Your Interview Process is a Liability

84% of developers now use AI tools. 46% of code is AI-generated, and 50% of engineers at Meta and 34% at Microsoft use Claude Code as their primary coding tool.^[1] Yet we still hire by testing people without AI access.

That’s not just unfair to candidates. It’s a business risk.

You’re optimizing your pipeline for people who might actively resist AI adoption. Or worse, who lack the judgment to use it well. And you may be selecting against the engineers who’ve most effectively integrated AI into their workflow. Exactly the people you want.

The compliance risk nobody talks about

76% of neurodivergent job seekers say traditional recruitment methods put them at a disadvantage.^[2] Timed assessments. Panel interviews. Whiteboard performances. 65% fear discrimination from management if they disclose their condition.^[3] Employment tribunal awards citing neurodivergent conditions are up 133% year on year.^[4]

I have ADHD. I can tell you exactly what happens in a high pressure interview. My brain jumps ahead. I see the whole system at once but struggle to narrate it linearly. I know the answer before I can explain how I got there. Under time pressure, that gap between knowing and explaining gets worse.

None of that means I can’t build the system. It means I can’t perform building the system in a way that maps to neurotypical expectations of what “thinking out loud” should look like.

Timed live coding doesn’t test ability to ship. It tests ability to mask. And increasingly, that’s a liability for companies, not just candidates.

What we’re actually selecting for

Here’s the uncomfortable question. If 46% of code is AI-generated, what’s the actual skill we need?

It’s not typing speed. It’s not memorizing algorithms. It’s not even system design pattern recognition, though that helps.

The skill is judgment. The ability to inject constraints the AI doesn’t know. To catch the hallucination before it ships. To know when the machine is wrong.

Stack Overflow’s 2025 survey found that 46% of developers don’t trust AI output accuracy.^[5]

Think about that for a second. 46% of code is AI-generated. 46% of developers don’t trust AI output. That’s not a contradiction - that’s the job description. The skill isn’t writing code. It’s knowing when the code is wrong.

We know human review is critical. We know the value is in verification, not generation.

But the format still rewards speed and recall under pressure, even when the rubric claims to assess trade-offs or problem-solving. A 45-minute timed test, no tools, no documentation, no AI. These conditions never exist on the job. Even on-call at 3am, you have Claude.

I’ve run sales hiring processes and built sales teams. If someone proposed “sell me this pen” as an interview round, they’d get laughed out of the room. We figured out decades ago that theatrical stunts don’t predict performance.

In my first engineering interview back in 2024, I was asked to recursively flatten an array of n levels of depth. Without using flatMap. Not “explain when you’d use flatMap.” Not “what are the tradeoffs of recursion vs. iteration.” Just… do it the hard way, for no reason. That’s the “sell me this pen” of engineering interviews.

Leonardo DiCaprio selling a pen — Now invert a binary tree

Yet engineering interviews are full of the equivalent. Timed algorithm puzzles, whiteboard performances, pressure tests that reward recall over reasoning. Google banned their famous brainteaser questions (“how many golf balls fit in an airplane?”) in 2013 after their head of HR found they “don’t predict anything” and “serve primarily to make the interviewer feel smart.”^[6] Then they doubled down on LeetCode. The industry recognized the problem and made it worse.

An AI-first Framework

Engineers have always complained interviews reward the wrong things. But this feels different. The system was built for a world that no longer exists.

I understand why LeetCode persists. It scales. Thousands of candidates, consistent bar, minimal interviewer training required. It’s easy to administer and hard to argue with. But standardizing the wrong thing isn’t rigor.

It’s theater.

Yes, skilled interviewers can extract signal from any format. But why make their job harder with prompts that narrow the conversation, instead of opening it? If the conversation is the real signal, let AI handle the part it’s good at.

The research is damning. A North Carolina State/Microsoft study found that candidates solving problems on a whiteboard in front of interviewers performed half as well as those given the same problem alone.^[7] The format tests anxiety, not competence. In the study, all women failed when observed, and all women passed when given privacy. Same problem, same skill level, different environment.^[8] If you’ve ever wondered why engineering struggles to retain women, consider that we’ve built a filter optimised to reject them.

And the format barely predicts job performance. Isolated algorithmic assessments have validity coefficients of 0.31-0.42, explaining just 10-18% of the variance. Combine them with work samples and structured interviews? Validity jumps to 0.63-0.65.^[9] We know how to do this better. We just don’t. Worse, AI can now pass the algorithmic part anyway. GPT-4 solves 32-68% of LeetCode problems.^[10] You’re selecting for a skill that’s increasingly automatable, not the judgment that remains distinctly human.

If this is so obvious, why hasn’t it changed? A few reasons. First, survivor bias: people who passed LeetCode interviews design LeetCode interviews. The system selects for its own perpetuation. Second, invisible false negatives: when you reject a good candidate, you never find out. When you hire a bad one, everyone knows. So companies optimize for avoiding visible mistakes, not for finding the best people. Third, legal comfort. “We gave everyone the same test” is easier to defend than “we made a judgment call about their judgment.” And finally, nobody got fired for copying Google, even when Google’s process makes no sense for a 50-person startup.

Understanding why the system persists doesn’t make it right. But it does explain why change has to be deliberate.

Given it’s easy to criticise, and difficult to suggest solutions, here’s my good-faith attempt at coming up with a replacement.

A note on scope - this framework assumes you’re hiring for roles where judgment matters more than raw implementation speed. That is typically mid-level and above. For junior roles, you probably do need to see someone write code. But even there, “build this small thing with full tool access” beats “implement this algorithm without Google.” The question isn’t whether to assess technical ability. It’s whether artificial constraints add signal or just noise.

Judgment Under Ambiguity

No code. Shared whiteboard. But instead of “present your solution while we watch,” it’s “let’s solve this together.”

You can ship this feature in 4 weeks with tech debt, or 12 weeks done properly. A competitor just launched something similar. What do you do?

There’s no off-the-shelf answer here. No pattern to recognize. It’s trade-offs all the way down.

Candidate: “Ship in 4 weeks. We can’t let them own the narrative.”

How bad is the tech debt? What’s the rewrite cost? What if the competitor’s version flops and you’ve rushed something you didn’t need to rush?

“Okay, it depends. How similar is their feature? Are we competing for the same customers? What’s our runway - can we afford to be second to market? And what does ‘done properly’ even mean for something we might pivot away from in six months?”

Now you’re getting somewhere. But the best candidates go further. They ask the questions that actually matter to the business.

“Are prospects asking for this feature, or are we assuming they want it? Is this a wedge to open new deals, or just downside risk if we don’t have it? What’s our average deal cycle - does 8 weeks even matter if sales takes 6 months? How does this map to what we’re hearing in pipeline reviews?”

That’s not a technical answer. It’s a business answer. And increasingly, that’s the job.

Nobody makes these calls perfectly in their head. Real engineering is iterative. You propose, get pushback, revise.

The trade-offs here are genuine. Two senior engineers at the same company might disagree on the right answer. That’s what makes it interesting, and that’s what makes it useful. You’re not testing whether they know the “correct” solution. You’re testing how they think when there isn’t one.

It shows what judgment actually looks like. And it’s a mile away from inverting a binary tree.

Other questions that work:

“Your biggest customer’s contract is up for renewal. They want a feature that would take 6 weeks. Sales says it might save the deal. Do you build it?”
“A key customer wants a feature that would require breaking your API for everyone else. How do you handle it?”
“You have doubts about a feature the PM is convinced we need. What do you do?”

Vague Prompt Test

Give them a laptop. Full AI access. A deliberately underspecified goal.

Build an endpoint that enriches user data with analytics from Snowflake.

A candidate prompts “Write an endpoint that fetches users and calls Snowflake.” Claude generates this:

const users = await User.find().populate('profile');

const analytics = await snowflake.getAnalytics(
  users.map(u => u._id)
);

return users.map(u => ({
  ...u.toObject(),
  ...analytics[u._id]
}));

The code works. Tests pass. But what questions did they ask first?

A stronger approach starts with discovery: “What’s the latency on the Snowflake call? What’s expected concurrency? How big are these user documents?”

That leads to different code:

const userIds = (await User.find().select('_id').lean())
  .map(u => u._id);

const analytics = await snowflake.getAnalytics(userIds);

const users = await User.find({ _id: { $in: userIds } })
  .populate('profile')
  .lean();

return users.map(u => ({ ...u, ...analytics[u._id] }));

The first version holds thousands of hydrated ORM objects in memory during a 2-second external call. Under load, those objects survive garbage collection, get promoted to old gen, and trigger stop-the-world pauses. P95 explodes.

The second version holds a lightweight array of IDs in memory, does a 2-second external call using the IDs, then hydrates the full objects. Same result, fraction of the memory pressure.

AI won’t catch this, because you need to know the constraints and context to give it. It doesn’t know your traffic patterns. It doesn’t know what’s sitting in memory while it waits for a slow API call. You’re assessing whether the candidate asks the questions that surface constraints before prompting AI.

Hallucination Review

Hand them an AI-generated pull request. The code works. The tests pass.

// GraphQL resolver
const attendees = async (event, args, context) => {
  return Promise.all(
      event.attendeeIds.map(id => context.dataloaders.user.load(id))
  );
};

# GraphQL resolver
async def attendees(event, info):
  return await asyncio.gather(
      *[info.context.dataloaders.user.load(id) for id in event.attendee_ids]
  )

“This loads full User documents for every attendee. 500 attendees means 500 full docs in memory simultaneously, all held until the response serializes. Can we add a projection? Do we actually need the full user object here, or just name and avatar?”

// Webhook handler
app.post('/webhook', async (req, res) => {
  await processEvent(req.body);
  res.sendStatus(200);
});

# Webhook handler
@app.post('/webhook')
async def webhook(request: Request):
  await process_event(await request.json())
  return Response(status_code=200)

“What if processEvent takes 30 seconds? Most webhook providers timeout after 10 seconds and retry. We’d process the same event twice. Should we acknowledge immediately and use a queue? Or at least add idempotency keys?”

// Order details endpoint
app.get('/api/orders/:id', authMiddleware, async (req, res) => {
  const order = await Order.findById(req.params.id);
  res.json(order);
});

# Order details endpoint
@app.get('/api/orders/{order_id}')
@require_auth
async def get_order(order_id: str, user: User = Depends(get_current_user)):
  order = await Order.find_by_id(order_id)
  return order

“This checks if a user is logged in. It doesn’t check if this user owns the order. I can fetch anyone’s order if I guess the ID.”

This one’s worth pausing on. AI (still) gets auth wrong constantly. When you prompt “create an endpoint to fetch an order by ID,” nothing in that sentence says “users should only see their own orders.” The AI builds exactly what you asked for. It doesn’t know your access control rules because you didn’t tell it, and it can’t infer them from context.

Authentication is visible, there’s middleware, patterns to copy. Authorization is business logic that lives in your head. Wouldn’t you want to assess this as a business that cares about security? What risks are you taking on not explicitly assessing this as part of your interview process?

AI generates plausible code. It often generates code with subtle issues that only surface under load, at scale, or in failure modes. Or in this case, when someone guesses an id. Catching these before they ship is the job now.

Paid Trial (Optional)

Not every role needs this. But if you want to see someone work in your actual codebase, pay them for it.

A half-day contract. A real ticket from your backlog. Their environment, their hours, their tools. You’re not testing whether they can perform under artificial constraints. You’re seeing how they actually work.

The take-home gets a bad reputation because companies use it as a free filter early in the pipeline. Bounded, paid, and late in the process, it’s something else entirely: a mutual trial. They’re evaluating you too.

The uncomfortable truth

In an AI-native world, soft skills are becoming more important, not less. The ability to clarify requirements. To communicate tradeoffs. To collaborate under uncertainty. These are the things AI can’t do.

Yet 68% of HR professionals admit their recruitment frameworks aren’t designed to surface these skills.^[11] We test what’s easy to measure. Algorithm recall. Syntax. Speed.

Here’s the irony. We know 46% of developers don’t trust AI output. We know human judgment is critical. But we’ve built hiring pipelines that select for the opposite.

Some companies have figured this out

Canva ($3B ARR, 220 million users) now requires that candidates use AI tools during technical interviews. They retired LeetCode-style questions entirely. Instead, they give candidates ambiguous, realistic problems and evaluate whether they can break down requirements, catch issues in AI-generated code, and make sound technical decisions.^[12]

Their finding?

“Candidates with minimal AI experience often struggled. Not because they couldn’t code, but because they lacked the judgment to guide AI effectively.”

The framework exists. Some companies are already using it.

What’s at stake

Your interview process is a filter. The question is what it’s filtering for.

Test for algorithm recall under pressure, you select for that. But that’s not judgment. It’s pattern matching in a vacuum.

Judgment is knowing what to ask before you write a single line.

In an AI-native world, that’s not just a missed opportunity. It’s a liability you’re actively building into your team.

You’re not hiring for 2026. You’re hiring for 2019 - and calling it rigor.

Pick one of these. Try it in your next loop. See what you learn.^[13]

Footnotes

Blind survey of 1,215 US software professionals, December 2025. ↩
Zurich UK 2024. ↩
Birkbeck University / Neurodiversity in Business 2023. ↩
UK Employment Tribunal Data 2022-2024. ↩
Stack Overflow Developer Survey 2025. ↩
Laszlo Bock, SVP People Operations at Google, New York Times 2013. ↩
NC State/Microsoft 2020. “Does Stress Impact Technical Interview Performance?” ↩
Same study. The researchers note this likely reflects performance anxiety patterns, not ability differences—the private condition proved the point. ↩
Sackett et al. 2021, revising Schmidt & Hunter’s landmark 1998 meta-analysis. ↩
Wang et al. 2025. Evaluation of 18 LLMs on 2,100 LeetCode problems. ↩
City & Guilds Neurodiversity Index 2024. ↩
Canva. ↩
A caveat: every interview format gets gamed eventually. If “judgment under ambiguity” becomes standard, candidates will learn to performatively ask “what’s our runway?” the same way they currently grind LeetCode. The advantage of these formats isn’t that they’re ungameable - it’s that the performance is closer to the job. Someone who’s practiced asking good clarifying questions and learning about business context has… practiced asking good clarifying questions and learning about business context. That’s not a bug. And it’s a much better use of their time than memorizing how to reverse a linked list. ↩