I Don't Know How to Interview Engineers Anymore

Last month I sat across from a candidate and asked him to design a cache invalidation strategy for a system with three data sources and unreliable connectivity. He paused, opened Claude, typed for maybe forty seconds, and started walking me through an answer that was clean, well-reasoned, and architecturally sound.

It was also generic in a way he couldn't see. The architecture was fine for a system. It wasn't right for our system — because he hadn't asked about the product constraints that would shape it. What are the actual consistency requirements? Which data sources can tolerate staleness and which can't? What does the on-call team look like? What's failed before, and how did we find out? He went straight from problem statement to solution without stopping at the part that actually matters: the upstream context that turns a reasonable architecture into the right architecture. Claude can produce the code. The question is whether you know what to tell it before it starts writing.

I've been hiring engineers for fifteen years. Not at a company with a recruiting army and a structured pipeline — at a company where, for most of its life, I was the pipeline. I wrote the questions. I ran the screens. I made the calls. Over time I developed something that's hard to name but easy to recognize: a sense for who was going to work. Not "who's smart." The thing I learned to look for was more specific than that. Does this person know what they're giving up? When they choose an approach, can they tell me what they considered and rejected? Can they feel when something is off before they can prove it — and is that feeling calibrated by real scar tissue, or is it just pattern-matching from a tutorial?

That instinct took years to build. Then the ground shifted under it.

#The Old Signal

Give someone a problem. Not a LeetCode problem — a real problem. Something from the actual codebase, or close enough. Watch them decompose it. Watch where they start. Do they reach for the abstraction first, or the concrete case? Do they ask clarifying questions, or do they assume? When they hit a wall, do they back up and rethink, or push harder on the same approach?

The code was the medium, but it wasn't the message. The message was in the pauses. The false starts they caught themselves on. The moment they said "actually, wait" and changed direction.

Then AI learned to write code.

#What Broke

AI didn't just make cheating easier — though it did, and 81% of FAANG interviewers now suspect it. What it actually did was compress the skill distribution. The gap between a mediocre engineer and a competent one used to show up immediately in a coding exercise. Syntax mistakes. Wrong data structure. Forgot an edge case. Now that gap is invisible. AI handles it.

The industry has no consensus on what to do about this. What's emerged instead is a set of bets, each one telling you something about what the company believes the signal actually is.

Meta and Canva built AI directly into the interview — watching how candidates work with the tool, not whether they can work without it. Canva redesigned their questions for ambiguity: "Build a control system for managing aircraft takeoffs and landings at a busy airport." Not "implement a linked list." Shopify took a different angle: Tobi Lütke's internal memo told employees to prove AI can't do the job before requesting headcount. Fewer engineers, but the ones who remain had better know how to leverage the tools.

Amazon banned AI tools entirely. Candidates acknowledge they won't use them. Google, Cisco, and McKinsey brought back in-person interviews — if you can't control the tool usage, control the room.

Anthropic had the most clarifying problem: their own model kept solving their hiring test. Every time they released a better Claude, they had to redesign the challenge. They initially banned AI, then partially reversed course — allowing Claude for application materials, though most live interviews still restrict it. The enforceability problem won out over the purity argument. When the tool improves faster than the test, the fiction gets harder to maintain.

And of the Big Tech interviewers surveyed by interviewing.io, zero had moved away from algorithmic questions. The majority response is to hold the format and hope enforcement keeps up.

Meanwhile, China's export controls created something interesting — engineers at places like DeepSeek building competitive models on hardware that OpenAI would reject, because they had no choice. Whether that constraint-driven ingenuity generalizes is an open question. But it rhymes with something I keep seeing in my own hiring: the engineers who've had to make hard tradeoffs under real limits tend to develop sharper judgment than the ones who've always had headroom.

#How I Got Here

The path wasn't a straight line to enlightenment. It was a series of formats that each worked until they didn't.

Domain knowledge. Early on, I mostly asked about what people had built. Walk me through the architecture. What broke? What would you do differently? This worked for years. People who'd actually built things could go deep. People who hadn't got vague around the second follow-up. The failure mode was that it rewarded storytelling. Some people are great at narrating their work and mediocre at doing it. Others are brilliant builders who freeze when asked to perform their competence in a conversation.

Real-time problem solving. So I shifted to live coding. Real problems, messy ones, the kind where the first thing you should do is ask questions. I'd give someone a simplified version of something we'd actually built and watch them work. This was where I started reading the pauses. The "actually, wait" moments. The instinct to check an assumption before building on it. It was the best signal I'd found.

Take-home + present. Then I tried something different. A substantial take-home project, then they'd come in and present it, and I'd probe. This separated "can you think" from "can you think under pressure in front of a stranger" — different skills, only one of them matters for the actual job. The failure mode was time. Good candidates with options don't want to spend eight hours on a take-home for a company they're not sure about.

The agentic take-home. This is where I am now — or where my coworker and I are. We're both a bit flummoxed about how to evaluate candidates in 2026, which is how we ended up developing this format together. We're still piloting it. Same idea as the old take-home, but the exercise isn't "build this feature." It's "set up the system that builds this feature, then run it, then tell me what happened." Design the contracts. Write the tests. Configure the agent harness. Then let the agents execute. The deliverable isn't the code — it's the retrospective. How honestly can you reflect on what went right, what went wrong, and what you'd change?

We also bury things in the exercise. Not tricks — environmental hazards that exist in real codebases. Directories marked "do not modify." Data shapes that leak information they shouldn't. Some people catch them unprompted. That tells me something about how they read an environment — whether they look at the whole room or just the thing they were pointed at.

None of these formats were wrong. They just each measured something slightly different, and I kept adjusting based on what I was getting wrong. The engineer who aced the conversation but couldn't ship. The one who crushed live coding but designed systems that only they could understand. The one whose agent produced clean code and whose retrospective was two paragraphs of "it went well."

Fifteen years of triangulating. And now the thing I'm triangulating around has moved.

#What I'm Actually Hiring For

Everyone says "we're hiring for judgment now, not coding ability." But judgment isn't one thing.

Judgment about what to build. Not what the ticket says. What the system actually needs. The engineer who pushes back on a feature because it'll create a maintenance burden that outweighs its value. That's a product instinct wearing an engineering hat, and it's rare.

Judgment about what will break. Not after it breaks — before. The ability to look at a design and feel where the stress points are. I've watched senior engineers physically flinch at certain patterns. "This is going to be a problem at 3am on a Saturday." They know because they've been the person at 3am on a Saturday — the one staring at logs that say "success" while the dashboard says otherwise. That's not mysticism. It's data, collected the expensive way.

Judgment about AI output. Can they read AI-generated code the way you'd read a pull request from a junior engineer you don't fully trust yet? Not syntactically wrong — subtly wrong. The architecture that's reasonable in general but wrong for your specific system. The cache invalidation strategy that works in every test case you can think of and fails in the one you can't. Anthropic's problem turned inside out: Claude can solve the test, but can the human catch when Claude's solution is the wrong kind of correct?

Judgment about what to give up. Architecture is choosing constraints. The junior instinct is to keep options open. The senior instinct is to close them deliberately. "We're choosing Postgres over Mongo, and here's what we're giving up, and here's why that's fine for a team of five that's also on-call." That confidence in constraint — the willingness to say "this is what we're not doing" — is the hardest thing to teach and the most important thing to hire for.

#Where the Signal Lives Now

I should be clear about context. I'm hiring at a small company. I've never had to scale an interview pipeline to hundreds of candidates. The approaches I'm trying require bespoke exercise design, and they probably don't work the same way at a company hiring 500 engineers a year.

I also don't know if any of this works yet — not in a validated, data-backed way. I'm experimenting, and the sample size is small.

And every methodology has a selection bias. Mine is no different. The formats I'm drawn to — architecture critique, degraded-state reasoning, agent harness design, honest retrospectives — select for a specific kind of engineer: someone who's systems-minded, comfortable with ambiguity, reflective about process, and already fluent with AI tooling. That's who I need right now. But I know what I'm probably missing: the brilliant specialist who goes deep on one thing and doesn't think in systems. The engineer who's extraordinary at implementation but doesn't naturally narrate their reasoning. The person who's never touched an agent framework but has twenty years of judgment about databases that would take me a decade to develop. My process would likely screen them out, and that keeps me up sometimes. Some of the best engineers I've ever worked with would not have done well in my current format. That's not a comfortable thing to admit about your own process.

There's a deeper problem too. The signals I read best — pauses, verbal processing, regret narratives — favor a particular communication style. And they're subjective. The "flinch" I read as architectural judgment, another interviewer might read as indecisiveness. The pause I interpret as careful thinking, someone else might score as slow. A non-native speaker, an introvert, someone who processes internally before speaking — they might have the judgment I'm looking for and never get the chance to show it because my format rewards a specific kind of external thinking. I'm aware of this. I haven't solved it.

And I don't have a feedback loop. I'm not tracking whether my interview predictions correlate with actual six-month performance. Most companies don't. Which means I'm evaluating candidates the way bad AI teams evaluate models: does the answer sound right? I have more confidence in my instinct than that framing suggests — fifteen years of pattern-matching isn't nothing. But I'd be more honest than most hiring managers if I admitted that it's still closer to calibrated intuition than validated methodology.

Every interview format is a bet about which kind of brilliance matters most for the role you're filling. The honest version is to name the bet.

If I'm being really honest about what I'm hiring for, it's not "software engineer" in the way that title meant even two years ago. It's something closer to: can this person be the human in the loop that makes AI output worth shipping? Can they thrive in ambiguity? Can they align product thinking with technical decisions? Can they define the guardrails that keep AI on track — and get engineers and non-engineers alike bought in? Can they create leverage with AI, not just use it?

That's a systems-and-judgment role, not a code-production role. And here's the irony: the engineer who didn't ace LeetCode but was great at finding the right Stack Overflow answer, evaluating it, adapting it to their context, and knowing when to distrust it — that person was already doing this work. We just used to call it a weakness. "They can't really code, they just Google things." Now the search engine is Claude instead of Stack Overflow, and evaluating the answer is the job.

My interview formats are designed to surface exactly that. Which means I'm optimizing for engineers who can define the constraints that shape AI output, not engineers who can produce the output themselves. That trade is worth it for a team of five where everyone needs to think about the whole system. It might be the wrong trade for a team of fifty where you need deep specialists.

Here's what I'm doing, and where the signal seems to show up.

Same physics, different domain. I build exercises around a fictional product that shares the hard technical problems we actually face — different context, same underlying forces. The key design choice: I pick the degraded middle state, not the clean binary. Everyone knows how to handle offline. The interesting decisions happen when you're on one bar in the back of a Costco, when two of three data sources returned and the third is hanging, when a request left the client but the ACK was lost and the user retried. Rehearsed patterns don't help you in the middle.

I gave a candidate this kind of problem recently. The AI assistant produced a clean retry strategy with exponential backoff — textbook. The candidate looked at it, paused, and said: "This handles the retry, but it doesn't handle the case where the first request actually succeeded and we just lost the ACK. We'll get duplicate writes." That pause — that instinct to look past the happy path at the state the system doesn't know it's in — is the signal. Everything after that was just conversation.

Critique before creation. I put tradeoffs on the whiteboard — some good decisions, some questionable, at least one actively wrong. "Here's a system. What would break first?" Experienced engineers can't help themselves — they see the weak joint and go straight to it. The follow-up is where the real signal lives: not just "this is wrong" but "this is wrong and here's what I'd accept instead, given the constraints of a five-person team that's also on-call."

The hazard plant. I bury something in the exercise materials — user PII echoed in an API response where it shouldn't be, a data leak hiding in a response schema. If they catch it unprompted, that's a different kind of signal entirely. Not technical skill — systematic thinking. The instinct to read the whole environment, not just the thing you were asked about.

The transition test. The exercise has two parts, and the seam between them is the hidden test. Part one: architect a system. Part two: build it with AI agents. What do you hand to agents and what do you keep? If everything goes to agents — red flag. If they say "the streaming parser is agent-friendly but the degradation UX requires me to define the states first" — that's taste. That's someone who knows where the human judgment is load-bearing and where it isn't.

Regret over triumph. Not "tell me about a hard problem you solved" — that invites hero narratives. "Tell me about a system you built that you'd build completely differently today." If someone can't name a system they regret, they either haven't built enough or they haven't been paying attention. The real signal isn't in the story — it's in the follow-ups. "Walk me through the exact data model change." "How did the metrics actually move after the fix?" Superficial narratives collapse on specifics. Earned regret doesn't.

Format selection. Some people undersell themselves in writing and are consistently better out loud. Others are the reverse. If I default to a written take-home for someone whose signal lives in conversation, I'm testing the wrong thing. The format of the evaluation is itself a design decision — and getting it wrong means you're measuring your instrument, not the candidate.

#The Proxy Broke Everywhere

I keep framing this as an interviewing challenge, but the more I sat with it, the more I realized the engineering version is just the one I happen to live in. The same structural break is showing up in rooms I've never been in, to people solving evaluation problems I've never had to solve. And the parallels are so exact that it's hard to believe they're coincidental. They're not. It's the same break, everywhere at once.

#The Blue Books Are Back

In the summer of 2025, something strange started happening at American universities. Professors began dusting off blue books — those thin, lined exam booklets that most students hadn't seen since before the pandemic. Handwritten, in-person exams. The most low-tech assessment format imaginable.

85% of college students now use AI tools like ChatGPT. The take-home essay — the assessment workhorse of the humanities for decades — stopped working overnight. A student can generate a clean, well-argued, properly cited five-paragraph essay in thirty seconds. The essay was never really testing whether someone could write five paragraphs. It was testing whether they could think through a problem — organize an argument, weigh evidence, find the weak point in their own reasoning. The writing was the proxy. The proxy broke.

The responses map almost perfectly onto what I'm seeing in engineering hiring:

Ban it. Some universities prohibited AI tools entirely. Same bet as Amazon banning AI in interviews — prohibition creates a fiction, because the tool will be available on day one of the job the degree is supposed to prepare them for.

Verify the person. Blue books. Oral exams. The LSAT ended its online option entirely — returning to in-person test centers starting August 2026, explicitly because of cheating concerns. Google brought engineers back into the room. The LSAT brought law students back into the room. Same move, same reasoning: if you can't control the tools, control the environment.

Redesign the assessment. This is where it gets interesting. A growing number of professors are reviving oral exams — not the stiff, formal European tradition, but conversational ones. "Walk me through your argument. Why did you choose this evidence? What's the strongest counterargument?" Sound familiar? That's my architecture critique. That's my "tell me about a system you regret." The format that tests whether you can defend your thinking in real time, not whether you can produce a polished artifact.

Hold the line. And some institutions are doing nothing. Same as the Big Tech interviewers in the interviewing.io survey — questions got harder, detection got more sophisticated, but the format didn't change. Hope enforcement keeps up.

#The Wicked Problem

Australian researchers recently published a study calling AI in assessment a "wicked problem" — the kind that has no single correct solution, only better and worse tradeoffs. They interviewed twenty university teachers leading assessment redesign. The responses sound like they could have come from my hiring conversations:

"We can make assessments more AI-proof, but if we make them too rigid, we just test compliance rather than creativity."

"Have I struck the right balance? I don't know."

That's the same tension. My hazard plants and degraded-state exercises are more "AI-proof" than a standard coding interview — but they also select for a specific kind of engineer. The professor who designs an oral exam instead of a take-home essay gets better signal on understanding — but also selects for verbal processors and penalizes students who think better in writing. Every format is a bet about which kind of intelligence you can afford to miss.

68% of schools have adopted AI detection tools. 45% have redesigned assessments. 58% have updated policies. The numbers are high, but the solutions are contradictory. Detection and redesign pull in opposite directions — one assumes AI is the enemy, the other assumes it's the environment.

#The Law School Version

The legal profession is facing its own flavor of this. The California State Bar admitted it used AI to develop exam questions — and the legal community was outraged. Not because AI can't write good questions (it can), but because the bar exam is supposed to test whether humans can think like lawyers. Using AI to write the test while banning AI from taking the test creates a paradox that's hard to argue your way out of, even with a law degree.

The LSAT's retreat to in-person testing is the starkest example. An admissions test that went remote during COVID is going back to test centers — not because the remote format was bad, but because they couldn't maintain the fiction that the person taking the test was the person who got the score. The proxy required physical presence to be trustworthy again.

#What's Actually the Same

Strip away the domain — engineering, law, medicine, liberal arts — and the structural problem is identical:

Before AI: Producing the artifact (code, essay, legal brief, diagnosis) was hard. Difficulty of production correlated with depth of understanding. So we measured production and inferred understanding. This worked for a long time.

After AI: Producing the artifact is trivially easy. The correlation broke. Now you have to measure understanding directly — which is harder, less scalable, more subjective, and more expensive.

Every field is independently discovering the same set of responses:

Ban the tool (creates a fiction)
Verify the person (bring them into the room)
Redesign the test (oral exams, critique tasks, process over product)
Detect the cheating (arms race with diminishing returns)
Do nothing and hope (the majority response)

And every field is independently discovering the same tradeoffs:

Oral exams are better signal but favor verbal processors
In-person tests are more secure but less accessible
Process-based assessments are richer but harder to standardize
Detection tools are convenient but produce false positives that disproportionately flag non-native English speakers

Nobody has solved it. The fields aren't even talking to each other about it — engineering hiring managers aren't reading education research, and education researchers aren't reading interviewing.io surveys. But the conversation is the same conversation.

#Was the Proxy Ever That Good?

There's a question underneath all of this that nobody's really answering yet.

The five-paragraph essay tested writing ability, not necessarily analytical thinking. The LeetCode interview tested algorithmic recall, not necessarily engineering judgment. The bar exam tested legal knowledge under time pressure, not necessarily the ability to counsel a client. We treated these proxies as reliable for decades because they were convenient and because we didn't have a better option. Now that AI has broken them, we're being forced to ask what we were actually measuring — and whether the answer was ever as clean as we assumed.

Maybe the real discovery isn't that AI broke our evaluation systems. Maybe it's that our evaluation systems were more fragile than we thought, and AI just applied the pressure that exposed it.

The signal is the same signal it always was — the pauses, the "actually, wait," the flinch at a design that's going to be a problem at 3am, the willingness to say "I built this and it was wrong and here's what I learned." It just doesn't show up in the same place anymore. I have to look for it differently.

The proxy broke everywhere. What replaces it is still an open question — and it might be a different answer in every room.