Last month I sat across from a candidate and asked him to design a cache invalidation strategy for a system with three data sources and unreliable connectivity. He paused, opened Claude, typed for maybe forty seconds, and started walking me through an answer that was clean, well-reasoned, and architecturally sound.
It was also generic in a way he couldn't see. The architecture was fine for a system. It wasn't right for our system — because he hadn't asked about the product constraints that would shape it. What are the actual consistency requirements? Which data sources can tolerate staleness and which can't? What does the on-call team look like? What's failed before, and how did we find out? He went straight from problem statement to solution without stopping at the part that actually matters: the upstream context that turns a reasonable architecture into the right architecture. Claude can produce the code. The question is whether you know what to tell it before it starts writing.
I've been hiring engineers for fifteen years. Not at a company with a recruiting army and a structured pipeline — at a company where, for most of its life, I was the pipeline. I wrote the questions. I ran the screens. I made the calls. Over time I developed something that's hard to name but easy to recognize: a sense for who was going to work. Not "who's smart." The thing I learned to look for was more specific than that. Does this person know what they're giving up? When they choose an approach, can they tell me what they considered and rejected? Can they feel when something is off before they can prove it — and is that feeling calibrated by real scar tissue, or is it just pattern-matching from a tutorial?
That instinct took years to build. Then the ground shifted under it.
#The Old Signal
Give someone a problem. Not a LeetCode problem — a real problem. Something from the actual codebase, or close enough. Watch them decompose it. Watch where they start. Do they reach for the abstraction first, or the concrete case? Do they ask clarifying questions, or do they assume? When they hit a wall, do they back up and rethink, or push harder on the same approach?
The code was the medium, but it wasn't the message. The message was in the pauses. The false starts they caught themselves on. The moment they said "actually, wait" and changed direction.
Then AI learned to write code.
#What Broke
AI didn't just make cheating easier — though it did, and 81% of FAANG interviewers now suspect it. What it actually did was compress the skill distribution. The gap between a mediocre engineer and a competent one used to show up immediately in a coding exercise. Syntax mistakes. Wrong data structure. Forgot an edge case. Now that gap is invisible. AI handles it.
The industry has no consensus on what to do about this. What's emerged instead is a set of bets, each one telling you something about what the company believes the signal actually is.
Meta and Canva built AI directly into the interview — watching how candidates work with the tool, not whether they can work without it. Canva redesigned their questions for ambiguity: "Build a control system for managing aircraft takeoffs and landings at a busy airport." Not "implement a linked list." Shopify took a different angle: Tobi Lütke's internal memo told employees to prove AI can't do the job before requesting headcount. Fewer engineers, but the ones who remain had better know how to leverage the tools.
Amazon banned AI tools entirely. Candidates acknowledge they won't use them. Google, Cisco, and McKinsey brought back in-person interviews — if you can't control the tool usage, control the room.
Anthropic had the most clarifying problem: their own model kept solving their hiring test. Every time they released a better Claude, they had to redesign the challenge. They initially banned AI, then partially reversed course — allowing Claude for application materials, though most live interviews still restrict it. The enforceability problem won out over the purity argument. When the tool improves faster than the test, the fiction gets harder to maintain.
And of the Big Tech interviewers surveyed by interviewing.io, zero had moved away from algorithmic questions. The majority response is to hold the format and hope enforcement keeps up.
Meanwhile, China's export controls created something interesting — engineers at places like DeepSeek building competitive models on hardware that OpenAI would reject, because they had no choice. Whether that constraint-driven ingenuity generalizes is an open question. But it rhymes with something I keep seeing in my own hiring: the engineers who've had to make hard tradeoffs under real limits tend to develop sharper judgment than the ones who've always had headroom.
#How I Got Here
The path wasn't a straight line to enlightenment. It was a series of formats that each worked until they didn't.
Domain knowledge. Early on, I mostly asked about what people had built. Walk me through the architecture. What broke? What would you do differently? This worked for years. People who'd actually built things could go deep. People who hadn't got vague around the second follow-up. The failure mode was that it rewarded storytelling. Some people are great at narrating their work and mediocre at doing it. Others are brilliant builders who freeze when asked to perform their competence in a conversation.
Real-time problem solving. So I shifted to live coding. Real problems, messy ones, the kind where the first thing you should do is ask questions. I'd give someone a simplified version of something we'd actually built and watch them work. This was where I started reading the pauses. The "actually, wait" moments. The instinct to check an assumption before building on it. It was the best signal I'd found.
Take-home + present. Then I tried something different. A substantial take-home project, then they'd come in and present it, and I'd probe. This separated "can you think" from "can you think under pressure in front of a stranger" — different skills, only one of them matters for the actual job. The failure mode was time. Good candidates with options don't want to spend eight hours on a take-home for a company they're not sure about.
The agentic take-home. This is where I am now — or where my coworker and I are. We're both a bit flummoxed about how to evaluate candidates in 2026, which is how we ended up developing this format together. We're still piloting it. Same idea as the old take-home, but the exercise isn't "build this feature." It's "set up the system that builds this feature, then run it, then tell me what happened." Design the contracts. Write the tests. Configure the agent harness. Then let the agents execute. The deliverable isn't the code — it's the retrospective. How honestly can you reflect on what went right, what went wrong, and what you'd change?
We also bury things in the exercise. Not tricks — environmental hazards that exist in real codebases. Directories marked "do not modify." Data shapes that leak information they shouldn't. Some people catch them unprompted. That tells me something about how they read an environment — whether they look at the whole room or just the thing they were pointed at.
None of these formats were wrong. They just each measured something slightly different, and I kept adjusting based on what I was getting wrong. The engineer who aced the conversation but couldn't ship. The one who crushed live coding but designed systems that only they could understand. The one whose agent produced clean code and whose retrospective was two paragraphs of "it went well."
Fifteen years of triangulating. And now the thing I'm triangulating around has moved.
#What I'm Actually Hiring For
Everyone says "we're hiring for judgment now, not coding ability." But judgment isn't one thing.
Judgment about what to build. Not what the ticket says. What the system actually needs. The engineer who pushes back on a feature because it'll create a maintenance burden that outweighs its value. That's a product instinct wearing an engineering hat, and it's rare.
Judgment about what will break. Not after it breaks — before. The ability to look at a design and feel where the stress points are. I've watched senior engineers physically flinch at certain patterns. "This is going to be a problem at 3am on a Saturday." They know because they've been the person at 3am on a Saturday — the one staring at logs that say "success" while the dashboard says otherwise. That's not mysticism. It's data, collected the expensive way.
Judgment about AI output. Can they read AI-generated code the way you'd read a pull request from a junior engineer you don't fully trust yet? Not syntactically wrong — subtly wrong. The architecture that's reasonable in general but wrong for your specific system. The cache invalidation strategy that works in every test case you can think of and fails in the one you can't. Anthropic's problem turned inside out: Claude can solve the test, but can the human catch when Claude's solution is the wrong kind of correct?
Judgment about what to give up. Architecture is choosing constraints. The junior instinct is to keep options open. The senior instinct is to close them deliberately. "We're choosing Postgres over Mongo, and here's what we're giving up, and here's why that's fine for a team of five that's also on-call." That confidence in constraint — the willingness to say "this is what we're not doing" — is the hardest thing to teach and the most important thing to hire for.
#Where the Signal Lives Now
I should be clear about context. I'm hiring at a small company. I've never had to scale an interview pipeline to hundreds of candidates. The approaches I'm trying require bespoke exercise design, and they probably don't work the same way at a company hiring 500 engineers a year.
I also don't know if any of this works yet — not in a validated, data-backed way. I'm experimenting, and the sample size is small.
And every methodology has a selection bias. Mine is no different. The formats I'm drawn to — architecture critique, degraded-state reasoning, agent harness design, honest retrospectives — select for a specific kind of engineer: someone who's systems-minded, comfortable with ambiguity, reflective about process, and already fluent with AI tooling. That's who I need right now. But I know what I'm probably missing: the brilliant specialist who goes deep on one thing and doesn't think in systems. The engineer who's extraordinary at implementation but doesn't naturally narrate their reasoning. The person who's never touched an agent framework but has twenty years of judgment about databases that would take me a decade to develop. My process would likely screen them out, and that keeps me up sometimes. Some of the best engineers I've ever worked with would not have done well in my current format. That's not a comfortable thing to admit about your own process.
There's a deeper problem too. The signals I read best — pauses, verbal processing, regret narratives — favor a particular communication style. And they're subjective. The "flinch" I read as architectural judgment, another interviewer might read as indecisiveness. The pause I interpret as careful thinking, someone else might score as slow. A non-native speaker, an introvert, someone who processes internally before speaking — they might have the judgment I'm looking for and never get the chance to show it because my format rewards a specific kind of external thinking. I'm aware of this. I haven't solved it.
And I don't have a feedback loop. I'm not tracking whether my interview predictions correlate with actual six-month performance. Most companies don't. Which means I'm evaluating candidates the way bad AI teams evaluate models: does the answer sound right? I have more confidence in my instinct than that framing suggests — fifteen years of pattern-matching isn't nothing. But I'd be more honest than most hiring managers if I admitted that it's still closer to calibrated intuition than validated methodology.
Every interview format is a bet about which kind of brilliance matters most for the role you're filling. The honest version is to name the bet.
If I'm being really honest about what I'm hiring for, it's not "software engineer" in the way that title meant even two years ago. It's something closer to: can this person be the human in the loop that makes AI output worth shipping? Can they thrive in ambiguity? Can they align product thinking with technical decisions? Can they define the guardrails that keep AI on track — and get engineers and non-engineers alike bought in? Can they create leverage with AI, not just use it?
That's a systems-and-judgment role, not a code-production role. And here's the irony: the engineer who didn't ace LeetCode but was great at finding the right Stack Overflow answer, evaluating it, adapting it to their context, and knowing when to distrust it — that person was already doing this work. We just used to call it a weakness. "They can't really code, they just Google things." Now the search engine is Claude instead of Stack Overflow, and evaluating the answer is the job.
My interview formats are designed to surface exactly that. Which means I'm optimizing for engineers who can define the constraints that shape AI output, not engineers who can produce the output themselves. That trade is worth it for a team of five where everyone needs to think about the whole system. It might be the wrong trade for a team of fifty where you need deep specialists.
Here's what I'm doing, and where the signal seems to show up.
Same physics, different domain. I build exercises around a fictional product that shares the hard technical problems we actually face — different context, same underlying forces. The key design choice: I pick the degraded middle state, not the clean binary. Everyone knows how to handle offline. The interesting decisions happen when you're on one bar in the back of a Costco, when two of three data sources returned and the third is hanging, when a request left the client but the ACK was lost and the user retried. Rehearsed patterns don't help you in the middle.
I gave a candidate this kind of problem recently. The AI assistant produced a clean retry strategy with exponential backoff — textbook. The candidate looked at it, paused, and said: "This handles the retry, but it doesn't handle the case where the first request actually succeeded and we just lost the ACK. We'll get duplicate writes." That pause — that instinct to look past the happy path at the state the system doesn't know it's in — is the signal. Everything after that was just conversation.
Critique before creation. I put tradeoffs on the whiteboard — some good decisions, some questionable, at least one actively wrong. "Here's a system. What would break first?" Experienced engineers can't help themselves — they see the weak joint and go straight to it. The follow-up is where the real signal lives: not just "this is wrong" but "this is wrong and here's what I'd accept instead, given the constraints of a five-person team that's also on-call."
The hazard plant. I bury something in the exercise materials — user PII echoed in an API response where it shouldn't be, a data leak hiding in a response schema. If they catch it unprompted, that's a different kind of signal entirely. Not technical skill — systematic thinking. The instinct to read the whole environment, not just the thing you were asked about.
The transition test. The exercise has two parts, and the seam between them is the hidden test. Part one: architect a system. Part two: build it with AI agents. What do you hand to agents and what do you keep? If everything goes to agents — red flag. If they say "the streaming parser is agent-friendly but the degradation UX requires me to define the states first" — that's taste. That's someone who knows where the human judgment is load-bearing and where it isn't.
Regret over triumph. Not "tell me about a hard problem you solved" — that invites hero narratives. "Tell me about a system you built that you'd build completely differently today." If someone can't name a system they regret, they either haven't built enough or they haven't been paying attention. The real signal isn't in the story — it's in the follow-ups. "Walk me through the exact data model change." "How did the metrics actually move after the fix?" Superficial narratives collapse on specifics. Earned regret doesn't.
Format selection. Some people undersell themselves in writing and are consistently better out loud. Others are the reverse. If I default to a written take-home for someone whose signal lives in conversation, I'm testing the wrong thing. The format of the evaluation is itself a design decision — and getting it wrong means you're measuring your instrument, not the candidate.
#The Bigger Problem
I keep framing this as an interviewing challenge, but the pattern extends past engineering.
A teacher assigns a five-paragraph essay on the causes of the Civil War. A student submits something clean, well-structured, grammatically perfect. The old test measured what my live coding exercise measured: can you produce the artifact? The better teachers are doing what the better interviewers are doing — testing judgment about the artifact instead of production of it. Here's an AI-generated essay. What's wrong with it? What assumption is it making? What would you add that the model couldn't know?
Medical boards are facing it. Law schools are facing it. The artifact was a proxy for understanding, and the proxy broke.
Output was convenient. It was legible. You could grade it, score it, compare it across candidates. Judgment is none of those things. Many of the things that matter most about expertise are also the hardest to measure. We got lucky for a while — production correlated with understanding, so we could measure the easy thing and infer the hard thing. That shortcut is disappearing.
#What's Left
AI didn't eliminate the need to evaluate engineers. It eliminated the convenience of using code as a proxy for engineering judgment.
For the first few months after the ground shifted, it felt like starting over. My usual questions didn't land. The formats I'd spent years refining felt like testing penmanship in the age of keyboards.
But the formats broke. The instinct didn't.
The signal is the same signal it always was — the pauses, the "actually, wait," the flinch at a design that's going to be a problem at 3am, the willingness to say "I built this and it was wrong and here's what I learned." It just doesn't show up in the same place anymore. I have to look for it differently.
I'm not starting over. I'm recalibrating. Which is what I've been doing every few years for fifteen years anyway.
The difference is that this time, everyone else is recalibrating too.