I wrote recently about not knowing how to interview engineers anymore. The thesis was simple: AI made code free, and every interview format that used code production as a proxy for engineering judgment lost its signal. I'm recalibrating. Everyone is.
But the more I sat with that essay, the more I realized the engineering version is just the one I happen to live in. The same thing is happening in rooms I've never been in, to people solving evaluation problems I've never had to solve. And the parallels are so exact that it's hard to believe they're coincidental. They're not. It's the same structural break, showing up everywhere at once.
#The Blue Books Are Back
In the summer of 2025, something strange started happening at American universities. Professors began dusting off blue books — those thin, lined exam booklets that most students hadn't seen since before the pandemic. Handwritten, in-person exams. The most low-tech assessment format imaginable.
85% of college students now use AI tools like ChatGPT. The take-home essay — the assessment workhorse of the humanities for decades — stopped working overnight. A student can generate a clean, well-argued, properly cited five-paragraph essay in thirty seconds. The essay was never really testing whether someone could write five paragraphs. It was testing whether they could think through a problem — organize an argument, weigh evidence, find the weak point in their own reasoning. The writing was the proxy. The proxy broke.
The responses map almost perfectly onto what I'm seeing in engineering hiring:
Ban it. Some universities prohibited AI tools entirely. Same bet as Amazon banning AI in interviews — prohibition creates a fiction, because the tool will be available on day one of the job the degree is supposed to prepare them for.
Verify the person. Blue books. Oral exams. The LSAT ended its online option entirely — returning to in-person test centers starting August 2026, explicitly because of cheating concerns. Google brought engineers back into the room. The LSAT brought law students back into the room. Same move, same reasoning: if you can't control the tools, control the environment.
Redesign the assessment. This is where it gets interesting. A growing number of professors are reviving oral exams — not the stiff, formal European tradition, but conversational ones. "Walk me through your argument. Why did you choose this evidence? What's the strongest counterargument?" Sound familiar? That's my architecture critique. That's my "tell me about a system you regret." The format that tests whether you can defend your thinking in real time, not whether you can produce a polished artifact.
Hold the line. And some institutions are doing nothing. Same as the Big Tech interviewers in the interviewing.io survey — questions got harder, detection got more sophisticated, but the format didn't change. Hope enforcement keeps up.
#The Wicked Problem
Australian researchers recently published a study calling AI in assessment a "wicked problem" — the kind that has no single correct solution, only better and worse tradeoffs. They interviewed twenty university teachers leading assessment redesign. The responses sound like they could have come from my hiring conversations:
"We can make assessments more AI-proof, but if we make them too rigid, we just test compliance rather than creativity."
"Have I struck the right balance? I don't know."
That's the same tension. My hazard plants and degraded-state exercises are more "AI-proof" than a standard coding interview — but they also select for a specific kind of engineer. The professor who designs an oral exam instead of a take-home essay gets better signal on understanding — but also selects for verbal processors and penalizes students who think better in writing. Every format is a bet about which kind of intelligence you can afford to miss.
68% of schools have adopted AI detection tools. 45% have redesigned assessments. 58% have updated policies. The numbers are high, but the solutions are contradictory. Detection and redesign pull in opposite directions — one assumes AI is the enemy, the other assumes it's the environment.
#The Law School Version
The legal profession is facing its own flavor of this. The California State Bar admitted it used AI to develop exam questions — and the legal community was outraged. Not because AI can't write good questions (it can), but because the bar exam is supposed to test whether humans can think like lawyers. Using AI to write the test while banning AI from taking the test creates a paradox that's hard to argue your way out of, even with a law degree.
The LSAT's retreat to in-person testing is the starkest example. An admissions test that went remote during COVID is going back to test centers — not because the remote format was bad, but because they couldn't maintain the fiction that the person taking the test was the person who got the score. The proxy required physical presence to be trustworthy again.
#What's Actually the Same
Here's what I keep coming back to. Strip away the domain — engineering, law, medicine, liberal arts — and the structural problem is identical:
Before AI: Producing the artifact (code, essay, legal brief, diagnosis) was hard. Difficulty of production correlated with depth of understanding. So we measured production and inferred understanding. This worked for a long time.
After AI: Producing the artifact is trivially easy. The correlation broke. Now you have to measure understanding directly — which is harder, less scalable, more subjective, and more expensive.
Every field is independently discovering the same set of responses:
- Ban the tool (creates a fiction)
- Verify the person (bring them into the room)
- Redesign the test (oral exams, critique tasks, process over product)
- Detect the cheating (arms race with diminishing returns)
- Do nothing and hope (the majority response)
And every field is independently discovering the same tradeoffs:
- Oral exams are better signal but favor verbal processors
- In-person tests are more secure but less accessible
- Process-based assessments are richer but harder to standardize
- Detection tools are convenient but produce false positives that disproportionately flag non-native English speakers
Nobody has solved it. The fields aren't even talking to each other about it — engineering hiring managers aren't reading education research, and education researchers aren't reading interviewing.io surveys. But the conversation is the same conversation.
#The Deeper Question
There's a question underneath all of this that nobody's really answering yet: was the proxy ever that good?
The five-paragraph essay tested writing ability, not necessarily analytical thinking. The LeetCode interview tested algorithmic recall, not necessarily engineering judgment. The bar exam tested legal knowledge under time pressure, not necessarily the ability to counsel a client. We treated these proxies as reliable for decades because they were convenient and because we didn't have a better option. Now that AI has broken them, we're being forced to ask what we were actually measuring — and whether the answer was ever as clean as we assumed.
Maybe the real discovery isn't that AI broke our evaluation systems. Maybe it's that our evaluation systems were more fragile than we thought, and AI just applied the pressure that exposed it.
I don't have a conclusion here. That's honest. What I have is a pattern: every domain that relied on artifact production as proof of understanding is now scrambling, independently, to find something better. The scrambling looks remarkably similar across fields. And nobody's published a solution that works at scale.
The proxy broke everywhere. What replaces it is still an open question — and it might be a different answer in every room.