What Are Test Scores Actually Worth?

I have a three-year-old. In a couple of years, she'll be in kindergarten. I was standing in the parking lot after a school event, talking to another parent about schools in our area — the usual anxiety spiral: which ones are good, what do the ratings mean, should we move. Normally I'd jot something down and never follow up. Instead, I pulled out my phone, opened Telegram, and asked Lunen — my always-on AI assistant running on a Mac Mini at home via OpenClaw — to pull the state test data and rank schools near us against the state average. By the time I got home, I had something useful.

The list was interesting. But the questions it raised were more interesting. What are these scores measuring? Why do the top schools all look the same demographically? What's the actual difference between a school rated 7 and a school rated 8?

That conversation became SchoolScope — a school performance analysis tool for California. It's still a side project, but it's grown into something with its own methodology, grounding documents for 25 states, a system of axioms, and a few hundred commits.

It's also a testbed. At Product.ai, where I'm Chief Architect, I can't break things — we process hundreds of millions of events a day. SchoolScope is where I work fast, break things, and stress-test every new feature Anthropic ships in Claude Code that week. The thinking patterns that survive here are the ones I bring back to the team. The tech stacks are completely different (oRPC, Cloudflare Workers — things we'd never adopt in production), but the methodology transfers.

Here's how my thinking evolved, and where it falls short. It's also a microcosm of the shift I've been living at Product.ai: from writing code to defining constraints, from optimizing prompts to engineering context.

#Phase 1: Test Scores Are Everything

The starting assumption was simple. California publishes CAASPP results — standardized test scores for every public school, broken down by grade, subject, and subgroup. It's the closest thing to an objective, comparable data point across thousands of schools. GreatSchools uses it. Niche uses it. If you're going to rank schools, this is the raw material.

So I pulled the data. Built a composite score. Started ranking schools.

And almost immediately, something felt wrong.

The top-ranked schools were overwhelmingly in wealthy neighborhoods. The bottom-ranked schools were overwhelmingly in disadvantaged communities. The correlation between test scores and household income was so strong that you could practically predict a school's ranking from its zip code.

This isn't a discovery — anyone who's looked at school data knows this. But building the ranking system myself made the mechanism visceral. I wasn't just reading about the demographic proxy problem. I was producing it. My algorithm was taking public data and generating a leaderboard that mostly reflected where rich families live.

The data was technically correct. The output was generically "good." And it was specifically wrong for what I wanted to build. If that sounds familiar — it's the same problem I hit at Product.ai when we tried to prompt-engineer our way to good code. The AI's version of "best practices" is an average of the internet. Your version is specific to your context. Without that context, you get plausible garbage.

#Phase 2: What Are All the Signals?

So I started asking: what else is there? What does the state actually publish?

It turns out — a lot more than test scores:

Chronic absenteeism - students missing 10%+ of school days. At the elementary level, this is less about kids skipping and more about family engagement, school culture, community stability. A school with high test scores and high absenteeism is telling you something different than one with high scores and low absenteeism.
Suspension rates - and specifically, how they vary by subgroup. A school that suspends Black students at 3x the rate of white students has a discipline philosophy worth understanding, regardless of its test scores.
Growth trajectory - not just "how did students score?" but "did students at this school improve over time?" A school that takes kids from 30% proficiency to 50% is doing harder, more valuable work than a school maintaining 90%. One measures what families bring. The other measures what the school does.
Graduation rates and college readiness indicators - for high schools, whether students actually complete a-g requirements, pass AP exams, complete CTE pathways.
Teacher stability - credential status, years of experience. High turnover is a red flag. But you can't score this without penalizing schools in hard-to-staff areas, which are - surprise - the same disadvantaged communities that already get penalized on test scores.
Per-pupil spending - from NCES federal finance data. Context, not a score input. A school spending $8,000 per student is operating in a fundamentally different reality than one spending $20,000.

Each of these is a lens. None of them alone tells you if a school is "good." Together, they start to tell you something more honest.

But now I was making judgment calls about what matters — which meant I needed a system for encoding those judgments.

#Phase 3: Axioms, Grounding, and Encoding Judgment

An axiom, the way I use the term, is an almost-immutable truth you've decided to build on. Not a guideline. Not a suggestion. A constraint with teeth. Constraints are crystallized taste — someone has to decide what matters, what's non-negotiable, what variance is acceptable. That's the part AI can't do for you.

At Product.ai, we develop axioms through the Adversarial Reasoning Cycle — researching a question across multiple AI models, finding convergence, distilling the result. For SchoolScope, the axioms came from staring at the data and making calls:

"Exceeded > Met." The gap between "met standard" and "exceeded standard" is the most meaningful signal in school performance data. It separates schools that clear the bar from schools that raise it.
"Growth > Proficiency." A school that takes kids from 30% to 50% proficiency is doing harder work than one maintaining 90%. Growth measures what the school contributes. Proficiency measures what families bring.
"Absenteeism Is Culture." Chronic absenteeism at the elementary level isn't about kids being "smart enough to skip." It's family engagement, school culture, community stability.
"No quantitative claims without measuring them." If we say a school is in the 85th percentile, the methodology page explains exactly how percentiles are computed, and the explore page lets you verify by sorting.
"We don't penalize schools for serving disadvantaged communities." Score colors are blue-to-amber, not red/green. Labels say "Needs Support" not "Failing." Growth is weighted because it measures the school, not the zip code.

These axioms live in Markdown files in a grounding/axioms/ directory. Every AI agent that touches the codebase reads them before writing a line of code. They're the constitution of the product.

#State grounding documents

Each state gets its own grounding doc - a comprehensive Markdown file that describes everything: what test the state uses, where the raw data lives, what format it's in, what the performance levels are called, what's available and what's missing. California's is the most developed. Texas (STAAR), New York (Regents), Florida (FAST) - each has different tests, different file formats, different quirks. I've written grounding documents for 25 states so far.

California's specifies file formats (caret-delimited, Latin-1), encoding changes between years, WAF restrictions on automated downloads, COVID data gaps. These aren't things you want an AI agent to discover by trial and error.

Each state also gets a config file in TypeScript:

// src/config/states/ca.ts
export const CA_CONFIG = {
  state: "CA",
  testName: "CAASPP Smarter Balanced",
  performanceLevels: ["Exceeded", "Met", "Nearly Met", "Not Met"],
  levels: {
    elementary: {
      grades: [3, 4, 5],
      composite: {
        exceeded: 0.43,
        metAbove: 0.22,
        growth: 0.15,
        absenteeism: 0.10,
        suspension: 0.05,
        elpac: 0.05,
      },
    },
    // middle, high...
  },
};

When Texas launches, it gets tx.ts with STAAR equivalents. The import pipeline reads the config. The agents read the config. Nobody hardcodes "CAASPP" anywhere. The architecture is designed so a new state is a new config file and grounding doc - not a rewrite.

#The Scope Score

The composite weights those axioms into a number. At elementary: exceeded 43%, met+above 22%, growth 15%, absenteeism 10%, suspension 5%, ELPAC 5%.

Every one of those weights is an opinion. The methodology is public. Every weight is explained. Every limitation is stated. If you disagree — and reasonable people should — the whole point is that you can see the machinery. GreatSchools gives you a 1-10 number with no escape hatch. I wanted the opposite: here's the score, here's exactly why, here's what it can't tell you.

I also built archetypes - labels like "Growth Engine" (low raw scores but strong improvement), "High Ceiling" (exceptional exceeded rates), "Steady Foundation" (consistent across metrics). These turned out to be more useful than the number itself. A parent looking at a "Growth Engine" school understands something a percentile rank can't communicate: this school is doing real work with the students it has.

#Phase 4: Where It Falls Short

Here's the honest part.

Test scores correlate with demographics. I've done everything I can to mitigate this - weighting growth, contextualizing with spending, showing equity gaps by subgroup, never using diversity metrics as scoring inputs, choosing "Needs Support" over "Failing" as labels. But the correlation doesn't go away. A composite that includes test scores will always partially reflect who walks in the door — it can never fully isolate what happens inside.

Pseudo-cohort isn't true cohort. I actually built something better than cross-sectional: when historical data is available, I track the same school's cohort across years - 2023's 3rd graders measured again as 2025's 5th graders, using SBAC scale scores designed for cross-year comparison. 98% of elementary schools have this data. It's stronger than what most rating sites do (comparing different students at different grades in the same year). But it's still "pseudo" - I'm tracking school-level averages, not individual students. Kids transfer in and out. It's the closest thing to true value-added measurement available from public data, and it's still imperfect.

I can't measure what matters most. Teacher quality. School culture. Whether the art program is any good. Whether your kid will have a friend. Whether the principal actually cares. The data captures outcomes that are measurable across thousands of schools. The things parents care about most are the things that don't fit in a spreadsheet.

Small schools get smoothed. A school with 15 test-takers can swing wildly year to year. I apply Bayesian smoothing toward the state average to prevent noise from distorting rankings. That's statistically sound and experientially unsatisfying - it means small schools get pulled toward mediocrity in the data even when they're exceptional in practice.

Private schools are a black box. 2,452 private schools in California from the NCES survey. No test scores. No growth data. No chronic absenteeism. I show enrollment, student-teacher ratio, religious affiliation - but I can't rank them alongside public schools. The data doesn't exist to do it fairly. I'd rather show an honest "we don't have enough data to score this school" than manufacture a number.

High school scores lack growth data. California only tests 11th graders in high school - there's no earlier grade to compare against within the same level. So the growth trajectory signal that's most powerful at elementary and middle school doesn't exist for high schools. The Scope Score still works, but it's a less complete picture.

#What I Actually Believe Now

Test scores are the least bad objective measure we have. They're real. They're standardized. They're published. You can compare across schools. That matters - especially when the alternative is GreatSchools' black box or Niche's stale federal data supplemented by sparse user reviews.

But they're one lens. I put that line on every page of SchoolScope, and I mean it. A school that scores in the 40th percentile but has strong growth, low absenteeism, and stable teachers might be a better fit for your kid than a 90th percentile school with high suspension rates and demographic sorting.

The real value isn't the number. It's the structure around the number - the context, the comparison, the honesty about limitations. The prompt is what you say to the contractor. The constraints are the building code. SchoolScope's axioms are the building code. A parent who understands that "Exceeded" matters more than "Met," that growth measures the school and proficiency measures the neighborhood, that chronic absenteeism is a culture signal - that parent is making a fundamentally different decision than one staring at a single rating.

I built this because I'll need it soon. My daughter will be in the system in a couple years, and I want to make the decision with real information, not marketing. I've seen how the sausage gets made now — I know what a 7 versus an 8 actually means, and more importantly, what it doesn't mean. I know that a school with strong growth and low absenteeism in a working-class neighborhood might be doing better work than the 9-rated school in the hills. I want that context when it's my kid.

But I also built it because the process changed what I think "real information" means. It's not more data. It's more honest data — data that tells you what it knows, what it doesn't, and where you need to go look for yourself. The number is never the answer. The number plus the context plus the limitations plus the visit plus the gut feeling — that's getting closer.

Let me be real: it's a toy project. Handful of hits a day. I have no idea if I'll invest more time in it. But it taught me something I preach at Product.ai and didn't fully feel until I built this: having a great product means nothing if no one sees it. GreatSchools has brand recognition, backlinks, and a decade of SEO. I have better data and zero distribution. The marketing, the storytelling, the networking — that's the actual hard part. Execution got cheap. Attention didn't.

I used to say it's not about the idea, it's about the implementation. I don't say that anymore. The doing got cheaper. The deciding got more valuable. And the decisions are only as good as the constraints you write down before the first line of code gets generated.

#How It's Built

SchoolScope is a side project built in stolen hours - evenings, weekends, naptime. I wrote approximately zero lines of code by hand, which sounds like a flex but is really about the architecture. I wrote constraints, specs, and axioms - the grounding documents I described above. AI coding agents read those documents and build within them. The constraint engineering follows what I think of as a three-layer model: constraints bound the space, prompts express intent, code is what emerges.

When I wanted to add per-pupil spending data, I didn't open VS Code. I wrote a spec, then dispatched two AI agents in parallel - one to build the data pipeline (download NCES F-33 finance data, parse it, match districts to schools), one to build the UI (spending cards on school profiles, district pages, state comparison). Both agents read the grounding documents. Both ran in parallel. The feature was live in about two hours.

Then I noticed the spending data was three years stale - because we'd used an intermediary API that lagged behind the source. The raw NCES files with current data were sitting on ies.ed.gov the whole time. I dispatched a third agent to go direct to the primary source. Ten minutes later, the data was three years fresher than our competitors.

That mistake became an axiom: "Always go direct to the agency. Never depend on intermediary APIs when the raw data is publicly available." A lesson learned, codified, and now every future agent reads it before touching data imports.

That's how axioms work in practice. You make a mistake. You understand why. You write it down as a constraint. Every agent that comes after inherits the lesson. The axioms compound.

The full methodology is published at schoolscope.co/methodology. Every weight, every data source, every limitation. If you think I weighted something wrong, I'd like to hear about it.

#How to Actually Set This Up

People ask how to get started with this. It's simpler than it sounds, and you don't need to be building a school data product. The same structure works for any project where AI is doing the building.

#Step 1: Write the project grounding file

Every major AI coding tool has a file it reads automatically when it opens your project — CLAUDE.md for Claude Code, AGENTS.md for Codex, GEMINI.md for Gemini CLI. The name varies; the purpose is the same. It's your project's self-portrait: what it is, how it's built, what conventions matter, where the important files live. Think of it as the onboarding doc you'd write for a new engineer — except the new engineer is an AI that reads every word and follows it literally.

Mine starts with what SchoolScope is, lists the tech stack, describes the data sources, and then has a section of rules: "Private schools are NEVER ranked alongside public schools. NEVER get a Scope Score." The caps aren't shouting — they're emphasis for a reader that doesn't have intuition.

#Step 2: Start with one axiom

You don't need a system of 50 axioms on day one. You need one. Build something, notice what's wrong with the output, and write down the constraint that would have prevented it.

My first real axiom was born from a mistake: I used an intermediary API for spending data that was three years stale when the primary source had current data. That became: "Always go direct to the agency." One sentence. Saved every future agent from making the same mistake.

The process is always the same:

Build something
Notice what's wrong — not technically wrong, but wrong for your context
Write the constraint that would have prevented it
Put it where the AI will read it before building again

Axioms accumulate. After a few weeks you'll have 5-10 that cover most of your recurring judgment calls.

#Step 3: Organize into a grounding directory

Once you have more than a handful of axioms, give them a home. My structure:

grounding/
  axioms/
    product-principles.md    # What the product is and isn't
    data-axioms.md           # How we handle data
    competitive-lessons.md   # Mistakes competitors made that we won't
    voice-axioms.md          # How the product speaks
  research/
    competitors/             # What we learned studying alternatives
  states/
    ca.md                    # California-specific data sources and quirks
    tx.md                    # Texas (different tests, different formats)
  strategy/
    JTBD.md                  # Who uses this and why
    SEO_STRATEGY.md          # How people find us

The directory names matter less than the principle: separate what's universal from what's specific. Product principles apply everywhere. State-specific data quirks apply to one import pipeline. When an agent reads ca.md, it knows that CAASPP files are caret-delimited in Latin-1. When it reads product-principles.md, it knows we never penalize schools for demographics. Both are constraints, but they operate at different scales.

#Step 4: Reference grounding docs from your project file

The project file (CLAUDE.md, AGENTS.md, etc.) is the entry point — the table of contents. The grounding directory is the library it points to. One tells the AI this project exists and here are the rules. The other gives it the deep context for specific domains. The project file tells the AI which grounding docs to read for which tasks:

## Before You Start
- Read `grounding/axioms/product-principles.md` for product principles
- Read `grounding/strategy/JTBD.md` for audience context
- Read `grounding/axioms/data-axioms.md` before touching import scripts

This is the mechanism that makes axioms enforced instead of just documented. Without it, your grounding docs are a wiki nobody reads. With it, every AI session starts from your context instead of from zero.

#Step 5: Let them evolve

Axioms aren't stone tablets. They're living documents that sharpen as you learn. My data axioms have been rewritten three times. The product principles started as five bullet points and grew to a full constitution as edge cases appeared.

The key is treating them like code: version them, review the diffs, notice when they conflict. If two axioms pull in different directions, that's a design decision you haven't made yet. The conflict is the signal.

#The non-engineer version

If you're not writing code — if you're using Claude Projects, ChatGPT, or Gemini Gems — the same pattern applies, just simpler:

Write a one-page doc about what you do, who you serve, and what you won't compromise on
Upload it as permanent context in your AI tool
Every time the output misses something, add the constraint to your doc
Compare the output with and without your grounding doc — the difference is immediate

That's it. You're doing context engineering. The constraints compound from there.

SchoolScope is live at schoolscope.co. Side project, not affiliated with my employer. All data sourced from public agencies (California Department of Education, NCES, U.S. Census).