Skip to content

What Are Test Scores Actually Worth?

By Bri Stanback 11 min read

I have a three-year-old. In a couple of years, she'll be in kindergarten. I was standing in the parking lot after a school event, talking to another parent about schools in our area — the usual anxiety spiral: which ones are good, what do the ratings mean, should we move. Normally I'd jot something down and never follow up. Instead, I pulled out my phone, opened Telegram, and asked Lunen — my always-on AI assistant running on a Mac Mini at home via OpenClaw — to pull the state test data and rank schools near us against the state average. By the time I got home, I had something useful.

The list was interesting. But the questions it raised were more interesting. What are these scores measuring? Why do the top schools all look the same demographically? What's the actual difference between a school rated 7 and a school rated 8?

That conversation became SchoolScope — a school performance analysis tool for California. It's still a side project, but it's grown into something with its own methodology, grounding documents for 25 states, a system of axioms, and a few hundred commits.

It's also a testbed. At Product.ai, where I'm Chief Architect, I can't break things — we process hundreds of millions of events a day. SchoolScope is where I work fast, break things, and stress-test every new feature Anthropic ships in Claude Code that week. The thinking patterns that survive here are the ones I bring back to the team. The tech stacks are completely different (oRPC, Cloudflare Workers — things we'd never adopt in production), but the methodology transfers.

Here's how my thinking evolved, and where it falls short. It's also a microcosm of the shift I've been living at Product.ai: from writing code to defining constraints, from optimizing prompts to engineering context.


#Phase 1: Test Scores Are Everything

The starting assumption was simple. California publishes CAASPP results — standardized test scores for every public school, broken down by grade, subject, and subgroup. It's the closest thing to an objective, comparable data point across thousands of schools. GreatSchools uses it. Niche uses it. If you're going to rank schools, this is the raw material.

So I pulled the data. Built a composite score. Started ranking schools.

And almost immediately, something felt wrong.

The top-ranked schools were overwhelmingly in wealthy neighborhoods. The bottom-ranked schools were overwhelmingly in disadvantaged communities. The correlation between test scores and household income was so strong that you could practically predict a school's ranking from its zip code.

This isn't a discovery — anyone who's looked at school data knows this. But building the ranking system myself made the mechanism visceral. I wasn't just reading about the demographic proxy problem. I was producing it. My algorithm was taking public data and generating a leaderboard that mostly reflected where rich families live.

The data was technically correct. The output was generically "good." And it was specifically wrong for what I wanted to build. If that sounds familiar — it's the same problem I hit at Product.ai when we tried to prompt-engineer our way to good code. The AI's version of "best practices" is an average of the internet. Your version is specific to your context. Without that context, you get plausible garbage.


#Phase 2: What Are All the Signals?

So I started asking: what else is there? What does the state actually publish?

It turns out — a lot more than test scores:

Each of these is a lens. None of them alone tells you if a school is "good." Together, they start to tell you something more honest.

But now I was making judgment calls about what matters — which meant I needed a system for encoding those judgments.


#Phase 3: Axioms, Grounding, and Encoding Judgment

An axiom, the way I use the term, is an almost-immutable truth you've decided to build on. Not a guideline. Not a suggestion. A constraint with teeth. Constraints are crystallized taste — someone has to decide what matters, what's non-negotiable, what variance is acceptable. That's the part AI can't do for you.

At Product.ai, we develop axioms through the Adversarial Reasoning Cycle — researching a question across multiple AI models, finding convergence, distilling the result. For SchoolScope, the axioms came from staring at the data and making calls:

These axioms live in Markdown files in a grounding/axioms/ directory. Every AI agent that touches the codebase reads them before writing a line of code. They're the constitution of the product.

#State grounding documents

Each state gets its own grounding doc - a comprehensive Markdown file that describes everything: what test the state uses, where the raw data lives, what format it's in, what the performance levels are called, what's available and what's missing. California's is the most developed. Texas (STAAR), New York (Regents), Florida (FAST) - each has different tests, different file formats, different quirks. I've written grounding documents for 25 states so far.

California's specifies file formats (caret-delimited, Latin-1), encoding changes between years, WAF restrictions on automated downloads, COVID data gaps. These aren't things you want an AI agent to discover by trial and error.

Each state also gets a config file in TypeScript:

// src/config/states/ca.ts
export const CA_CONFIG = {
  state: "CA",
  testName: "CAASPP Smarter Balanced",
  performanceLevels: ["Exceeded", "Met", "Nearly Met", "Not Met"],
  levels: {
    elementary: {
      grades: [3, 4, 5],
      composite: {
        exceeded: 0.43,
        metAbove: 0.22,
        growth: 0.15,
        absenteeism: 0.10,
        suspension: 0.05,
        elpac: 0.05,
      },
    },
    // middle, high...
  },
};

When Texas launches, it gets tx.ts with STAAR equivalents. The import pipeline reads the config. The agents read the config. Nobody hardcodes "CAASPP" anywhere. The architecture is designed so a new state is a new config file and grounding doc - not a rewrite.

#The Scope Score

The composite weights those axioms into a number. At elementary: exceeded 43%, met+above 22%, growth 15%, absenteeism 10%, suspension 5%, ELPAC 5%.

Every one of those weights is an opinion. The methodology is public. Every weight is explained. Every limitation is stated. If you disagree — and reasonable people should — the whole point is that you can see the machinery. GreatSchools gives you a 1-10 number with no escape hatch. I wanted the opposite: here's the score, here's exactly why, here's what it can't tell you.

I also built archetypes - labels like "Growth Engine" (low raw scores but strong improvement), "High Ceiling" (exceptional exceeded rates), "Steady Foundation" (consistent across metrics). These turned out to be more useful than the number itself. A parent looking at a "Growth Engine" school understands something a percentile rank can't communicate: this school is doing real work with the students it has.


#Phase 4: Where It Falls Short

Here's the honest part.

Test scores correlate with demographics. I've done everything I can to mitigate this - weighting growth, contextualizing with spending, showing equity gaps by subgroup, never using diversity metrics as scoring inputs, choosing "Needs Support" over "Failing" as labels. But the correlation doesn't go away. A composite that includes test scores will always partially reflect who walks in the door, not just what happens inside.

Pseudo-cohort isn't true cohort. I actually built something better than cross-sectional: when historical data is available, I track the same school's cohort across years - 2023's 3rd graders measured again as 2025's 5th graders, using SBAC scale scores designed for cross-year comparison. 98% of elementary schools have this data. It's stronger than what most rating sites do (comparing different students at different grades in the same year). But it's still "pseudo" - I'm tracking school-level averages, not individual students. Kids transfer in and out. It's the closest thing to true value-added measurement available from public data, and it's still imperfect.

I can't measure what matters most. Teacher quality. School culture. Whether the art program is any good. Whether your kid will have a friend. Whether the principal actually cares. The data captures outcomes that are measurable at scale. The things parents care about most are the things that don't fit in a spreadsheet.

Small schools get smoothed. A school with 15 test-takers can swing wildly year to year. I apply Bayesian smoothing toward the state average to prevent noise from distorting rankings. That's statistically sound and experientially unsatisfying - it means small schools get pulled toward mediocrity in the data even when they're exceptional in practice.

Private schools are a black box. 2,452 private schools in California from the NCES survey. No test scores. No growth data. No chronic absenteeism. I show enrollment, student-teacher ratio, religious affiliation - but I can't rank them alongside public schools. The data doesn't exist to do it fairly. I'd rather show an honest "we don't have enough data to score this school" than manufacture a number.

High school scores lack growth data. California only tests 11th graders in high school - there's no earlier grade to compare against within the same level. So the growth trajectory signal that's most powerful at elementary and middle school doesn't exist for high schools. The Scope Score still works, but it's a less complete picture.


#What I Actually Believe Now

Test scores are the least bad objective measure we have. They're real. They're standardized. They're published. You can compare across schools. That matters - especially when the alternative is GreatSchools' black box or Niche's stale federal data supplemented by sparse user reviews.

But they're one lens. I put that line on every page of SchoolScope, and I mean it. A school that scores in the 40th percentile but has strong growth, low absenteeism, and stable teachers might be a better fit for your kid than a 90th percentile school with high suspension rates and demographic sorting.

The real value isn't the number. It's the structure around the number - the context, the comparison, the honesty about limitations. The prompt is what you say to the contractor. The constraints are the building code. SchoolScope's axioms are the building code. A parent who understands that "Exceeded" matters more than "Met," that growth measures the school and proficiency measures the neighborhood, that chronic absenteeism is a culture signal - that parent is making a fundamentally different decision than one staring at a single rating.

I built this because I'll need it soon. My daughter will be in the system in a couple years, and I want to make the decision with real information, not marketing. I've seen how the sausage gets made now — I know what a 7 versus an 8 actually means, and more importantly, what it doesn't mean. I know that a school with strong growth and low absenteeism in a working-class neighborhood might be doing better work than the 9-rated school in the hills. I want that context when it's my kid.

But I also built it because the process changed what I think "real information" means. It's not more data. It's more honest data — data that tells you what it knows, what it doesn't, and where you need to go look for yourself. The number is never the answer. The number plus the context plus the limitations plus the visit plus the gut feeling — that's getting closer.

Let me be real: it's a toy project. Handful of hits a day. I have no idea if I'll invest more time in it. But it taught me something I preach at Product.ai and didn't fully feel until I built this: having a great product means nothing if no one sees it. GreatSchools has brand recognition, backlinks, and a decade of SEO. I have better data and zero distribution. The marketing, the storytelling, the networking — that's the actual hard part. Execution got cheap. Attention didn't.

I used to say it's not about the idea, it's about the implementation. I don't say that anymore. The doing got cheaper. The deciding got more valuable. And the decisions are only as good as the constraints you write down before the first line of code gets generated.


#How It's Built

SchoolScope is a side project built in stolen hours - evenings, weekends, naptime. I wrote approximately zero lines of code by hand, which sounds like a flex but is really about the architecture. I wrote constraints, specs, and axioms - the grounding documents I described above. AI coding agents read those documents and build within them. The constraint engineering follows what I think of as a three-layer model: constraints bound the space, prompts express intent, code is what emerges.

When I wanted to add per-pupil spending data, I didn't open VS Code. I wrote a spec, then dispatched two AI agents in parallel - one to build the data pipeline (download NCES F-33 finance data, parse it, match districts to schools), one to build the UI (spending cards on school profiles, district pages, state comparison). Both agents read the grounding documents. Both ran in parallel. The feature was live in about two hours.

Then I noticed the spending data was three years stale - because we'd used an intermediary API that lagged behind the source. The raw NCES files with current data were sitting on ies.ed.gov the whole time. I dispatched a third agent to go direct to the primary source. Ten minutes later, the data was three years fresher than our competitors.

That mistake became an axiom: "Always go direct to the agency. Never depend on intermediary APIs when the raw data is publicly available." A lesson learned, codified, and now every future agent reads it before touching data imports.

That's how axioms work in practice. You make a mistake. You understand why. You write it down as a constraint. Every agent that comes after inherits the lesson. The axioms compound.

The full methodology is published at schoolscope.co/methodology. Every weight, every data source, every limitation. If you think I weighted something wrong, I'd genuinely like to hear about it.


SchoolScope is live at schoolscope.co. Side project, not affiliated with my employer. All data sourced from public agencies (California Department of Education, NCES, U.S. Census).

Tagged

  • ai
  • architecture
  • education
  • systems