fitme·story
Flagship · v4.4

5 min read

Summary card · 60-second read

Can You Test AI Output Quality the Same Way You Test Code?

Version
v4.4
Date
2026-04-09
Tier
flagship

AI-output quality treated as a testable property. 2-layer eval design (golden I/O + heuristic XCTest evals + monitoring schema) and a lifecycle expansion from 5 to 6 phases — Eval inserted between Implementation and Review.

Honest disclosures
  • Three of four eval files were authored alongside the components they test — co-authoring removes some of the catch surface that independent evaluation would expose.
  • +71% velocity is computed against a pre-v6.0 baseline; the baseline itself carries ±15% measurement uncertainty (closed in v6.0).
How to read this case studyT1/T2/T3 · ledger · kill criterion
T1Instrumented
Numbers come from a machine-generated ledger or commit. Reproducible. Highest reader trust.
T2Declared
Numbers stated by a structured declaration (PRD, plan, frontmatter) but not directly measured.
T3Narrative
Estimates and observations from session memory. Useful for context; not citable as evidence.
Ledger
Where to verify the claim — a file path, GitHub issue, or backlog entry. Anything labelled ledger: is the audit trail.
Kill criterion
The pre-registered threshold under which this work would have been killed mid-flight. Not fired = work shipped without hitting the threshold.
Deferred
Items intentionally not closed in this version. Each cites the ledger that tracks remaining work.
Plan
define eval criteria
Implement
build the feature
Eval (new)
golden I/O + heuristics
Review
spec compliance
Merge
ship it
Learn
eval trends in monitoring
The new 6-phase lifecycle. Eval slots between Implementation and Review — criteria defined at Plan, run at Eval, analysed at Learn.

Can You Test AI Output Quality the Same Way You Test Code?

What happens when you add a formal eval layer — golden I/O tests and heuristic quality checks — to an AI-assisted PM framework?

29 / 29
eval cases green on first run (4 AI subsystems)

Context

By framework v4.3, the PM workflow had proven it could ship features fast. But "fast" is meaningless if the AI outputs are wrong. Unit tests verified that code compiled and functions returned expected values — but nobody was testing whether the AI's readiness scores fell in sane ranges, whether recommendation copy matched the user's context, or whether confidence gates fired at the right thresholds. The eval layer was an attempt to close that gap: treat AI output quality as a testable property, not an assumption.


The Problem

Before v4.4, AI quality verification was binary: "build passes, tests pass." But those tests checked code correctness, not output quality. A readiness score that returned 0.99 for a sleep-deprived user with elevated resting heart rate would pass every unit test — the function returned a float in range. The eval layer needed to assert behavioral contracts: given this input profile, the score should fall in this range, the copy should use this tone, the confidence badge should show this level.


The Approach

2-layer design:

  • Layer 1 (XCTest evals): Golden I/O tests (exact input → expected output range) and heuristic quality checks (output properties that should always hold). 29 eval cases across 4 files covering readiness scoring, AI output quality, tier behavior, and user profile integrity.
  • Layer 2 (monitoring schema): A structured ai_quality_metrics schema that captures eval results per feature, making quality trends visible across the project's lifetime.

6-phase skill lifecycle: The framework's skill lifecycle expanded from 5 phases to 6, inserting an Eval phase between Implementation and Review. Eval definitions happen at task planning, execution at testing, analysis at the learn phase — creating a closed loop.

Implementation: 45 minutes total. 15 minutes for the spec, 25 minutes for 3 eval files (20 cases), 5 minutes for framework docs. A 4th eval file (9 more cases) was added 10 minutes later during the next feature.


Key Metrics

MetricValue
Wall time~55 min (including follow-on eval file)
Eval cases29 across 4 files
Eval pass rate100% (29/29 green on first run)
Complexity Units12.6
Velocity4.37 min/CU (+71% vs baseline)
Framework lifecycle change5-phase → 6-phase (added Eval)
Coverage gaps identified3 subsystems with zero eval coverage

What the Eval Layer Caught

FileTestsCategoryWhat It Asserts
ReadinessFormulaEvals7Golden I/OScore ranges, band assignments, goal-aware shifts, cold start behavior, contradictory signal handling
AIOutputQualityEvals7HeuristicSignal coverage, no raw keys in UI strings, copy length bounds, tone matching, confidence badge accuracy
AITierBehaviorEvals6Tier behaviorLocal fallback works, confidence gate fires at 0.4 boundary, empty/stale snapshot handling
ProfileEvals9Golden I/O + heuristicMinimal/full profile handling, goal mutation, enum completeness, analytics prefix compliance

The confidence gate boundary test — 0.39 fails, 0.41 passes, exact 0.4 passes — is the kind of edge case that escapes unit tests but breaks real user experiences.


What This Proved About the Framework

  1. Eval patterns are reusable across features. The 4th eval file was written in 10 minutes by following the testEval_ convention from the first 3. The pattern scales without additional design work.

  2. Quality infrastructure can be free. The eval phase added zero measurable overhead to feature delivery. Eval tests compile and run in under 1 second. The framework got more rigorous without getting slower.

  3. Coverage gaps become visible. Once you formalize what's tested, the gaps are obvious: nutrition recommendations, training plan suggestions, and cohort intelligence had zero eval coverage. The eval layer didn't just test — it created a map of what's untested.

  4. The closed loop works. Define quality criteria at planning → build → verify against criteria → learn from results. This cycle prevents the "ship fast, discover quality problems later" pattern.


Key Takeaways

  • AI output quality is testable. Golden I/O tests and heuristic checks work the same way for AI outputs as they do for traditional code — you just need to define what "correct" means for probabilistic outputs.
  • Adding a quality phase to the lifecycle cost nothing in delivery speed because eval tests parallelize with other test work.
  • The 4.37 min/CU velocity at v4.4 showed that framework improvements compound: each version adds capability without adding cost, because the new capabilities are designed to run in parallel with existing work.