Case studies
Six milestones where the framework fundamentally changed — plus supporting studies and engineering deep-dives. Read top to bottom for the chronological story, or jump to any milestone.
How we measured
Framing · read before the numbers below
Every case study below cites numbers — wall time, complexity units, cache-hit rates, throughput multipliers. This block explains where those numbers come from, how the measurement approach evolved across framework versions, and what the numbers can and can't prove.
- The assumption — a natural experiment
- Six sequential UX-alignment refactors ran across six FitMe screens. Identical scope (same phase list, same compliance checklist, same design-system target) meant any velocity difference between refactor 1 and refactor 6 could only come from screen complexity, practitioner learning, or framework evolution. Normalize for complexity and treat learning as roughly constant after the first run, and what's left is framework evolution — a controlled natural experiment with N=6 initial datapoints, now 17.
- How measurement evolved
- Three generations. v2.0–v5.2 — estimated: wall time from commit timestamps (±15–30 min), cache hit rates inferred from narrative. v6.0 — instrumented: per-phase timestamps, L1/L2/L3 cache counters, tokenizer-based overhead measurement, mandatory eval-coverage gate. v7.0 onwards— continuous factors: view-count tiers replaced binary "has UI", architectural-novelty replaced binary "new model".
- The normalization model
- Every feature converts to a single number — Complexity Units (CU) — via
CU = Tasks × Work_Type_Weight × (1 + Σ Complexity_Factors). The primary metric is min/CU: wall time divided by CU; lower is better. This is what makes a 6.5-hour onboarding refactor comparable to a 54-minute 4-feature parallel run. Full formula → - How we analyzed results
- Three comparison axes: framework-era averages (v2.0 → v7.0), work-type segmentation (refactor vs feature vs enhancement), and execution-mode (serial vs parallel). Trend fitted as a power law — R² = 0.87 under v2 factors. Rolling baselines replaced the single anchor to detect plateaus. Regressions documented honestly: two real ones (Training v4.0, Readiness v4.2), both attributed to measurable learning taxes from new framework capabilities. Full retrospective →
- How we compared across features
- Every case study ends with a Normalized velocity block that cites the same CU formula, making cross-comparison honest. A framework refactor and a new feature land on the same axis. A serial v5.1 run and a parallel v5.1 stress test land on the same axis. The full dataset was submitted for independent review — arithmetic verified, structure sound, weaknesses surfaced and mostly fixed in v6.0. External validation →
- What this can't prove
- Single practitioner. N=17 is small for robust regression. Of the 185 full-audit findings, only 11.4% are externally-automated (confirmed by build, test, or independent measurement); 78.9% are framework-only (AI assertion from code reading). All claims should be read as directional signals, not statistical certainties. The honest reporting of regressions and limitations is what makes the rest of the dataset trustworthy.
Milestones
Six inflection points across the framework's evolution — each one the study where something that had been a hypothesis became a result.
- Milestone 1 · v2.0 · Baseline pilot5 min read
The Pilot — Running the Full PM Lifecycle on Onboarding
The pilot run. Full 9-phase PM lifecycle on one feature, end-to-end. Every number that follows in this timeline is relative to this one — including the 3 rework cycles and 5 latent bugs.
ImpactBaseline · 6.5h - Milestone 2 · v4.1 · Compounding proven9 min read
How 6 Screen Refactors Proved a 6.5x Speedup
Six identical-scope refactors across four framework versions. A controlled natural experiment that isolated framework improvement from practitioner learning. Wall time dropped from 6.5h to 1h, defect escape rate from 5 to 0.
Impact6.5× faster · defects → 0 - Milestone 3 · v4.4 · Quality gate5 min read
Can You Test AI Output Quality the Same Way You Test Code?
Can you test AI output quality the same way you test code? Golden I/O tests and heuristic checks across four AI subsystems, all green on first run. Added a quality phase to the lifecycle without adding measurable overhead.
Impact29 / 29 green - Milestone 4 · v5.2 · Parallel dispatch7 min read
Shipping 4 Features in 54 Minutes — The Parallel Stress Test
Four independent features dispatched concurrently — 54 minutes from first prompt to four merged PRs. Zero merge conflicts, zero regressions. The stress test that proved the framework could parallelize.
Impact4 features / 54 min - Milestone 5 · v6.0 · Measurement7 min read
When We Stopped Estimating and Started Measuring
Deterministic phase timing, skill activation, and cache hits — all measured per feature, not estimated. The version where the framework stopped claiming numbers and started capturing them.
ImpactInstrumentation, not estimation - Milestone 6 · v7.0 · Hardware-aware6 min read
V7.0 HADF — Teaching the Framework to Detect Chip Architecture
The framework learned to detect the machine it runs on. 17 chip profiles, 7 cloud signatures, dispatch routing that adapts to hardware — little-core for mechanical work, big-core for reasoning, cloud only when locally infeasible.
Impact17 chip profiles
More case studies
Supporting studies and methodology notes — the work that validates, extends, or explains the milestones above.
- v4.4The Most Complex Feature Completed at Refactor Speed5 min
- v5.0What If You Designed Software Like a Chip?5 min
- v5.1The Fastest Feature — 86% Velocity Improvement on Auth Flow7 min
- v5.1First Feature Under the New Architecture — AI Engine Adaptation7 min
- v5.2What Breaks When You Run 4 Features at Once — And How to Fix It6 min
- v5.2From "Zero Conflicts by Luck" to "Zero Conflicts by Design"5 min
- v6.1185 Findings, 12 Critical — What a Full-System Audit Revealed8 min
- v6.1Building the Site That Tells the Story — A Two-Hour Meta-Build27 min
- v6.1The Dual-Sync Race — Two Backends, One Last-Writer-Wins Silence6 min
- v6.1The Stacked-PR Misfire — When "Merged" Didn't Mean "On Main"6 min
- v6.1The XCTWaiter Abort — Learning to Stop, Rollback, and Retry6 min
- v7.6Mechanical Enforcement — How v7.6 Closed the Class B Gap from Gemini's Audit8 min
- v7.7Validity Closure — How v7.7 Closed the Last Closable Class B Gap9 min
- supportingExternal Validation — Did Our Numbers Hold Up?6 min
- supportingWhat If We Had Measurement From Day One? — A Retrospective ROI Analysis7 min
- supportingHow We Normalized Complexity Across 16 Different Features7 min
- supportingThe operations layer in practice3 short studies
Developer deep-dives
Engineering write-ups for readers who want the code-level story — SSR, animation plumbing, component design. Not required reading for the framework narrative.