fitme·story

Case studies

Six milestones where the framework fundamentally changed — plus supporting studies and engineering deep-dives. Read top to bottom for the chronological story, or jump to any milestone.

How we measured

Framing · read before the numbers below

Every case study below cites numbers — wall time, complexity units, cache-hit rates, throughput multipliers. This block explains where those numbers come from, how the measurement approach evolved across framework versions, and what the numbers can and can't prove.

The assumption — a natural experiment
Six sequential UX-alignment refactors ran across six FitMe screens. Identical scope (same phase list, same compliance checklist, same design-system target) meant any velocity difference between refactor 1 and refactor 6 could only come from screen complexity, practitioner learning, or framework evolution. Normalize for complexity and treat learning as roughly constant after the first run, and what's left is framework evolution — a controlled natural experiment with N=6 initial datapoints, now 17.
How measurement evolved
Three generations. v2.0–v5.2 — estimated: wall time from commit timestamps (±15–30 min), cache hit rates inferred from narrative. v6.0 — instrumented: per-phase timestamps, L1/L2/L3 cache counters, tokenizer-based overhead measurement, mandatory eval-coverage gate. v7.0 onwards— continuous factors: view-count tiers replaced binary "has UI", architectural-novelty replaced binary "new model".
The normalization model
Every feature converts to a single number — Complexity Units (CU) — via CU = Tasks × Work_Type_Weight × (1 + Σ Complexity_Factors). The primary metric is min/CU: wall time divided by CU; lower is better. This is what makes a 6.5-hour onboarding refactor comparable to a 54-minute 4-feature parallel run. Full formula →
How we analyzed results
Three comparison axes: framework-era averages (v2.0 → v7.0), work-type segmentation (refactor vs feature vs enhancement), and execution-mode (serial vs parallel). Trend fitted as a power law — R² = 0.87 under v2 factors. Rolling baselines replaced the single anchor to detect plateaus. Regressions documented honestly: two real ones (Training v4.0, Readiness v4.2), both attributed to measurable learning taxes from new framework capabilities. Full retrospective →
How we compared across features
Every case study ends with a Normalized velocity block that cites the same CU formula, making cross-comparison honest. A framework refactor and a new feature land on the same axis. A serial v5.1 run and a parallel v5.1 stress test land on the same axis. The full dataset was submitted for independent review — arithmetic verified, structure sound, weaknesses surfaced and mostly fixed in v6.0. External validation →
What this can't prove
Single practitioner. N=17 is small for robust regression. Of the 185 full-audit findings, only 11.4% are externally-automated (confirmed by build, test, or independent measurement); 78.9% are framework-only (AI assertion from code reading). All claims should be read as directional signals, not statistical certainties. The honest reporting of regressions and limitations is what makes the rest of the dataset trustworthy.

Milestones

Six inflection points across the framework's evolution — each one the study where something that had been a hypothesis became a result.

  1. Milestone 1 · v2.0 · Baseline pilot5 min read

    The Pilot — Running the Full PM Lifecycle on Onboarding

    The pilot run. Full 9-phase PM lifecycle on one feature, end-to-end. Every number that follows in this timeline is relative to this one — including the 3 rework cycles and 5 latent bugs.

    ImpactBaseline · 6.5h
  2. Milestone 2 · v4.1 · Compounding proven9 min read

    How 6 Screen Refactors Proved a 6.5x Speedup

    Six identical-scope refactors across four framework versions. A controlled natural experiment that isolated framework improvement from practitioner learning. Wall time dropped from 6.5h to 1h, defect escape rate from 5 to 0.

    Impact6.5× faster · defects → 0
  3. Milestone 3 · v4.4 · Quality gate5 min read

    Can You Test AI Output Quality the Same Way You Test Code?

    Can you test AI output quality the same way you test code? Golden I/O tests and heuristic checks across four AI subsystems, all green on first run. Added a quality phase to the lifecycle without adding measurable overhead.

    Impact29 / 29 green
  4. Milestone 4 · v5.2 · Parallel dispatch7 min read

    Shipping 4 Features in 54 Minutes — The Parallel Stress Test

    Four independent features dispatched concurrently — 54 minutes from first prompt to four merged PRs. Zero merge conflicts, zero regressions. The stress test that proved the framework could parallelize.

    Impact4 features / 54 min
  5. Milestone 5 · v6.0 · Measurement7 min read

    When We Stopped Estimating and Started Measuring

    Deterministic phase timing, skill activation, and cache hits — all measured per feature, not estimated. The version where the framework stopped claiming numbers and started capturing them.

    ImpactInstrumentation, not estimation
  6. Milestone 6 · v7.0 · Hardware-aware6 min read

    V7.0 HADF — Teaching the Framework to Detect Chip Architecture

    The framework learned to detect the machine it runs on. 17 chip profiles, 7 cloud signatures, dispatch routing that adapts to hardware — little-core for mechanical work, big-core for reasoning, cloud only when locally infeasible.

    Impact17 chip profiles

More case studies

Supporting studies and methodology notes — the work that validates, extends, or explains the milestones above.

Developer deep-dives

Engineering write-ups for readers who want the code-level story — SSR, animation plumbing, component design. Not required reading for the framework narrative.