fitme·story
Flagship · v4.3

9 min read

Summary card · 60-second read

How 6 Screen Refactors Proved a 6.5x Speedup

Version
v4.3
Date
2026-04-10
Tier
flagship

Six identical-scope screen refactors across four framework versions isolated framework improvement from practitioner learning. Wall time fell from 6.5h (v2.0 onboarding) to 1h (v4.1 settings) — a 6.5× speedup that survived as a controlled natural experiment.

Honest disclosures
  • Home (v3.0) is a documented 36h outlier — established the v2 file convention plus 3 sub-features. Excluded from curve fits, retained as a labeled data point.
  • 6.5× headline conflates framework improvement with practitioner learning even with the controlled-scope design. Tagged T2 Declared, not T1 Instrumented.
How to read this case studyT1/T2/T3 · ledger · kill criterion
T1Instrumented
Numbers come from a machine-generated ledger or commit. Reproducible. Highest reader trust.
T2Declared
Numbers stated by a structured declaration (PRD, plan, frontmatter) but not directly measured.
T3Narrative
Estimates and observations from session memory. Useful for context; not citable as evidence.
Ledger
Where to verify the claim — a file path, GitHub issue, or backlog entry. Anything labelled ledger: is the audit trail.
Kill criterion
The pre-registered threshold under which this work would have been killed mid-flight. Not fired = work shipped without hitting the threshold.
Deferred
Items intentionally not closed in this version. Each cites the ledger that tracks remaining work.
Wall time per refactor — sorted (Home outlier flagged)
Home's 36h includes establishing the v2 convention + 3 sub-features. Refactoring-only time was ~4–5h, consistent with the trend.

How 6 Screen Refactors Proved a 6.5x Speedup

Six identical-scope tasks. Four framework versions. A controlled natural experiment that isolated framework improvement from practitioner learning.

6.5×
speedup across 6 identical-scope refactors

Context

Between April 5 and April 10, 2026, FitMe iOS underwent a UX alignment pass across all six main screens. Each screen was treated as an independent refactor with its own feature branch, audit, and PR. The screens were refactored sequentially -- not because of a constraint, but because the sequence created a natural experiment. With identical scope (same phases, same compliance checklist, same design system target), any change in velocity between screen 1 and screen 6 can only be explained by screen complexity (normalizable), practitioner learning (roughly constant after the first run), or framework evolution. This case study isolates and measures the third factor.


The Experiment

ScreenLines of CodeRefactor OrderFramework VersionWall TimePR
Onboarding1,1061stv2.0~6.5h#59
Home7032ndv3.0~36h*#61
Training2,1353rdv4.0~5h#74
Nutrition4874thv4.1~2h#75
Stats3125thv4.1~1.5h#76
Settings2896thv4.1~1h#77

*Home was an outlier -- it established the v2 file convention, spawned 3 sub-features, and integrated external tools for the first time. Its refactoring-only time was ~4-5h, consistent with the trend. All analysis excludes Home from curve fitting while retaining it as a labeled data point.


Three Levels of Improvement

Level 1: Individual Skill Throughput (Micro)

Each specialized skill showed measurable improvement across framework versions:

UX Audit throughput: 13.5 findings/hour (v3.0) improved to 46.0 findings/hour (v4.1) -- a 3.4x gain. The mechanism: a cached principle-to-pattern mapping eliminated per-screen re-derivation of which UX foundations apply to which UI elements. An anti-pattern library grew with each run, compressing time-to-first-finding.

Analytics instrumentation: ~9 events/hour (v2.0-v3.0) jumped to 48 events/hour (v4.0) when a screen-prefix naming convention was formalized and cached. This is the signature of a cached rule replacing a deliberative process -- a single discontinuity, not a gradual improvement.

Implementation velocity: The v2 subdirectory convention, established during the Home refactor, eliminated the most time-consuming pre-implementation decision: how to structure the refactor. Commit count dropped from 20 (Onboarding, patching v1 in place) to 3 (Settings, applying a cached recipe).

Test density: Home was over-tested at 5.25 tests per analytics event because test patterns were being established in real time. By Training, test templates were cached, and density stabilized at 2.4-2.7 tests/event -- leaner but not undertested.

Level 2: Cross-Skill Handoff Quality (Meso)

How skills communicate determines how much context is lost at each handoff. The communication substrate evolved through three stages:

v2.0 -- Conversation context only. Context existed only in the active session window. Closing the session discarded inter-skill state. Resuming required re-reading the PRD and reconstructing intent from scratch.

v3.0 -- Shared durable files. Audit reports, UX specs, and state files persisted across sessions. Any skill could read any prior artifact. Handoffs became explicit file reads, not context reconstruction.

v4.0+ -- Shared files plus a multi-level learning cache. Cache entries were pushed to receiving skills before they began execution. The receiving skill arrived with a hypothesis already formed, not blank. This eliminated the "cold start" cost that each skill paid even when shared files existed.

The handoff improvement is most visible in the /ux-to-/design-to-/dev chain:

  • v2.0 (Onboarding): Informal finding list, no severity tiers, no token gap analysis. Result: 20 patches, 3 pbxproj fix commits, 5 latent bugs found later.
  • v3.0 (Home): Structured audit report with numbered findings, severity, principle-to-finding index. Result: 5 clean commits, v2 convention established, 0 latent bugs.
  • v4.0 (Training): Cache-pre-loaded anti-patterns and principle mappings. Result: 7 files (most complex screen), single pbxproj commit, 0 bugs. Implementation time matched Home despite Training being 3x larger.

Level 3: End-to-End Lifecycle (Macro)

Metricv2.0 (Onboarding)v4.1 avg (Nutrition/Stats/Settings)Improvement
Wall time~6.5h~1.5h4.3x
Planning velocity3.7 findings/h13.7 findings/h3.7x
Phase compression1.4 phases/h6.0 phases/h4.3x
Rework cycles3 (pbxproj bugs)0Eliminated
Defect escape rate5 latent bugs0Eliminated
v2.0 (Onboarding)
6.5h
3 rework cycles · 5 latent bugs
v4.1 avg (3 screens)
1.5h
0 rework · 0 bugs

The Learning Cache as Inflection Point

The single most impactful structural change was the introduction of a multi-level learning cache at v4.0:

Framework VersionCache Hit Rate
v2.0 (Onboarding)0%
v3.0 (Home)0%
v4.0 (Training)~40%
v4.1 (Nutrition)~55%
v4.1 (Stats)~65%
v4.1 (Settings)~70%

Each percentage point represents a unit of prior research time that did not need to be repeated. At 70% cache hit rate, seven-tenths of the context-derivation work that was performed on Onboarding was no longer being performed at all. The cache does not make skills smarter -- it makes them stop re-deriving things they already know.

Two inflection points appear in the data:

Inflection 1 (v3.0 to v4.0): Training was the most complex screen at 2,135 lines -- larger than all others. Under a natural-learning-only model, it should have taken longer than Home. It completed in 5 hours. Phase compression jumped from 0.25 phases/hour to 1.8 phases/hour -- a 7.2x improvement on a harder problem.

Inflection 2 (v4.0 to v4.1): The shared cross-skill cache made patterns learned by one skill automatically available to others. The Nutrition run logged the first explicit L2 cache hit by name. Phase compression jumped to 6.0 phases/hour.


Every New Capability Costs Before It Pays

The data includes two non-outlier regressions:

  • Training v2 (v4.0): 16.0 min/CU vs the 15.2 baseline -- a 5% regression. The cache system was new and being populated for the first time. Learning taxes are real.
  • Home v2 (v3.0): 36 hours total. Three independent factors compounded: establishing the v2 file convention (framework development disguised as feature work), scope explosion (3 sub-features spawned), and external tool integration (Figma and Notion connected for the first time).

Both regressions were one-time costs. Every subsequent feature benefited from what these runs established.


The Growth Curve

A logarithmic model fitted to five data points (excluding Home) predicts wall time well for the first two refactors but consistently underestimates performance from the fourth onward. The delta between predicted and actual is stable at ~2.3 hours across three consecutive v4.1 runs -- the signature of a structural improvement (the L2 cache) on top of natural practitioner learning.

Compound efficiency gains across the full span:

Metricv2.0 to v4.1Factor
Wall time6.5h to 1h6.5x faster
Planning velocity3.7 to 16.0 findings/h4.3x
Defect escape rate5 to 0Eliminated
Design system tokens added per refactor0 to 2.3 avgSystem now grows with every run
Cache hit rate0% to 70%70% of prior research eliminated

The design system growth metric captures a compounding effect. Each token added in a v4.1 run reduces the discovery work for the next run. The framework converted a compliance-check process into a design-system-enrichment loop.

Does the curve plateau? Through six refactors, no. Wall time is still declining (2.0h to 1.5h to 1.0h across the three v4.1 screens), and cache hit rate is still increasing (55% to 65% to 70%).


Key Takeaways

  • 6.5x speedup on identical-scope work is the headline, but the defect escape rate dropping from 5 to 0 is the more practically significant result. Speed and quality improved together, not at each other's expense.
  • The improvement curves are skill-specific. Analytics shows a discontinuity (cached rule replaces deliberation). Design shows compounding (each refactor enriches the vocabulary). Implementation shows a step change then gradual improvement. These different shapes confirm the improvements are driven by what each skill learned, not a single "the team got faster" effect.
  • A framework that achieves 7.2x speedup on a harder problem (Training vs Home) by adding infrastructure is demonstrating returns to investment, not diminishing returns. The conventional prediction would be that more skills, more cache layers, and more shared files add coordination overhead. The data shows the opposite.
  • The framework is approaching a floor set by the irreducible complexity of the work itself, not a ceiling set by coordination overhead.