9 min read
How 6 Screen Refactors Proved a 6.5x Speedup
- Version
- v4.3
- Date
- 2026-04-10
- Tier
- flagship
Six identical-scope screen refactors across four framework versions isolated framework improvement from practitioner learning. Wall time fell from 6.5h (v2.0 onboarding) to 1h (v4.1 settings) — a 6.5× speedup that survived as a controlled natural experiment.
- •Home (v3.0) is a documented 36h outlier — established the v2 file convention plus 3 sub-features. Excluded from curve fits, retained as a labeled data point.
- •6.5× headline conflates framework improvement with practitioner learning even with the controlled-scope design. Tagged T2 Declared, not T1 Instrumented.
How to read this case studyT1/T2/T3 · ledger · kill criterion▾
- T1Instrumented
- Numbers come from a machine-generated ledger or commit. Reproducible. Highest reader trust.
- T2Declared
- Numbers stated by a structured declaration (PRD, plan, frontmatter) but not directly measured.
- T3Narrative
- Estimates and observations from session memory. Useful for context; not citable as evidence.
- Ledger
- Where to verify the claim — a file path, GitHub issue, or backlog entry. Anything labelled
ledger:is the audit trail. - Kill criterion
- The pre-registered threshold under which this work would have been killed mid-flight. Not fired = work shipped without hitting the threshold.
- Deferred
- Items intentionally not closed in this version. Each cites the ledger that tracks remaining work.
- Home (v3.0, outlier)36h
- Onboarding (v2.0)6.5h
- Training (v4.0)5h
- Nutrition (v4.1)2h
- Stats (v4.1)1.5h
- Settings (v4.1)1h
How 6 Screen Refactors Proved a 6.5x Speedup
Six identical-scope tasks. Four framework versions. A controlled natural experiment that isolated framework improvement from practitioner learning.
Context
Between April 5 and April 10, 2026, FitMe iOS underwent a UX alignment pass across all six main screens. Each screen was treated as an independent refactor with its own feature branch, audit, and PR. The screens were refactored sequentially -- not because of a constraint, but because the sequence created a natural experiment. With identical scope (same phases, same compliance checklist, same design system target), any change in velocity between screen 1 and screen 6 can only be explained by screen complexity (normalizable), practitioner learning (roughly constant after the first run), or framework evolution. This case study isolates and measures the third factor.
The Experiment
| Screen | Lines of Code | Refactor Order | Framework Version | Wall Time | PR |
|---|---|---|---|---|---|
| Onboarding | 1,106 | 1st | v2.0 | ~6.5h | #59 |
| Home | 703 | 2nd | v3.0 | ~36h* | #61 |
| Training | 2,135 | 3rd | v4.0 | ~5h | #74 |
| Nutrition | 487 | 4th | v4.1 | ~2h | #75 |
| Stats | 312 | 5th | v4.1 | ~1.5h | #76 |
| Settings | 289 | 6th | v4.1 | ~1h | #77 |
*Home was an outlier -- it established the v2 file convention, spawned 3 sub-features, and integrated external tools for the first time. Its refactoring-only time was ~4-5h, consistent with the trend. All analysis excludes Home from curve fitting while retaining it as a labeled data point.
Three Levels of Improvement
Level 1: Individual Skill Throughput (Micro)
Each specialized skill showed measurable improvement across framework versions:
UX Audit throughput: 13.5 findings/hour (v3.0) improved to 46.0 findings/hour (v4.1) -- a 3.4x gain. The mechanism: a cached principle-to-pattern mapping eliminated per-screen re-derivation of which UX foundations apply to which UI elements. An anti-pattern library grew with each run, compressing time-to-first-finding.
Analytics instrumentation: ~9 events/hour (v2.0-v3.0) jumped to 48 events/hour (v4.0) when a screen-prefix naming convention was formalized and cached. This is the signature of a cached rule replacing a deliberative process -- a single discontinuity, not a gradual improvement.
Implementation velocity: The v2 subdirectory convention, established during the Home refactor, eliminated the most time-consuming pre-implementation decision: how to structure the refactor. Commit count dropped from 20 (Onboarding, patching v1 in place) to 3 (Settings, applying a cached recipe).
Test density: Home was over-tested at 5.25 tests per analytics event because test patterns were being established in real time. By Training, test templates were cached, and density stabilized at 2.4-2.7 tests/event -- leaner but not undertested.
Level 2: Cross-Skill Handoff Quality (Meso)
How skills communicate determines how much context is lost at each handoff. The communication substrate evolved through three stages:
v2.0 -- Conversation context only. Context existed only in the active session window. Closing the session discarded inter-skill state. Resuming required re-reading the PRD and reconstructing intent from scratch.
v3.0 -- Shared durable files. Audit reports, UX specs, and state files persisted across sessions. Any skill could read any prior artifact. Handoffs became explicit file reads, not context reconstruction.
v4.0+ -- Shared files plus a multi-level learning cache. Cache entries were pushed to receiving skills before they began execution. The receiving skill arrived with a hypothesis already formed, not blank. This eliminated the "cold start" cost that each skill paid even when shared files existed.
The handoff improvement is most visible in the /ux-to-/design-to-/dev chain:
- v2.0 (Onboarding): Informal finding list, no severity tiers, no token gap analysis. Result: 20 patches, 3 pbxproj fix commits, 5 latent bugs found later.
- v3.0 (Home): Structured audit report with numbered findings, severity, principle-to-finding index. Result: 5 clean commits, v2 convention established, 0 latent bugs.
- v4.0 (Training): Cache-pre-loaded anti-patterns and principle mappings. Result: 7 files (most complex screen), single pbxproj commit, 0 bugs. Implementation time matched Home despite Training being 3x larger.
Level 3: End-to-End Lifecycle (Macro)
| Metric | v2.0 (Onboarding) | v4.1 avg (Nutrition/Stats/Settings) | Improvement |
|---|---|---|---|
| Wall time | ~6.5h | ~1.5h | 4.3x |
| Planning velocity | 3.7 findings/h | 13.7 findings/h | 3.7x |
| Phase compression | 1.4 phases/h | 6.0 phases/h | 4.3x |
| Rework cycles | 3 (pbxproj bugs) | 0 | Eliminated |
| Defect escape rate | 5 latent bugs | 0 | Eliminated |
The Learning Cache as Inflection Point
The single most impactful structural change was the introduction of a multi-level learning cache at v4.0:
| Framework Version | Cache Hit Rate |
|---|---|
| v2.0 (Onboarding) | 0% |
| v3.0 (Home) | 0% |
| v4.0 (Training) | ~40% |
| v4.1 (Nutrition) | ~55% |
| v4.1 (Stats) | ~65% |
| v4.1 (Settings) | ~70% |
Each percentage point represents a unit of prior research time that did not need to be repeated. At 70% cache hit rate, seven-tenths of the context-derivation work that was performed on Onboarding was no longer being performed at all. The cache does not make skills smarter -- it makes them stop re-deriving things they already know.
Two inflection points appear in the data:
Inflection 1 (v3.0 to v4.0): Training was the most complex screen at 2,135 lines -- larger than all others. Under a natural-learning-only model, it should have taken longer than Home. It completed in 5 hours. Phase compression jumped from 0.25 phases/hour to 1.8 phases/hour -- a 7.2x improvement on a harder problem.
Inflection 2 (v4.0 to v4.1): The shared cross-skill cache made patterns learned by one skill automatically available to others. The Nutrition run logged the first explicit L2 cache hit by name. Phase compression jumped to 6.0 phases/hour.
Every New Capability Costs Before It Pays
The data includes two non-outlier regressions:
- Training v2 (v4.0): 16.0 min/CU vs the 15.2 baseline -- a 5% regression. The cache system was new and being populated for the first time. Learning taxes are real.
- Home v2 (v3.0): 36 hours total. Three independent factors compounded: establishing the v2 file convention (framework development disguised as feature work), scope explosion (3 sub-features spawned), and external tool integration (Figma and Notion connected for the first time).
Both regressions were one-time costs. Every subsequent feature benefited from what these runs established.
The Growth Curve
A logarithmic model fitted to five data points (excluding Home) predicts wall time well for the first two refactors but consistently underestimates performance from the fourth onward. The delta between predicted and actual is stable at ~2.3 hours across three consecutive v4.1 runs -- the signature of a structural improvement (the L2 cache) on top of natural practitioner learning.
Compound efficiency gains across the full span:
| Metric | v2.0 to v4.1 | Factor |
|---|---|---|
| Wall time | 6.5h to 1h | 6.5x faster |
| Planning velocity | 3.7 to 16.0 findings/h | 4.3x |
| Defect escape rate | 5 to 0 | Eliminated |
| Design system tokens added per refactor | 0 to 2.3 avg | System now grows with every run |
| Cache hit rate | 0% to 70% | 70% of prior research eliminated |
The design system growth metric captures a compounding effect. Each token added in a v4.1 run reduces the discovery work for the next run. The framework converted a compliance-check process into a design-system-enrichment loop.
Does the curve plateau? Through six refactors, no. Wall time is still declining (2.0h to 1.5h to 1.0h across the three v4.1 screens), and cache hit rate is still increasing (55% to 65% to 70%).
Key Takeaways
- 6.5x speedup on identical-scope work is the headline, but the defect escape rate dropping from 5 to 0 is the more practically significant result. Speed and quality improved together, not at each other's expense.
- The improvement curves are skill-specific. Analytics shows a discontinuity (cached rule replaces deliberation). Design shows compounding (each refactor enriches the vocabulary). Implementation shows a step change then gradual improvement. These different shapes confirm the improvements are driven by what each skill learned, not a single "the team got faster" effect.
- A framework that achieves 7.2x speedup on a harder problem (Training vs Home) by adding infrastructure is demonstrating returns to investment, not diminishing returns. The conventional prediction would be that more skills, more cache layers, and more shared files add coordination overhead. The data shows the opposite.
- The framework is approaching a floor set by the irreducible complexity of the work itself, not a ceiling set by coordination overhead.