How 6 Screen Refactors Proved a 6.5x Speedup

Six identical-scope tasks. Four framework versions. A controlled natural experiment that isolated framework improvement from practitioner learning.

6.5×

speedup across 6 identical-scope refactors

Context

Between April 5 and April 10, 2026, FitMe iOS underwent a UX alignment pass across all six main screens. Each screen was treated as an independent refactor with its own feature branch, audit, and PR. The screens were refactored sequentially -- not because of a constraint, but because the sequence created a natural experiment. With identical scope (same phases, same compliance checklist, same design system target), any change in velocity between screen 1 and screen 6 can only be explained by screen complexity (normalizable), practitioner learning (roughly constant after the first run), or framework evolution. This case study isolates and measures the third factor.

The Experiment

Screen	Lines of Code	Refactor Order	Framework Version	Wall Time	PR
Onboarding	1,106	1st	v2.0	~6.5h	#59
Home	703	2nd	v3.0	~36h*	#61
Training	2,135	3rd	v4.0	~5h	#74
Nutrition	487	4th	v4.1	~2h	#75
Stats	312	5th	v4.1	~1.5h	#76
Settings	289	6th	v4.1	~1h	#77

*Home was an outlier -- it established the v2 file convention, spawned 3 sub-features, and integrated external tools for the first time. Its refactoring-only time was ~4-5h, consistent with the trend. All analysis excludes Home from curve fitting while retaining it as a labeled data point.

Three Levels of Improvement

Level 1: Individual Skill Throughput (Micro)

Each specialized skill showed measurable improvement across framework versions:

UX Audit throughput: 13.5 findings/hour (v3.0) improved to 46.0 findings/hour (v4.1) -- a 3.4x gain. The mechanism: a cached principle-to-pattern mapping eliminated per-screen re-derivation of which UX foundations apply to which UI elements. An anti-pattern library grew with each run, compressing time-to-first-finding.

Analytics instrumentation: ~9 events/hour (v2.0-v3.0) jumped to 48 events/hour (v4.0) when a screen-prefix naming convention was formalized and cached. This is the signature of a cached rule replacing a deliberative process -- a single discontinuity, not a gradual improvement.

Implementation velocity: The v2 subdirectory convention, established during the Home refactor, eliminated the most time-consuming pre-implementation decision: how to structure the refactor. Commit count dropped from 20 (Onboarding, patching v1 in place) to 3 (Settings, applying a cached recipe).

Test density: Home was over-tested at 5.25 tests per analytics event because test patterns were being established in real time. By Training, test templates were cached, and density stabilized at 2.4-2.7 tests/event -- leaner but not undertested.

Level 2: Cross-Skill Handoff Quality (Meso)

How skills communicate determines how much context is lost at each handoff. The communication substrate evolved through three stages:

v2.0 -- Conversation context only. Context existed only in the active session window. Closing the session discarded inter-skill state. Resuming required re-reading the PRD and reconstructing intent from scratch.

v3.0 -- Shared durable files. Audit reports, UX specs, and state files persisted across sessions. Any skill could read any prior artifact. Handoffs became explicit file reads, not context reconstruction.

v4.0+ -- Shared files plus a multi-level learning cache. Cache entries were pushed to receiving skills before they began execution. The receiving skill arrived with a hypothesis already formed, not blank. This eliminated the "cold start" cost that each skill paid even when shared files existed.

The handoff improvement is most visible in the /ux-to-/design-to-/dev chain:

v2.0 (Onboarding): Informal finding list, no severity tiers, no token gap analysis. Result: 20 patches, 3 pbxproj fix commits, 5 latent bugs found later.
v3.0 (Home): Structured audit report with numbered findings, severity, principle-to-finding index. Result: 5 clean commits, v2 convention established, 0 latent bugs.
v4.0 (Training): Cache-pre-loaded anti-patterns and principle mappings. Result: 7 files (most complex screen), single pbxproj commit, 0 bugs. Implementation time matched Home despite Training being 3x larger.

Level 3: End-to-End Lifecycle (Macro)

Metric	v2.0 (Onboarding)	v4.1 avg (Nutrition/Stats/Settings)	Improvement
Wall time	~6.5h	~1.5h	4.3x
Planning velocity	3.7 findings/h	13.7 findings/h	3.7x
Phase compression	1.4 phases/h	6.0 phases/h	4.3x
Rework cycles	3 (pbxproj bugs)	0	Eliminated
Defect escape rate	5 latent bugs	0	Eliminated

v2.0 (Onboarding)

6.5h

3 rework cycles · 5 latent bugs

v4.1 avg (3 screens)

1.5h

0 rework · 0 bugs

The Learning Cache as Inflection Point

The single most impactful structural change was the introduction of a multi-level learning cache at v4.0:

Framework Version	Cache Hit Rate
v2.0 (Onboarding)	0%
v3.0 (Home)	0%
v4.0 (Training)	~40%
v4.1 (Nutrition)	~55%
v4.1 (Stats)	~65%
v4.1 (Settings)	~70%

Each percentage point represents a unit of prior research time that did not need to be repeated. At 70% cache hit rate, seven-tenths of the context-derivation work that was performed on Onboarding was no longer being performed at all. The cache does not make skills smarter -- it makes them stop re-deriving things they already know.

Two inflection points appear in the data:

Inflection 1 (v3.0 to v4.0): Training was the most complex screen at 2,135 lines -- larger than all others. Under a natural-learning-only model, it should have taken longer than Home. It completed in 5 hours. Phase compression jumped from 0.25 phases/hour to 1.8 phases/hour -- a 7.2x improvement on a harder problem.

Inflection 2 (v4.0 to v4.1): The shared cross-skill cache made patterns learned by one skill automatically available to others. The Nutrition run logged the first explicit L2 cache hit by name. Phase compression jumped to 6.0 phases/hour.

Every New Capability Costs Before It Pays

The data includes two non-outlier regressions:

Training v2 (v4.0): 16.0 min/CU vs the 15.2 baseline -- a 5% regression. The cache system was new and being populated for the first time. Learning taxes are real.
Home v2 (v3.0): 36 hours total. Three independent factors compounded: establishing the v2 file convention (framework development disguised as feature work), scope explosion (3 sub-features spawned), and external tool integration (Figma and Notion connected for the first time).

Both regressions were one-time costs. Every subsequent feature benefited from what these runs established.

The Growth Curve

A logarithmic model fitted to five data points (excluding Home) predicts wall time well for the first two refactors but consistently underestimates performance from the fourth onward. The delta between predicted and actual is stable at ~2.3 hours across three consecutive v4.1 runs -- the signature of a structural improvement (the L2 cache) on top of natural practitioner learning.

Compound efficiency gains across the full span:

Metric	v2.0 to v4.1	Factor
Wall time	6.5h to 1h	6.5x faster
Planning velocity	3.7 to 16.0 findings/h	4.3x
Defect escape rate	5 to 0	Eliminated
Design system tokens added per refactor	0 to 2.3 avg	System now grows with every run
Cache hit rate	0% to 70%	70% of prior research eliminated

The design system growth metric captures a compounding effect. Each token added in a v4.1 run reduces the discovery work for the next run. The framework converted a compliance-check process into a design-system-enrichment loop.

Does the curve plateau? Through six refactors, no. Wall time is still declining (2.0h to 1.5h to 1.0h across the three v4.1 screens), and cache hit rate is still increasing (55% to 65% to 70%).

Key Takeaways

6.5x speedup on identical-scope work is the headline, but the defect escape rate dropping from 5 to 0 is the more practically significant result. Speed and quality improved together, not at each other's expense.
The improvement curves are skill-specific. Analytics shows a discontinuity (cached rule replaces deliberation). Design shows compounding (each refactor enriches the vocabulary). Implementation shows a step change then gradual improvement. These different shapes confirm the improvements are driven by what each skill learned, not a single "the team got faster" effect.
A framework that achieves 7.2x speedup on a harder problem (Training vs Home) by adding infrastructure is demonstrating returns to investment, not diminishing returns. The conventional prediction would be that more skills, more cache layers, and more shared files add coordination overhead. The data shows the opposite.
The framework is approaching a floor set by the irreducible complexity of the work itself, not a ceiling set by coordination overhead.