fitme·story
7 min read
Summary card · 60-second read

How We Normalized Complexity Across 16 Different Features

Date
2026-04-16
Tier
appendix

Raw metrics like wall time and file count are meaningless without normalization. The Complexity Unit (CU) model — additive factors for tasks, types, views, design iterations, architectural novelty — makes 16 different features comparable.

Honest disclosures
  • CU magnitudes are judgment-based; v7.6 mechanical enforcement validates schema (presence + ranges + total) but not whether a number is the right number for a feature.
  • The 16-feature corpus is the whole project, not a controlled sample. Generalisation outside FitMe is unproven.
How to read this case studyT1/T2/T3 · ledger · kill criterion
T1Instrumented
Numbers come from a machine-generated ledger or commit. Reproducible. Highest reader trust.
T2Declared
Numbers stated by a structured declaration (PRD, plan, frontmatter) but not directly measured.
T3Narrative
Estimates and observations from session memory. Useful for context; not citable as evidence.
Ledger
Where to verify the claim — a file path, GitHub issue, or backlog entry. Anything labelled ledger: is the audit trail.
Kill criterion
The pre-registered threshold under which this work would have been killed mid-flight. Not fired = work shipped without hitting the threshold.
Deferred
Items intentionally not closed in this version. Each cites the ledger that tracks remaining work.

Visual aid · key numbers at a glance

Default · no specialised visual declared

How We Normalized Complexity Across 16 Different Features

Raw metrics like wall time and file count are meaningless without normalization. A 22-task UI refactor with auth integration is fundamentally different from a 4-task backend enhancement. This model makes them comparable.

Context

Across 17 case-studied features, the project produced a rare longitudinal dataset: identical workflow template, sequential execution, and progressively newer framework versions. But comparing "6.5 hours for onboarding" to "1.5 hours for AI engine" is misleading without accounting for the fact that onboarding had 22 tasks with UI work while the AI engine enhancement had 13 tasks with cross-cutting architectural scope. The Complexity Unit (CU) model exists to make these comparisons honest.


The Formula

CU = Tasks x Work_Type_Weight x (1 + sum(Complexity_Factors))

Work Type Weights

Work TypeWeightRationale
Feature1.0Full lifecycle, maximum ceremony
Refactor (v2)0.9Full lifecycle but v1 exists as reference
Enhancement0.84-phase lifecycle, parent PRD exists
Fix0.52-phase lifecycle, minimal planning
Chore0.31-phase, docs/config only

Complexity Factors (v2 -- continuous)

Factorv1 (binary)v2 (continuous)Signal
Has UI+0.3+0.15 (1 view) / +0.30 (2-3) / +0.45 (4+)View count from state
Auth/External Service+0.5+0.5 (unchanged)Binary flag
Runtime Testing Required+0.4+0.4 (unchanged)Binary flag
New Model/Service+0.2+0.1 (1-2 types) / +0.2 (3-5) / +0.3 (6+)Type count from state
Cross-Feature Dependencies+0.2+0.2 (unchanged)Binary flag
Design Iterations+0.15 per round+0.10 (text) / +0.15 (layout) / +0.20 (interaction) / +0.25 (full redesign) per roundIteration scope
Architectural NoveltyNot tracked+0.2First-of-kind flag (no cache entry)

Primary Metric: Minutes Per Complexity Unit (min/CU)

Velocity = Wall_Time_Minutes / CU

Lower is better. This is the single metric that enables cross-version comparison.


All 17 Features, Normalized

#FeatureFW VerTypeWall TimeTasksCUmin/CUvs Baseline
1Onboarding v2v2.0refactor6.5h2225.715.2Baseline
2Home v2v3.0refactor36h*1723.093.9*Outlier
3Training v2v4.0refactor5h1618.716.0-5%
4Nutrition v2v4.1refactor2h1416.47.3+52%
5Stats v2v4.1refactor1.5h1011.77.7+49%
6Settings v2v4.1refactor1h67.08.6+43%
7Readiness v2v4.2enhancement2.5h78.417.9-18%
8AI Engine v2v4.2enhancement0.5h43.87.9+48%
9AI Rec UIv4.2feature0.7h67.85.4+64%
10Profilev4.4feature2h1316.97.1+53%
11AI Engine Archv5.1enhancement1.5h1317.75.1+66%
12Onboarding Authv5.1feature~1.7h1847.72.1+86%
13Parallel Stress Testv5.14x feature54 min3043.91.23+92%
14Parallel Write Safetyv5.2feature20 min62.169.26+39%
15Framework Measurementv6.0feature1.5h2028.03.21+79%

*Home v2 excluded from trend analysis -- outlier that invented the v2 convention, spawned 3 sub-features, and integrated external tools for the first time.


Trend Analysis

By Framework Version

FW VersionFeaturesAvg min/CUvs BaselineInterpretation
v2.0115.2BaselineNo cache, no skills, monolithic PM
v4.0116.0-5%Learning cost of cache system (expected regression)
v4.137.9+48%Cache acceleration kicks in (40-70% hit rates)
v4.2310.4+32%Mixed -- includes Readiness (new model type learning tax)
v4.417.1+53%Eval-driven development
v5.123.6+76%SoC optimizations + deep pattern reuse

Power Law Fit

Velocity(N) = 15.2 x N^(-0.68), R-squared = 0.82 (v1 factors)
Velocity(N) = 15.2 x N^(-0.61), R-squared = 0.87 (v2 factors)

The -0.68 exponent indicates steep improvement that has not yet plateaued. For comparison: typical software learning curves show -0.3 to -0.5; manufacturing improvement shows -0.2 to -0.3.

CU v2 improves the fit from R-squared 0.82 to 0.87. The exponent drops to -0.61 -- slightly less steep but more consistent, with fewer features deviating from the trend.

Regressions and Learning Taxes

Two non-outlier regressions are visible:

FeatureFW Versionmin/CUvs BaselineAttribution
Training v2v4.016.0-5%Cache-system learning overhead (first use)
Readiness v2v4.217.9-18%First-of-kind model/service work

Pattern: When the framework introduces a new structural capability, the next feature pays a measurable learning tax before gains appear. Under CU v2, the Readiness regression drops from -18% to -7% -- half the apparent regression was an artifact of binary factors failing to capture the feature's true complexity.


Confounders and Limitations

ConfounderImpactMitigation
Single practitionerCannot separate framework improvement from personal learningCache hit rate provides a proxy -- high cache % means the framework is learning, not just the human
Feature complexity variesAddressed by CU normalizationSome factors (auth complexity, design iteration difficulty) remain subjective
Framework evolves between measurementsThis IS the signal, not noiseDocumented which version produced which result
Session continuity variesSingle-session features benefit from warm contextNoted in each case study
Task count is self-reportedDifferent features may count at different granularityMitigated by consistent methodology after v3.0

Key Takeaways

  • Full-lifecycle features (4.9 min/CU avg) outperform refactors (8.6 min/CU avg). This is counterintuitive -- new features should be harder. The explanation: refactors were early in the framework's evolution when the cache was cold and the workflow was immature. New features benefit from a mature cache.
  • The power law fit (R-squared = 0.87) explains most variance, but ~13% remains attributable to practitioner learning and feature-specific novelty.
  • CU v2 continuous factors retroactively explain regressions that binary factors could not. The Readiness regression was halved when view count and architectural novelty were modeled as continuous variables instead of binary flags.
  • N=17 is small for robust regression. All claims should be treated as directional signals, not definitive measures. Bootstrap confidence intervals should be used for any published benchmarks.