fitme·story
v6.1 · 8 min read
Summary card · 60-second read

185 Findings, 12 Critical — What a Full-System Audit Revealed

Version
v6.1
Date
2026-04-18
Tier
light

The same AI that built the framework audited its own work across 22 features and 6 framework versions, using a 4-layer methodology (parallel surface sweep → risk-weighted deep dive → historical cross-reference → external validation). 185 raw findings compressed to 20 thematic clusters and ultimately 3 root causes.

Honest disclosures
  • Audit is self-referential — same AI that built the framework. Bias implications are documented in the dedicated section of the case study.
  • Layer 4 external validation covers 21 of 185 findings; the remaining 164 carry only AI-author signal. External replication remains a Tier 3.3 backlog item.
  • Two findings are regressions of prior 'fixed' issues — session-restore timeout (incomplete fix) and color-token migration (66 deprecated calls remained).
How to read this case studyT1/T2/T3 · ledger · kill criterion
T1Instrumented
Numbers come from a machine-generated ledger or commit. Reproducible. Highest reader trust.
T2Declared
Numbers stated by a structured declaration (PRD, plan, frontmatter) but not directly measured.
T3Narrative
Estimates and observations from session memory. Useful for context; not citable as evidence.
Ledger
Where to verify the claim — a file path, GitHub issue, or backlog entry. Anything labelled ledger: is the audit trail.
Kill criterion
The pre-registered threshold under which this work would have been killed mid-flight. Not fired = work shipped without hitting the threshold.
Deferred
Items intentionally not closed in this version. Each cites the ledger that tracks remaining work.
  1. 185
    Raw findings
    clusters
  2. 20
    Clusters
    root causes
  3. 3
    Root causes
The 4-layer methodology compressed 185 raw findings into 20 thematic clusters and ultimately 3 root causes.

185 Findings, 12 Critical — What a Full-System Audit Revealed

What happens when the AI system that built the framework audits its own work -- and how honest can a self-referential audit be?

  1. 185
    Raw findings
    clusters
  2. 20
    Clusters
    root causes
  3. 3
    Root causes
The 4-layer methodology compressed 185 raw findings into 20 thematic clusters and ultimately 3 root causes.

Context

After shipping 22 features across 6 framework versions, FitMe had accumulated substantial technical debt that individual feature-level audits could not see. This full-system audit used a 4-layer methodology -- surface sweep, risk-weighted deep dive, historical cross-reference, and external validation -- to assess the entire codebase. The audit was performed by the same AI system that built the framework. The implications of that self-referential structure are documented honestly in the bias section below.


The 4-Layer Methodology

Layer 1 -- Surface Sweep (parallel domain agents). Six independent agents ran concurrently, each constrained to one domain (Backend, Tests, AI, UI, Design System, Framework). Domain boundaries prevented cross-domain bias in the initial pass. 170 findings surfaced.

Layer 2 -- Risk-Weighted Deep Dive. Targeted the two highest-risk clusters from Layer 1: the authentication/encryption complex and the dual-sync architecture. Root-cause analysis yielded 15 additional findings that explained 31 Layer 1 findings as symptoms of deeper issues.

Layer 3 -- Historical Cross-Reference. Sampled 5 prior case studies to classify findings as recurring (9 matches), regression (2 matches), predicted (4 matches from prior case studies that had explicitly called out gaps), or new category (5 never-seen patterns).

Layer 4 -- External Validation. 21 findings confirmed by objective, non-AI signals: build success, test suite results (231 pass / 0 fail), file line counts, and token budget measurements. These carry the highest credibility because the validation is independent of the AI that generated them.


What the Audit Found

DomainFindingsCriticalHighMediumLowInfo
Backend6021822135
Tests30581520
AI2001982
UI2505974
Design System15001230
Framework2005843
Layer 2 additions1551215
Total185124990259

The Three Root Causes

Discovery 1: Fabrication-Over-Nil Pattern (AI Domain). When adapter methods use non-optional return types, absent source data gets replaced with plausible-sounding fabricated values instead of triggering the "insufficient data" fallback path. Gender identity hardcoded to a default. Training frequency returning enum case count instead of user schedule. Stress level returning "moderate" when no log exists. Impact: every user with a completed profile receives AI recommendations computed from partially fabricated inputs.

Discovery 2: Dual-Sync Race Condition (Backend). Two sync services (CloudKit and Supabase) both pull on login with no merge coordination. The last one to finish overwrites the other's result non-deterministically. Additionally, the pull timestamp advances even when rows fail decryption, permanently skipping failed records on subsequent syncs.

Discovery 3: Authentication Security Gaps (Backend). A review-mode session has no debug fence -- any production binary receiving a specific environment variable authenticates as a developer identity. Passkey assertions are used directly as session tokens without server-side verification. A session restore timeout fix from a prior case study introduced a regression (missing cancellation in the nil branch creates a crash risk).

Regressions from Prior "Fixes"

Two findings were regressions of previously "fixed" issues:

  • The session restore timeout fix was incomplete -- a nil branch that doesn't cancel the timeout, creating a double-resume crash risk
  • A token migration claimed complete in a prior case study still had 66 deprecated color calls active

The Framework's Own Measurement Was Broken

The self-audit of the framework infrastructure revealed:

  • All cache index files had empty arrays -- cache discovery was non-functional
  • The cache metrics aggregate had all-zero values despite 8+ completed features with documented cache hits in case study narratives
  • Phase 0 health check had never fired
  • Version numbers were inconsistent across config files

This means all published cache performance claims in prior case studies were narrative, not measured. The cache was working (velocity improvements are real), but the measurement infrastructure for the cache was not.


System Health Scorecard

DomainScore (0-100)Grade
AI0F
Backend0F
Tests0F
UI9F
Framework42D
Design System46D
Overall16.2F

The scorecard weights severity heavily -- critical and high findings dominate. A domain needs zero critical/high findings and few medium findings to score above 50. This is intentionally harsh. The point is not to produce a flattering number but to create a clear priority list.

What IS working well: Build passes cleanly across 143 Swift files. 231 tests run with 0 failures. Design system token coverage for v2-aligned screens is good. The AI three-tier architecture is correctly structured at the orchestration level. The hardware-aware dispatch infrastructure is complete and intact.


The Self-Referential Bias Report

Validation Distribution

TypeCountPercentageWhat It Means
External-Automated2111.4%Confirmed by build, test, or measurement -- independent of AI
Cross-Referenced189.7%Confirmed against prior documented findings
Framework-Only14678.9%AI assertion from code reading -- plausible but unverified

What This Audit Can Detect

  • Code patterns that are structurally wrong (non-optional return types where nil is meaningful)
  • Security configurations that are definitively incorrect by definition (hardcoded tokens without debug fences)
  • Coverage gaps (files with zero test references)
  • Configuration consistency across files

What This Audit Cannot Detect

  • Runtime behavior. The audit never ran the app. Bugs requiring specific state sequences are described from code analysis but not verified.
  • Performance. Zero profiling data.
  • Concurrency. Thread safety limited to async/await pattern reading. Actor reentrancy and deadlocks require runtime analysis.
  • UI/UX quality. Compliance scores inferred from code structure, not from observing users.
  • Security exploitability. Findings identified from code, but actual exploitability depends on runtime conditions.

The Core Credibility Problem

This audit was performed by the same AI system that built the framework. Three credibility problems follow:

  1. Blind-spot alignment. The audit finds issues in categories the AI understands (code patterns, token compliance, test coverage) and misses categories it cannot observe (runtime behavior, real user experience, security exploitability).
  2. Self-critical performance. The framework findings are the most credible section because they are genuinely self-critical. But acknowledging limitations does not make those limitations less real.
  3. Every finding was an assertion when written. Cross-reference and external validation were applied post-hoc, leaving 78.9% still as assertions.

Treat this audit as a prioritized list of plausible issues, not ground truth. The critical findings warrant investigation precisely because they are plausible -- not because they are verified.


Key Takeaways

  • Parallel domain agents prevent cross-domain anchoring. Running 6 independent agents before synthesis meant Backend findings did not influence what the Tests agent looked for. The domain boundaries held.
  • Risk-weighted deep diving multiplied value. 28 Layer 2 findings explained 31 Layer 1 findings as symptoms of root causes. Fixing one root cause (like making adapter return types optional) addresses multiple surface findings simultaneously.
  • Historical cross-reference caught 2 regressions that appeared to be completed fixes. Both would have been invisible without checking prior case study claims against current code.
  • The most honest thing a self-referential audit can do is document what it cannot see. The bias report is not a disclaimer -- it is a scope definition. The 185 findings are the floor of the defect surface, not the ceiling.
  • Measurement infrastructure that is defined but never populated is worse than no measurement infrastructure -- it creates false confidence. Every published cache metric was narrative, not measured. The infrastructure existed but never fired.