fitme·story

How this site stays honest

Every case study and line of code on this site is periodically reviewed by an independent external AI model — a neutral second pass that checks accuracy, honesty, and methodology. No curator-written spin; an outside model reads the same artifacts you do.

What the audit checks for

  • Factual accuracy of every number in the case studies — commits, tests, metrics, timing claims
  • Honest representation of failures alongside successes — nothing quietly dropped
  • No cherry-picked data in comparisons
  • No silent edits to historical claims (git history is preserved and publicly inspectable)
  • No secrets or private data inadvertently published

Which models audit the site

Audits are performed by models from different vendors than the one that writes the site. The goal is independence — a model that had no hand in producing the artifacts it reviews.

  • Google Gemini 2.5 Pro — inaugural external auditor, first pass on 2026-04-21.

Additional auditors (from other vendors and model families) will be added as they run. Each audit is archived verbatim with the date it was performed and the model that performed it. Nothing is silently re-audited or retracted.

Audit cadence

  • On-demand after each framework version bump or major case-study batch.
  • Minimum quarterly even if no version bump has occurred, to catch drift.
  • Immediately if an internal structural check (the 72h integrity cycle) flags a regression that is not resolvable in-repo.

Latest audit results

2026-04-21 · Google Gemini 2.5 Pro · follow-up progress 2026-04-24

Mixed: methodologically strong, empirically weak on pre-v6.0 quantitative claims.

Scope
24 showcase + 41 main-repo case studies + 3 internal meta-analyses
Remediation progress (2026-04-24)
7 of 9 Tier 1/2/3 items fully shipped, 2 partial/pilot, 1 external-blocked. Remaining: real-provider auth verification (manual 7-step playbook), Tier 2.2 process adoption, Tier 3.2 trend after 3 integrity cycles, and Tier 1.1 adoption gap (see “New measurement gap” below).
Well-supported
Process documentation; internal arithmetic; post-v6.0 instrumentation; honest failure reporting
Weak / uncertain
Pre-v6.0 quantitative claims; causality of speedup; power-law predictive power; runtime correctness
Corrections (same-day)
Initial “3 broken PR citations” finding was a false positive propagated from the input meta-analysis — all three were GitHub issues, not PRs. Correction appended to the audit; issue #138 closed with full explanation.
Honest downgrade (2026-04-23)
Tier 1.1 (automated time/event metrics) was initially marked “done” on 2026-04-21; the 2026-04-23 status pass downgraded it to “partial” because system-wide adoption is still incomplete. Publishing the downgrade rather than leaving the overstatement in place.
Tier advancement (2026-04-24)
Tier 1.2 (integrate with sources of truth) promoted from partial to fully shipped — a pre-commit hook now verifies that the phases.merge.pr_number field resolves on GitHub at write time, not just at audit time. Tier 2.2 (contemporaneous logging) expanded from pilot to 5 live logs, with fresh scaffolds for app-store-assets, import-training-plan, and push-notifications.
New measurement gap (2026-04-24)
A fresh integrity run found cache_hits at 0 of 40 across the feature corpus. The v6.0 measurement protocol defined the field, but no feature session actually writes to it — distinct from Tier 1.1's adoption gap. Filed as issue #140 rather than silently fixed. Measurable via make measurement-adoption.
Framework evolution (2026-04-24)
Framework v7.1 → v7.5 (“Data Integrity Framework”) shipped 2026-04-24 as a direct extension of the Gemini audit remediation — the same-direction response to “our own measurement was the weakest link.” Details in the audit archive.

Read the full audit →

Framework advancement — what the audit could and couldn't verify

The chart below plots each significant feature against the framework version it shipped under. Solid dots are T1 instrumented data — the post-v6.0 era Gemini endorsed as reliable. Dashed outlines are pre-v6.0 T3 narrative estimates, which the audit flagged as unreliable and which are shown here for context rather than as trend evidence. The trend line only connects T1 points.

T1 · instrumented (audit-trusted)T3 · narrative estimate (pre-v6.0)
Framework advancement plotted on the axes the Gemini audit endorsed: wall clock in minutes (Y, log) across framework versions (X). Solid dots = T1 instrumented (audit-trusted); dashed outlines = T3 narrative estimates from the pre-v6.0 era the audit flagged as unreliable. The trend line connects only T1 points; its sparsity is the honest shape of the audit-validated window. Each point is labeled with CU (complexity units) for reference. Measurement adoption gap tracked at issue #140. Click the chart to view full-screen.

How we act on audit findings

Findings are not silently fixed. When an audit surfaces a broken citation, a methodological flaw, or a data inconsistency, we:

  1. Publish the audit verbatim — even before corrections are made, so the raw finding is visible.
  2. File a tracked issue on the underlying code repository (e.g., FitTracker2) with the remediation plan.
  3. Append corrections, don't overwrite. The audit archive and its git history show what was found, what was done, and when.
  4. Harden the processso the same class of finding can't recur (e.g., the 72h integrity cycle and the “Auditor Agent” recommended by Gemini).

Everything on this site and its underlying code is open for inspection at github.com/Regevba/fitme-story and github.com/Regevba/fitme-showcase.