fitme·story

Independent audit

Google Gemini 2.5 Pro — 2026-04-21

Auditor:
Google Gemini 2.5 Pro
Audit date:
2026-04-21
Corpus:
24 showcase + 41 main-repo case studies + 3 internal meta-analyses
Independence:
Different model family, different vendor, artifact-only access

Correction (same-day): the “3 broken PR citations” finding was a false positive.

Gemini was supplied a meta-analysis that had flagged #51, #69, #70 as non-existent PRs. Same-day verification shows all three are real GitHub issues, not PRs. The structural meta-analysis used a liberal #\d+ regex and conflated issue citations with PR citations; Gemini faithfully repeated the error because it was given the flawed meta-analysis as input. See §10 below for the full correction. Issue #138 has been closed with the full explanation. The original Gemini text in §§ 1–9 below is preserved unchanged — this site publishes audits verbatim and appends corrections rather than silently editing.

Full audit source (independent-audit-2026-04-21-gemini.md)

1. The prompt used

The following instructions were given to Gemini verbatim, twice (once for the case-by-case and initial meta-analysis review, once for the second-layer meta-analysis review):

You are an independent, skeptical validator of a set of case studies and associated meta-analysis. You must stay neutral, evidence-based, and avoid inventing facts.

I have provided: (1) a dossier prepared by Claude that summarizes each case study, the framework and principles, metrics and methodology, per-case evidence, cross-case meta-analysis, and data-quality/validation gaps; (2) additional meta-analysis files or text I created on the entire dataset.

Your task:(1) validate each case study independently; (2) validate the meta-analysis (full dataset, consistent definitions, reasonable statistics, justified conclusions); (3) cross-check Claude's dossier against the meta-analysis.

Output structure: executive summary; case-by-case validation notes; meta-analysis evaluation (methodology, justification, gaps); overall verdict (well-supported vs weak; additional checks).

Rules: skeptical but fair (challenge claims, not the author); specific and evidence-based; do not invent data or interpretations; state explicitly when unclear; neutral and professional tone.

2. Executive summary

The dossier presents a comprehensive and unusually transparent summary of a longitudinal dataset tracking an AI-assisted PM framework. The evidence's overall strength is mixed and can be clearly bifurcated:

Methodologically Strong:The project's commitment to documenting its process, including its failures, regressions, and measurement gaps, is its greatest strength. The evolution of the measurement framework from estimated (pre-v6.0) to instrumented (post-v6.0) is well-documented. The internal arithmetic of the primary metric (min/CU) appears consistent based on the provided formulas and examples.

Empirically Weak: The majority of the quantitative performance claims, especially those pre-dating framework v6.0, are based on estimates, narrative inference, and proxies with disclosed, significant error margins. The dataset is from a single practitioner, making it impossible to disentangle framework improvements from individual learning. The most significant validation efforts (the S13 audit and the D3 analysis) are flagged as potentially self-referential.

Conclusion:The dataset is highly reliable as a qualitative record of a methodology's evolution and a case study in rigorous self-assessment. It is unreliable for making precise, predictive quantitative claims about AI-assisted development velocity. The dossier itself does an excellent job of highlighting these limitations, and my validation largely confirms its own stated caveats.

3. Case-by-case validation notes

Gemini's strength rating and issue notes for each case study. Ratings are Gemini's verbatim classification, not the site curator's.

IDCaseStrengthIssues & gaps
S01Onboarding PilotWeak15.2 min/CU baseline arithmetically consistent but a single retroactive data point based on estimated wall time. Fragile anchor for trend.
S026 RefactorsModerate6.5x speedup is plausible. Home v2 outlier exclusion is significant. Practitioner learning vs framework improvement is unresolved.
S03Eval-Driven DevModerate100% pass rate supported. Eval coverage density not normalized — pass rate could be on a trivial subset of logic.
S04User ProfileWeak“Highest recorded velocity” claim used a proxy metric (files/hour), not min/CU. CU derived later.
S05SoC-on-SoftwareWeak63% overhead reduction based on pre-v6.0 token proxy. Before/after figures may not be comparable. Precision not supported by measurement quality.
S06Auth FlowModerate2.1 min/CU sound. Unplanned work not captured in CU denominator — weakness in pre-v2 CU model.
S07AI Engine ArchWeak45% cache hit rate invalidated by S11/S13 disclosures. “0 regressions” only supported by existing test suite; no new tests added.
S08Parallel Stress TestModerate1.23 min/CU internally consistent. Conflict-free outcome was “partially luck-dependent”; 52% of dispatches blocked.
S09Dispatch IntelligenceWeak80% accuracy claim based on 4/5 tasks — not statistically significant. Two critical bugs caught by manual review gate.
S10Parallel Write SafetyVery Weak“No runtime stress test yet.” Proposal, not validation.
S11Framework Measurement v6StrongHonest disclosure invalidating prior cache-hit-rate claims. New baseline for data quality.
S12V7.0 HADFModerateShipped disabled (enabled=false). 5 open questions on real-world performance. Design claims, not validated effect.
S13Full-System AuditStrong (as hypotheses)Zero-coverage greps consistent. 78.9% of findings are unverified “framework-only assertions.” Validates limits of self-assessment, not code quality.
S14–S24Deep divesModerate–StrongS15, S22, S23 strong on runtime gap vs static analysis. S20 provides the only replication-style result. Honest “shipped-without-a-door” reporting lends credibility.

4. Meta-analysis evaluation

4.1 Methodology

Use of Full Dataset: The analysis uses the full dataset, correctly identifying and justifying the exclusion of one outlier (Home v2). It also correctly includes and discusses regressions (Training v2, Readiness v2), which avoids cherry-picking.

Consistency: There is a major inconsistency in data quality between pre- and post-v6.0 measurements. The meta-analysis acknowledges this, but any trendline drawn across this boundary (like the power-law fit) is inherently suspect.

Reasoning/Statistics: The power-law fit on N=12 data points is a significant over-extrapolation. R²=0.82–0.87 with small N and estimated inputs makes this fit illustrative at best, not predictive. The serial/parallel decomposition is a sound analytical technique.

4.2 Conclusion justification

The primary conclusion — that velocity improved over time — is directionally supported. The magnitude (12.4×) is built on a baseline with high uncertainty and a peak from a “luck-dependent” stress test. The serial-velocity plateau hypothesis (4–5 min/CU) is reasonable but remains weak on such a small N. The dossier's own framing — “validation value hinges less on accepting individual velocity numbers and more on whether the methodology, arithmetic, and caveats hold up” — is sound.

4.3 Gaps and risks

Confirmation bias is the most significant risk: the entire dataset — code, framework, case studies, audit, and validation document (D3) — was generated by the same entity.

Provenance of D3 (meta-analysis-validation.md) is unconfirmed. Without confirmation of its independence, it must be treated as a rigorous internal review, not an external audit.

CU model may have been fitted post-hoc to the data. If the model was tuned to make the trend look good, the entire quantitative analysis is compromised.

5. Overall validation verdict

Well-supported

  • Process documentation — iterative improvement, measurement, self-correction
  • Internal arithmetic — CU formula and min/CU recalculable from inputs
  • Methodological flaw disclosures — S11, S13, S22 honest failure reporting is the most credible part of the dataset
  • Post-v6.0 instrumentation — more trustworthy due to deterministic tooling

Weak or uncertain

  • Pre-v6.0 quantitative claims — velocity numbers, cache-hit rates, token counts are directional estimates only. 15.2 min/CU baseline particularly uncertain.
  • Causality of speedup — cannot rule out learning effects with a single practitioner
  • Predictive power — power-law fit has little to no predictive validity on this sample
  • Runtime correctness — almost no validated evidence. “Shipped-without-a-door” cases (S22, S23) are strong evidence against assuming runtime correctness from static analysis.

Additional checks Gemini recommends

  • Provenance of D3 — clarify identity and independence of the validation-document reviewer
  • CU model timing — confirm whether CU weights were defined before or after data collection
  • Minimal external replication — even one case study run by a separate independent entity would check self-referential bias
  • Runtime spot-checks — smoke tests on features flagged as problematic (e.g., Push Notifications S22)
  • Re-fit the trend with post-v6.0 data only as it becomes available

6. Second-layer review — Gemini's evaluation of the structural meta-analysis

A follow-up turn supplied Gemini with meta-analysis-2026-04-21.md (the structural audit of 41 main-repo case-study files) and asked for evaluation as an additional evidence layer.

What is new

Focus on structural integrity, not performance. Larger corpus (41 source files vs 24 showcase files). Mechanical verifiable checks (word counts, schema checks, gh pr viewcalls). Key new findings: 6 stub files that are near-duplicates; 3 cited PR numbers that do not exist on GitHub; 63.4% of cases don't state their dispatch pattern; state.json schema inconsistencies; 38 of 39 dated files written in a three-week April 2026 window.

Consistency with case studies and the dossier

Full dataset used. Identification and separate treatment of the 6 stub files is a methodologically sound way to avoid skewing results. Transparent about its own limitations (“extraction is lexical”). Scope mismatch with the dossier is explained: dossier uses a curated subset (24 showcase files), this analysis uses the larger messier source corpus (41 files). The PR check finds that PR #51 (cited in S01's source) and PRs #69, #70 (cited in S19's source) are non-existent — directly weakens those specific case-study claims. The 95%-in-three-weeks finding reframes the corpus from “lab notebooks” to “historical reconstructions.”

Final judgment on the second meta-analysis

This second meta-analysis significantly strengthens the overall validation process while simultaneously weakening the blind acceptance of the original claims.

It strengthens the validation byproviding a new, objective, reproducible layer based on structural integrity; moving the conversation from “are the conclusions logical?” to “is the underlying data reliable?”; discovering concrete data quality issues; and giving a clearer picture of the documentation's retroactive nature.

It weakens the claims by demonstrating factual errors in the evidence cited for at least two case studies (S01, S19); revealing that the documentation process is less systematic than the polished showcase files suggest; and confirming that governance metrics (like kill criteria) are not consistently documented even in the source files.

Overall confidence in any specific quantitative claim from a pre-v6.0 case study should now be considered lowuntil specific evidentiary pointers (like PR numbers) have been manually verified. Confidence in the project's commitment to eventual transparency remains high — this audit itself is part of the provided evidence.

7. Gemini's remediation recommendations — Tier 1/2/3

Tier 1 — Foundational instrumentation & tooling

  1. Automate all time and event-based metrics. Instrument the PM framework itself; duration = timestamp_exit − timestamp_enter.
  2. Integrate directly with sources of truth. Use the GitHub API to link feature IDs to PR IDs and status. Prevents data integrity errors like the non-existent PRs.
  3. Enforce schema on write. Pre-commit hook or CI check that validates all state.json files.

Tier 2 — Process & workflow improvements

  1. Gated phase transitions with verifiable evidence. Implement → Test requires linked PR + CI green; Review → Merge requires basic runtime smoke test.
  2. Shift from retroactive case studies to contemporaneous logs. Primary artifact becomes a structured append-only log.
  3. Mandate explicit data quality tiers. Label every metric as Tier 1 (Instrumented) / Tier 2 (Declared) / Tier 3 (Narrative).

Tier 3 — Governance & verification

  1. Introduce an independent Auditor Agent. A separate automated process that runs the checks from the second meta-analysis on a cadence.
  2. Define and track documentation debt. Auditor Agent tracks metrics like kill-criteria and dispatch-pattern compliance on a dashboard.

External (not self-executable)

  1. Minimal external replication. A single case study run and documented by a separate independent entity would check self-referential bias.

8. How this audit is being acted on

Transparent accounting rather than concealment. The three PR citations Gemini flagged (#51, #69, #70) are tracked in GitHub issue #138 for root-cause investigation (typo, deleted PR, or misremembered number). The audit is being published before that investigation concludes; corrections will be appended here and in the cited case studies as they land.

Improvement backlog.Gemini's Tier 1/2/3 recommendations have been prioritized by ROI × reversibility — items 1–3 below are single-hour tasks, 4–5 are multi-session process changes:

  1. Independent Auditor Agent (extends the 72h integrity cycle v7.1) — Tier 3.1
  2. state.json schema enforcement on write — Tier 1.3
  3. Data quality tiers in all reports — Tier 2.3
  4. PR-link verification on phase transition — Tier 1.2 subset
  5. Contemporaneous logging replacing retroactive case studies — Tier 2.2

Already shipped going forward: Tier 1.1 (automated time metrics, framework v6.0, 2026-04-16). Items that cannot be done solo (Tier 3.3 — external replication) remain open.

9. What this audit does NOT do

  • It does not re-verify CU arithmetic, velocity numbers, or cache-hit percentages. Out of scope for the structural review.
  • It does not investigate the root cause of the 3 non-existent PR citations — only that they don't resolve.
  • It does not validate runtime correctness of any shipped feature — only the documentation of that shipping.
  • It does not constitute external replication. Gemini reviewed artifacts provided in a prompt; it did not run the framework on an independent task.

10. Corrections (appended 2026-04-21 after initial publication)

The “three PR citations don't resolve” finding is wrong. Gemini correctly flagged it based on what it was told, but the underlying claim came from a false positive in the structural meta-analysis Gemini was supplied with.

What actually exists

All three numbers are real GitHub issues, not PRs:

  • Issue #51 “Onboarding Flow” (CLOSED) — cited in pm-workflow-showcase-onboarding.md as regevba/fittracker2#51
  • Issue #69 “Rest Day — Positive Experience Redesign” (OPEN) — cited in training-plan-v2-case-study.md as issue #69
  • Issue #70 “Advanced Data Fusion + AI Exercise Recommendations” (OPEN) — cited in training-plan-v2-case-study.md as issue #70

gh issue view confirms all three resolve. No case-study correction is needed for the citations themselves.

Who was wrong, and what this means

  • Claude's structural meta-analysis was the source of the error. Its mechanical PR extraction used a liberal #\d+ regex and checked every match against gh pr list, conflating issue citations with PR citations.
  • Gemini's audit faithfully reproduced the error because the flawed meta-analysis was fed to it as input. Gemini did not independently re-run the gh pr viewqueries — it cited the meta-analysis's finding as evidence.
  • Gemini's meta-evaluation was still correct in form: “demonstrating factual errors in the evidence cited for at least two case studies (S01, S19) weakens the original claims.” If the finding had been real, that critique would stand. Because the finding itself was flawed, the actual weakness is in the meta-analysis's precision — something Gemini could not have detected without re-running the queries.

What was actually verified vs. what wasn't

Gemini's other structural findings — state.json schema drift, the audit-v2-gN stub cluster, the dispatch-pattern gap, the 95%-in-three-weeks observation, the showcase ↔ main-repo mapping — were not re-verified by the author. Those findings propagate from the meta-analysis with the same epistemic status: they could be correct, or they could contain similar precision gaps. The Auditor Agent (below) is the forward defense.

Tooling response

The Auditor Agent extension shipped in scripts/integrity-check.py (2026-04-21) uses a tighter regex requiring PR or pull/ context. Running this check against the same corpus produces zero BROKEN_PR_CITATIONfindings on the original case studies. This is how the false positive surfaced: cross-checking the new Auditor Agent's output against the original meta-analysis revealed the discrepancy.

Policy precedent

This correction was appended, not substituted for the original text. Sections 1–9 above still contain the original statement. Issue #138 is closed with a full explanation, not deleted. Every subsequent audit will follow the same pattern: the initial finding stays visible, the correction appears as a time-stamped append, and the chain of reasoning is preserved for future reviewers (human or AI) to retrace.


11. How we responded — v7.5 (policy) + v7.6 (mechanical)

The 9 Tier 1/2/3 recommendations in §7 were addressed in two waves. The response is published here as appended commentary, not as silent edits to §§ 1–10.

v7.5 — policy response (shipped 2026-04-24)

v7.5 turned the audit's recommendations into eight cooperating defenses spanning write-time gates (pre-commit hooks for SCHEMA_DRIFT and PR_NUMBER_UNRESOLVED), cycle-time enforcement (the 72h Auditor Agent extended from 8 to 11 check codes), runtime smoke gates (5 profiles including sign_in_surface), contemporaneous logging (5 active feature logs), data-quality tiers (T1 / T2 / T3 convention codified in CLAUDE.md), a documentation-debt dashboard, and a measurement-adoption ledger. 7 of Gemini's 9 Tier items shipped fully or effectively, 2 shipped as partial/pilot with measured known deltas, and 1 (Tier 3.3 — external replication) was deferred as external-blocked. Full narrative: data-integrity-framework-v7.5-case-study.md.

v7.6 — mechanical response (shipped 2026-04-25)

v7.5 was a complete policy answer, but most of its new defenses were Class B — they relied on the agent remembering to invoke them. v7.6 is the mechanical layer that promotes 7 silent agent-attention checks to Class A enforcement and explicitly documents the 5 gaps that cannot be promoted (because pretending we could mechanize them would itself be a lie).

  • 4 new write-time pre-commit check codes: PHASE_TRANSITION_NO_LOG, PHASE_TRANSITION_NO_TIMING, BROKEN_PR_CITATION (write-time, narrow regex), and CASE_STUDY_MISSING_TIER_TAGS (forward-only, dates ≥ 2026-04-21). Every commit now passes through these checks before it can land.
  • Per-PR review bot: a new pm-framework/pr-integrity commit status that fails when a PR introduces NEW findings vs main. Sticky comment updates in place; status check available for branch protection.
  • Weekly framework-status cron: Mondays 05:00 UTC, snapshots the measurement-adoption history (dedup-by-date) and opens a regression issue when fully_adopted or any_adopted decreases.
  • 5 explicit Class B gaps documented: unclosable-gaps.md enumerates each gap with a 4-section format (technical reason, observability, human action, tracking). The gaps are: cache_hits[] writer-path adoption (issue #140), cu_v2 factor magnitude correctness, T1/T2/T3 tag correctness (presence promoted to Class A; correctness stays B), Tier 2.1 real-provider auth, and Tier 3.3 external replication.

Outlier framing — read this before quoting numbers

The v7.6 case study is itself an outlier in the corpus. It shipped in a single ~6-hour working session on 2026-04-25; the data is dogfooded (the author of the framework wrote the data and reads it); and the v6.0 measurement protocol the case study uses was applied retroactively (v6.0 shipped 2026-04-16; most prior features have empty cache_hits[] arrays not because no hits occurred but because no logger was wired). The case study labels these limits explicitly and applies them to the published numbers — the 3.33 min/CU velocity is a dogfooded micro-benchmark, not a generalizable framework-velocity claim. The fair test of v7.6's success will be downstream organic feature work over the next 6+ weeks. See §§ 10–11 of the upstream for the full data analysis with outlier caveat.

Detailed mechanical answer (links)

Tier 3.3 — the public invitation for an external operator to run the framework on an unrelated product — is the explicit final v7.6 deliverable and was filed on 2026-04-25 as issue #142 (pinned, labels: tier-3-3, external-replication, help wanted). Until an external case study lands in docs/case-studies/external/, the framework's own measurements remain self-referential by definition; the v7.6 publication is the framework honestly admitting where its own evidence runs out.

Two surfaces complete the audit response on this site:


Full audit source archived at FitTracker2/docs/case-studies/meta-analysis/independent-audit-2026-04-21-gemini.md. This page reproduces that archive verbatim plus published date. Any future edits will be appended (not silently rewritten); diff will be visible in the archive's git history.