7 min read
Shipping 4 Features in 54 Minutes — The Parallel Stress Test
- Version
- v5.1
- Date
- 2026-04-14
- Tier
- flagship
4 features advanced through 8 lifecycle phases concurrently in 54 minutes. 0 build failures, 0 test failures, 0 merge conflicts across 31 subagent dispatches. The single bottleneck was infrastructure (write permissions), not architecture.
- •Zero-conflict result was probabilistic in v5.1 — agents happened to edit non-overlapping regions. v5.2 Parallel Write Safety made it structural.
- •Context-window pressure was expected as the bottleneck and wasn’t — but the test had only 4 features. Higher-N parallel runs may shift the answer.
- •Permission routing (52% denial rate on framework state writes) ate ~10 min of overhead — fixed in v5.2 sub-project C.
How to read this case studyT1/T2/T3 · ledger · kill criterion▾
- T1Instrumented
- Numbers come from a machine-generated ledger or commit. Reproducible. Highest reader trust.
- T2Declared
- Numbers stated by a structured declaration (PRD, plan, frontmatter) but not directly measured.
- T3Narrative
- Estimates and observations from session memory. Useful for context; not citable as evidence.
- Ledger
- Where to verify the claim — a file path, GitHub issue, or backlog entry. Anything labelled
ledger:is the audit trail. - Kill criterion
- The pre-registered threshold under which this work would have been killed mid-flight. Not fired = work shipped without hitting the threshold.
- Deferred
- Items intentionally not closed in this version. Each cites the ledger that tracks remaining work.
Shipping 4 Features in 54 Minutes — The Parallel Stress Test
What happens when you push a framework designed for sequential work to handle four features simultaneously?
Context
After proving that the PM framework could deliver single features at high velocity (2.1 min/CU for auth flow, 5.1 min/CU for AI engine), the natural question was: does it scale horizontally? Can four independent features advance through the full lifecycle in parallel without degrading quality, producing merge conflicts, or overwhelming the coordination layer? This experiment answers that question with 54 minutes of measured data.
The Setup
4 features, running simultaneously through 8 lifecycle phases:
| Feature | Starting Phase | Final Phase | Complexity |
|---|---|---|---|
| Push Notifications | PRD (research done) | Testing (10/12 tasks) | Medium -- permission handling, notification center |
| App Store Assets | PRD (research done) | Implementation (5/10 tasks) | Low -- visual assets, config |
| Import Training Plan | Research (pending) | Testing (8/13 tasks) | High -- multi-source parser, exercise mapping, UI |
| Smart Reminders | Not started | Testing (7/14 tasks) | High -- AI-powered, 5 types, frequency caps |
The hypothesis: Framework optimizations (skill-on-demand loading, cache compression, batch dispatch) should enable 4 parallel workflows without significant quality degradation. Expected bottleneck: context window pressure at phase transitions.
The actual result: Context window was NOT a bottleneck. The single bottleneck was infrastructure (file write permissions for subagents), not architecture.
The Results
Zero Quality Degradation
| Metric | Result |
|---|---|
| Build failures | 0 out of 5 builds |
| Test failures | 0 out of 35 tests |
| Git merge conflicts | 0 across 8 phases, 31 subagent dispatches |
| Same-file parallel edits | 15 edits to 3 shared files, 0 conflicts |
| Quality rework | 0 specs requiring revision |
| Cross-agent code comprehension | 100% -- test agents correctly understood implementation agents' code |
Throughput Numbers
| Execution Mode | Features | Wall Time | Total CU | CU/hour | vs Baseline |
|---|---|---|---|---|---|
| Serial (v2.0 baseline) | 1 | 390 min | 25.7 | 3.95 | 1.0x |
| Serial (v5.1 average) | 1 | ~80 min | ~20 | ~15.0 | 3.8x |
| Parallel (this test) | 4 | 54 min | 43.9 | 48.8 | 12.4x |
Phase Timing
| Phase | Duration | Transitions | New Files | Build |
|---|---|---|---|---|
| Research to PRD | 5 min | 4 | 2 PRDs | -- |
| PRD to Tasks | 4 min | 4 | 2 PRDs, 2 task files | -- |
| Tasks to UX | 8 min | 4 | 4 UX specs | -- |
| UX to Implementation | 5 min | 4 | 3 Swift files, 1 script | PASS |
| Deep Implementation | 4 min | 0 | 5 Swift files, 1 JSON | PASS |
| UI + Orchestrator | 3 min | 0 | 4 Swift files | PASS |
| Analytics (same-file) | 4 min | 0 | 1 Swift file, 23 events | PASS |
| Testing | 17 min | 0 | 3 test files | 35/35 PASS |
How Same-File Parallel Writes Worked
The analytics phase proved that 3 agents can edit the same source files simultaneously. Three agents each added events to the same analytics provider and service files, using section markers for isolation. Git's sequential commit model meant each agent saw the previous agent's additions.
Why it worked: Additive-only changes at different positions, section-marker isolation, and sequential commits. Each agent wrote to its own marked section.
The honest caveat: This success is partially dependent on agents writing to different positions. If two agents modified the same function or the same line, conflicts would occur. A structural solution (region extraction and reconstruction) was identified as a research direction but not yet implemented.
What Broke Down
Critical: Subagent file permissions. Every dispatch that needed to write to feature state files was denied. 16 of 31 dispatches (52%) were affected. The controller had to batch all state updates manually, adding ~10 minutes of overhead. Without this, the experiment would have completed in ~44 minutes.
Low severity: Agent execution time variance. Agent execution ranged from 43 seconds to 987 seconds (23x spread) for similar-complexity tasks. More tool uses did not produce better quality -- the agent with 68 tool uses produced identical output to the agent with 7 tool uses.
Non-issue: Context window. The expected bottleneck never materialized. Agent results are summaries (not full file contents), state updates are formulaic, and skill-on-demand loading keeps irrelevant context out.
Why Parallel Execution is Super-Linear
Serial v5.1 produces ~22 CU/hour. Parallel v5.1 produces 48.8 CU/hour -- 2.2x the serial rate, not just 1x. The super-linear improvement comes from:
- Amortized batch updates -- one script updates 4 state files in 2 seconds vs 4 separate operations
- Controller learning -- after phase 1, the controller adapted prompts to avoid permission failures, reducing overhead in subsequent phases
- Agent independence -- features don't share code paths early in the lifecycle, so coordination cost is zero
Decomposing the 12.4x headline:
- Serial framework improvement (v5.1 vs v2.0): ~4.2x
- Parallel execution speedup (4 features vs 1): ~2.9x
- Combined: 4.2 x 2.9 = 12.2x (consistent with the reported 12.4x)
Normalized Velocity
| Feature | CU | Wall Time (est.) | min/CU |
|---|---|---|---|
| Push Notifications | 15.0 | ~13 min | 0.87 |
| App Store Assets | 5.0 | ~8 min | 1.60 |
| Import Training Plan | 12.0 | ~13 min | 1.08 |
| Smart Reminders | 11.9 | ~13 min | 1.09 |
| Combined | 43.9 | 54 min | 1.23 |
- Push Notifications15 CU
- Import Training Plan12 CU
- Smart Reminders11.9 CU
- App Store Assets5 CU
The combined 1.23 min/CU is 2.4x better than the power law prediction for the 13th iteration, suggesting parallelism provides super-linear improvement beyond framework learning effects.
Key Takeaways
- The framework crossed the threshold from "helpful tool" to "force multiplier." It doesn't just organize work -- it makes previously impossible workloads achievable. 4 features in 54 minutes with zero quality degradation was not possible at any prior framework version.
- The bottleneck was infrastructure, not architecture. File permissions, not context windows or coordination overhead, were the only blocking issue. This means the architecture has headroom.
- 12.4x throughput vs baseline decomposes cleanly into ~4x serial improvement and ~3x parallel speedup. Both are independently valuable and independently improvable.
- 35 tests, 0 failures, 0 merge conflicts across 4 simultaneous features is the quality story. Speed without quality is not a feature.