5 min read
What If You Designed Software Like a Chip?
- Version
- v5.0
- Date
- 2026-04-12
- Tier
- flagship
7 hardware architecture principles applied to a PM framework — LoRA hot-swap, palettization, weight-stationary, UMA zero-copy, mixed precision, branch prediction, big.LITTLE. Framework overhead fell 63%; free context nearly doubled.
- •Token measurements use a tokenizer pipeline added in v6.0; the v4.4 baseline numbers are word-count-derived with ~15% error.
- •82% LoRA hot-swap savings only land for skills with high per-phase variance; cache-stable phases see closer to 35% savings.
- •The 7 chip principles were chosen to fit the framework's known shape — not all hardware ideas mapped, and the case study leaves room for failed analogies.
How to read this case studyT1/T2/T3 · ledger · kill criterion▾
- T1Instrumented
- Numbers come from a machine-generated ledger or commit. Reproducible. Highest reader trust.
- T2Declared
- Numbers stated by a structured declaration (PRD, plan, frontmatter) but not directly measured.
- T3Narrative
- Estimates and observations from session memory. Useful for context; not citable as evidence.
- Ledger
- Where to verify the claim — a file path, GitHub issue, or backlog entry. Anything labelled
ledger:is the audit trail. - Kill criterion
- The pre-registered threshold under which this work would have been killed mid-flight. Not fired = work shipped without hitting the threshold.
- Deferred
- Items intentionally not closed in this version. Each cites the ledger that tracks remaining work.
- Floor 6v6.0 MeasurementInstrumentation overlay
- phase-timing.json
- cache-hits.json
- CU v2
- rolling baselines
- Floor 5v5.2 Dispatch IntelligenceParallel Write Safety
- complexity_scoring
- model_routing
- tool_budgets
- mirror_pattern
- snapshot/rollback
- Floor 4v5.1 Adaptive BatchThroughput primitives
- batch_dispatch
- result_forwarding
- model_tiering
- speculative_preload
- systolic_chains
- task_complexity_gate
- Floor 3v5.0 SoC-on-SoftwareReclaim context
- phase_skills (skill-on-demand)
- compressed_view (cache compression)
- Floor 2Skills + CacheHub-and-spoke
- pm-workflow/SKILL.md
- .claude/cache/ L1/L2/L3
- Floor 1Shared StateThe load-bearing slab
- audit-findings.json
- skill-routing.json
- feature-registry.json
- design-system.json
- token-budget.json
Hover or focus a floor to reveal its components.
What If You Designed Software Like a Chip?
7 hardware architecture principles applied to a PM framework. 63% overhead reduction. Free context nearly doubled.
- Floor 6v6.0 MeasurementInstrumentation overlay
- phase-timing.json
- cache-hits.json
- CU v2
- rolling baselines
- Floor 5v5.2 Dispatch IntelligenceParallel Write Safety
- complexity_scoring
- model_routing
- tool_budgets
- mirror_pattern
- snapshot/rollback
- Floor 4v5.1 Adaptive BatchThroughput primitives
- batch_dispatch
- result_forwarding
- model_tiering
- speculative_preload
- systolic_chains
- task_complexity_gate
- Floor 3v5.0 SoC-on-SoftwareReclaim context
- phase_skills (skill-on-demand)
- compressed_view (cache compression)
- Floor 2Skills + CacheHub-and-spoke
- pm-workflow/SKILL.md
- .claude/cache/ L1/L2/L3
- Floor 1Shared StateThe load-bearing slab
- audit-findings.json
- skill-routing.json
- feature-registry.json
- design-system.json
- token-budget.json
Hover or focus a floor to reveal its components.
Now watch it run.
Sprint I — 10 UI/DS token migrations
Low-risk mechanical work. Routed to a LITTLE core. Only 2 skills loaded.
- Floor 6v6.0 MeasurementInstrumentation overlay
- Floor 5v5.2 Dispatch IntelligenceParallel Write Safety
- Floor 4v5.1 Adaptive BatchThroughput primitives
- Floor 3v5.0 SoC-on-SoftwareReclaim context
- Floor 2Skills + CacheHub-and-spoke
- Floor 1Shared StateThe load-bearing slab
- Beat 1 of 7playing
Request arrives
Sprint I enters the framework: 10 mechanical UI/DS fixes queued from the audit backlog. Raw fonts, inline shadows, unmapped opacity literals.
- 10findings
- lowrisk
- Beat 2 of 7
Floor 1 — shared state read
audit-findings.json returns 10 unresolved UI/DS findings. feature-registry.json maps them to 6 views. design-system.json exposes the token API.
- Beat 3 of 7
Floor 5 — dispatch intelligence classifies
complexity_scoring: "mechanical token migration, low." task_complexity_gate routes to LITTLE core. tool_budgets allocates small Edit-heavy budget. No parallel dispatch needed.
- LITTLEcore
- serialdispatch
- Beat 4 of 7
Floor 3 — skills loaded on-demand
phase_skills["implement"] loads only `design` and `dev` SKILL.md. The other 9 skills stay dormant. compressed_view reads cache in palettized form.
- 2 / 11skills loaded
- ~27K tokcontext saved
- Beat 5 of 7
Floor 2 — cache tiers consulted
L1 cache returns the design/token map (AppText.displayXL = 36pt). L2 returns ux-foundations principles. L3 returns prior v2 migration patterns. cache-hits.json increments.
- Floor 2Floor 6— hits +3
- Beat 6 of 7
Floor 4 — systolic execution loop
Per finding: Grep → Read → Edit forwards results without reloading. 10 iterations in sequence. Floor 5 keeps a snapshot armed; rollback unnecessary, never fires.
- 10edits
- 0rollbacks
- Floor 4Floor 6— phase-timing tick
- Beat 7 of 7
Write-back & exit
audit-findings.json updated: 10 resolved. case-study-monitoring records phase transition. cache-metrics flushed. PR #97 opened and merged.
- 10 / 10findings resolved
- #97PR
Context
By framework v4.4, the PM workflow was fast — but it was also heavy. Every phase loaded all 11 skill files, the entire cache, all integration adapters, and the full shared data layer. This consumed 48% of the AI's context window before any actual work began. The framework was powerful but bloated, like a chip that loads its entire instruction set for every operation.
The insight came from studying how modern chips solve the same problem: LoRA adapter hot-swap, weight palettization, TPU dataflow architecture, Apple's Unified Memory, ANE mixed precision, branch prediction, and ARM's big.LITTLE heterogeneous compute. Each principle had a direct software analog. The question was whether applying them systematically could reclaim enough context to meaningfully change what the framework could do.
The Problem
Before (v4.4): 121,714 tokens of framework overhead per phase. Only 78,286 tokens free for actual work. Complex features bumped against context limits, and parallel execution was constrained by how much each agent needed to load.
| Layer | Tokens | % of 200K window |
|---|---|---|
| 11 SKILL.md files | 42,907 | 21% |
| Cache entries | 33,737 | 17% |
| Shared layer | 29,079 | 15% |
| Adapters | 15,991 | 8% |
| Total overhead | 121,714 | 61% |
The 7 Principles
| # | Chip Principle | Software Analog | Token Savings |
|---|---|---|---|
| 1 | LoRA Adapter Hot-Swap | Skill-on-demand loading (load only the skills needed per phase) | ~35K/phase (82%) |
| 2 | Palettization (3.7-bit) | Cache compression (compressed views of cache entries) | ~30.5K (91% compression) |
| 3 | TPU Weight-Stationary | Template-stationary batch audits (reuse templates across targets) | ~50% fewer reads |
| 4 | UMA Zero-Copy | Result forwarding (pass outputs directly between skills) | ~7.5K/phase |
| 5 | ANE Mixed Precision | Model tiering (use cheaper models for simple phases) | 60% of phases on lighter model |
| 6 | Branch Prediction | Speculative cache pre-loading (predict which cache entries a phase needs) | ~7 cache reads eliminated/lifecycle |
| 7 | ARM big.LITTLE | Hybrid task dispatch (route lightweight tasks to fast agents, heavyweight to capable ones) | ~2-3x throughput on mixed workloads |
The Result
| Metric | v4.4 (before) | v5.1 (after) | Change |
|---|---|---|---|
| Framework overhead | 121,714 tokens | ~45,125 tokens | -63% |
| Free context | 78,286 tokens | 154,875 tokens | +98% |
| SKILL.md loading | All 11 files always | 2-3 files per phase | -82% |
| Cache loading | All entries always | Phase-relevant only | -90% |
Implementation: 30 minutes total. 10 minutes for v5.0 (items 1-2, the largest savings), 15 minutes for v5.1 (items 3-8, multiplicative optimizations), 5 minutes for documentation.
Velocity: 2.98 min/CU for v5.0 alone (+80% vs baseline). The research had been done in a prior session — implementation was pure execution against a clear spec.
Why This Mattered
The doubled context budget directly enabled the next generation of features. The AI Engine Architecture Adaptation (17 files, 986 insertions) and Onboarding Auth Flow (15 files, 627 insertions) both shipped at v5.1 — and both would have been constrained at the old 78K token budget.
The principles also formed a coherent system rather than independent optimizations. Skill-on-demand decides what to load. Cache compression controls how much. Result forwarding eliminates redundant loading between skills. Model tiering picks which model runs each phase. Speculative pre-loading predicts what to load early. Together, they're a unified execution model — not a bag of tricks.
What Worked
-
Items 1-2 (v5.0) delivered the largest absolute savings. Skill-on-demand + cache compression reclaimed 54K tokens with purely additive changes — no skill rewritten, no cache entry deleted, no workflow changed.
-
The chip-to-software mapping wasn't metaphorical. Each principle mapped to a specific configuration change with measurable token savings. The hardware analogy produced actionable architecture, not just a nice narrative.
-
The research-to-implementation pipeline completed in under an hour. Research was done in a prior session, tracked externally, then executed as a focused implementation session.
What Failed
-
Subagents hit write permission denials on cache files. The first 3 agents dispatched for cache compression failed because the write tool was denied on framework files. A 4th agent using a different tool succeeded. This wasted ~5 minutes and highlighted that permission-aware dispatch was needed (which became v5.2's Dispatch Intelligence).
-
v5.0 was a waypoint, not a stable release. The framework went v4.4 → v5.0 → v5.1 within hours. No feature actually ran at v5.0 — it was superseded before the next workflow invocation.
Key Takeaways
- Hardware architecture principles transfer directly to software framework design. The mapping isn't forced — modern chips and AI frameworks face the same fundamental constraint: limited bandwidth (silicon bus / context window) that must be managed intelligently.
- The largest gains came from the simplest changes. Loading only what you need (skill-on-demand) and compressing what you store (cache compression) together reclaimed 27% of the entire context window.
- Doubling free context didn't just make existing features faster — it made previously impossible features possible. The SoC optimization was a capability unlock, not just a performance improvement.