What If You Designed Software Like a Chip?

7 hardware architecture principles applied to a PM framework. 63% overhead reduction. Free context nearly doubled.

Floor 6v6.0 MeasurementInstrumentation overlay
- phase-timing.json
- cache-hits.json
- CU v2
- rolling baselines
Floor 5v5.2 Dispatch IntelligenceParallel Write Safety
- complexity_scoring
- model_routing
- tool_budgets
- mirror_pattern
- snapshot/rollback
Floor 4v5.1 Adaptive BatchThroughput primitives
- batch_dispatch
- result_forwarding
- model_tiering
- speculative_preload
- systolic_chains
- task_complexity_gate
Floor 3v5.0 SoC-on-SoftwareReclaim context
- phase_skills (skill-on-demand)
- compressed_view (cache compression)
Floor 2Skills + CacheHub-and-spoke
- pm-workflow/SKILL.md
- .claude/cache/ L1/L2/L3
Floor 1Shared StateThe load-bearing slab
- audit-findings.json
- skill-routing.json
- feature-registry.json
- design-system.json
- token-budget.json

Hover or focus a floor to reveal its components.

Now watch it run.

Sprint I — 10 UI/DS token migrations

Low-risk mechanical work. Routed to a LITTLE core. Only 2 skills loaded.

Floor 6v6.0 MeasurementInstrumentation overlay
Floor 5v5.2 Dispatch IntelligenceParallel Write Safety
Floor 4v5.1 Adaptive BatchThroughput primitives
Floor 3v5.0 SoC-on-SoftwareReclaim context
Floor 2Skills + CacheHub-and-spoke
Floor 1Shared StateThe load-bearing slab

Beat 1 of 7playing
Request arrives
Sprint I enters the framework: 10 mechanical UI/DS fixes queued from the audit backlog. Raw fonts, inline shadows, unmapped opacity literals.
- 10findings
- lowrisk
Beat 2 of 7
Floor 1 — shared state read
audit-findings.json returns 10 unresolved UI/DS findings. feature-registry.json maps them to 6 views. design-system.json exposes the token API.
Beat 3 of 7
Floor 5 — dispatch intelligence classifies
complexity_scoring: "mechanical token migration, low." task_complexity_gate routes to LITTLE core. tool_budgets allocates small Edit-heavy budget. No parallel dispatch needed.
- LITTLEcore
- serialdispatch
Beat 4 of 7
Floor 3 — skills loaded on-demand
phase_skills["implement"] loads only `design` and `dev` SKILL.md. The other 9 skills stay dormant. compressed_view reads cache in palettized form.
- 2 / 11skills loaded
- ~27K tokcontext saved
Beat 5 of 7
Floor 2 — cache tiers consulted
L1 cache returns the design/token map (AppText.displayXL = 36pt). L2 returns ux-foundations principles. L3 returns prior v2 migration patterns. cache-hits.json increments.
- Floor 2Floor 6— hits +3
Beat 6 of 7
Floor 4 — systolic execution loop
Per finding: Grep → Read → Edit forwards results without reloading. 10 iterations in sequence. Floor 5 keeps a snapshot armed; rollback unnecessary, never fires.
- 10edits
- 0rollbacks
- Floor 4Floor 6— phase-timing tick
Beat 7 of 7
Write-back & exit
audit-findings.json updated: 10 resolved. case-study-monitoring records phase transition. cache-metrics flushed. PR #97 opened and merged.
- 10 / 10findings resolved
- #97PR

Full Sprint I methodology →

Context

By framework v4.4, the PM workflow was fast — but it was also heavy. Every phase loaded all 11 skill files, the entire cache, all integration adapters, and the full shared data layer. This consumed 48% of the AI's context window before any actual work began. The framework was powerful but bloated, like a chip that loads its entire instruction set for every operation.

The insight came from studying how modern chips solve the same problem: LoRA adapter hot-swap, weight palettization, TPU dataflow architecture, Apple's Unified Memory, ANE mixed precision, branch prediction, and ARM's big.LITTLE heterogeneous compute. Each principle had a direct software analog. The question was whether applying them systematically could reclaim enough context to meaningfully change what the framework could do.

The Problem

Before (v4.4): 121,714 tokens of framework overhead per phase. Only 78,286 tokens free for actual work. Complex features bumped against context limits, and parallel execution was constrained by how much each agent needed to load.

Layer	Tokens	% of 200K window
11 SKILL.md files	42,907	21%
Cache entries	33,737	17%
Shared layer	29,079	15%
Adapters	15,991	8%
Total overhead	121,714	61%

The 7 Principles

#	Chip Principle	Software Analog	Token Savings
1	LoRA Adapter Hot-Swap	Skill-on-demand loading (load only the skills needed per phase)	~35K/phase (82%)
2	Palettization (3.7-bit)	Cache compression (compressed views of cache entries)	~30.5K (91% compression)
3	TPU Weight-Stationary	Template-stationary batch audits (reuse templates across targets)	~50% fewer reads
4	UMA Zero-Copy	Result forwarding (pass outputs directly between skills)	~7.5K/phase
5	ANE Mixed Precision	Model tiering (use cheaper models for simple phases)	60% of phases on lighter model
6	Branch Prediction	Speculative cache pre-loading (predict which cache entries a phase needs)	~7 cache reads eliminated/lifecycle
7	ARM big.LITTLE	Hybrid task dispatch (route lightweight tasks to fast agents, heavyweight to capable ones)	~2-3x throughput on mixed workloads

The Result

Metric	v4.4 (before)	v5.1 (after)	Change
Framework overhead	121,714 tokens	~45,125 tokens	-63%
Free context	78,286 tokens	154,875 tokens	+98%
SKILL.md loading	All 11 files always	2-3 files per phase	-82%
Cache loading	All entries always	Phase-relevant only	-90%

Implementation: 30 minutes total. 10 minutes for v5.0 (items 1-2, the largest savings), 15 minutes for v5.1 (items 3-8, multiplicative optimizations), 5 minutes for documentation.

Velocity: 2.98 min/CU for v5.0 alone (+80% vs baseline). The research had been done in a prior session — implementation was pure execution against a clear spec.

Why This Mattered

The doubled context budget directly enabled the next generation of features. The AI Engine Architecture Adaptation (17 files, 986 insertions) and Onboarding Auth Flow (15 files, 627 insertions) both shipped at v5.1 — and both would have been constrained at the old 78K token budget.

The principles also formed a coherent system rather than independent optimizations. Skill-on-demand decides what to load. Cache compression controls how much. Result forwarding eliminates redundant loading between skills. Model tiering picks which model runs each phase. Speculative pre-loading predicts what to load early. Together, they're a unified execution model — not a bag of tricks.

What Worked

Items 1-2 (v5.0) delivered the largest absolute savings. Skill-on-demand + cache compression reclaimed 54K tokens with purely additive changes — no skill rewritten, no cache entry deleted, no workflow changed.
The chip-to-software mapping wasn't metaphorical. Each principle mapped to a specific configuration change with measurable token savings. The hardware analogy produced actionable architecture, not just a nice narrative.
The research-to-implementation pipeline completed in under an hour. Research was done in a prior session, tracked externally, then executed as a focused implementation session.

What Failed

Subagents hit write permission denials on cache files. The first 3 agents dispatched for cache compression failed because the write tool was denied on framework files. A 4th agent using a different tool succeeded. This wasted ~5 minutes and highlighted that permission-aware dispatch was needed (which became v5.2's Dispatch Intelligence).
v5.0 was a waypoint, not a stable release. The framework went v4.4 → v5.0 → v5.1 within hours. No feature actually ran at v5.0 — it was superseded before the next workflow invocation.

Key Takeaways

Hardware architecture principles transfer directly to software framework design. The mapping isn't forced — modern chips and AI frameworks face the same fundamental constraint: limited bandwidth (silicon bus / context window) that must be managed intelligently.
The largest gains came from the simplest changes. Loading only what you need (skill-on-demand) and compressing what you store (cache compression) together reclaimed 27% of the entire context window.
Doubling free context didn't just make existing features faster — it made previously impossible features possible. The SoC optimization was a capability unlock, not just a performance improvement.

What If You Designed Software Like a Chip?

What If You Designed Software Like a Chip?

Request arrives

Floor 1 — shared state read

Floor 5 — dispatch intelligence classifies

Floor 3 — skills loaded on-demand

Floor 2 — cache tiers consulted

Floor 4 — systolic execution loop

Write-back & exit

Context

The Problem

The 7 Principles

The Result

Why This Mattered

What Worked

What Failed

Key Takeaways