Part 40: The Model Upgrade

The real-world testing had exposed a pattern. Most of the failures weren’t in the worker’s code generation — they were in the planner’s instructions to the worker. The planner was telling Qwen what to build without providing enough context about what already existed. That raised an obvious question: would a better planner fix the worker’s output?

I ran the same 10-prompt website building sequence with three different planner models. Identical worker (Qwen 3 14B), identical system, clean episodic memory each time. The only variable was which model wrote the plans.

Sonnet 4.5 reached step 6 before stalling. It read only the HTML file before planning updates, ignoring CSS and JavaScript entirely. When it told Qwen to “generate a dark mode toggle button,” it didn’t provide the existing HTML. Qwen invented a button ID. Then in the JavaScript step, Qwen invented a different ID. The toggle rendered but did nothing — the JS was listening for clicks on an element that didn’t exist. The countdown timer broke the same way. Files went missing between updates because the planner didn’t carry them forward.

Sonnet 4.6 reached step 9. Every modification started by reading all existing files — HTML, CSS, and JavaScript. When it told Qwen to generate dark mode JavaScript, it included the current HTML with the actual button ID. Qwen followed the instructions exactly. Same model, same task, working dark mode. The difference was entirely in the quality of the instructions.

Opus 4.6 also reached step 9, with slightly more efficient plans — fewer steps to achieve the same outcome. It made different architectural choices (inline styles instead of a separate CSS file) but the results were functionally identical.

The costs were close. Sonnet 4.6 at $1.19 per plan sequence versus Opus 4.6 at $1.34. Sonnet 4.5 was $2.59 — more expensive despite worse results, because failures triggered replanning cycles that burned additional tokens.

The key finding was that Qwen’s code quality is almost entirely dependent on the quality of instructions it receives. The worker model didn’t change. Its capabilities didn’t change. But when the planner provided proper context — existing code, element IDs, file structure — the output went from broken to working. The planner is the bottleneck, not the worker.

Sonnet 4.6 became the default planner. But the model upgrade exposed another problem.

The episodic memory system stores records of successful plans so the planner can reference them when handling similar future requests. During testing, failed attempts were being stored too — plans that produced broken dark mode toggles, plans that lost files, plans with hallucinated filenames. These records had high similarity scores to new requests, so they were being injected into the planner’s context as “examples to follow.” The planner was learning from its worst work.

I purged the episodic memory entirely — 15 stale records and 6 strategy patterns, all contaminated by failed test runs. Then I switched to using Opus 4.6 to seed the memory with high-quality plans. Opus generates excellent plans with proper context gathering, comprehensive file handling, and correct variable references. Those plans become the episodic examples that Sonnet 4.6 references when handling similar requests.

The result is a two-tier model strategy. Opus 4.6 writes the canonical plans that populate episodic memory. Sonnet 4.6 handles day-to-day planning with those high-quality examples available as reference. Sonnet gets the benefit of Opus-level reasoning without paying Opus-level costs on every request. And the episodic memory is no longer polluted with failure patterns from testing.