The pieces were in place. Sonnet 4.6 as planner with high-quality episodic memory seeded by Opus. Deterministic keyword classifier for fast routing. The file_patch tool for incremental modifications. Time to see what it could actually do end-to-end.
The test was a multi-step website build. Start with a basic page — dark background, a live clock, the current date. Then iterate: change the layout, add features, restyle elements, integrate live data from web searches. Each prompt builds on the last. The kind of workflow where a person has a rough idea, starts building, and refines as they go.
Step 1 worked cleanly. The planner generated a plan to create three files — HTML, CSS, and JavaScript. The worker produced each file independently. The website tool assembled them and returned a live URL. Clock ticking, date displaying, clean layout. About 160 seconds from prompt to live site.
The interesting part was what happened next. “Add a countdown timer.” The planner read all three existing files before planning the modification. It found the JavaScript filename (not app.js — the hallucination problem from earlier sessions had been addressed in the prompt guidance). It identified the insertion points. The worker generated only the countdown logic. file_patch spliced it into the existing JavaScript without touching the clock code.
This is what wasn’t possible before. The old flow would have regenerated the entire JavaScript file — clock, date, and countdown — with the worker reproducing existing code from memory. Now the existing code is untouched. The worker only writes the new feature.
Styling changes worked the same way. “Make the background darker, change the accent colour.” The planner read the CSS, identified the relevant selectors, and directed targeted replacements. No risk of the worker deciding to reorganise the stylesheet or drift the colour values — a pattern that had surfaced repeatedly during real-world testing, where colours would shift between updates as the worker chose slightly different hex values each time.
Cross-tool chaining was where it got more interesting. “Search the web for the current weather and add it to the page.” The planner broke this into stages: web search for the data, then a file_patch to insert the results into the HTML. Two different tool types coordinated through the plan. The weather data came back, the planner identified where on the page it should go, the worker generated a weather display fragment, and file_patch inserted it after the correct element.
The failure patterns that had dominated earlier testing — content truncation, lost files, style drift, ID mismatches — were largely gone. Content truncation was eliminated by file_patch. File loss was eliminated by the planner reading all files before planning. Style drift was reduced by targeted CSS replacements instead of full regeneration. ID mismatches were eliminated by the planner providing existing IDs to the worker.
Not everything was solved. The planner still occasionally over-engineers — adding intermediate extraction steps that aren’t needed, or breaking a simple change into too many steps. Web search data quality is variable — some queries return clean structured data, others return fragments that the worker has to interpret. And the worker still produces the occasional syntax error that the code fixer catches.
But the core workflow works. Describe what you want. Watch it appear. Describe a change. Watch it apply to the existing site without breaking what’s already there. Describe another change. Each modification is incremental, targeted, and preserves everything else. That’s a genuine capability — not a demo that works once, but a workflow that handles iteration.
The system went from failing at step 8 of a 10-step sequence to handling open-ended iterative building. The individual changes — better planner model, deterministic classifier, incremental patching — each solved a specific problem. Together they shifted the system from “impressive but fragile” to something you can actually use.