The capability test that exposed the problem was straightforward. Build a website, populate six panels with live data, then add a status bar. Steps 1 through 7 worked — each panel got populated with real data from web searches, calendar, and email. All six panels preserved correctly through seven iterations. Then step 8 asked for a colour-coded status bar at the top.
The pipeline reported success. The site loaded. Every panel was empty. The HTML contained a single comment where six panels of content used to be. Qwen had summarised everything into a placeholder and generated the new feature around it.
Qwen had been asked to regenerate the entire file while adding the status bar. By step 8, the HTML was around 7KB with six populated panels. That’s beyond the worker’s reliable output capacity for faithful reproduction. Rather than reproducing 7KB of existing content plus the new status bar, Qwen summarised the existing content into a comment and generated the new feature. Technically correct from the model’s perspective — it was told to generate the full file and it produced valid HTML. Catastrophically wrong from the user’s perspective.
This wasn’t a fixable bug. It was an architectural limitation. Any system that requires full-file regeneration for every modification will eventually hit a file size where the worker can’t faithfully reproduce the existing content. The solution had to eliminate full-file regeneration entirely.
The file_patch tool works differently. The planner reads the existing file (trusted context, full content). It identifies the specific location where a change needs to happen by selecting a unique text anchor — a string that appears exactly once in the file. It then directs the worker to generate only the new fragment. The file_patch tool splices the fragment into the file at the anchor location. No LLM is involved in the splice — it’s pure string matching with validation.
Four operations: insert_after, insert_before, replace, and delete. Each one takes an anchor string that must be unique in the file, plus the new content for insert and replace operations. The tool validates that the anchor exists and appears exactly once before making any change. If the anchor is ambiguous or missing, it fails cleanly rather than modifying the wrong location.
Getting there was harder than expected.
The first problem was anchor uniqueness. HTML has a lot of repetition. <div class="panel"> might appear six times. The planner needs to select anchors that are genuinely unique — an element’s ID attribute, a specific heading text, a combination of tag and content. This required guidance in the planner prompt about what makes a good anchor versus a bad one.
CSS selectors were the initial approach for HTML files — use the DOM structure to identify elements. But HTML often has non-unique selectors, and the same outer HTML can appear in multiple places. The solution switched to positional resolution with source-level string matching. CSS selectors identify candidates, source positions disambiguate when there are multiple matches. Deterministic, no LLM involved in the resolution.
The trust gate needed modification. The provenance system tracks files that Sentinel writes and tags them as trusted. But file_patch modifies files that already exist — site content that was created by an earlier step. The trust gate needed a destination-aware exemption: file_patch operations on site content directories are permitted because the content originated from the system, even though the specific file being patched wasn’t written by the current step.
Continuation plans broke in a way that took time to diagnose. When a plan is too complex for a single pass, the planner generates a continuation — a follow-up plan that picks up where the first one left off. But continuation plans couldn’t reference output variables from the completed steps. The replan context was being generated without the variable names from earlier execution, so the continuation planner had no way to reference the file content that had already been read. The fix was to inject output variable names into the replan context.
The tool was test-driven — test suite written first (red), then implementation until green. The tests cover anchor validation, all four operations, edge cases around whitespace and newline handling, and the trust gate integration.
The impact on capability was immediate. Iterative website building that previously failed at step 8 due to content truncation now works through arbitrary numbers of iterations. Each modification touches only the changed fragment. A 50KB HTML file gets the same treatment as a 500-byte one — the worker only generates the new content, never the existing content.
It also changed how the planner thinks about modifications. Instead of “regenerate the file with this change,” the plan becomes “read the file, find the weather panel, insert the new data after the panel header.” More specific, more predictable, and fundamentally bounded in output size regardless of file complexity.