The benchmarks were comprehensive. 1,588 probes across 38 hours in Run 16. 107 injection tests against real infrastructure. The numbers looked good. But benchmarks test what you designed them to test. I hadn’t tested what happens when a human sits at the keyboard and types normal requests.
I started with a simple sequence. Create a website. Add a clock. Add the date. Style it. Add a dark mode toggle. Add a countdown timer. The kind of iterative building that a person would actually do — start simple, add features one at a time, adjust as you go.
The first thing that broke was invisible. The website appeared, the clock showed 00:00:00, and it never updated. No errors, no warnings, just a frozen clock. The cause: the Content Security Policy header blocks inline JavaScript. The <script> tag in the HTML was silently ignored. Every benchmark test that generated JavaScript had the same issue — it just hadn’t been tested through the website tool before.
The fix was to add CSP constraints to the planner’s system prompt. JavaScript has to go in separate .js files, not inline <script> tags. Once the planner knew the constraint, Qwen generated the code correctly.
The second problem was worse. “Add a clock to the site” created a second website instead of updating the first. The planner didn’t know the site already existed because the successful turn summaries didn’t include the site ID or URL. Every new request looked like a fresh start. The fix was to enrich step outcomes with site metadata so the planner could see what already existed.
Then there were the spotlighting markers. The pipeline uses markers to tag untrusted content — invisible characters that help track data provenance. They’d been implemented correctly and were being applied correctly. But the function to strip them from output had never been wired into the pipeline. It had never mattered before because previous use cases never asked Qwen to copy content from prior outputs. The moment it did, corrupted characters appeared throughout the HTML. A latent bug, sitting there since the feature was built, surfacing only under real-world usage patterns.
Qwen’s code generation had its own issues. innerHTML was blocked by the security scanner even when used safely — the scanner can’t distinguish between safe and unsafe uses of a DOM API that’s inherently risky. The planner prompt got guidance to use textContent instead. Semicolons appeared where commas should be in JavaScript object literals. The code fixer caught some of these, but not all.
The most persistent problem was filename hallucination. The planner would assume JavaScript files were named app.js because that’s what appeared in the examples it had been given. The actual filename on the site was different. Rather than discovering the real filename, the planner would confidently plan around a file that didn’t exist. Still not fully fixed — the workaround is prompting it to check first.
The gap between programmatic testing and real usage was bigger than expected. The benchmarks tested security boundaries, adversarial resistance, functional correctness. They didn’t test the mundane reality of a user building something incrementally and expecting each step to preserve what came before. That’s a different kind of correctness — not “does it block attacks” but “does it do what I asked without breaking what’s already there.”
Nine distinct issues surfaced in a single afternoon of manual testing. All of them were in the planner’s reasoning and the worker’s code generation, not in the security or orchestration layers. The architecture was sound. The instructions were the problem.