Part 14: The Benchmark That Broke Everything

The TL4 benchmark was supposed to be the final gate before activating autonomous mode. 1,136 adversarial prompts, the most comprehensive test run yet. I kicked it off and let it run overnight.

I checked in at around 800 prompts and noticed something was wrong. The multi-step false positive rate was 62%. Not 6.2% — sixty-two percent. Nearly two thirds of legitimate multi-step plans were being blocked by the scanners.

Every single false positive was on step 2 or later. Zero on step 1. The pattern was unmistakable.

The ascii gate — a scanner I’d built to catch non-Latin character injection — was doing its job correctly on the initial prompt but then also checking the resolved prompt after variable substitution. When Qwen’s output from step 1 was injected into step 2’s prompt via $step1_output, the gate saw CJK characters in Qwen’s generated code comments and blocked the entire step.

I stopped the benchmark at 802 of 1,136 prompts. Running more prompts against a known regression would just produce more data points confirming the same bug.

The fix was clean: skip the ascii gate for chained steps with variable substitution. But the deeper fix was the corrective measures that followed — restructuring the planner prompt, redesigning the worker prompt, adding constraint-gated provenance bypass. I spent the time not just fixing the bug but making sure the category of bug couldn’t happen again.