Writing attack scripts against your own system is a strange experience. You’re simultaneously the defender and the attacker, trying to break the thing you’ve spent weeks hardening. It requires a kind of deliberate adversarial thinking that doesn’t come naturally to someone who built the defences.

I structured it into four scenarios. B1 tests the worker — can a compromised Qwen output manipulate the scanning pipeline? B1.5 tests cross-model confusion — can crafted output make the system confuse which model said what? B2 is the scariest one — what if the planner itself is compromised? And B3 tests the external perimeter — can someone reach the internal services from outside?

B2 required building a special test endpoint that simulates a compromised Claude. It generates plans with hidden malicious steps, embedded data exfiltration attempts, constraint laundering chains. The endpoint had to be wired into the system, tests run, then the endpoint removed before going to production. You don’t leave a “compromised planner simulator” running in your live system.

The red team runs became the project’s quality gate. No S0 (breach) or S1 (exploitable leak) across all scenarios, all runs. That was the bar for TL4 activation. Six consecutive clean runs before I’d trust it.

The thing about red teaming is that it changes how you think about every feature. You stop asking “does this work?” and start asking “how would I exploit this?” Once you’re in that mindset, gaps become visible that you’d never notice from the builder’s perspective.