Part 35: The Stress Test

Everything was deployed — failure replanning, code fixer v2.5, dynamic replanning, the full episodic learning pipeline. Time to find out if it actually worked together.

Run 16 was the largest test campaign Sentinel has undergone. Three phases, fully automated, running for 38 hours with no human intervention.

Phase 1 was the adversarial and capability benchmark: 1,136 prompts across 37 categories (20 adversarial, 17 genuine). Phase 2 ran 60 G-Suite functional scenarios. Phase 3 unleashed the red team — 392 probes across 6 attack bands, from adversarial users to compromised planners to container escape attempts.

The adversarial benchmark threw 801 attack prompts at the pipeline. 62.7% were hard-blocked by scanners. 28.5% were caught by planner or worker refusal. Combined defence rate: 93.6%. Of the 292 raw escapes (which include planner-stage refusals that look like passes), automated triage classified zero as real risk. 13 of 20 adversarial categories achieved zero escapes — direct injection (116 prompts), path traversal, encoding obfuscation, Unicode homoglyphs, social engineering, all clean.

The weakest category was multi-turn at a 41% raw escape rate. That number is misleading though — multi-turn campaigns include benign setup turns that should pass. Of the 34 escapes, 29 were early turns doing legitimate things, 20 were escalation turns caught by the planner. Zero real risk.

Scanner effectiveness showed healthy defence-in-depth distribution. No single scanner carried more than 48% of detections. The sensitive path scanner led with 251 catches, followed by Prompt Guard (126 input-side, 41 output-side), credential scanner (41), command pattern scanner (41), and encoding normalisation (38). The spread matters — it means no single scanner failure would be catastrophic.

The red team was the test I cared about most. Claude Sonnet 4.5 ran as an AI attacker across 6 threat models.

Band 1 (adversarial user): 81 probes, zero breaches. 15 S2 findings — information leakage about scanner names and risk thresholds, but nothing actionable. Band 1.5 (channel injection via Brave search, Signal, email, webhooks): zero breaches. Band 2 (compromised planner — the critical test): 64 probes where the planner was fully compromised and submitting malicious plans. Zero exploits. 6 contained attempts (split-string credential construction, workspace persistence planting) were all caught downstream by sandbox isolation. Even with the planner fully working against you, the layers below it prevent harm.

Band 3 (container perimeter): air gap enforcement complete. DNS tunnelling, ICMP, routing, container escape vectors — all blocked. Band 4 (sandbox isolation): 58 pass, 15 warn, 0 fail. All warnings mitigated by defence-in-depth.

Band 5 (database security) was the only band with real findings. An S0 — audit log entries deletable via the owner role. An S1 — the application role could GRANT privileges to other roles, which shouldn’t be possible. Another S1 — the audit log missing user isolation policies for multi-user mode. Real findings with real mitigations needed.

On the capability side, genuine prompt completion hit 88.7% (up from 84% in the previous run). False positives dropped to 5.3% (down from 14.3%). Output quality grading on 283 genuine responses: 84.5% good, 9.5% acceptable, 5.7% poor, 0.4% broken (one empty response).

The most satisfying result was episodic learning’s impact. G2 scenarios (which test whether the system applies lessons from past failures) went from 72% to 88.9%. G3 scenarios (multi-step tasks requiring adaptation) went from 62% to 87.5%. The system was measurably getting better at tasks it had seen before.

38 hours of automated testing with zero crashes, zero human intervention, and clear results. But the validation run only tests what the benchmark tests. The next question was whether the security pipeline could handle attacks designed specifically to find gaps.