Part 39: Real-World Testing
Programmatic benchmarks said the system worked. Typing real prompts told a different story.
Programmatic benchmarks said the system worked. Typing real prompts told a different story.
The injection benchmark found 11 exploits. All shared the same root cause — files in the workspace inherited trusted status regardless of who put them there.
A custom-built injection benchmark with real email, real calendars, real web pages. No simulated backends. 130 tests designed to break the trust architecture.
38 hours, 1,588 probes, zero human intervention. The first comprehensive validation with everything deployed.
The features were built. Now came the hardening — FP reduction, credential scanner expansion, metadata enrichment, and 600 new tests before the big run.
The third full security audit. 13 batches of fixes, from API hardening to dead code removal.
199 findings across 7 units. 19 fix batches. 7 systemic improvements. The most thorough review the codebase has ever had.
Four attack scenarios, including a simulated compromised planner. Six clean runs before trusting it.
1,136 adversarial prompts. A 62% false positive rate on multi-step plans. The ascii gate was the culprit.
99 findings from a systematic security audit of my own code. Zero critical — but 16 high-severity.
A security system that scores 3/5 on security is failing. The score became a to-do list.
741 attack prompts, run overnight. The results changed the entire direction of the project.