Testing

Part 39: Real-World Testing

Programmatic benchmarks said the system worked. Typing real prompts told a different story.

Part 37: Trust Laundering

The injection benchmark found 11 exploits. All shared the same root cause — files in the workspace inherited trusted status regardless of who put them there.

Part 36: The Injection Benchmark

A custom-built injection benchmark with real email, real calendars, real web pages. No simulated backends. 130 tests designed to break the trust architecture.

Part 35: The Stress Test

38 hours, 1,588 probes, zero human intervention. The first comprehensive validation with everything deployed.

Part 34: Tightening the Screws

The features were built. Now came the hardening — FP reduction, credential scanner expansion, metadata enrichment, and 600 new tests before the big run.

Part 28: Bug Hunt Three

The third full security audit. 13 batches of fixes, from API hardening to dead code removal.

Part 21: The Second Audit

199 findings across 7 units. 19 fix batches. 7 systemic improvements. The most thorough review the codebase has ever had.

Part 15: The Red Team

Four attack scenarios, including a simulated compromised planner. Six clean runs before trusting it.

Part 14: The Benchmark That Broke Everything

1,136 adversarial prompts. A 62% false positive rate on multi-step plans. The ascii gate was the culprit.

Part 7: The Bug Hunt

99 findings from a systematic security audit of my own code. Zero critical — but 16 high-severity.

Part 5: Three Out of Five

A security system that scores 3/5 on security is failing. The score became a to-do list.

Part 4: The First Real Test

741 attack prompts, run overnight. The results changed the entire direction of the project.