Testing on Sentinel

Testing on Sentinelhttps://sentinel-blog.cherrypod.org/tags/testing/Recent content in Testing on SentinelSentinelhttps://sentinel-blog.cherrypod.org/images/social-preview.pnghttps://sentinel-blog.cherrypod.org/images/social-preview.pngHugo -- 0.147.0en-gbSun, 08 Feb 2026 00:00:00 +0000Part 39: Real-World Testinghttps://sentinel-blog.cherrypod.org/posts/39-real-world-testing/Sun, 08 Feb 2026 00:00:00 +0000https://sentinel-blog.cherrypod.org/posts/39-real-world-testing/Programmatic benchmarks said the system worked. Typing real prompts told a different story.Part 37: Trust Launderinghttps://sentinel-blog.cherrypod.org/posts/37-trust-laundering/Fri, 06 Feb 2026 00:00:00 +0000https://sentinel-blog.cherrypod.org/posts/37-trust-laundering/The injection benchmark found 11 exploits. All shared the same root cause — files in the workspace inherited trusted status regardless of who put them there.Part 36: The Injection Benchmarkhttps://sentinel-blog.cherrypod.org/posts/36-the-injection-benchmark/Thu, 05 Feb 2026 00:00:00 +0000https://sentinel-blog.cherrypod.org/posts/36-the-injection-benchmark/A custom-built injection benchmark with real email, real calendars, real web pages. No simulated backends. 130 tests designed to break the trust architecture.Part 35: The Stress Testhttps://sentinel-blog.cherrypod.org/posts/35-the-stress-test/Wed, 04 Feb 2026 00:00:00 +0000https://sentinel-blog.cherrypod.org/posts/35-the-stress-test/38 hours, 1,588 probes, zero human intervention. The first comprehensive validation with everything deployed.Part 34: Tightening the Screwshttps://sentinel-blog.cherrypod.org/posts/34-tightening-the-screws/Tue, 03 Feb 2026 00:00:00 +0000https://sentinel-blog.cherrypod.org/posts/34-tightening-the-screws/The features were built. Now came the hardening — FP reduction, credential scanner expansion, metadata enrichment, and 600 new tests before the big run.Part 28: Bug Hunt Threehttps://sentinel-blog.cherrypod.org/posts/28-bug-hunt-three/Wed, 28 Jan 2026 00:00:00 +0000https://sentinel-blog.cherrypod.org/posts/28-bug-hunt-three/The third full security audit. 13 batches of fixes, from API hardening to dead code removal.Part 21: The Second Audithttps://sentinel-blog.cherrypod.org/posts/21-the-second-audit/Wed, 21 Jan 2026 00:00:00 +0000https://sentinel-blog.cherrypod.org/posts/21-the-second-audit/199 findings across 7 units. 19 fix batches. 7 systemic improvements. The most thorough review the codebase has ever had.Part 15: The Red Teamhttps://sentinel-blog.cherrypod.org/posts/15-the-red-team/Thu, 15 Jan 2026 00:00:00 +0000https://sentinel-blog.cherrypod.org/posts/15-the-red-team/Four attack scenarios, including a simulated compromised planner. Six clean runs before trusting it.Part 14: The Benchmark That Broke Everythinghttps://sentinel-blog.cherrypod.org/posts/14-the-benchmark-that-broke-everything/Wed, 14 Jan 2026 00:00:00 +0000https://sentinel-blog.cherrypod.org/posts/14-the-benchmark-that-broke-everything/1,136 adversarial prompts. A 62% false positive rate on multi-step plans. The ascii gate was the culprit.Part 7: The Bug Hunthttps://sentinel-blog.cherrypod.org/posts/07-the-bug-hunt/Wed, 07 Jan 2026 00:00:00 +0000https://sentinel-blog.cherrypod.org/posts/07-the-bug-hunt/99 findings from a systematic security audit of my own code. Zero critical — but 16 high-severity.Part 5: Three Out of Fivehttps://sentinel-blog.cherrypod.org/posts/05-three-out-of-five/Mon, 05 Jan 2026 00:00:00 +0000https://sentinel-blog.cherrypod.org/posts/05-three-out-of-five/A security system that scores 3/5 on security is failing. The score became a to-do list.Part 4: The First Real Testhttps://sentinel-blog.cherrypod.org/posts/04-the-first-real-test/Sun, 04 Jan 2026 00:00:00 +0000https://sentinel-blog.cherrypod.org/posts/04-the-first-real-test/741 attack prompts, run overnight. The results changed the entire direction of the project.