<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Testing on Sentinel</title><link>https://sentinel-blog.cherrypod.org/tags/testing/</link><description>Recent content in Testing on Sentinel</description><image><title>Sentinel</title><url>https://sentinel-blog.cherrypod.org/images/social-preview.png</url><link>https://sentinel-blog.cherrypod.org/images/social-preview.png</link></image><generator>Hugo -- 0.147.0</generator><language>en-gb</language><lastBuildDate>Sun, 08 Feb 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://sentinel-blog.cherrypod.org/tags/testing/index.xml" rel="self" type="application/rss+xml"/><item><title>Part 39: Real-World Testing</title><link>https://sentinel-blog.cherrypod.org/posts/39-real-world-testing/</link><pubDate>Sun, 08 Feb 2026 00:00:00 +0000</pubDate><guid>https://sentinel-blog.cherrypod.org/posts/39-real-world-testing/</guid><description>Programmatic benchmarks said the system worked. Typing real prompts told a different story.</description></item><item><title>Part 37: Trust Laundering</title><link>https://sentinel-blog.cherrypod.org/posts/37-trust-laundering/</link><pubDate>Fri, 06 Feb 2026 00:00:00 +0000</pubDate><guid>https://sentinel-blog.cherrypod.org/posts/37-trust-laundering/</guid><description>The injection benchmark found 11 exploits. All shared the same root cause — files in the workspace inherited trusted status regardless of who put them there.</description></item><item><title>Part 36: The Injection Benchmark</title><link>https://sentinel-blog.cherrypod.org/posts/36-the-injection-benchmark/</link><pubDate>Thu, 05 Feb 2026 00:00:00 +0000</pubDate><guid>https://sentinel-blog.cherrypod.org/posts/36-the-injection-benchmark/</guid><description>A custom-built injection benchmark with real email, real calendars, real web pages. No simulated backends. 130 tests designed to break the trust architecture.</description></item><item><title>Part 35: The Stress Test</title><link>https://sentinel-blog.cherrypod.org/posts/35-the-stress-test/</link><pubDate>Wed, 04 Feb 2026 00:00:00 +0000</pubDate><guid>https://sentinel-blog.cherrypod.org/posts/35-the-stress-test/</guid><description>38 hours, 1,588 probes, zero human intervention. The first comprehensive validation with everything deployed.</description></item><item><title>Part 34: Tightening the Screws</title><link>https://sentinel-blog.cherrypod.org/posts/34-tightening-the-screws/</link><pubDate>Tue, 03 Feb 2026 00:00:00 +0000</pubDate><guid>https://sentinel-blog.cherrypod.org/posts/34-tightening-the-screws/</guid><description>The features were built. Now came the hardening — FP reduction, credential scanner expansion, metadata enrichment, and 600 new tests before the big run.</description></item><item><title>Part 28: Bug Hunt Three</title><link>https://sentinel-blog.cherrypod.org/posts/28-bug-hunt-three/</link><pubDate>Wed, 28 Jan 2026 00:00:00 +0000</pubDate><guid>https://sentinel-blog.cherrypod.org/posts/28-bug-hunt-three/</guid><description>The third full security audit. 13 batches of fixes, from API hardening to dead code removal.</description></item><item><title>Part 21: The Second Audit</title><link>https://sentinel-blog.cherrypod.org/posts/21-the-second-audit/</link><pubDate>Wed, 21 Jan 2026 00:00:00 +0000</pubDate><guid>https://sentinel-blog.cherrypod.org/posts/21-the-second-audit/</guid><description>199 findings across 7 units. 19 fix batches. 7 systemic improvements. The most thorough review the codebase has ever had.</description></item><item><title>Part 15: The Red Team</title><link>https://sentinel-blog.cherrypod.org/posts/15-the-red-team/</link><pubDate>Thu, 15 Jan 2026 00:00:00 +0000</pubDate><guid>https://sentinel-blog.cherrypod.org/posts/15-the-red-team/</guid><description>Four attack scenarios, including a simulated compromised planner. Six clean runs before trusting it.</description></item><item><title>Part 14: The Benchmark That Broke Everything</title><link>https://sentinel-blog.cherrypod.org/posts/14-the-benchmark-that-broke-everything/</link><pubDate>Wed, 14 Jan 2026 00:00:00 +0000</pubDate><guid>https://sentinel-blog.cherrypod.org/posts/14-the-benchmark-that-broke-everything/</guid><description>1,136 adversarial prompts. A 62% false positive rate on multi-step plans. The ascii gate was the culprit.</description></item><item><title>Part 7: The Bug Hunt</title><link>https://sentinel-blog.cherrypod.org/posts/07-the-bug-hunt/</link><pubDate>Wed, 07 Jan 2026 00:00:00 +0000</pubDate><guid>https://sentinel-blog.cherrypod.org/posts/07-the-bug-hunt/</guid><description>99 findings from a systematic security audit of my own code. Zero critical — but 16 high-severity.</description></item><item><title>Part 5: Three Out of Five</title><link>https://sentinel-blog.cherrypod.org/posts/05-three-out-of-five/</link><pubDate>Mon, 05 Jan 2026 00:00:00 +0000</pubDate><guid>https://sentinel-blog.cherrypod.org/posts/05-three-out-of-five/</guid><description>A security system that scores 3/5 on security is failing. The score became a to-do list.</description></item><item><title>Part 4: The First Real Test</title><link>https://sentinel-blog.cherrypod.org/posts/04-the-first-real-test/</link><pubDate>Sun, 04 Jan 2026 00:00:00 +0000</pubDate><guid>https://sentinel-blog.cherrypod.org/posts/04-the-first-real-test/</guid><description>741 attack prompts, run overnight. The results changed the entire direction of the project.</description></item></channel></rss>