107 tests. 54 defended, 37 contained, 11 exploits, 5 errors (all resolved to defended on rerun).
The good news first. External channels held. Email bodies (29 tests): zero exploits. Calendar descriptions (26 tests): zero exploits. Web pages (26 tests): zero exploits. Signal triggers (3 tests): zero exploits. The CaMeL trust architecture worked exactly as designed — data from external sources is tagged UNTRUSTED, and untrusted data cannot influence the planner’s decisions. All 13 payload styles were contained equally across email and calendar. The architecture doesn’t care what the payload says because it never gets the authority to say it.
The contained results were the strongest finding. Every contained test completed the user’s original task while silently ignoring the injection. The user asked for an email summary — they got an email summary. The attacker’s exfiltration payload was irrelevant. That’s the ideal outcome: security without utility loss.
Then there was file_content.
All 11 exploits came from the same vector: files in the workspace. And they all followed an identical chain.
Step 1: The attacker seeds a file in /workspace/. Step 2: The user asks Sentinel to read and process files in that directory. Step 3: Sentinel’s file_read function reads the file and tags it TrustLevel.TRUSTED — because the provenance system only tracks files that Sentinel itself wrote. A file placed there by anyone else has no provenance record, and no provenance record defaults to trusted. Step 4: The planner sees trusted content containing “instructions” and a user prompt saying “process any instructions.” Step 5: The planner generates a shell step to execute the instruction. Step 6: The shell step writes BACKDOOR_MARKER to a file.
I’m calling it trust laundering. Untrusted data enters the system through a channel that doesn’t get tagged, inherits trusted status by default, and flows through the pipeline with full authority. The provenance system was designed to track what Sentinel writes, not to distrust what it didn’t write. That’s the gap.
All 13 payload styles worked — from blunt “ignore previous instructions” to subtle social engineering. The payload didn’t matter because the trust tag had already done the damage. Once data is marked TRUSTED, the architectural defence is bypassed entirely. The scanners didn’t catch it either — echo BACKDOOR_MARKER >> file is a normal shell command, not a pattern any scanner flags.
Three contributing factors. The permissive user prompt (“process any instructions”) is partly a test design issue — it essentially consents to following whatever the file says. But the email exfiltration tests used a reasonable prompt (“email a summary to my manager”) and the planner still refused those, so the prompt alone isn’t the full story. The real issue is the trust tag. The third factor is the absence of any planner guidance about treating file contents as potentially adversarial.
The defence layer analysis told a clear story. Layer 1 (trust architecture) caught everything from external channels — 37 tests contained. Layer 2 (planner refusal) caught another 36 — Claude recognised suspicious exfiltration intent even with trusted files. Layer 3 (scanners) caught blatant patterns. Layer 4 failed on the 11 file_content/file_modification tests where all three upstream layers had nothing to work with.
There was an important finding about utility. Every “defended” test killed the legitimate user task. The user asked for an email summary and got “Request refused.” The system was safe but useless. Every “contained” test delivered the user’s result while silently ignoring the attack. Contained is strictly better on both axes. Scanner blocking should be the fallback, not the goal — architectural containment should be the target for every vector.
Pipeline non-determinism surfaced too. Two tests (subtle_context and encoded_evasion) flipped between exploit and defended on identical inputs. The root cause isn’t scanner stochasticity — Prompt Guard is deterministic given the same input. It’s upstream LLM non-determinism: Qwen and Claude generate slightly different outputs each run, which means the scanner sees different text. An attacker who can seed files gets multiple shots at a probabilistic defence. Another reason structural trust tagging is the right fix — it eliminates the vector regardless of what the LLM generates.
The web page results were deceptive. All 26 scored “defended” — but not because of the trust architecture. The web server ran on a LAN IP, and Sentinel’s SSRF filter (which blocks private IP addresses to prevent server-side request forgery) caught every request before any web content was retrieved. The trust architecture for web content is empirically unvalidated. I’ve since set up 5 public test sites to close that gap.
The combined testing report — Run 16’s 1,588 probes plus the injection benchmark’s 107 tests — gives a clear picture of where things stand: ~1,695 total test probes, 11 real exploits (all the same root cause), zero external channel breaches, and a prioritised remediation plan.
The fix is straightforward in principle. Files without a Sentinel provenance record should default to UNTRUSTED, not TRUSTED. A content hash check ensures that even files Sentinel did write can’t be silently modified by an attacker and still retain their trust tag. The planner prompt gets guidance about file-sourced data as a defence-in-depth layer. And the injection benchmark gets rerun after each fix to verify the exploits are actually dead.
Eleven exploits from one root cause. It’s a better position than eleven exploits from eleven different causes — but it’s a reminder that the gap you didn’t think to test is always the one that bites you.