Part 18: The False Positive Problem

Autonomy exposed a problem I hadn’t fully appreciated: when there’s no human to override a scanner, false positives become functional failures.

In approval mode, if a scanner incorrectly flagged something, I’d see it in the approval view, recognise it as a false positive, and approve anyway. The system was conservative and that was fine because I was the safety valve.

At TL4, there is no safety valve. A false positive means a blocked step means a failed task means the user gets an error message instead of an answer. The security system becomes the thing that stops the system from being useful.

The worst version of this was false positive cascading. Sentinel tracks cumulative risk per session — each flagged item adds to a risk score. If the score exceeds a threshold, the session locks. In approval mode this rarely mattered because I’d clear the flags. Autonomously, false positives accumulated. A few innocent flags would push the session over the threshold, the session would lock, and every subsequent request in that session would be blocked regardless of content.

The fix was risk decay. The cumulative risk score now decays over time — specifically, 1.0 per minute. A session that gets temporarily heated doesn’t stay locked forever. Combined with an auto-unlock timeout that clears stale conversation turns (preventing the system from re-triggering on old history), it means false positives are annoying but no longer permanently crippling.

Getting this balance right — aggressive enough to catch real attacks, forgiving enough to recover from false alarms — is probably the ongoing challenge of the project. It’s easy to build a system that blocks everything. The hard part is blocking only the right things.