The features were done. Episodic learning, dynamic replanning, code fixer v2.5, multi-user, the app.py refactor — all merged. Next came the hardening: fixing the things that only surface when you run the system for real, expanding test coverage into areas that hadn’t been properly exercised, and enriching the data the system records about its own behaviour.
The false positive problem from earlier needed another round. Variance testing had identified specific scanner patterns that were triggering on legitimate content. Dockerfile content was being flagged because the scanner saw what looked like code execution patterns inside build instructions. The find -exec command — completely normal shell usage — was matching the exec blocked pattern. Risk decay needed increasing from 1.0 to 2.0 per minute because false positives were still cascading into session locks under sustained workloads.
The credential scanner got a major expansion. 10 new patterns covering OpenVPN static keys, Discord bot tokens, AWS keys, and generic secret assignments (the password = "hunter2" pattern that appears in a thousand different formats). Placeholder suppression was the tricky part — the scanner needs to flag api_key = "sk-live-abc123" but not api_key = "fakekey-for-testing". A suppression list (fakekey, dummy, sample, example, placeholder, test) handles the obvious cases. 40 new tests to verify the patterns catch what they should and ignore what they shouldn’t.
Metadata enrichment was about giving the episodic learning pipeline better data to work with. Every step outcome now records exit codes, stderr previews (up to 10 lines / 1,500 characters), scanner results, and code fixer actions. The scanner error categories expanded from a single bucket to four (syntax, security, format, runtime), so the system can distinguish between a Python syntax error and a security scanner block when learning from past failures. Sandbox timeout and OOM-kill detection got added to the execution metadata too — if a step was killed because it ran too long or consumed too much memory, that’s a different lesson than if it failed because the code was wrong.
Trust level escalation tests were new. Two scenarios verifying that trust levels behave correctly under adversarial pressure: Scenario A tests that a user can’t escalate their own trust level through a sequence of requests, Scenario B tests that violation accumulation works correctly. The violation accumulation fix was a real bug — violations were being weighted equally regardless of category, which meant a series of minor formatting flags could lock a session the same way a genuine injection attempt would. Now violations are weighted by block category, so a formatting false positive contributes less than a command injection detection.
The non-zero exit code replan closed the last gap in the failure recovery system. Previously, when a shell command returned a non-zero exit code, the step was marked as failed and the system moved on. Now it’s marked soft_failed, and the failure trigger flows back to the planner with FAILURE RECOVERY guidance. The planner can then generate a fix cycle — diagnose the failure, attempt a repair, retry. Validated at 12/12 on the mini G-suite before deployment.
The Rust sidecar got a startup optimization. WASM modules were being compiled on every call — each security check paid a JIT compilation penalty. Pre-compiling at startup eliminated the per-call overhead entirely. Small change, measurable impact on latency for every request that touches the sidecar.
By the time everything was deployed, the test count had grown from about 3,500 to 4,147. The pipeline had every feature, every hardening fix, and every new test. Time to see how it held up under sustained pressure.