A structural audit of the codebase surfaced what rapid development had been accumulating. The largest file — executor.py — was 3,888 lines. The largest class, ToolExecutor, had 64 methods. The largest single function, _handle_task_inner, was 999 lines. These weren’t legacy modules nobody touched. They were active, central components that every feature depended on. Every new capability had been bolted onto the same handful of files because that’s where the entry points were.

The audit categorised 19 high-severity findings, 108 medium, and 255 low. The highs were the god files and god functions — components that did too many things, knew about too many other components, and couldn’t be tested or reasoned about in isolation. The mediums were inconsistent logging, missing error context, and duplicated patterns. The lows were style issues, magic numbers, and naming inconsistencies. Individually, each finding was manageable. Collectively, they represented a codebase that was becoming harder to work with every time something was added.

The refactor was structured as seventeen phases, each targeting a specific module or subsystem. The executor came first — decomposing its 3,888 lines into focused sub-modules, each handling a specific category of tool operation. The orchestrator followed, extracted into a mixin architecture with a OrchestratorServices Protocol that defined the interface between components. The security module, the code fixer, the tool handlers, the API routes, the channel integrations — each one got its own phase of audit, decomposition, and quality verification.

The discipline was strict. One concern per session. No feature work during refactoring. A mandatory quality gate after each phase — the code had to pass before moving to the next module. This prevented the common refactoring failure mode where half-finished restructuring creates more problems than it solves. Each phase left the code in a working, tested state.

Several architectural patterns emerged across the phases. God functions were decomposed into focused sub-methods, with a strict ceiling — no function over 200 lines. Singleton session state was replaced with a TaskExecutionContext dataclass that could be passed explicitly rather than accessed through global state. A unified exception hierarchy, SentinelError, replaced the scattered mix of ad-hoc exception classes and bare raise statements. Protocol definitions formalised the interfaces between modules, making dependencies explicit and testable.

One of the most significant outcomes was audit_fix — a tool that started as a one-off script and evolved into something considerably more useful. It’s an AST-based code quality tool that can analyse Python source files and apply deterministic fixes. Not a linter that tells you what’s wrong — a tool that fixes things directly.

The core capability is structured logging injection. audit_fix parses Python source using the ast module, identifies functions that lack entry logging, and injects logger.debug() calls with structured event names and parameter summaries. It’s not a naive “add a log line to every function” tool. It has sophisticated skip rules: dunder methods are ignored, trivial functions under five effective lines are skipped, getters and setters are excluded, functions decorated with @no_audit_log are left alone, and functions that already have logging at entry are not double-logged.

Parameter handling is particularly careful. Parameters are classified into three categories. Large content parameters — things likely to contain file contents or long strings — get len() logged rather than their full value. Safe primitives like booleans, numbers, and short strings get their value logged directly. Everything else gets type().__name__ — informative without being noisy. A maximum of three parameters are logged per function to keep the output useful.

Beyond entry logging, audit_fix handles exception blocks (injecting logger.exception() where except clauses silently swallow errors), negative-path logging (filling asymmetric gaps where an if-branch has logging but the else-branch doesn’t), and several mechanical fixes: enforcing raise X from exc for proper exception chaining, fixing exc_info usage patterns, and resolving collisions with reserved LogRecord attribute names.

The execution pipeline has a strict, load-bearing order. Ruff formats the code to a clean baseline first. Logger setup ensures every qualifying module has a properly configured logger using __name__. Then the injectors run — entry logging, exception logging, negative-path logging. Ruff formats again after injection. Mechanical fixers apply their transformations. A final Ruff pass ensures everything is clean. Every injector validates its output with ast.parse() and rolls back on syntax errors — a malformed injection never makes it to disk.

audit_fix itself went through a significant restructuring during the refactor. It started as a 2,881-line monolith that had three systemic bugs: it was hijacking all loggers to a hardcoded name instead of respecting __name__, it had seven injection types that sprayed noisy logging everywhere, and it used inline markers that conflicted with manual code review. The refactored version is a proper package with each injector and fixer in its own module, clear skip rules, and convergence behaviour — running it twice on the same file produces identical output.

The tool now runs as part of the development workflow. A --dry-run flag shows what would change without modifying files. A --check flag provides CI integration, exiting with code 1 if changes are needed. A --manifest flag produces JSON output for programmatic consumption. Steps can be selectively applied or excluded. It’s a tool that makes the codebase better every time it runs, deterministically and without human intervention for the mechanical parts.

A companion tool, audit_scan, provides the detection side. It runs tiered checks — tier 1 mechanical checks (import patterns, naming conventions, structural metrics) and tier 2 context-aware checks (logging quality, security patterns, validation coverage, structural complexity). It supports baseline comparison, showing only new findings since the last scan, and can be scoped to only changed files via git diff. Together, audit_scan finds problems and audit_fix resolves the ones that have deterministic solutions.

The security modules got particular attention during the refactor. The output scanner’s core functions were decomposed into focused sub-functions. The Rust sidecar — used for performance-critical scanning — had 16 clippy warnings resolved and 3 SSRF bypass vectors closed. XSS hardening was applied across the UI layer. These weren’t new security features, but the decomposition made the existing security logic auditable in a way that a 900-line function never was.

The numbers tell part of the story. 690 files changed. 126,000 lines added, 19,000 removed. 105 bugs found and fixed along the way — not bugs that the refactor introduced, but bugs that the refactor exposed. Issues that had been hiding in god functions, masked by complexity, invisible until the code was simple enough to read clearly. The test suite grew from 5,679 to 5,956 tests, plus 55 new Rust tests for the sidecar.

The quality score — a composite metric from audit_scan covering logging, structure, error handling, and security patterns — went from roughly 6.6 to 9.2 out of 10 across the codebase. More importantly, the codebase is now something that can be maintained. Functions are short enough to understand. Dependencies are explicit. Interfaces are defined by Protocols. The logging is structured and consistent. When the next feature needs to be added, it has a clear place to go rather than being bolted onto a 4,000-line file.

The refactor didn’t add any user-facing capabilities. Sentinel does exactly what it did before. But the foundation it’s built on is fundamentally different — modular, observable, and maintainable. The next hundred features will be easier than the last ten were.