Part 44: Did It Actually Work?

There was a bug that crystallised the problem. “Change the background to dark green.” The planner generated a plan. The executor ran four discovery phases — reading files, checking structure, identifying the right CSS selector. Then the replan budget ran out. No file_patch was ever executed. The background was still whatever colour it started as. The task reported: success.

Success, in this context, meant “all steps that ran completed without errors.” Which was technically true. The discovery steps all succeeded. But the goal — changing the background colour — was never attempted, let alone achieved. The system was conflating execution completion with goal achievement.

The verification system addresses this with two tiers.

Tier 1 is deterministic. It answers three questions without any LLM inference. First: did all planned steps execute, or did the budget run out? If the budget ran out, the status is now partial, not success. Second: did the plan execute any “effect” tools — file_write, file_patch, shell with mutations, signal_send? Discovery-only executions (just reading files) can’t have achieved a modification goal. Third: what actually changed on disk? File mutations are aggregated with size and line diffs. A file_patch that didn’t change the file size or content is suspicious.

Tier 2 is the planner-as-judge. When Tier 1 signals are ambiguous — steps completed, effect tools ran, but the outcome is unclear — Claude is invoked with the original goal and a set of decomposed yes/no questions. Did the requested change appear in the final file? Does the output match the user’s description? The judge verdict gets recorded alongside the task result.

The planner can also attach assertions to plan steps — post-conditions like “file contains this pattern” or “HTTP endpoint returns 200.” Seven evaluator types handle the common cases: file_exists, file_contains, file_size_gt, output_contains, exit_code_eq, socket_open, http_200. All paths are sandboxed to prevent traversal. Failed assertions trigger recovery instructions in the replan prompt rather than hard failures — the system gets a chance to fix what went wrong.

The Glasgow bug — the dark green background — now correctly reports partial with goal_actions_executed: false. That’s the right answer. The system tried to solve the problem, ran out of budget during discovery, and never made the change. Honest status reporting instead of optimistic inference.

There’s a broader pattern here. LLM-driven systems have a tendency toward false positives — reporting success because nothing explicitly failed, even when the actual goal wasn’t achieved. Deterministic checks catch the obvious cases (nothing changed, budget exhausted) without spending inference. The LLM judge handles the nuanced cases where changes were made but might not match intent. Defence in depth, applied to self-assessment.