Part 51: Did It Actually Work?

The planner had a problem with optimism. It would execute a multi-step plan, report success, and move on. Sometimes the result was correct. Sometimes a step had failed silently — a file write that produced empty output, a code patch that didn’t apply, a website that rendered a blank page. The planner didn’t check. It assumed that completing the steps meant achieving the goal.

The goal verification pipeline addresses this with three layers: a loop controller for persistent goal pursuit, an assertion framework for deterministic checks, and content manifests for observable-state verification.

The loop controller wraps the orchestrator in a retry loop. Instead of a single pass through the plan, it calls handle_task() repeatedly, evaluating the result after each attempt. If the goal isn’t met, it synthesises an actionable “gap summary” — a description of what’s missing or broken — and feeds it back as an enriched request for the next attempt. The planner sees its previous attempt’s outcome and can adjust its approach.

Gap synthesis is the intelligence behind the retries. It’s not a blind “try again.” A priority cascade of six detectors runs against the attempt result: judge gap (the verification judge identified a specific failure), blocked step (a plan step couldn’t execute), no mutations (nothing changed on disk), empty output (the tool returned nothing), partial completion (some steps succeeded, others didn’t), and step failure (a step errored). The first detector that fires produces the gap summary. This means the retry request is specific: “The CSS file was created but the JavaScript file is empty” rather than “Something went wrong.”

The gap summaries are carefully sanitised before reaching the planner. An allowlist approach ensures only safe error descriptions get through. Terms like “semgrep”, “yara”, “clamav” — the names of security scanners in the pipeline — are scrubbed by a backstop regex. The planner should know that its output was rejected, but it shouldn’t know which specific security tool rejected it or why. That information could be used to craft content that evades the scanner.

Attempt history is compressed intelligently. Older attempts get 60-character summaries. Recent ones get full detail. This prevents the context from growing unboundedly while keeping the most relevant information at full fidelity. The loop has configurable budgets — maximum iterations and wall-clock time — to prevent runaway retries. Loop state is persisted in Postgres with user scoping, cancellation support, and a concurrency guard ensuring one active loop per user.

The assertion framework provides deterministic verification that doesn’t require an LLM call. Seven assertion types cover the common cases: file_exists, file_not_empty, file_contains, file_not_contains, content_changed, response_contains, and command_returns. The planner can attach assertions to its plans — “after creating the website, assert that index.html exists and is not empty.” These are cheap, fast, and unambiguous. Either the file exists or it doesn’t.

Assertions feed into a two-tier judge pipeline. Tier 1 is the deterministic layer — assertion results, tool output scanning, structural checks. It’s free and fast. Tier 2 is a planner-as-judge LLM call that evaluates whether the goal was actually achieved. The key insight is that Tier 2 is only invoked when Tier 1 is ambiguous. If all assertions pass and the structural checks look clean, the task is marked complete without burning a judge call. If assertions fail, the task is marked failed without needing the judge to confirm the obvious. The judge only runs when the deterministic signals are mixed or insufficient.

Content manifests fill the gap between structural digests and semantic verification. A structural digest can tell you that an HTML file has 12 elements and 3 script tags. But it can’t tell you whether the background is blue or whether the navigation links point somewhere real. Content manifests extract observable properties: the actual background colour from a CSS rule, the actual text content of a heading, the actual DOM element IDs that JavaScript references, the actual endpoints that fetch calls target.

Four extractors cover the main file types. HTML extraction pulls body styles, per-element inline styles, text content, panel counts, and script references. CSS extraction captures element styles by selector, the full colour palette, and layout rules. JavaScript extraction identifies DOM references, timer behaviours, fetch calls, and Date usage. Python extraction maps entry points, class signatures, and CLI argument patterns. Each extractor is designed to confirm what the planner intended — not to expose raw content the planner didn’t originate.

The structural cross-reference check ties it together. After a website is built, it validates that JavaScript DOM references (getElementById('chart')) actually correspond to HTML elements with matching IDs, and that HTML script tags reference files that exist on disk. These are the integration bugs that individual file checks miss — each file can be valid in isolation while the wiring between them is broken.

The result is a verification pipeline that catches failures at multiple levels. Assertions catch the obvious cases — missing files, empty output. Content manifests catch semantic mismatches — the wrong colour, missing content. Cross-reference checks catch integration errors — broken references between files. The judge handles everything else. And the loop controller turns detection into correction, feeding failures back into the planner until the goal is actually met or the budget runs out.

The planner is still optimistic. But now there’s a system that checks its work.