Part 38: The Classifier Swap
The LLM classifier was slow, expensive, and occasionally wrong. A deterministic keyword matcher replaced it in microseconds with zero GPU.
The LLM classifier was slow, expensive, and occasionally wrong. A deterministic keyword matcher replaced it in microseconds with zero GPU.
Qwen was silently spilling VRAM to CPU. Fixing the KV cache quantisation unlocked more context and faster inference.
Not every request needs a frontier model to plan it. The router classifies incoming messages and takes the fast path when it can.