Part 33: The Invisible Bottleneck

Qwen 3 14B runs on an RTX 3060 with 12GB of VRAM. The model weights fit, but I hadn’t been paying close enough attention to what else was consuming GPU memory.

The KV cache — the key-value pairs the model stores for attention across the context window — was running at full precision. At 20K context with default settings, the cache was overflowing VRAM and silently spilling into system RAM. Ollama doesn’t warn you when this happens. The model still works, it just gets slower, and you have no idea why.

I found it while investigating why inference felt sluggish on longer conversations. The fix was two changes to the Ollama model configuration: switch the KV cache to q8_0 quantisation (8-bit instead of 16-bit, roughly halving its memory footprint) and enable flash attention (which computes attention more efficiently without materialising the full attention matrix in memory).

The result was immediate. The KV cache shrank enough that I could increase the context window from 20K to 24K tokens and raise the output cap from 8K to 12K tokens — both improvements to capability, not just performance. Longer plans, more detailed output, fewer truncation issues.

The pre-warm step at startup also helped. The fast-path classifier calls Ollama for every incoming request, and a cold model means the first request after idle is noticeably slow. Now the system sends a throwaway inference at startup so the model is already loaded and warm when the first real request arrives.

The frustrating part is how invisible the problem was. No error, no warning, no metric showing GPU memory pressure. Just slightly slower inference that I’d been living with unnecessarily.