Understanding KV Cache Requirements
Most discussions of LLM memory requirements focus on model weights. But for long-context workloads, the KV (key-value) cache can consume as much or more memory than the model itself. Here's how to calculate it accurately.
What is the KV Cache?
During inference, transformer models compute key and value tensors for each token in the context. These are cached to avoid recomputation on subsequent generation steps. The cache grows linearly with context length and the number of attention heads.
The Formula
The memory cost of the KV cache is:
KV Cache (GB) =
2 × num_layers × num_heads × head_dim × ctx_len × bytes_per_element
/ (1024³)For concrete examples using Llama 3.1 70B (80 layers, 64 heads, head_dim=128) at FP16:
- 8K context: ~16 GB
- 32K context: ~61 GB
- 128K context: ~244 GB
Why This Matters Practically
A 70B model at Q4_K_M weighs ~38 GB. On a 48GB VRAM setup (e.g., RTX 6000 Ada), this leaves only ~10 GB for KV cache — capping you at roughly 5K context. Misconfiguring this is the most common reason users see out-of-memory errors at longer contexts.
Reducing KV Cache Pressure
- KV cache quantization: llama.cpp supports
--cache-type-k q8_0to halve the KV cache memory at minimal quality cost. - Grouped Query Attention (GQA): Many modern models (Llama 3, Mistral) use GQA, which reduces the KV cache by 4–8× by sharing KV heads across query heads.
- Sliding window attention: Models like Mistral 7B limit attention to a local window, keeping KV cache size bounded regardless of total context.
LocalOps Calculator
The LocalOps hardware simulator accounts for KV cache in its memory calculations. When you set a context window in the Advanced panel, the KV cache contribution is added to the base model VRAM requirement automatically. This is why setting 128K context on the calculator shows significantly higher system RAM requirements.