LocalOps
Back to Blog
February 28, 2026·8 min read

Maximizing DeepSeek R1 671B Performance Locally

DeepSeek R1 671B is a Mixture-of-Experts (MoE) model with 671 billion total parameters, but only ~37B active per forward pass. This architecture makes it far more tractable to run locally than a dense 671B model would be — but it still requires careful configuration to avoid memory bottlenecks.

Hardware Requirements

At Q4_K_M quantization, DeepSeek R1 671B weighs approximately 380 GB. This means you need either:

  • A multi-GPU setup with ≥400 GB combined VRAM (e.g., 4× H100 80GB, or 8× A100 40GB), or
  • CPU+GPU hybrid offloading with at least 64–128 GB system RAM to absorb the layers that don't fit in VRAM.

Recommended Quantization

For MoE models, Q2_K (2-bit) can be surprisingly usable because the sparsity compensates for quantization loss. However, Q4_K_M remains the sweet spot for quality. Avoid Q8 unless you have abundant VRAM — the quality gains over Q4 are marginal for this architecture.

MoE Offloading Strategy

The key insight with MoE models: expert layers can be offloaded aggressively. Use llama.cpp with --n-gpu-layers tuned so that the attention and shared layers live in VRAM, while the expert FFN layers stream from RAM. This yields 60–80% of pure-GPU throughput at a fraction of the memory cost.

./llama-cli \
  -m deepseek-r1-671b-q4_k_m.gguf \
  -ngl 30 \        # Keep 30 layers on GPU  
  --mlock \        # Pin RAM allocation
  -c 8192

Context Window Trade-offs

Every 4K tokens of context adds ~2 GB of KV cache at default precision. For a 128K context, that's ~32 GB of additional memory pressure. Keep context at 8K–32K unless your task demands more.

Inference Speed Expectations

On a 4× RTX 4090 setup with Q4_K_M and 30 GPU layers, expect roughly 8–15 tokens/sec. This is usable for interactive sessions but slow for batch workloads. For production throughput, consider vLLM with tensor parallelism.