Reaching 64 t/s: LM Cache, KV Checkpointing, and MTP

Why llama-server is Faster Than llama-cli

The answer is KV cache checkpointing. llama-server stores the attention state of each conversation as checkpoints. When a new message arrives, the server finds the nearest matching checkpoint and restores it — only processing the new tokens since that checkpoint.

restored context checkpoint (pos_min = 11924, n_tokens = 11925)
prompt eval: 744 tok/s  ← only 771 NEW tokens processed
eval:        61.96 t/s  ← generation

In this example, 11,925 tokens were restored from cache. Only 771 new tokens needed processing — a 94% cache hit. The prompt eval that takes 14 seconds from scratch completes in 1 second.

llama-cli has no persistent state between runs. Every invocation reprocesses the full prompt from scratch. That's the 36 TPS ceiling — and why the server mode number from Experiment 02 looked suspiciously high.

The LM cache isn't optional for real coding use. Cline builds up 10k-40k tokens of context across a session. Without checkpointing, every message would require reprocessing the full history — adding 10-30 seconds of latency per turn.

How checkpoints work

llama-server creates a checkpoint every 8192 tokens during prompt processing. The -ctxcp 64 flag allows up to 64 checkpoints per slot. When a new prompt arrives, the server scans backwards to find the longest matching prefix, restores that checkpoint, then processes only the divergent tail. Each checkpoint is roughly 63-75 MiB of KV state.

MTP: Multi-Token Prediction

With llama-server's LM cache giving 57 TPS, the next step was MTP. MTP is speculative decoding without a separate draft model — the Qwen3.6 35B-A3B model was trained with an additional prediction head that guesses the next 2-4 tokens. llama-server verifies them in a single forward pass. Correct guesses are accepted for free. Zero quality loss — rejected drafts fall back to standard sampling.

MTP requires training-time support

You cannot add MTP to an existing model at inference time. The prediction head must exist in the weights from training. Qwen3.6 35B-A3B has one — it ships as a separate GGUF variant with the MTP tensors fused in (~19.4GB, same quantization level as the base model).

Building llama.cpp with MTP support

cd ~/llama.cpp
git fetch origin pull/22673/head:pr-22673
git merge --no-ff pr-22673 -m "Merge PR #22673: MTP Support"
cmake -B build -DGGML_CUDA=ON -DGGML_AVX2=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j8

Download the MTP model:

hf download localweights/Qwen3.6-35B-A3B-MTP-IQ4_XS-GGUF \
  --local-dir D:/aiprojects/models/

The OOM problem and -fitt 1536

The first attempt at 65k context crashed with OOM. The MTP draft context needs VRAM headroom that wasn't being reserved — llama.cpp by default tries to use all available VRAM for the model, leaving nothing for the draft.

-fitt 1536 fixes this: it tells llama.cpp to fit the model, then leave exactly 1536MB free for the draft context and KV cache. llama.cpp auto-balances the CPU/GPU expert split to satisfy the constraint — effectively replacing the manual -ncmoe 11 tuning from Experiment 01.

With -fitt 1536: 131k context loaded cleanly, MTP active, no OOM.

Full command

./llama-server \
  -m Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf \
  -fitt 1536 \          # leave 1536MB for MTP draft + KV cache
  -c 131072 \           # 131k context
  -n 32768 \
  -fa on \
  -np 1 \               # required: MTP supports single slot only
  -ctk q8_0 -ctv q8_0 \
  -ctkd q8_0 -ctvd q8_0 \
  -ctxcp 64 \
  --no-mmap --mlock \
  --no-warmup \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --jinja --metrics \
  --host 0.0.0.0 --port 8080

llama-cli vs llama-server — again

The same pattern repeated. Running the MTP model with llama-cli: 10.9 TPS. Running it through llama-server: 64 TPS. The server's batched speculative verification loop is what makes MTP effective. Always benchmark with the server.

llama-cli + MTP

10.9 TPS — MTP loads but the speculative loop is not optimized for interactive mode.

llama-server + MTP

64.19 TPS — 76% draft acceptance rate. 131k context. GPU at 100% during generation.

Results

64.19 tokens/sec

76% acceptance rate

131k context tokens

3.5× over Ollama

Ollama default

18 t/s

llama-cli -ncmoe 11

36 t/s

llama-server (LM cache)

57 t/s

llama-server + MTP ✦

64 t/s

Step	Bottleneck removed	TPS
Ollama → llama-cli -ncmoe 11	Naive CPU/GPU split	18 → 36
llama-cli → llama-server	Full prompt reprocessing every request	36 → 57
+ MTP draft-mtp	Sequential token generation	57 → 64

Real-World Cline Numbers

The benchmark is 500 tokens on a fresh context. Real Cline sessions accumulate much more. TPS degrades as context grows — the KV cache competes with expert weights for VRAM.

TPS and acceptance rate vs context size — real Cline sessions

Scenario	TPS	Acceptance	Context
Standard benchmark	64	76%	~1k
Cline coding task	57–66	74–93%	~25k
Cline deep session	46–53	55–67%	~43k
Claude Code agentic	46–53	55–67%	~43k

Cline's 93% acceptance rate on fresh coding tasks is the standout number. Code is highly predictable — the MTP head guesses correctly nearly every time. Claude Code's multi-agent parallel requests split GPU compute across slots and accumulate more context, reducing per-request TPS.

The degradation at 43k context is the KV cache spilling into shared GPU memory — PCIe-backed DDR5 at 32 GB/s instead of dedicated VRAM at 448 GB/s. TurboQuant KV compression would keep the cache in VRAM at much larger contexts. That's Experiment 04.

Next — Experiment 04

80B Model on 41GB: When NVMe Becomes the Bottleneck

Qwen3 Coder Next 80B-A3B at Q4. 7GB spills to NVMe at 3-4 GB/s real-world speed. Same MTP + -fitt approach. Does it hold at double the model size? What's the minimum viable stack when the model no longer fits in fast memory?