Why llama-server is Faster Than llama-cli
The answer is KV cache checkpointing. llama-server stores the attention state of each conversation as checkpoints. When a new message arrives, the server finds the nearest matching checkpoint and restores it — only processing the new tokens since that checkpoint.
restored context checkpoint (pos_min = 11924, n_tokens = 11925)
prompt eval: 744 tok/s ← only 771 NEW tokens processed
eval: 61.96 t/s ← generation
In this example, 11,925 tokens were restored from cache. Only 771 new tokens needed processing — a 94% cache hit. The prompt eval that takes 14 seconds from scratch completes in 1 second.
llama-cli has no persistent state between runs. Every invocation reprocesses the full prompt from scratch. That's the 36 TPS ceiling — and why the server mode number from Experiment 02 looked suspiciously high.
The LM cache isn't optional for real coding use. Cline builds up 10k-40k tokens of context across a session. Without checkpointing, every message would require reprocessing the full history — adding 10-30 seconds of latency per turn.
How checkpoints work
llama-server creates a checkpoint every 8192 tokens during prompt processing. The -ctxcp 64 flag allows up to 64 checkpoints per slot. When a new prompt arrives, the server scans backwards to find the longest matching prefix, restores that checkpoint, then processes only the divergent tail. Each checkpoint is roughly 63-75 MiB of KV state.
MTP: Multi-Token Prediction
With llama-server's LM cache giving 57 TPS, the next step was MTP. MTP is speculative decoding without a separate draft model — the Qwen3.6 35B-A3B model was trained with an additional prediction head that guesses the next 2-4 tokens. llama-server verifies them in a single forward pass. Correct guesses are accepted for free. Zero quality loss — rejected drafts fall back to standard sampling.
MTP requires training-time support
You cannot add MTP to an existing model at inference time. The prediction head must exist in the weights from training. Qwen3.6 35B-A3B has one — it ships as a separate GGUF variant with the MTP tensors fused in (~19.4GB, same quantization level as the base model).
Building llama.cpp with MTP support
cd ~/llama.cpp
git fetch origin pull/22673/head:pr-22673
git merge --no-ff pr-22673 -m "Merge PR #22673: MTP Support"
cmake -B build -DGGML_CUDA=ON -DGGML_AVX2=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j8
Download the MTP model:
hf download localweights/Qwen3.6-35B-A3B-MTP-IQ4_XS-GGUF \
--local-dir D:/aiprojects/models/
The OOM problem and -fitt 1536
The first attempt at 65k context crashed with OOM. The MTP draft context needs VRAM headroom that wasn't being reserved — llama.cpp by default tries to use all available VRAM for the model, leaving nothing for the draft.
-fitt 1536 fixes this: it tells llama.cpp to fit the model, then leave exactly 1536MB free for the draft context and KV cache. llama.cpp auto-balances the CPU/GPU expert split to satisfy the constraint — effectively replacing the manual -ncmoe 11 tuning from Experiment 01.
With -fitt 1536: 131k context loaded cleanly, MTP active, no OOM.
Full command
./llama-server \
-m Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf \
-fitt 1536 \ # leave 1536MB for MTP draft + KV cache
-c 131072 \ # 131k context
-n 32768 \
-fa on \
-np 1 \ # required: MTP supports single slot only
-ctk q8_0 -ctv q8_0 \
-ctkd q8_0 -ctvd q8_0 \
-ctxcp 64 \
--no-mmap --mlock \
--no-warmup \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--jinja --metrics \
--host 0.0.0.0 --port 8080
llama-cli vs llama-server — again
The same pattern repeated. Running the MTP model with llama-cli: 10.9 TPS. Running it through llama-server: 64 TPS. The server's batched speculative verification loop is what makes MTP effective. Always benchmark with the server.
10.9 TPS — MTP loads but the speculative loop is not optimized for interactive mode.
64.19 TPS — 76% draft acceptance rate. 131k context. GPU at 100% during generation.
Results
| Step | Bottleneck removed | TPS |
|---|---|---|
| Ollama → llama-cli -ncmoe 11 | Naive CPU/GPU split | 18 → 36 |
| llama-cli → llama-server | Full prompt reprocessing every request | 36 → 57 |
| + MTP draft-mtp | Sequential token generation | 57 → 64 |
Real-World Cline Numbers
The benchmark is 500 tokens on a fresh context. Real Cline sessions accumulate much more. TPS degrades as context grows — the KV cache competes with expert weights for VRAM.
| Scenario | TPS | Acceptance | Context |
|---|---|---|---|
| Standard benchmark | 64 | 76% | ~1k |
| Cline coding task | 57–66 | 74–93% | ~25k |
| Cline deep session | 46–53 | 55–67% | ~43k |
| Claude Code agentic | 46–53 | 55–67% | ~43k |
Cline's 93% acceptance rate on fresh coding tasks is the standout number. Code is highly predictable — the MTP head guesses correctly nearly every time. Claude Code's multi-agent parallel requests split GPU compute across slots and accumulate more context, reducing per-request TPS.
The degradation at 43k context is the KV cache spilling into shared GPU memory — PCIe-backed DDR5 at 32 GB/s instead of dedicated VRAM at 448 GB/s. TurboQuant KV compression would keep the cache in VRAM at much larger contexts. That's Experiment 04.