What does --n-cpu-moe do in llama.cpp?

It controls how much CPU-side parallel work llama.cpp uses for MoE-related processing. In practice, that can include routing support, expert dispatch, or related host-side tasks depending on the build and model path. The exact implementation can vary by version. So check the specific fork.

Why can more CPU work make llama.cpp faster?

More CPU work can make llama.cpp faster when it cuts GPU idle time. If the CPU prepares expert routing and memory handoffs sooner, the GPU waits less between decode steps. Throughput goes up even though total host activity rises. That's the trade.

How should I tune Qwen gguf on 12 GB VRAM?

Tune it by finding the best balance between offload, memory fit, and CPU MoE parallelism. Start with VRAM and RAM headroom, then sweep GPU layers and --n-cpu-moe together. Most bad results come from hidden transfer stalls, not from one obviously wrong flag. Simple enough.

Does TurboQuant change Qwen 35B performance behavior?

Yes, TurboQuant can change where the bottleneck lands because custom quant kernels alter compute and memory tradeoffs. That may improve speed, but it can also make thread tuning more sensitive. Comparing against upstream llama.cpp is the fastest way to tell whether a gain is general or fork-specific. We'd do that first.

When should I lower n-cpu-moe instead of raising it?

Lower it when CPU contention, cache thrash, or scheduler overhead starts to dominate. If tokens per second flatten while CPU usage spikes and the system feels saturated, you've probably gone past the sweet spot. More threads stop paying off once host coordination costs exceed the overlap benefit. That's usually obvious.

llama.cpp n-cpu-moe performance: why speedups happen

⚡ Quick Answer

Yes, increasing --n-cpu-moe can raise throughput because llama.cpp often benefits from better CPU-GPU overlap during MoE expert routing and host-side work. On constrained 12 GB VRAM systems, a higher setting can reduce stalls, keep the GPU fed, and lift tokens per second even though the CPU does more total work.

llama.cpp n-cpu-moe performance can look strange the first time you run into it. You crank up a CPU-heavy knob, and tokens per second climb anyway. Feels backward. But on a 12 GB VRAM box running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf through a TurboQuant-flavored llama.cpp build, the limit often isn't plain CPU math by itself. It's the handoff across expert routing, quantized weight reads, memory traffic, and GPU idle gaps.

Why did llama.cpp n-cpu-moe increase tokens per second?

Because a higher --n-cpu-moe setting can trim pipeline stalls and improve overlap, it can raise throughput even when the CPU takes on more MoE work. Short version: less waiting. In mixture-of-experts inference, the model doesn't wake up every expert for every token; it routes each token through a smaller subset, which creates uneven scheduling and messy memory access. And that unevenness matters more than most people assume. If the CPU side readies routing decisions, expert picks, or host-managed work too slowly, the GPU can sit there underfed between decode steps. That's wasted silicon. In llama.cpp, especially with partially offloaded GGUF models, the fastest setup often isn't the one doing the least CPU work. It's the one with the fewest bubbles. We'd argue that's the missing mental model in a lot of forum replies. For a concrete example, people running Qwen-family MoE checkpoints on cards like the RTX 3060 12 GB often report sudden jumps after thread and offload tuning, because the earlier choke point was scheduler lag rather than matrix multiply throughput. That's a bigger shift than it sounds.

Related:🔗AI agent energy consumption

How llama.cpp moe offload explanation works on 12 GB VRAM rigs

The short answer is that 12 GB VRAM usually forces a split execution path, and split paths reward balanced host-device coordination. Not glamorous. A dense model tends to stress the GPU in a more predictable way, but an MoE model adds routing, expert dispatch, and fragmented memory behavior that can push some consequential work back onto the CPU and system RAM. So the GPU isn't the whole story. On a machine with 32 GB RAM and 12 GB VRAM, llama.cpp may leave some layers, expert weights, caches, or side buffers in host memory depending on quantization format, context settings, and offload choices. That means PCIe transfers and host-side prep can become first-order factors. According to NVIDIA's public guidance, PCIe Gen4 x16 offers about 32 GB/s of bidirectional bandwidth in ideal conditions, far below on-card GDDR6 bandwidth that often clears 300 GB/s on mainstream GPUs. That's a huge gap. When expert routing triggers frequent cross-boundary access, smarter CPU parallelism can cut wait time enough to outweigh the extra CPU cycles. Worth noting. Think of an RTX 3060 paired with DDR5 desktop memory: the balance between host prep and device execution can matter more than people expect.

Related:🔗GPU benchmarking

What expert routing diagrams reveal about Qwen 35B TurboQuant performance

The direct answer is that MoE routing creates uneven work, and uneven work rewards more aggressive host-side parallel scheduling. Here's the thing. Think of each token taking a path like this: token embedding -> router score -> top-k expert choice -> fetch expert weights -> run expert MLP -> merge outputs -> continue decode. Simple on paper. Messy in practice. If two experts land in different memory regions, or one sits in a section handled differently by TurboQuant kernels, timing can wobble from token to token. Here's the first diagram to keep in mind: [Token] -> [Router on active layers] -> [Expert A + Expert C] -> [Combine] -> [Next layer]. And here's the second: [CPU thread pool prepares dispatch] || [GPU runs current kernels] -> [sync point] -> [next token]. When --n-cpu-moe is too low, that dispatch lane can fall behind. And when you raise it, the overlap improves, so the GPU burns less time waiting at sync points. Not magic. That, more than any mystery speed hack, likely explains those surprising Qwen 35B TurboQuant gains people keep posting. We'd say that's the real story.

Which bottlenecks drive llama.cpp inference speed tuning with quantized GGUF?

The best answer is that quantized GGUF inference lives or dies by the slowest of four moving parts: compute, memory bandwidth, transfer latency, and scheduler overhead. Simple enough. Quantization cuts memory footprint and usually lifts effective throughput, but it can also change kernel behavior, dequant cost, and cache locality. Not all Q4 variants behave the same. In llama.cpp, formats like Q4_K and related grouped quant schemes trade precision and packing efficiency against dequant overhead, and the exact payoff depends on your CPU cache, GPU architecture, and whether kernels come from CUDA, Metal, Vulkan, or a custom path like TurboQuant. We see this constantly in community benchmarks. A faster quant on one rig can lose on another because DRAM latency or PCIe contention becomes the real limiter. MLPerf Inference results have pointed to this for years: serving speed depends heavily on system balance, not just accelerator peak specs, and that same logic carries over here even if llama.cpp hobbyist rigs aren't a one-to-one match. My view is blunt: if you only watch GPU utilization, you'll misread half your llama.cpp inference tuning problems. That's not trivial. Take a Ryzen desktop and an RTX 3060, for instance; the bottleneck can hop around depending on cache behavior alone.

How to benchmark Qwen gguf 12GB VRAM performance without fooling yourself

The direct answer is that you need controlled tests, fixed prompts, and one-variable changes, or you'll confuse noise with signal. Not quite. Start by measuring prompt processing and token generation separately, because MoE routing effects often show up more clearly in decode than in prefill. And keep context length fixed. Use the same seed, shut down background tasks you can control, and log GPU utilization, CPU package usage, RAM pressure, VRAM usage, and tokens per second together. A single FPS-style number hides too much. For a real baseline, many llama.cpp users report that moving from conservative thread settings to higher host parallelism on 30B to 40B-class quantized models can shift decode throughput by double-digit percentages, especially on DDR5 desktops with midrange GPUs. That said, thermal throttling and OS scheduling can fake gains or losses. So run at least three passes per setting and keep the median. We'd argue that's the minimum. A Windows box with Discord updates kicking off in the background can throw off a result more than people realize.

What tuning checklist fixes weird llama.cpp n-cpu-moe performance results?

The practical answer is to tune in dependency order: verify memory limits, then offload, then MoE threads, then batch and context, then kernel choices. Boring, yes. First, confirm you aren't swapping in system RAM or overcommitting VRAM, because paging wrecks any useful signal. Second, test nearby --gpu-layers or similar offload settings to find the boundary where transfers start hurting more than they help. Third, sweep --n-cpu-moe in small steps such as 8, 12, 16, 20, 24, and 30 while watching both tok/s and CPU saturation. Fourth, compare prompt eval and decode separately; a win in one can hide a loss in the other. Fifth, test different --threads values and affinity behavior, since MoE worker contention can collide with the main sampling loop. And finally, if you're using TurboQuant or another fork, compare one run against upstream llama.cpp because custom kernels sometimes redraw the bottleneck map. That's the checklist. It's also the one that works. Worth noting. A side-by-side run on upstream versus TurboQuant can tell you more than an hour of guessing.

Step-by-Step Guide

1
Measure a clean baseline
Run the same prompt, context length, and sampling settings three times before changing anything. Record prompt eval speed, decode speed, CPU usage, VRAM usage, and RAM usage separately. You'll need that baseline to tell a real gain from random variance.
2
Map your memory boundary
Check whether the model plus KV cache fits cleanly inside your VRAM and system RAM limits. If VRAM spills force repeated host-device movement, throughput can collapse in strange ways. Tools like nvidia-smi, Windows Task Manager, or nvtop make this visible fast.
3
Sweep n-cpu-moe methodically
Increase --n-cpu-moe in small increments rather than jumping straight to the maximum. Watch for the point where tokens per second rise, flatten, or reverse. On 12 GB VRAM rigs, the best value often lands where CPU preparation stays ahead of GPU demand without oversubscribing cores.
4
Separate decode from prefill
Benchmark prompt ingestion and token generation as different phases. MoE thread changes may barely affect prefill but materially improve decode. That's why a single average throughput number can mislead even experienced users.
5
Compare offload and thread settings together
Retest your best n-cpu-moe value with adjacent GPU offload and CPU thread counts. These settings interact, especially when expert routing and quantized weights bounce across memory tiers. A good MoE thread count paired with a poor offload split can still lose badly.
6
Validate on a second build
Run the same benchmark on upstream llama.cpp if you're using TurboQuant or another specialized fork. Different kernels, schedulers, or quant handlers can move the bottleneck. If the pattern persists across builds, you've probably found a real systems effect rather than a fork-specific quirk.

Key Statistics

NVIDIA documents PCIe Gen4 x16 bandwidth at roughly 32 GB/s bidirectional in ideal conditions.That figure matters because host-device transfers for partially offloaded MoE models are far slower than on-card VRAM access, so scheduling inefficiency quickly shows up as lost throughput.

Consumer GPUs such as the GeForce RTX 3060 12 GB ship with memory bandwidth around 360 GB/s, according to NVIDIA specifications.The huge gap between local VRAM bandwidth and PCIe transfer bandwidth explains why keeping expert work well staged matters so much on constrained VRAM rigs.

MLPerf Inference submissions routinely show large performance swings across systems with similar accelerator classes but different host configurations.The lesson for llama.cpp users is simple: CPU, memory, and software stack choices can materially change throughput even when the GPU stays the same.

Community llama.cpp benchmarks often report double-digit decode gains after thread and offload retuning on 30B-plus quantized models.Those anecdotal but repeated results support the idea that weird speedups usually come from removing pipeline stalls rather than from any single magic flag.

Frequently Asked Questions

✦

Key Takeaways

✓More CPU MoE threads can speed inference by cutting idle GPU time
✓MoE routing, quantization, and memory bandwidth interact more than most users expect
✓TurboQuant and GGUF layouts can shift where the bottleneck actually sits
✓12 GB VRAM rigs often win from balanced overlap, not maximal GPU offload
✓A repeatable tuning checklist beats one-off benchmark anecdotes every time

← Back to Blogs More in Open Source AI →