Ollama leaves 2.2× performance on the table for MoE models. A deep dive into memory bandwidth hierarchy and why GPU utilization % is a misleading metric.
The goal is simple: run the biggest open-weight model possible at good tokens per second, entirely on local or consumer hardware hardware. No cloud. No API keys. Just your machine.
Follow the build →Ollama leaves 2.2× performance on the table for MoE models. A deep dive into memory bandwidth hierarchy and why GPU utilization % is a misleading metric.
How to change the default directory for Ollama models on Windows using an environment variable.
A clear breakdown of what large language models are, how they work, and why they matter for local inference.
Compiled Thoughts is a public build log with one goal: run the biggest open-weight model possible at good tokens per second, entirely on local hardware.
Open-weight models are getting bigger and better fast. But actually running them — on your own machine, without cloud APIs or expensive subscriptions — is still harder than it should be. This blog is about closing that gap.
Every post is a step in the build: benchmarks, tooling, configuration, failures, and breakthroughs. All numbers are real and measured.