Running an LLM on Your Laptop in 2026: M-Series, Quantization, and What Actually Works
Step-by-step: pick a quantization, install Ollama or LM Studio, run a 7B-14B model on a MacBook or 16GB GPU, and not lose your sanity.
In 2024 you could run a 7B model on a MacBook if you were willing to wait. In 2026 a base M4 MacBook Air (16GB) runs Phi-4 or Llama 4 8B at 25-40 tokens/sec — fast enough to be a real coding assistant. Here's how to set it up without spending a weekend reading docs.
Pick a model first Three solid defaults:
- Llama 4 8B Q4: 4.7GB on disk. Best general chat. Most polished.
- Phi-4 Q4: 8.5GB on disk. Best for math, code, structured reasoning.
- Qwen3.6 7B Q4: 4.5GB on disk. Best multilingual + has vision.
If you're on a 16GB Mac/PC, all three fit with room for everything else. On 8GB, only the Llama 4 8B Q4 is comfortable, and even then you'll close other apps.
Install Ollama (easiest) ollama.com → download → install. Then in terminal:
ollama pull llama4:8b
ollama run llama4:8b
That's it. There's a chat interface in the terminal. For a GUI, install Open WebUI (Docker, takes 5 minutes).
Install LM Studio (best GUI) lmstudio.ai → download → install. Search for the model in the in-app browser, click download, then "Load model." Built-in chat, model browser, and a local server endpoint that mimics the OpenAI API — so any code targeting GPT works against your laptop with one URL change.
Quantization, briefly Q8 = 8-bit weights, almost lossless, big files. Q4_K_M = 4-bit, ~75% smaller, ~2% quality loss. Q2 = 2-bit, much smaller, noticeable quality loss. Default to **Q4_K_M** for chat and Q5_K_M for code if you have the RAM.
What about M-series Macs specifically? The unified memory in M-series chips means GPU and CPU share RAM. Ollama and LM Studio both use Metal acceleration automatically. Rule of thumb: model weights file ≤ (RAM - 4GB). So 16GB Mac → model files up to ~12GB. M3 Max with 36GB → model files up to ~32GB (Llama 4 70B Q4 fits).
Speed expectations (M4 16GB, real numbers) - Llama 4 8B Q4: 35 tok/s - Phi-4 Q4: 28 tok/s - Qwen3.6 7B Q4: 32 tok/s - Gemma 3 9B Q4: 30 tok/s
Above 20 tok/s is comfortable for chat. Below 10 tok/s, you'll be annoyed.
Common mistakes 1. **Loading a model bigger than your RAM**: macOS will swap and the experience will be miserable. Stay under (RAM - 4GB). 2. **Using Q8 when Q4 would work**: 4x bigger files, 2x slower, basically same quality for chat. 3. **Running on CPU when GPU is available**: Ollama and LM Studio both auto-detect, but if you misconfigured, you'll be at <5 tok/s. 4. **Using a base model instead of an instruct model**: Base models autocomplete; instruct models chat. The "-Instruct" or "-Chat" suffix is what you want.
If you're picking your first model, start with Llama 4 8B Q4 in Ollama. Once that works, swap in Phi-4 to compare. That's the whole onboarding.