2026-04-23 · local · tutorial

Running an LLM on Your Laptop in 2026: M-Series, Quantization, and What Actually Works

Step-by-step: pick a quantization, install Ollama or LM Studio, run a 7B-14B model on a MacBook or 16GB GPU, and not lose your sanity.

In 2024 you could run a 7B model on a MacBook if you were willing to wait. In 2026 a base M4 MacBook Air (16GB) runs Phi-4 or Llama 4 8B at 25-40 tokens/sec — fast enough to be a real coding assistant. Here's how to set it up without spending a weekend reading docs.

Pick a model first Three solid defaults:

Llama 4 8B Q4: 4.7GB on disk. Best general chat. Most polished.
Phi-4 Q4: 8.5GB on disk. Best for math, code, structured reasoning.
Qwen3.6 7B Q4: 4.5GB on disk. Best multilingual + has vision.

If you're on a 16GB Mac/PC, all three fit with room for everything else. On 8GB, only the Llama 4 8B Q4 is comfortable, and even then you'll close other apps.

Install Ollama (easiest) ollama.com → download → install. Then in terminal:

ollama pull llama4:8b
ollama run llama4:8b

That's it. There's a chat interface in the terminal. For a GUI, install Open WebUI (Docker, takes 5 minutes).

Install LM Studio (best GUI) lmstudio.ai → download → install. Search for the model in the in-app browser, click download, then "Load model." Built-in chat, model browser, and a local server endpoint that mimics the OpenAI API — so any code targeting GPT works against your laptop with one URL change.

Quantization, briefly Q8 = 8-bit weights, almost lossless, big files. Q4_K_M = 4-bit, ~75% smaller, ~2% quality loss. Q2 = 2-bit, much smaller, noticeable quality loss. Default to Q4_K_M for chat and Q5_K_M for code if you have the RAM.

What about M-series Macs specifically? The unified memory in M-series chips means GPU and CPU share RAM. Ollama and LM Studio both use Metal acceleration automatically. Rule of thumb: model weights file ≤ (RAM - 4GB). So 16GB Mac → model files up to ~12GB. M3 Max with 36GB → model files up to ~32GB (Llama 4 70B Q4 fits).

Speed expectations (M4 16GB, real numbers) - Llama 4 8B Q4: 35 tok/s - Phi-4 Q4: 28 tok/s - Qwen3.6 7B Q4: 32 tok/s - Gemma 3 9B Q4: 30 tok/s

Above 20 tok/s is comfortable for chat. Below 10 tok/s, you'll be annoyed.

Common mistakes 1. Loading a model bigger than your RAM: macOS will swap and the experience will be miserable. Stay under (RAM - 4GB). 2. Using Q8 when Q4 would work: 4x bigger files, 2x slower, basically same quality for chat. 3. Running on CPU when GPU is available: Ollama and LM Studio both auto-detect, but if you misconfigured, you'll be at <5 tok/s. 4. Using a base model instead of an instruct model: Base models autocomplete; instruct models chat. The "-Instruct" or "-Chat" suffix is what you want.

If you're picking your first model, start with Llama 4 8B Q4 in Ollama. Once that works, swap in Phi-4 to compare. That's the whole onboarding.

Models in this post

Gemma 3

Google DeepMind · 27B · Gemma

Phi-4

Microsoft · 14B · MIT

Pick a model first Three solid defaults:

Install Ollama (easiest) ollama.com → download → install. Then in terminal:

Quantization, briefly Q8 = 8-bit weights, almost lossless, big files. Q4_K_M = 4-bit, ~75% smaller, ~2% quality loss. Q2 = 2-bit, much smaller, noticeable quality loss. Default to **Q4_K_M** for chat and Q5_K_M for code if you have the RAM.

Speed expectations (M4 16GB, real numbers) - Llama 4 8B Q4: 35 tok/s - Phi-4 Q4: 28 tok/s - Qwen3.6 7B Q4: 32 tok/s - Gemma 3 9B Q4: 30 tok/s

Models in this post

Quantization, briefly Q8 = 8-bit weights, almost lossless, big files. Q4_K_M = 4-bit, ~75% smaller, ~2% quality loss. Q2 = 2-bit, much smaller, noticeable quality loss. Default to Q4_K_M for chat and Q5_K_M for code if you have the RAM.