← Blog
2026-04-21 · edge · small

Small LLMs on Edge Devices: What Runs on Phones, Pis, and Browsers in 2026

Gemma 2B runs on a Pi 5. Phi-4 runs in a browser via WebGPU. Phones run Llama 3B. A practical guide to LLMs on tiny hardware.

Edge LLMs are no longer a tech demo. In 2026 you can put a useful chat model on phones, browsers, Raspberry Pis, and embedded devices. Here's what actually runs and how.

Phones (iOS / Android) **Best pick**: Gemma 3 2B (Q4) at ~1.6GB on disk. Runs at 5-12 tok/s on iPhone 15 Pro and Pixel 8 Pro via Apple's Foundation Models or MediaPipe.

Apps: Apple Intelligence (uses internal models), Gemini Nano (Google's edge model), or third-party like Private LLM, MLC Chat, Layla.

Use cases that work: summarizing emails, drafting short replies, smart compose, on-device translation. Use cases that don't: long-form writing, anything math-heavy, anything multi-step.

Browsers (WebGPU) **Best pick**: Phi-4 Q4 via [WebLLM](https://webllm.mlc.ai). 8.5GB download (cached after first load), then runs entirely in the browser at 8-15 tok/s on M-series Macs.

Real-world apps: in-browser code assistants, chat widgets that don't call any server, privacy-first writing tools.

Limitations: WebGPU is Chrome/Edge only as of April 2026 (Firefox + Safari behind a flag). 8GB+ download is too much for casual visitors — best for installed-as-PWA scenarios.

Raspberry Pi 5 (8GB) **Best pick**: Gemma 3 2B (Q4) at 1.6GB. Runs at 3-5 tok/s with llama.cpp.

Use cases: home automation voice assistants, smart speakers, embedded chat in IoT devices.

Pi 4 (4GB) can technically run TinyLlama 1.1B or Qwen3.6 0.5B but quality is weak. Don't bother unless you really need a $40 LLM box.

Embedded (Coral, Jetson Nano, RK3588) NVIDIA Jetson Orin Nano 8GB runs Phi-4 14B Q4 at 6-10 tok/s. Good for in-vehicle assistants, robotics, industrial.

RK3588 SoC (Orange Pi 5+, Radxa Rock 5B) runs Gemma 3 2B with vendor-specific NPU acceleration at ~8 tok/s.

What you can't (yet) do on edge Anything requiring 30B+ params. Anything requiring more than ~32K context. Vision-language reasoning at scale (you can do simple captioning, not multi-image analysis).

The mental model Edge LLMs are best for **simple, short, repeated tasks** where the latency win + privacy win + offline win outweighs the quality cost. They are not "GPT-5 in your pocket" — they're "useful enough for a specific job, on hardware you already have."

Models in this post