Small LLMs on Edge Devices: What Runs on Phones, Pis, and Browsers in 2026
Gemma 2B runs on a Pi 5. Phi-4 runs in a browser via WebGPU. Phones run Llama 3B. A practical guide to LLMs on tiny hardware.
Edge LLMs are no longer a tech demo. In 2026 you can put a useful chat model on phones, browsers, Raspberry Pis, and embedded devices. Here's what actually runs and how.
Phones (iOS / Android) **Best pick**: Gemma 3 2B (Q4) at ~1.6GB on disk. Runs at 5-12 tok/s on iPhone 15 Pro and Pixel 8 Pro via Apple's Foundation Models or MediaPipe.
Apps: Apple Intelligence (uses internal models), Gemini Nano (Google's edge model), or third-party like Private LLM, MLC Chat, Layla.
Use cases that work: summarizing emails, drafting short replies, smart compose, on-device translation. Use cases that don't: long-form writing, anything math-heavy, anything multi-step.
Browsers (WebGPU) **Best pick**: Phi-4 Q4 via [WebLLM](https://webllm.mlc.ai). 8.5GB download (cached after first load), then runs entirely in the browser at 8-15 tok/s on M-series Macs.
Real-world apps: in-browser code assistants, chat widgets that don't call any server, privacy-first writing tools.
Limitations: WebGPU is Chrome/Edge only as of April 2026 (Firefox + Safari behind a flag). 8GB+ download is too much for casual visitors — best for installed-as-PWA scenarios.
Raspberry Pi 5 (8GB) **Best pick**: Gemma 3 2B (Q4) at 1.6GB. Runs at 3-5 tok/s with llama.cpp.
Use cases: home automation voice assistants, smart speakers, embedded chat in IoT devices.
Pi 4 (4GB) can technically run TinyLlama 1.1B or Qwen3.6 0.5B but quality is weak. Don't bother unless you really need a $40 LLM box.
Embedded (Coral, Jetson Nano, RK3588) NVIDIA Jetson Orin Nano 8GB runs Phi-4 14B Q4 at 6-10 tok/s. Good for in-vehicle assistants, robotics, industrial.
RK3588 SoC (Orange Pi 5+, Radxa Rock 5B) runs Gemma 3 2B with vendor-specific NPU acceleration at ~8 tok/s.