Open source LLMs,
honest comparison.
The directory we wished existed when picking an open LLM. 18 models — DeepSeek V4, Llama 4, Qwen3.6, Phi-4, Gemma 3 — with real benchmarks, license clarity, and honest tradeoffs. No vendor fluff.
Mixture-of-experts flagship released April 2026. Tops the open leaderboard for reasoning and coding, distilled into 236B and 67B variants for self-hosters.
Meta's flagship dense LLM. Comes in 8B, 70B, and 405B; 8B fits on a single 16GB GPU and is the most-downloaded local model on Hugging Face.
European flagship from Mistral AI. Strong general reasoning, conservative on hallucinations, EU-aligned for compliance-heavy buyers.
Microsoft's 14B reasoning model. Hits 70B-level math and reasoning scores from a 14B base — best parameter efficiency on the leaderboard.
Alibaba's April 2026 release; the 35B variant punches above its weight on coding and vision benchmarks, with a permissive license and strong Chinese support.
Shanghai AI Lab's third-generation InternLM. Apache-2.0, 200K context, strong reasoning. Quietly one of the best open Chinese-led models.
Cohere's RAG-tuned flagship. Built for retrieval, multilingual chat, and tool use. Open weights — research license only; commercial use needs Cohere API.
Google's open-weight family. The 2B variant is one of the strongest small models — runs on phones, edge devices, even a Raspberry Pi 5.
MoE classic from 2024 — 8 experts of 22B, 39B active per token. Apache-2.0 makes it the go-to open MoE for production teams.
Open-source flagship from UAE's Technology Innovation Institute. Fully open weights, commercial use allowed up to certain revenue thresholds.
01.AI's open Yi family. Fully Apache-2.0, no usage restrictions. Solid bilingual EN/CN model that's been a self-hosting favorite since 2024.
7B variant of Qwen3.6 with vision support. Punches above its weight class, with the broadest language coverage of any 7B model.
AI2's instruction-tuned Llama variant. Open recipe — every step of the post-training pipeline is documented and reproducible.
The 8B Llama 4 variant — the most popular local LLM by download count. Runs on a 16GB GPU at fp16, or 4GB at Q4.
Community-fine-tuned chat model on top of Llama base. Apache-2.0 weights, often tops local LLM evaluations among <10B models.
Allen AI's truly open model — weights, training code, training data, and checkpoints all released. The most reproducible model on this list.
Code-specialized DeepSeek model. Best HumanEval among open code models, supports fill-in-the-middle and repository-level context.
BigCode collaboration's open code LLM, trained on permissively licensed code only — important for compliance-sensitive code generation.
From the blog
Honest takes on open-LLM tradeoffs.
Use case guides
How-to walkthroughs for real situations.
Head-to-head comparisons
When two models are close, here's how to choose.
FAQs
What does 'open source LLM' actually mean in 2026?+
Most so-called 'open source LLMs' are open weights — you get the trained model file but not the training data or training code. True open-source models (OSI-approved licenses like Apache-2.0 and MIT) are a smaller subset: Mixtral 8x22B, Yi 1.5, OLMo 2, OpenChat 4, InternLM 3, and Phi-4. Llama, Gemma, Qwen, and DeepSeek are open-weights with custom permissive licenses.
Which is the best open source LLM in 2026?+
DeepSeek V4 685B currently tops open-leaderboard reasoning and code benchmarks. For local use, Llama 4 8B has the deepest ecosystem; Phi-4 has the best parameter efficiency; Qwen3.6 covers the most languages including built-in vision.
Can I use Llama 4 in a commercial product?+
Yes, with two caveats: products with more than 700M monthly active users must negotiate with Meta, and the Llama Acceptable Use Policy bans certain applications (weapons, mass deception, etc.). For under-700M-MAU products, the license is effectively permissive commercial.
How much VRAM do I need to run an open LLM?+
Quantized to Q4: a 7-8B model needs 4-6GB, 13-14B needs 8-10GB, 33-35B needs 20-24GB, 70B needs ~40GB, 405B needs an H100 cluster. Doubling that for fp16. Use the Max VRAM filter above to see what fits.
What's the difference between an instruct model and a base model?+
Base models are trained only on next-token prediction — they autocomplete. Instruct (or Chat) models are the same base, fine-tuned on instructions and human feedback to act like a helpful assistant. Always pick the -Instruct or -Chat variant for chat use cases.
Is DeepSeek V4 really better than Claude Opus and GPT-5?+
On many open benchmarks DeepSeek V4 685B is competitive with or beats hosted frontier models. On hardest reasoning tasks (ARC-AGI 2, FrontierMath) Claude Opus 4.7 still edges it. For everyday code, chat, and reasoning, the gap is small to nonexistent.
What is MoE (Mixture of Experts)?+
An architecture where the model has many expert sub-networks but each token only activates a few. A 685B-parameter MoE like DeepSeek V4 only uses 22B params per forward pass. Result: training efficiency and inference cost similar to a much smaller model.
Can I run an LLM on my laptop?+
Yes. A 16GB Mac (M3/M4) runs Llama 4 8B Q4, Phi-4 Q4, or Qwen3.6 7B Q4 at 25-40 tokens per second using Ollama or LM Studio. 8GB Macs can run 8B models comfortably; 4GB GPUs can run 3B models.
What is GGUF?+
The standard file format for quantized LLMs used by llama.cpp, Ollama, LM Studio, and most local inference tools. A .gguf file contains weights, tokenizer, and metadata in one cross-platform binary. The successor to GGML.
Where do I download open LLM weights safely?+
First-party sources: official organization accounts on Hugging Face (meta-llama, deepseek-ai, Qwen, mistralai, google, microsoft, allenai). For quantized GGUFs, trusted re-uploaders like TheBloke, lmstudio-community, and bartowski. Always verify SHA256 against the source repo.
How do I fine-tune an open LLM?+
QLoRA is the default in 2026 — it quantizes the base to 4-bit and trains small low-rank adapters. Tools: Unsloth, Axolotl, LLaMA-Factory. Memory: ~24GB for a 7B fine-tune, ~80GB for a 70B. Always build an eval set first or you can't tell if the fine-tune helped.
Which open LLM is best for code?+
DeepSeek Coder V3 33B leads HumanEval at 89.4 among open code models, with fill-in-the-middle support. StarCoder 3 trails on benchmarks but trains only on permissively licensed code — important for IP-conscious teams. Both run on a single A100.
Which open LLM is best for multilingual?+
Qwen3.6 35B with 119 languages supported and strong Chinese/English bilingual results. Runner-up: Mistral Large 3 for European languages. For Indian languages, look at Sarvam-1 (not in this directory yet) and Llama 4 70B fine-tunes.
What is RAG and which open LLM is best for it?+
Retrieval-Augmented Generation = retrieve relevant docs from a vector DB, append to prompt, then generate. Command R+ 2 has the best RAG-tuned open weights but is research-license-only. For commercial RAG: InternLM 3 (Apache, 200K context) or Llama 4 70B with custom prompting.
How does this directory stay current?+
We maintain it manually with monthly updates as models release. If you spot a model we should add or stats that need updating, email choppy.young@gmail.com. Last updated April 2026.
Built something with an open LLM?
Tell us what worked. We'll update the directory.
choppy.young@gmail.com →