Updated April 2026 · 18 models tracked

Open source LLMs,
honest comparison.

The directory we wished existed when picking an open LLM. 18 models — DeepSeek V4, Llama 4, Qwen3.6, Phi-4, Gemma 3 — with real benchmarks, license clarity, and honest tradeoffs. No vendor fluff.

Showing 18 of 18
DeepSeek V4
DeepSeek AI · 2026-04
685B
256K ctx

Mixture-of-experts flagship released April 2026. Tops the open leaderboard for reasoning and coding, distilled into 236B and 67B variants for self-hosters.

DeepSeekCommercial OKVRAM 80GB+
MMLU 89.4
HumanEval 92.1
GSM8K 95.2
Llama 4
Meta · 2025-09
405B
128K ctx

Meta's flagship dense LLM. Comes in 8B, 70B, and 405B; 8B fits on a single 16GB GPU and is the most-downloaded local model on Hugging Face.

LlamaCommercial w/ caveats👁 visionVRAM 16GB+
MMLU 87.1
HumanEval 84.5
GSM8K 93
Mistral Large 3
Mistral AI · 2026-02
123B
128K ctx

European flagship from Mistral AI. Strong general reasoning, conservative on hallucinations, EU-aligned for compliance-heavy buyers.

MistralCommercial w/ caveatsVRAM 48GB+
MMLU 84.8
HumanEval 80.3
GSM8K 92.1
Phi-4
Microsoft · 2025-12
14B
16K ctx

Microsoft's 14B reasoning model. Hits 70B-level math and reasoning scores from a 14B base — best parameter efficiency on the leaderboard.

MITCommercial OKVRAM 8GB+
MMLU 84.8
HumanEval 82.6
GSM8K 95.2
Qwen3.6 35B
Alibaba · 2026-04
35B
128K ctx

Alibaba's April 2026 release; the 35B variant punches above its weight on coding and vision benchmarks, with a permissive license and strong Chinese support.

QwenCommercial OK👁 visionVRAM 24GB+
MMLU 84.2
HumanEval 87.2
GSM8K 91.4
InternLM 3
Shanghai AI Lab · 2026-01
70B
200K ctx

Shanghai AI Lab's third-generation InternLM. Apache-2.0, 200K context, strong reasoning. Quietly one of the best open Chinese-led models.

Apache-2.0Commercial OKVRAM 14GB+
MMLU 80.4
HumanEval 73
GSM8K 89.6
Command R+ 2
Cohere · 2026-01
104B
128K ctx

Cohere's RAG-tuned flagship. Built for retrieval, multilingual chat, and tool use. Open weights — research license only; commercial use needs Cohere API.

Custom (research)Research onlyVRAM 48GB+
MMLU 78.8
HumanEval 71.7
GSM8K 87.3
Gemma 3
Google DeepMind · 2025-11
27B
128K ctx

Google's open-weight family. The 2B variant is one of the strongest small models — runs on phones, edge devices, even a Raspberry Pi 5.

GemmaCommercial OK👁 visionVRAM 6GB+
MMLU 78.5
HumanEval 71.2
GSM8K 86.5
Mixtral 8x22B
Mistral AI · 2024-04
141B
64K ctx

MoE classic from 2024 — 8 experts of 22B, 39B active per token. Apache-2.0 makes it the go-to open MoE for production teams.

Apache-2.0Commercial OKVRAM 80GB+
MMLU 77.8
HumanEval 75.3
GSM8K 88.4
Falcon 3 180B
TII (UAE) · 2025-08
180B
32K ctx

Open-source flagship from UAE's Technology Innovation Institute. Fully open weights, commercial use allowed up to certain revenue thresholds.

FalconCommercial OKVRAM 24GB+
MMLU 77.4
HumanEval 67
GSM8K 81
Yi 1.5 34B
01.AI · 2024-05
34B
32K ctx

01.AI's open Yi family. Fully Apache-2.0, no usage restrictions. Solid bilingual EN/CN model that's been a self-hosting favorite since 2024.

Apache-2.0Commercial OKVRAM 24GB+
MMLU 76.8
HumanEval 75.2
GSM8K 84.2
Qwen3.6 7B
Alibaba · 2026-04
7B
128K ctx

7B variant of Qwen3.6 with vision support. Punches above its weight class, with the broadest language coverage of any 7B model.

QwenCommercial OK👁 visionVRAM 16GB+
MMLU 76.4
HumanEval 75
GSM8K 84.6
Tulu 3
Allen AI · 2025-12
70B
8K ctx

AI2's instruction-tuned Llama variant. Open recipe — every step of the post-training pipeline is documented and reproducible.

LlamaCommercial w/ caveatsVRAM 16GB+
MMLU 75
HumanEval 65.6
GSM8K 87.1
Llama 4 8B
Meta · 2025-09
8B
128K ctx

The 8B Llama 4 variant — the most popular local LLM by download count. Runs on a 16GB GPU at fp16, or 4GB at Q4.

LlamaCommercial w/ caveatsVRAM 16GB+
MMLU 73
HumanEval 62.2
GSM8K 85.3
OpenChat 4
OpenChat (community) · 2025-10
8B
32K ctx

Community-fine-tuned chat model on top of Llama base. Apache-2.0 weights, often tops local LLM evaluations among <10B models.

Apache-2.0Commercial OKVRAM 8GB+
MMLU 72.4
HumanEval 70.7
GSM8K 85.4
OLMo 2
Allen AI · 2025-11
32B
8K ctx

Allen AI's truly open model — weights, training code, training data, and checkpoints all released. The most reproducible model on this list.

Apache-2.0Commercial OKVRAM 16GB+
MMLU 71.2
HumanEval 60
GSM8K 78.5
DeepSeek Coder V3
DeepSeek AI · 2026-02
33B
64K ctx

Code-specialized DeepSeek model. Best HumanEval among open code models, supports fill-in-the-middle and repository-level context.

DeepSeekCommercial OKVRAM 24GB+
MMLU 70.5
HumanEval 89.4
GSM8K 81
StarCoder 3
BigCode · 2025-07
15B
16K ctx

BigCode collaboration's open code LLM, trained on permissively licensed code only — important for compliance-sensitive code generation.

Open RAILCommercial w/ caveatsVRAM 16GB+
MMLU 51.5
HumanEval 73.2
GSM8K 64

From the blog

Honest takes on open-LLM tradeoffs.

Use case guides

How-to walkthroughs for real situations.

Head-to-head comparisons

When two models are close, here's how to choose.

FAQs

What does 'open source LLM' actually mean in 2026?+

Most so-called 'open source LLMs' are open weights — you get the trained model file but not the training data or training code. True open-source models (OSI-approved licenses like Apache-2.0 and MIT) are a smaller subset: Mixtral 8x22B, Yi 1.5, OLMo 2, OpenChat 4, InternLM 3, and Phi-4. Llama, Gemma, Qwen, and DeepSeek are open-weights with custom permissive licenses.

Which is the best open source LLM in 2026?+

DeepSeek V4 685B currently tops open-leaderboard reasoning and code benchmarks. For local use, Llama 4 8B has the deepest ecosystem; Phi-4 has the best parameter efficiency; Qwen3.6 covers the most languages including built-in vision.

Can I use Llama 4 in a commercial product?+

Yes, with two caveats: products with more than 700M monthly active users must negotiate with Meta, and the Llama Acceptable Use Policy bans certain applications (weapons, mass deception, etc.). For under-700M-MAU products, the license is effectively permissive commercial.

How much VRAM do I need to run an open LLM?+

Quantized to Q4: a 7-8B model needs 4-6GB, 13-14B needs 8-10GB, 33-35B needs 20-24GB, 70B needs ~40GB, 405B needs an H100 cluster. Doubling that for fp16. Use the Max VRAM filter above to see what fits.

What's the difference between an instruct model and a base model?+

Base models are trained only on next-token prediction — they autocomplete. Instruct (or Chat) models are the same base, fine-tuned on instructions and human feedback to act like a helpful assistant. Always pick the -Instruct or -Chat variant for chat use cases.

Is DeepSeek V4 really better than Claude Opus and GPT-5?+

On many open benchmarks DeepSeek V4 685B is competitive with or beats hosted frontier models. On hardest reasoning tasks (ARC-AGI 2, FrontierMath) Claude Opus 4.7 still edges it. For everyday code, chat, and reasoning, the gap is small to nonexistent.

What is MoE (Mixture of Experts)?+

An architecture where the model has many expert sub-networks but each token only activates a few. A 685B-parameter MoE like DeepSeek V4 only uses 22B params per forward pass. Result: training efficiency and inference cost similar to a much smaller model.

Can I run an LLM on my laptop?+

Yes. A 16GB Mac (M3/M4) runs Llama 4 8B Q4, Phi-4 Q4, or Qwen3.6 7B Q4 at 25-40 tokens per second using Ollama or LM Studio. 8GB Macs can run 8B models comfortably; 4GB GPUs can run 3B models.

What is GGUF?+

The standard file format for quantized LLMs used by llama.cpp, Ollama, LM Studio, and most local inference tools. A .gguf file contains weights, tokenizer, and metadata in one cross-platform binary. The successor to GGML.

Where do I download open LLM weights safely?+

First-party sources: official organization accounts on Hugging Face (meta-llama, deepseek-ai, Qwen, mistralai, google, microsoft, allenai). For quantized GGUFs, trusted re-uploaders like TheBloke, lmstudio-community, and bartowski. Always verify SHA256 against the source repo.

How do I fine-tune an open LLM?+

QLoRA is the default in 2026 — it quantizes the base to 4-bit and trains small low-rank adapters. Tools: Unsloth, Axolotl, LLaMA-Factory. Memory: ~24GB for a 7B fine-tune, ~80GB for a 70B. Always build an eval set first or you can't tell if the fine-tune helped.

Which open LLM is best for code?+

DeepSeek Coder V3 33B leads HumanEval at 89.4 among open code models, with fill-in-the-middle support. StarCoder 3 trails on benchmarks but trains only on permissively licensed code — important for IP-conscious teams. Both run on a single A100.

Which open LLM is best for multilingual?+

Qwen3.6 35B with 119 languages supported and strong Chinese/English bilingual results. Runner-up: Mistral Large 3 for European languages. For Indian languages, look at Sarvam-1 (not in this directory yet) and Llama 4 70B fine-tunes.

What is RAG and which open LLM is best for it?+

Retrieval-Augmented Generation = retrieve relevant docs from a vector DB, append to prompt, then generate. Command R+ 2 has the best RAG-tuned open weights but is research-license-only. For commercial RAG: InternLM 3 (Apache, 200K context) or Llama 4 70B with custom prompting.

How does this directory stay current?+

We maintain it manually with monthly updates as models release. If you spot a model we should add or stats that need updating, email choppy.young@gmail.com. Last updated April 2026.

Built something with an open LLM?

Tell us what worked. We'll update the directory.

choppy.young@gmail.com