Updated April 2026 · 18 models tracked

Open source LLMs,
honest comparison.

The directory we wished existed when picking an open LLM. 18 models — DeepSeek V4, Llama 4, Qwen3.6, Phi-4, Gemma 3 — with real benchmarks, license clarity, and honest tradeoffs. No vendor fluff.

Browse the directory ↓Read: best picks 2026 →

Max VRAM: 160GB

Showing 18 of 18

DeepSeek V4

DeepSeek AI · 2026-04

685B

256K ctx

Mixture-of-experts flagship released April 2026. Tops the open leaderboard for reasoning and coding, distilled into 236B and 67B variants for self-hosters.

DeepSeekCommercial OKVRAM 80GB+

MMLU 89.4

HumanEval 92.1

GSM8K 95.2

Meta's flagship dense LLM. Comes in 8B, 70B, and 405B; 8B fits on a single 16GB GPU and is the most-downloaded local model on Hugging Face.

LlamaCommercial w/ caveats👁 visionVRAM 16GB+

MMLU 87.1

HumanEval 84.5

European flagship from Mistral AI. Strong general reasoning, conservative on hallucinations, EU-aligned for compliance-heavy buyers.

MistralCommercial w/ caveatsVRAM 48GB+

MMLU 84.8

HumanEval 80.3

GSM8K 92.1

Microsoft's 14B reasoning model. Hits 70B-level math and reasoning scores from a 14B base — best parameter efficiency on the leaderboard.

MITCommercial OKVRAM 8GB+

MMLU 84.8

HumanEval 82.6

GSM8K 95.2

Alibaba's April 2026 release; the 35B variant punches above its weight on coding and vision benchmarks, with a permissive license and strong Chinese support.

QwenCommercial OK👁 visionVRAM 24GB+

MMLU 84.2

HumanEval 87.2

GSM8K 91.4

InternLM 3

Shanghai AI Lab · 2026-01

70B

200K ctx

Shanghai AI Lab's third-generation InternLM. Apache-2.0, 200K context, strong reasoning. Quietly one of the best open Chinese-led models.

Apache-2.0Commercial OKVRAM 14GB+

MMLU 80.4

GSM8K 89.6

Cohere's RAG-tuned flagship. Built for retrieval, multilingual chat, and tool use. Open weights — research license only; commercial use needs Cohere API.

Custom (research)Research onlyVRAM 48GB+

MMLU 78.8

HumanEval 71.7

GSM8K 87.3

Gemma 3

Google DeepMind · 2025-11

27B

128K ctx

Google's open-weight family. The 2B variant is one of the strongest small models — runs on phones, edge devices, even a Raspberry Pi 5.

GemmaCommercial OK👁 visionVRAM 6GB+

MMLU 78.5

HumanEval 71.2

GSM8K 86.5

MoE classic from 2024 — 8 experts of 22B, 39B active per token. Apache-2.0 makes it the go-to open MoE for production teams.

Apache-2.0Commercial OKVRAM 80GB+

MMLU 77.8

HumanEval 75.3

GSM8K 88.4

Open-source flagship from UAE's Technology Innovation Institute. Fully open weights, commercial use allowed up to certain revenue thresholds.

FalconCommercial OKVRAM 24GB+

MMLU 77.4

01.AI's open Yi family. Fully Apache-2.0, no usage restrictions. Solid bilingual EN/CN model that's been a self-hosting favorite since 2024.

Apache-2.0Commercial OKVRAM 24GB+

MMLU 76.8

HumanEval 75.2

GSM8K 84.2

Qwen3.6 7B

Alibaba · 2026-04

128K ctx

7B variant of Qwen3.6 with vision support. Punches above its weight class, with the broadest language coverage of any 7B model.

QwenCommercial OK👁 visionVRAM 16GB+

MMLU 76.4

GSM8K 84.6

AI2's instruction-tuned Llama variant. Open recipe — every step of the post-training pipeline is documented and reproducible.

LlamaCommercial w/ caveatsVRAM 16GB+

MMLU 75

HumanEval 65.6

GSM8K 87.1

Llama 4 8B

Meta · 2025-09

128K ctx

The 8B Llama 4 variant — the most popular local LLM by download count. Runs on a 16GB GPU at fp16, or 4GB at Q4.

LlamaCommercial w/ caveatsVRAM 16GB+

MMLU 73

HumanEval 62.2

GSM8K 85.3

OpenChat 4

OpenChat (community) · 2025-10

32K ctx

Community-fine-tuned chat model on top of Llama base. Apache-2.0 weights, often tops local LLM evaluations among <10B models.

Apache-2.0Commercial OKVRAM 8GB+

MMLU 72.4

HumanEval 70.7

GSM8K 85.4

Allen AI's truly open model — weights, training code, training data, and checkpoints all released. The most reproducible model on this list.

Apache-2.0Commercial OKVRAM 16GB+

MMLU 71.2

HumanEval 60

GSM8K 78.5

DeepSeek Coder V3

DeepSeek AI · 2026-02

33B

64K ctx

Code-specialized DeepSeek model. Best HumanEval among open code models, supports fill-in-the-middle and repository-level context.

DeepSeekCommercial OKVRAM 24GB+

MMLU 70.5

HumanEval 89.4

BigCode collaboration's open code LLM, trained on permissively licensed code only — important for compliance-sensitive code generation.

Open RAILCommercial w/ caveatsVRAM 16GB+

MMLU 51.5

HumanEval 73.2

GSM8K 64

From the blog

Honest takes on open-LLM tradeoffs.

2026-04-26

Best Open Source LLMs 2026: Honest Picks by Use Case

Which open-source LLM should you actually run in 2026? Honest picks by use case — frontier reasoning, coding, RAG, edge devices, multilingual.

2026-04-25

Open Source LLM Licenses Explained: Llama vs Apache vs Gemma vs MIT

Can you use Llama in a commercial product? What does the Gemma license actually restrict? A plain-English breakdown of every major open LLM license.

2026-04-24

DeepSeek V4 vs Llama 4: Which Open Frontier Model Should You Run?

DeepSeek V4 just topped the open leaderboard. Should you switch from Llama 4 405B? Side-by-side on benchmarks, license, hardware, and ecosystem.

2026-04-23

Running an LLM on Your Laptop in 2026: M-Series, Quantization, and What Actually Works

Step-by-step: pick a quantization, install Ollama or LM Studio, run a 7B-14B model on a MacBook or 16GB GPU, and not lose your sanity.

Use case guides

How-to walkthroughs for real situations.

How to Self-Host a Coding Assistant with an Open LLM

Step-by-step setup of a private code assistant using DeepSeek Coder V3 or Llama 4 on your own GPU.

How to Build Private RAG on Internal Docs with an Open LLM

Set up retrieval-augmented generation over your company's docs using Command R+ 2 or InternLM 3 — without sending data to OpenAI.

How to Pick an Open LLM for EU AI Act Compliance

Which open LLMs satisfy the EU AI Act's transparency and provenance requirements for general-purpose AI? A practical decision guide.

How to Replace Claude or GPT with an Open-Source LLM

A migration playbook: which open model maps to which closed model, what breaks, and how to shadow-test before switching.

Head-to-head comparisons

When two models are close, here's how to choose.

DeepSeek V4 vs Llama 4

Side-by-side comparison of DeepSeek V4 and Llama 4. Benchmarks, license, hardware, and ecosystem.

Llama 4 8B vs Phi-4

Two of the strongest small open models compared. Llama 4 8B has the ecosystem; Phi-4 has the parameter efficiency.

Qwen3.6 vs Llama 4

Two leading multilingual open LLMs compared. Qwen wins on language coverage; Llama wins on Western ecosystem.

DeepSeek Coder V3 vs StarCoder 3

Two leading open code LLMs. DeepSeek wins on benchmarks; StarCoder wins on training-data provenance.

Gemma 3 vs Phi-4

Two best small open models compared. Gemma 3 ships in 2B/9B/27B; Phi-4 is 14B-only but reasoning-heavy.

FAQs

What does 'open source LLM' actually mean in 2026?+

Most so-called 'open source LLMs' are open weights — you get the trained model file but not the training data or training code. True open-source models (OSI-approved licenses like Apache-2.0 and MIT) are a smaller subset: Mixtral 8x22B, Yi 1.5, OLMo 2, OpenChat 4, InternLM 3, and Phi-4. Llama, Gemma, Qwen, and DeepSeek are open-weights with custom permissive licenses.

Which is the best open source LLM in 2026?+

DeepSeek V4 685B currently tops open-leaderboard reasoning and code benchmarks. For local use, Llama 4 8B has the deepest ecosystem; Phi-4 has the best parameter efficiency; Qwen3.6 covers the most languages including built-in vision.

Can I use Llama 4 in a commercial product?+

Yes, with two caveats: products with more than 700M monthly active users must negotiate with Meta, and the Llama Acceptable Use Policy bans certain applications (weapons, mass deception, etc.). For under-700M-MAU products, the license is effectively permissive commercial.

How much VRAM do I need to run an open LLM?+

Quantized to Q4: a 7-8B model needs 4-6GB, 13-14B needs 8-10GB, 33-35B needs 20-24GB, 70B needs ~40GB, 405B needs an H100 cluster. Doubling that for fp16. Use the Max VRAM filter above to see what fits.

What's the difference between an instruct model and a base model?+

Base models are trained only on next-token prediction — they autocomplete. Instruct (or Chat) models are the same base, fine-tuned on instructions and human feedback to act like a helpful assistant. Always pick the -Instruct or -Chat variant for chat use cases.

Is DeepSeek V4 really better than Claude Opus and GPT-5?+

On many open benchmarks DeepSeek V4 685B is competitive with or beats hosted frontier models. On hardest reasoning tasks (ARC-AGI 2, FrontierMath) Claude Opus 4.7 still edges it. For everyday code, chat, and reasoning, the gap is small to nonexistent.

What is MoE (Mixture of Experts)?+

An architecture where the model has many expert sub-networks but each token only activates a few. A 685B-parameter MoE like DeepSeek V4 only uses 22B params per forward pass. Result: training efficiency and inference cost similar to a much smaller model.

Can I run an LLM on my laptop?+

Yes. A 16GB Mac (M3/M4) runs Llama 4 8B Q4, Phi-4 Q4, or Qwen3.6 7B Q4 at 25-40 tokens per second using Ollama or LM Studio. 8GB Macs can run 8B models comfortably; 4GB GPUs can run 3B models.

What is GGUF?+

The standard file format for quantized LLMs used by llama.cpp, Ollama, LM Studio, and most local inference tools. A .gguf file contains weights, tokenizer, and metadata in one cross-platform binary. The successor to GGML.

Where do I download open LLM weights safely?+

First-party sources: official organization accounts on Hugging Face (meta-llama, deepseek-ai, Qwen, mistralai, google, microsoft, allenai). For quantized GGUFs, trusted re-uploaders like TheBloke, lmstudio-community, and bartowski. Always verify SHA256 against the source repo.

How do I fine-tune an open LLM?+

QLoRA is the default in 2026 — it quantizes the base to 4-bit and trains small low-rank adapters. Tools: Unsloth, Axolotl, LLaMA-Factory. Memory: ~24GB for a 7B fine-tune, ~80GB for a 70B. Always build an eval set first or you can't tell if the fine-tune helped.

Which open LLM is best for code?+

DeepSeek Coder V3 33B leads HumanEval at 89.4 among open code models, with fill-in-the-middle support. StarCoder 3 trails on benchmarks but trains only on permissively licensed code — important for IP-conscious teams. Both run on a single A100.

Which open LLM is best for multilingual?+

Qwen3.6 35B with 119 languages supported and strong Chinese/English bilingual results. Runner-up: Mistral Large 3 for European languages. For Indian languages, look at Sarvam-1 (not in this directory yet) and Llama 4 70B fine-tunes.

What is RAG and which open LLM is best for it?+

Retrieval-Augmented Generation = retrieve relevant docs from a vector DB, append to prompt, then generate. Command R+ 2 has the best RAG-tuned open weights but is research-license-only. For commercial RAG: InternLM 3 (Apache, 200K context) or Llama 4 70B with custom prompting.

How does this directory stay current?+

We maintain it manually with monthly updates as models release. If you spot a model we should add or stats that need updating, email choppy.young@gmail.com. Last updated April 2026.

Built something with an open LLM?

Tell us what worked. We'll update the directory.

choppy.young@gmail.com →