How to Build Private RAG on Internal Docs with an Open LLM
Set up retrieval-augmented generation over your company's docs using Command R+ 2 or InternLM 3 — without sending data to OpenAI.
- STEP 1
Pick model + embedder
Generation: Command R+ 2 35B (research-license OK for internal use) or InternLM 3 70B (Apache for production). Embedder: BGE-M3 or NV-Embed-v2 for multilingual support.
- STEP 2
Index your docs
Use LlamaIndex or LangChain to chunk docs (500-1000 tokens with 100 token overlap), embed each chunk, and store in a vector DB (Qdrant, Weaviate, or pgvector).
- STEP 3
Wire up retrieval
On query: embed query → top-K (8-15) similarity search → rerank with a cross-encoder → stuff results into the model prompt. Command R+ 2 has a built-in template for this; InternLM 3 you wire manually.
- STEP 4
Add citations
Both Command R+ 2 and modern instruction-tuned models can emit citation markers like [1] [2] referring to retrieved chunks. Display these in your UI as 'Source: docs/foo.md L42'.
- STEP 5
Run on-premise
Deploy with vLLM or TGI on a single A100 80GB (Command R+ 2 35B) or H100 (InternLM 3 70B). Add an Open WebUI for the chat surface. Total stack runs on one GPU machine, no external dependencies.