← Use cases

How to Build Private RAG on Internal Docs with an Open LLM

Set up retrieval-augmented generation over your company's docs using Command R+ 2 or InternLM 3 — without sending data to OpenAI.

  1. STEP 1

    Pick model + embedder

    Generation: Command R+ 2 35B (research-license OK for internal use) or InternLM 3 70B (Apache for production). Embedder: BGE-M3 or NV-Embed-v2 for multilingual support.

  2. STEP 2

    Index your docs

    Use LlamaIndex or LangChain to chunk docs (500-1000 tokens with 100 token overlap), embed each chunk, and store in a vector DB (Qdrant, Weaviate, or pgvector).

  3. STEP 3

    Wire up retrieval

    On query: embed query → top-K (8-15) similarity search → rerank with a cross-encoder → stuff results into the model prompt. Command R+ 2 has a built-in template for this; InternLM 3 you wire manually.

  4. STEP 4

    Add citations

    Both Command R+ 2 and modern instruction-tuned models can emit citation markers like [1] [2] referring to retrieved chunks. Display these in your UI as 'Source: docs/foo.md L42'.

  5. STEP 5

    Run on-premise

    Deploy with vLLM or TGI on a single A100 80GB (Command R+ 2 35B) or H100 (InternLM 3 70B). Add an Open WebUI for the chat surface. Total stack runs on one GPU machine, no external dependencies.

Recommended models