Multilingual RAG That Actually Works: A Practical Architecture

Many multilingual RAG systems fail for predictable reasons. They trust drifting embedding spaces, slice documents mid sentence, and let translation alter product names or IDs. A sturdier design keeps retrieval and reasoning in one base language, and translates only at the edges. This article outlines a practical architecture you can implement with standard tools. A short case study appears at the end.

Summary

Index, retrieve, and reason in one base language.
Translate only at input and output, with entity protection and glossaries.
Use language aware chunking and a cross encoder reranker to keep context small and relevant.
Evaluate on groundedness, retrieval quality, and cross language agreement.

Why one base language works

Running separate indices for many languages multiplies complexity. Models perform unevenly across languages, caches fragment, and audits become harder. Consolidating on one base language gives you a single vector store to tune, one reranking stack to secure, and one grounded generation policy to validate. Content can arrive in any language. Normalize it into English for retrieval and reasoning, and keep originals for previews and audit. Operations become simpler, and incident response is faster because there is one path to debug.

Ingest: normalize first, translate once

Reliability starts before the first query. Parse web pages, office files, and scans. Normalize punctuation and whitespace, repair hyphenation, and preserve structure such as titles, headings, lists, tables, and code blocks. Detect language at document and block level and store a confidence score. When detection is uncertain, use script and character distribution checks as a backstop.

Protect entities and sensitive data before any external service sees the text. Brand terms, SKUs, order numbers, addresses, and personal names should be masked or marked do not translate. Apply a versioned glossary so terms like refund window or store credit remain consistent across channels. Translate to the base language once at ingest. Keep the original for traceability, then chunk, embed, and index only the base language version.

Chunking: respect structure and script

Chunking drives recall and groundedness.

Target moderate passages with modest overlap. Break on sentence boundaries and align with headings.

For scripts without spaces, run word segmentation before sentence splitting. If segmentation confidence is low, fall back to semantic windows anchored on punctuation, bullets, and section headers.

Keep tables and lists intact. Group related rows or items and carry captions and column headers into the chunk so numbers have meaning.

Add a short breadcrumb such as Title › H2 › H3 to improve header recall without bloating context.

Fewer and stronger chunks perform better in retrieval, stabilize reranking, and keep generation extractive.

Query path: translate at the edges, reason in the middle

When a user asks a question, detect the language and record confidence. Protect entities in the query so product names, IDs, and addresses remain stable. Translate the question into the base language using the same glossary you used at ingest. Cache common templates and paraphrases to cut latency.

Embed the base language query, retrieve candidates from the base language index, and rerank with a cross encoder. Keep a small top set for generation. Apply hard filters for tenant, version, and freshness, and soft boosts for recency and exact heading matches.

If language detection or translation confidence is low, widen retrieval and if needed union a quick multilingual search over the original text before reranking everything together. When groundedness is weak, prefer a short clarification over a speculative answer.

Retrieval and reranking: small, sharp, fresh

Use a strong embedding model for the base language and a high recall ANN index. Avoid blind query expansion. Let chunk quality and reranking do the work. Enforce tenant, version, and freshness with hard filters. Apply soft boosts where they matter. A cross encoder reranker usually pays for itself by allowing fewer and better passages which reduces tokens, noise, and latency.

Grounded generation and back translation

Keep the generator on a short leash. It should see only the reranked passages and rules that enforce citation and extraction first behavior. Tool calls for dates or arithmetic are fine, invented sources are not. Produce a base language draft with citations. Translate that draft back to the user language, restore protected entities, and localize numbers, currencies, and dates. English in the middle keeps reasoning stable as models evolve.

Guardrails, observability, and failure modes

Design for uncertainty.

Low confidence in language detection or translation. Widen retrieval, keep more candidates, and merge multilingual hits before reranking.
Low groundedness. Ask for clarification instead of guessing.
Policy. Mask sensitive fields at the edges and unmask only where rules allow.
Audit. Persist language scores, translation engine and glossary version, candidate IDs and text hashes for both original and base text, reranker scores, and groundedness ratings.

These controls matter more than shaving a few milliseconds from a single model call.

Evaluation: prove it end to end

Measure retrieval and generation separately and together.

Retrieval quality. Build parallel query sets with gold passages. Run translate then retrieve then rerank.
Faithfulness and helpfulness. Use a constrained judge that only sees supplied passages to score groundedness and entailment.
Term and style consistency. Check translation with automated metrics and glossary consistency rules to catch drift in key terms.
Cross language agreement. Canonicalize numbers, dates, and named entities and ensure answers to the same question agree factually across languages.

Operationally, watch resolution rate, human handoff rate, citation clicks, 24 hour reopen rate, and latency and token cost by user language.

Latency and cost: aim for predictability

Translation adds overhead. Manage it. Cache common questions and variants. Pretranslate high traffic knowledge pages at ingest. Keep context tight through strong reranking. Favor simple and repeatable prompts over sprawling instruction sets. Predictable latency makes service levels enforceable and incidents easier to triage.

Common pitfalls

Running multiple live indices by language.
Splitting chunks mid sentence, mid table, or mid list.
Letting translation alter names and IDs.
Over stuffing contexts in place of reranking.
Relying on intuition instead of gold passages and groundedness checks.

Case study: HoverBot

HoverBot translates on write, keeps retrieval and reasoning in English, and translates at the edges with entity locking and tenant glossaries. Chunking preserves structure and script nuances, retrieval uses a high recall index with a cross encoder reranker, and context stays small. When language or translation confidence is low, retrieval widens and may union multilingual hits before reranking. Evaluation is strict on groundedness and cross language agreement, and all decisions are logged for audit. The outcome is consistent answers across locales with predictable latency and cost.