Cloud Computing: Building an Extended Knowledge RAG System: Stuff Nobody Tells You

Step 1: You Are Not Building a Search Engine

When people start with RAG, they assume retrieval is about accuracy — finding the “right” document. But retrieval in RAG is about usefulness, not just correctness. The best chunk isn’t always the one that has the answer word-for-word. Sometimes it’s the chunk that gives context, so the LLM can reason better. Tune your retriever for relevance plus helpfulness, not just keyword matching. Users don’t care about metrics. They care about answers.

Step 2: Chunking Is an Art Form

The chunking strategy makes or breaks your system. Bad chunking means important info gets sliced mid-thought, or related concepts get spread across different vectors.

Real-world chunking tips:

Semantic chunking > size-based chunking. Use headings, paragraphs, and logical breakpoints.
Overlap your chunks slightly (say 10–20%). It’s like adding mortar between bricks.
If your domain is very factual (like laws, specs, or APIs), you prefer smaller, precise chunks.
If your domain is narrative (like sales enablement or onboarding), use larger, story-like chunks.

Use recursive chunking — first by section, then by paragraph, then by sentences if needed.

Step 3: Retrieval Alone Is Not Enough

You need a few layers of safety:

Cite your sources. Make it obvious where every piece of information came from.
Re-rank your candidates. Retrieval should be multi-stage: initial recall, then re-ranking based on semantic similarity or even task-specific relevance (e.g., “Is this paragraph answering a medical question?”).
Ground the output. Use prompt engineering to force the LLM to stick closely to retrieved passages. Few-shot examples help a lot here.

Example grounding prompt:

Answer the following question based solely on the provided documents. If the documents do not contain the answer, say, “The answer is not available in the documents’.

Step 4: Cost, Latency, and User Expectations Will Haunt You

You might think a system that calls the retriever and the LLM in sequence is fine.

When you scale, you realize:

Vector DB queries have real latency (~50–200 ms).
LLM inference can be slow (~2–10s per query).
And users expect instant responses (like Google fast, not “waiting for a reply” fast).

Real talk:

Batch retrievals if you can.
Keep the top-k small (k=3–5 is often enough).
Fine-tune a smaller LLM (like Mistral, or Zephyr) if your generation needs are lightweight.
Cache aggressively. Precompute common question embeddings, if possible.
Use smarter retrieval triggers — don’t always call the retriever if the question is simple.

Step 5: Your RAG System Is a Living, Breathing Beast

You don’t “launch” a RAG system. You adopt it, like a needy, hyperactive puppy.

It needs:

Constant retraining of retrievers
Periodic re-embedding of documents (especially if you update content)
Monitoring for drift — if users start asking different types of questions over time
Feedback loops — let users upvote/downvote answers, and retrain your retriever on this signal.

Biggest myth: “Static knowledge base = set and forget.” Real users will break whatever assumptions you made in your comfy dev environment.

Finally, if you’ve made it this far, congratulations: you now know more about RAG than 90% of the people posting cute diagrams on LinkedIn. Building a great RAG system isn’t about fancy tech stacks or the newest model releases. It’s about sweating the tiny, boring, invisible details — document structures, retrieval tuning, prompt crafting, and user expectations.