Retrieval-Augmented Generation (RAG)

Ground LLM responses in your own data to build reliable LLM applications

Learning goals

•Understand the RAG architecture
•Learn to implement basic RAG systems
•Know common RAG pitfalls and solutions

Generic LLMs

Knowledge cutoff (don't know recent events)
Can't access private data
May hallucinate facts

RAG solves this by: 1. Retrieving relevant documents from your data 2. Augmenting the prompt with this context 3. Generating a response grounded in real information

RAG Architecture

          User Query
              ↓
┌─────────────────┐
│  Embed Query               │
└─────────────────┘
              ↓
┌─────────────────┐
│  Vector Search             │ ← Your document embeddings
└─────────────────┘
              ↓
┌─────────────────┐
│  Top K Results             │
└─────────────────┘
              ↓
┌─────────────────┐
│  Augment Prompt            │
└─────────────────┘
              ↓
┌─────────────────┐
│  Generate                  │
└─────────────────┘
              ↓
         Response

Implementation Considerations

Chunking Strategy

Too small: loses context
Too large: noise and irrelevance
Sweet spot: 200-500 tokens with overlap

Retrieval Quality

Number of results (k): balance relevance vs context size
Similarity threshold: filter low-relevance results
Hybrid search: combine semantic + keyword matching

Prompt Design

Clearly separate context from question
Instruct model to say "I don't know" if context doesn't contain the answer
Consider citing sources

Common mistakes

×Not saying 'I don't know'—model may hallucinate if context lacks the answer

×Poor chunking—too large or too small chunks hurt retrieval

×Ignoring metadata—dates, sources, and document types improve relevance

×No evaluation—track retrieval quality and answer accuracy separately

Key takeaways

+RAG grounds LLM responses in your actual data

+Quality depends on both retrieval and generation

+Chunking strategy significantly impacts results

+Always include instructions for handling missing information

RAG Architecture

          User Query
              ↓
┌─────────────────┐
│  Embed Query               │
└─────────────────┘
              ↓
┌─────────────────┐
│  Vector Search             │ ← Your document embeddings
└─────────────────┘
              ↓
┌─────────────────┐
│  Top K Results             │
└─────────────────┘
              ↓
┌─────────────────┐
│  Augment Prompt            │
└─────────────────┘
              ↓
┌─────────────────┐
│  Generate                  │
└─────────────────┘
              ↓
         Response

Implementation Considerations

Chunking Strategy

Too small: loses context
Too large: noise and irrelevance
Sweet spot: 200-500 tokens with overlap

Retrieval Quality

Number of results (k): balance relevance vs context size
Similarity threshold: filter low-relevance results
Hybrid search: combine semantic + keyword matching

Prompt Design

Clearly separate context from question
Instruct model to say "I don't know" if context doesn't contain the answer
Consider citing sources