The most surprising thing about RAG is that it’s not fundamentally about finding the right answer, but about finding the most plausible answer given the context.
Let’s see this in action. Imagine you have a PDF about astrophysics. You want to ask Claude, a powerful LLM, "What is the redshift of a galaxy with a recession velocity of 10,000 km/s?"
First, we need to get that PDF into a format LlamaIndex can use.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
# Load documents from a directory
documents = SimpleDirectoryReader("./data").load_data()
# Build an index from the documents
index = VectorStoreIndex.from_documents(documents)
# Create a query engine
query_engine = index.as_query_engine()
# Query the index
response = query_engine.query("What is the redshift of a galaxy with a recession velocity of 10,000 km/s?")
print(response)
When you run this, LlamaIndex doesn’t just send your question to Claude. It first uses the index to find the most relevant chunks of text from your PDF that might contain the answer. It then takes those chunks and your original question and sends them together as a prompt to Claude. Claude, armed with this specific context, generates a much more informed and accurate response than it could have on its own.
The core problem RAG solves is the LLM’s "knowledge cutoff" and its tendency to hallucinate when asked about specific, private, or very recent information. LLMs are trained on massive datasets, but that data is static and doesn’t include your specific documents. RAG bridges this gap.
Internally, LlamaIndex does a few key things:
- Indexing: It breaks your documents into smaller pieces (chunks). For each chunk, it generates an embedding – a numerical representation of its meaning. These embeddings are stored in a vector database.
- Retrieval: When you ask a question, LlamaIndex also generates an embedding for your question. It then searches the vector database for document chunks whose embeddings are "closest" (most similar in meaning) to your question’s embedding.
- Augmentation: The retrieved chunks are combined with your original question into a single, augmented prompt.
- Generation: This augmented prompt is sent to the LLM (like Claude), which uses the provided context to formulate its answer.
The exact levers you control are primarily in the indexing and retrieval stages. You can tweak:
- Chunking strategy: How large are the pieces of text you break your documents into? Smaller chunks can be more precise but might lose broader context. Larger chunks retain context but might be too dilute. For a physics document, you might want chunk sizes of 512 tokens.
- Embedding model: Which model do you use to generate embeddings? Different models have different strengths and capture nuances in text differently. A
text-embedding-ada-002from OpenAI or abge-small-en-v1.5from HuggingFace are common choices. - Similarity metric: How do you measure "closeness" between embeddings? Cosine similarity is the standard, but others exist.
- Number of retrieved chunks: How many of the top-matching chunks do you send to the LLM? Too few, and you might miss crucial information; too many, and you risk overwhelming the LLM or exceeding its context window. You might set
similarity_top_k=3orsimilarity_top_k=5.
When you’re retrieving chunks, the system doesn’t just grab the top k most similar ones. It often uses a technique called "reranking." After an initial retrieval based on embedding similarity, a more sophisticated (and often slower) model re-evaluates the relevance of the retrieved chunks in the context of the specific question. This helps ensure that the chunks sent to the LLM are not just semantically similar but are truly the most pertinent to answering the question, even if their initial embedding scores were slightly lower than others.
The next concept you’ll grapple with is managing the LLM’s context window, especially when dealing with very long documents or complex queries that require retrieving many chunks.