Enhancements and Evaluations
- This sections details the different options of enhancements we can apply to our base RAG pipeline to better optimize or results.
Query Enhancement or Transformation
- Problem: Raw user queries may contain errors or lack clarity, leading to poor context retrieval.
- Solution: Leverage an LLM to improve queries by:
- Reformulating them for better retrieval context.
- Rewriting or expanding to multiple queries to cover a broader scope.
- Breaking complex questions into simpler sub-questions for more precise answers.
- ex: "What are the differences in features between A and B?" can be broken down to:
- Sub-query 1: “What are the features of A?”
- Sub-query 2: “What are the features of B?”
- ex: "What are the differences in features between A and B?" can be broken down to:
- Convert to a step-back query: Instead of focusing on the specific details or constraints of the original question, we steps back to address the underlying concepts, context, or principles related to the query.
- User Query:
- “I have a dataset with 10 billion records and want to store it in PostgresSQL for querying. Is it possible?”
- Step-back Query:
- "What are the capabilities and limitations of PostgresSQL for storing and querying large-scale datasets, and how does it handle datasets with billions of records?"
- User Query:
Query Routing
- Problem: How do we determine if a query must definitely be routed to the RAG system rather than handled directly by the LLM model?
- For example, a query like “Who is the manager of Team A at my company?” likely requires domain-specific knowledge and should go to the RAG system, whereas “Please summarize this answer better” is a general request that the LLM can handle without external retrieval.
- Solution: Build a Query Router or Query Agent:
- Agentic RAG: Implement an LLM agent with predefined rules to classify queries. The agent decides whether the query requires an internet search, domain-specific knowledge (RAG), or can be answered directly by the LLM based on those rules.
- Classification Model Using Embeddings: Use a model to classify queries into categories (e.g., political, sports, technical) based on their embeddings. This helps determine if the query aligns with areas covered by the RAG system.
- Example Tool: GitHub Semantic-Router
- Embedding Score Threshold: Set a similarity score threshold (e.g., < 0.6) from the vector store. If the score is below this threshold, fallback to alternative paths such as an internet search, direct LLM response, or return a “404” (no relevant information found).
Multi-Modal Retrieval
- Problem: Documents like PDFs often include more than text (e.g., images, graphs, tables), but standard retrieval focuses only on text, missing valuable information.
- Solution: Use advanced tools to extract and embed non-text data:
- Extract images and tables, then embed them alongside text.
- For images: Use an Models(LLM, Image vision, OpenAI CLIP) to generate descriptions, storing these with metadata (e.g., cloud image URL) for retrieval.
Optimize Used Settings
- Chunk Size:
- Problem: Chunk size affects speed and context preservation, but the ideal size varies.
- Solution: Experiment with different chunk sizes and creation algorithms to balance speed and context.
- Adjust chunk and overlap size depending on the document pages and size.
- Using large chunking size might include too much irrelevant information. Smaller chunks allow a RAG system to be more precise when retrieving answers to specific queries.
- For a very small document such as 15 pages, 200 chunk size with 60 overlap might be enough
- For larger files such as 300 pages, 1000 chunk size with 200 overlap can work
- Using large chunking size might include too much irrelevant information. Smaller chunks allow a RAG system to be more precise when retrieving answers to specific queries.
- Use overlapping chunks to preserve continuity (e.g., table headers, paragraph transitions), ensuring key details aren’t lost.
- Adjust chunk and overlap size depending on the document pages and size.
- Embedding Model:
- Problem: The chosen embedding model impacts retrieval quality.
- Solution: Test different embedding models and evaluate their performance for your use case
- Embedding indices
- Problem: The chosen embedding ids such as description of image or which items attributes impact retrievals use-case quality.
- Solution: For an item with many attributes such as title, description and price, using the description might lead to better result. in image embedding instead of embedding the whole image maybe segmenting the image into different parts such as pants, t-shirts can result more granular quality.
- The
knnalgorithms used- Problem: Depending on the embedding data you have, the algorithm you use to get relevant embedding matter.
- Solution:
- facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.
- For less than a thousand embeddings, a brute force search makes sense
- For less than a million, a fast but not memory-efficient algorithm (such as HNSW) is appropriate
- For less than a billion, quantization (using k-means and IVF) becomes important
- For a trillion example, the only solution becomes on-disk indices
- Final LLM Model:
- Problem: The LLM generating the final answer may not always perform optimally.
- Solution: Optimize or swap the LLM for better accuracy and efficiency.
Contextual Chunk Header or Metadata
- Problem: Without additional context, retrieved chunks may lack relevance or traceability.
- Solution: Add metadata to chunks (e.g., book name, page, year, author, security details) during processing.
- Benefits: Enables filtering for more relevant results, provides source info (book/page), identifies authorship, and supports authorization via security metadata.
- Further reading:
- Introducing Contextual Retrieval \ Anthropic
- Talks about how the process of prepending chunk-specific explanatory context to each chunk before embedding (“Contextual Embeddings”) and creating the BM25 index (“Contextual BM25”) can improve rag performance.
- Introducing Contextual Retrieval \ Anthropic
Hybrid Search
- Problem: Dense vectors alone may miss some relevant matches that traditional keyword-based methods could catch.
- Solution: Implement hybrid search (dense + sparse vectors, aka fusion search):
- Sparse Vectors Explained: Represent content via word frequency (e.g., for "Why is Jaguar the best brand," sparse vector might encode "Jaguar: 1, the: 2," etc.), enabling traditional keyword matching.
- This can be very useful when user is using a query that can benefit from exact keyword matching such as queries including product name and music title
- Process: During ingestion, convert data into both dense and sparse vectors and store them. At query time, convert the user query into both formats and search the database, combining results for improved recall.
- Sparse Vectors Explained: Represent content via word frequency (e.g., for "Why is Jaguar the best brand," sparse vector might encode "Jaguar: 1, the: 2," etc.), enabling traditional keyword matching.
- Ensemble Retriever:
- Is a technique where we use addition ways of embedding in addition to Dense vector embedding
Reranking
- Problem: Vector search returns the top k matches (e.g., top 3), but a discarded result (e.g., 7th) might hold critical context.
- Additionally, we might have broken user query into multiple subqueries, which each have multiple top k matches, how do we refine which best results to forward to LLM model.
- Solutions:
- Good: Increase the result set (e.g., from top 3 to top 15 or 25), though this risks exceeding the LLM’s context window (fixed in size, despite recent growth).
- Better: Use a reranker model (aka cross-encoder or two-stage retrieval):
- Retrieve a larger set (e.g., top 25) from the vector database, then pass these with the query to a reranker. The reranker selects the top 3 most relevant results.
- Alternative Options:
- LLM-Based Ranking: Feed results to an LLM and let it rank them based on perceived relevance.
- Metadata-Based Ranking: Incorporate chunk metadata as a factor in the ranking process.
Prompt Engineering
- Prompt engineering is the practice of optimizing LLM prompts to improve the quality and accuracy of generated output. Often one of the lowest-hanging fruits when it comes to techniques for improving RAG systems.
- Prompt engineering does not require making changes to the underlying LLM itself. This makes it an efficient and accessible way to enhance performance without complex modifications.
- Check out Prompt Engineering best practices by major model providers such as Google, OpenAI and Anthropic.
Graph RAG (Graph-Augmented Retrieval)
- Problem: Traditional RAG relies on retrieving isolated chunks of data based on vector similarity, which may miss deeper relationships or dependencies between pieces of information (e.g., how concepts, entities, or events connect across documents). This can lead to incomplete or less coherent context for the LLM.
- Such as failing to answer questions like: "What are the main topics discussed in this document"
- Solution: Implement Graph RAG by integrating a knowledge graph into the retrieval process to capture and leverage relationships:
- Example:
- Query: "How does SQLite improve database performance?"
- Vector search retrieves chunks about SQLite features.
- Graph RAG adds related info, like "SQLite → improves → query optimization" or "SQLite → used by → specific applications," based on graph connections.
Agentic RAG
- AI agents are autonomous systems that can interpret information, formulate plans, and make decisions. When added to a RAG pipeline, agents can reformulate user queries and re-retrieve more relevant information if initial results are inaccurate or irrelevant.
- Agentic RAG can also handle more complex queries that require multistep reasoning and tools(i.e. Internet search tool), like comparing information across multiple documents, asking follow-up questions, and iteratively adjusting retrieval and generation strategies.
Evaluations
- Levels of Complexity: RAG Applications - jxnl.co
- Evaluation measures (information retrieval)
- Metrics can be broadly split into two categories: online metrics and offline metrics.
- Online metrics can be measured only from the usage of the system, often in an A/B test setting. For recommendation, in particular, the click-through rate or directly the revenue can be considered
- Offline metrics are generally created from relevance judgment sessions where the judges score the quality of the search results. Both binary (relevant/non-relevant) and multi-level (e.g., relevance from 0 to 5) scales can be used to score each document returned in response to a query.
Feedback Loop RAG
- Problem: RAG systems may not adapt to user needs or improve over time without insights into their performance, leading to stagnant or suboptimal results.
- Solution: Incorporate a user based feedback loop to refine and enhance the RAG system iteratively
- Example: If users consistently mark answers as "not helpful" for certain queries, tweak the retrieval process (e.g., increase overlap in chunks) or update the knowledge base.
LLM Based Evaluations
- Use an LLM to automate and standardize evaluation of RAG results
- How It Works:
- Define measurement criteria, such as:
- Correctness: Does the answer accurately reflect the facts?
- Relevance: Is the response aligned with the query’s intent?
- Completeness: Does it cover all key aspects of the question?
- Pass the RAG output (retrieved context + generated answer) and the original query to an LLM evaluator.
- The LLM scores the output based on the set criteria, providing quantitative or qualitative feedback.
- Define measurement criteria, such as:
- Example: Test two chunk sizes (500 vs. 1000 characters) by having the LLM evaluate responses for relevance and completeness, then compare scores to pick the better setting.