Can AI generate truly relevant answers at scale? How do we make sure it understands complex, multi-turn conversations? And how do we keep it from confidently spitting out incorrect facts? These are the kinds of challenges that modern AI systems face, especially those built using RAG. RAG combines the power of document retrieval with the fluency of language generation, allowing systems to answer questions with context-aware, grounded responses. While basic RAG systems work well for straightforward tasks, they often stumble with complex queries, hallucinations, and context retention across longer interactions. That’s where advanced RAG techniques come in.
In this blog, we’ll explore how to level up your RAG pipelines, enhancing each stage of the stack: Indexing, Retrieval, and Generation. We’ll walk through powerful methods (with hands-on code) that can help improve relevance, reduce noise, and scale your system’s performance—whether you’re building a healthcare assistant, an educational tutor, or an enterprise knowledge bot.
Table of contents
- Where Basic RAG Falls Short?
- Indexing and Chunking: Building a Strong Foundation
- HNSW: Hierarchical Navigable Small Worlds
- Semantic Chunking
- Language Model-Based Chunking
- Leveraging Metadata: Adding Context
- Using GLiNER to Generate Metadata
- Retrieval: Finding the Right Information
- Hybrid Search
- Query Rewriting
- LLM Prompt-based Contextual Compression Retrieval
- Fine-Tuning Embedding Models
- Generation: Crafting High-Quality Responses
- Autocut to Remove Irrelevant Information
- Reranking Retrieved Objects
- Fine-Tuning the LLM
- Using RAFT: Adapting Language Model to Domain-Specific RAG
- Conclusion
Where Basic RAG Falls Short?
Let’s look at the Basic RAG framework:
This RAG system architecture shows the basic storing of chunk embeddings in the Vector store. The first step is to load the documents, then split or chunk it using various chunking techniques and then embed it using an embedding model so that it can be understood by LLMs easily.
This image depicts the retrieval and generation steps of RAG; a question is asked by the user, and then our system extracts the results based on the question by searching the Vector store. Then the retrieved content is passed to the LLM along with the question, and the LLM provides a structured output.
Basic RAG systems have clear limitations, especially in demanding situations.
- Hallucinations: A major problem is hallucination. The model creates content that is factually wrong or not supported by the source documents. This hurts reliability, particularly in fields like medicine or law where precision is critical.
- Lack of Domain Specificity: Standard RAG models struggle with specialized topics. Without adapting the retrieval and generation processes to the specific details of a domain, the system risks finding irrelevant or inaccurate information.
- Complex Conversations: Basic RAG systems have trouble with complex queries or multi-turn conversations. They often lose the context across interactions. This leads to disconnected or incomplete answers. RAG systems must handle increasing query complexity.
Hence, we’ll go through each part of the RAG stack for Advanced RAG Techniques i.e. Indexing, Retrieval and Generation. We’ll discuss improvements using open-source libraries and resources. These Advanced RAG Techniques apply generally, whether you build a healthcare chatbot, educational bot or other applications. They will improve most RAG systems.
Let’s begin with the Advanced RAG Techniques!
Indexing and Chunking: Building a Strong Foundation
Good indexing is essential for any RAG system. The first step involves how we bring in, break up, and store data. Let’s explore methods to index data, focusing on indexing and chunking text and using metadata.
1. HNSW: Hierarchical Navigable Small Worlds
Hierarchical Navigable Small Worlds (HNSW) is an effective algorithm for finding similar items in large datasets. It helps in quickly locating approximate nearest neighbors (ANN) by using a structured approach based on graphs.
- Proximity Graph: HNSW builds a graph where each point connects to nearby points. This structure allows for efficient searching.
- Hierarchical Structure: The algorithm organizes points into multiple layers. The top layer connects distant points, while lower layers connect closer points. This setup speeds up the search process.
- Greedy Routing: HNSW uses a greedy method to find neighbors. It starts at a high-level point and moves to the nearest neighbor until it reaches a local minimum. This method reduces the time needed to find similar items.
How does HNSW work?
The working of HNSW includes several key components:
- Input Layer: Each data point is represented as a vector in a high-dimensional space.
-
Graph Construction:
- Nodes are added to the graph one at a time.
- Each node is assigned to a layer based on a probability function. This function decides how likely a node is to be placed in a higher layer.
- The algorithm balances the number of connections and the speed of searches.
-
Search Process:
- The search starts at a specific entry point in the top layer.
- The algorithm moves to the nearest neighbor at each step.
- Once it reaches a local minimum, it shifts to the next lower layer and continues searching until it finds the closest point in the bottom layer.
-
Parameters:
- M: The number of neighbors connected to each node.
- efConstruction: This parameter affects how many neighbors the algorithm considers when building the graph.
- efSearch: This parameter influences the search process, determining how many neighbors to evaluate.
HNSW’s design allows it to find similar items quickly and accurately. This makes it a strong choice for tasks that require efficient searches in large datasets.
The image depicts a simplified HNSW search: starting at the “entry point” (blue), the algorithm navigates the graph towards the “query vector” (yellow). The “nearest neighbor” (striped) is identified by traversing edges based on proximity. This illustrates the core concept of navigating a graph for efficient approximate nearest neighbor search.
Hands on HNSW
Follow these steps to implement the Hierarchical Navigable Small Worlds (HNSW) algorithm with FAISS. This guide includes example outputs to illustrate the process.
Step 1: Set Up HNSW Parameters
First, define the parameters for the HNSW index. You need to specify the size of the vectors and the number of neighbors for each node.
import faiss import numpy as np # Set up HNSW parameters d = 128 # Size of the vectors M = 32 # Number of neighbors for each nodel
Step 2: Initialize the HNSW Index
Create the HNSW index using the parameters defined above.
# Initialize the HNSW index index = faiss.IndexHNSWFlat(d, M)
Step 3: Set efConstruction
Before adding data to the index, set the `efConstruction` parameter. This parameter controls how many neighbors the algorithm considers when building the index.
efConstruction = 200 # Example value for efConstruction index.hnsw.efConstruction = efConstruction
Step 4: Generate Sample Data
For this example, generate random data to index. Here, `xb` represents the dataset you want to index.
# Generate random dataset of vectors n = 10000 # Number of vectors to index xb = np.random.random((n, d)).astype('float32') # Add data to the index index.add(xb) # Build the index
Step 5: Set efSearch
After building the index, set the `efSearch` parameter. This parameter affects the search process.
efSearch = 100 # Example value for efSearch index.hnsw.efSearch = efSearch
Step 6: Perform a Search
Now, you can search for the nearest neighbors of your query vectors. Here, `xq` represents the query vectors.
# Generate random query vectors nq = 5 # Number of query vectors xq = np.random.random((nq, d)).astype('float32') # Perform a search for the top k nearest neighbors k = 5 # Number of nearest neighbors to retrieve distances, indices = index.search(xq, k) # Output the results print("Query Vectors:\n", xq) print("\nNearest Neighbors Indices:\n", indices) print("\nNearest Neighbors Distances:\n", distances)
Output
Query Vectors:<br><br>[[0.12345678 0.23456789 ... 0.98765432]<br><br>[0.23456789 0.34567890 ... 0.87654321]<br><br>[0.34567890 0.45678901 ... 0.76543210]<br><br>[0.45678901 0.56789012 ... 0.65432109]<br><br>[0.56789012 0.67890123 ... 0.54321098]]<br><br>Nearest Neighbors Indices:<br><br>[[ 123 456 789 101 112]<br><br>[ 234 567 890 123 134]<br><br>[ 345 678 901 234 245]<br><br>[ 456 789 012 345 356]<br><br>[ 567 890 123 456 467]]<br><br>Nearest Neighbors Distances:<br><br>[[0.123 0.234 0.345 0.456 0.567]<br><br>[0.234 0.345 0.456 0.567 0.678]<br><br>[0.345 0.456 0.567 0.678 0.789]<br><br>[0.456 0.567 0.678 0.789 0.890]<br><br>[0.567 0.678 0.789 0.890 0.901]]
2. Semantic Chunking
This approach divides text based on meaning, not just fixed sizes. Each chunk represents a coherent piece of information. We calculate the cosine distance between sentence embeddings. If two sentences are semantically similar (below a threshold), they go in the same chunk. This creates chunks of different lengths based on the content’s meaning.
- Pros: Creates more coherent and meaningful chunks, improving retrieval.
- Cons: Requires more computation (using a BERT-based encoder).
Hands-on Semantic Chunking
from langchain_experimental.text_splitter import SemanticChunker from langchain_openai.embeddings import OpenAIEmbeddings text_splitter = SemanticChunker(OpenAIEmbeddings()) docs = text_splitter.create_documents([document]) print(docs[0].page_content)
This code utilizes SemanticChunker from LangChain, which splits a document into semantically related chunks using OpenAI embeddings. It creates document chunks where each chunk aims to capture coherent semantic units rather than arbitrary text segments. The
3. Language Model-Based Chunking
This advanced method uses a language model to create complete statements from text. Each chunk is semantically whole. A language model (e.g., a 7-billion parameter model) processes the text. It breaks it into statements that make sense on their own. The model then combines these into chunks, balancing completeness and context. This method is computationally heavy but offers high accuracy.
- Pros: Adapts to the nuances of the text and creates high-quality chunks.
- Cons: Computationally expensive; may need fine-tuning for specific uses.
Hands-on Language Model-Based Chunking
async def generate_contexts(document, chunks): async def process_chunk(chunk): response = await client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "Generate a brief context explaining how this chunk relates to the full document."}, {"role": "user", "content": f"<document> \n{document} \n</document> \nHere is the chunk we want to situate within the whole document \n<chunk> \n{chunk} \n</chunk> \nPlease give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."} ], temperature=0.3, max_tokens=100 ) context = response.choices[0].message.content return f"{context} {chunk}" # Process all chunks concurrently contextual_chunks = await asyncio.gather( *[process_chunk(chunk) for chunk in chunks] ) return contextual_chunks
This code snippet utilizes an LLM (likely OpenAI’s GPT-4o via the client.chat.completions.create call) to generate contextual information for each chunk of a document. It processes each chunk asynchronously, prompting the LLM to explain how the chunk relates to the full document. Finally, it returns a list of the original chunks prepended with their generated context, effectively enriching them for improved search retrieval.
4. Leveraging Metadata: Adding Context
Adding and Filtering with Metadata
Metadata provides extra context. This improves retrieval accuracy. By including metadata like dates, patient age, and pre-existing conditions, you can filter out irrelevant information during searches. Filtering narrows the search, making retrieval more efficient and relevant. When indexing, store metadata alongside the text.
For example, healthcare data include age, visit date, and specific conditions in patient records. Use this metadata to filter search results. This ensures the system retrieves only relevant information. For instance, if a query relates to children, filter out records of patients over 18. This reduces noise and improves relevance.
Example
Chunk #1
Source Metadata: {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:26e9aac7d5494208a56ff0c6cbbfda20', 'source': 'https://plato.stanford.edu/entries/goedel/'}
Source Text:
2.2.1 The First Incompleteness Theorem<br><br>In his Logical Journey (Wang 1996) Hao Wang published the<br><br>full text of material Gödel had written (at Wang’s request)<br><br>about his discovery of the incompleteness theorems. This material had<br><br>formed the basis of Wang’s “Some Facts about Kurt<br><br>Gödel,” and was read and approved by Gödel:
Chunk #2
Source Metadata: {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:d15f62c453c64072b768e136080cb5ba', 'source': 'https://plato.stanford.edu/entries/goedel/'}
Source Text:
The First Incompleteness Theorem provides a counterexample to<br><br>completeness by exhibiting an arithmetic statement which is neither<br><br>provable nor refutable in Peano arithmetic, though true in the<br><br>standard model. The Second Incompleteness Theorem shows that the<br><br>consistency of arithmetic cannot be proved in arithmetic itself. Thus<br><br>Gödel’s theorems demonstrated the infeasibility of the<br><br>Hilbert program, if it is to be characterized by those particular<br><br>desiderata, consistency and completeness.
Here, we can see that metadata contains the unique ID and source of the chunk, which provide more context to the chunk and helps in easy retrieval.
5. Using GLiNER to Generate Metadata
You won’t always have a lot of metadata but using a model like GLiNER can generate metadata on the fly! GLiNER tags and labels chunks during ingestion to create metadata.
Implementation
Give GLiNER each chunk with tags to identify. If tags are found, it will label them. If no matches are confident, no tags are produced.
Works well generally, but might need fine-tuning for niche datasets. Improves retrieval accuracy but adds a processing step.
GLiNER can parse incoming queries and match them against metadata labels for filtering.
GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer Demo: Click Here
These techniques build a strong RAG system. They enable efficient retrieval from large datasets. The choice of chunking and metadata use depends on your dataset’s specific needs and features.
Retrieval: Finding the Right Information
Now, let’s focus on the “R” in RAG. How can we improve retrieval from a vector database? This is about retrieving all documents relevant to a query. This greatly increases the chances the LLM can produce high-quality results. Here are several techniques:
6. Hybrid Search
Combines vector search (finding semantic meaning) and keyword search (finding exact matches). Hybrid search uses the strengths of both. In AI, many terms are specific keywords: algorithm names, technology terms, LLMs. A vector search alone might miss these. Keyword search ensures these important terms are considered. Combining both methods creates a more complete retrieval process. These searches run at the same time.
Results are merged and ranked using a weighting system. For example, using Weaviate, you adjust the alpha parameter to balance vector and keyword results. This creates a combined, ranked list.
- Pros: Balances precision and recall, improving retrieval quality.
- Cons: Requires careful tuning of weights.
Hands-on Hybrid Search
from langchain_community.retrievers import WeaviateHybridSearchRetriever from langchain_core.documents import Document retriever = WeaviateHybridSearchRetriever( client=client, index_name="LangChain", text_key="text", attributes=[], create_schema_if_missing=True, ) retriever.invoke("the ethical implications of AI")
This code initializes a WeaviateHybridSearchRetriever for retrieving documents from a Weaviate vector database. It combines vector search and keyword search within Weaviate’s hybrid retrieval capabilities. Finally, it executes a query, “the ethical implications of AI” to retrieve relevant documents using this hybrid approach.
7. Query Rewriting
Recognizes that human queries may not be optimal for databases or language models. Using a language model to rewrite queries significantly improves retrieval.
- Rewriting for Vector Databases: This transforms the user’s initial query into a database-friendly format. For example, “what are AI agents and why they are the next big thing in 2025” might become “AI agents big thing 2025”. We can use any LLM for rewriting the query so that it captures the important aspects of the query.
- Prompt Rewriting for Language Models: This involves automatically creating prompts to optimize interaction with the language model. This improves the quality and accuracy of results. We can use Frameworks like DSPy to help with this or any LLM to rewrite the query. These rewritten queries and prompts ensure the search process retrieves relevant documents and the language model is prompted effectively.
Multi Query Retrieval
Retrieval can yield different results based on slight changes in how a query is worded. If the embeddings do not accurately reflect the meaning of the data, this issue can become more pronounced. To address these challenges, prompt engineering or tuning is often used, but this process can be time-consuming.
The MultiQueryRetriever simplifies this task. It uses a large language model (LLM) to create multiple queries from different angles based on a single user input. For each generated query, it retrieves a set of relevant documents. By combining the unique results from all queries, the MultiQueryRetriever provides a broader set of potentially relevant documents. This approach enhances the chances of finding useful information without the need for extensive manual tuning.
from langchain_openai import ChatOpenAI chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0) from langchain.retrievers.multi_query import MultiQueryRetriever # Set logging for the queries import logging similarity_retriever3 = chroma_db3.as_retriever(search_type="similarity", search_kwargs={"k": 2}) mq_retriever = MultiQueryRetriever.from_llm( retriever=similarity_retriever3, llm=chatgpt, include_original=True ) logging.basicConfig() # so we can see what queries are generated by the LLM logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO) query = "what is the capital of India?" docs = mq_retriever.invoke(query) docs
This code sets up a multi-query retrieval system using LangChain. It generates multiple variations of the input query (“what is the capital of India?”). These variations are then used to query a Chroma vector database (chroma_db3) via a similarity retriever, aiming to broaden the search and capture diverse relevant documents. The MultiQueryRetriever ultimately aggregates and returns the retrieved documents.
Output
[Document(metadata={'article_id': '5117', 'title': 'New Delhi'},<br> page_content='New Delhi () is the capital of India and a union territory of<br> the megacity of Delhi. It has a very old history and is home to several<br> monuments where the city is expensive to live in. In traditional Indian<br> geography it falls under the North Indian zone. The city has an area of<br> about 42.7\xa0km. New Delhi has a population of about 9.4 Million people.'),<br><br>Document(metadata={'article_id': '4062', 'title': 'Kolkata'},<br> page_content="Kolkata (spelled Calcutta before 1 January 2001) is the<br> capital city of the Indian state of West Bengal. It is the second largest<br> city in India after Mumbai. It is on the east bank of the River Hooghly.<br> When it is called Calcutta, it includes the suburbs. This makes it the third<br> largest city of India. This also makes it the world's 8th largest<br> metropolitan area as defined by the United Nations. Kolkata served as the<br> capital of India during the British Raj until 1911. Kolkata was once the<br> center of industry and education. However, it has witnessed political<br> violence and economic problems since 1954. Since 2000, Kolkata has grown due<br> to economic growth. Like other metropolitan cities in India, Kolkata<br> struggles with poverty, pollution and traffic congestion."),<br><br>Document(metadata={'article_id': '22215', 'title': 'States and union<br> territories of India'}, page_content='The Republic of India is divided into<br> twenty-eight States,and eight union territories including the National<br> Capital Territory.')]
8. LLM Prompt-based Contextual Compression Retrieval
Context compression helps improve the relevance of retrieved documents. This can occur in two main ways:
- Extracting Relevant Content: Remove parts of the retrieved documents that do not relate to the query. This means keeping only the sections that answer the question.
-
Filtering Irrelevant Documents: Excluding documents that do not relate to the query without altering the content of the documents themselves.
To achieve this, we can use the LLMChainExtractor, which reviews the initially returned documents and extracts only the relevant content for the query. It may also drop completely irrelevant documents.
Here is how to implement this using LangChain:
from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor from langchain_openai import ChatOpenAI # Initialize the language model chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0) # Set up a similarity retriever similarity_retriever = chroma_db3.as_retriever(search_type="similarity", search_kwargs={"k": 3}) # Create the extractor to get relevant content compressor = LLMChainExtractor.from_llm(llm=chatgpt) # Combine the retriever and the extractor compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=similarity_retriever) # Example query query = "What is the capital of India?" docs = compression_retriever.invoke(query) print(docs)
Output:
[Document(metadata={'article_id': '5117', 'title': 'New Delhi'},<br> page_content='New Delhi is the capital of India and a union territory of the <br>megacity of Delhi.')]
For a different query:
query = "What is the old capital of India?" docs = compression_retriever.invoke(query) print(docs)
Output
[Document(metadata={'article_id': '4062', 'title': 'Kolkata'},<br> page_content='Kolkata served as the capital of India during the British Raj<br> until 1911.')]
The `LLMChainFilter` offers a simpler but effective way to filter documents. It uses an LLM chain to decide which documents to keep and which to discard without changing the content of the documents.
Here’s how to implement the filter:
from langchain.retrievers.document_compressors import LLMChainFilter # Set up the filter _filter = LLMChainFilter.from_llm(llm=chatgpt) # Combine the retriever and the filter compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=similarity_retriever) # Example query query = "What is the capital of India?" docs = compression_retriever.invoke(query) print(docs)
Output
[Document(metadata={'article_id': '5117', 'title': 'New Delhi'},<br> page_content='New Delhi is the capital of India and a union territory of the<br> megacity of Delhi.')]
For another query:
query = "What is the old capital of India?" docs = compression_retriever.invoke(query) print(docs)
Output:
[Document(metadata={'article_id': '4062', 'title': 'Kolkata'},<br> page_content='Kolkata served as the capital of India during the British Raj<br> until 1911.')]
These strategies help refine the retrieval process by focusing on relevant content. The `LLMChainExtractor` extracts only the necessary parts of documents, while the `LLMChainFilter` decides which documents to keep. Both methods enhance the quality of the information retrieved, making it more relevant to the user’s query.
9. Fine-Tuning Embedding Models
Pre-trained embedding models are a good start. Fine-tuning these models on your data greatly improves retrieval.
Choosing the Right Models: For specialized fields like medicine, select models pre-trained on relevant data. For example, you can use the MedCPT family of query and document encoders pre-trained on a large scale of 255M query-article pairs from PubMed search logs.
Fine-Tuning with Positive and Negative Pairs: Collect your own data and create pairs of similar (positive) and dissimilar (negative) examples. Fine-tune the model to understand these differences. This helps the model learn domain-specific relationships, improving retrieval.
- Pros: Improves retrieval performance.
- Cons: Requires carefully created training data.
These combined techniques create a strong retrieval system. This improves the relevance of objects given to the LLM, boosting generation quality.
Also read this: Training and Finetuning Embedding Models with Sentence Transformers v3
Generation: Crafting High-Quality Responses
Finally, let’s discuss improving the generation quality of a Language Model (LLM). The goal is to give the LLM context that is as relevant to the prompt as possible. Irrelevant data can trigger hallucinations. Here are tips for better generation:
10. Autocut to Remove Irrelevant Information
Autocut filters out irrelevant information retrieved from the database. This prevents the LLM from being misled.
- Retrieve and Score Similarity: When a query is made, multiple objects are retrieved with similarity scores.
- Identify and Cut Off: Use the similarity scores to find a cutoff point where scores drop significantly. Exclude objects beyond this point. This ensures that only the most relevant information is given to the LLM. For example, if you retrieve six objects, scores might drop sharply after the fourth. By looking at the rate of change, you can determine which objects to exclude.
Hands on
from langchain_openai import OpenAIEmbeddings from langchain_pinecone import PineconeVectorStore from typing import List from langchain_core.documents import Document from langchain_core.runnables import chain vectorstore = PineconeVectorStore.from_documents( docs, index_name="sample", embedding=OpenAIEmbeddings() ) @chain def retriever(query: str): docs, scores = zip(*vectorstore.similarity_search_with_score(query)) for doc, score in zip(docs, scores): doc.metadata["score"] = score return docs result = retriever.invoke("dinosaur") result
This code snippet uses LangChain and Pinecone to perform a similarity search. It embeds documents using OpenAI embeddings, stores them in a Pinecone vector store, and defines a retriever function. The retriever searches for documents similar to a given query (“dinosaur”), calculates similarity scores, and adds these scores to the document metadata before returning the results.
Output
[Document(page_content='In her second book, Dr. Simmons delves deeper into<br> the ethical considerations surrounding AI development and deployment. It is<br> an eye-opening examination of the dilemmas faced by developers,<br> policymakers, and society at large.', metadata={}),<br><br>Document(page_content='A comprehensive analysis of the evolution of<br> artificial intelligence, from its inception to its future prospects. Dr.<br> Simmons covers ethical considerations, potentials, and threats posed by<br> AI.', metadata={}),<br><br>Document(page_content="In his follow-up to 'Symbiosis', Prof. Sterling takes<br> a look at the subtle, unnoticed presence and influence of AI in our everyday<br> lives. It reveals how AI has become woven into our routines, often without<br> our explicit realization.", metadata={}),<br><br>Document(page_content='Prof. Sterling explores the potential for harmonious<br>coexistence between humans and artificial intelligence. The book discusses<br> how AI can be integrated into society in a beneficial and non-disruptive<br>manner.', metadata={})]
We can see that it is also giving the similarity scores with it we can cut off based on a threshold value.
11. Reranking Retrieved Objects
Reranking uses a more advanced model to re-evaluate and reorder the initially retrieved objects. This improves the quality of the final retrieved set.
- Overfetch: Initially retrieve more objects than needed.
- Apply Ranker Model: Use a high-latency model (typically a cross encoder) to re-evaluate relevance. This model considers the query and each object pairwise to reassess similarity.
- Reorder Results: Based on the new assessment, reorder the objects. Put the most relevant results at the top. This ensures that the most relevant documents are prioritized, improving the data given to the LLM.
Hands-on Reranking Retrieved Objects
from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import FlashrankRerank from langchain_openai import ChatOpenAI llm = ChatOpenAI(temperature=0) compressor = FlashrankRerank() compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=retriever ) compressed_docs = compression_retriever.invoke( "What did the president say about Ketanji Jackson Brown" ) print([doc.metadata["id"] for doc in compressed_docs]) pretty_print_docs(compressed_docs)
This code snippet utilizes FlashrankRerank within a ContextualCompressionRetriever to improve the relevance of retrieved documents. It specifically reranks documents obtained by a base retriever (represented by retriever) based on their relevance to the query “What did the president say about Ketanji Jackson Brown”. Finally, it prints the document IDs and the compressed, reranked documents.
Output
[0, 5, 3]<br><br>Document 1:<br><br>One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.<br><br>And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.<br><br>----------------------------------------------------------------------------------------------------<br><br>Document 2:<br><br>He met the Ukrainian people.<br><br>From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.<br><br>Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland.<br><br>In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight.<br><br>----------------------------------------------------------------------------------------------------<br><br>Document 3:<br><br>And tonight, I’m announcing that the Justice Department will name a chief prosecutor for pandemic fraud.<br><br>By the end of this year, the deficit will be down to less than half what it was before I took office.<br><br>The only president ever to cut the deficit by more than one trillion dollars in a single year.<br><br>Lowering your costs also means demanding more competition.<br><br>I’m a capitalist, but capitalism without competition isn’t capitalism.<br><br>It’s exploitation—and it drives up prices.
The output shoes it reranks the retrieved chunks based on the relevancy.
12. Fine-Tuning the LLM
Fine-tuning the LLM on domain-specific data greatly enhances its performance. For instance, use a model like Meditron 70B. This is a fine-tuned version of LLaMA 2 70b for medical data, using both:
Unsupervised Fine-Tuning: Continue pre-training on a large collection of domain-specific text (e.g., PubMed literature).
Supervised Fine-Tuning: Further refine the model using supervised learning on domain-specific tasks (e.g., medical multiple-choice questions). This specialized training helps the model perform well in the target domain. It outperforms its base model and larger, less specialized models like GPT-3.5 on specific tasks.
This image denotes the process of fine-tuning in task-specific examples. This approach allows developers to specify desired outputs, encourage certain behaviors, or achieve better control over the model’s responses.
13. Using RAFT: Adapting Language Model to Domain-Specific RAG
RAFT, or Retrieval-Augmented fine-tuning, is a method that improves how large language models (LLMs) work in specific fields. It helps these models use relevant information from documents to answer questions more accurately.
-
Retrieval-Augmented Fine Tuning: RAFT combines fine-tuning with retrieval methods. This allows the model to learn from both useful and less useful documents during training.
- Chain-of-Thought Reasoning: The model generates answers that show its reasoning process. This helps it provide clear and accurate responses based on the documents it retrieves.
- Dynamic Document Handling: RAFT trains the model to find and use the most relevant documents while ignoring those that do not help answer the question.
Architecture of RAFT
The RAFT architecture includes several key components:
- Input Layer: The model takes in a question (Q) and a set of retrieved documents (D), which include both relevant and irrelevant documents.
-
Processing Layer:
- The model analyzes the input to find important information in the documents.
- It creates an answer (A*) that refers to the relevant documents.
- Output Layer: The model produces the final answer based on the relevant documents while disregarding the irrelevant ones.
- Training Mechanism: During training, some data includes both relevant and irrelevant documents, while other data includes only irrelevant ones. This setup encourages the model to focus on context rather than memorization.
- Evaluation: The model’s performance is assessed based on its ability to answer questions accurately using the retrieved documents.
By using this architecture, RAFT enhances the model’s ability to work in specific domains. It provides a reliable way to generate accurate and relevant responses.
The top-left figure depicts the approach of adapting LLMs to reading solutions from a set of positive and distractor documents in contrast to the standard RAG setup, where models are trained based on the retriever outputs, which is a mixture of both memorization and reading. At test time, all methods follow the standard RAG setting, provided with top-k retrieved documents in the context.
Conclusion
Improving retrieval and generation in RAG systems is essential for better AI applications. The techniques discussed range from low-effort, high-impact methods (query rewriting, reranking) to more intensive processes (embedding and LLM fine-tuning). The best technique depends on your application’s specific needs and limits. Advanced RAG techniques, when applied thoughtfully, allow developers to build more accurate, reliable, and context-aware AI systems capable of handling complex information needs.
The above is the detailed content of Top 13 Advanced RAG Techniques for Your Next Project. For more information, please follow other related articles on the PHP Chinese website!

The term "AI-ready workforce" is frequently used, but what does it truly mean in the supply chain industry? According to Abe Eshkenazi, CEO of the Association for Supply Chain Management (ASCM), it signifies professionals capable of critic

The decentralized AI revolution is quietly gaining momentum. This Friday in Austin, Texas, the Bittensor Endgame Summit marks a pivotal moment, transitioning decentralized AI (DeAI) from theory to practical application. Unlike the glitzy commercial

Enterprise AI faces data integration challenges The application of enterprise AI faces a major challenge: building systems that can maintain accuracy and practicality by continuously learning business data. NeMo microservices solve this problem by creating what Nvidia describes as "data flywheel", allowing AI systems to remain relevant through continuous exposure to enterprise information and user interaction. This newly launched toolkit contains five key microservices: NeMo Customizer handles fine-tuning of large language models with higher training throughput. NeMo Evaluator provides simplified evaluation of AI models for custom benchmarks. NeMo Guardrails implements security controls to maintain compliance and appropriateness

AI: The Future of Art and Design Artificial intelligence (AI) is changing the field of art and design in unprecedented ways, and its impact is no longer limited to amateurs, but more profoundly affecting professionals. Artwork and design schemes generated by AI are rapidly replacing traditional material images and designers in many transactional design activities such as advertising, social media image generation and web design. However, professional artists and designers also find the practical value of AI. They use AI as an auxiliary tool to explore new aesthetic possibilities, blend different styles, and create novel visual effects. AI helps artists and designers automate repetitive tasks, propose different design elements and provide creative input. AI supports style transfer, which is to apply a style of image

Zoom, initially known for its video conferencing platform, is leading a workplace revolution with its innovative use of agentic AI. A recent conversation with Zoom's CTO, XD Huang, revealed the company's ambitious vision. Defining Agentic AI Huang d

Will AI revolutionize education? This question is prompting serious reflection among educators and stakeholders. The integration of AI into education presents both opportunities and challenges. As Matthew Lynch of The Tech Edvocate notes, universit

The development of scientific research and technology in the United States may face challenges, perhaps due to budget cuts. According to Nature, the number of American scientists applying for overseas jobs increased by 32% from January to March 2025 compared with the same period in 2024. A previous poll showed that 75% of the researchers surveyed were considering searching for jobs in Europe and Canada. Hundreds of NIH and NSF grants have been terminated in the past few months, with NIH’s new grants down by about $2.3 billion this year, a drop of nearly one-third. The leaked budget proposal shows that the Trump administration is considering sharply cutting budgets for scientific institutions, with a possible reduction of up to 50%. The turmoil in the field of basic research has also affected one of the major advantages of the United States: attracting overseas talents. 35

OpenAI unveils the powerful GPT-4.1 series: a family of three advanced language models designed for real-world applications. This significant leap forward offers faster response times, enhanced comprehension, and drastically reduced costs compared t


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

SublimeText3 Linux new version
SublimeText3 Linux latest version

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Dreamweaver Mac version
Visual web development tools
