Home >Backend Development >Python Tutorial >How to Create Your Own RAG with Free LLM Models and a Knowledge Base
This article explores the implementation of a straightforward yet effective question-answering system that combines modern transformer-based models. The system uses T5 (Text-to-Text Transfer Transformer) for answer generation and Sentence Transformers for semantic similarity matching.
In my previous article, I explained how to create a simple translation API with a web interface using a free foundational LLM model. This time, let’s dive into building a Retrieval-Augmented Generation (RAG) system using free transformer-based LLM models and a knowledge base.
RAG (Retrieval-Augmented Generation) is a technique that combines two key components:
Retrieval: First, it searches through a knowledge base (like documents, databases, etc.) to find relevant information for a given query. This usually involves:
Generation: Then it uses a language model (like T5 in our code) to generate a response by:
Combining the retrieved information with the original question
Creating a natural language response based on this context
In the code:
Benefits of RAG:
The implementation consists of a SimpleQASystem class that orchestrates two main components:
You can download the latest version of the source code here: https://github.com/alexander-uspenskiy/rag_project
This guide will help you set up your Retrieval-Augmented Generation (RAG) project on both macOS and Windows.
For macOS:
Install Homebrew (if not already installed):
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Install Python 3.8 using Homebrew
brew install python@3.10
For Windows:
Download and install Python 3.8 from python.org
Make sure to check “Add Python to PATH” during installation
macOS:
mkdir RAG_project
cd RAG_project
Windows:
mkdir RAG_project
cd RAG_project
Step 2: Set Up Virtual Environment
macOS:
python3 -m venv venv
source venv/bin/activate
Windows:
python -m venv venv
venvScriptsactivate
**Core Components
def __init__(self): self.model_name = 't5-small' self.tokenizer = T5Tokenizer.from_pretrained(self.model_name) self.model = T5ForConditionalGeneration.from_pretrained(self.model_name) self.encoder = SentenceTransformer('paraphrase-MiniLM-L6-v2')
The system initializes with two primary models:
T5-small: A smaller version of the T5 model for generating answers
paraphrase-MiniLM-L6-v2: A sentence transformer model for encoding text into meaningful vectors
2. Dataset Preparation
def prepare_dataset(self, data: List[Dict[str, str]]): self.answers = [item['answer'] for item in data] self.answer_embeddings = [] for answer in self.answers: embedding = self.encoder.encode(answer, convert_to_tensor=True) self.answer_embeddings.append(embedding)
The dataset preparation phase:
1. Question Processing
When a user submits a question, the system follows these steps:
Embedding Generation: The question is converted into a vector representation using the same sentence transformer model used for the answers.
Semantic Search: The system finds the most relevant stored answer by:
2. Answer Generation
def get_answer(self, question: str) -> str: # ... semantic search logic ... input_text = f"Given the context, what is the answer to the question: {question} Context: {context}" input_ids = self.tokenizer(input_text, max_length=512, truncation=True, padding='max_length', return_tensors='pt').input_ids outputs = self.model.generate(input_ids, max_length=50, num_beams=4, early_stopping=True, no_repeat_ngram_size=2
The answer generation process:
3. Answer Cleaning
def __init__(self): self.model_name = 't5-small' self.tokenizer = T5Tokenizer.from_pretrained(self.model_name) self.model = T5ForConditionalGeneration.from_pretrained(self.model_name) self.encoder = SentenceTransformer('paraphrase-MiniLM-L6-v2')
You can download the latest version of source code here: https://github.com/alexander-uspenskiy/rag_project
def prepare_dataset(self, data: List[Dict[str, str]]): self.answers = [item['answer'] for item in data] self.answer_embeddings = [] for answer in self.answers: embedding = self.encoder.encode(answer, convert_to_tensor=True) self.answer_embeddings.append(embedding)
The system explicitly uses CPU to avoid memory issues
Embeddings are converted to CPU tensors when needed
Input length is limited to 512 tokens
Usage Example
def get_answer(self, question: str) -> str: # ... semantic search logic ... input_text = f"Given the context, what is the answer to the question: {question} Context: {context}" input_ids = self.tokenizer(input_text, max_length=512, truncation=True, padding='max_length', return_tensors='pt').input_ids outputs = self.model.generate(input_ids, max_length=50, num_beams=4, early_stopping=True, no_repeat_ngram_size=2
Run in terminal
Scalability:
The current implementation keeps all embeddings in memory
Could be improved with vector databases for large-scale applications
Answer Quality:
Relies heavily on the quality of the provided answer dataset
Limited by the context window of T5-small
Could benefit from answer validation or confidence scoring
Performance:
This implementation provides a solid foundation for a question-answering system, combining the strengths of semantic search and transformer-based text generation. Feel free to play with model parameters (like max_length, num_beams, early_stopping, no_repeat_ngram_size, etc) to find a better way to get more coherent and stable answers. While there’s room for improvement, the current implementation offers a good balance between complexity and functionality, making it suitable for educational purposes and small to medium-scale applications.
Happy coding!
The above is the detailed content of How to Create Your Own RAG with Free LLM Models and a Knowledge Base. For more information, please follow other related articles on the PHP Chinese website!