


This article explores the implementation of a straightforward yet effective question-answering system that combines modern transformer-based models. The system uses T5 (Text-to-Text Transfer Transformer) for answer generation and Sentence Transformers for semantic similarity matching.
In my previous article, I explained how to create a simple translation API with a web interface using a free foundational LLM model. This time, let’s dive into building a Retrieval-Augmented Generation (RAG) system using free transformer-based LLM models and a knowledge base.
RAG (Retrieval-Augmented Generation) is a technique that combines two key components:
Retrieval: First, it searches through a knowledge base (like documents, databases, etc.) to find relevant information for a given query. This usually involves:
- Converting text into embeddings (numerical vectors that represent meaning)
- Finding similar content using similarity measures (like cosine similarity)
- Selecting the most relevant pieces of information
Generation: Then it uses a language model (like T5 in our code) to generate a response by:
Combining the retrieved information with the original question
Creating a natural language response based on this context
In the code:
- The SentenceTransformer handles the retrieval part by creating embeddings
- The T5 model handles the generation part by creating answers
Benefits of RAG:
- More accurate responses since they’re grounded in specific knowledge
- Reduced hallucination compared to pure LLM responses
- Ability to access up-to-date or domain-specific information
- More controllable and transparent than pure generation
System Architecture Overview
The implementation consists of a SimpleQASystem class that orchestrates two main components:
- A semantic search system using Sentence Transformers
- An answer generation system using T5
You can download the latest version of the source code here: https://github.com/alexander-uspenskiy/rag_project
System Diagram
RAG Project Setup Guide
This guide will help you set up your Retrieval-Augmented Generation (RAG) project on both macOS and Windows.
Prerequisites
For macOS:
Install Homebrew (if not already installed):
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Install Python 3.8 using Homebrew
brew install python@3.10
For Windows:
Download and install Python 3.8 from python.org
Make sure to check “Add Python to PATH” during installation
Project Setup
Step 1: Create Project Directory
macOS:
mkdir RAG_project
cd RAG_project
Windows:
mkdir RAG_project
cd RAG_project
Step 2: Set Up Virtual Environment
macOS:
python3 -m venv venv
source venv/bin/activate
Windows:
python -m venv venv
venvScriptsactivate
**Core Components
- Initialization**
def __init__(self): self.model_name = 't5-small' self.tokenizer = T5Tokenizer.from_pretrained(self.model_name) self.model = T5ForConditionalGeneration.from_pretrained(self.model_name) self.encoder = SentenceTransformer('paraphrase-MiniLM-L6-v2')
The system initializes with two primary models:
T5-small: A smaller version of the T5 model for generating answers
paraphrase-MiniLM-L6-v2: A sentence transformer model for encoding text into meaningful vectors
2. Dataset Preparation
def prepare_dataset(self, data: List[Dict[str, str]]): self.answers = [item['answer'] for item in data] self.answer_embeddings = [] for answer in self.answers: embedding = self.encoder.encode(answer, convert_to_tensor=True) self.answer_embeddings.append(embedding)
The dataset preparation phase:
- Extracts answers from the input data
- Creates embeddings for each answer using the sentence transformer
- Stores both answers and their embeddings for quick retrieval
How the System Works
1. Question Processing
When a user submits a question, the system follows these steps:
Embedding Generation: The question is converted into a vector representation using the same sentence transformer model used for the answers.
Semantic Search: The system finds the most relevant stored answer by:
- Computing cosine similarity between the question embedding and all answer embeddings
- Selecting the answer with the highest similarity score Context Formation: The selected answer becomes the context for T5 to generate a final response.
2. Answer Generation
def get_answer(self, question: str) -> str: # ... semantic search logic ... input_text = f"Given the context, what is the answer to the question: {question} Context: {context}" input_ids = self.tokenizer(input_text, max_length=512, truncation=True, padding='max_length', return_tensors='pt').input_ids outputs = self.model.generate(input_ids, max_length=50, num_beams=4, early_stopping=True, no_repeat_ngram_size=2
The answer generation process:
- Combines the question and context into a prompt for T5
- Tokenizes the input text with a maximum length of 512 tokens
- Generates an answer using beam search with these parameters:
- max_length=50: Limits answer length
- num_beams=4: Uses beam search with 4 beams
- early_stopping=True: Stops generation when all beams reach an end token
- no_repeat_ngram_size=2: Prevents repetition of bigrams
3. Answer Cleaning
def __init__(self): self.model_name = 't5-small' self.tokenizer = T5Tokenizer.from_pretrained(self.model_name) self.model = T5ForConditionalGeneration.from_pretrained(self.model_name) self.encoder = SentenceTransformer('paraphrase-MiniLM-L6-v2')
- Removes duplicate consecutive words (case-insensitive)
- Capitalizes the first letter of the answer
- Removes extra whitespace
Full Source Code
You can download the latest version of source code here: https://github.com/alexander-uspenskiy/rag_project
def prepare_dataset(self, data: List[Dict[str, str]]): self.answers = [item['answer'] for item in data] self.answer_embeddings = [] for answer in self.answers: embedding = self.encoder.encode(answer, convert_to_tensor=True) self.answer_embeddings.append(embedding)
Memory Management:
The system explicitly uses CPU to avoid memory issues
Embeddings are converted to CPU tensors when needed
Input length is limited to 512 tokens
Error Handling:
- Comprehensive try-except blocks throughout the code
- Meaningful error messages for debugging
- Validation checks for uninitialized components
Usage Example
def get_answer(self, question: str) -> str: # ... semantic search logic ... input_text = f"Given the context, what is the answer to the question: {question} Context: {context}" input_ids = self.tokenizer(input_text, max_length=512, truncation=True, padding='max_length', return_tensors='pt').input_ids outputs = self.model.generate(input_ids, max_length=50, num_beams=4, early_stopping=True, no_repeat_ngram_size=2
Run in terminal
Limitations and Potential Improvements
Scalability:
The current implementation keeps all embeddings in memory
Could be improved with vector databases for large-scale applications
Answer Quality:
Relies heavily on the quality of the provided answer dataset
Limited by the context window of T5-small
Could benefit from answer validation or confidence scoring
Performance:
- Using CPU only might be slower for large-scale applications
- Could be optimized with batch processing
- Could implement caching for frequently asked questions
Conclusion
This implementation provides a solid foundation for a question-answering system, combining the strengths of semantic search and transformer-based text generation. Feel free to play with model parameters (like max_length, num_beams, early_stopping, no_repeat_ngram_size, etc) to find a better way to get more coherent and stable answers. While there’s room for improvement, the current implementation offers a good balance between complexity and functionality, making it suitable for educational purposes and small to medium-scale applications.
Happy coding!
The above is the detailed content of How to Create Your Own RAG with Free LLM Models and a Knowledge Base. For more information, please follow other related articles on the PHP Chinese website!

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

Python and C have significant differences in memory management and control. 1. Python uses automatic memory management, based on reference counting and garbage collection, simplifying the work of programmers. 2.C requires manual management of memory, providing more control but increasing complexity and error risk. Which language to choose should be based on project requirements and team technology stack.

Python's applications in scientific computing include data analysis, machine learning, numerical simulation and visualization. 1.Numpy provides efficient multi-dimensional arrays and mathematical functions. 2. SciPy extends Numpy functionality and provides optimization and linear algebra tools. 3. Pandas is used for data processing and analysis. 4.Matplotlib is used to generate various graphs and visual results.

Whether to choose Python or C depends on project requirements: 1) Python is suitable for rapid development, data science, and scripting because of its concise syntax and rich libraries; 2) C is suitable for scenarios that require high performance and underlying control, such as system programming and game development, because of its compilation and manual memory management.

Python is widely used in data science and machine learning, mainly relying on its simplicity and a powerful library ecosystem. 1) Pandas is used for data processing and analysis, 2) Numpy provides efficient numerical calculations, and 3) Scikit-learn is used for machine learning model construction and optimization, these libraries make Python an ideal tool for data science and machine learning.

Is it enough to learn Python for two hours a day? It depends on your goals and learning methods. 1) Develop a clear learning plan, 2) Select appropriate learning resources and methods, 3) Practice and review and consolidate hands-on practice and review and consolidate, and you can gradually master the basic knowledge and advanced functions of Python during this period.

Key applications of Python in web development include the use of Django and Flask frameworks, API development, data analysis and visualization, machine learning and AI, and performance optimization. 1. Django and Flask framework: Django is suitable for rapid development of complex applications, and Flask is suitable for small or highly customized projects. 2. API development: Use Flask or DjangoRESTFramework to build RESTfulAPI. 3. Data analysis and visualization: Use Python to process data and display it through the web interface. 4. Machine Learning and AI: Python is used to build intelligent web applications. 5. Performance optimization: optimized through asynchronous programming, caching and code

Python is better than C in development efficiency, but C is higher in execution performance. 1. Python's concise syntax and rich libraries improve development efficiency. 2.C's compilation-type characteristics and hardware control improve execution performance. When making a choice, you need to weigh the development speed and execution efficiency based on project needs.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

MinGW - Minimalist GNU for Windows
This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Dreamweaver Mac version
Visual web development tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

SublimeText3 Mac version
God-level code editing software (SublimeText3)

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment