Introduction
In the second part of our series on building a RAG application on a Raspberry Pi, we’ll expand on the foundation we laid in the first part, where we created and tested the core pipeline. In the first part, we created the core pipeline and tested it to ensure everything worked as expected. Now, we’re going to take things a step further by building a FastAPI application to serve our RAG pipeline and creating a Reflex app to give users a simple and interactive way to access it. This part will guide you through setting up the FastAPI back-end, designing the front-end with Reflex, and getting everything up and running on your Raspberry Pi. By the end, you’ll have a complete, working application that’s ready for real-world use.
Learning Objectives
- Set up a FastAPI back-end to integrate with the existing RAG pipeline and process queries efficiently.
- Design a user-friendly interface using Reflex to interact with the FastAPI back-end and the RAG pipeline.
- Create and test API endpoints for querying and document ingestion, ensuring smooth operation with FastAPI.
- Deploy and test the complete application on a Raspberry Pi, ensuring both back-end and front-end components function seamlessly.
- Understand the integration between FastAPI and Reflex for a cohesive RAG application experience.
- Implement and troubleshoot FastAPI and Reflex components to provide a fully operational RAG application on a Raspberry Pi.
If you missed the previous edition, be sure to check it out here: Self-Hosting RAG Applications on Edge Devices with Langchain and Ollama – Part I.
Table of contents
- Creating Python Environment
- Developing the Back-End with FastAPI
- Designing the Front-End with Reflex
- Testing and Deployment
- Frequently Asked Question
This article was published as a part of theData Science Blogathon.
Creating Python Environment
Before we start with creating the application we need to setup the environment. Create an environment and install the below dependencies:
deeplake boto3==1.34.144 botocore==1.34.144 fastapi==0.110.3 gunicorn==22.0.0 httpx==0.27.0 huggingface-hub==0.23.4 langchain==0.2.6 langchain-community==0.2.6 langchain-core==0.2.11 langchain-experimental==0.0.62 langchain-text-splitters==0.2.2 langsmith==0.1.83 marshmallow==3.21.3 numpy==1.26.4 pandas==2.2.2 pydantic==2.8.2 pydantic_core==2.20.1 PyMuPDF==1.24.7 PyMuPDFb==1.24.6 python-dotenv==1.0.1 pytz==2024.1 PyYAML==6.0.1 reflex==0.5.6 requests==2.32.3 reflex==0.5.6 reflex-hosting-cli==0.1.13
Once the required packages are installed, we need to have the required models present in the device. We will do this using Ollama. Follow the steps from Part-1 of this article to download both the language and embedding models. Finally, create two directories for the back-end and front-end applications.
Once the models are pulled using Ollama, we are ready to build the final application.
Developing the Back-End with FastAPI
In the Part-1 of this article, we have built the RAG pipeline having both the Ingestion and QnA modules. We have tested both the pipelines using some documents and they were perfectly working. Now we need to wrap the pipeline with FastAPI to create consumable API. This will help us integrate it with any front-end application like Streamlit, Chainlit, Gradio, Reflex, React, Angular etc. Let’s start by building a structure for the application. Following the structure is completely optional, but make sure to check the dependency imports if you follow a different structure to create the app.
Below is the tree structure we will follow:
backend ├── app.py ├── requirements.txt └── src ├── config.py ├── doc_loader │ ├── base_loader.py │ ├── __init__.py │ └── pdf_loader.py ├── ingestion.py ├── __init__.py └── qna.py
Let’s start with the config.py. This file will contain all the configurable options for the application, like the Ollama URL, LLM name and the embeddings model name. Below is an example:
LANGUAGE_MODEL_NAME = "phi3" EMBEDDINGS_MODEL_NAME = "nomic-embed-text" OLLAMA_URL = "http://localhost:11434"
The base_loader.py file contains the parent document loader class that will be inherited by children document loader. In this application we are only working with PDF files, so a Child PDFLoader class will be
created that will inherit the BaseLoader class.
Below are the contents of base_loader.py and pdf_loader.py:
# base_loader.py from abc import ABC, abstractmethod class BaseLoader(ABC): def __init__(self, file_path: str) -> None: self.file_path = file_path @abstractmethod async def load_document(self): pass # pdf_loader.py import os from .base_loader import BaseLoader from langchain.schema import Document from langchain.document_loaders.pdf import PyMuPDFLoader from langchain.text_splitter import CharacterTextSplitter class PDFLoader(BaseLoader): def __init__(self, file_path: str) -> None: super().__init__(file_path) async def load_document(self): self.file_name = os.path.basename(self.file_path) loader = PyMuPDFLoader(file_path=self.file_path) text_splitter = CharacterTextSplitter( separator="\n", chunk_size=1000, chunk_overlap=200, ) pages = await loader.aload() total_pages = len(pages) chunks = [] for idx, page in enumerate(pages): chunks.append( Document( page_content=page.page_content, metadata=dict( { "file_name": self.file_name, "page_no": str(idx 1), "total_pages": str(total_pages), } ), ) ) final_chunks = text_splitter.split_documents(chunks) return final_chunks
We have discussed the working of pdf_loader in the Part-1 of the article.
Next, let’s build the Ingestion class. This is same as the one we built in the Part-1 of this article.
Code for Ingestion Class
import os import config as cfg from pinecone import Pinecone from langchain.vectorstores.deeplake import DeepLake from langchain.embeddings.ollama import OllamaEmbeddings from .doc_loader import PDFLoader class Ingestion: """Document Ingestion pipeline.""" def __init__(self): try: self.embeddings = OllamaEmbeddings( model=cfg.EMBEDDINGS_MODEL_NAME, base_url=cfg.OLLAMA_URL, show_progress=True, ) self.vector_store = DeepLake( dataset_path="data/text_vectorstore", embedding=self.embeddings, num_workers=4, verbose=False, ) except Exception as e: raise RuntimeError(f"Failed to initialize Ingestion system. ERROR: {e}") async def create_and_add_embeddings( self, file: str, ): try: loader = PDFLoader( file_path=file, ) chunks = await loader.load_document() size = await self.vector_store.aadd_documents(documents=chunks) return len(size) except (ValueError, RuntimeError, KeyError, TypeError) as e: raise Exception(f"ERROR: {e}")
Now that we have setup the Ingestion class, we’ll go forward with creating the QnA class. This too is same as the one we created in the Part-1 of this article.
Code for QnA Class
import os import config as cfg from pinecone import Pinecone from langchain.vectorstores.deeplake import DeepLake from langchain.embeddings.ollama import OllamaEmbeddings from langchain_community.llms.ollama import Ollama from .doc_loader import PDFLoader class QnA: """Document Ingestion pipeline.""" def __init__(self): try: self.embeddings = OllamaEmbeddings( model=cfg.EMBEDDINGS_MODEL_NAME, base_url=cfg.OLLAMA_URL, show_progress=True, ) self.model = Ollama( model=cfg.LANGUAGE_MODEL_NAME, base_url=cfg.OLLAMA_URL, verbose=True, temperature=0.2, ) self.vector_store = DeepLake( dataset_path="data/text_vectorstore", embedding=self.embeddings, num_workers=4, verbose=False, ) self.retriever = self.vector_store.as_retriever( search_type="similarity", search_kwargs={ "k": 10, }, ) except Exception as e: raise RuntimeError(f"Failed to initialize Ingestion system. ERROR: {e}") def create_rag_chain(self): try: system_prompt = """<instructions>\n\nContext: {context}" """ prompt = ChatPromptTemplate.from_messages( [ ("system", system_prompt), ("human", "{input}"), ] ) question_answer_chain = create_stuff_documents_chain(self.model, prompt) rag_chain = create_retrieval_chain(self.retriever, question_answer_chain) return rag_chain except Exception as e: raise RuntimeError(f"Failed to create retrieval chain. ERROR: {e}")</instructions>
With this we have finished creating the code functionalities of the RAG app. Now let’s wrap the app with FastAPI.
Code for the FastAPI Application
import sys import os import uvicorn from src import QnA, Ingestion from fastapi import FastAPI, Request, File, UploadFile from fastapi.responses import StreamingResponse app = FastAPI() ingestion = Ingestion() chatbot = QnA() rag_chain = chatbot.create_rag_chain() @app.get("/") def hello(): return {"message": "API Running in server 8089"} @app.post("/query") async def ask_query(request: Request): data = await request.json() question = data.get("question") async def event_generator(): for chunk in rag_chain.pick("answer").stream({"input": question}): yield chunk return StreamingResponse(event_generator(), media_type="text/plain") @app.post("/ingest") async def ingest_document(file: UploadFile = File(...)): try: os.makedirs("files", exist_ok=True) file_location = f"files/{file.filename}" with open(file_location, "wb ") as file_object: file_object.write(file.file.read()) size = await ingestion.create_and_add_embeddings(file=file_location) return {"message": f"File ingested! Document count: {size}"} except Exception as e: return {"message": f"An error occured: {e}"} if __name__ == "__main__": try: uvicorn.run(app, host="0.0.0.0", port=8089) except KeyboardInterrupt as e: print("App stopped!")
Let’s breakdown the app by each endpoints:
- First we initialize the FastAPI app, the Ingestion and the QnA objects. We then create a RAG chain using the create_rag_chain method of QnA class.
- Our first endpoint is a simple GET method. This will help us know whether the app is healthy or not. Think of it like a ‘Hello World’ endpoint.
- The second is the query endpoint. This is a POST method and will be used to run the chain. It takes in a request parameter, from which we extract the user’s query. Then we create a asynchronous method that acts as an asynchronous wrapper around the chain.stream function call. We need to do this to allow FastAPI to handle the LLM’s stream function call, to get a ChatGPT-like experience in the chat interface. We then wrap the asynchronous method with StreamingResponse class and return it.
- The third endpoint is the ingestion endpoint. It also is a POST method that takes in the entire file as bytes as input. We store this file in the local directory and then ingest it using the create_and_add_embeddings method of Ingestion class.
Finally, we run the app using uvicorn package, using host and port. To test the app, simply run the application using the following command:
python app.py
Use a API testing IDE like Postman, Insomnia or Bruno for testing the application. You can also use Thunder Client extension to do the same.
Testing the Ingestion endpoint:
Testing the query endpoint:
Designing the Front-End with Reflex
We have successfully created a FastAPI app for the backend of our RAG application. It’s time to build our front-end. You can chose any front-end library for this, but for this particular article we will build the front-end using Reflex. Reflex is a python-only front-end library, created to build web applications, purely using python. It proves us with templates for common applications like calculator, image generation and chatbot. We will use the chatbot application template as a start for our user interface. Our final app will have the following structure, so let’s have it here for reference.
Frontend Directory
We will have a frontend directory for this:
frontend ├── assets │ └── favicon.ico ├── docs │ └── demo.gif ├── chat │ ├── components │ │ ├── chat.py │ │ ├── file_upload.py │ │ ├── __init__.py │ │ ├── loading_icon.py │ │ ├── modal.py │ │ └── navbar.py │ ├── __init__.py │ ├── chat.py │ └── state.py ├── requirements.txt ├── rxconfig.py └── uploaded_files
Steps for Final App
Follow the steps to prepare the grounding for the final app.
Step1:Clone the chat template repository in the frontend directory
git clone https://github.com/reflex-dev/reflex-chat.git .
Step2:Run the following command to initialize the directory as a reflex app
reflex init
This will setup the reflex app and will be ready to run and develop.
Step3: Test the app, use the following command from inside the frontend directory
reflex run
Let’s start modifying the components. First let’s modify the chat.py file.
Below is the code for the same:
import reflex as rx from reflex_demo.components import loading_icon from reflex_demo.state import QA, State message_style = dict( display="inline-block", padding="0 10px", border_radius="8px", max_width=["30em", "30em", "50em", "50em", "50em", "50em"], ) def message(qa: QA) -> rx.Component: """A single question/answer message. Args: qa: The question/answer pair. Returns: A component displaying the question/answer pair. """ return rx.box( rx.box( rx.markdown( qa.question, background_color=rx.color("mauve", 4), color=rx.color("mauve", 12), **message_style, ), text_align="right", margin_top="1em", ), rx.box( rx.markdown( qa.answer, background_color=rx.color("accent", 4), color=rx.color("accent", 12), **message_style, ), text_align="left", padding_top="1em", ), width="100%", ) def chat() -> rx.Component: """List all the messages in a single conversation.""" return rx.vstack( rx.box(rx.foreach(State.chats[State.current_chat], message), width="100%"), py="8", flex="1", width="100%", max_width="50em", padding_x="4px", align_self="center", overflow="hidden", padding_bottom="5em", ) def action_bar() -> rx.Component: """The action bar to send a new message.""" return rx.center( rx.vstack( rx.chakra.form( rx.chakra.form_control( rx.hstack( rx.input( rx.input.slot( rx.tooltip( rx.icon("info", size=18), content="Enter a question to get a response.", ) ), placeholder="Type something...", , width=["15em", "20em", "45em", "50em", "50em", "50em"], ), rx.button( rx.cond( State.processing, loading_icon(height="1em"), rx.text("Send", font_family="Ubuntu"), ), type="submit", ), align_items="center", ), is_disabled=State.processing, ), on_submit=State.process_question, reset_on_submit=True, ), rx.text( "ReflexGPT may return factually incorrect or misleading responses. Use discretion.", text_align="center", font_size=".75em", color=rx.color("mauve", 10), font_family="Ubuntu", ), rx.logo(margin_top="-1em", margin_bottom="-1em"), align_items="center", ), position="sticky", bottom="0", left="0", padding_y="16px", backdrop_filter="auto", backdrop_blur="lg", border_top=f"1px solid {rx.color('mauve', 3)}", background_color=rx.color("mauve", 2), align_items="stretch", width="100%", )
The changes are minimal from the one present natively in the template.
Next, we will edit the chat.py app. This is the main chat component.
Code for Main Chat Component
Below is the code for it:
import reflex as rx from reflex_demo.components import chat, navbar, upload_form from reflex_demo.state import State @rx.page(route="/chat", title="RAG Chatbot") def chat_interface() -> rx.Component: return rx.chakra.vstack( navbar(), chat.chat(), chat.action_bar(), background_color=rx.color("mauve", 1), color=rx.color("mauve", 12), min_height="100vh", align_items="stretch", spacing="0", ) @rx.page(route="/", title="RAG Chatbot") def index() -> rx.Component: return rx.chakra.vstack( navbar(), upload_form(), background_color=rx.color("mauve", 1), color=rx.color("mauve", 12), min_height="100vh", align_items="stretch", spacing="0", ) # Add state and page to the app. app = rx.App( theme=rx.theme( appearance="dark", accent_color="jade", ), stylesheets=["https://fonts.googleapis.com/css2?family=Ubuntu&display=swap"], style={ "font_family": "Ubuntu", }, ) app.add_page(index) app.add_page(chat_interface)
This is the code for the chat interface. We have only added the Font family to the app config, the rest of the code is the same.
Next let’s edit the state.py file. This is where the frontend will make call to the API endpoints for response.
Editing state.py File
import requests import reflex as rx class QA(rx.Base): question: str answer: str DEFAULT_CHATS = { "Intros": [], } class State(rx.State): chats: dict[str, list[QA]] = DEFAULT_CHATS current_chat = "Intros" url: str = "http://localhost:8089/query" question: str processing: bool = False new_chat_name: str = "" def create_chat(self): """Create a new chat.""" # Add the new chat to the list of chats. self.current_chat = self.new_chat_name self.chats[self.new_chat_name] = [] def delete_chat(self): """Delete the current chat.""" del self.chats[self.current_chat] if len(self.chats) == 0: self.chats = DEFAULT_CHATS self.current_chat = list(self.chats.keys())[0] def set_chat(self, chat_name: str): """Set the name of the current chat. Args: chat_name: The name of the chat. """ self.current_chat = chat_name @rx.var def chat_titles(self) -> list[str]: """Get the list of chat titles. Returns: The list of chat names. """ return list(self.chats.keys()) async def process_question(self, form_data: dict[str, str]): # Get the question from the form question = form_data["question"] # Check if the question is empty if question == "": return model = self.openai_process_question async for value in model(question): yield value async def openai_process_question(self, question: str): """Get the response from the API. Args: form_data: A dict with the current question. """ # Add the question to the list of questions. qa = QA(question=question, answer="") self.chats[self.current_chat].append(qa) payload = {"question": question} # Clear the input and start the processing. self.processing = True yield response = requests.post(self.url, json=payload, stream=True) # Stream the results, yielding after every word. for answer_text in response.iter_content(chunk_size=512): # Ensure answer_text is not None before concatenation answer_text = answer_text.decode() if answer_text is not None: self.chats[self.current_chat][-1].answer = answer_text else: answer_text = "" self.chats[self.current_chat][-1].answer = answer_text self.chats = self.chats yield # Toggle the processing flag. self.processing = False
In this file, we have defined the URL for the query endpoint. We have also modified the openai_process_question method to send a POST request to the query endpoint and get the streaming
response, which will be displayed in the chat interface.
Writing Contents of the file_upload.py File
Finally, let’s write the contents of the file_upload.py file. This component will be displayed in the beginning which will allow us to upload the file for ingestion.
import reflex as rx import os import time import requests class UploadExample(rx.State): uploading: bool = False ingesting: bool = False progress: int = 0 total_bytes: int = 0 ingestion_url = "http://127.0.0.1:8089/ingest" async def handle_upload(self, files: list[rx.UploadFile]): self.ingesting = True yield for file in files: file_bytes = await file.read() file_name = file.filename files = { "file": (os.path.basename(file_name), file_bytes, "multipart/form-data") } response = requests.post(self.ingestion_url, files=files) self.ingesting = False yield if response.status_code == 200: # yield rx.redirect("/chat") self.show_redirect_popup() def handle_upload_progress(self, progress: dict): self.uploading = True self.progress = round(progress["progress"] * 100) if self.progress >= 100: self.uploading = False def cancel_upload(self): self.uploading = False return rx.cancel_upload("upload3") def upload_form(): return rx.vstack( rx.upload( rx.flex( rx.text( "Drag and drop file here or click to select file", font_family="Ubuntu", ), rx.icon("upload", size=30), direction="column", align="center", ), , border="1px solid rgb(233, 233,233, 0.4)", margin="5em 0 10px 0", background_color="rgb(107,99,246)", border_radius="8px", padding="1em", ), rx.vstack(rx.foreach(rx.selected_files("upload3"), rx.text)), rx.cond( ~UploadExample.ingesting, rx.button( "Upload", on_click=UploadExample.handle_upload( rx.upload_files( upload_, on_upload_progress=UploadExample.handle_upload_progress, ), ), ), rx.flex( rx.spinner(size="3", loading=UploadExample.ingesting), rx.button( "Cancel", on_click=UploadExample.cancel_upload, ), align="center", spacing="3", ), ), rx.alert_dialog.root( rx.alert_dialog.trigger( rx.button("Continue to Chat", color_scheme="green"), ), rx.alert_dialog.content( rx.alert_dialog.title("Redirect to Chat Interface?"), rx.alert_dialog.description( "You will be redirected to the Chat Interface.", size="2", ), rx.flex( rx.alert_dialog.cancel( rx.button( "Cancel", variant="soft", color_scheme="gray", ), ), rx.alert_dialog.action( rx.button( "Continue", color_scheme="green", variant="solid", on_click=rx.redirect("/chat"), ), ), spacing="3", margin_top="16px", justify="end", ), style={"max_width": 450}, ), ), align="center", )
This component will allow us to upload a file and ingest it into the vector store. It uses the ingest endpoint of our FastAPI app to upload and ingest the file. After ingestion, the user can simply move
to the chat interface for asking queries.
With this we have completed building the front-end for our application. Now we will need to test the application using some document.
Testing and Deployment
Now let’s test the application on some manuals or documents. To use the application, we need to run both the back-end app and the reflex app separately. Run the back-end app from it’s directory using the
following command:
python app.py
Wait for the FastAPI to start running. Then in another terminal instance run the front-end app using the following command:
reflex run
One the apps are up and running, got to the following URL to access the reflex app. Initially we would be in the File Upload page. Upload a file and press the upload button.
The file will be uploaded and ingested. This will take a while depending on the document size and
the device specs. Once it’s done, click on the ‘Continue to Chat’ button to move to the chat interface. Write your query and press Send.
Conclusion
In thistwo partseries, you’ve now built a complete and functional RAG application on a Raspberry Pi, from creating the core pipeline to wrapping it with a FastAPI back-end and developing a Reflex-based front-end. With these tools, your RAG pipeline is accessible and interactive, providing real-time query processing through a user-friendly web interface. By mastering these steps, you’ve gained valuable experience in building and deploying end-to-end applications on a compact, efficient platform. This setup opens the door to countless possibilities for deploying AI-driven applications on resource-constrained devices like the Raspberry Pi, making cutting-edge technology more accessible and practical for everyday use.
Key Takeaways
- A detailed guide is provided on setting up the development environment, including installing necessary dependencies and models using Ollama, ensuring the application is ready for the final build.
- The article explains how to wrap the RAG pipeline in a FastAPI application, including setting up endpoints for querying the model and ingesting documents, making the pipeline accessible via a web API.
- The front-end of the RAG application is built using Reflex, a Python-only front-end library. The article demonstrates how to modify the chat application template to create a user-friendly interface for interacting with the RAG pipeline.
- The article guides on integrating the FastAPI backend with the Reflex front-end and deploying the complete application on a Raspberry Pi, ensuring seamless operation and user accessibility.
- Practical steps are provided for testing both the ingestion and query endpoints using tools like Postman or Thunder Client, along with running and testing the Reflex front-end to ensure the entire application functions as expected.
Frequently Asked Question
Q1: How can I make the app accessible to myself from anywhere in the World without compromising security?A. There is a platform named Tailscale that allows your devices to be connected to a private secure network, accessible only to you. You can add your Raspberry Pi and other devices to Tailscale devices and connect to the VPN to access your apps, from anywhere within the world.
Q2: My application is very slow in terms of ingestion and QnA.A. That is the constraint due to low hardware specifications of Raspberry Pi. The article is just a head up tutorial on how to start building RAG app using Raspberry Pi and Ollama.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
The above is the detailed content of Self Hosting RAG Applications On Edge Devices with Langchain. For more information, please follow other related articles on the PHP Chinese website!

Introduction Suppose there is a farmer who daily observes the progress of crops in several weeks. He looks at the growth rates and begins to ponder about how much more taller his plants could grow in another few weeks. From th

Soft AI — defined as AI systems designed to perform specific, narrow tasks using approximate reasoning, pattern recognition, and flexible decision-making — seeks to mimic human-like thinking by embracing ambiguity. But what does this mean for busine

The answer is clear—just as cloud computing required a shift toward cloud-native security tools, AI demands a new breed of security solutions designed specifically for AI's unique needs. The Rise of Cloud Computing and Security Lessons Learned In th

Entrepreneurs and using AI and Generative AI to make their businesses better. At the same time, it is important to remember generative AI, like all technologies, is an amplifier – making the good great and the mediocre, worse. A rigorous 2024 study o

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Large Language Models (LLMs) and the Inevitable Problem of Hallucinations You've likely used AI models like ChatGPT, Claude, and Gemini. These are all examples of Large Language Models (LLMs), powerful AI systems trained on massive text datasets to

Recent research has shown that AI Overviews can cause a whopping 15-64% decline in organic traffic, based on industry and search type. This radical change is causing marketers to reconsider their whole strategy regarding digital visibility. The New

A recent report from Elon University’s Imagining The Digital Future Center surveyed nearly 300 global technology experts. The resulting report, ‘Being Human in 2035’, concluded that most are concerned that the deepening adoption of AI systems over t


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

Zend Studio 13.0.1
Powerful PHP integrated development environment

SublimeText3 Chinese version
Chinese version, very easy to use

PhpStorm Mac version
The latest (2018.2.1) professional PHP integrated development tool

SublimeText3 English version
Recommended: Win version, supports code prompts!