search
HomeWeb Front-endJS TutorialEvaluating Medical Retrieval-Augmented Generation (RAG) with NVIDIA AI Endpoints and Ragas

Evaluating Medical Retrieval-Augmented Generation (RAG) with NVIDIA AI Endpoints and Ragas

In the realm of medicine, incorporating advanced technologies is essential to enhance patient care and improve research methodologies. Retrieval-augmented generation (RAG) is one of these pioneering innovations, blending the power of large language models (LLMs) with external knowledge retrieval. By pulling relevant information from databases, scientific literature, and patient records, RAG systems provide a more accurate and contextually enriched response foundation, addressing limitations like outdated information and hallucinations often observed in pure LLMs.

In this overview, we’ll explore RAG’s growing role in healthcare, focusing on its potential to transform applications like drug discovery and clinical trials. We'll also dive into the methods and tools necessary to evaluate the unique demands of medical RAG systems, such as NVIDIA’s LangChain endpoints and the Ragas framework, along with the MACCROBAT dataset, a collection of patient reports from PubMed Central.


Key Challenges of Medical RAG

  1. Scalability: With medical data expanding at over 35% CAGR, RAG systems need to manage and retrieve information efficiently without compromising speed, especially in scenarios where timely insights can impact patient care.

  2. Specialized Language and Knowledge Requirements: Medical RAG systems require domain-specific tuning since the medical lexicon and content differ substantially from other domains like finance or law.

  3. Absence of Tailored Evaluation Metrics: Unlike general-purpose RAG applications, medical RAG lacks well-suited benchmarks. Conventional metrics (like BLEU or ROUGE) emphasize text similarity rather than the factual accuracy critical in medical contexts.

  4. Component-wise Evaluation: Effective evaluation requires independent scrutiny of both the retrieval and generation components. Retrieval must pull relevant, current data, and the generation component must ensure faithfulness to retrieved content.

Introducing Ragas for RAG Evaluation

Ragas, an open-source evaluation framework, offers an automated approach for assessing RAG pipelines. Its toolkit focuses on context relevancy, recall, faithfulness, and answer relevancy. Utilizing an LLM-as-a-judge model, Ragas minimizes the need for manually annotated data, making the process efficient and cost-effective.

Evaluation Strategies for RAG Systems

For robust RAG evaluation, consider these steps:

  1. Synthetic Data Generation: Generate triplet data (question, answer, context) based on the vector store documents to create synthetic test data.
  2. Metric-Based Evaluation: Evaluate the RAG system on metrics like precision and recall, comparing its responses to the generated synthetic data as ground truth.
  3. Independent Component Evaluation: For each question, assess retrieval context relevance and the generation’s answer accuracy.

Here’s an example pipeline: given a question like “What are typical BP measurements in congestive heart failure?” the system first retrieves relevant context and then evaluates if the response addresses the question accurately.

Setting Up RAG with NVIDIA API and LangChain

To follow along, create an NVIDIA account and obtain an API key. Install the necessary packages with:

pip install langchain
pip install langchain_nvidia_ai_endpoints
pip install ragas

Download the MACCROBAT dataset, which offers comprehensive medical records that can be loaded and processed via LangChain.

from langchain_community.document_loaders import HuggingFaceDatasetLoader
from datasets import load_dataset

dataset_name = "singh-aditya/MACCROBAT_biomedical_ner"
page_content_column = "full_text"

loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)
dataset = loader.load()

Using NVIDIA endpoints and LangChain, we can now build a robust test set generator and create synthetic data based on the dataset:

from ragas.testset.generator import TestsetGenerator
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings

critic_llm = ChatNVIDIA(model="meta/llama3.1-8b-instruct")
generator_llm = ChatNVIDIA(model="mistralai/mixtral-8x7b-instruct-v0.1")
embeddings = NVIDIAEmbeddings(model="nv-embedqa-e5-v5", truncate="END")

generator = TestsetGenerator.from_langchain(
    generator_llm, critic_llm, embeddings, chunk_size=512
)
testset = generator.generate_with_langchain_docs(dataset, test_size=10)

Deploying and Evaluating the Pipeline

Deploy your RAG system on a vector store, generating sample questions from actual medical reports:

# Sample questions
["What are typical BP measurements in the case of congestive heart failure?",
 "What can scans reveal in patients with severe acute pain?",
 "Is surgical intervention necessary for liver metastasis?"]

Each question links with a retrieved context and a generated ground truth answer, which can then be used to evaluate the performance of both retrieval and generation components.

Custom Metrics with Ragas

Medical RAG systems may need custom metrics to assess retrieval precision. For instance, a metric could determine if a retrieved document is relevant enough for a search query:

from dataclasses import dataclass, field
from ragas.evaluation.metrics import MetricWithLLM, Prompt

RETRIEVAL_PRECISION = Prompt(
    name="retrieval_precision",
    instruction="Is this result relevant enough for the first page of search results? Answer '1' for yes and '0' for no.",
    input_keys=["question", "context"]
)

@dataclass
class RetrievalPrecision(MetricWithLLM):
    name: str = "retrieval_precision"
    evaluation_mode = EvaluationMode.qc
    context_relevancy_prompt: Prompt = field(default_factory=lambda: RETRIEVAL_PRECISION)

# Use this custom metric in evaluation
score = evaluate(dataset["eval"], metrics=[RetrievalPrecision()])

Structured Output for Precision and Reliability

For an efficient and reliable evaluation, structured output simplifies processing. With NVIDIA's LangChain endpoints, structure your LLM response into predefined categories (e.g., yes/no).

import enum

class Choices(enum.Enum):
    Y = "Y"
    N = "N"

structured_llm = nvidia_llm.with_structured_output(Choices)
structured_llm.invoke("Is this search result relevant to the query?")

Conclusion

RAG bridges LLMs and dense vector retrieval for highly efficient, scalable applications across medical, multilingual, and code generation domains. In healthcare, its potential to bring accurate, contextually aware responses is evident, but evaluation must prioritize accuracy, domain specificity, and cost-efficiency.

The outlined evaluation pipeline, employing synthetic test data, NVIDIA endpoints, and Ragas, offers a robust method to meet these demands. For a deeper dive, you can explore Ragas and NVIDIA Generative AI examples on GitHub.

The above is the detailed content of Evaluating Medical Retrieval-Augmented Generation (RAG) with NVIDIA AI Endpoints and Ragas. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Python vs. JavaScript: A Comparative Analysis for DevelopersPython vs. JavaScript: A Comparative Analysis for DevelopersMay 09, 2025 am 12:22 AM

The main difference between Python and JavaScript is the type system and application scenarios. 1. Python uses dynamic types, suitable for scientific computing and data analysis. 2. JavaScript adopts weak types and is widely used in front-end and full-stack development. The two have their own advantages in asynchronous programming and performance optimization, and should be decided according to project requirements when choosing.

Python vs. JavaScript: Choosing the Right Tool for the JobPython vs. JavaScript: Choosing the Right Tool for the JobMay 08, 2025 am 12:10 AM

Whether to choose Python or JavaScript depends on the project type: 1) Choose Python for data science and automation tasks; 2) Choose JavaScript for front-end and full-stack development. Python is favored for its powerful library in data processing and automation, while JavaScript is indispensable for its advantages in web interaction and full-stack development.

Python and JavaScript: Understanding the Strengths of EachPython and JavaScript: Understanding the Strengths of EachMay 06, 2025 am 12:15 AM

Python and JavaScript each have their own advantages, and the choice depends on project needs and personal preferences. 1. Python is easy to learn, with concise syntax, suitable for data science and back-end development, but has a slow execution speed. 2. JavaScript is everywhere in front-end development and has strong asynchronous programming capabilities. Node.js makes it suitable for full-stack development, but the syntax may be complex and error-prone.

JavaScript's Core: Is It Built on C or C  ?JavaScript's Core: Is It Built on C or C ?May 05, 2025 am 12:07 AM

JavaScriptisnotbuiltonCorC ;it'saninterpretedlanguagethatrunsonenginesoftenwritteninC .1)JavaScriptwasdesignedasalightweight,interpretedlanguageforwebbrowsers.2)EnginesevolvedfromsimpleinterpreterstoJITcompilers,typicallyinC ,improvingperformance.

JavaScript Applications: From Front-End to Back-EndJavaScript Applications: From Front-End to Back-EndMay 04, 2025 am 12:12 AM

JavaScript can be used for front-end and back-end development. The front-end enhances the user experience through DOM operations, and the back-end handles server tasks through Node.js. 1. Front-end example: Change the content of the web page text. 2. Backend example: Create a Node.js server.

Python vs. JavaScript: Which Language Should You Learn?Python vs. JavaScript: Which Language Should You Learn?May 03, 2025 am 12:10 AM

Choosing Python or JavaScript should be based on career development, learning curve and ecosystem: 1) Career development: Python is suitable for data science and back-end development, while JavaScript is suitable for front-end and full-stack development. 2) Learning curve: Python syntax is concise and suitable for beginners; JavaScript syntax is flexible. 3) Ecosystem: Python has rich scientific computing libraries, and JavaScript has a powerful front-end framework.

JavaScript Frameworks: Powering Modern Web DevelopmentJavaScript Frameworks: Powering Modern Web DevelopmentMay 02, 2025 am 12:04 AM

The power of the JavaScript framework lies in simplifying development, improving user experience and application performance. When choosing a framework, consider: 1. Project size and complexity, 2. Team experience, 3. Ecosystem and community support.

The Relationship Between JavaScript, C  , and BrowsersThe Relationship Between JavaScript, C , and BrowsersMay 01, 2025 am 12:06 AM

Introduction I know you may find it strange, what exactly does JavaScript, C and browser have to do? They seem to be unrelated, but in fact, they play a very important role in modern web development. Today we will discuss the close connection between these three. Through this article, you will learn how JavaScript runs in the browser, the role of C in the browser engine, and how they work together to drive rendering and interaction of web pages. We all know the relationship between JavaScript and browser. JavaScript is the core language of front-end development. It runs directly in the browser, making web pages vivid and interesting. Have you ever wondered why JavaScr

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Dreamweaver Mac version

Dreamweaver Mac version

Visual web development tools