ホームページ >テクノロジー周辺機器 >AI >Amazon のオープンソース RAGChecker 診断ツールを使用して、RAG システムに包括的な「物理検査」を実施します。

Amazon のオープンソース RAGChecker 診断ツールを使用して、RAG システムに包括的な「物理検査」を実施します。

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBオリジナル: 2024-08-19 04:29:321070ブラウズ

Amazon のオープンソース RAGChecker 診断ツールを使用して、RAG システムに包括的な「物理検査」を実施します。

AIxivコラムは、当サイトの学術・技術コンテンツを掲載するコラムです。過去数年間で、このサイトの AIxiv コラムには 2,000 件を超えるレポートが寄せられ、世界中の主要な大学や企業のトップ研究室がカバーされ、学術交流と普及を効果的に促進しています。共有したい優れた作品がある場合は、お気軽に寄稿するか、報告のために当社までご連絡ください。送信メール: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

Amazon上海人工知能この研究所は 2018 年に設立され、ディープラーニング研究分野の主要機関の 1 つとなり、約 90 の論文を発表しています。研究分野には、深層学習、自然言語処理、コンピュータビジョン、グラフ機械学習、ハイパフォーマンスコンピューティング、インテリジェントレコメンデーションシステム、不正検出とリスク制御、ナレッジグラフ構築、インテリジェントな意思決定システムなどの基本理論が含まれます。同研究所は、世界をリードするディープグラフラーニングライブラリであるディープグラフライブラリ (DGL) の研究開発を主導しました。このライブラリは、ディープラーニングとグラフ構造表現の利点を組み合わせ、多くの重要な応用分野に影響を与えています。

検索拡張生成 (RAG) テクノロジーは、外部の知識ベースと LLM を統合することにより、AI アプリケーションの分野に革命をもたらしています。内部知識により、AI システムの精度と信頼性が大幅に向上します。ただし、RAG システムはさまざまな業界で広く導入されているため、その評価と最適化は大きな課題に直面しています。従来のエンドツーエンドのメトリクスであれ、単一モジュールの評価であれ、既存の評価方法では、RAG システムの複雑さと実際のパフォーマンスを完全に反映することが困難です。特に、RAG システムのパフォーマンスのみを反映する最終スコアレポートしか提供できません。

人々は病気で病院に行って検査する必要があります。では、RAG システムが病気の場合、どうやって診断するのでしょうか。

最近、Amazon上海人工知能研究所はRAGシステムを提供するためにRAGCheckerという診断ツールを立ち上げましたきめ細かく、包括的で信頼性の高い 診断レポートと は、パフォーマンスをさらに向上させるための実用的な指示 を提供します。この記事では、この RAG 「顕微鏡」を詳しく紹介し、開発者がよりスマートで信頼性の高い RAG システムを作成するのにどのように役立つかを確認します。

Amazon のオープンソース RAGChecker 診断ツールを使用して、RAG システムに包括的な「物理検査」を実施します。

論文: https://arxiv.org/pdf/2408.08067
プロジェクトアドレス: https://github.com/amazon-science/RAGChecker

#🎜🎜 #

RAGChecker: RAG システム用の包括的な診断ツール

RAG システム用の包括的な診断ツールができたらどうなるか想像してみてください「身体検査」はどのようなものですか？ RAGChecker はこのために作られています。システムの全体的なパフォーマンスを評価するだけでなく、取得と生成の 2 つのコアモジュールのパフォーマンスの詳細な分析も提供します。

Amazon のオープンソース RAGChecker 診断ツールを使用して、RAG システムに包括的な「物理検査」を実施します。 RAGChecker の主な機能は次のとおりです:

# 🎜🎜#きめ細かい評価: RAGChecker は、単純な返信レベルの評価ではなく、クレームレベルの含意チェックを使用します。このアプローチにより、システムパフォーマンスのより詳細かつ微妙な分析が可能になり、深い洞察が得られます。

包括的なインジケーターシステム: このフレームワークは、忠実性とコンテキストの利用、ノイズ感度、幻覚など、RAG システムのパフォーマンスのあらゆる側面をカバーする一連のインジケーターを提供します。等
実証済みの有効性: 信頼性テストでは、RAGChecker の評価結果が人間の判断と強く相関し、他の既存の評価指標よりも優れていることが示されています。これにより、評価結果の信頼性と実用性が保証されます。
実用的な洞察: RAGChecker が提供する診断メトリクスは、RAG システムを改善するための明確な方向性のガイダンスを提供します。これらの洞察は、研究者や実践者がより効果的で信頼性の高い AI アプリケーションを開発するのに役立ちます。 #🎜🎜 ##### 🎜🎜 ## 🎜🎜 ## 🎜🎜 ## 🎜🎜 ## 🎜🎜 ## 🎜🎜#ラグチェッカーのコアインジケーター#🎜🎜 ## 🎜🎜 ### 🎜 🎜## 🎜🎜#

These indicators are divided into three major categories:

1. Overall indicators:

Precision: the proportion of correct statements in the model’s answers
Recall: Proportion of statements in standard answers included in model answers
F1 score (F1 score): the harmonic mean of precision and recall, providing a balanced performance measure

2. Retrieval module metrics:

Context Precision: The proportion of blocks that contain at least one standard answer statement among all retrieved blocks
Claim Recall: The standard answers covered by the retrieved blocks Proportion of statements

3. Generation module metrics:

Context Utilization: Evaluates how effectively the generation module utilizes relevant information obtained from the retrieval block to produce correct statements . This metric reflects how efficiently the system utilizes the retrieved information.
Noise Sensitivity: A measure of the generation module’s tendency to include erroneous information from the retrieval block in its answers. This metric helps identify how sensitive a system is to irrelevant or erroneous information.
Hallucination: Measures how often the model generates information that is neither in the retrieval block nor in the standard answer. This is like capturing the situation where the model "makes up" information out of thin air, and is an important indicator for evaluating the reliability of the model.
Self-knowledge: Evaluates how often the model answers questions correctly without obtaining information from the retrieval block. This reflects the model's ability to leverage its own built-in knowledge when needed.
Faithfulness: Measures how consistent the response of the generation module is with the information provided by the retrieval block. This metric reflects the system's compliance with the given information.

These indicators are like the "physical examination report" of the RAG system, helping developers comprehensively understand the health of the system and identify areas for improvement.

Start using RAGChecker

For developers who want to try RAGChecker, the getting started process is very simple. The following are the steps to get started quickly:

1. Environment setup: First, install RAGChecker and its dependencies:

pip install ragcheckerpython -m spacy download en_core_web_sm

2. Prepare data: Prepare the output of the RAG system into a specific JSON format, Includes context for queries, standard answers, model answers, and retrieval. The data format should look like this:

{ "results": [ { "query_id": "< 查询 ID>", "query": "< 输入查询 >", "gt_answer": "< 标准答案 >", "response": "<RAG 系统生成的回答 >", "retrieved_context": [ { "doc_id": "< 文档 ID>", "text": "< 检索块的内容 >" }, ... ] }, ... ]   }

3. Run the evaluation:

Use the command line:

ragchecker-cli \--input_path=examples/checking_inputs.json \--output_path=examples/checking_outputs.json

Or use Python code:

from ragchecker import RAGResults, RAGCheckerfrom ragchecker.metrics import all_metrics# 从 JSON 初始化 RAGResultswith open ("examples/checking_inputs.json") as fp:rag_results = RAGResults.from_json (fp.read ())# 设置评估器evaluator = RAGChecker ()# 评估结果evaluator.evaluate (rag_results, all_metrics)print (rag_results)

4. Analysis results: RAGChecker will output files in json format to display evaluation indicators to help you understand the performance of all aspects of the RAG system.

The format of the output result is as follows:

Amazon のオープンソース RAGChecker 診断ツールを使用して、RAG システムに包括的な「物理検査」を実施します。

By analyzing these indicators, developers can optimize various aspects of the RAG system in a targeted manner. For example:

A lower Claim Recall may indicate the need for improved retrieval strategies. This means that the system may not have retrieved enough relevant information and needs to optimize the retrieval algorithm or expand the knowledge base.
High Noise Sensitivity indicates that the generation module needs to improve its reasoning capabilities to better distinguish relevant information from irrelevant or erroneous details from the retrieved context. This may require improving the model’s training methods or enhancing its ability to understand context.
High Hallucination scores may point to the need to better integrate the generation module with the retrieved context. This might involve improving how the model exploits retrieved information, or increasing its fidelity to the facts.
The balance between Context Utilization and Self-knowledge can help you optimize the trade-off between retrieval information utilization and model inherent knowledge. This might involve adjusting how much the model relies on retrieval information, or improving its ability to combine multiple sources of information.

In this way, RAGChecker not only provides a detailed performance evaluation, but also provides clear guidance on the specific optimization direction of the RAG system.

Using RAGChecker in LlamaIndex

RAGChecker is now integrated with LlamaIndex, providing a powerful evaluation tool for RAG applications built with LlamaIndex. If you want to know how to use RAGChecker in the LlamaIndex project, you can refer to the section about RAGChecker integration in the LlamaIndex documentation.

Conclusion

RAGChecker provides a new tool for the evaluation and optimization of RAG systems. It provides developers with a “microscope” to help them gain in-depth understanding and accurately optimize the RAG system. Whether you are an academic studying RAG technology or an engineer working on developing smarter AI applications, RAGChecker will be your indispensable right-hand assistant. Readers can visit https://github.com/amazon-science/RAGChecker for more information or to participate in the development of the project.

以上がAmazon のオープンソース RAGChecker 診断ツールを使用して、RAG システムに包括的な「物理検査」を実施します。の詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。

Python json github 算法人工智能 https

声明：

この記事の内容はネチズンが自主的に寄稿したものであり、著作権は原著者に帰属します。このサイトは、それに相当する法的責任を負いません。盗作または侵害の疑いのあるコンテンツを見つけた場合は、admin@php.cn までご連絡ください。

前の記事：DeepSeek オープンソースの大規模数学モデル、高校および大学の定理証明用の新しい SOTA次の記事：DeepSeek オープンソースの大規模数学モデル、高校および大学の定理証明用の新しい SOTA

続きを見る