首頁 >科技週邊 >人工智慧 >擬人化的上下文檢索:實施指南

擬人化的上下文檢索:實施指南

William Shakespeare
William Shakespeare原創
2025-03-02 09:34:12453瀏覽

通過整合外部知識來增強AI模型的檢索>檢索生成。但是,傳統的抹布經常碎片文檔,失去關鍵環境並影響準確性。 Anthropic的上下文檢索通過在嵌入之前向每個文檔塊中添加簡潔的上下文解釋來解決此問題。 這大大減少了檢索錯誤,從而改善了下游任務性能。 本文詳細介紹了上下文檢索及其實現。

langchain

的抹布

>利用蘭鍊和抹布將外部數據與LLMS整合。 >

>上下文檢索解釋

>傳統的抹布方法將文檔分為較小的塊,以便於檢索,但這可以消除基本環境。例如,一塊可能說“其超過385萬居民使其成為歐盟人口最多的城市”,而沒有指定這座城市。 這種缺乏上下文阻礙了準確性。

>上下文檢索通過在嵌入之前對每個塊進行簡短的,特定於上下文的摘要來解決此問題。 上一個示例將變為:

<code>contextualized_chunk = """Berlin is the capital and largest city of Germany, known for being the EU's most populous city within its limits.
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.
"""</code>
>跨不同數據集的人類內部測試(代碼庫,科學論文,小說)表明,與上下文嵌入模型和上下文的BM25配對時,上下文檢索將檢索錯誤最多減少49%。

Anthropic's Contextual Retrieval: A Guide With Implementation

實現上下文檢索

本節使用示例文檔概述了逐步實現:> Anthropic's Contextual Retrieval: A Guide With Implementation

步驟1:塊創建

>

將文檔分為較小的獨立塊(在這裡,句子):>
<code># Input text for the knowledge base
input_text = """Berlin is the capital and largest city of Germany, both by area and by population.
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.
The city is also one of the states of Germany and is the third smallest state in the country in terms of area.
Paris is the capital and most populous city of France.
It is situated along the Seine River in the north-central part of the country.
The city has a population of over 2.1 million residents within its administrative limits, making it one of Europe's major population centers."""</code>

步驟2:提示模板定義

>

定義上下文生成的提示(使用了擬人的模板):>

<code># Splitting the input text into smaller chunks
test_chunks = [
    'Berlin is the capital and largest city of Germany, both by area and by population.',
    "\n\nIts more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.",
    '\n\nThe city is also one of the states of Germany and is the third smallest state in the country in terms of area.',
    '\n\n# Paris is the capital and most populous city of France.',
    '\n\n# It is situated along the Seine River in the north-central part of the country.',
    "\n\n# The city has a population of over 2.1 million residents within its administrative limits, making it one of Europe's major population centers."
]</code>

步驟3:LLM初始化>

選擇一個llm(在此使用OpenAI的GPT-4O):>

<code>from langchain.prompts import ChatPromptTemplate, PromptTemplate, HumanMessagePromptTemplate

# Define the prompt for generating contextual information
anthropic_contextual_retrieval_system_prompt = """<document>
{WHOLE_DOCUMENT}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{CHUNK_CONTENT}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""

# ... (rest of the prompt template code remains the same)</code>
步驟4:鏈創建

> 連接提示和llm:

>
<code>import os
from langchain_openai import ChatOpenAI

# Load environment variables
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

# Initialize the model instance
llm_model_instance = ChatOpenAI(
    model="gpt-4o",
)</code>
步驟5:塊處理

> 生成每個塊的上下文:

(原始示例中顯示輸出)
<code>from langchain.output_parsers import StrOutputParser

# Chain the prompt with the model instance
contextual_chunk_creation = anthropic_contextual_retrieval_final_prompt | llm_model_instance | StrOutputParser()</code>
>

重新掌握以增強精度 >通過優先考慮最相關的塊來進一步完善檢索。 這提高了準確性並降低了成本。在擬人的測試中,重新評估的檢索錯誤從5.7%降低到1.9%,提高了67%。

<code># Process each chunk and generate contextual information
for test_chunk in test_chunks:
    res = contextual_chunk_creation.invoke({
        "WHOLE_DOCUMENT": input_text,
        "CHUNK_CONTENT": test_chunk
    })
    print(res)
    print('-----')</code>

其他注意事項

對於較小的知識庫(&lt; 200,000令牌),直接在提示中包括整個知識基礎可能比使用檢索系統更有效。 此外,利用及時的緩存(可與Claude一起使用)可以大大降低成本並提高響應時間。

結論

>人類的上下文檢索提供了一種簡單而強大的方法來改善抹布系統。 上下文嵌入,BM25和重新固定的組合大大提高了準確性。 建議進一步探索其他檢索技術。

以上是擬人化的上下文檢索:實施指南的詳細內容。更多資訊請關注PHP中文網其他相關文章!

陳述:
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn