通過整合外部知識來增強AI模型的檢索>檢索生成。但是,傳統的抹布經常碎片文檔,失去關鍵環境並影響準確性。
的抹布
>傳統的抹布方法將文檔分為較小的塊,以便於檢索,但這可以消除基本環境。例如,一塊可能說“其超過385萬居民使其成為歐盟人口最多的城市”,而沒有指定這座城市。 這種缺乏上下文阻礙了準確性。
實現上下文檢索 本節使用示例文檔概述了逐步實現: >
步驟2:提示模板定義 定義上下文生成的提示(使用了擬人的模板): 步驟3:LLM初始化>
>
連接提示和llm:
>
生成每個塊的上下文:
重新掌握以增強精度
>通過優先考慮最相關的塊來進一步完善檢索。 這提高了準確性並降低了成本。在擬人的測試中,重新評估的檢索錯誤從5.7%降低到1.9%,提高了67%。 結論>上下文檢索解釋
>上下文檢索通過在嵌入之前對每個塊進行簡短的,特定於上下文的摘要來解決此問題。 上一個示例將變為:
<code>contextualized_chunk = """Berlin is the capital and largest city of Germany, known for being the EU's most populous city within its limits.
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.
"""</code>
>跨不同數據集的人類內部測試(代碼庫,科學論文,小說)表明,與上下文嵌入模型和上下文的BM25配對時,上下文檢索將檢索錯誤最多減少49%。
步驟1:塊創建
<code># Input text for the knowledge base
input_text = """Berlin is the capital and largest city of Germany, both by area and by population.
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.
The city is also one of the states of Germany and is the third smallest state in the country in terms of area.
Paris is the capital and most populous city of France.
It is situated along the Seine River in the north-central part of the country.
The city has a population of over 2.1 million residents within its administrative limits, making it one of Europe's major population centers."""</code>
<code># Splitting the input text into smaller chunks
test_chunks = [
'Berlin is the capital and largest city of Germany, both by area and by population.',
"\n\nIts more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.",
'\n\nThe city is also one of the states of Germany and is the third smallest state in the country in terms of area.',
'\n\n# Paris is the capital and most populous city of France.',
'\n\n# It is situated along the Seine River in the north-central part of the country.',
"\n\n# The city has a population of over 2.1 million residents within its administrative limits, making it one of Europe's major population centers."
]</code>
<code>from langchain.prompts import ChatPromptTemplate, PromptTemplate, HumanMessagePromptTemplate
# Define the prompt for generating contextual information
anthropic_contextual_retrieval_system_prompt = """<document>
{WHOLE_DOCUMENT}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{CHUNK_CONTENT}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
# ... (rest of the prompt template code remains the same)</code>
步驟4:鏈創建<code>import os
from langchain_openai import ChatOpenAI
# Load environment variables
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
# Initialize the model instance
llm_model_instance = ChatOpenAI(
model="gpt-4o",
)</code>
步驟5:塊處理<code>from langchain.output_parsers import StrOutputParser
# Chain the prompt with the model instance
contextual_chunk_creation = anthropic_contextual_retrieval_final_prompt | llm_model_instance | StrOutputParser()</code>
>
<code># Process each chunk and generate contextual information
for test_chunk in test_chunks:
res = contextual_chunk_creation.invoke({
"WHOLE_DOCUMENT": input_text,
"CHUNK_CONTENT": test_chunk
})
print(res)
print('-----')</code>
其他注意事項
對於較小的知識庫(&lt; 200,000令牌),直接在提示中包括整個知識基礎可能比使用檢索系統更有效。 此外,利用及時的緩存(可與Claude一起使用)可以大大降低成本並提高響應時間。
>人類的上下文檢索提供了一種簡單而強大的方法來改善抹布系統。 上下文嵌入,BM25和重新固定的組合大大提高了準確性。 建議進一步探索其他檢索技術。
以上是擬人化的上下文檢索:實施指南的詳細內容。更多資訊請關注PHP中文網其他相關文章!