首页 >科技周边 >人工智能 >增强抹布：超越香草的方法

增强抹布：超越香草的方法

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB原创: 2025-02-25 16:38:09634浏览

Enhancing RAG: Beyond Vanilla Approaches

通过集成外部信息检索，

>检索效果生成（RAG）可以显着增强语言模型。标准抹布在提高响应相关性的同时，通常会在复杂的检索情况下步履蹒跚。本文研究了基本抹布的缺点，并提出了提高准确性和效率的高级方法。

基本抹布的限制

考虑一个简单的方案：从几个文档中检索相关信息。我们的数据集包括：

>两个不相关的文档，其中包含一些重叠的关键字，但在不同的上下文中。

<code>main_document_text = """
Morning Routine (5:30 AM - 9:00 AM)
✅ Wake Up Early - Aim for 6-8 hours of sleep to feel well-rested.
✅ Hydrate First - Drink a glass of water to rehydrate your body.
✅ Morning Stretch or Light Exercise - Do 5-10 minutes of stretching or a short workout to activate your body.
✅ Mindfulness or Meditation - Spend 5-10 minutes practicing mindfulness or deep breathing.
✅ Healthy Breakfast - Eat a balanced meal with protein, healthy fats, and fiber.
✅ Plan Your Day - Set goals, review your schedule, and prioritize tasks.
...
"""</code>

我如何提高自己的健康和生产力？

>健康和富有成效的生活方式的最佳策略是什么？可能由于在无关的文档中存在类似单词而难以持续检索主要文档。
助手功能：简化抹布管道

提高检索准确性并简化查询处理，我们引入了辅助功能。这些功能处理任务，例如查询chatgpt API，计算文档嵌入以及确定相似性分数。这会产生更有效的抹布管道。

这是助手函数：

评估基本抹布

<code># **Imports**
import os
import json
import openai
import numpy as np
from scipy.spatial.distance import cosine
from google.colab import userdata

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = userdata.get('AiTeam')</code>

<code>def query_chatgpt(prompt, model="gpt-4o", response_format=openai.NOT_GIVEN):
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0 , # Adjust for more or less creativity
            response_format=response_format
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        return f"Error: {e}"</code>

<code>def get_embedding(text, model="text-embedding-3-large"): #"text-embedding-ada-002"
    """Fetches the embedding for a given text using OpenAI's API."""
    response = client.embeddings.create(
        input=[text],
        model=model
    )
    return response.data[0].embedding</code>

我们使用预定义的查询测试基本抹布，以评估其基于语义相似性检索最相关文档的能力。这突出了它的局限性。

<code>def compute_similarity_metrics(embed1, embed2):
    """Computes different similarity/distance metrics between two embeddings."""
    cosine_sim = 1- cosine(embed1, embed2)  # Cosine similarity

    return cosine_sim</code>

<code>def fetch_similar_docs(query, docs, threshold = .55, top=1):
  query_em = get_embedding(query)
  data = []
  for d in docs:
    # Compute and print similarity metrics
    similarity_results = compute_similarity_metrics(d["embedding"], query_em)
    if(similarity_results >= threshold):
      data.append({"id":d["id"], "ref_doc":d.get("ref_doc", ""), "score":similarity_results})

  # Sorting by value (second element in each tuple)
  sorted_data = sorted(data, key=lambda x: x["score"], reverse=True)  # Ascending order
  sorted_data = sorted_data[:min(top, len(sorted_data))]
  return sorted_data</code>

用于增强rag的高级技术

为了改善检索过程，我们介绍了生成结构化信息以增强文档检索和查询处理的功能。

实现了三个关键增强：

<code>"""# **Testing Vanilla RAG**"""

query = "what should I do to stay healthy and productive?"
r = fetch_similar_docs(query, docs)
print("query = ", query)
print("documents = ", r)

query = "what are the best practices to stay healthy and productive ?"
r = fetch_similar_docs(query, docs)
print("query = ", query)
print("documents = ", r)</code>

1。生成常见问题解答

>从文档创建常见问题解答会扩展查询匹配的可能性。这些常见问题解答是一次生成并存储的，可以丰富搜索空间而无需重复成本。

2。创建概述

简明的摘要捕获了该文件的核心思想，从而提高了检索效率。概述的嵌入被添加到文档集合中。

<code>def generate_faq(text):
  prompt = f'''
  given the following text: """{text}"""
  Ask relevant simple atomic questions ONLY (don't answer them) to cover all subjects covered by the text. Return the result as a json list example [q1, q2, q3...]
  '''
  return query_chatgpt(prompt, response_format={ "type": "json_object" })</code>

3。查询分解

广泛的查询被分解为较小，更精确的子查询。将这些子查询与增强的文档集合（原始文档，常见问题解答和概述）进行了比较。结果合并以提高相关性。

<code>main_document_text = """
Morning Routine (5:30 AM - 9:00 AM)
✅ Wake Up Early - Aim for 6-8 hours of sleep to feel well-rested.
✅ Hydrate First - Drink a glass of water to rehydrate your body.
✅ Morning Stretch or Light Exercise - Do 5-10 minutes of stretching or a short workout to activate your body.
✅ Mindfulness or Meditation - Spend 5-10 minutes practicing mindfulness or deep breathing.
✅ Healthy Breakfast - Eat a balanced meal with protein, healthy fats, and fiber.
✅ Plan Your Day - Set goals, review your schedule, and prioritize tasks.
...
"""</code>

评估增强的抹布

通过这些增强功能重新运行初始查询，显示出显着改善。查询分解产生多个子征服，从而成功地从常见问题解答和原始文档中取回。

>示例常见问题解答输出：

<code># **Imports**
import os
import json
import openai
import numpy as np
from scipy.spatial.distance import cosine
from google.colab import userdata

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = userdata.get('AiTeam')</code>

<code>def query_chatgpt(prompt, model="gpt-4o", response_format=openai.NOT_GIVEN):
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0 , # Adjust for more or less creativity
            response_format=response_format
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        return f"Error: {e}"</code>

成本效益分析

在预处理（生成常见问题解答，概述和嵌入）的同时，增加了前期成本，这是每个文档的一次性费用。这抵消了优化不良的抹布系统的效率低下：使用户感到沮丧，并从检索无关的信息中提高了查询成本。对于大批量系统，预处理是一项值得投资的。>

结论

>将文档预处理（常见问题解答和概述）与查询分解结合起来会产生更智能的抹布系统，以平衡准确性和成本效益。这提高了检索质量，降低了无关紧要的结果并改善了用户体验。未来的研究可以探索进一步的优化，例如动态阈值和加强辅助学习以进行查询。

以上是增强抹布：超越香草的方法的详细内容。更多信息请关注PHP中文网其他相关文章！

for while using Collection this chatgpt embedding

声明：

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

上一篇：6 Common LLM Customization Strategies Briefly Explained下一篇：Synthetic Data Generation with LLMs

查看更多