有趣的是,Facebook AI Research(现为Meta AI)的研究人员于2020年发表了第一篇关于RAG的论文,但直到Chatgpt的出现,其潜力才完全实现。从那以后,一直没有停止。引入了更高级和复杂的抹布框架,不仅提高了该技术的准确性,而且还使其能够处理多模式数据,从而扩大了其广泛应用程序的潜力。我在以下文章中详细介绍了该主题,特别讨论了上下文多模式抹布,多模式AI搜索业务应用程序以及信息提取和对接平台。
>将多模式数据集成到大语言模型实时Web搜索以访问最新信息多模式AI搜索业务应用程序
AI驱动的信息提取和对接>
随着抹布技术的不断扩展和新兴数据访问要求的不断扩展,可以通过整合其他多样化的知识来源和工具来扩展静态知识基础问题的纯抹布的功能,从而回答了静态知识基础的问题。例如:>多个数据库(例如,包含向量数据库和知识图的知识库)
>外部API收集特定数据,例如股票市场趋势或公司特定工具(例如Slack Channels或Email帐户)的数据
>在本文中,我们将开发一个特定的代理RAG应用程序,称为智能业务指南(SBG) -
>该工具的第一个版本是我们正在进行的项目的一部分乐观,由中央波罗的海Interreg资助。该项目的重点是使用AI的企业家和业务计划的芬兰和爱沙尼亚的高技能移民。 SBG是旨在在该项目的UPSKILLSing过程中使用的工具之一。该工具着重于提供从真实来源到打算开展业务或已经从事业务的人提供精确和快速的信息。SBG的代理抹布包括:
>>这个代理抹布有什么特殊之处? 选择>不同的开源模型(
llama,mistral,gemma>使用免费的开源模型开发高级代理抹布(以下称为智能业务指南或SBG)
文件中结构:_agenticrag.py.py>实现了整个代理工作流程,并且
让我们深入研究。
Llamaparse是一个由LLM和LLM用例构建的Genai-native文档解析平台。我已经解释了在上面引用的文章中使用Llamaparse的使用。这次,我直接在Llamacloud解析了文件。 Llamaparse每天提供1000个免费积分。这些学分的使用取决于解析模式。对于仅文本的PDF,‘
fast'模式(1个学分 / 3页)效果很好,可以跳过OCR,图像提取和表格 /标识。还有其他更高级的模式可用,每个页面的信用点数量更高。我选择了执行OCR,图像提取和表/标识的“premium”模式,非常适合具有图像的复杂文档。 我定义了以下解析指令。
解析的文件以llamacloud的速度格式下载。可以通过Llamacloud API进行相同的解析。
>You are given a document containing text, tables, and images. Extract all the contents in their correct format. Extract each table in a correct format and include a detailed explanation of each table before its extracted format. If an image contains text, extract all the text in the correct format and include a detailed explanation of each image before its extracted text. Produce the output in markdown text. Extract each page separately in the form of an individual node. Assign the document name and page number to each extracted node in the format: [Creativity and Business, page 7]. Include the document name and page number at the start and end of each extracted page.
这是Pikkala,A。等,(2015)的《指南创造力和业务》中的示例页面(“
>免费复制以供非商业私人或公共使用,attributionimport os from llama_parse import LlamaParse from llama_index.core import SimpleDirectoryReader # Define parsing instructions parsing_instructions = """ Extract the text from the document using proper structure. """ def save_to_markdown(output_path, content): """ Save extracted content to a markdown file. Parameters: output_path (str): The path where the markdown file will be saved. content (list): The extracted content to be saved. """ with open(output_path, "w", encoding="utf-8") as md_file: for document in content: # Extract the text content from the Document object md_file.write(document.text + "nn") # Access the 'text' attribute def extract_document(input_path): # Initialize the LlamaParse parser parsing_instructions = """You are given a document containing text, tables, and images. Extract all the contents in their correct format. Extract each table in a correct format and include a detailed explanation of each table before its extracted format. If an image contains text, extract all the text in the correct format and include a detailed explanation of each image before its extracted text. Produce the output in markdown text. Extract each page separately in the form of an individual node. Assign the document name and page number to each extracted node in the format: [Creativity and Business, page 7]. Include the document name and page number at the start and end of each extracted page. """ parser = LlamaParse( result_type="markdown", parsing_instructions=parsing_instructions, premium_mode=True, api_key=LLAMA_CLOUD_API_KEY, verbose=True ) file_extractor = {".pdf": parser} documents = SimpleDirectoryReader( input_path, file_extractor=file_extractor ).load_data() return documents input_path = r"C:Usersh02317Downloadsdocs" # Replace with your document path output_file = r"C:Usersh02317Downloadsextracted_document.md" # Output markdown file name # Extract the document extracted_content = extract_document(input_path) save_to_markdown(output_file, extracted_content)>”)。
这是此页面的解析输出。 Llamaparse从页面中的所有结构中有效提取信息。页面中显示的笔记本为图像格式。
recursivecharactertextsplitter
>[Creativity and Business, page 8] # How to use this book 1. The book is divided into six chapters and sub-sections dealing with different topics. You can read the book through one chapter and topic at a time, or you can use the checklist of the table of contents to select sections on topics in which you need more information and support. 2. Each section opens with a creative entrepreneur's thought on the topic. 3. The introduction gives a brief description of the topic. 4. Each section contains exercises that help you reflect on your own skills and business idea and develop your business idea further. ## What is your business idea "I would like to launch a touring theatre company." Do you have an idea about a product or service you would like to sell? Or do you have a bunch of ideas you have been mull- ing over for some time? This section will help you get a better understanding about your business idea and what competen- cies you already have that could help you implement it, and what types of competencies you still need to gain. ### EXTRA Business idea development in a nutshell I found a great definition of what business idea development is from the My Coach online service (Youtube 27 May 2014). It divides the idea development process into three stages: the thinking - stage, the (subconscious) talking - stage, and the customer feedback stage. It is important that you talk about your business idea, as it is very easy to become stuck on a particular path and ignore everything else. You can bounce your idea around with all sorts of people: with a local business advisor; an experienced entrepreneur; or a friend. As you talk about your business idea with others, your subconscious will start working on the idea, and the feedback from others will help steer the idea in the right direction. ### Recommended reading Taivas + helvetti (Terho Puustinen & Mika Mäkeläinen: One on One Publishing Oy 2013) ### Keywords treasure map; business idea; business idea development ## EXERCISE: Identifying your personal competencies Write down the various things you have done in your life and think what kind of competencies each of these things has given you. The idea is not just to write down your education, training and work experience like in a CV; you should also include hobbies, encounters with different types of people, and any life experiences that may have contributed to you being here now with your business idea. The starting circle can be you at any age, from birth to adulthood, depending on what types of experiences you have had time to accumulate. The final circle can be you at this moment. PERSONAL CAREER PATH SUPPLEMENTARY PERSONAL DEVELOPMENT (e.g. training courses; literature; seminars) Fill in the "My Competencies" section of the Creative Business Model Canvas: 5. Each section also includes an EXTRA box with interesting tidbits about the topic at hand. 6. For each topic, tips on further reading are given in the grey box. 7. The second grey box contains recommended keywords for searching more information about the topic online. 8. By completing each section of the one-page business plan or "Creative Business Model Canvas" (page 74), by the end of the book you will have a complete business plan. 9. By writing down your business start-up costs (e.g. marketing or logistics) in the price tag box of each section, by the time you get to the Finance and Administration section you will already know your start-up costs and you can enter them in the receipt provided in the Finance and Administration section (page 57). This book is based on Finnish practices. The authors and the publisher are not responsible for the applicability of factual information to other countries. Readers are advised to check country-specific information on business structures, support organisations, taxation, legislation, etc. Factual information about Finnish practices should also be checked in case of differing interpretations by authorities. [Creativity and Business, page 8]>然后将分解的降价文档分为块,chunk_size = 3000 = 3000和chunk_overlap = 200.
。 随后,使用嵌入式模型(例如Open-Source
> ALL-MINILM-L6-V2def staticChunker(folder_path): docs = [] print(f"Creating chunks. CHUNK_SIZE: {CHUNK_SIZE}, CHUNK_OVERLAP: {CHUNK_OVERLAP}") # Loop through all .md files in the folder for file_name in os.listdir(folder_path): if file_name.endswith(".md"): file_path = os.path.join(folder_path, file_name) print(f"Processing file: {file_path}") # Load documents from the Markdown file loader = UnstructuredMarkdownLoader(file_path) documents = loader.load() # Add file-specific metadata (optional) for doc in documents: doc.metadata["source_file"] = file_name # Split loaded documents into chunks text_splitter = RecursiveCharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP) chunked_docs = text_splitter.split_documents(documents) docs.extend(chunked_docs) return docs模型)或OpenAI's
text-embedding-3-large
def load_or_create_vs(persist_directory): # Check if the vector store directory exists if os.path.exists(persist_directory): print("Loading existing vector store...") # Load the existing vector store vectorstore = Chroma( persist_directory=persist_directory, embedding_function=st.session_state.embed_model, collection_name=collection_name ) else: print("Vector store not found. Creating a new one...n") docs = staticChunker(DATA_FOLDER) print("Computing embeddings...") # Create and persist a new Chroma vector store vectorstore = Chroma.from_documents( documents=docs, embedding=st.session_state.embed_model, persist_directory=persist_directory, collection_name=collection_name ) print('Vector store created and persisted successfully!') return vectorstore>
,该节点代表做出决策的工作流程(例如,Web搜索或Vector数据库搜索)。节点通过>如何使用自动Internet搜索开发免费的AI代理 我们需要创建图形
nodes
> edges连接,该节点定义了决策和动作的流动(例如,检索后的下一个状态是什么)。图形state在通过图移动时跟踪信息,以便代理使用每个步骤的正确数据。
工作流程中的输入点是一个路由器函数,它通过分析用户的查询来确定在工作流中执行的初始节点。整个工作流都包含以下节点。
检索
问题
在下图中描绘了整个工作流程。
You are given a document containing text, tables, and images. Extract all the contents in their correct format. Extract each table in a correct format and include a detailed explanation of each table before its extracted format. If an image contains text, extract all the text in the correct format and include a detailed explanation of each image before its extracted text. Produce the output in markdown text. Extract each page separately in the form of an individual node. Assign the document name and page number to each extracted node in the format: [Creativity and Business, page 7]. Include the document name and page number at the start and end of each extracted page.> _web
在检索信息中找不到相关块的情况,或者直接通过_route
>混合搜索结合了retriever和tavily搜索的结果,并填充了“> document”的状态变量,该变量将传递给使用“Question ”状态变量。 调用工具
此代理工作流中使用的工具是从预定义的受信任URL获取信息的报废函数。塔维尔(Tavily)和这些工具之间的区别在于,塔维利(Tavily)进行了更广泛的互联网搜索,以带来不同来源的结果。鉴于,这些工具使用Python美丽的汤网报废库来从受信任的来源(预定义的URL)中提取信息。这样,我们确保从已知的,可信赖的来源中提取有关某些查询的信息。此外,此信息检索是完全免费的。>
这是_get_taxinfo
You are given a document containing text, tables, and images. Extract all the contents in their correct format. Extract each table in a correct format and include a detailed explanation of each table before its extracted format. If an image contains text, extract all the text in the correct format and include a detailed explanation of each image before its extracted text. Produce the output in markdown text. Extract each page separately in the form of an individual node. Assign the document name and page number to each extracted node in the format: [Creativity and Business, page 7]. Include the document name and page number at the start and end of each extracted page.
节点,生成,通过使用下面描述的预定义提示(langchain's提示> class)调用链条来创建最终响应。 _rag提示接收状态变量_ “响应生成的行为,包括有关响应风格,对话语调,格式指南,引用规则,混合上下文处理和仅上下文重点的说明。 生成节点首先检索状态变量“
问题import os from llama_parse import LlamaParse from llama_index.core import SimpleDirectoryReader # Define parsing instructions parsing_instructions = """ Extract the text from the document using proper structure. """ def save_to_markdown(output_path, content): """ Save extracted content to a markdown file. Parameters: output_path (str): The path where the markdown file will be saved. content (list): The extracted content to be saved. """ with open(output_path, "w", encoding="utf-8") as md_file: for document in content: # Extract the text content from the Document object md_file.write(document.text + "nn") # Access the 'text' attribute def extract_document(input_path): # Initialize the LlamaParse parser parsing_instructions = """You are given a document containing text, tables, and images. Extract all the contents in their correct format. Extract each table in a correct format and include a detailed explanation of each table before its extracted format. If an image contains text, extract all the text in the correct format and include a detailed explanation of each image before its extracted text. Produce the output in markdown text. Extract each page separately in the form of an individual node. Assign the document name and page number to each extracted node in the format: [Creativity and Business, page 7]. Include the document name and page number at the start and end of each extracted page. """ parser = LlamaParse( result_type="markdown", parsing_instructions=parsing_instructions, premium_mode=True, api_key=LLAMA_CLOUD_API_KEY, verbose=True ) file_extractor = {".pdf": parser} documents = SimpleDirectoryReader( input_path, file_extractor=file_extractor ).load_data() return documents input_path = r"C:Usersh02317Downloadsdocs" # Replace with your document path output_file = r"C:Usersh02317Downloadsextracted_document.md" # Output markdown file name # Extract the document extracted_content = extract_document(input_path) save_to_markdown(output_file, extracted_content)”,“”
> documents> ”和“ _answerstyle”和格式单个字符串作为上下文。随后,它使用_rag提示调用生成链,并且响应生成llm _ 生成“ generatio_n”状态变量的最终答案。 _app.p_y使用此状态变量,以在> spartlit 用户界面中显示生成的响应。 >使用GROQ的免费API,有可能达到模型的速率或上下文窗口限制。在那种情况下,我将生成的节点扩展到以圆形方式从模型名称列表中动态切换模型,然后在生成响应后将模型恢复到当前模型。 助手功能
[Creativity and Business, page 8] # How to use this book 1. The book is divided into six chapters and sub-sections dealing with different topics. You can read the book through one chapter and topic at a time, or you can use the checklist of the table of contents to select sections on topics in which you need more information and support. 2. Each section opens with a creative entrepreneur's thought on the topic. 3. The introduction gives a brief description of the topic. 4. Each section contains exercises that help you reflect on your own skills and business idea and develop your business idea further. ## What is your business idea "I would like to launch a touring theatre company." Do you have an idea about a product or service you would like to sell? Or do you have a bunch of ideas you have been mull- ing over for some time? This section will help you get a better understanding about your business idea and what competen- cies you already have that could help you implement it, and what types of competencies you still need to gain. ### EXTRA Business idea development in a nutshell I found a great definition of what business idea development is from the My Coach online service (Youtube 27 May 2014). It divides the idea development process into three stages: the thinking - stage, the (subconscious) talking - stage, and the customer feedback stage. It is important that you talk about your business idea, as it is very easy to become stuck on a particular path and ignore everything else. You can bounce your idea around with all sorts of people: with a local business advisor; an experienced entrepreneur; or a friend. As you talk about your business idea with others, your subconscious will start working on the idea, and the feedback from others will help steer the idea in the right direction. ### Recommended reading Taivas + helvetti (Terho Puustinen & Mika Mäkeläinen: One on One Publishing Oy 2013) ### Keywords treasure map; business idea; business idea development ## EXERCISE: Identifying your personal competencies Write down the various things you have done in your life and think what kind of competencies each of these things has given you. The idea is not just to write down your education, training and work experience like in a CV; you should also include hobbies, encounters with different types of people, and any life experiences that may have contributed to you being here now with your business idea. The starting circle can be you at any age, from birth to adulthood, depending on what types of experiences you have had time to accumulate. The final circle can be you at this moment. PERSONAL CAREER PATH SUPPLEMENTARY PERSONAL DEVELOPMENT (e.g. training courses; literature; seminars) Fill in the "My Competencies" section of the Creative Business Model Canvas: 5. Each section also includes an EXTRA box with interesting tidbits about the topic at hand. 6. For each topic, tips on further reading are given in the grey box. 7. The second grey box contains recommended keywords for searching more information about the topic online. 8. By completing each section of the one-page business plan or "Creative Business Model Canvas" (page 74), by the end of the book you will have a complete business plan. 9. By writing down your business start-up costs (e.g. marketing or logistics) in the price tag box of each section, by the time you get to the Finance and Administration section you will already know your start-up costs and you can enter them in the receipt provided in the Finance and Administration section (page 57). This book is based on Finnish practices. The authors and the publisher are not responsible for the applicability of factual information to other countries. Readers are advised to check country-specific information on business structures, support organisations, taxation, legislation, etc. Factual information about Finnish practices should also be checked in case of differing interpretations by authorities. [Creativity and Business, page 8]在应用程序初始化期间从
spreatlitapp更改模型或状态变量时,都会触发__。它重新定位组件并保存更新的状态。此功能还可以跟踪各种会话变量并防止冗余初始化。 以下助手功能初始化了答案的LLM,嵌入模型,路由器LLM和分级LLM。模型名称的列表_model列表,用于跟踪模型在模型的动态切换过程中的跟踪>生成> node。
建立工作流def staticChunker(folder_path): docs = [] print(f"Creating chunks. CHUNK_SIZE: {CHUNK_SIZE}, CHUNK_OVERLAP: {CHUNK_OVERLAP}") # Loop through all .md files in the folder for file_name in os.listdir(folder_path): if file_name.endswith(".md"): file_path = os.path.join(folder_path, file_name) print(f"Processing file: {file_path}") # Load documents from the Markdown file loader = UnstructuredMarkdownLoader(file_path) documents = loader.load() # Add file-specific metadata (optional) for doc in documents: doc.metadata["source_file"] = file_name # Split loaded documents into chunks text_splitter = RecursiveCharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP) chunked_docs = text_splitter.split_documents(documents) docs.extend(chunked_docs) return docs
现在,使用_route问题的图形状态,节点,条件输入点,并且边缘被定义为建立节点之间的流程。最后,将工作流汇编为可执行的app,以供在
> spartlitdef load_or_create_vs(persist_directory): # Check if the vector store directory exists if os.path.exists(persist_directory): print("Loading existing vector store...") # Load the existing vector store vectorstore = Chroma( persist_directory=persist_directory, embedding_function=st.session_state.embed_model, collection_name=collection_name ) else: print("Vector store not found. Creating a new one...n") docs = staticChunker(DATA_FOLDER) print("Computing embeddings...") # Create and persist a new Chroma vector store vectorstore = Chroma.from_documents( documents=docs, embedding=st.session_state.embed_model, persist_directory=persist_directory, collection_name=collection_name ) print('Vector store created and persisted successfully!') return vectorstore接口中使用。工作流程中的条件入口点使用_route
edges)描述是否要过渡到 > app.py 中的简化应用程序提供了一个交互式接口,可以使用动态设置来提出问题和显示响应,以进行模型选择,答案样式和特定于查询的工具。 _initializeapp 函数,从_agenticrag.py导入,初始化所有会话变量,包括所有LLMS,嵌入模型以及从左侧栏中选择的其他选项。
_agentic_rag.p_y中的打印语句通过将 重定向到io.stringiobuffer来捕获。然后,使用_text区域组件在shatlit。
这是简化接口的快照: contise'> contise' )调用referiever(向量搜索),而渐变功能函数可找到所有检索到的块相关。因此,一个决定通过生成节点生成答案的决定是由_route_after 分级node。
>下图显示了使用'解释性'答案样式的答案。按照_rag提示的指示,llm用更多的解释详细说明了答案。 下面的图像显示了路由器触发_get_licenseinfo 工具响应问题。
node时,当在矢量搜索中找不到相关块时。
>以下图像显示了在 node找到_internet_search启用state flag' ',然后将问题路由到_hybridsearchnode。> node。
>扩展的指示
可以在多个方向上增强此应用程序,例如 如果您喜欢这篇文章,请拍拍文章(多次?),写评论,然后在媒介和LinkedIn上关注我。 You are given a document containing text, tables, and images. Extract all the contents in their correct format. Extract each table in a correct format and include a detailed explanation of each table before its extracted format.
If an image contains text, extract all the text in the correct format and include a detailed explanation of each image before its extracted text.
Produce the output in markdown text. Extract each page separately in the form of an individual node. Assign the document name and page number to each extracted node in the format: [Creativity and Business, page 7].
Include the document name and page number at the start and end of each extracted page.
流lit接口
import os
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader
# Define parsing instructions
parsing_instructions = """
Extract the text from the document using proper structure.
"""
def save_to_markdown(output_path, content):
"""
Save extracted content to a markdown file.
Parameters:
output_path (str): The path where the markdown file will be saved.
content (list): The extracted content to be saved.
"""
with open(output_path, "w", encoding="utf-8") as md_file:
for document in content:
# Extract the text content from the Document object
md_file.write(document.text + "nn") # Access the 'text' attribute
def extract_document(input_path):
# Initialize the LlamaParse parser
parsing_instructions = """You are given a document containing text, tables, and images. Extract all the contents in their correct format. Extract each table in a correct format and include a detailed explanation of each table before its extracted format.
If an image contains text, extract all the text in the correct format and include a detailed explanation of each image before its extracted text.
Produce the output in markdown text. Extract each page separately in the form of an individual node. Assign the document name and page number to each extracted node in the format: [Creativity and Business, page 7].
Include the document name and page number at the start and end of each extracted page.
"""
parser = LlamaParse(
result_type="markdown",
parsing_instructions=parsing_instructions,
premium_mode=True,
api_key=LLAMA_CLOUD_API_KEY,
verbose=True
)
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
input_path, file_extractor=file_extractor
).load_data()
return documents
input_path = r"C:Usersh02317Downloadsdocs" # Replace with your document path
output_file = r"C:Usersh02317Downloadsextracted_document.md" # Output markdown file name
# Extract the document
extracted_content = extract_document(input_path)
save_to_markdown(output_file, extracted_content)
以下图像显示了由'问题
调用的Web搜索
qustion
true
这就是所有人!
以上是为业务计划和企业家精神制定AI驱动的智能指南的详细内容。更多信息请关注PHP中文网其他相关文章!