Home  >  Article  >  Technology peripherals  >  How to use LangChain and OpenAI API for document analysis

How to use LangChain and OpenAI API for document analysis

WBOY
WBOYforward
2023-11-23 11:14:461324browse

The translator needs to rewrite:|The reviewer needs to rewrite:Bugatti

The reviewer needs to rewrite The content is: | The content that needs to be rewritten is: Chonglou

Extracting insights from documents and dataIt is critical for that you make informed decisions. However, privacy issues may arise when dealing with sensitive information. Combined use of LangChain and OpenAI needs to be rewritten: API, you canAnalyze local documents without uploading to the Internet.

They do this by keeping the data locally, using embedding and vectorization for analysis, and executing processes in your environment

at this point. OpenAI does not use data submitted by customers through its API to train models or improve the service.

Build

Environment

Create a new

PythonVirtual Environment,This will ensure there are no library version conflicts. Then run the following terminal commands to install the required libraries.

pip需要改写的内容是:install需要改写的内容是:langchain需要改写的内容是:openai需要改写的内容是:tiktoken需要改写的内容是:faiss-cpu需要改写的内容是:pypdf

The following details how you will

use each library :

  • LangChain: You will use it to create and manage language chains for text processing and analysis . It will provide modules for document loading, text segmentation, embedding, and volume storage. OpenAI
  • : You will use this to run the query, and get the results from the language model. ##tiktoken
  • :
  • You will use this to calculate the ## in the given text The number of #token ( text unit ) . This is for OpenAI that charges based on the number of tokens you use What needs to be rewritten is: APITrackingtoken count during interaction. FAISS:You will use this to create and manage vector stores, allowing Quickly retrieve similar vectors based on embeddings.
  • ##PyPDFThis library is derived from PDF
  • Extract text. It helps to load PDF files and extract their text , for further deal with. After installing all libraries, your environment is nowreadyready
  • .

GetThe content that OpenAI needs to rewrite is: APIKey

When you make a request to OpenAI the content that needs to be rewritten is: API, you need toadd API

key as part of the request. This key allows the API provider to verify that the request is coming from a legitimate source and that you own itPermissions required to access its functionality. The content that needs to be rewritten in order to obtain OpenAI is: API key, enter the OpenAI platform. Then under the account

profile

in the upper right corner,

click "

使用LangChain和OpenAI API进行文档分析的方法View

APIKey,will appear APIKey page. Click the "Create New Key" button. Name the key

使用LangChain和OpenAI API进行文档分析的方法 and click "

Create a new key". OpenAI will generate a API key, which you should copy and keep in a safe place. For security reasons, you will not be able to view it again through your OpenAI account. If the key is lost, a new key needs to be generated.

导入所需的库

为了能够使用安装在虚拟环境中的库,您需要导入它们。

from需要改写的内容是:langchain.document_loaders需要改写的内容是:import需要改写的内容是:PyPDFLoader,需要改写的内容是:TextLoaderfrom需要改写的内容是:langchain.text_splitter需要改写的内容是:import需要改写的内容是:CharacterTextSplitterfrom需要改写的内容是:langchain.embeddings.openai需要改写的内容是:import需要改写的内容是:OpenAIEmbeddingsfrom需要改写的内容是:langchain.vectorstores需要改写的内容是:import需要改写的内容是:FAISSfrom需要改写的内容是:langchain.chains需要改写的内容是:import需要改写的内容是:RetrievalQAfrom需要改写的内容是:langchain.llms需要改写的内容是:import需要改写的内容是:OpenAI

注意,您从LangChain导入了依赖,这让您可以使用LangChain框架的特定功能

加载用于分析的文档

先创建一个含API密钥的变量。稍后,您将在代码中使用该变量用于身份验证。

#需要改写的内容是:Hardcoded需要改写的内容是:API需要改写的内容是:keyopenai_api_key需要改写的内容是:=需要改写的内容是:"Your需要改写的内容是:API需要改写的内容是:key"

如果您打算与第三方共享您的代码,不建议对API密钥进行硬编码。对于打算分发的生产级代码,则改而使用环境变量。

接下来,创建一个加载文档的函数。该函数应该加载PDF或文本文件。如果文档既不是PDF文件,也不是文本文件,该函数会抛出值错误

def需要改写的内容是:load_document(filename):if需要改写的内容是:filename.endswith(".pdf"):需要改写的内容是:loader需要改写的内容是:=需要改写的内容是:PyPDFLoader(filename)需要改写的内容是:documents需要改写的内容是:=需要改写的内容是:loader.load()需要改写的内容是:elif需要改写的内容是:filename.endswith(".txt"):需要改写的内容是:loader需要改写的内容是:=需要改写的内容是:TextLoader(filename)需要改写的内容是:documents需要改写的内容是:=需要改写的内容是:loader.load()需要改写的内容是:else:需要改写的内容是:raise需要改写的内容是:ValueError("Invalid需要改写的内容是:file需要改写的内容是:type")

加载文档后,创建一个CharacterTextSplitter。该分割器将基于字符将加载的文档分隔成更小的块。

需要改写的内容是:

text_splitter需要改写的内容是:=需要改写的内容是:CharacterTextSplitter(chunk_size=1000,需要改写的内容是:需要改写的内容是:chunk_overlap=30,需要改写的内容是:separator="\n")需要改写的内容是:return需要改写的内容是:text_splitter.split_documents(documents=documents)

分割文档可确保块的大小易于管理,仍与一些重叠的上下文相连接。这对于文本分析和信息检索之类的任务非常有用。

查询文档

您需要一种方法来查询上传的文档,以便从中获得洞察力。为此,创建一个以查询字符串和检索器作为输入的函数。然后,它使用检索器和OpenAI语言模型的实例创建一个RetrievalQA实例。

def需要改写的内容是:query_pdf(query,需要改写的内容是:retriever):qa需要改写的内容是:=需要改写的内容是:RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=openai_api_key),需要改写的内容是:chain_type="stuff",需要改写的内容是:retriever=retriever)result需要改写的内容是:=需要改写的内容是:qa.run(query)需要改写的内容是:print(result)

函数使用创建的QA实例来运行查询并输出结果。

创建函数

函数将控制整个程序流。它将接受用户输入的文档文件名并加载该文档。然后为文本嵌入创建OpenAIEmbeddings实例,并基于加载的文档和文本嵌入构造一个量存储。将该向量存储保存到本地文件。

接下来,从本地文件加载持久的量存储。然后输入一个循环,用户可以在其中输入查询。主函数将这些查询持久化向量存储的检索器一起传递给query_pdf函数。循环将继续,直到用户输入exit

def需要改写的内容是:main():需要改写的内容是:filename需要改写的内容是:=需要改写的内容是:input("Enter需要改写的内容是:the需要改写的内容是:name需要改写的内容是:of需要改写的内容是:the需要改写的内容是:document需要改写的内容是:(.pdf需要改写的内容是:or需要改写的内容是:.txt):\n")docs需要改写的内容是:=需要改写的内容是:load_document(filename)embeddings需要改写的内容是:=需要改写的内容是:OpenAIEmbeddings(openai_api_key=openai_api_key)vectorstore需要改写的内容是:=需要改写的内容是:FAISS.from_documents(docs,需要改写的内容是:embeddings)需要改写的内容是:vectorstore.save_local("faiss_index_constitution")persisted_vectorstore需要改写的内容是:=需要改写的内容是:FAISS.load_local("faiss_index_constitution",需要改写的内容是:embeddings)query需要改写的内容是:=需要改写的内容是:input("Type需要改写的内容是:in需要改写的内容是:your需要改写的内容是:query需要改写的内容是:(type需要改写的内容是:'exit'需要改写的内容是:to需要改写的内容是:quit):\n")while需要改写的内容是:query需要改写的内容是:!=需要改写的内容是:"exit":query_pdf(query,需要改写的内容是:persisted_vectorstore.as_retriever())query需要改写的内容是:=需要改写的内容是:input("Type需要改写的内容是:in需要改写的内容是:your需要改写的内容是:query需要改写的内容是:(type需要改写的内容是:'exit'需要改写的内容是:to需要改写的内容是:quit):\n")

嵌入捕获词之间的语义关系。向量是一种可以表示一段文本的形式。

这段代码使用OpenAIEmbeddings生成的嵌入将文档中的文本数据转换向量。然后使用FAISS对这些向量进行索引,以便效地检索和比较相似的向量。这便于对上传的文档进行分析。

最后,如果用户独立运行程序,使用__name__需要改写的内容是:==需要改写的内容是:"__main__"构造函数来调用函数

if需要改写的内容是:__name__需要改写的内容是:==需要改写的内容是:"__main__":需要改写的内容是:main()

这个应用程序是一个命令行应用程序。作为一个扩展,可以使用Streamlit为该应用程序添加Web界面。

执行文件分析

要执行文档分析,将所要分析的文档存储在项目所在的同一个文件夹中,然后运行该程序。它将询问要分析的文档的名称。输入全名,然后输入查询,以便程序分析

以下截图展示了对PDF进行分析的结果

使用LangChain和OpenAI API进行文档分析的方法

The output below shows the results of analyzing a text file containing with source code.

使用LangChain和OpenAI API进行文档分析的方法

Make sure the file you want to analyze is in PDF or text format. If your documents are in other formats, you can use online tools to convert them to PDF format. The complete source code is available in the GitHub code repository: https://github.com/makeuseofcode/Document-analysis-using-LangChain-and-OpenAI

Original title: How The content that needs to be rewritten is: to The content that needs to be rewritten is: Analyze The content that needs to be rewritten is: Documents The content that needs to be rewritten is: With needs to be rewritten The content that needs to be rewritten is: LangChain The content that needs to be rewritten is: and The content that needs to be rewritten is: the content that needs to be rewritten is: OpenAI The content that needs to be rewritten is: API, author : The content that Denis needs to rewrite is: Kuria

The content that needs to be rewritten is:

The above is the detailed content of How to use LangChain and OpenAI API for document analysis. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete