Home  >  Article  >  Technology peripherals  >  Alibaba Cloud AnalyticDB (ADB) + LLM: Building an enterprise-specific Chatbot in the AIGC era​

Alibaba Cloud AnalyticDB (ADB) + LLM: Building an enterprise-specific Chatbot in the AIGC era​

王林
王林forward
2023-05-23 12:23:471114browse

Introduction

##How to based on vector database LLM (Large Language Model) to create a company-specific Chatbot that understands you better?

1. Why does Chatbot need a large language model vector database?

This spring, the most shocking technology product is the emergence of ChatGPT. Through the large language model (LLM), people can see that generative AI can achieve the same goal as human language. With highly similar language expression capabilities, AI is no longer out of reach and can now enter human work and life. This has revitalized the AI ​​field that has been dormant for some time. Countless practitioners are eagerly devoting themselves to the next era of change. Opportunities; according to incomplete statistics, in just 4 months, the United States has completed more than 4,000 generative AI industry financings. In the next generation of technology, generative AI has become a part of capital and enterprises that cannot be ignored, and a higher level of infrastructure capabilities is increasingly needed to support its development.


# Large models can answer relatively It is a universal problem, but if you want to serve vertical professional fields, there will be problems of insufficient depth of knowledge and timeliness. So how can companies seize opportunities and build vertical field services? There are currently two models. The first is Fine Tune, which is a vertical domain model based on a large model. This has a large comprehensive investment cost and low update frequency, and is not suitable for all enterprises; the second is to Build the enterprise's own knowledge assets in the vector database, and use the large model vector database to build in-depth services in vertical fields. The essence is to use the database for prompt engineering. Enterprises can use vertical categories of legal provisions and precedents to build legal technology services in specific areas such as the legal industry. For example, Harvey, a legal technology company, is building a “Copilot for Lawyer” to improve legal drafting and research services. Extracting enterprise knowledge base documents and real-time information through vector features and then storing them in a vector database, combined with the LLM large language model, can make Chatbot (question and answer robot) answers more professional and timely, and build an enterprise-specific Chatbot.


howBased on a large language modelLet Chatbot better answer current affairs questions ? Welcome to the "Alibaba Cloud Yaochi Database" video account to watch the demo Demo.


##This article will next focus on the large language model ( LLM) Vector Database’s principles and processes for building an enterprise-specific Chatbot, as well as ADB-PG’s core capabilities in building this scenario.

2. What is a vector database?

In the real world, most data is in unstructured form, such as images, audio, video, and text. With the emergence of smart cities, short videos, personalized product recommendations, visual product search and other applications, these unstructured data have experienced explosive growth. In order to be able to process these unstructured data, we usually use artificial intelligence technology to extract the characteristics of these unstructured data and convert them into feature vectors, and then analyze and retrieve these feature vectors to achieve the purpose of analyzing the unstructured data. processing. Therefore, we call a database that can store, analyze, and retrieve feature vectors a vector database.

For fast retrieval of feature vectors, vector databases generally use the technical means of constructing vector indexes. The vector indexes we usually talk about belong to ANNS (Approximate Nearest Neighbors Search, Approximate Nearest Neighbors Search), which The core idea is that it is no longer limited to returning only the most accurate result items, but only searches for data items that may be close neighbors, that is, by sacrificing a little accuracy within the acceptable range in exchange for improved retrieval efficiency. This is also the biggest difference between vector databases and traditional databases.


#Currently in the actual production environment, there are two types of Main practical way to more conveniently apply ANNS vector indexing. One is to separately service the ANNS vector index to provide vector index creation and retrieval capabilities, thereby forming a proprietary vector database; the other is to integrate the ANNS vector index into a traditional structured database to form a DBMS with vector retrieval capabilities. In actual business scenarios, proprietary vector databases often need to be used in conjunction with other traditional databases, which will cause some common problems, such as data redundancy, excessive data migration, data consistency issues, etc. Compared to a true DBMS, a proprietary vector database requires additional professional maintenance, additional costs, and very limited query language capabilities, programmability, scalability, and tool integration.


#The DBMS that incorporates the vector retrieval function is different. It is first of all a very complete modern database platform that can meet the database functional needs of application developers; then its integrated vector retrieval capabilities can also implement the functions of proprietary vector databases, and enable vector storage and retrieval to inherit DBMS Excellent capabilities, such as ease of use (directly using SQL to process vectors), transactions, high availability, high scalability, etc.


##The ADB-PG introduced in this article has vector retrieval A functional DBMS not only includes vector retrieval functions, but also has one-stop database capabilities. Before introducing the specific capabilities of ADB-PG, let's first take a look at the creation process and related principles of Chatbot in the Demo video.

3. LLM large language model ADB-PG: Creating an enterprise-specific Chatbot

Case-Local Knowledge Question and Answer System

For the example of the previous Demo video combining the large language model LLM and ADB-PG to comment on current affairs news, let LLM answer "What is Tongyi Qianwen?" It can be seen that if we ask LLM to answer directly, the answer obtained is meaningless because the LLM training data set does not contain relevant content. And when we use the vector database as local knowledge storage and let LLM automatically extract relevant knowledge, it correctly answered "What is Tongyi Qianwen".

阿里云AnalyticDB(ADB) + LLM:构建AIGC时代下企业专属Chatbot​

##Answer to "Tong Yi Qian Ask what? documents, PDFs, emails, network information and other content. for example:

  1. Combine the latest flight information and the latest internet celebrity check-in locations and other travel guide resources to create a travel assistant. For example, answer questions about where is the best place to travel next week and how is the most economical.
  2. ## Comments on sports events, comments on current hot news, and summaries. Who is the MVP of the NBA game today.
  3. Education industry, interpretation of the latest educational hot topics, for example, tell me what is AIGC and what is Stable Diffusion and how to use it and more.
  4. In the financial field, quickly analyze financial financial reports in various industries and create financial consulting assistants.
  5. Customer service robots in the professional field...


Implementation principle

The local knowledge question and answer system (Local QA System) is mainly based on a combination of large language models Reasoning capabilities and vector database storage and retrieval capabilities. Obtain the most relevant semantic fragments through vector retrieval, and based on this, let the large language model perform reasoning combined with the context of the relevant fragments to draw correct conclusions. There are two main processes in this process:

a. Back-end data processing and storage process

b. Front-end Q&A process

At the same time, its underlying main Depends on two modules:

1. Inference module based on large language model

2. Vector data management module based on vector database


阿里云AnalyticDB(ADB) + LLM:构建AIGC时代下企业专属Chatbot​


##L

ocal QA system on LLM & VectorStore principle Back-end data processing and storage process

The black part in the above picture is the back-end data processing process, which mainly solves the embedding of our original data and combines it with the original The data are stored together in the vector database ADB-PG. Here you only need to pay attention to the blue dotted box part of the picture above. Black processing module and ADB-PG vector database.

  • #Step1: First extract all the text content in the original document. Then it is cut into multiple chunks according to the semantics, which can be understood as text paragraphs that can fully express a meaning. In this process, you can also perform some additional actions such as metadata extraction and sensitive information detection.
  • Step2: Throw these Chunks to the embedding model to find the embedding of these chunks.
  • ##Step3: Store the embedding and the original chunk together in the vector database.
  • ##Front-end Q&A process

This process is mainly divided into three Parts: #1. Problem refining part; 2. Vector retrieval to extract the most relevant knowledge; 3. Reasoning and solving part. Here we need to focus on the orange part. It may be a bit obscure to just talk about the principle, but we will use the above example to illustrate.

阿里云AnalyticDB(ADB) + LLM:构建AIGC时代下企业专属Chatbot​


##L

ocal QA system on LLM & VectorStore

Part1 Problem Refining

This part is optional and exists because some questions depend on context. Because the new questions asked by the user may not allow LLM to understand the user's intentions.


##For example, the user’s new question is “What can it do?” ". LLM does not know who it refers to and needs to combine previous chat history, such as "What is Tongyi Qianwen" to deduce the independent question that the user needs to answer "What can Tongyi Qianwen do?" LLM cannot correctly answer the vague question "What is its use", but it can correctly answer the independent question "What is the use of Tongyi Qianwen". If your problem is self-contained, you don't need this section.

#After getting the independent question, we can based on this independent question , to find the embedding of this independent problem. Then search the vector database for the most similar vectors to find the most relevant content. This behavior is in the functionality of Part2 Retrieval Plugin.

#Part2 Vector Retrieval


The independent problem embedding function will be performed in the text2vec model. After obtaining the embedding, you can use this embedding to search for data that has been stored in the vector database in advance. For example, we have stored the following content in ADB-PG. We can obtain the most similar content or knowledge through the obtained vector, such as the first and third items. Tongyi Qianwen is..., Tongyi Qianwen can help us xxx.

阿里云AnalyticDB(ADB) + LLM:构建AIGC时代下企业专属Chatbot​##Part3 Inference Solution


After obtaining the most relevant knowledge, we can let LLM perform solution reasoning based on the most relevant knowledge and independent questions to get the final answer. Here is the answer to the question "What is the use of Tongyi Qianwen" by combining the most effective information such as "Tongyi Qianwen is...", "Tongyi Qianwen can help us xxx" and so on. In the end, the inference solution of GPT is roughly like this:

阿里云AnalyticDB(ADB) + LLM:构建AIGC时代下企业专属Chatbot​



##4. ADB-PG: A one-stop enterprise knowledge database with built-in vector search and full-text search

Why is ADB-PG suitable as a knowledge database for Chatbot? ADB-PG is a cloud-native data warehouse with large-scale parallel processing capabilities. It supports row storage and column storage modes, which can not only provide high-performance offline data processing, but also support high-concurrency online analysis and query of massive data. Therefore, we can say that ADB-PG is a data warehouse platform that supports distributed transactions and mixed loads, and also supports the processing of a variety of unstructured and semi-structured data sources. For example, the vector retrieval plug-in enables high-performance vector retrieval and analysis of unstructured data such as images, languages, videos, and texts, and full-text retrieval and analysis of semi-structured data such as JSON.


## Therefore, in the AIGC scenario, ADB-PG can As a vector database, it meets its needs for vector storage and retrieval, and can also meet the storage and query of other structured data. It can also provide full-text retrieval capabilities, providing a one-stop solution for business applications in AIGC scenarios. . Below we will introduce in detail the three capabilities of ADB-PG: vector retrieval, fusion retrieval and full-text retrieval.


#ADB-PG vector retrieval and fusion retrieval functions in 2020 It was first launched on the public cloud in 2016 and has been widely used in the field of face recognition. ADB-PG's vector database is inherited from the data warehouse platform, so it has almost all the benefits of DBMS, such as ANSISQL, ACID transactions, high availability, fault recovery, point-in-time recovery, programmability, scalability, etc. At the same time, it supports vector and vector similarity searches of dot product distance, Hamming distance and Euclidean distance. These functions are currently widely used in face recognition, product recognition and text-based semantic search. With AIGC exploding, these features provide a solid foundation for text-based chatbots. In addition, the ADB-PG vector retrieval engine also uses Intel SIMD instructions to achieve vector similarity matching extremely efficiently.


# Let’s use a specific example to illustrate ADB- How to use PG's vector retrieval and fusion retrieval. Suppose there is a text knowledge base, which divides a batch of articles into chunks and converts them into embedding vectors before entering the database. The chunks table contains the following fields:

阿里云AnalyticDB(ADB) + LLM:构建AIGC时代下企业专属Chatbot​


#The corresponding table creation DDL is as follows:

阿里云AnalyticDB(ADB) + LLM:构建AIGC时代下企业专属Chatbot​

#In order to retrieve the vector To accelerate, we also need to create a vector index:


阿里云AnalyticDB(ADB) + LLM:构建AIGC时代下企业专属Chatbot​##At the same time, in order to accelerate vector structured fusion queries, we also need to create indexes for commonly used structured columns:


阿里云AnalyticDB(ADB) + LLM:构建AIGC时代下企业专属Chatbot​


When inserting data, we can directly use the insert syntax in SQL:

阿里云AnalyticDB(ADB) + LLM:构建AIGC时代下企业专属Chatbot​



#In this example, if we want to search its source article through text, then we can search directly through vector search. The specific SQL is as follows:

阿里云AnalyticDB(ADB) + LLM:构建AIGC时代下企业专属Chatbot​



#Similarly, if our need is to find the most recent month The source article of a certain text within. Then we can search directly through fusion search. The specific SQL is as follows:

阿里云AnalyticDB(ADB) + LLM:构建AIGC时代下企业专属Chatbot​



##After reading the above examples, we can clearly find that using vector retrieval and fusion retrieval in ADB-PG is just like It is as convenient as using a traditional database without any learning threshold. At the same time, we have also made many targeted optimizations for vector retrieval, such as vector data compression, parallel construction of vector indexes, vector multi-partition parallel retrieval, etc., which will not be detailed here.


##ADB-PG also has rich full-text search capabilities , supporting retrieval capabilities such as complex combination conditions and result rankings; in addition, for Chinese data sets, ADB-PG also supports Chinese word segmentation functions, which can process and segment Chinese texts efficiently and customizedly; at the same time, ADB-PG also supports the use of indexes to accelerate full-text retrieval. Analyze performance. These capabilities can also be fully used in AIGC business scenarios. For example, the business can perform two-way recall of knowledge base documents combined with the above-mentioned vector retrieval and full-text retrieval capabilities.

##The knowledge database search part includes traditional keyword full-text search And vector feature retrieval, keyword full-text retrieval ensures the accuracy of the query. Vector feature retrieval provides generalization and semantic matching. In addition to literal matching, it recalls the knowledge of semantic matching, reduces the no-result rate, and provides richer information for large models. Context is conducive to summary and induction of large language models.

5. Summary

Combined with the content mentioned earlier in this article, if we compare the knowledgeable Chatbot

is the person class, Then the large language model can be seen as the knowledge and learning reasoning ability that Chatbot acquired from all books and public materials in various fields before graduating from college. Therefore, based on the large language model, Chatbot can answer questions related to it before graduation. However, if the question involves a specific professional field (the relevant information is proprietary to corporate organizations and is not public) or a newly emerging species concept (it has not yet been released when graduating from university) Birth), it is impossible to deal with it calmly only by relying on the knowledge gained in school (corresponding to pre-trained large language models). You need to have channels to continue to acquire new knowledge after graduation (such as work-related professional learning databases), combined with your own learning and reasoning abilities. , to make a professional response.

The same Chatbot needs to combine the learning and reasoning capabilities of large language models with a one-stop database like ADB-PG that contains vector retrieval and full-text retrieval capabilities (which stores the proprietary and latest information of enterprise organizations). knowledge documents and vector features), and can provide more professional and timely answers based on the knowledge content in the database when responding to questions.

The above is the detailed content of Alibaba Cloud AnalyticDB (ADB) + LLM: Building an enterprise-specific Chatbot in the AIGC era​. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete