Long text cannot kill RAG: SQL+ vector drives large models and the new paradigm of big data, MyScale AI database is officially open source-AI-php.cn

Long text cannot kill RAG: SQL+ vector drives large models and the new paradigm of big data, MyScale AI database is officially open source

PHPz

Apr 12, 2024 am 08:04 AM

gitindustryopenseaai database

The combination of large models and AI databases has become a magic weapon for cost reduction and efficiency improvement of large models and truly intelligent big data.

长文本杀不死RAG：SQL+向量驱动大模型和大数据新范式，MyScale AI数据库正式开源

The wave of large models (LLM) has been surging for more than a year, especially with GPT-4, Gemini-1.5, Claude-3 The models represented by You Fang and I will appear on stage and become a well-deserved center of attention. On the LLM track, some research focuses on increasing model parameters, and some are crazy about multi-modality... Among them, LLM's ability to process context length has become an important indicator for evaluating models. A stronger context means that the model Have stronger retrieval performance. For example, the ability of some models to process up to 1 million tokens in one go has led many researchers to think about whether the RAG (Retrieval-Augmented Generation) method is still necessary?

Some people think that RAG will be killed by the long context model, but this view has been refuted by many researchers and architects. They believe that on the one hand, data structures are complex, change regularly, and many data have important time dimensions, which may be too complex for LLM. On the other hand, it is unrealistic to put all the massive heterogeneous data of enterprises and industries into the context window. The combination of large models and AI databases injects professional, accurate and real-time information into the generative AI system, greatly reducing illusions and improving the practicality of the system. At the same time, the Data-centric LLM method can also take advantage of the massive data management and query capabilities of AI databases to significantly reduce the cost of large model training and fine-tuning, and support small sample tuning in different scenarios of the system. In summary, the combination of large models and AI databases not only reduces costs and increases efficiency for large models, but also makes big data truly intelligent.

After several years of development and iteration, MyScaleDB is finally open source

The emergence of RAG makes LLM can accurately extract information from large-scale knowledge bases and generate real-time, professional, and insightful answers. Along with this, the vector database, the core function of the RAG system, has also developed rapidly. According to the design concept of vector database, we can roughly divide it into three categories: dedicated vector database, retrieval system combining keywords and vectors, and SQL vector database.

Specialized vector databases represented by Pinecone/Weaviate/Milvus were designed and built for vector retrieval from the beginning. The vector retrieval performance is excellent, but it is not universal. The data management function is weak.
Keyword and vector retrieval systems represented by Elasticsearch/OpenSearch are widely used in production because of their complete keyword retrieval functions. However, they occupy a lot of system resources, and keywords and vectors The accuracy and performance of the joint query are not satisfactory.
SQL vector databases represented by pgvector (vector search plug-in for PostgreSQL) and MyScale AI database are based on SQL and have powerful data management functions. However, due to the disadvantages of PostgreSQL row storage and the limitations of vector algorithms, pgvector has low accuracy in complex vector queries.

MyScale AI Database (MyScaleDB) Based on a high-performance SQL column storage database, self-developed with high performance and high data density Vector index algorithm, and the retrieval and storage engine have been deeply developed and optimized for joint queries of SQL and vectors. is the world's first SQL vector database product whose comprehensive performance and cost-effectiveness greatly exceeds that of a dedicated vector database.

Thanks to the long-term polishing of SQL database in massive structured data scenarios, MyScaleDB supports both massive vector and structured data, including strings, Efficient storage and query of multiple data types such as JSON, space, time series, etc., and will launch powerful inverted table and keyword retrieval functions in the near future to further improve the accuracy of the RAG system and replace systems such as Elasticsearch.

长文本杀不死RAG：SQL+向量驱动大模型和大数据新范式，MyScale AI数据库正式开源

After nearly 6 years of development and several version iterations, MyScaleDB has recently been open sourced. All developers and enterprise users are welcome to star on GitHub and open up a new way of using SQL to build production-level AI applications!

Project address: https://github.com/myscale/myscaledb

Fully compatible with SQL, improved accuracy , cost reduction

With the help of complete SQL data management capabilities, powerful and efficient structured, vector and heterogeneous data storage and query capabilities, MyScaleDB is expected to become the first An AI database that is truly oriented to large models and big data.

Native compatibility with SQL and vectors

Half a century since the birth of SQL , despite experiencing waves such as NoSQL and big data, the ever-evolving SQL database still occupies a major share of the data management market, and even retrieval and big data systems such as Elasticsearch and Spark have successively supported SQL interfaces. Although dedicated vector databases have been optimized and system designed for vectors, their query interfaces usually lack standardization and do not have advanced query languages. This results in weak generalization capabilities of the interface. For example, Pinecone’s query interface does not even include specifying the fields to be retrieved, let alone common database functions such as paging and aggregation.

#The weak generalization ability of the interface means that it changes frequently, which increases the learning cost. The MyScale team believes that

the systematically optimized SQL and vector system can maintain complete SQL support while ensuring high performance of vector retrieval, and the results of their open source evaluation have fully demonstrated this. .

In actual complex AI application scenarios, the combination of SQL and vectors can greatly increase the flexibility of data modeling and simplify the development process. For example, in the Science Navigator project cooperating between the MyScale team and the Beijing Institute of Scientific Intelligence, MyScaleDB is used to retrieve massive scientific literature data and perform intelligent question answering. There are more than 10 main SQL table structures, many of which establish vectors. And inverted table index, and use the primary key and foreign key to make the association. In actual queries, the system will also involve joint queries of structured, vector and keyword data, as well as related queries of several tables. These modeling and correlations are difficult to achieve in a dedicated vector database, which will also lead to slow iteration of the final system, inefficient querying and difficult maintenance.

^{Science Navigator main table structure diagram (bold columns establish vector indexes or inverted indexes)}

Support joint query of structured, vector and keyword data

In the actual RAG system, the accuracy and effect of retrieval are the main bottlenecks restricting its implementation. This requires the AI database to efficiently support joint queries of structured, vector and keyword data to comprehensively improve retrieval accuracy.

For example, in a financial scenario, the user needs to query the document library "What is the revenue of a certain company's global businesses in 2023?", "A certain company", "2023 Year" and other structured meta-information cannot be well captured by vectors, and may not even be directly reflected in the corresponding paragraphs. Performing vector retrieval directly on the entire database will obtain a large amount of noise information and reduce the final accuracy of the system. On the other hand, company name, year, etc. can usually be obtained as meta-information of the document. We can use WHERE year=2023 AND company ILIKE "%%" as the filter condition of vector query to accurately locate Relevant information is obtained, which greatly improves the reliability of the system. In finance, manufacturing, scientific research and other scenarios, the MyScale team has observed the power of heterogeneous data modeling and related queries. In many scenarios, the accuracy is even 60% to 90% improvement.

Although traditional database products have gradually realized the importance of vector queries in the AI era and have begun to add vector capabilities to the database, there are still significant problems with the accuracy of their joint queries. . For example, in the scenario of filtering queries, when the filtering ratio is 0.1, the QPS of Elasticsearch will drop to only about 5, while the retrieval accuracy of PostgresSQL (using the pgvector plug-in) is only about 50% when the filtering ratio is 0.01, making the query unstable. Accuracy/performance greatly restricts its application scenarios. And MyScale only uses 36% of the cost of pgvector and 12% of the cost of ElasticSearch, to achieve high performance and high precision queries in various scenarios with different filtering ratios.

长文本杀不死RAG：SQL+向量驱动大模型和大数据新范式，MyScale AI数据库正式开源

^{In different filtering proportion scenarios, myscale achieves high precision and high performance query}

## This

The balance between performance and cost in real scenarios

Because of the importance and high attention of vector retrieval in large model applications, more and more The team invested in the vector database track. Everyone’s initial focus was on improving QPS in pure vector search scenarios, but

pure vector search is far from enough

! In actual combat scenarios, data modeling, query flexibility and accuracy, and balancing data density, query performance and cost are more important issues.

In the RAG scenario, pure vector query performance has a 10x excess, vectors occupy huge resources, lack of joint query functions, poor performance and accuracy are often the result of current proprietary vectors Database normality.

MyScaleDB is committed to improving the comprehensive performance of AI databases in real massive data scenarios

. Its MyScale Vector Database Benchmark is also the first in the industry to compare mainstream vector database systems with a scale of five million vectors and different query scenarios. An open source evaluation system for performance and cost-effectiveness. Everyone is welcome to pay attention and raise issues. The MyScale team said that there is still a lot of room for optimization of the AI database in real application scenarios, and they also hope to continue to polish the product and improve the evaluation system in practice.

MyScale Vector Database Benchmark project address:

https://github.com/myscale/vector-db-benchmark

Outlook: Big model big data Agent platform supported by AI database

Machine learning big data drives the Internet and the Internet The success of a generation of information systems, and in the context of the era of large models, the MyScale team is also committed to proposing a new generation of large model and big data solutions. With high-performance SQL vector database as a solid support, MyScaleDB provides the key capabilities of large-scale data processing, knowledge query, observability, data analysis and small sample learning, building an AI and data closed loop, Become the key base of the

next generation big model big data Agent platform

. The MyScale team has already explored the implementation of this solution in scientific research, finance, industry, medical and other fields. 长文本杀不死RAG：SQL+向量驱动大模型和大数据新范式，MyScale AI数据库正式开源

###With the rapid development of technology, some sense of artificial general intelligence (AGI) is expected to appear in the next 5-10 years. Regarding this issue, we can’t help but think: Is a large model that is static, virtual, and competitive with humans needed, or is there another more comprehensive solution? Data is undoubtedly an important link between large models, the world, and users. The MyScale team's vision is to organically combine large models and big data to create an AI system that is more professional, real-time, and efficient in collaboration, but also full of human warmth and value. ###

The above is the detailed content of Long text cannot kill RAG: SQL+ vector drives large models and the new paradigm of big data, MyScale AI database is officially open source. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete

How to Build an Intelligent FAQ Chatbot Using Agentic RAGMay 07, 2025 am 11:28 AM

AI agents are now a part of enterprises big and small. From filling forms at hospitals and checking legal documents to analyzing video footage and handling customer support – we have AI agents for all kinds of tasks. Compan

From Panic To Power: What Leaders Must Learn In The AI AgeMay 07, 2025 am 11:26 AM

Life is good. Predictable, too—just the way your analytical mind prefers it. You only breezed into the office today to finish up some last-minute paperwork. Right after that you’re taking your partner and kids for a well-deserved vacation to sunny H

Why Convergence-Of-Evidence That Predicts AGI Will Outdo Scientific Consensus By AI ExpertsMay 07, 2025 am 11:24 AM

But scientific consensus has its hiccups and gotchas, and perhaps a more prudent approach would be via the use of convergence-of-evidence, also known as consilience. Let’s talk about it. This analysis of an innovative AI breakthrough is part of my

The Studio Ghibli Dilemma – Copyright In The Age Of Generative AIMay 07, 2025 am 11:19 AM

Neither OpenAI nor Studio Ghibli responded to requests for comment for this story. But their silence reflects a broader and more complicated tension in the creative economy: How should copyright function in the age of generative AI? With tools like

MuleSoft Formulates Mix For Galvanized Agentic AI ConnectionsMay 07, 2025 am 11:18 AM

Both concrete and software can be galvanized for robust performance where needed. Both can be stress tested, both can suffer from fissures and cracks over time, both can be broken down and refactored into a “new build”, the production of both feature

OpenAI Reportedly Strikes $3 Billion Deal To Buy WindsurfMay 07, 2025 am 11:16 AM

However, a lot of the reporting stops at a very surface level. If you’re trying to figure out what Windsurf is all about, you might or might not get what you want from the syndicated content that shows up at the top of the Google Search Engine Resul

Mandatory AI Education For All U.S. Kids? 250-Plus CEOs Say YesMay 07, 2025 am 11:15 AM

Key Facts Leaders signing the open letter include CEOs of such high-profile companies as Adobe, Accenture, AMD, American Airlines, Blue Origin, Cognizant, Dell, Dropbox, IBM, LinkedIn, Lyft, Microsoft, Salesforce, Uber, Yahoo and Zoom.

Our Complacency Crisis: Navigating AI DeceptionMay 07, 2025 am 11:09 AM

That scenario is no longer speculative fiction. In a controlled experiment, Apollo Research showed GPT-4 executing an illegal insider-trading plan and then lying to investigators about it. The episode is a vivid reminder that two curves are rising to

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

3 weeks agoByDDD

How to fix KB5055518 fails to install in Windows 10?

3 weeks agoByDDD

Roblox: Grow A Garden - Complete Mutation Guide

2 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks agoByDDD

Hot Tools

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

WebStorm Mac version

Useful JavaScript development tools

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

SublimeText3 English version

Recommended: Win version, supports code prompts!

MinGW - Minimalist GNU for Windows

This project is in the process of being migrated to osdn.net/projects/mingw, you can continue to follow us there. MinGW: A native Windows port of the GNU Compiler Collection (GCC), freely distributable import libraries and header files for building native Windows applications; includes extensions to the MSVC runtime to support C99 functionality. All MinGW software can run on 64-bit Windows platforms.

Hot Topics

1663

1419

1313

1263

1237