search
HomeTechnology peripheralsAIWhen LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

Background

Large Model (LLM) provides a new direction for the development of general artificial intelligence (AGI). It uses massive public data, such as the Internet, books and other corpora to conduct large-scale self-learning. Through supervised training, powerful language understanding, language generation, reasoning and other abilities have been obtained. However, large models still face some challenges in utilizing private domain data. Private domain data refers to data owned by specific companies or individuals and usually contains domain-specific knowledge. Combining large models with private domain knowledge will Provide great value.

Private domain knowledge can be divided into unstructured and structured data in terms of data form. Unstructured data, such as documents, are usually enhanced through retrieval, and tools such as langchain can be used to quickly implement a question and answer system. Structured data, such as databases (DB), require large models to interact with the database, query and analyze to obtain useful information. A series of products and applications have recently been derived around large models and databases, such as using LLM to build intelligent databases, perform BI analysis, and complete automatic table construction. Among them, text-to-SQL technology, which interacts with the database in a natural language, has always been a highly anticipated direction.

In academia, past text-to-SQL benchmarks only focused on small-scale databases. The most advanced LLM can already achieve an execution accuracy of 85.3%, but does this mean that LLM Already available as a natural language interface to the database?

New generation data set

Recently, Alibaba, together with the University of Hong Kong and other institutions, launched a new benchmark BIRD (Can LLM Already Serve as A Database) for large-scale real databases Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs), including 95 large-scale databases and high-quality Text-SQL pairs, with a data storage capacity of up to 33.4 GB. The previous best model only achieved 40.08% evaluation on BIRD, which is still far from the human result of 92.96%, proving that challenges still exist. In addition to evaluating the correctness of SQL, the author also added an evaluation of SQL execution efficiency, hoping that the model can not only write correct SQL, but also write efficient SQL.

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

Paper: https://arxiv.org/abs/2305.03111

Homepage: https://bird-bench.github.io

##Code: https://github. com/AlibabaResearch/DAMO-ConvAI/tree/main/bird

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

##Currently, BIRD’s data and code , and the list have all been open sourced, and have been downloaded more than 10,000 times around the world. BIRD has aroused widespread attention and discussion on Twitter since its launch.

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

# Comments from overseas users are also very exciting:

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

An LLM project not to be missed

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

Very useful checkpoint, hotbed for improvement

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

AI can help you, but it cannot replace you

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

##My job is safe for now...

Method Overview

New Challenges

This research is mainly oriented to Text-to-SQL evaluation of real databases, popular test benchmarks in the past, Spider and WikiSQL, for example, only focus on database schema with a small amount of database content, resulting in a gap between academic research and practical applications. BIRD focuses on three new challenges: massive and real database content, external knowledge reasoning between natural language questions and database content, and the efficiency of SQL when processing large databases.

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

First of all, the database contains massive and noisy data values. In the example on the left, the average salary needs to be calculated by converting the string in the database into a floating point value (Float) and then performing the aggregation calculation (Aggregation);

Secondly, external knowledge inference is necessary. In the middle example, in order to accurately return answers to the user, the model must first know that the account type eligible for the loan must be "OWNER", which represents a huge The mysteries hidden behind database content sometimes require external knowledge and reasoning to reveal;

Finally, query execution efficiency needs to be considered. In the example on the right, using more efficient SQL queries can significantly improve speed, which is of great value to the industry because users not only expect to write correct SQL, but also expect efficient SQL execution, especially in large databases In the case of;

Data annotation

BIRD decouples question generation and SQL annotation during the annotation process . At the same time, experts are added to write database description files to help problem and SQL annotation personnel better understand the database.

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

1. Database collection: The author collected and processed 80 databases from open source data platforms such as Kaggle and CTU Prague Relational Learning Repository. Fifteen databases were manually created as black-box tests by collecting real table data, building ER diagrams, and setting database constraints to avoid the current database being learned by the current large model. BIRD's database contains patterns and values ​​in multiple fields, 37 fields, covering blockchain, sports, medical care, games, etc.

2. Problem collection: First, the author hires experts to write a description file for the database. The description file includes a complete description of the column name, database value, and external parameters used to understand the value. knowledge etc. Then 11 native speakers from the United States, United Kingdom, Canada, Singapore and other countries were recruited to generate questions for BIRD. Every speaker has at least a bachelor's degree or above.

3. SQL generation: A global annotation team composed of data engineers and database course students was recruited to generate SQL for BIRD. Given a database and a reference database description file, the annotator needs to generate SQL to correctly answer the question. The Double-Blind annotation method is adopted, requiring two annotators to annotate the same question. Double-blind annotation can minimize errors caused by a single annotator.

4. Quality inspection: Quality inspection is divided into two parts: effectiveness and consistency of result execution. Validity not only requires the correctness of execution, but also requires that the execution result cannot be null (NULL). Experts will gradually modify the problem conditions until the SQL execution results are valid.

5. Difficulty division: The difficulty index of text-to-SQL can provide researchers with a reference for optimizing algorithms. The difficulty of Text-to-SQL depends not only on the complexity of the SQL, but also on factors such as the difficulty of the question, the ease of understanding with additional knowledge, and the complexity of the database. The authors therefore asked SQL annotators to rate the difficulty during the annotation process and divided the difficulty into three categories: easy, moderate, and challenging.

Data statistics

1. Question type statistics: Questions are divided into two categories, basic question types (Fundamental Type) and Reasoning Type. Basic question types include those covered in traditional Text-to-SQL datasets, while inference question types include questions that require external knowledge to understand the value:

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

##2. Database distribution: The author uses a sunburst diagram to show the relationship between the database domain and its data size. A larger radius means more text-SQL is based on that database, and vice versa. The darker the color, the larger the database size. For example, donor is the largest database in the benchmark, occupying 4.5GB of space.

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

3.SQL distribution: the number of tokens passed by the author through SQL, the number of keywords, the number of n-gram types, JOIN The number and other 4 dimensions prove that BIRD's SQL is by far the most diverse and complex.

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

Evaluation Indicators

1. Execution Accuracy: Compare the difference between the SQL execution results predicted by the model and the actual annotated SQL execution results;

2. Effective efficiency score: Taking into account both the accuracy and efficiency of SQL, compare the model predictions The relative difference between the SQL execution speed and the real annotated SQL execution speed takes the running time as the main indicator of efficiency.

Experimental analysis

The author selected the training-type T5 model and the large-scale model that performed well in previous benchmark tests Language models (LLM) as baseline models: Codex (code-davinci-002) and ChatGPT (gpt-3.5-turbo). In order to better understand whether multi-step reasoning can stimulate the reasoning capabilities of large language models in real database environments, their Chain-of-Thought version is also provided. And test the baseline model in two settings: one is full schema information input, and the other is human understanding of the database values ​​involved in the problem, summarized into natural language description (knowledge evidence) to assist the model in understanding the database.

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

The author gives some conclusions:

1. Additional knowledge Gain: Increasing the knowledge evidence (knowledge evidence) on the understanding of database values ​​has a significant improvement. This proves that in real database scenarios, relying solely on semantic parsing capabilities is not enough. Understanding database values ​​will help users find more accurately Answer.

2. The thinking chain is not necessarily completely beneficial: when the model does not have a given database value description and zero-shot, the model's own COT inference can be generated more accurately Answer. However, when given additional knowledge (knowledge evidence), LLM was asked to perform COT and found that the effect was not significant or even declined. Therefore in this scenario, LLM may generate knowledge conflicts. How to resolve this conflict so that the model can both accept external knowledge and benefit from its own powerful multi-step reasoning will be a key research direction in the future.

3. The gap with humans: BIRD also provides human indicators. The author uses an exam to test the performance of the annotator when facing the test set for the first time, and uses it as the basis for human indicators. Experiments have found that the current best LLM is still far behind humans, proving that challenges still exist. The authors performed a detailed error analysis and provided some potential directions for future research.

When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​

Conclusion

The application of LLM in the database field will provide users with smarter and more convenient Database interactive experience. The emergence of BIRD will promote the intelligent development of interaction between natural language and real databases, provide room for progress in text-to-SQL technology for real database scenarios, and help researchers develop more advanced and practical database applications.

The above is the detailed content of When LLM meets Database: Alibaba DAMO Academy and HKU launch a new Text-to-SQL benchmark​. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete
AI Therapists Are Here: 14 Groundbreaking Mental Health Tools You Need To KnowAI Therapists Are Here: 14 Groundbreaking Mental Health Tools You Need To KnowApr 30, 2025 am 11:17 AM

While it can’t provide the human connection and intuition of a trained therapist, research has shown that many people are comfortable sharing their worries and concerns with relatively faceless and anonymous AI bots. Whether this is always a good i

Calling AI To The Grocery AisleCalling AI To The Grocery AisleApr 30, 2025 am 11:16 AM

Artificial intelligence (AI), a technology decades in the making, is revolutionizing the food retail industry. From large-scale efficiency gains and cost reductions to streamlined processes across various business functions, AI's impact is undeniabl

Getting Pep Talks From Generative AI To Lift Your SpiritGetting Pep Talks From Generative AI To Lift Your SpiritApr 30, 2025 am 11:15 AM

Let’s talk about it. This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI including identifying and explaining various impactful AI complexities (see the link here). In addition, for my comp

Why AI-Powered Hyper-Personalization Is A Must For All BusinessesWhy AI-Powered Hyper-Personalization Is A Must For All BusinessesApr 30, 2025 am 11:14 AM

Maintaining a professional image requires occasional wardrobe updates. While online shopping is convenient, it lacks the certainty of in-person try-ons. My solution? AI-powered personalization. I envision an AI assistant curating clothing selecti

Forget Duolingo: Google Translate's New AI Feature Teaches LanguagesForget Duolingo: Google Translate's New AI Feature Teaches LanguagesApr 30, 2025 am 11:13 AM

Google Translate adds language learning function According to Android Authority, app expert AssembleDebug has found that the latest version of the Google Translate app contains a new "practice" mode of testing code designed to help users improve their language skills through personalized activities. This feature is currently invisible to users, but AssembleDebug is able to partially activate it and view some of its new user interface elements. When activated, the feature adds a new Graduation Cap icon at the bottom of the screen marked with a "Beta" badge indicating that the "Practice" feature will be released initially in experimental form. The related pop-up prompt shows "Practice the activities tailored for you!", which means Google will generate customized

They're Making TCP/IP For AI, And It's Called NANDAThey're Making TCP/IP For AI, And It's Called NANDAApr 30, 2025 am 11:12 AM

MIT researchers are developing NANDA, a groundbreaking web protocol designed for AI agents. Short for Networked Agents and Decentralized AI, NANDA builds upon Anthropic's Model Context Protocol (MCP) by adding internet capabilities, enabling AI agen

The Prompt: Deepfake Detection Is A Booming BusinessThe Prompt: Deepfake Detection Is A Booming BusinessApr 30, 2025 am 11:11 AM

Meta's Latest Venture: An AI App to Rival ChatGPT Meta, the parent company of Facebook, Instagram, WhatsApp, and Threads, is launching a new AI-powered application. This standalone app, Meta AI, aims to compete directly with OpenAI's ChatGPT. Lever

The Next Two Years In AI Cybersecurity For Business LeadersThe Next Two Years In AI Cybersecurity For Business LeadersApr 30, 2025 am 11:10 AM

Navigating the Rising Tide of AI Cyber Attacks Recently, Jason Clinton, CISO for Anthropic, underscored the emerging risks tied to non-human identities—as machine-to-machine communication proliferates, safeguarding these "identities" become

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment