Home >Technology peripherals >AI >Automatically discovered 100+ vulnerabilities in four major databases in one day, Zhejiang University research won the best paper in SIGMOD 2023

Automatically discovered 100+ vulnerabilities in four major databases in one day, Zhejiang University research won the best paper in SIGMOD 2023

WBOY
WBOYforward
2023-05-24 15:58:06900browse

The 2023 ACM SIGMOD/PODS International Conference on Data Management (SIGMOD 2023) will be held in Seattle, USA from June 18-23 local time. Recently, the conference announced the list of the best papers, with "Predicate Pushdown for Data Science Pipelines" from Microsoft Research and "Detecting Logic Bugs of Join Optimizations in DBMS" from Zhejiang University winning. This is the first time a mainland Chinese research team has won the best paper award at the conference since its inception in 1975. Among them, research from Zhejiang University proposed a novel method that can automatically discover logical vulnerabilities in database management systems such as MySQL, MariaDB, TiDB and PolarDB.

一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文

Over the past few decades, modern database management systems (DBMS) have continued to evolve to support a variety of new architectures, such as cloud platforms and HTAP , which requires increasingly sophisticated optimization of query evaluation. The query optimizer is considered to be one of the most complex and important components in a DBMS. Its function is to parse the input SQL query and then generate an efficient execution plan with the assistance of the built-in cost model. Errors in query optimizer implementation may lead to vulnerabilities (bugs), including crash vulnerabilities and logic holes. Crash vulnerabilities are easy to detect because a crash causes the system to stop immediately. However, logic vulnerabilities are easily overlooked because they can cause the DBMS to return erroneous result sets that are difficult to detect. This paper focuses on detecting these silent vulnerabilities.

一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文

There is an emerging approach to detecting logical vulnerabilities in DBMS, namely Pivoted Query Synthesis (PQS). The core idea of ​​this method is to randomly select a pivot row (pivot row) from the table and then generate a query with this row as the result. If any synthesized query cannot return that data row, then a logic vulnerability has been detected. PQS is mainly used to support option queries in a single table, and 90% of its reported vulnerabilities only involve single-table queries. There is still a large research gap for multi-table queries that use different join algorithms and join structures (which are more error-prone than single-table queries).

The following figure shows two logical loopholes in connecting queries in MySQL. Both vulnerabilities can be detected using the new tools proposed in this article.

一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文

Figure 1: Example of logical vulnerability in join optimization in DBMS

Figure 1(a) shows a logic vulnerability in the hash join in MySQL 8.0.18. In this example, the first query returns the correct result set because it is executed using a block nested loop join. However, the second query using an inner hash join went wrong and returned an incorrectly empty result set. This is because the underlying hash join algorithm incorrectly assumes that 0 is not equal to −0.

The logic vulnerability in Figure 1 (b) originates from the semi-join processing in MySQL 8.0.28. In the first query, the nested loop inner join converts the data type varchar to bigint, resulting in the correct result set. When executing the second query using a hash semi-join, the data type varchar will be converted to double, resulting in a loss of data accuracy and errors in equivalent comparisons.

The difficulty of using query synthesis method for logical vulnerability detection of multi-table join queries is far more difficult than that of single-table queries. This involves two challenges:

  • Result verification: In order to verify the correctness of the query results, the previous method used a differential testing strategy. The idea is to use different physical execution plans (the way the database system actually executes query statements) to process queries. If these plans return inconsistent result sets, a logic flaw may have been detected. However, the differential testing method has two disadvantages. First, certain logical vulnerabilities can affect multiple physical execution plans and cause them all to produce the same erroneous result. Second, when inconsistent result sets are observed, it becomes expensive to manually check which execution plan generated the correct result. A possible solution to this problem is to construct ground-truth results for arbitrary test queries, but existing tools do not support this operation;
  • Search space : For a given database schema, the number of join queries that can be generated scales exponentially with the number of tables and columns. Since it is impossible to enumerate all possible queries for verification, an effective query space exploration mechanism is needed to allow us to detect logical vulnerabilities as efficiently as possible.

In response to the above problems, researchers from Zhejiang University proposed a method called Transformed Query Synthesis (TQS). TQS is a new, universal and cost-effective tool for the task of detecting logical vulnerabilities in join optimization in DBMSs.

In response to the first challenge mentioned above, the solution proposed by researchers is DSG, which stands for Data-guided Schema and query Generation. ). Given a data set represented as a wide table, DSG can split the data set into multiple tables based on the detected paradigm. To speed up the discovery of vulnerabilities, DSG also injects some artificial noise data into the generated database. First, convert the database schema into a graph, where nodes are tables/columns and edges are relationships between nodes. DSG uses random walks on the pattern graph to select tables for the query, and then uses these tables to generate joins. For a specific join query involving multiple tables, we can easily find the truth result from the wide table. In this way, DSG can effectively generate (query, result) collections for database validation.

#In response to the second challenge mentioned above, the method designed by the researchers is KQE, which is Knowledge-guided Query space Exploration. The approach begins by extending the pattern graph into a plan-iterative graph, which represents the entire query generation space. Each join query is then represented as a subgraph. To score the generated query graph, KQE employs an embedding-based graph index that searches the already explored space for structurally similar query graphs. The random walk query generator is guided based on the coverage score to explore as much of the unknown query space as possible.

#In order to demonstrate the versatility and effectiveness of the method, the researchers evaluated TQS on four commonly used DBMS: MySQL, MariaDB, TiDB and PolarDB . After running for 24 hours, TQS successfully found 115 vulnerabilities, including 31 in MySQL, 30 in MariaDB, 31 in TiDB, and 23 in PolarDB. By analyzing the root causes, the types of these vulnerabilities can be summarized, among which there are 7 types of vulnerabilities in MySQL, 5 types in MariaDB, 5 types in TiDB, and 3 types in PolarDB. The researchers have submitted the discovered vulnerabilities to the corresponding communities and received positive feedback.

The following will describe the problem to be solved and the solution proposed by Zhejiang University in mathematical form.

Problem Definition

There are two types of database vulnerabilities: crashes and logical vulnerabilities. Crash vulnerabilities arise from the execution of the operating system and DBMS. They can cause the DBMS to be forcibly terminated due to reasons such as insufficient resources such as memory or access to an invalid memory address. Therefore, crash vulnerabilities are easily discovered. Logic vulnerabilities are much harder to find because the database will still function normally and will process queries and return seemingly correct results (and most of the time they do, but in a few cases they may Reading wrong result set). These silent vulnerabilities are like invisible bombs and are more dangerous because they are difficult to detect and may affect the correctness of the application.

This paper introduces a query optimizer to detect logical vulnerabilities for multi-table join query problems. Researchers call these vulnerabilities join optimization bugs. Using the notation given in Table 1, the join optimization vulnerability detection problem can be formally defined as:

Definition: For each query qi in the query workload Q, let query optimize The server performs the join of qi through multiple actual plans and validates its result set using the ground truth 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文. If 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文, a connection optimization vulnerability has been found.

一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文

Table 1: Symbol Description Table

Program Overview

Figure 2 gives an architectural overview of TQS. Given a baseline data set and a target DBMS, TQS searches for possible logical vulnerabilities in the DBMS by generating queries based on the data set. TQS has two key components: data-guided schema and query generation (DSG) and knowledge-guided query space exploration (KQE)

一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文

Figure 2: TQS overview

DSG treats the input data set as a wide table, and in addition to the original tuple , DSG also deliberately synthesizes some tuples with error-prone values ​​(such as null values ​​or very long strings). For join queries, DSG creates a new schema for the wide table by splitting the wide table into multiple tables, ensuring that the tables conform to a functional dependency-based paradigm. DSG models the database schema as a graph and then generates logical/conceptual queries via random walks on the schema graph. A DSG will materialize a logical query into a physical execution plan and transform the query with different hints, allowing the DBMS to execute multiple different physical execution plans to search for vulnerabilities. For a join query, the ground truth result is obtained by mapping the join graph back to the wide table.

After completing the schema setup and data splitting, KQE expands the schema diagram into a planning iteration diagram. Each query is represented as a subgraph. KQE builds an embedding-based graph index for the embeddings of the query graph in history (i.e., within the already explored query space). Intuitively, the role of KQE is to ensure that the newly generated query graph is as far away from its nearest neighbors in the history as possible, i.e. this is to explore new query graphs rather than repeating existing query graphs. KQE does this by scoring the generated query graph based on structural similarity (to query graphs in history) while using an adaptive random walk method to generate queries. .

Algorithm 1 summarizes the core idea of ​​TQS, where lines 2, 10, and 12 are DSG, and lines 4, 8, and 9 are KQE.

一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文

Given a dataset 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文 and a wide table 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文 sampled from 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文, DSG will Table 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文 is split into multiple tables, and these tables form a database schema 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文 (line 2) that conforms to 3NF. The pattern 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文 can be viewed as a graph 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文, where tables and columns are vertices and edges represent relationships between vertices. DSG uses a random walk on 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文 to generate a join expression of the query (line 10). In fact, join queries can be projected as subgraphs of 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文. The DSG can easily retrieve the ground truth result for this query by mapping the subgraph back to the wide table 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文 (line 12).

KQE extends the pattern diagram into a planning iteration diagram (line 4). To avoid testing similar paths, KQE will build an embedding-based graph index

一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文

to index the embeddings of the existing query graph (line 9) . KQE updates the edge weight π of the planning iteration graph G based on the structural similarity between the current query graph and the existing query graph (line 8). KQE scores the next possible path, which guides the random walk generator, thereby preferring to explore the unknown query space.

For a query 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文, TQS transforms the query through the prompt set 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文 to execute multiple different actual query plans (Section 11 OK). Finally, the result set of query

一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文# is compared to the ground truth value 一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文 (line 14). If they are inconsistent, then a join optimization vulnerability has been detected (line 15).

Please read the original paper for more detailed descriptions of DSG and KQE.

Experimental results

TQS ​​successfully found some logical vulnerabilities in database management systems such as MySQL, MariaDB, TiDB and PolarDB. They are divided into 20 types, including 7 vulnerabilities in MySQL and 7 in MariaDB. There are 5 types of TiDB, 5 types of PolarDB, and 3 types of PolarDB, as shown in the table below.

一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文

Compared with other methods, the overall performance of TQS proposed by Zhejiang University is also quite outstanding, and it has achieved significant improvements in many indicators. Excellent results were obtained, and the effectiveness of each component was also tested through controlled variable experiments.

一天自动发现四大数据库100+漏洞,浙大研究获SIGMOD 2023最佳论文

But researchers also said that TQS is currently focusing on equijoin queries. Nonetheless, the DSG and KQE ideas can also be extended to the case of non-equijoins. The only difficulty is how to generate and manage query truth results - in the case of non-equijoins, the size of these results increases exponentially. This aspect still needs further research in the future.

The above is the detailed content of Automatically discovered 100+ vulnerabilities in four major databases in one day, Zhejiang University research won the best paper in SIGMOD 2023. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete