Regarding MySQL join, you must have known a lot of "anecdotes" about it. For example, a two-table join requires a small table to drive a large table. Alibaba developer specifications prohibit three For join operations of more than one table, MySQL's join function is extremely weak and so on. These norms or remarks may be true or false, sometimes right or sometimes wrong. You need to have an in-depth understanding of join to understand clearly.
Next, let’s take a comprehensive look at MySQL’s join operation.
In daily database queries, we often need to join multiple tables to obtain the merged data of multiple tables at one time. This requires the use of join in the database. grammar. Join is a very common operation in the data field to merge two data sets. If you know more about it, you will find that MySQL, Oracle, PostgreSQL and Spark all support this operation. The protagonist of this article is MySQL. If there is no special explanation below, MySQL's join will be used as the main subject. Oracle, PostgreSQL and Spark can be regarded as the big bosses that beat them. Their algorithm optimization and implementation of join are better than MySQL.
MySQL's join has many rules. If you are not careful, a bad join statement may not only cause a full table query on a certain table, but also may affect The cache of the database causes most hot data to be replaced, dragging down the performance of the entire database.
Therefore, the industry has summarized many norms or principles for MySQL joins, such as small tables driving large tables and prohibiting join operations of more than three tables. Below we will introduce the MySQL join algorithm in turn, compare it with the join implementation of Oracle and Spark, and intersperse answers to why the above specifications or principles are formed.
For the implementation of join operations, there are probably three more common algorithms: Nested Loop Join (loop nested join), Hash Join (hash join) and Sort Merge Join (sort merge join), they each have their own advantages, disadvantages and applicable conditions, we will introduce them in turn next.
Nested Loop Join scans the driver table. Every time a record is read, the corresponding data is queried in the driven table according to the index on the associated field of the join. . It is suitable for scenarios where the subset of data to be connected is small. It is also the only algorithm implementation of MySQL join. We will explain its details in detail next.
There are two variants of the Nested Loop Join algorithm in MySQL, namely Index Nested-Loop Join and Block Nested-Loop Join.
Next, let’s initialize the related table structure and data
CREATE TABLE `t1` ( `id` int(11) NOT NULL, `a` int(11) DEFAULT NULL, `b` int(11) DEFAULT NULL, PRIMARY KEY (`id`), KEY `a` (`a`) ) ENGINE=InnoDB; delimiter ;; # 定义存储过程来初始化t1 create procedure init_data() begin declare i int; set i=1; while(i<p>As can be seen from the above command, both tables have a The primary key index id and an index a, there is no index on field b. The stored procedure init_data inserts 10,000 rows of data into table t1 and 500 rows of data into table t2. </p><p>In order to prevent the MySQL optimizer from selecting the table as the driver table and affecting the execution process of the analyzed SQL statement, we directly use straight_join to let MySQL use a fixed connection table order for querying. In the following statement, t1 is the driver Table, t2 is the driven table. </p><pre class="brush:php;toolbar:false">select * from t2 straight_join t1 on (t2.a=t1.a);复制代码
Use the explain command introduced in our previous article to view the execution plan of the statement.
As you can see from the above figure, the a field on the t1 table is indexed, and the index is used in the join process. Therefore, the execution flow of the SQL statement is as follows:
We call this process Index Nested-Loop Join, or NLJ for short, and its corresponding flow chart is as follows.
It should be noted that in the second step, when querying table t1 based on the a field, an index is used, so each scan will only scan one row (from explain the results, which vary according to different case scenarios).
Assume that the number of rows in the driving table is N and the number of rows in the driven table is M. Because during the execution of this join statement, the driving table performs a full table scan, while the driven table uses an index, and each row of data in the driving table must be indexed in the driven table for index query, so the entire join process is The approximate complexity is N2log2M. Obviously, N has a greater impact on the number of scanned rows, so in this case, a small table should be used as the driving table.
Of course, the premise of all this is that the associated field of join is a, and there is an index on the a field of the t1 table.
如果没有索引时,再用上图的执行流程时,每次到 t1 去匹配的时候,就要做一次全表扫描。这也导致整个过程的时间复杂度编程了 N * M,这是不可接受的。所以,当没有索引时,MySQL 使用 Block Nested-Loop Join 算法。
Block Nested-Loop Join的算法,简称 Is MySQLs join function too weak?,它是 MySQL 在被驱动表上无可用索引时使用的 join 算法,其具体流程如下所示:
比如下面这条 SQL
select * from t2 straight_join t1 on (t2.b=t1.b);复制代码
这条语句的 explain 结果如下所示。可以看出
可以看出,这次 join 过程对 t1 和 t2 都做了一次全表扫描,并且将表 t2 中的 500 条数据全部放入内存 join_buffer 中,并且对于表 t1 中的每一行数据,都要去 join_buffer 中遍历一遍,都要做 500 次对比,所以一共要进行 500 * 10000 次内存对比操作,具体流程如下图所示。
主要注意的是,第一步中,并不是将表 t2 中的所有数据都放入 join_buffer,而是根据具体的 SQL 语句,而放入不同行的数据和不同的字段。比如下面这条 join 语句则只会将表 t2 中符合 b >= 100 的数据的 b 字段存入 join_buffer。
select t2.b,t1.b from t2 straight_join t1 on (t2.b=t1.b) where t2.b >= 100;复制代码
join_buffer 并不是无限大的,由 join_buffer_size 控制,默认值为 256K。当要存入的数据过大时,就只有分段存储了,整个执行过程就变成了:
这个流程体现了该算法名称中 Block 的由来,分块去执行 join 操作。因为表 t2 的数据被分成了 5 次存入 join_buffer,导致表 t1 要被全表扫描 5次。
全部存入 | 分5次存入 | |
---|---|---|
内存操作 | 10000 * 500 | 10000 * (100 + 100 + 100 + 100 + 100) |
扫描行数 | 10000 + 500 | 10000 * 5 + 500 |
As shown above, compared with table data that can be all stored in join_buffer, the number of memory judgments has not changed. It is the product of the row numbers of the two tables, which is 10000 * 500, but the driven table will be scanned multiple times. , every time it is saved, the driven table will be scanned again, which affects the final execution efficiency.
Based on the above two algorithms, we can draw the following conclusion, which is also the standard for most MySQL join statements on the Internet.
There is an index on the driven table, that is, when the Index Nested-Loop Join algorithm can be used, the join operation can be used.
Whether it is the Index Nested-Loop Join algorithm or Block Nested-Loop Join, a small table must be used as the driving table.
Because the time complexity of the above two join algorithms is at least also has a first-order relationship with the number of rows in the table involved, and it takes a lot of memory space , so it is understandable that Alibaba developer specifications strictly prohibit join operations of more than three tables.
But the above two algorithms are only one of the join algorithms. There are also more efficient join algorithms, such as Hash Join and Sorted Merged Join. Unfortunately, these two algorithms are not currently available in the mainstream version of MySQL, but Oracle, PostgreSQL and Spark all support it. This is also the reason why online complaints about MySQL are so weak(MySQL 8.0 version supports Hash join, but 8.0 currently supports it. Not yet a mainstream version).
In fact, Alibaba developer specifications also stipulated that when migrating from Oracle to MySQL, the join operation performance of MySQL was too poor to prohibit the join operation of more than three tables.
Hash Join scans the driver table, uses the associated fields of the join to create a hash table in the memory, and then scans the driven table, reading out each row of data from the hash table Find the corresponding data. It is a common method for large data set connection operations. It is suitable for scenarios where the amount of data in the driver table is small and can be put into memory. It can be used for large tables without indexes and parallel queries. Best performance. Unfortunately, it only applies to equi-join scenarios, such as on a.id = where b.a_id.
It is still the join statement of the above two tables, and its execution process is as follows
It can be seen that This algorithm is similar to Block Nested-Loop Join, except that the unordered Join Buffer is changed to a hash table hash table, so that the data does not match It is no longer necessary to traverse all the data in the join buffer, but directly use hashing to obtain the matching row with a time complexity close to O(1), which greatly improves the join speed of the two tables.
However, due to the characteristics of hash, this algorithm can only be applied to the equivalent connection scenario, and this algorithm cannot be used in other connection scenarios.
Sort Merge Join first sorts the two tables according to the associated fields of the join (if they are already sorted, such as if there is an index on the field, there is no need to sort again) , and then perform a merge operation on the two tables. If the two tables have been sorted, there is no need to sort them again when performing a sort merge join. In this case, the performance of Merge Join will be better than Hash Join. Merge Join can be adapted to non-equivalent Joins (>, =, ).
It should be noted that if the connected field already has an index, that is to say, if it has been sorted, the merge operation can be performed directly. However, if the connected field does not have an index, its execution process is as follows shown.
The main time consumption of the Sorted Merge Join algorithm is the sorting operation of the two tables, so if the two tables have been sorted according to the connection field, the algorithm is even faster than the Hash Join algorithm. In one case, this algorithm is faster than the Nested Loop Join algorithm.
Now, let’s summarize the differences, advantages and disadvantages of the above three algorithms.
Hash Join | Sorted Merge Join | ||
---|---|---|---|
Applies to any condition | Applies only to equivalence connections (=) | Equal or non-equivalent connections ( >, =, ' | |
CPU, disk I/ O | Memory, temporary space | Memory, temporary space | |
When there is a highly selective index or restrictive The search efficiency is relatively high and the first search result can be quickly returned | When there is a lack of index or the index conditions are ambiguous, Hash Join is more effective than Nested Loop. Usually faster than Merge Join. In a data warehouse environment, if the table has a large number of records, the efficiency is high | When there is a lack of indexes or the index conditions are ambiguous, Sort Merge Join is more effective than Nested Loop. When the connection field has an index or is sorted in advance, it is faster than hash join and supports more connection conditions | |
Efficiency when there is no index or there are many table records Low | Building a hash table requires a lot of memory, and the first result return is slow | All tables need to be sorted. It is designed for optimal throughput and does not return data until all results are found | |
Yes (no indexing is too inefficient) | No | No |
More related free learning recommendations: mysql tutorial(Video)
The above is the detailed content of Is MySQL's join function too weak?. For more information, please follow other related articles on the PHP Chinese website!