Home  >  Article  >  Database  >  What is the principle of MySQL subquery

What is the principle of MySQL subquery

王林
王林forward
2023-05-29 08:14:151075browse

    01 Preface

    Subquery, the popular explanation is that another query statement is nested in the query statement. I believe that students who come into contact with MySQL in their daily work all know or have used subqueries, but how is it implemented? How efficient is the query? I'm afraid many people are not clear about these. Let's explore these two issues together. one time.

    02 Preparation content

    This task requires the use of three tables. These three tables all have a primary key index id and an index a, but there is no index on word b. The stored procedure idata() inserts 100 rows of data into table t1, and 1000 rows of data into tables t2 and t3. The table creation statement is as follows:

    CREATE TABLE `t1` (
        `id` INT ( 11 ) NOT NULL,
        `t1_a` INT ( 11 ) DEFAULT NULL,
        `t1_b` INT ( 11 ) DEFAULT NULL,
    PRIMARY KEY ( `id` ),
    KEY `idx_a` ( `t1_a` )) ENGINE = INNODB;
    
    CREATE TABLE `t2` (
        `id` INT ( 11 ) NOT NULL,
        `t2_a` INT ( 11 ) DEFAULT NULL,
        `t2_b` INT ( 11 ) DEFAULT NULL,
    PRIMARY KEY ( `id` ),
    KEY `idx_a` ( `t2_a` )) ENGINE = INNODB;
    
    CREATE TABLE `t3` (
        `id` INT ( 11 ) NOT NULL,
        `t3_a` INT ( 11 ) DEFAULT NULL,
        `t3_b` INT ( 11 ) DEFAULT NULL,
    PRIMARY KEY ( `id` ),
    KEY `idx_a` ( `t3_a` )) ENGINE = INNODB;
    
    -- 向t1添加100条数据
    -- drop procedure idata;
    delimiter ;;
    create procedure idata()
    begin
      declare i int;
      set i=1;
      while(i<=100)do
            insert into t1 values(i, i, i);
        set i=i+1;
      end while;
    end;;
    delimiter ;
    call idata();
    
    -- 向t2添加1000条数据
    drop procedure idata;
    delimiter ;;
    create procedure idata()
    begin
      declare i int;
      set i=101;
      while(i<=1100)do
            insert into t2 values(i, i, i);
        set i=i+1;
      end while;
    end;;
    delimiter ;
    call idata();
    
    -- 向t2添加1000条数据,且t3_a列的值为倒叙
    drop procedure idata;
    delimiter ;;
    create procedure idata()
    begin
      declare i int;
      set i=101;
      while(i<=1100)do
            insert into t3 values(i, 1101-i, i);
        set i=i+1;
      end while;
    end;;
    delimiter ;
    call idata();

    03 Grammar form and classification of subquery

    3.1 Grammar form

    The syntax of subquery stipulates that subquery can be in an outer query Various positions appear, here we only introduce a few commonly used ones:

    3.1.1

    in the FROM clause, such as SELECT m, n FROM (SELECT m2 1 AS m, n2 AS n FROM t2 WHERE m2 > 2) AS t;

    The subquery in this example is: (SELECT m2 1 AS m, n2 AS n FROM t2 WHERE m2 > 2 ), this subquery placed in the FROM clause is equivalent to a table, but it is a little different from the table we usually use. This table composed of the subquery result set is called a derived table.

    3.1.2

    in WHERE or IN clause, such as: SELECT * FROM t1 WHERE m1 = (SELECT MIN(m2) FROM t2);

        SELECT * FROM t1 WHERE m1 IN (SELECT m2 FROM t2);

    Others include the SELECT clause, the ORDER BY clause, and the GROUP BY clause, although The syntax is supported, but it doesn’t make much sense, so I won’t dwell on these situations.

    3.2 Classification

    3.2.1 Distinguish

    scalar subquery according to the returned result set. A subquery that only returns a single value is called a scalar subquery, for example:

    SELECT * FROM t1 WHERE m1 = (SELECT m1 FROM t1 LIMIT 1);

    Row subquery is a subquery that only returns one record, but this Records need to contain multiple columns (containing only one column becomes a scalar subquery). For example: SELECT * FROM t1 WHERE (m1, n1) = (SELECT m2, n2 FROM t2 LIMIT 1);

    The column subquery only returns the data of one column, but this column The data needs to contain multiple records (containing only one record becomes a scalar subquery). For example: SELECT * FROM t1 WHERE m1 IN (SELECT m2 FROM t2);

    Table subquery means that the result of the subquery contains both many records and many columns, such as :

    SELECT * FROM t1 WHERE (m1, n1) IN (SELECT m2, n2 FROM t2);

    (SELECT m2, n2 FROM t2) is A table subquery needs to be compared with the row subquery. In the row subquery, we use LIMIT 1 to ensure that the result of the subquery has only one record.

    3.2.2 Distinguish irrelevant subqueries according to their relationship with the outer query

    Irrelevant subqueries mean that the subquery can be run independently to produce results without relying on the value of the outer query. We can Call this subquery an irrelevant subquery.
    Correlated subquery is a subquery that needs to depend on the value of the outer query, which is called a correlated subquery. For example: SELECT * FROM t1 WHERE m1 IN (SELECT m2 FROM t2 WHERE n1 = n2);

    04How subquery is executed in MySQL

    4.1 Scalar subquery, row subquery Execution method

    4.1.1 Uncorrelated subquery

    The following query statement:

    mysql root@localhost:test> explain select * from t1 where t1_a = (select t2_a from t2 limit 1);
    +----+-------------+-------+-------+---------------+-------+---------+--------+------+-------------+
    | id | select_type | table | type  | possible_keys | key   | key_len | ref    | rows | Extra       |
    +----+-------------+-------+-------+---------------+-------+---------+--------+------+-------------+
    | 1  | PRIMARY     | t1    | ref   | idx_a         | idx_a | 5       | const  | 1    | Using where |
    | 2  | SUBQUERY    | t2    | index | <null>        | idx_a | 5       | <null> | 1000 | Using index |
    +----+-------------+-------+-------+---------------+-------+---------+--------+------+-------------+

    Its execution method:

    Execute separately first ( select t2_a from t2 limit 1) This subquery.

    Then use the result of the previous subquery as the parameter of the outer query and then execute the outer query select * from t1 where t1_a = ....

    In other words, for query statements that contain unrelated scalar subqueries or row subqueries, MySQL will execute the outer query and subquery independently, just like two single table queries. .

    4.1.2 Related subqueries

    For example, the query below:

    mysql root@localhost:test> explain select * from t1 where t1_a = (select t2_a from t2 where t1.t1_b=t2.t2_b  limit 1);
    +----+--------------------+-------+------+---------------+--------+---------+--------+------+-------------+
    | id | select_type        | table | type | possible_keys | key    | key_len | ref    | rows | Extra       |
    +----+--------------------+-------+------+---------------+--------+---------+--------+------+-------------+
    | 1  | PRIMARY            | t1    | ALL  | <null>        | <null> | <null>  | <null> | 100  | Using where |
    | 2  | DEPENDENT SUBQUERY | t2    | ALL  | <null>        | <null> | <null>  | <null> | 1000 | Using where |
    +----+--------------------+-------+------+---------------+--------+---------+--------+------+-------------+

    The execution method is like this:

    First query from the outer layer Get a record from the t1 table. In this example, we first get a record from the t1 table.

    Then find the value involved in the subquery from the record obtained in the previous step, that is, find the value of the t1.t1_b column in the t1 table, and then execute the subquery.

    Finally, based on the query results of the subquery, it is detected whether the condition of the WHERE clause of the outer query is true. If it is true, the record of the outer query is added to the result set, otherwise it is discarded.

    Then repeat the above steps until all records in t1 are matched.

    4.2 IN subquery

    4.2.1 Materialization

    If the number of records in the result set of the subquery is very small, then the subquery and the outer query are regarded as The efficiency of two separate single-table queries is quite high, but if there are too many result sets after executing the subquery separately, it will lead to these problems:

    There are too many result sets, and they may not fit in the memory. ~

    For the outer query, if the result set of the subquery is too many, it means that there are too many parameters in the IN clause, which results in:

    1)无法有效的使用索引,只能对外层查询进行全表扫描。

    2)在对外层查询执行全表扫描时,由于 IN 子句中的参数太多,这会导致检测一条记录是否符合和 IN 子句中的参数匹配花费的时间太长。

    因此,可以将非相关子查询的结果集写入临时表中,而不直接将其作为外层查询的参数。写入临时表的过程是这样的:

    该临时表的列就是子查询结果集中的列。

    写入临时表的记录会被去重,让临时表变得更小,更省地方。

    一般情况下子查询结果集不大时,就会为它建立基于内存的使用 Memory 存储引擎的临时表,而且会为该表建立哈希索引。

    如果子查询的结果集非常大,超过了系统变量 tmp_table_size或者 max_heap_table_size,临时表会转而使用基于磁盘的存储引擎来保存结果集中的记录,索引类型也对应转变为 B+ 树索引。

    将子查询结果集中的记录保存到临时表中的过程被称为物化(Materialize)。我们可以称存储子查询结果集的临时表为物化表,以便更加便利。正因为物化表中的记录都建立了索引(基于内存的物化表有哈希索引,基于磁盘的有 B+ 树索引),通过索引执行IN语句判断某个操作数在不在子查询结果集中变得非常快,从而提升了子查询语句的性能。

    mysql root@localhost:test> explain select * from t3 where t3_a in (select t2_a from t2);
    +----+--------------+-------------+--------+---------------+------------+---------+--------------+------+-------------+
    | id | select_type  | table       | type   | possible_keys | key        | key_len | ref          | rows | Extra       |
    +----+--------------+-------------+--------+---------------+------------+---------+--------------+------+-------------+
    | 1  | SIMPLE       | t3          | ALL    | idx_a         | <null>     | <null>  | <null>       | 1000 | Using where |
    | 1  | SIMPLE       | <subquery2> | eq_ref | <auto_key>    | <auto_key> | 5       | test.t3.t3_a | 1    | <null>      |
    | 2  | MATERIALIZED | t2          | index  | idx_a         | idx_a      | 5       | <null>       | 1000 | Using index |
    +----+--------------+-------------+--------+---------------+------------+---------+--------------+------+-------------+

    其实上边的查询就相当于表 t3 和子查询物化表进行内连接:

    mysql root@localhost:test> explain select * from t3 left join t2 on t3.t3_a=t2.t2_a;
    +----+-------------+-------+------+---------------+--------+---------+--------------+------+--------+
    | id | select_type | table | type | possible_keys | key    | key_len | ref          | rows | Extra  |
    +----+-------------+-------+------+---------------+--------+---------+--------------+------+--------+
    | 1  | SIMPLE      | t3    | ALL  | <null>        | <null> | <null>  | <null>       | 1000 | <null> |
    | 1  | SIMPLE      | t2    | ref  | idx_a         | idx_a  | 5       | test.t3.t3_a | 1    | <null> |
    +----+-------------+-------+------+---------------+--------+---------+--------------+------+--------+

    此时 MySQL 查询优化器会通过运算来选择成本更低的方案来执行查询。

    虽然,上面通过物化表的方式,将IN子查询转换成了联接查询,但还是会有建立临时表的成本,能不能不进行物化操作直接把子查询转换为连接呢?直接转换肯定不行。
    -- 这里我们先构造了3条记录,其实也是构造不唯一的普通索引

    +------+------+------+
    | id   | t2_a | t2_b |
    +------+------+------+
    | 1100 | 1000 | 1000 |
    | 1101 | 1000 | 1000 |
    | 1102 | 1000 | 1000 |
    +------+------+------+
    -- 加限制条件where t2.id>=1100是为了减少要显示的数据
    mysql root@localhost:test> select * from t3 where t3_a in (select t2_a from t2 where t2.id>=1100);
    +-----+------+------+
    | id  | t3_a | t3_b |
    +-----+------+------+
    | 101 | 1000 | 101  |
    +-----+------+------+
    1 row in set
    Time: 0.016s
    mysql root@localhost:test> select * from t3 left join t2 on t3.t3_a=t2.t2_a where t2.id>=1100;
    +-----+------+------+------+------+------+
    | id  | t3_a | t3_b | id   | t2_a | t2_b |
    +-----+------+------+------+------+------+
    | 101 | 1000 | 101  | 1100 | 1000 | 1000 |
    | 101 | 1000 | 101  | 1101 | 1000 | 1000 |
    | 101 | 1000 | 101  | 1102 | 1000 | 1000 |
    +-----+------+------+------+------+------+
    3 rows in set
    Time: 0.018s

    所以说 IN 子查询和表联接之间并不完全等价。而我们需要的是另一种叫做半联接 (semi-join) 的联接方式 :对于 t3 表的某条记录来说,我们只关心在 t2 表中是否存在与之匹配的记录,而不关心具体有多少条记录与之匹配,最终的结果集中也只保留 t3 表的记录。

    注意:semi-join 只是在 MySQL 内部采用的一种执行子查询的方式,MySQL 并没有提供面向用户的 semi-join 语法。

    4.2.2 半联接的实现:
    • Table pullout (子查询中的表上拉)

    当子查询的查询列表处只有主键或者唯一索引列时,可以直接把子查询中的表上拉到外层查询的 FROM 子句中,并把子查询中的搜索条件合并到外层查询的搜索条件中,比如这个:

    mysql root@localhost:test> select * from t3 where t3_a in (select t2_a from t2 where t2.id=999)
    +-----+------+------+
    | id  | t3_a | t3_b |
    +-----+------+------+
    | 102 | 999  | 102  |
    +-----+------+------+
    1 row in set
    Time: 0.024s
    mysql root@localhost:test> select * from t3 join t2 on t3.t3_a=t2.t2_a where t2.id=999;
    +-----+------+------+-----+------+------+
    | id  | t3_a | t3_b | id  | t2_a | t2_b |
    +-----+------+------+-----+------+------+
    | 102 | 999  | 102  | 999 | 999  | 999  |
    +-----+------+------+-----+------+------+
    1 row in set
    Time: 0.028s
    mysql root@localhost:test> explain select * from t3 where t3_a in (select t2_a from t2 where t2.id=999)
    +----+-------------+-------+-------+---------------+---------+---------+-------+------+--------+
    | id | select_type | table | type  | possible_keys | key     | key_len | ref   | rows | Extra  |
    +----+-------------+-------+-------+---------------+---------+---------+-------+------+--------+
    | 1  | SIMPLE      | t2    | const | PRIMARY,idx_a | PRIMARY | 4       | const | 1    | <null> |
    | 1  | SIMPLE      | t3    | ref   | idx_a         | idx_a   | 5       | const | 1    | <null> |
    +----+-------------+-------+-------+---------------+---------+---------+-------+------+--------+
    • FirstMatch execution strategy (首次匹配)

    FirstMatch 是一种最原始的半连接执行方式,跟相关子查询的执行方式是一样的,就是说先取一条外层查询的中的记录,然后到子查询的表中寻找符合匹配条件的记录,如果能找到一条,则将该外层查询的记录放入最终的结果集并且停止查找更多匹配的记录,如果找不到则把该外层查询的记录丢弃掉。然后再开始取下一条外层查询中的记录,重复上边这个过程。

    mysql root@localhost:test> explain select * from t3 where t3_a in (select t2_a from t2 where t2.t2_a=1000)
    +----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------------+
    | id | select_type | table | type | possible_keys | key   | key_len | ref   | rows | Extra                       |
    +----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------------+
    | 1  | SIMPLE      | t3    | ref  | idx_a         | idx_a | 5       | const | 1    | <null>                      |
    | 1  | SIMPLE      | t2    | ref  | idx_a         | idx_a | 5       | const | 4    | Using index; FirstMatch(t3) |
    +----+-------------+-------+------+---------------+-------+---------+-------+------+-----------------------------+
    • DuplicateWeedout execution strategy (重复值消除)

    转换为半连接查询后,t3 表中的某条记录可能在 t2 表中有多条匹配的记录,所以该条记录可能多次被添加到最后的结果集中,为了消除重复,我们可以建立一个临时表,并设置主键id,每当某条 t3 表中的记录要加入结果集时,就首先把这条记录的id值加入到这个临时表里,如果添加成功,说明之前这条 t2 表中的记录并没有加入最终的结果集,是一条需要的结果;如果添加失败,说明之前这条 s1 表中的记录已经加入过最终的结果集,直接把它丢弃。

    • LooseScan execution strategy (松散扫描)

    这种虽然是扫描索引,但只取值相同的记录的第一条去做匹配操作的方式称之为松散扫描。

    4.2.3 半联接的适用条件

    当然,并不是所有包含IN子查询的查询语句都可以转换为 semi-join,只有形如这样的查询才可以被转换为 semi-join:

    SELECT ... FROM outer_tables 
        WHERE expr IN (SELECT ... FROM inner_tables ...) AND ...
    
    -- 或者这样的形式也可以:
    
    SELECT ... FROM outer_tables 
        WHERE (oe1, oe2, ...) IN (SELECT ie1, ie2, ... FROM inner_tables ...) AND ...

    用文字总结一下,只有符合下边这些条件的子查询才可以被转换为 semi-join:

    1. 该子查询必须是和IN语句组成的布尔表达式,并且在外层查询的 WHERE 或者 ON 子句中出现

    2. 外层查询也可以有其他的搜索条件,只不过和 IN 子查询的搜索条件必须使用AND 连接起来

    3. 该子查询必须是一个单一的查询,不能是由若干查询由 UNION 连接起来的形式

    4. 该子查询不能包含 GROUP BY 或者 HAVING 语句或者聚集函数

    4.2.4 转为 EXISTS 子查询

    可以将 IN 子查询转换为 EXISTS 子查询,无论该子查询是相关的还是不相关的。这里提供了一个通用的例子:可以将任何一个 IN 子查询转换为 EXISTS 子查询

    outer_expr IN (SELECT inner_expr FROM ... WHERE subquery_where)
    -- 可以被转换为:
    EXISTS (SELECT inner_expr FROM ... WHERE subquery_where AND outer_expr=inner_expr)

    当然这个过程中有一些特殊情况,比如在 outer_expr 或者 inner_expr 值为 NULL 的情况下就比较特殊。如果表达式中有任何一个操作数的值为 NULL,结果通常是 NULL。例如:

    mysql root@localhost:test> SELECT NULL IN (1, 2, 3);
    +-------------------+
    | NULL IN (1, 2, 3) |
    +-------------------+
    | <null>            |
    +-------------------+
    1 row in set

    EXISTS 子查询的结果肯定是布尔型,值为 TRUE 或 FALSE。但是现实中我们大部分使用 IN 子查询的场景是把它放在 WHERE 或者 ON 子句中,而 WHERE 或者 ON 子句是不区分 NULL 和 FALSE 的,比方说:

    mysql root@localhost:test> SELECT 1 FROM s1 WHERE NULL;
    +---+
    | 1 |
    +---+
    0 rows in set
    Time: 0.016s
    mysql root@localhost:test> SELECT 1 FROM s1 WHERE FALSE;
    +---+
    | 1 |
    +---+
    0 rows in set
    Time: 0.033s

    所以只要我们的IN子查询是放在 WHERE 或者 ON 子句中的,那么 IN ->  EXISTS 的转换就是没问题的。为什么需要进行转换呢?因为如果不进行转换,可能无法使用索引,例如下面的查询语句:

    mysql root@localhost:test> explain select * from t3 where t3_a in (select t2_a from t2 where t2.t2_a>=999) or t3_b > 1000;
    +----+-------------+-------+-------+---------------+--------+---------+--------+------+--------------------------+
    | id | select_type | table | type  | possible_keys | key    | key_len | ref    | rows | Extra                    |
    +----+-------------+-------+-------+---------------+--------+---------+--------+------+--------------------------+
    | 1  | PRIMARY     | t3    | ALL   | <null>        | <null> | <null>  | <null> | 1000 | Using where              |
    | 2  | SUBQUERY    | t2    | range | idx_a         | idx_a  | 5       | <null> | 107  | Using where; Using index |
    +----+-------------+-------+-------+---------------+--------+---------+--------+------+--------------------------+

    但是将它转为 EXISTS 子查询后却可以使用到索引:

    mysql root@localhost:test> explain select * from t3 where exists (select 1 from t2 where t2.t2_a>=999 and t2.t2_a=t3.t3_a) or t3_b > 1000;
    +----+--------------------+-------+------+---------------+--------+---------+--------------+------+--------------------------+
    | id | select_type        | table | type | possible_keys | key    | key_len | ref          | rows | Extra                    |
    +----+--------------------+-------+------+---------------+--------+---------+--------------+------+--------------------------+
    | 1  | PRIMARY            | t3    | ALL  | <null>        | <null> | <null>  | <null>       | 1000 | Using where              |
    | 2  | DEPENDENT SUBQUERY | t2    | ref  | idx_a         | idx_a  | 5       | test.t3.t3_a | 1    | Using where; Using index |
    +----+--------------------+-------+------+---------------+--------+---------+--------------+------+--------------------------+

    需要注意的是,如果 IN 子查询不满足转换为 semi-join 的条件,又不能转换为物化表如果将其转换为物化表成本过高,那么就需要使用 EXISTS 查询。如果将其转换为物化表成本过高,那么就需要使用 EXISTS 查询。

    The above is the detailed content of What is the principle of MySQL subquery. For more information, please follow other related articles on the PHP Chinese website!

    Statement:
    This article is reproduced at:yisu.com. If there is any infringement, please contact admin@php.cn delete