Home >Database >Mysql Tutorial >MySQL Internals Optimizer

MySQL Internals Optimizer

黄舟
黄舟Original
2016-12-16 11:01:451121browse

The Optimizer

This article describes how the MySQL query optimizer works. The MySQL query optimizer mainly determines the most efficient route (routine, direction) for the executed query.

One. Source code and concepts


This part discusses the key concepts and terminology of the optimizer, and how they correspond to the MySQL source code.

1. Definition

Narrow definition: The optimizer is a series of routes that the DBMS uses to determine which execution path to take when querying.

MySQL often adjusts the query route, so you have to compare the logic described in this article with the logic in the source code. To make the comparison easier, annotations of relevant files and paths will be included here, such as source code/sql/sql_select.cc, optimize_cond() function.

When a query is converted into another query and the result is the same, statement conversion occurs. The following query

SELECT ... WHERE 5 = a
will be converted into


SELECT ... WHERE a = 5
The most obvious statement conversion is less, and some statement conversions are for faster execution .

2. Optimizer source code

The following pseudo code shows the logical structure of the handle_select() function in /sql/sql_select.cc. (Source code/sql/sql_select.cc handles SQL queries)

handle_select()
mysql_select()
JOIN::PRepare()
setup_fields()
JOIN::optimize() /* optimizer is from here... * /
Optimize_cond ()
OPT_SUM_QUERY ()
Make_join_statistics ()
get_quick_record_count ()
chooose_plan ()
/ * Find the best way to accept P / * As Specified by the User. * /
Optimize_straight_join ()
BEST_ACCESS_PATH ()
                                                                                         best_extension_by_limited_search ()
             best_access_path()                                                                                                    /* Perform an exhaustive search for an optimal plan */                                                                                                                                                                 .
The indented row shows which function Which function to call, such as handle_select() function calls mysql_select() function, mysql_select() function will call JOIN::prepare(), JOIN::optimize(), JOIN::exec(), and so on. The first part of the mysql_select() function is to call JOIN::prepare(). This function is used for context analysis, metadata creation and some statement transformations. The query optimizer function JOIN::optimize() and all its sub-routes in the optimization process. After the JOIN::optimize() function is executed, JOIN::exec() takes over and completes the execution work after the optimization decision of the JOIN::optimize() function.

Although the word JOIN appears, the query optimizer actually handles all query types, not just JOIN join queries.



Two. Top Optimizations


This section discusses the most important optimizations performed by the server.

1. Optimize the constant relationship

Constant equivalent transfer


The following expression will undergo statement transformation:

WHERE column1 = column2 AND column2 = 'x'

This expression, as we all know, if A =B && B=C => A=C (can be passed by equal value), the expression in the previous sentence becomes:

WHERE column1='x' AND column2='x'


if and only When the operator is any of the following, statement transformation will occur in the column1 column2 condition:


=, <, >, <=, >=, <>, <=>, LIKE

Note: The transformation of equal value transfer is not suitable for BETWEEN. It may not be suitable for LIKE either, but that’s a story later.


Constant equivalent transfer also occurs in loops, and the output of the previous step is used as the input of the next step.


See /sql/sql_select.cc for source code, change_cond_ref_to_const() function. Or /sql/sql_select.cc, propagate_cond_constants() function.

Eliminate dead code


Conditions that are always TRUE will undergo statement transformation, such as:

WHERE 0=0 AND column1='y'
In this case, the first condition will be eliminated, and finally it will be :


column1='y'
See /sql/sql_select.cc for the source code, remove_eq_conds().

Conditions that are always FLASE will also undergo statement transformation, such as:

WHERE (0 = AND s1 = OR s1 = 7
Parentheses and the first two conditions are always FLASE, and the last one is:


WHERE s1 = 7

In some cases, when the WHERE statement represents an impossible condition, the query optimizer may eliminate all statements, as follows:

WHERE (0 = AND s1 = 5)
Because this condition will never be is TRUE, Impossible WHERE will be displayed in the EXPLAIN analysis. Simply put, MySQL will say that the WHERE condition has been optimized.


If a field cannot be NULL, the optimizer will eliminate all irrelevant IS NULL conditions. In this way, the condition

WHERE not_null_column IS NULL
is always FLASE; and the condition


WHERE not_null_column IS NOT NULL
is always TRUE, so this field query condition is also eliminated. Subtle. For example: in an OUT JOIN outer join, a field that is defined as NOT NULL still contains a NULL value, and the optimizer will exclude the IS NULL condition in this special case. The optimizer will not check it. All Impossible WHERE conditions, because there are too many possibilities in this area. For example:

CREATE TABLE Table1 (column1 CHAR(1));

...

SELECT * FROM Table1 WHERE column1 = 'Popgo';

Optimization The processor will not eliminate the conditions of this query, even if it is made impossible in the CREATE TABLE definition



Combinable constant values ​​


The following expression will undergo statement transformation:

WHERE column1 =. 1 + 2

Finally:

WHERE column1 = 3
As mentioned before, the optimizer can easily merge such query statements together. This operation simplifies the results. And constant table

MySQL constant value sometimes refers not only to the literal meaning of the SQL statement of the query, but also in the content of constant tables (constant tables) is defined as:

1. A table with no records or one record

2. The expression of the table is constrained by the WHERE condition, and contains the expression form column = "constant", or all fields of the primary key of the table, or all fields of any unique key (unique The key field is defined as NOT NULL)

For example, the definition of the Table0 table contains:

... PRIMARY KEY (column1, column2)

Then, the query expression:


FROM Table0... WHERE column1=5 AND column2=7 ...

will return constant tables. More simply, if the definition of the Table1 table contains:

... unique_not_null_column INT NOT NULL UNIQUE
Then, the query expression:


FROM Table1... WHERE unique_not_null_column=5
will also return the constant table (constant tables).



This rule means that a constant table (constant tables) has at most one record value. MySQL will first evaluate whether it is a constant table (constant tables) and find out that value. In this way, MySQL will insert this value into the query statement. For example:


SELECT Table1.unique_not_null_column, Table2.any_column
FROM Table1, Table2

WHERE Table1.unique_not_null_column = Table2.any_column

AND Table1.unique_not_null_column = 5;
When MySQL evaluates this statement, it will first It is found that according to the constant The second definition of the table (constant tables) is that the table Table1 with the query condition unique_not_null_column is a constant table (constant tables), and it will obtain this value.

If the value fails, that is, unique_not_null_column = no value in Table1, the result after EXPLAIN:



Impossible WHERE noticed after reading const tables
On the contrary, if the value succeeds, that is, unique_not_null_column = is in Table1 A record value, MySQL will be converted into the following statement:

SELECT 5, Table2.any_column
FROM Table1, Table2

WHERE 5 = Table2.any_column

AND 5 = 5;

In fact, this is a good example of. The optimizer performs some statement transformations due to the constant equivalent passing mentioned earlier. In addition, why should we describe constant equal value transfer first? Because it is done before MySQL confirms what constant tables are. The order of optimizer steps sometimes differs.


Although many queries do not have reference to constant tables. It should be remembered that whenever the word constant is mentioned in the future, it refers to any literal value or the contents of a constant table.

2. Optimize JOIN connection


This section discusses different methods of optimizing JOIN connections. Note: JOIN connection does not only refer to the JOIN type, but also the types of all conditional queries. Some people prefer to call it access type.

Determine the JOIN connection type


When evaluating the query condition expression, MySQL will determine which JOIN connection type it belongs to.

The following JOIN types are recorded in the file, sorted from best to worst:

system: system type of constant tables (constant tables)
const: constant tables (constant tables)
eq_ref: unique equality relationship Or primary key index
ref: Index of equality relationship, the value of this index cannot be NULL
ref_or_null: Index of equality relationship, the value of this index may be NULL
range: Related index, such as BETWEEN, IN, >=, LIKE, etc.
index: sequential scan index
ALL: sequential scan of the entire table data

See /sql/sql_select.h, enum join_type{} for the source code. In addition, there is a small part that is not recorded in the file, for the JOIN connection type of the subquery.

The optimizer uses the JOIN connection type to select a driving expression, as follows:

SELECT *
FROM Table1
WHERE indexed_column = AND unindexed_column = 6
If indexed_column has a better JOIN connection type, it is more likely to become a driving expression Mode. You will also encounter various exceptions to it, but this description is the first simple optimization rule.


What is the most meaningful thing for a driver? There are two execution paths for the following query:


Worst execution plan: scan all rows of the read table. This is also called a sequential scan or simple table scan of Table1. Query each row and check whether the values ​​of indexed_column and unindexed_column match the query conditions.


The best execution plan: Retrieve records with indexed_column = value through the index. This is also called an indexed search. Query each row and check whether the value of the unindexed_column column matches the query conditions.


An indexed search usually calls for fewer accesses than a sequential scan, and if the table being accessed is huge and the index is unique, the table accesses will be very few. This is why access tables with good execution plans are better, and this is why indexed_column is often used as the driver.

Methods of joining and access

In single table search, bad JOIN join execution choices cause more performance damage than bad execution choices. So MySQL developers spend more time ensuring that the tables in the query are joined in an optimal order and that the best access method (often called the access path) is chosen to check the table data. The combination of a fixed order of table joins and the corresponding table access methods for all tables is called a query execution plan (QEP). The purpose of the query optimizer is to find the best QEP among all possible plans. There are some general concepts behind JOIN connection priority.

Each plan or part of a plan is defined as COST. The cost of a plan roughly reflects the resources required to compute the query as planned, with the primary factor being the total number of records accessed when computing the query. Once we have a way to assign QEPs to different costs, we have a way to compare them. In this way, the purpose of the optimizer is to find the lowest-cost QEP among all possible plans.

In MySQL, the best QEP search is implemented in a bottom-up approach. The optimizer first confirms all plans for one table, then all plans for two tables, and so on until a complete optimal QEP is established. Query plans that include only some of the tables and predicates in the query are called partial plans. The optimizer relies on the fact that the more tables are added to partial plans, the higher the cost (note: the higher the cost, the lower the execution efficiency). This allows the optimizer to scale to more tables using only lower-cost partial plans than the current best full plan.
To complete the key route of searching for the best QEP, see sql/sql_select.cc, find_best(). It performs an exhaustive search of all possible plans, thereby guaranteeing that it will eventually find an optimal one.


Below we describe the pseudo code of the find_best() method. This is recursive, so some of the input variables are marked to indicate where they are from the previous iteration so far.

remaining_tables = {t1, ..., tn}; /* all tables referenced in a query */

procedure find_best(
   partial_plan in,      /* in, partial plan of tables-joined-so-far */
   partial_plan_cost,    /* in, cost of partial_plan */
   remaining_tables,     /* in, set of tables not referenced in partial_plan */
   best_plan_so_far,     /* in/out, best plan found so far */
   best_plan_so_far_cost)/* in/out, cost of best_plan_so_far */
{
   for each table T from remaining_tables
   {
     /* Calculate the cost of using table T. Factors that the
        optimizer takes into account may include:
          Many rows in table (bad)
          Many key parts in common with tables so far (very good)
          Restriction mentioned in the WHERE clause (good)
          Long key (good)
          Unique or primary key (good)
          Full-text key (bad)
        Other factors that may at some time be worth considering:
          Many columns in key
          Short average/maximum key length
          Small table file
          Few levels in index
          All ORDER BY / GROUP columns come from this table */
     cost = complex-series-of-calculations;
     /* Add the cost to the cost so far. */
     partial_plan_cost+= cost;

     if (partial_plan_cost >= best_plan_so_far_cost)
       /* partial_plan_cost already too great, stop search */
       continue;

     partial_plan= expand partial_plan by best_access_method;
     remaining_tables= remaining_tables - table T;
     if (remaining_tables is not an empty set)
     {
       find_best(partial_plan, partial_plan_cost,
                 remaining_tables,
                 best_plan_so_far, best_plan_so_far_cost);
     }
     else
     {
       best_plan_so_far_cost= partial_plan_cost;
       best_plan_so_far= partial_plan;
     }
   }
}

这里优化器利用了一种深度优先搜索算法。它完成了在FROM语句中评估所有的表。如果评估比起目前为止最好的评估,变得更差,它将停止搜索。扫描的顺序依赖于出现FROM语句中的表的顺序。


源代码见:/sql/table.h, struct st_table。


 分析表(ANALYZE TABLE)可能会影响到一些优化器决断的因素。源代码见:/sql/sql_sqlect.cc,make_join_statistics()。


 find_best()和greedy_search()的直截了当(straightforward)使用将不会用于LEFT JOIN或RIGHT JOIN。例如,从MySQL 4.0.14起,在一些情况下,优化器可能转变LEFT JOIN为STRAIHT JOIN,并交换表的顺序。另外见:LEFT JOIN and RIGHT JOIN Optimization。


 

 RANGE联接类型

 有些条件可以使用索引,但是在一个键的范围(range)或宽度内。这些称为范围条件,最常看到的是带>,>=,<,<=,IN,LIKE,BETWEEN的查询表达式。


 

 对优化器来说,如下表达式:

column1 IN (1,2,3)
和这个是一样的:


column1 = OR column1 = OR column1 = 3
 MySQL同样对待这种语句,无需对查询条件的IN到OR或OR到IN做转变。


 

 如下语句,优化器也会用到索引(范围查询range search)

column1 LIKE 'x%'
 但这种就不行:


column1 LIKE '%x%'
 也就是说,如果匹配条件的第一个字符是通配符,那就没范围查询。

 

 同样,如下两个语句也是一样的

column1 BETWEEN 5 AND 7


column1 >= AND column1 <= 7

If the query conditions retrieve too many index keys, the optimizer may change the RANGE join type to the ALL JOIN join type. Transformations like this are especially possible in < and > conditions and multi-level secondary indexes. See the source code: /myisam/mi_range.c, mi_records_in_range() (MyISAM index).


INDEX join type

Consider this query

SELECT column1 FROM Table1;
If column1 has an index added, the optimizer may choose to get the value from the added index instead of the table (full table scan). Indexes like this are generally called covering indexes (COVERING INDEX). In the EXPLAIN Extra description, MySQL will simply use the word INDEX to represent the covering index (COVERING INDEX).


Statement:

SELECT column1, column2 FROM Table1; Only when the index is defined as follows, the optimizer will use the JOIN connection type INDEX: join type = index CREATE INDEX ... ON Table1 (column1, column2);
In other words, the queried fields (such as column1, column2) must be indexed, and the multiple indexed fields are not in order. Therefore, it makes more sense to strictly define a multiple-column index as a covering index (COVERING INDEX) to be used regardless of search considerations.


INDEX MERGE join type


Overview

Using index merging (INDEX MERGE), when the table conditions can be transformed into the following:


cond_1 OR cond_2 ... OR cond_N
The conditions for transformation are: each Several cond_i (cond_1, cond_2...) conditions can be used for range scans, and no pair of conditions (cond_i, cond_j) use the same index. If the cond_i and cond_j conditions use the same index, then the cond_i or cond_j conditions can be combined into a single range scan, and there is no need to merge.


The following query uses index merge (INDEX MERGE):

SELECT * FROM t WHERE key1=c1 OR key2

SELECT * FROM t WHERE (key1= c1 OR key2 Index MERGE (INDEX MERGE) is a container that is implemented as a range key and constructed with cond_i (cond_1, cond_2...) conditions. When doing an index merge (INDEX MERGE), MySQL retrieves the rows for each keyscans and then runs them through a deduplication process. Currently the class Unique is used to eliminate duplicates.

INDEX MERGE Optimizer


A single SEL_TREE object cannot be constructed to have keys with different members in an OR statement, similar to this condition:

key1 < c1 OR key2 < c2


From Starting with MySQL 5.0, these conditions are handled by the index merge (INDEX MERGE) method and the range optimizer (range optimizer) structure class SEL_IMERGE. SEL_IMERGE represents the disjunction of several SEL_TREE objects, which is expressed as:

sel_imerge_cond = (t_1 OR t_1 OR ... OR t_n)
Each t_i (t_1, t_2...) represents a SEL_TREE, there is no pair (t_i, t_j) Different SEL_TREE objects can be merged into a single SEL_TREE object.


The current implementation method only constructs SEL_IMERGE when no single SEL_TREE object can be constructed as part of the analyzed query; if it is found that a single SEL_TREE object can be constructed, the SEL_TREE will be discarded immediately. This is actually a limitation and can lead to the use of worst-case row retrieval strategies. The following query:


SELECT * FROM t WHERE (goodkey1=c1 OR goodkey1=c2) AND badkey=c3
The scan on badkey will be selected, even if the index merge (INDEX MERGE) on (goodkey1, goodkey1) will be updated quick.


Index Merge (INDEX MERGE) The optimizer will collect a list of all possible routes for index merge (INDEX MERGE) to access rows. This SEL_IMERGE structure list represents the following conditions:


(t_11 OR t_12 OR ... OR t_1k) AND
(t_21 OR t_22 OR ... OR t_2l) AND
... OR . .. OR t_mp)
When t_ij is a SEL_IMERGE and a condition is a SEL_IMERGE object.

Minimum cost SEL_IMERGE object used to retrieve rows.

For detailed information on the index merge (INDEX MERGE) constructor, see: source code sql/opt_range.cc, imerge_list_and_list(), imerge_list_or_list(), and member functions of the SEL_IMERGE class.

RANGE Optimizer

For range RANGE queries, the MySQL optimizer builds a SEL_TREE object in the following form:

range_cond = (cond_key_1 AND cond_key_2 AND ... AND cond_key_N)

Each cond_key_i is a key conditions of the components. MySQL creates a cond_key_i condition for each useful key. Then this cheapest condition cond_key_i is used to do range RANGE scanning.


A single cond_key_i condition is represented by a pointer-linked network of SEL_ARG objects. Each SEL_ARG object refers to a specific part of the key and represents the following conditions:


sel_arg_cond= (inf_val < key_part_n AND key_part_n < sup_val) (1)
                                          )
              OR left_sel_arg_cond                                                                                                               4)

1. The implementation interval may have no upper or lower critical values, or may or may not include critical values.

2. Implement a SEL_ARG object with condition on next key component (is for a SEL_ARG object with condition on next key component).

3. Implement a SEL_ARG object with an interval on the same field as this SEL_ARG object (is for a SEL_ARG object with an interval on the same field as this SEL_ARG object). The intervals between the current object and the left object are disjoint. left_sel_arg_cond.sup_val <= inf_val.

4. Implement a spaced SEL_ARG object in the same area as this SEL_ARG object. The intervals between the current object and the right object are disjoint. left_sel_arg_cond.min_val >= max_val.

MySQL will convert nested AND-OR conditions of any depth into the above connected form.

Row retrieval algorithm

Index merge (INDEX MERGE) has the following two steps:

Preparation phase:

activate 'index only';

foreach key_i in (key_scans clustered_pk_scan)
{
while (retrieve next (key, rowid) pair from key_i)
{
if (no clustered PK scan ||
row doesn't match clustered PK scan condition)
put rowid into Unique;
}
}
deactivate 'index only';

Row retrieval phase:

for each rowid in Unique

{
retrieve row and pass it to output;
}
if (clustered_pk_scan)
{
while (retrieve next row for clustered_pk_scan)
pass row to output;
}
See the source code: sql/opt_range.cc, the index merging (INDEX MERGE) row retrieval code of the QUICK_INDEX_MERGE_SELECT class function.


3. Transpositions

MySQL supports transposition of simple statement expressions (reversing the order of operands of relational operators). In other words:

WHERE - 5 = column1

This statement can be converted into:

WHERE column1 = -5

However, MySQL does not support transpositions with operations, such as:

WHERE 5 = - column1

And this sentence cannot be treated equally:

WHERE column1 = -5

The transposition of column = constant expressions like this is for better index retrieval. If this form of statement has indexed fields, MySQL will always use the indexed ones regardless of the size of the table. (Exception: If the table has no records or only one row of records, it is a constant table and requires special handling. See Constant Values ​​and Constant Tables).

AND relationship

An AND query form is condition1 AND condition2, as follows:

WHERE column1 = 'x' AND column2 = 'y'

This step describes the decision-making process of the optimizer:

1. If neither condition is indexed, a sequential scan (full table scan) is used.

2. In addition to the previous point, if one of the conditions has a better JOIN join type, select a driver with the JOIN join type. (See Determining the JOIN Connection Type)
3. In addition to the first two points, if both conditions have indexed and equal JOIN connection types (note: JON connection types have good or bad effects), a driver will be selected based on the first created index.

The optimizer will also select index merging (INDEX MERGE) based on index crossover, see INDEX MERGE join type. Examples are as follows:

CREATE TABLE Table1 (s1 INT, s2 INT);
CREATE INDEX Index1 ON Table1 (s2);
CREATE INDEX Index2 ON Table1 (s1);
...
SELECT * FROM Table1 WHERE s1 = AND s2 = 5;
When choosing a strategy to solve this query, the optimizer will choose s2 = 5 as the driver because the index on s2 is created first. Treat it as a casual effect rather than a rule that may change at any time.


OR relationship

An OR query form is condition1 OR condition2, as follows:

WHERE column1 = 'x' OR column2 = 'y'
The decision of this query optimizer is to use a sequential full table scan.

There is also an option to use index merging (INDEX MERGE) under certain circumstances. For more information, see INDEX MERGE Optimizer and Index Merge Optimization.


The above specific case cannot be used if the fields of the two conditions are the same. As follows:

WHERE column1 = 'x' OR column1 = 'y'
In this case, since the statement is a RANG query, it will be indexed. This topic will be seen again in the discussion of IN predicates.


UNION query

All SELECT query statements containing UNION will be optimized individually. Therefore, this query:

SELECT * FROM Table1 WHERE column1 = 'x'
UNION ALL
SELECT * FROM TABLE1 WHERE column2 = 'y'
If both column1 and column2 are indexed, each SELECT will use index scan, respectively The result sets are merged. Note: This query may produce the same result set if the query uses the sequential scan OR example.


NOT (<>) relationship

A logical condition is as follows:


column1 <> 5
is equivalent to:


column1 < 5 OR column1 > 5
However, MySQL No conversion statement will be made for this condition. If you think that using RANG to query will result in better results, you must manually convert the statements yourself.

There is also a logical condition as follows:

WHERE NOT (column1 != 5)
is equivalent to:


WHERE column1 = 5
In this case, MySQL will not perform statement conversion.


We look forward to adding new optimization methods for the above two situations.

ORDER BY statement


Usually, if the optimizer finds that the row records are ordered anyway, it will also skip the SORT process in the ORDER BY statement. However, there are still a few exceptions to verify.


Example:

SELECT column1 FROM Table1 ORDER BY 'x';
The optimizer will throw away the ORDER BY statement, which is also an example of dead code deletion.


Example:


SELECT column1 FROM Table1 ORDER BY column1;
The optimizer will use the index of column1, if it exists.


Example:

SELECT column1 FROM Table1 ORDER BY column1+1;
The optimizer will use the index of column1, if it exists. But don't be confused, indexes are only used to find record values. In addition: the cost of sequentially scanning the index is cheaper than the cost of sequentially scanning the entire table (generally the size of the index will be smaller than the size of the data value), which is why the INDEX JOIN connection type is more optimized than the ALL type. See Determining the JOIN Join Type.


There is also a sorting SORT for all result sets, for example:


SELECT * FROM Table1
WHERE column1 > 'x' AND column2 > 'x'
ORDER BY column2;
If both column1 and column2 If there is an index, the optimizer will go to the index on column1. In this query, the ordering of column2 values ​​does not affect driver selection.


See the source code: /sql/sql_select.cc, test_if_order_by_key() and /sql/sql_select.cc, test_if_skip_sort_order().


ORDER BY Optimization describes the content mechanism of the SORT sorting process and will not be explained again here. But you are urged to read it because it describes the operation of the buffering and quicksort mechanisms.

GROUP BY and related conditions


Here is the main description of GROUP BY and related conditions (HAVING, COUNT(), MAX(), MIN(), SUM(), AVG(), DISTINCT()) optimization.


GROUP BY will use the index if an index exists.
GROUP BY will use sorting if no index exists. The optimizer may choose to use HASH table sorting.
In the case of GROUP BY x ORDER BY x, the optimizer will think that ORDER BY is unnecessary because GROUP BY will be sorted by x.
The optimizer contains code in the WHERE statement that transfers specific HAVING conditions. However, this code is not valid at the time of writing. See the source code: /sql/sql_select.cc, JOIN::optimize(), after #ifdef HAVE_REF_TO_FIELDS.
If the table handler has a valid fast row-count, then this query:
SELECT COUNT(*) FROM Table1;
You can get the row count without scanning all rows. This is only true for MyISAM tables, but not for InnoDB tables. In addition, this query


SELECT COUNT(column1) FROM Table1;
will not have the same optimization unless column1 is defined as NOT NULL.


New optimization methods for MAX() and MIN(). Example:
SELECT MAX(column1)
FROM Table1
WHERE column1 < 'a';
If column1 is indexed, it is easy to find the maximum value by querying the 'a' value in the index and returning the index key before that.


Optimize the query in the following form and perform statement transformation:
SELECT DISTINCT column1 FROM Table1;
Into:


SELECT column1 FROM Table1 GROUP BY column1;
If and only if these two conditions are true:


* GROUP BY can be completed by index. This implies that there is only one table in the FROM statement and no WHERE statement.
* There is no LIMIT statement.

Because DISTINCT statements are not always converted into GROUP BY, do not expect that query statements containing DISTINCT will always have a sorted result set. However, you can rely on GROUP BY optimization rules unless the query includes ORDER BY NULL.


Three. Other optimizations


This section discusses other more special optimization methods.

1. NULLs value filtering access for ref and eq_ref


This part discusses the NULLs value filtering optimization method for ref and eq_ref connection types.

Early NULLs value filtering

Suppose we have a join sequence as follows:

..., tblX, ..., tblY, ...
Assume more deeply that table tblY is joined through ref or eq_ref The type is accessed:


tblY.key_column = tblX.column
Alternatively, using ref type access with multiple key parts:


... AND tblY.key_partN = tblX.column AND ...
tblX.column works is NULL. When accessing ref (or eq_ref) type, NULLs filtering will be applied in the early stage. We make the following inference:


(tblY.key_partN = tblX.column) => (tblX.column IS NOT NULL)
The original equation is checked only after reading the current row records of tables tblX and tblY. The IS NOT NULL predicate is checked only after reading the current row record in table tblX. If there are any


other tables in the union sort of tables tblX and tblY, the IS NOT NULL predicate check allows us to skip accessing those tables.

The implementation code of this feature is as follows:


ref analyzer (containing method update_ref_and_keys()) checks and marks query equations of this type like the above by setting KEY_FIELD::null_rejecting=TRUE.
After selecting JOIN connection sorting, add_not_null_conds() will add the appropriate IS NOT NULL predicate to the relevant conditions of the appropriate table.

Adding the IS NOT NULL predicate to all equations may be used by ref access types (rather than those that are actually used). However, this is not currently being done.

Late (Late) NULLs filtering

Suppose we have a table tblX query plan, which is accessed through the ref access type:

tblX.key_part1 = expr1 AND tblX.key_part2 = expr2 AND ...
In calling the index Before retrieval, we determine whether any expri (expr1, expr2, expr3...) value is NULL. If it is, we will not call the search, but will immediately return an array of no matches found.


This optimization method reuses the null_rejecting attribute generated by early NULLs filtering. The source code of this check can be found in: function join_read_always_key().

2. Partition-related optimization

This part discusses MySQL partition-related optimization. For concepts and implementation related to MySQL5.1 partitioning, see: Partitioning.

Partition pruning (pruning)

The operation of partition pruning is defined as follows:

"Provides a query for a partitioned table, comparing the DDL statement of this partitioned table with any WHERE or ON statement in the query , and find the minimum partition set accessed by this query. "


The partition set obtained in this way is much smaller than the set of all partitions of the table. This partition set will also be used in subsequent query statements. Other partitions that are not added to this partition set will not be accessed, that is to say, the partitions have been pruned. Because of this, queries execute faster.


Non-Transactional Table Engines.??If MyISAM has no transactional storage engine, locks will be added to the entire partition table. Theoretically, it is possible to improve concurrency using partition pruning by only adding locks to the partitions being used. But this function has not been implemented yet.

Partition pruning does not depend on the storage engine of the table, so this function is part of the MySQL query optimizer. The following sections describe the details of partition pruning.

Overview of partition pruning

The implementation steps of partition pruning are as follows:

1. Analyze the WHERE statement conditions and construct an interval graph to describe the results of the analysis.

2. Through the interval graph, find the set of visited partitions (including sub-partitions) for each interval.

3. Construct the set of partitions required for the query.


The interval graph is constructed in a bottom-up manner and represents the description of the above steps. Following the discussion, we will first define the term interval graph, then describe how to use partition intervals to form an interval graph, and finally describe the workflow of the interval graph.

Partitioning Intervals

Single-Point Intervals

Starting from the simplest case, assume a partitioned table with N columns, represented by the partition type p_type and the partition function p_func. As follows:

CREATE TABLE t (columns)
PARTITION BY p_type(p_func(col1, col2,... colN)...);
Then assume that the WHERE condition of the query is in the following form:


WHERE t.col1=const1 AND t. col2=const2 AND ... t.colN=constN
We can calculate p_func(const1, const2 ... constN) and find out which partition contains the same record as the WHERE condition. Note: This process will operate on all partition types and all partition functions.


Note: This process only works if the WHERE condition is of the form above, and each column of the table must be verified to be equal to some arbitrary constant (the same constant does not need to be the same for each column). For example, if there is no col1=const1 in the WHERE statement of the above example, then we will not calculate the value of the p_func partition function, and will not constrain the actual partition set used.


Interval walking (Walking)

Suppose a partition table t is defined as a column set, the partition type p_type, and the partition function p_func uses the integer type field int_col, as follows:

CREATE TABLE t (columns)
PARTITION BY
p_type(p_func(int_col))
...
Suppose we have a WHERE condition query of the following form:


WHERE const1 <= int_col <= const2
We can reduce the conditions of this situation into a series of single-point intervals ( Single-Point Intervals), as follows, by transforming this WHERE statement into the following relationship:


int_field=const1 OR
int_field=const1 + 1 OR
int_field=const1 + 2 OR
... OR
int_field=const2

In the source code, this transformation is called interval walking (Walking). The cost of traveling short intervals is not expensive, so we can reduce the number of partitions to scan small partitions. However, traversing long ranges is not very efficient and requires checking a large number of partitions, in which case all partitions may be scanned.

The following parameters determine the value of range walking (Walking):

#define MAX_RANGE_TO_WALK=10

Note: The following conditional relationship will also use the logic of the above range walking (Walking):

const1 >= int_col > = const2

Interval mapping (mapping)

Assume the following partition table definition:

CREATE TABLE t (columns)
PARTITION BY RANGE | LIST (unary_ascending_function (column))
Assume the WHERE of our query for table t Statement is one of the following forms:


const1 <= t.column <= const2
t.column <= const2
const1 <= t.column

The self-partitioning function is in ascending order, see below The relationship:

const1 <= t.col <= const2

=> p_func(const1) <=

p_func(t.column) <= p_func(const2)
Use A and B to represent this The leftmost and rightmost parts of the relationship, we can rewrite the relationship as:


A <= p_func(t.column) <= B
Note: In this example, the interval is closed and has two bounds value. However, similar inferences can be extended to other types of intervals.


Such as range partitioning (RANGE partitioning), each partition occupies an interval on the axis of the partition function value, and each interval is not connected, as follows:

                                                                                                                                                                                                                -x------x--------x-------->

search interval ----x============= =x----------->

                                                                                                                                                                                                                                                                                         


For example, LIST partitioning, each partition includes points set on the axis of the partition function value, and each partition will produce different intersection points, as follows:

                                                                                                                                                                                 -+---+----+----+----+----+---->

search interval ----x===================x------>
                                                                                                        In the search interval [A, B]. The partition set used determines the partitions that run from A to B and collect their points within this search range.


Subpartitioning intervals

In the previous section, we described several ways to infer the active partition set from basic WHERE conditions. Everything shows that the inference methods of these partitions are suitable for subpartitions, except for the subpartitions of range partitioning (RANGE partitioning) and enumeration partitioning (LIST partitioning).

Since each partition is molecule partitioned in the same way, we will find out which sub-partition within each partition will be accessed.


From WHERE Clauses to Intervals

The previous chapter described how to infer partition sets from WHERE statements representing partitions and sub-partition intervals. Now let's see how to extract a range from any WHERE statement.

The extracted process uses the range analyzer (RANGE Analyzer), which is part of the MySQL optimizer. It generates plans for range RANGE access. This is because the tasks are similar. Two forms of WHERE statements: the RANGE access type uses index range (interval) scanning; the partition pruning module uses partition intervals to determine which partition is used.

For partition pruning, the RANGE Analyzer is called with the WHERE statement, a column list of the table used by the partition and sub-partition functions:

(part_col1, part_col2, ... part_colN,

subpart_col1, subpart_col2, ... subpart_colM)
The result of the range analyzer (RANGE Analyzer) work is called the SEL_ARG graph. This is a very complex structure and we are not going to describe it here. The current focus of this cultural discussion is that we can travel through all partitions and collect the ranges of partitions and sub-partitions.


The following example illustrates the structure and travel process. Suppose table t is partitioned as follows:

CREATE TABLE t (..., pf INT, sp1 CHAR(5), sp2 INT, ... )

PARTITION BY LIST (pf)
SUBPARTITION BY HASH(sp1, sp2) (
PARTITION p0 VALUES IN (1),
PARTITION p1 VALUES IN (2),
PARTITION p2 VALUES IN (3),
PARTITION p3 VALUES IN (4),
PARTITION p4 VALUES IN (5),
);
Now assume a very complex WHERE statement query for table t:

pf=1 AND (sp1='foo' AND sp2 IN (40,50))

OR

(pf1=3 OR pf1=4) AND sp1='bar' AND sp2=33

OR

((pf=3 OR pf=4) AND sp1=5)

OR

p=8

The SEL_ARG picture is as follows:

(root)
   |               :
   | Partitioning  :         Sub-partitioning
   |               :
   |               :
   |               :
   |   +------+    :     +-----------+   +--------+
   ---| pf=1 |----:-----| sp1='foo' |---| sp2=40 |
       +------+    :     +-----------+   +--------+
          |        :                         |
          |        :                     +--------+
          |        :                     | sp2=50 |
          |        :                     +--------+
          |        :
          |        :
       +------+    :     +-----------+   +--------+
       | pf=3 |----:--+--| sp1='bar' |---| sp2=33 |
       +------+    :  |  +-----------+   +--------+
          |        :  |
       +------+    :  |
       | pf=4 |----:--+
       +------+    :
          |        :
          |        :
       +------+    :     +-----------+
       | pf=8 |----:-----| sp1='baz' |
       +------+    :     +-----------+

 上述图表,竖的边界(|)代表OR,横的(-)代表AND,横的和竖的线也代表AND。


 

 分区裁剪(partition pruning)代码游历从图上方到下方,从左边到右边,并做了如下的推论

1。在最上和最左的区间,从使用分区的空集合开始游历:

2。 
执行pf=1的区间分析,找到分区P0的相应集合,右移到sp1='foo' 
再次右移,到sp2=40 
分析sp1='foo' AND sp2=40区间,在某SP1子分区找到行。推论一:在每个分区组成集合P0,标识子分区SP1“被使用” 
下移到sp2=50 
分析sp1='foo'区间和sp2=50区间,在某SP2子分区找到行。推论二:在每个分区组成集合P0,标识子分区SP2“被使用” 
移回到pf=1,然后下称到pf=3 
3。

执行pf=3的区间分析,找到分区P1的相应集合,右移到sp1='bar' 
再次右移,到sp2=33

分析sp1='bar' AND sp2=33区间,在某SP3子分区找到行。推论三:在每个分区组成集合P1,标识子分区SP3“被使用”
移回到pf=3,然后下移到pf=4

4。


执行pf=4的区间分析,找到分区P2的相应集合,右移到sp2='bar'
执行移动和类似的推论已在pf=3验证正确。这样的效果是比较差的,因为我们将再次分析sp1='bar' AND sp2=33区间,但这个操作不会很大影响到整体性能。
移回到pf=3,然后下称到pf=8

5。


执行pf=8的区间分析,找到分区P3的相应集合,右移到sp1='baz'

Now that we have reached sp1='baz', we find that we can no longer move to the right, nor can we build sub-partition intervals. We record and return pf=8
From the previous process, we can no longer limit the sub-partition range, so the inference: in each partition in the P3 partition set, it is assumed that all sub-partitions are valid.

6. Try to move down from pf=9 and find that it reaches the end, so the tour graph is completed.

Note: Under certain circumstances, the results of the range analyzer (RANGE Analyzer) will have several SEL_ARG graphs, which are composed of OR or AND operators. This occurs for WHERE statements that are either very complex or do not allow the construction of a single range list. For this situation, the partition pruning code uses the appropriate operation, for example:


SELECT * FROM t1 WHERE partition_id=10 OR subpartition_id=20
In this example, no single range is constructed, but the partition is pruned (partition pruning) The code correctly infers that the partition set used is a union:


All subpartitions in a partition contain rows with partition_id=10, and within each partition a subpartition contains rows with subpartition_id=20.


Implementation of partition pruning in the source code

Simple explanation of the source code:

sql/opt_range.cc:
This code includes the implementation of From WHERE Clauses to Intervals , method prune_partitions(). There are detailed line-by-line code comments about partition pruning, starting from the PartitionPruningModule code:

sql/partition_info.h:

class partition_info {
...
/*
Bitmap of used (i.e. not pruned away) partitions. This is where result
of partition pruning is stored.
*/
MY_BITMAP used_partitions;

/*
"virtual function" pointers to functions that perform interval analysis
on this partitioned table (used by the code in opt_range .cc)
*/
get_partitions_in_range_iter get_part_iter_for_interval;
get_partitions_in_range_iter get_subpart_iter_for_interval;

};
sql/sql_partition.cc:

This code contains methods to implement all partition interval analysis types.

Partition Retrieval

If a partitioned table is accessed by a series of index retrievals (i.e. ref, eq_ref, ref_or_null join access methods), MySQL will check whether all partitions are required for index retrieval or restrict access to a specific partition. Example:

CREATE TABLE t1 (a INT, b INT);

INSERT INTO t1 VALUES (1,1),(2,2),(3,3);

CREATE TABLE t2 (
) keypart1 INT,
keypart2 INT,
KEY(keypart1, keypart2)
)
PARTITION BY HASH(keypart2);

INSERT INTO t2 VALUES (1,1),(2,2),(3,3);

Query conditions As follows:

SELECT * FROM t1, t2
WHERE t2.keypart1=t1.a
AND t2.keypart2=t1.b;
Use the following algorithm to execute:


(for each record in t1:)
{
t2 ->index_read({current-value-of(t1.a), current-value-of(t1.b)});
while( t2->index_next_same() )
pass row combination to query output;
}
In the index_read() call, the partition table handle will dig out the values ​​of all partition columns identified, in this case, a single column b, and then find a partition to access. If this partition is pruned, no other partitions will be accessible.

The above is the content of MySQL Internals Optimizer. For more related articles, please pay attention to the PHP Chinese website (www.php.cn)!


Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn