Home >Database >Mysql Tutorial >sql query statement optimization
Table of contents
1. What is an execution plan? What information does the execution plan depend on?
2. Unify the writing method of SQL statements to reduce parsing overhead
3. Reduce the nesting of SQL statements
4. Use "temporary tables" to temporarily store intermediate results
5. OLTP system SQL statements must use bind variables
6. Oblique fields Bind variable snooping issues
7. The transaction of begin tran should be as small as possible.
8. Nolock should be added to some SQL query statements
9. After adding nolock, querying a table where page splits often occur is prone to skipped or repeated reads
10. The clustered index is not built on the sequential fields of the table, which is prone to page splitting. Page split
11. Use composite index to improve the query speed of multiple where conditions
13. When using like for fuzzy query, you should try not to use the first%
14. Three ways to connect SQL Server tables
15. Row_number will cause the table Scanning, using temporary table paging is better
What is an execution plan? What information does the execution plan depend on?
The execution plan is a query plan made by the database based on the statistical information of the SQL statement and related tables. This plan is automatically analyzed by the query optimizer. For example, if a SQL statement is used to query a table with 100,000 records, 1 record, the query optimizer will choose the "index search" method. If the table is archived and there are currently only 5,000 records left, the query optimizer will change the plan and use the "full table scan" method.
It can be seen that the execution plan is not fixed, it is "personalized". There are two important points in generating a correct "execution plan":
Does the SQL statement clearly tell the query optimizer what it wants to do?
Are the database statistics obtained by the query optimizer up-to-date and correct?
Unify the writing method of SQL statements to reduce parsing overhead
For the following two SQL statements, programmers think they are the same, but the database query optimizer may think they are different.
select * from dual
Select * From dual
In fact, the case is different. The query analyzer considers it to be two different SQL statements and must be parsed twice. Generate 2 execution plans. Therefore, as a programmer, you should ensure that the same query statement is consistent everywhere, even one more space will not work!
Reduce the nesting of SQL statements
I often see that a SQL statement captured from the database is as long as 2 A4 sheets of paper when printed. Generally speaking, there are usually problems with such complex statements. I took this 2-page long SQL statement to ask the original author, but he said it took too long and he couldn't understand it for a while. It is conceivable that even the original author may be confused by the SQL statement, and the database will be confused as well.
Generally, the results of a Select statement are used as a subset, and then the query is performed from the subset. This kind of one-level nested statement is relatively common, but according to experience, if there are more than 3 levels of nesting, the query optimizer will be very troublesome. It is easy to give wrong execution plans. Because it was stunned. Things like artificial intelligence are ultimately inferior to human resolution. If people are dizzy, I can guarantee that the database will be dizzy as well.
In addition, the execution plan can be reused. The simpler the SQL statement, the higher the possibility of being reused. As long as one character changes in a complex SQL statement, it must be re-parsed, and then a lot of garbage will be stuffed in the memory. It is conceivable how inefficient the database will be.
Use "temporary table" to temporarily store intermediate results
An important way to simplify SQL statements is to use temporary tables to temporarily store intermediate results. However, the benefits of temporary tables are far more than these. Temporary results are temporarily stored in temporary tables for subsequent queries. It is in tempdb. This can avoid multiple scans of the main table in the program, and also greatly reduces the "shared lock" blocking "update lock" during program execution, reducing blocking and improving concurrency performance.
OLTP system SQL statements must use bind variables
select * from orderheader where changetime > '2010-10-20 00:00:01'
select * from orderheader where changetime > '2010-09-22 00: 00:01'
The above two statements are considered by the query optimizer to be different SQL statements and need to be parsed twice. If you use bind variables
select * from orderheader where changetime > @chgtime
@chgtime variable can pass in any value, so that a large number of similar queries can reuse the execution plan, which can greatly reduce the burden of the database parsing SQL statements. Parse once and reuse multiple times is the principle to improve database efficiency.
Bind variable snooping problem of skewed fields
Everything has two sides. Bind variables are applicable to most OLTP processing, but there are exceptions. For example, when the field in the where condition is a "skewed field".
"Tilted field" means that most of the values in the column are the same. For example, in a census table, in the "Ethnicity" column, more than 90% of the values are Han. So if a SQL statement wants to query the population of Han people who are 30 years old, then the "ethnic" column must be placed in the where condition. At this time, there will be a big problem if you use the bind variable @nation.
Imagine if the first value passed in by @nation is "Han", then the entire execution plan will inevitably choose table scan. Then, the second value passed in is "Buyei". It stands to reason that the proportion of "Buyi" may be only one ten thousandth, so index search should be used. However, since the execution plan of "Han" parsed for the first time is reused, the table scan method will also be used for the second time. This problem is the famous "bind variable snooping". It is recommended not to use bind variables for "skewed fields".
begin tran transactions should be as small as possible
In SQL Server, a SQL statement is a transaction by default, and it is committed by default after the statement is executed. In fact, this is a minimized form of begin tran, just like a begin tran is implied at the beginning of each statement, and a commit is implied at the end.
In some cases, we need to explicitly declare begin tran. For example, when performing "insertion, deletion, and modification" operations, we need to modify several tables at the same time. It is required that either all modifications of several tables are successful or none of them are successful. begin tran can play such a role. It can execute several SQL statements together and finally commit them together. The advantage is that data consistency is guaranteed, but nothing is perfect. The price paid by Begin tran is that before submission, all resources locked by SQL statements cannot be released until they are committed.
It can be seen that if Begin tran traps too many SQL statements, the performance of the database will be poor. Before the large transaction is committed, other statements will inevitably be blocked, resulting in a lot of blocks.
The principle of using Begin tran is that on the premise of ensuring data consistency, the fewer SQL statements trapped by begin tran, the better! In some cases, triggers can be used to synchronize data, and begin tran is not necessarily used.
Some SQL query statements should be added with nolock
Adding nolock to SQL statements is an important means to improve the concurrency performance of SQL Server. This is not necessary in Oracle because the structure of Oracle is more reasonable and has undo table space storage "Data predecessor", if the data has not been committed during modification, then what you read is the copy before modification, which is placed in the undo table space. In this way, Oracle's reading and writing can be independent of each other, which is why Oracle is widely praised. SQL Server's reading and writing will block each other. In order to improve concurrency performance, nolock can be added to some queries, so that writing can be allowed during reading, but the disadvantage is that uncommitted dirty data may be read. There are three principles for using nolock.
(1) If the query results are used for "insertion, deletion, and modification", nolock cannot be added!
(2) The queried table is one where page splits occur frequently, so use nolock with caution!
(3) Using temporary tables can also save the "previous data", which functions like Oracle's undo table space.
If you can use temporary tables to improve concurrency performance, do not use nolock.
After adding nolock, querying tables where page splits often occur is prone to skipped or repeated reads.
After adding nolock, you can query while "inserting, deleting, and modifying", but because "inserting, deleting, and modifying" occur at the same time, , in some cases, once the data page is full, page splitting is inevitable, and the nolock query is occurring at this time. For example, the record that has been read on page 100 may be allocated to page 101 due to page splitting. page, this may cause the nolock query to read the data repeatedly when reading page 101, resulting in "repeated reading". In the same way, if the data on page 100 is divided into page 99 before it has been read, then the nolock query may miss the record and cause "skipping".
The friend mentioned above reported errors in some operations after adding nolock. It is estimated that the nolock query caused repeated reads. If two identical records are inserted into other tables, of course a primary key conflict will occur.
The clustered index is not built on the sequence field of the table, and the table is prone to page splits
For example, the order table has the order number orderid and the customer number contactid. So which field should the clustered index be added to? For this table, the order numbers are added sequentially. If a clustered index is added to the orderid, the new rows will be added at the end, so that page splits will not occur frequently. However, since most queries are based on customer numbers, it only makes sense to add a clustered index to contactid. For the order table, contactid is not a sequential field.
For example, the "contactid" of "Zhang San" is 001, then the order information of "Zhang San" must be placed on the first data page of this table. If "Zhang San" places a new order today, then The order information cannot be placed on the last page of the table, but on the first page! What if the first page is full? Sorry, all the data in this table needs to be moved back to make room for this record.
SQL Server’s index is different from Oracle’s index. SQL Server’s clustered index actually sorts the table in the order of the clustered index fields, which is equivalent to Oracle’s index-organized table. SQL Server's clustered index is an organizational form of the table itself, so its efficiency is very high. Because of this, when a record is inserted, its location is not placed randomly, but on the data page where it should be placed in order. If there is no space on that data page, it will cause page splits. So obviously, the clustered index is not built on the sequential fields of the table, and the table is prone to page splits.
I once encountered a situation where a friend’s insertion efficiency dropped significantly after re-indexing a certain table. It is estimated that the situation is probably like this. The clustered index of the table may not be built on the sequential fields of the table. The table is often archived, so the data of the table exists in a sparse state. For example, Zhang San has placed 20 orders, but only 5 orders have been placed in the last three months. The archiving strategy is to retain 3 months of data. Then Zhang San’s past 15 orders have been archived, leaving 15 vacancies, which can be entered in insert Repurposed as they occur. In this case, since there are free spaces available, no page split will occur. However, the query performance will be relatively low, because the query must scan those empty spaces without data.
The situation changed after rebuilding the clustered index, because rebuilding the clustered index means rearranging the data in the table. The original vacancies are gone, and the page fill rate is very high. When inserting data, page splits often occur, so the performance is greatly improved. decline.
For tables whose clustered indexes are not built on sequential fields, should we give a lower page fill rate? Do you want to avoid rebuilding the clustered index? It’s a question worth considering!
Use composite indexes to improve query speed for multiple where conditions
Compound indexes usually have better selectivity than a single index. Moreover, it is an index set up specifically for a certain where condition. It has been sorted, so the query speed is faster than a single index. The leading field of the composite index must be a "highly selective" field. For example, there are 3 fields: date, gender, and age. Let’s see, which field should be used as the guide field? Obviously "Date" should be used as the leading field. Date is the most selective of the 3 fields.
There is an exception here. If the date is also the leading field of the clustered index, you can not build a composite index and use the clustered index directly, which is also more efficient.
Don’t build a clustered index into a “composite index”. The simpler the clustered index, the better, and the higher the selectivity, the better! The clustered index includes 2 fields and is tolerable. But if there are more than 2 fields, you should consider building an auto-increment field as the primary key. The clustered index does not need to be the primary key.
When using like for fuzzy queries, you should be careful not to use the first %
Sometimes you need to do some fuzzy queries, such as
Select * from contact where username like '%yue%'
keyword %yue%, because "%" is used in front of yue, so the query must be scanned through the entire table. Unless necessary, do not add % before the keyword.
Three ways to join SQL Server tables
(1) Merge Join
(2 ) Nested Loop Join
(3) Hash Join
SQL Server 2000 has only one join method - Nested Loop Join. If the result set of A is small, it will be used as the appearance by default, and every record in A will go to B. After scanning once, the actual number of rows scanned is equivalent to the number of rows in the A result set x the number of rows in the B result set. So if both result sets are large, the result of Join is bad.
SQL Server 2005 added Merge Join. If the connection field of table A and table B happens to be the field where the clustered index is located, then the order of the tables has been arranged. Just join them on both sides. The cost of this join is equivalent to A The number of rows in the result set of the table is added to the number of rows in the result set of table B. One is addition and the other is multiplication. It can be seen that the effect of merge join is much better than Nested Loop Join.
If there is no index on the connected field, the efficiency of SQL2000 is quite low. However, SQL2005 provides Hash join, which is equivalent to temporarily adding indexes to the result sets of tables A and B. Therefore, the efficiency of SQL2005 is much higher than that of SQL2000. Improvement, I think, is an important reason.
To summarize, you should pay attention to the following points when connecting tables:
(1) Try to choose the field where the clustered index is located for the connection field
(2) Carefully consider the where condition and try to reduce the result set of tables A and B
(3) If many join connection fields lack indexes, and you are still using SQL2000, upgrade immediately.
Row_number will cause table scanning, and it is better to use temporary table paging
ROW_Number paging test results:
Use ROW_Number for paging: CPU time = 317265 milliseconds, occupied time = 423090 milliseconds
Use temporary table for paging: CPU time = 1266 milliseconds, occupied time = 6705 milliseconds
ROW_Number implementation is based on order by, and the impact of sorting on the query is obvious.
Others
such as some writing methods will limit the use of indexes
Select * from tablename where chgdate +7 < sysdate
Select * from tablename where chgdate < sysdate -7