Home >Database >Mysql Tutorial >Optimizing SQL queries: How to write high-performance SQL statements
The execution plan is a query plan made by the database based on the SQL statement and the statistical information of the related tables. This plan is automatically analyzed and generated by the query optimizer. For example, if a SQL statement is used to query 1 record from a table with 100,000 records, records, the query optimizer will choose the "index search" method. If the table is archived and there are currently only 5,000 records left, the query optimizer will change the plan and use the "full table scan" method.
It can be seen that the execution plan is not fixed, it is "personalized". There are two important points in generating a correct "execution plan":
(1) Does the SQL statement clearly tell the query optimizer what it wants to do?
(2) Is the database statistical information obtained by the query optimizer up to date and correct?
For the following two SQL statements, programmers think they are the same, but the database query optimizer thinks they are different.
select*from dual select*From dual
In fact, if the case is different, the query analyzer considers it to be two different SQL statements and must be parsed twice. Generate 2 execution plans. Therefore, as a programmer, you should ensure that the same query statement is consistent everywhere, even one more space will not work!
I often see that a SQL statement captured from the database is as long as 2 A4 sheets of paper when printed. Generally speaking, there are usually problems with such complex statements. I took this 2-page long SQL statement to ask the original author, but he said it took too long and he couldn't understand it for a while. It is conceivable that even the original author may be confused by the SQL statement, and the database will be confused as well.
Generally, the results of a Select statement are used as a subset, and then the query is performed from the subset. This kind of one-level nested statement is relatively common, but according to experience, if there are more than three levels of nesting, the query optimizer will easily give Wrong execution plan. Because it was stunned. Things like artificial intelligence are ultimately inferior to human resolution. If people are dizzy, I can guarantee that the database will be dizzy as well.
In addition, the execution plan can be reused, and the simpler the SQL statement, the higher the possibility of being reused. As long as one character changes in a complex SQL statement, it must be re-parsed, and then a lot of garbage will be stuffed in the memory. It is conceivable how inefficient the database will be.
An important way to simplify SQL statements is to use temporary tables to temporarily store intermediate results. However, the benefits of temporary tables are far more than these. Temporary results are stored in temporary tables, and subsequent queries are in tempdb, which can avoid multiple times in the program. Scanning the main table also greatly reduces "shared lock" blocking and "update lock" during program execution, reducing blocking and improving concurrency performance.
select*from orderheader where changetime >'2010-10-20 00:00:01' select*from orderheader where changetime >'2010-09-22 00:00:01'
The above two statements are considered by the query optimizer to be different SQL statements and need to be parsed twice. If using bind variables
select*from orderheader where changetime >@chgtime
The @chgtime variable can pass in any value, so that a large number of similar queries can reuse the execution plan, which can greatly reduce the burden of the database parsing SQL statements. Parse once and reuse multiple times is the principle to improve database efficiency.
There are two sides to everything, and bind variables are applicable to most OLTP processes, but there are exceptions. For example, when the field in the where condition is a "skewed field".
"Tilted field" means that most of the values in the column are the same. For example, in a census table, in the "Ethnicity" column, more than 90% of the values are Han. So if a SQL statement wants to query the population of Han people who are 30 years old, then the "ethnic" column must be placed in the where condition. At this time, there will be a big problem if you use the bind variable @nation.
Just imagine if the first value passed in by @nation is "Han", then the entire execution plan will inevitably choose table scan. Then, the second value passed in is "Buyei". It stands to reason that the proportion of "Buyi" may be only one ten thousandth, so index search should be used. However, since the execution plan of "Han" parsed for the first time is reused, the table scan method will also be used for the second time. This problem is the famous "bind variable snooping". It is recommended not to use bind variables for "skewed fields".
only when necessary. A SQL statement in SQL Server is a transaction by default, and it is committed by default after the statement is executed. In fact, this is a minimized form of begin tran, just like a begin tran is implied at the beginning of each statement, and a commit is implied at the end.
In some cases, we need to explicitly declare begin tran. For example, when performing "insertion, deletion, and modification" operations, we need to modify several tables at the same time. It is required that either all modifications of several tables are successful or none of them are successful. begin tran can play such a role. It can execute several SQL statements together and finally commit them together. The advantage is that data consistency is guaranteed, but nothing is perfect. The price paid by Begin tran is that before submission, all resources locked by SQL statements cannot be released until they are committed.
It can be seen that if Begin tran traps too many SQL statements, the performance of the database will be poor. Before the large transaction is committed, other statements will inevitably be blocked, resulting in a lot of blocks.
The principle of using Begin tran is that on the premise of ensuring data consistency, the fewer SQL statements trapped by begin tran, the better! In some cases, triggers can be used to synchronize data, and begin tran is not necessarily used.
Adding nolock to the SQL statement is an important means to improve the concurrency performance of SQL Server. This is not necessary in Oracle because the structure of Oracle is more reasonable and there is an undo table space to save the "data foreshadow". If the data is being modified It has not been committed yet, so what you read is the copy before it was modified, which is placed in the undo table space. In this way, Oracle's reading and writing can be independent of each other, which is why Oracle is widely praised. SQL Server's reading and writing will block each other. In order to improve concurrency performance, nolock can be added to some queries, so that writing can be allowed during reading. However, the disadvantage is that uncommitted dirty data may be read. There are three principles for using nolock.
(1) If the query results are used for "insertion, deletion, and modification", nolock cannot be added!
(2) The queried table is one where page splits occur frequently, so use nolock with caution!
(3) Using a temporary table can also save the "data foreshadow", which has a function similar to Oracle's undo table space,
If you can use temporary tables to improve concurrency performance, do not use nolock.
For example, in the order table, there is the order number orderid and the customer number contactid. So which field should the clustered index be added to? For this table, the order numbers are added sequentially. If a clustered index is added to the orderid, the new rows will be added at the end, so that page splits will not occur frequently. However, since most queries are based on customer numbers, it only makes sense to add a clustered index to contactid. For the order table, contactid is not a sequential field.
For example, the "contactid" of "Zhang San" is 001, then the order information of "Zhang San" must be placed on the first data page of this table. If "Zhang San" places a new order today, then the order The information cannot be placed on the last page of the table, but on the first page! What if the first page is full? Sorry, all the data in this table needs to be moved back to make room for this record.
SQL Server's indexes are different from Oracle's indexes. SQL Server's clustered index actually sorts the table in the order of the clustered index fields, which is equivalent to Oracle's index-organized table. The clustered index of SQL Server is an organizational form of the table itself, so its efficiency is very high. Because of this, when a record is inserted, its location is not placed randomly, but on the data page where it should be placed in order. If there is no space on that data page, it will cause page splits. So obviously, the clustered index is not built on the sequential fields of the table, and the table is prone to page splits.
I once encountered a situation where a friend's insertion efficiency dropped significantly after re-indexing a certain table. It is estimated that the situation is probably like this. The clustered index of the table may not be built on the sequential fields of the table. The table is often archived, so the data of the table exists in a sparse state. For example, Zhang San has placed 20 orders, but there are only 5 orders in the last three months. The archiving strategy is to retain 3 months of data. Then Zhang San’s past 15 orders have been archived, leaving 15 vacancies, which can be entered in insert Repurposed as they occur. In this case, since there are free spaces available, no page split will occur. However, the query performance will be relatively low, because the query must scan those empty positions without data.
The situation changed after rebuilding the clustered index, because rebuilding the clustered index means rearranging the data in the table. The original gaps are gone, and the page fill rate is very high. Page splits often occur when inserting data, so the performance drops significantly.
For tables whose clustered indexes are not built on sequential fields, should we give a lower page fill rate? Do you want to avoid rebuilding the clustered index? It’s a question worth considering!
加nolock后可以在“插、删、改”的同时进行查询,但是由于同时发生“插、删、改”,在某些情况下,一旦该数据页满了,那么页分裂不可避免,而此时nolock的查询正在发生,比如在第100页已经读过的记录,可能会因为页分裂而分到第101页,这有可能使得nolock查询在读101页时重复读到该条数据,产生“重复读”。同理,如果在100页上的数据还没被读到就分到99页去了,那nolock查询有可能会漏过该记录,产生“跳读”。
上面提到的哥们,在加了nolock后一些操作出现报错,估计有可能因为nolock查询产生了重复读,2条相同的记录去插入别的表,当然会发生主键冲突。
有的时候会需要进行一些模糊查询比如
select*from contact where username like ‘%yue%’
关键词%yue%,由于yue前面用到了“%”,因此该查询必然走全表扫描,除非必要,否则不要在关键词前加%,
sql server2000的数据库,我们的程序在提交sql语句的时候,没有使用强类型提交这个字段的值,由sql server 2000自动转换数据类型,会导致传入的参数与主键字段类型不一致,这个时候sql server 2000可能就会使用全表扫描。Sql2005上没有发现这种问题,但是还是应该注意一下。
(1) Merge Join
(2) Nested Loop Join
(3) Hash Join
SQL Server 2000只有一种join方式——Nested Loop Join,如果A结果集较小,那就默认作为外表,A中每条记录都要去B中扫描一遍,实际扫过的行数相当于A结果集行数x B结果集行数。所以如果两个结果集都很大,那Join的结果很糟糕。
SQL Server 2005新增了Merge Join,如果A表和B表的连接字段正好是聚集索引所在字段,那么表的顺序已经排好,只要两边拼上去就行了,这种join的开销相当于A表的结果集行数加上B表的结果集行数,一个是加,一个是乘,可见merge join 的效果要比Nested Loop Join好多了。
如果连接的字段上没有索引,那SQL2000的效率是相当低的,而SQL2005提供了Hash join,相当于临时给A,B表的结果集加上索引,因此SQL2005的效率比SQL2000有很大提高,我认为,这是一个重要的原因。
总结一下,在表连接时要注意以下几点:
(1) 连接字段尽量选择聚集索引所在的字段
(2) 仔细考虑where条件,尽量减小A、B表的结果集
(3) 如果很多join的连接字段都缺少索引,而你还在用SQL Server 2000,赶紧升级吧。
The above is the detailed content of Optimizing SQL queries: How to write high-performance SQL statements. For more information, please follow other related articles on the PHP Chinese website!