This article brings you knowledge about the advanced architecture of MYSQL and the InnoDB storage engine. I hope it will be helpful to you.
MySQL basic architecture diagram
In general, MySQL can be divided into two parts: Server layer and storage engine layer.
The Server layer includes connectors, query caches, analyzers, optimizers, executors, etc., covering most of MySQL's core service functions, as well as all built-in functions (such as date, time, mathematical and encryption functions, etc. ), all cross-storage engine functions are implemented in this layer, such as stored procedures, triggers, views, etc.
Connector
The connector is what you use when connecting to the database. It is responsible for establishing a connection with the client, obtaining permissions, maintaining and managing the connection.
Command: mysql -h$ip -P$port -u$user -p, press Enter and enter the password. You can also enter the password after -p, but there is a risk of password leakage.
show processlist, you can check the connection status. There is a Sleep in the Command column to indicate that the connection is idle.
#Idle connections will be disconnected after 8 hours by default and can be configured by the wait_timeout parameter.
In the database, a long connection means that after the connection is successful, if the client continues to make requests, the same connection will always be used. A short connection means that the connection is disconnected after a few queries are executed, and a new one is re-established for the next query.
Since establishing a connection consumes more resources, it is recommended to use long connections as much as possible. However, after using long connections, the memory occupied by MySQL increases very quickly. This is because the memory temporarily used by MySQL during execution is managed during the connection. inside the object. These resources will be released when the connection is disconnected. Therefore, if long connections accumulate, they may occupy too much memory and be forcibly killed by the system (OOM). Judging from the phenomenon, MySQL restarts abnormally.
Solution:
Disconnect long connections regularly. After using it for a period of time, or after the program determines that a large query that takes up memory has been executed, the connection is disconnected, and then the query is required and then reconnected.
If you are using MySQL 5.7 or newer, you can reinitialize the connection resource by executing mysql_reset_connection after each execution of a relatively large operation. This process does not require reconnection and permission verification, but will restore the connection to the state when it was just created.
Query cache
Query cache caches previously executed statements and their results in memory in the form of key-value pairs. The key is the query statement, and the value is the query result. If your query can find the key directly in this cache, then the value will be returned directly to the client.
The query cache was removed in MYSQL8. Due to frequent query cache failures, the hit rate is low.
Analyzer
The analyzer will first do "lexical analysis" to identify what the strings inside are and what they represent. Then you need to do "syntax analysis" to determine whether the SQL statement you entered satisfies MySQL syntax.
Optimizer
Executor
The storage engine layer is responsible for data storage and retrieval. Its architectural model is plug-in and supports multiple storage engines such as InnoDB, MyISAM, and Memory. The most commonly used storage engine now is InnoDB, which has become the default storage engine since MySQL version 5.5.5.
A Select statement execution process
The above figure takes the InnoDB storage engine as an example, and the processing process is as follows :
The user sends a request to tomcat, establishes a connection through the tomcat connection pool and the mysql connection pool, and then sends the SQL statement to MySQL through the connection;
MySQL has a separate listening thread, which reads the request data and obtains the SQL statement requested in the connection;
sends the obtained SQL data to the SQL interface for execution;
The SQL interface sends SQL to the SQL parser for parsing;
Sends the parsed SQL to the query optimizer to find the optimal query Road strength, and then sent to the executor;
The executor calls the storage engine interface to execute in a certain order and steps according to the optimized execution plan.
For example, the executor may first call an interface of the storage engine to obtain the first row of data in the "users" table, and then determine the "id" field of this data. Is the value equal to the value we expect? If not, continue to call the storage engine interface to obtain the next row of data in the "users" table. Based on the above idea, the executor will use a set of execution plans generated by our optimizer, and then continuously call various interfaces of the storage engine to complete the execution plan of the SQL statement, which is roughly to continuously update or extract. Some data comes out.
There are several questions involved here:
What exactly is the MySQL driver?
Taking java as an example, if we want to access a MySQL database in the Java system, we must add a MySQL driver to the system's dependencies, such as
<dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.46</version> </dependency>## in Maven. #So what exactly is this MySQL driver? In fact, the L driver will establish a network connection with the database at the bottom level. If there is a network connection, it can then send a request to the database server! Let the system written in the language access the database through the MySQL driver, as shown below What is the database connection pool used for? Assume that a web service is developed in Java and deployed on tomcat. Tomcat can process requests concurrently with multiple threads, so the first thing is that it is impossible to create only one database connection (multiple requests to grab one connection are much more efficient low). Secondly, what if a database connection is created for each request? This is also very bad, because it is very time-consuming to establish a database connection every time. It is not easy to establish the connection and execute the SQL statement. , it also destroys the database connection, causing performance problems due to frequent creation and destruction. So database connection pooling is generally used, that is, multiple database connections are maintained in a pool, allowing multiple threads to use different database connections inside to execute SQL statements, and then do not destroy them after executing the SQL statements. This database connection instead puts the connection back into the pool and can continue to be used in the future. Based on such a database connection pool mechanism, the problem of multiple threads concurrently using multiple database connections to execute SQL statements can be solved, and the problem of database connections being destroyed after use can be avoided. #What is the connection pool of the MySQL database used for? The connection pool of the MySQL database has the same function as the Java application connection pool, and both play the role of reusing connections.
InnoDB Storage Engine
InnoDB Architecture Brief Analysis
As can be seen from the picture, the InnoDB storage engine consists of three parts: memory pool, background thread and disk file Here is another picture that highlights the key points:InnoDB Storage Engine Part 1: Memory Structure
Buffer Pool Buffer Pool
InnoDB storage engine is based on disk storage and manage the records in pages. However, due to the gap between CPU speed and disk speed, disk-based database systems usually use buffer pool records to improve the overall performance of the database. When performing a read operation in the database, the page read from the disk is placed in the buffer pool. The next time the same page is read, first determine whether the page is in the buffer pool. If it is in the buffer pool, it is said that the page is hit in the buffer pool and the page is read directly. Otherwise, the page on the disk is read. For the modification operation of the page in the database, the page in the buffer pool is first modified, and then refreshed to the disk at a certain frequency. The operation of refreshing the page from the buffer pool back to the disk does not occur every time the page Triggered on update, but flushed back to disk through a mechanism called CheckPoint. Therefore, the size of the buffer pool directly affects the overall performance of the database. It can be set through the configuration parameter innodb_buffer_pool_size. The default buffer pool is 128MB, which is still a bit small. If your database is a 16-core 32G machine, then you can give Buffer Pool allocates 2GB of memory. Since the buffer pool is not infinite, as the data pages on the disk are continuously loaded into the buffer pool, the buffer pool will always be used up. At this time, some cache pages can only be eliminated. The method is to use the least recently used algorithm (LRU). Specifically, it is to introduce a new LRU linked list. Through this LRU linked list, you can know which cache pages are the least recently used. Then when you need to free up a cache page When flushing to disk, you can select the least recently used cache page in the LRU list to eliminate.The types of data pages cached in the buffer pool include: index pages, data pages, undo pages, insertion buffers, adaptive hash indexes, lock information and data dictionary information stored in InnoDB.
Data pages and index pages
Page (Page) is the most basic structure of Innodb storage and the smallest unit of Innodb disk management. All content related to the database is Stored in Page structure. Pages are divided into several types, and data pages and index pages are the two most important types.
Insert Buffer
When performing an insert operation on the InnoDB engine, it is generally necessary to insert in the order of the primary key, so as to obtain higher insert performance. When there is a non-clustered non-unique index in a table, when inserting, the data pages are still stored in order according to the primary key, but the insertion of non-clustered index leaf nodes is no longer sequential. In this case, it is necessary Discrete access to non-clustered index pages leads to performance degradation of insertion operations due to the presence of random reads.
So the InnoDB storage engine pioneered the design of the Insert Buffer. For non-clustered index insertion or update operations, it is not directly inserted into the index page every time, but first determines whether the inserted non-clustered index page is in the index page. In the buffer pool, if it is there, it will be inserted directly; if it is not there, it will be put into an Insert Buffer object first, which seems to be cheating. The non-clustered index of the database has been inserted into the leaf node, but it is not actually there. It is just stored in another location. Then perform the merge operation of the Insert Buffer and the sub-nodes of the auxiliary index page at a certain frequency and situation. At this time, multiple inserts can usually be merged into one operation (because they are in one index page), which greatly improves the efficiency of the operation. Improved performance for non-clustered index inserts.
However, the use of Insert Buffer needs to meet the following two conditions at the same time:
The index is a secondary index (secondary index);
The index is not unique.
When the above two conditions are met, the InnoDB storage engine will use the Insert Buffer, which can improve the performance of the insertion operation. But consider a situation where the application performs a large number of insert operations, which all involve non-unique non-clustered indexes, that is, the use of Insert Buffer. If the MySQL database goes down at this time, there will be a large number of Insert Buffers that are not merged into the actual non-clustered index.
So recovery at this time may take a long time, even several hours in extreme cases. The auxiliary index cannot be unique because when inserting the buffer, the database does not look up the index page to determine the uniqueness of the inserted record. If you search, discrete reading will definitely occur, causing Insert Buffer to lose its meaning.
You can view the insert buffer information through the command SHOW ENGINE INNODB STATUS
seg size shows the current size of the Insert Buffer is 11336×16KB, which is approximately 177MB; free list len represents the length of the free list; size represents the number of merged record pages. The second line in bold may be what users really care about, because it shows the improvement in insertion performance. Inserts represents the number of inserted records; merged recs represents the number of merged inserted records; merges represents the number of merges, which is the number of actual page reads. merges: merged recs is about 1:3, which means that the insert buffer reduces the discrete IO logical requests for non-clustered index pages by about 2/3.
As mentioned before, there is currently a problem with Insert Buffer: under write-intensive conditions, insert buffering will occupy too much buffer pool memory (innodb buffer pool). By default, it can occupy up to 1/ 2 buffer pool memory. The following is the initialization operation for insert buffer in the InnoDB storage engine source code:
Change Buffer
InnoDB from version 1.0.x Change Buffer was introduced, which can be regarded as an upgraded version of Insert Buffer. The InnodB storage engine can buffer DML operations - INSERT, DELETE, and UPDATE. They are: Insert Buffer, Delete Buffer, Purge buffer, and of course the previous Insert Like Buffer, the objects applicable to Change Buffer are still non-unique auxiliary indexes.
UPDATE operation on a record may be divided into two processes:
Mark the record as deleted;
Really delete the record
Therefore, the Delete Buffer corresponds to the first process of the UPDATE operation, which is to mark the record for deletion. PurgeBuffer corresponds to the second process of the UPDATE operation, which is about to record the actual deletion. At the same time, the InnoDB storage engine provides the parameter innodb_change_buffering, which is used to enable various Buffer options. The optional values of this parameter are: Inserts, deletes, purges, changes, all, none. Inserts, deletes, and purges are the three situations discussed earlier. changes means enabling Inserts and deletes, all means enabling all, and none means enabling none. The default value of this parameter is all.
从 InnoDB1.2.x版本开始,可以通过参数 innodb_change_buffer_max_size 来控制Change Buffer最大使用内存的数量:
mysql> show variables like 'innodb_change_buffer_max_size'; +-------------------------------+-------+ | Variable_name | Value | +-------------------------------+-------+ | innodb_change_buffer_max_size | 25 | +-------------------------------+-------+ 1 row in set (0.05 sec)
innodb_change_buffer_max_size 值默认为25,表示最多使用1/4的缓冲池内存空间。
而需要注意的是,该参数的最大有效值为50在 MySQL5.5版本中通过命令 SHOW ENGINE INNODB STATUS,可以观察到类似如下的内容:
可以看到这里显示了 merged operations和 discarded operation,并且下面具体显示 Change Buffer中每个操作的次数。 Insert 表示 Insert Buffer; delete mark表示 Delete Buffer; delete表示 Purge Buffer; discarded operations表示当 Change Buffer发生 merge时,表已经被删除,此时就无需再将记录合并(merge)到辅助索引中了。
自适应哈希索引
InnoDB 会根据访问的频率和模式,为热点页建立哈希索引,来提高查询效率。InnoDB 存储引擎会监控对表上各个索引页的查询,如果观察到建立哈希索引可以带来速度上的提升,则建立哈希索引,所以叫做自适应哈希索引。
自适应哈希索引通过缓冲池的B+树页构建而来,因此建立速度很快,而且不需要对整张数据表建立哈希索引。其有一个要求,即对这个页的连续访问模式必须一样的,也就是说其查询的条件必须完全一样,而且必须是连续的。
锁信息(lock info)
我们都知道,InnoDB 存储引擎会在行级别上对表数据进行上锁,不过 InnoDB 打开一张表,就增加一个对应的对象到数据字典。
数据字典
对数据库中的数据、库对象、表对象等的元信息的集合。在 MySQL 中,数据字典信息内容就包括表结构、数据库名或表名、字段的数据类型、视图、索引、表字段信息、存储过程、触发器等内容,MySQL INFORMATION_SCHEMA 库提供了对数据局元数据、统计信息、以及有关MySQL Server的访问信息(例如:数据库名或表名,字段的数据类型和访问权限等)。该库中保存的信息也可以称为MySQL的数据字典。
预读机制
MySQL的预读机制,就是当你从磁盘上加载一个数据页的时候,他可能会连带着把这个数据页相邻的其他数据页,也加载到缓存里去!
举个例子,假设现在有两个空闲缓存页,然后在加载一个数据页的时候,连带着把他的一个相邻的数据页也加载到缓存里去了,正好每个数据页放入一个空闲缓存页!
哪些情况下会触发MySQL的预读机制?
有一个参数是innodb_read_ahead_threshold,他的默认值是56,意思就是如果顺序的访问了一个区里的多个数据页,访问的数据页的数量超过了这个阈值,此时就会触发预读机制,把下一个相邻区中的所有数据页都加载到缓存里去。
如果Buffer Pool里缓存了一个区里的13个连续的数据页,而且这些数据页都是比较频繁会被访问的,此时就会直接触发预读机制,把这个区里的其他的数据页都加载到缓存里去这个机制是通过参数innodb_random_read_ahead来控制的,他默认是OFF,也就是这个规则是关闭的。
所以默认情况下,主要是第一个规则可能会触发预读机制,一下子把很多相邻区里的数据页加载到缓存里去。
预读机制的好处为了提升性能。假设你读取了数据页01到缓存页里去,那么接下来有可能会接着顺序读取数据页01相邻的数据页02到缓存页里去,这个时候,是不是可能在读取数据页02的时候要再次发起一次磁盘IO?
所以为了优化性能,MySQL才设计了预读机制,也就是说如果在一个区内,你顺序读取了好多数据页了,比如数据页01到数据页56都被你依次顺序读取了,MySQL会判断,你可能接着会继续顺序读取后面的数据页。那么此时就提前把后续的一大堆数据页(比如数据页57到数据页72)都读取到Buffer Pool里去。
缓冲池内存管理
这里需要了解三个链表(Free List、Flush List、LRU List),
The data pages and cache pages on the Free List disk are in one-to-one correspondence, both are 16KB, and one data page corresponds to one cache page. The database will design a free linked list for the Buffer Pool, which is a two-way linked list data structure. In this free linked list, each node is the address of a free cache page describing the data block. In other words, as long as one cache page is free , then his description data block will be put into this free linked list. When the database is first started, all cache pages may be free, because it may be an empty database with no data at all, so the description data blocks of all cache pages will be put into this free linked list at this time. In addition, this free linked list has a basic node, which refers to the head node and tail node of the linked list. It also stores how many nodes describing data blocks there are in the linked list, that is, how many free cache pages there are. .
Flush List is similar to Free List linked list. The essence of flush linked list is to use the two pointers in the description data block of the cache page to make the description data block of the modified cache page. A doubly linked list. Any cache page that has been modified will have its description data block added to the flush linked list. Flush means that these are dirty pages, and they will be flushed to the disk in the future.
LRU List Since the size of the buffer pool is certain, in other words, the free cache page data in the free linked list is certain. When you keep moving the data on the disk Pages are loaded into free cache pages, and free cache pages are constantly removed from the free linked list. Sooner or later, there will be no free cache pages in the free linked list. At this time, some cache pages need to be eliminated. Who should be eliminated? ? This requires the use of cache hit rate. The ones with the most cache hits are commonly used, and the ones that are not commonly used can be eliminated. Therefore, the LRU linked list is introduced to determine which cache pages are not commonly used.
What is the elimination strategy of the LRU linked list?
Suppose that when we load a data page from the disk to the cache page, we put the description data block of this cache page into the head of the LRU linked list. Then as long as there is a cache page with data, it will be in the LRU. , and the cache pages that have recently been loaded with data will be placed at the head of the LRU linked list, and then a certain cache page is added to the tail. As long as a query occurs, it will be moved to the head, and then the tail needs to be eliminated.
But is this really okay?
The first case is that the read-ahead mechanism is destroyed
Because the read-ahead mechanism will load adjacent data pages that have not been accessed into the cache, actually only one cache page is accessed Now, the other cache page loaded through the read-ahead mechanism is actually not accessed by anyone. At this time, the two cache pages can be in front of the LRU linked list, as shown below
At this time, if there are no free cache pages, then a new data page needs to be loaded at this time. Is it necessary to take out the so-called "least recently used cache page" from the end of the LRU linked list and flush it to the disk? , and then free up a free cache page. This is obviously very unreasonable.
The second situation may lead to the scenario where frequently accessed cache pages are eliminated
Full table scan causes him to directly load all the data pages in the table from the disk at once Go to the Buffer Pool. At this time, all the data pages of this table may be loaded into each cache page one by one! At this time, it is possible that a large list of cache pages at the front of the LRU linked list are all cache pages loaded in through a full table scan! So what if after this full table scan, the data in this table is hardly used in the future? At this time, the tail of the LRU linked list may all be cached pages that have been frequently accessed before! Then when you want to eliminate some cache pages to make room, you will eliminate the cache pages that have been frequently accessed at the end of the LRU list, leaving a large number of infrequently accessed pages that were loaded in the previous full table scan. Cached page!
Optimizing the LRU algorithm: designing the LRU linked list based on the idea of separation of hot and cold data
When MySQL designs the LRU linked list, it actually adopts the idea of separating hot and cold data. The LRU linked list will be split into two parts, one is hot data and the other is cold data. The ratio of hot and cold data is controlled by the innodb_old_blocks_pct parameter. Its default value is 37, which means that cold data accounts for 37%. When the data page is loaded into the cache for the first time, the cache page will actually be placed at the head of the linked list in the cold data area.
Then MySQL set a rule. He designed an innodb_old_blocks_time parameter. The default value is 1000, which is 1000 milliseconds. In other words, after a data page is loaded into the cache page, after 1s, you access the cache. page, it will be moved to the head of the linked list in the hot data area. Because suppose you load a data page into the cache, and then you access this cache page after 1s, which means that you are likely to access it frequently in the future. The time limit is 1s, so you only access this cached page after 1s. Cache the page, and he will put the cache page at the head of the linked list in the hot data area for you.
In this case, the data for pre-reading and full table scan will only be in the cold data header and will not enter the hot data area from the beginning.
Extreme optimization of LRU algorithm
Optimize the access rules of the hot data area of the LRU linked list, that is, only the cache pages in the last 3/4 of the hot data area are After accessing it, you will be moved to the head of the linked list. If the first 1/4 of the cache pages in the hot data area are accessed, they will not be moved to the head of the linked list.
For example, assuming there are 100 cache pages in the linked list of the hot data area, then the top 25 cache pages will not be moved to the head of the linked list even if they are accessed. But for the next 75 cache pages, as long as they are accessed, they will be moved to the head of the linked list. In this way, he can reduce the movement of nodes in the linked list as much as possible.
LRU linked list elimination cache page timing
When MySQL executes CRUD, it first operates a large number of cache pages and several corresponding linked lists. Then when the cache pages are full, you must find a way to flush some cache pages to the disk, then clear these cache pages, and then load the required data pages into the cache pages!
We already know that he eliminates cache pages based on the LRU linked list, so when did he flush the cache pages in the cold data area of the LRU linked list to the disk? In fact, he has the following three opportunities:
Regularly flush some of the cache pages at the end of the LRU to the disk
The background thread runs a scheduled task, and this scheduled task runs every After a period of time, some cache pages at the end of the cold data area of the LRU linked list will be flushed to the disk, these cache pages will be cleared, and they will be added back to the free linked list.
Regularly flush some cache pages in the flush linked list to the disk
If you only cache the cold data area of the LRU linked list It is not enough to flush pages to disk, because many cache pages in the hot data area of the linked list may also be modified frequently. Will they never be flushed to disk?
So this background thread will also flush all the cache pages in the flush linked list to the disk when MySQL is not very busy. In this way, the data modified by you will be flushed to the disk sooner or later!
As long as a wave of cache pages in the flush linked list are flushed to the disk, these cache pages will also be removed from the flush linked list and the lru linked list, and then added to the free linked list!
So the overall effect is to continuously load data into the cache page, continuously query and modify the cache data, and then the cache pages in the free linked list continue to decrease, and the cache pages in the flush linked list continue to decrease. It keeps increasing, and the cache pages in the lru linked list keep increasing and moving.
On the other side, your background thread is constantly flushing the cache pages of the cold data area of the lru linked list and the cache pages of the flush linked list to the disk to clear the cache pages, and then flush the linked list and the lru linked list. The cache pages are decreasing, and the cache pages in the free linked list are increasing.
The free linked list has no free cache page
If all the free linked lists are used, if you want to load the data page from the disk to a free cache page at this time, At this time, a cache page will be found from the end of the cold data area of the LRU linked list. It must be the least frequently used cache page! Then flush it to the disk and clear it, and then load the data page into the free cache page!
To summarize the usage of the three linked lists, when the Buffer Pool is used, it will actually frequently load data pages from the disk into its cache page, and then free the linked list, flush the linked list, and lru The linked list will be used at the same time. For example, when data is loaded into a cache page, the cache page will be removed from the free linked list, and then the head of the cold data area of the lru linked list will be placed into the cache page.
Then if you modify a cache page, the dirty page will be recorded in the flush linked list, and the lru linked list may also move you from the cold data area to the head of the hot data area. .
If you query a cache page, then the cache page will be moved to the hot data area in the lru linked list, or it may be moved to the head in the hot data area.
Redo log Buffer Redo log buffer
InnoDB has a buffer pool (bp for short). bp is the cache of database pages. Any modification operation to InnoDB will first be performed on the page of bp. Then such pages will be marked as dirty (dirty pages) and placed on a special flush list. Subsequently, the master thread or A dedicated cleaning thread periodically writes these pages to disk (disk or ssd).
The advantage of this is to avoid operating the disk for each write operation, resulting in a large amount of random IO. Periodic brushing can merge multiple modifications to the page into one IO operation, and asynchronous writing also reduces the cost. Access delay. However, if the server is shut down abnormally before the dirty page is flushed to the disk, these modification operations will be lost. If the write operation is in progress, the database may even be unavailable due to damaged data files.
In order to avoid the above problems, Innodb writes all page modification operations to a special file, and performs recovery operations from this file when the database is started. This file is the redo log file. This technology delays the refresh of bp pages, thereby improving database throughput and effectively reducing access latency.
The problem is the additional overhead of writing redo log operations (sequential IO, of course very fast), and the time required to resume operations when the database starts.
The redo log consists of two parts: redo log buffer and redo log file (introduced in the disk file section). InnoDB is a storage engine that supports transactions. When a transaction is committed, all logs of the transaction must be written to the redo log file first. The entire transaction operation is not completed until the commit operation of the transaction is completed. Every time the redo log buffer is written to the redo log file, an fsync operation needs to be called, because the redo log buffer only writes the content to the buffer system of the operating system first, and does not ensure that it is written directly to the disk. An fsync operation must be performed. Therefore, the performance of the disk also determines the performance of transaction submission to a certain extent (the redo log disk drop mechanism will be introduced later).
The InnoDB storage engine will first put the redo log information into the redo log buffer, and then flush it to the redo log file at a certain frequency. The log buffer generally does not need to be set very large, because the redo log buffer is generally flushed to the log file every second. It can be controlled by the configuration parameter Innodb_log_buffer_size, which defaults to 8MB.
Double Write Double Write
If Insert Buffer brings performance improvements to the InnoDB storage engine, then Double wtite brings data pages to the InnoDB storage engine. reliability.
InnoDB's Page Size is generally 16KB, and its data verification is also calculated based on this 16KB. Writing data to disk is performed in Page units. We know that because the file system is not atomic in most cases for large data pages (such as InnoDB's 16KB), this means that if the server is down, only part of the write may be done. When writing 16K of data to 4K, a system power outage and os crash occurred, and only part of the write was successful. In this case, it was a partial page write problem.
Experienced DBAs may think that if a write failure occurs, MySQL can recover based on the redo log. This is a method, but it must be clearly understood that what is recorded in the redo log is the physical modification of the page, such as offset 800, write 'aaaa' record. If the page itself is damaged, there is no point in redoing it. MySQL checks the checksum of the page during the recovery process. The checksum is the last transaction number of the page. When a partial page write problem occurs, the page has been damaged and the transaction number in the page cannot be found. From InnoDB's perspective, such a data page cannot pass checksum verification and cannot be recovered. Even if we force it to pass verification, we cannot recover from the crash because some of the current log types in InnoDB, some of which are logical operations, cannot be idempotent.
In order to solve this problem, InnoDB implements double write buffer. Simply put, before writing the data page, it first writes the data page to an independent physical file location (ibdata), and then writes it to data page. In this way, when the machine is down and restarted, if the data page is damaged, before applying the redo log, it is necessary to restore the page through a copy of the page, and then redo the redo log. This is double write. What double write technology brings to the innodb storage engine is the reliability of data pages. The double write technology is analyzed below
如上图所示,Double Write 由两部分组成,一部分是内存中的 double write buffer,大小为2MB,另一部分是物理磁盘上共享表空间连续的128个页,大小也为2MB。在对缓冲池的脏页进行刷新时,并不直接写磁盘,而是通过 memcpy 函数将脏页先复制到内存中的该区域,之后通过 double write buffer 再分两次,每次1MB顺序地写入共享表空间的物理磁盘上,然后马上调用 fsync 函数,同步磁盘,避免操作系统缓冲写带来的问题。在完成double write 页的写入后,再将 double wirite buffer 中的页写入各个表空间文件中。
在这个过程中,doublewrite 是顺序写,开销并不大,在完成 doublewrite 写入后,在将 double write buffer写入各表空间文件,这时是离散写入。
如果操作系统在将页写入磁盘的过程中发生了崩溃,在恢复过程中,InnoDB 存储引擎可以从共享表空间中的double write 中找到该页的一个副本,将其复制到表空间文件中,再应用重做日志。
InnoDB 存储引擎第二部分:后台线程
IO 线程
在 InnoDB 中使用了大量的 AIO(Async IO) 来做读写处理,这样可以极大提高数据库的性能。在 InnoDB 1.0 版本之前共有4个 IO Thread,分别是 write,read,insert buffer和log thread,后来版本将 read thread和 write thread 分别增大到了4个,一共有10个了。
- read thread : 负责读取操作,将数据从磁盘加载到缓存page页。4个
- write thread:负责写操作,将缓存脏页刷新到磁盘。4个
- log thread:负责将日志缓冲区内容刷新到磁盘。1个
- insert buffer thread :负责将写缓冲内容刷新到磁盘。1个
Purge 线程
事务提交之后,其使用的 undo 日志将不再需要,因此需要 Purge Thread 回收已经分配的 undo 页。show variables like '%innodb*purge*threads%';
Page Cleaner 线程
作用是将脏数据刷新到磁盘,脏数据刷盘后相应的 redo log 也就可以覆盖,即可以同步数据,又能达到 redo log 循环使用的目的。会调用write thread线程处理。show variables like '%innodb*page*cleaners%';
InnoDB 存储引擎第三部分:磁盘文件
InnoDB 的主要的磁盘文件主要分为三大块:一是系统表空间,二是用户表空间,三是 redo 日志文件和归档文件。
二进制文件(binlong)等文件是 MySQL Server 层维护的文件,所以未列入 InnoDB 的磁盘文件中。
系统表空间和用户表空间
系统表空间包含 InnoDB 数据字典(元数据以及相关对象)并且 double write buffer , change buffer , undo logs 的存储区域。
系统表空间也默认包含任何用户在系统表空间创建的表数据和索引数据。
系统表空间是一个共享的表空间,因为它是被多个表共享的。
系统表空间是由一个或者多个数据文件组成。默认情况下,1个初始大小为10MB,名为 ibdata1 的系统数据文件在MySQL的data目录下被创建。用户可以使用 innodb_data_file_path 对数据文件的大小和数量进行配置。
innodb_data_file_path 的格式如下:
innodb_data_file_path=datafile1[,datafile2]...
用户可以通过多个文件组成一个表空间,同时制定文件的属性:
innodb_data_file_path = /db/ibdata1:1000M;/dr2/db/ibdata2:1000M:autoextend
这里将 /db/ibdata1 和 /dr2/db/ibdata2 两个文件组成系统表空间。如果这两个文件位于不同的磁盘上,磁盘的负载可能被平均,因此可以提高数据库的整体性能。两个文件的文件名之后都跟了属性,表示文件 ibdata1 的大小为1000MB,文件 ibdata2 的大小为1000MB,而且用完空间之后可以自动增长。
设置 innodb_data_file_path 参数之后,所有基于 InnoDB 存储引擎的表的数据都会记录到该系统表空间中,如果设置了参数 innodb_file_per_table ,则用户可以将每个基于 InnoDB 存储引擎的表产生一个独立的用户空间。
用户表空间的命名规则为:表名.ibd。通过这种方式,用户不用将所有数据都存放于默认的系统表空间中,但是用户表空间只存储该表的数据、索引和插入缓冲BITMAP等信息,其余信息还是存放在默认的系统表空间中。
下图显示 InnoDB 存储引擎对于文件的存储方式,其中frm文件是表结构定义文件,记录每个表的表结构定义。
Redo log file (redo log file) and archive file
By default, there will be two files named ib_logfile0 and ib_logfile1 in the data directory of the InnoDB storage engine , this is the InnoDB redo log file, which records the transaction log for the InnoDB storage engine.
When an error occurs in InnoDB's data storage file, the redo log file can come in handy. The InnoDB storage engine can use redo log files to restore data to the correct state to ensure data correctness and integrity.
Each InnoDB storage engine has at least 1 redo log file, and each file group has at least 2 redo log files, plus the default ib_logfile0 and ib_logfile1.
In order to obtain higher reliability, users can set up multiple mirror log groups and place different file groups on different disks to improve the high availability of redo logs.
The size of each redo log file in the log group is the same and runs in the [loop writing] mode. The InnoDB storage engine first writes redo log file 1. When the file is full, it will switch to redo log file 2. When redo log file 2 is also full, it will switch to redo log 1.
Users can use Innodb_log_file_size to set the size of the redo log file, which has a great impact on the performance of the InnoDB storage engine.
If the redo log file is set too large, recovery may take a long time when data is lost; on the other hand, if the redo log file is set too small, the redo log file will be too small, resulting in checkpoint-based errors. Checking requires frequent flushing of dirty pages to disk, causing performance jitters.
Redo log flushing mechanism
InnoDB follows WAL (write ahead redo log) and Force-log-at- for flushing data files and log files. There are two rules of commit, both of which ensure the durability of the transaction. WAL requires that before data changes are written to disk, the log in memory must first be written to disk; Force-log-at-commit requires that when a transaction is committed, all generated logs must be flushed to disk. If the log After the refresh is successful, if the database crashes before the data in the buffer pool is refreshed to the disk, the database can recover the data from the log when restarted.
As shown in the figure above, when InnoDB changes data in the buffer pool, it will first write the relevant changes into the redo log buffer, and then refresh it on time (such as refreshing every second). mechanism) or written to the disk when the transaction is committed, which is consistent with the Force-log-at-commit principle; when the redo log is written to the disk, the changed data in the buffer pool will be written to the disk according to the checkpoint mechanism, which is consistent with WAL principles.
In the checkpoint timing mechanism, there is a judgment that the redo log file is full. Therefore, as mentioned above, if the redo log file is too small and often filled, it will frequently cause checkpoints to be Changed data is written to disk, causing performance jitter.
The file system of the operating system has a cache. When InnoDB writes data to the disk, it may only be written to the cache of the file system, and there is no real "safety".
The innodb_flush_log_at_trx_commit attribute of InnoDB can control the behavior of InnoDB each time a transaction is committed. When the attribute value is 0, when the transaction is committed, the redo log will not be written, but will wait for the main thread to write on time; when the attribute value is 1, when the transaction is committed, the redo log will be written to the file The system caches and calls fsync of the file system to actually write the data in the file system buffer to disk storage to ensure that no data loss occurs; when the attribute value is 2, the log file will also be written to the file system when the transaction is committed. Cache, but does not call fsync, but lets the file system determine when to write the cache to disk.
The log flushing mechanism is shown in the figure below:
Innodb_flush_log_at_commit is a basic parameter for InnoDB performance tuning, involving InnoDB's writing efficiency and data Safety. When the parameter value is 0, the writing efficiency is the highest, but the data security is the lowest; when the parameter value is 1, the writing efficiency is the lowest, but the data security is the highest; when the parameter value is 2, both are at a medium level, and it is generally recommended to set the attribute The value is set to 1 for higher security, and only when set to 1 can the durability of the transaction be guaranteed.
Use an UPDATE statement to learn more about the InnoDB storage engine
With the above introduction to the basic architecture of the InnoDB storage engine, let’s analyze an UPDATE again The specific process of data update.
We divide this picture into upper and lower parts. The upper part is the MySQL Server layer processing flow, and the lower part is the MySQL InnoDB storage engine processing flow.
MySQL Server layer processing process
This part of the processing flow has nothing to do with which storage engine. It is processed by the Server layer. The specific steps are as follows:
Various user operations trigger Background sql execution, through the database connection pool that comes with the web project: such as dbcp, c3p0, druid, etc., establishes a network connection with the database connection pool of the database server;
In the database connection pool After the thread monitors the request, it responds to the query parser through the SQL interface with the received sql statement. The query parser parses the sql according to the sql syntax to determine which fields of which table to query and what the query conditions are;
Then it is processed by the query optimizer to select the optimal execution plan for the sq;
Then the executor is responsible for calling a series of interfaces of the storage engine to execute This plan completes the execution of the entire sql statement
This part of the process is basically consistent with the analysis of a Select request processing process analyzed above.
InnoDB storage engine processing flow
##The specific execution statement must be completed by the storage engine, as above As shown in the figure:
After the Binlog is placed, the Binlog file name, file path information and commit mark are written to the Redo log in a synchronized sequence. The significance of this step is to keep the redo log consistent with the binlog log. The commit mark is an important criterion for determining whether a transaction is successfully submitted. For example, if MySQL crashes after step 5 or step 6 is executed successfully, this time because there is no final transaction commit mark in the redo log, so this transaction can be judged as unsuccessful. It will not be said that there is a log of this update in the redo log file, but there is no log of this update in the binlog log file, so there will be no problem of data inconsistency.
After completing the above, the memory data has been modified, the transaction has been submitted, and the log has been placed on the disk, but the disk data has not been modified synchronously. There is an IO thread in the background of the InnoDB storage engine. During the low peak period of database pressure, the data in the buffer pool that has been updated by transactions but has not yet had time to be written to the disk (dirty data, because the disk data and memory data are no longer available) consistent) are flushed to the disk to complete the persistence of the transaction.
So the InnoDB writing process can be represented by the following picture
Recommended learning: mysql video tutorial
The above is the detailed content of Detailed explanation of exciting mysql architecture and InnoDB storage engine knowledge. For more information, please follow other related articles on the PHP Chinese website!