The biggest difference between a database system and a file system is that the database can guarantee the atomicity of operations. An operation is either not done or done. Even if the database is down, no operation will occur. In half the cases, this requires database logs and a complete crash recovery mechanism to ensure this. This article carefully analyzes the crash recovery process of InnoDB, and the code is based on the 5.6 branch.
lsn: It can be understood as the amount of redo logs generated since the database was created. The larger the value, the greater the update of the database. More can also be understood as a moment of renewal. In addition, each data page also has an lsn, which represents the lsn when it was last modified. The larger the value, the later it was modified. For example, the lsn of data page A is 100, the lsn of data page B is 200, the checkpoint lsn is 150, and the system lsn is 300, which means that the current system has been updated to 300, and data pages smaller than 150 have been flushed to the disk, so the data The latest data of page A must be on the disk, but data page B is not necessarily, it may still be in the memory.
redo log: Modern databases need to write redo logs. For example, to modify a piece of data, first write the redo log, and then write the data. After writing the redo log, success is returned directly to the client. In this way, although it seems that the disk is written once more, the performance is greatly improved because random writes to the disk (writing data) are converted into sequential writes (writing redo logs). When the database hangs, by scanning the redo log, you can find out those data pages that have not been flushed (before the crash, the data pages may have only been modified in the memory, but have not had time to write to the disk) to ensure that the data is not lost.
undo log: The database also provides functions similar to undo. When you find that you have modified some wrong data, you can use the rollback command to roll back the previous operation. This feature requires undo logs to support. In addition, in order to improve concurrency (the same record, reading by different threads does not conflict, reading and writing do not conflict with writing and reading, only simultaneous writing does conflict), all implement a mechanism similar to MVCC. In InnoDB, This also relies on undo logs. In order to achieve unified management, unlike redo logs, undo logs have corresponding data pages in the Buffer Pool, which are managed together with ordinary data pages. They will also be eliminated from the memory according to LRU rules and will be read from the disk later. Like ordinary data pages, modifications to undo pages also require writing redo logs first.
Checkpoint: The English name is checkpoint. In order to improve the performance of the database, the data page is not flushed to the disk every time after the memory is modified. The data pages before checkpoint are guaranteed to be flushed to disk, so that the previous logs will be useless (due to the recycling of InnoDB redolog logs, this part of the log can be overwritten at this time). The data pages after checkpoint may be flushed to disk, or possibly There is no disk write, so the logs after checkpoint still need to be used during crash recovery. InnoDB will regularly advance checkpoints based on the refresh status of dirty pages, thereby reducing the time for database crash recovery. The checkpoint information is at the head of the first log file.
Crash recovery: The user modified the data and received a success message. However, for the database, the modified data may not have been placed on the disk at this time. If this Sometimes the database hangs. After restarting, the database needs to retrieve the modified data from the log and rewrite it to the disk to ensure that the user's data is not lost. This process of retrieving data from the log is the main task of crash recovery, and can also become database rollforward. Of course, in crash recovery, it is also necessary to roll back uncommitted transactions and submit unsuccessful transactions. Since the rollback operation requires the support of undo logs, and the integrity and reliability of undo logs require redo logs to ensure, crash recovery first performs redo rollforward, and then performs undo rollback.
Let’s carefully analyze the database crash recovery process from the source code perspective. The entire process is completed in the engine initialization phase (innobase_init
), the most important function of which is innobase_start_or_create_for_mysql
, through which innodb completes creation and initialization, including crash recovery. First, let’s introduce the rollforward of the database.
Roll forward database is mainly divided into two stages. The first is the log scanning stage. The scanning stage distributes redo logs to hash_table according to the space_id and page_no of the data page to ensure the same data. Page logs are distributed to the same hash bucket and sorted according to lsn size from small to large. After the scan is completed, the entire hash table is traversed and the logs of each data page are applied in turn. After application, the status of the data page is at least restored to the state before the crash. Let's analyze the code in detail.
First, open all ibdata files (open_or_create_data_files
) (there can be multiple ibdata). Each ibdata file has a flush_lsn at the head. Calculate the max_flush_lsn and min_flush_lsn in these files, because ibdata It is also possible that the data has not been completely written and needs to be recovered. Subsequently (recv_recovery_from_checkpoint_start_func
) determines whether the ibdata needs to be rolled forward by comparing checkpont_lsn and these two values.
Next, open all files in the system table space and log table space (fil_open_log_and_system_tablespace_files
) to prevent insufficient file handles and clear the buffer pool (buf_pool_invalidate
). Next, enter the most core function: recv_recovery_from_checkpoint_start_func. Note that even if the database is closed normally, it will be entered.
Although recv_recovery_from_checkpoint_start_func
seems verbose in the past, a lot of code is written for the LOG_ARCHIVE feature, and there is not much code for real data crash recovery.
First, initialize some variables and check the srv_force_recovery
variable. If the user sets to skip the rollforward phase, the function returns directly.
Next, initialize the recv_sys
structure, allocate the size of hash_table, and initialize flush list rbtree. recv_sys
The structure is mainly used in the roll-forward phase of crash recovery. hash_table is the hash table used to store logs of different data pages as mentioned before. The size of the hash table is initialized to buffer_size_in_bytes/512. This is the maximum length of the hash table. If it exceeds the length, it cannot be saved. Fortunately, it needs to be restored. The number of data pages will not exceed this value, because the buffer poll (dirty pages online before the database crashes) can only store buffer_size_in_bytes/16KB data pages. Even if compressed pages are considered, there are only buffer_size_in_bytes/1KB at most. In addition, about For the size of this hash table memory allocation, please refer to bug#53122. flush list rbtree is mainly used to add and insert dirty pages into the list. InnoDB's flush list must be sorted from small to large according to the oldest modification lsn (oldest_modifcation) of the data page. When the database is running normally, you can use log_sys->mutex and log_sys- >log_flush_order_mutex guarantees the order. There is no such guarantee in crash recovery. When applying data, the hash table is traversed from the first element. There is no guarantee that the data pages are sorted from small to large according to the oldest modification lsn (oldest_modifcation), so It is necessary to linearly traverse the flush_list to find the insertion position, which is too inefficient. Therefore, a red-black tree is introduced to speed up the search for the insertion position.
Next, read the checkpoint information from the header of ib_logfile0, mainly including checkpoint_lsn and checkpoint_no. Since InnoDB logs are used cyclically, and there must be at least 2, ib_logfile0 must exist. It is safe to store checkpoint information in it, and there is no need to worry about being deleted. The checkpoint information will actually be written in two places in the file header, and the two checkpoint fields will be written in turn. Why do we need to write in two places in turn? Assume that there is only one checkpoint field, and this field is always updated, and the checkpoint field has 512 bytes (OS_FILE_LOG_BLOCK_SIZE
). If it happens that when writing this 512 bytes, the database hangs and the server hangs as well (not yet) Considering the atomic write feature of the hardware (early hardware did not have this feature), only half of the 512 bytes may be written, causing the entire checkpoint domain to be unavailable. In this way, the database will not be able to perform crash recovery and thus will not be able to start. If there are two checkpoint domains, even if one is damaged, you can still use the other one to try to recover. Although the log may have been overwritten by this time, it at least increases the probability of successful recovery. The two checkpoint domains are written in turn, which can also reduce the impact of disk sector failures. The data pages before checkpoint_lsn have been placed on the disk and do not need to be rolled forward. The subsequent data pages may not have been placed on the disk yet and need to be restored. It does not matter even if they have been placed on the disk because the redo log is idempotent, applied once and applied twice. All the same (underlying implementation: If the lsn on the data page is greater than or equal to the lsn of the current redo log, it will not be applied, otherwise it will be applied. checkpoint_no can be understood as the number of times the checkpoint domain is written to the disk. Each time the disk is flushed, it is incremented by 1, and this value is modulo 2 can be used to implement alternate writing of the checkpoint_no field. Under normal logic, the value of checkpoint_no is selected as the final checkpoint information, which is used as the starting point for subsequent crash recovery scans
.Then, after using the information in the checkpoint field to initialize some information in the recv_sys structure, we enter the core function of log parsing recv_group_scan_log_recs
. We will analyze this function later. Its main function is to parse redo logs. If the memory is not enough, Just call the application (recv_apply_hashed_log_recs
) log directly, and then continue to parse it. If there are few logs to be applied, just parse the distribution log and apply the log in the recv_recovery_from_checkpoint_finish
function.
Next, do a checkpoint based on the current flushed data page status, because some logs may have been used in recv_group_scan_log_recs
. At this point the recv_recovery_from_checkpoint_start_func
function ends.
In the recv_recovery_from_checkpoint_finish
function, if srv_force_recovery is set correctly, start calling the function recv_apply_hashed_log_recs
to apply the log, and then wait for the dirty thread to exit (the thread is temporarily started during crash recovery) , and finally release the related resources of recv_sys and the memory occupied by hash_table.
At this point, the database roll forward is completed. Next, we analyze the implementation details of the redo log parsing function and the redo log application function in detail.
The top layer of the parsing function is recv_group_scan_log_recs
. This function calls the underlying function (log_group_read_log_seg
) according to RECV_SCAN_SIZE (64KB) Read in batches of size. After reading it out, first judge whether the end of the log has been read through the relationship between block_no and lsn and the log checksum (so it can be seen that there is no mark in the log header to mark the effective position of the log, and it is completely judged according to the above two conditions. has reached the end of the log), and returns if it reads to the end (as mentioned before, even if the database is closed normally, crash recovery logic must be used, so it is returned here, because the checkpoint value of the normal shutdown must point to the end of the log ), otherwise the log will be trimmed and put into a recv_sys->buf. Some control information and checksum values are stored in the log header, which are only used for verification and positioning, and are useless in real applications. Before placing it in recv_sys->buf, you need to check whether recv_sys->buf is full (RECV_PARSING_BUF_SIZE
, 2M). If it is full, an error will be reported (if the previous batch of parsing has incomplete logs, the log The parsing function does not distribute, but leaves these incomplete logs in recv_sys->buf until the complete log is parsed). The next thing is to parse the log from recv_sys->buf (recv_parse_log_recs
). There are two types of logs: single_rec and multi_rec. The former means that only one operation is performed on one data page, and the latter means that multiple operations are performed on one or more data pages. The log also includes the space_id, page_no of the corresponding data page, the type of operation, and the content of the operation (recv_parse_log_rec
). After parsing the corresponding log, hash it according to space_id and page_no (if the corresponding table space does not exist in the memory, it means that the table has been deleted), and put it in the hash_table (the actual storage location of the log is still in the buffer pool) That’s it, waiting for subsequent applications. There are a few points worth noting here:
If it is a multi_rec type, only when the MLOG_MULTI_REC_END mark is encountered, the log will be considered complete and will be distributed to the hash_table. Looking at the code, we can find that the multi_rec type log is parsed twice, once to verify the integrity (looking for MLOG_MULTI_REC_END), and the second time to distribute the log. I feel that this is a point that can be optimized.
There are currently more than 50 types of log operations. The content behind each operation is different, so the length is also different. The current log parsing logic needs to parse out all the operations in sequence. content, and then determine the length to locate the starting position of the next log. This method is slightly inefficient. In fact, you can add a field to the header of each operation to store the length of the subsequent content. This way you do not need to parse too much content, thereby improving the parsing speed and further improving the crash recovery speed. From the results See, the speed can be doubled (from 38 seconds to 14 seconds, see bug #82937 for details).
If it is found that there are still logs after checkpoint, it means that the database has not been shut down properly before, and crash recovery needs to be done, so some additional operations need to be done (recv_init_crash_recovery
), such as in the error log Print our common "Database was not shutdown normally!" and "Starting crash recovery.", and also check whether the data page is half-written from the double write buffer, and if necessary, restore it (buf_dblwr_process
) , you also need to start a thread to flush the dirty pages generated by the application log (because buf_flush_page_cleaner_thread has not been started at this time). Finally, all table spaces need to be opened. . Note that it is all tables. . . In the operation and maintenance of Alibaba Cloud RDS MySQL, we often find that the database hangs in the crash recovery stage, and there are words similar to "Reading tablespace information from the .ibd files..." in the error log, which means that the database is opening all table, and then looking at the number of tables, I found that there are dozens or even millions of tables. . . The reason why the database needs to open all tables is because when distributing logs, it is necessary to determine which ibd file the space_id corresponds to. This is determined by opening all tables and reading the space_id information. Another reason is to facilitate the double write buffer to check half-written data. Page. In order to solve the problem of too many tables causing too slow recovery, MySQL 5.7 has been optimized, WL#7142. The main idea is to write a new log mlog_file_name (including mapping of space_id and filename) to indicate that operations have been performed on this table. Subsequent operations on this table do not need to write this new log. When crash recovery is required, one more scan is performed to determine which tables have been modified by collecting mlog_file_name. , so there is no need to open all tables to determine space_id.
The last thing worth noting is memory. As mentioned before, if too many logs have been distributed and occupy too much memory, the log parsing function will apply the logs at the appropriate time instead of waiting until the end to apply them all together. Then the question arises, how much memory is used before logging logic is applied. The answer is: buffer_pool_size_in_bytes - 512 * buffer_pool_instance_num * 16KB. Since buffer_pool_instance_num is generally not too large, it can work. Most of the memory in the buffer pool is used to store logs. The remaining data pages are mainly reserved for the data pages read when applying the log, because currently the log application is single-threaded, reading a log, applying all the logs, and then flushing it back to the disk, it does not require much of memory.
The upper function of the application log is recv_apply_hashed_log_recs
(Application log may also be performed in the io_helper function), the main function is Traverse the hash_table, read each data page from disk, and apply the logs in the hash bucket in sequence. After applying all the logs, flush all buffer_pool pages if necessary. After all, the space is limited. The following points are worth noting:
The logs of the same data page must be applied according to lsn from small to large, otherwise the data will be overwritten. Only apply redo logs whose lsn is greater than page_lsn. Only these logs need to be redone, and the rest are ignored. After applying the log, add the dirty pages to the dirty page list. Since the dirty page list is sorted according to the oldest modification lsn (oldest_modification), a red-black tree is introduced here to speed up the search for the insertion position, and the time complexity increases from before The linear search is reduced to logarithmic level.
When a certain data page is needed, if it is found that it is not in the Buffer Pool, the 32 data pages surrounding this data page will be checked to see whether recovery is also required. If so, Then they can be read out together, which is equivalent to doing an io merge and reducing io operations (recv_read_in_area
). Since this is an asynchronous read, the final application log work is done by the io_helper thread (buf_page_io_complete
). In addition, in order to prevent too many ios from being initiated in a short time, flow control is added to the code. The logic (buf_read_recv_pages
). If a data page is found to be in memory, recv_recover_page
application log is called directly. From this we can see that the InnoDB application log is not actually a single-threaded application log. In addition to the main thread for crash recovery, the io_helper thread will also participate in recovery. The number of concurrent threads depends on the number of reading threads in io_helper.
After the redo rollforward database is executed, all data pages of the database are already in a consistent state, and the undo rollback database can be executed safely. When the database crashes, there may be some uncommitted transactions or committed transactions. At this time, you need to decide whether to commit. It is mainly divided into three steps. First, scan the undo log and re-establish the undo log linked list. Then, based on the linked list established in the previous step, reconstruct the transaction before the crash, that is, restore the state of the transaction at that time. Finally, rollback or commit is performed based on the different status of the transaction.
After recv_recovery_from_checkpoint_start_func
and before recv_recovery_from_checkpoint_finish
, trx_sys_init_at_db_start
is called. This function does the above three steps. the first two steps in.
The first step is processed in function trx_rseg_array_init
, traverse the entire undo log space (up to TRX_SYS_N_RSEGS (128) segments), if an undo segment is found to be non-empty, initialize it (trx_rseg_create_instance
). For each undo segment, if the undo slot is found to be non-empty (up to TRX_RSEG_N_SLOTS (1024) slots), it will be initialized (trx_undo_lists_init
). After initializing the undo slot, put different types of undo logs into different linked lists (trx_undo_mem_create_at_db_start
). There are two main types of undo logs: TRX_UNDO_INSERT and TRX_UNDO_UPDATE. The former is mainly used for insert operations, and the latter is used for update and delete operations. As mentioned before, the undo log has two functions, used when rolling back transactions and when reading MVCC snapshots. Since the insert data does not need to be provided to other threads, the undo log of the TRX_UNDO_INSERT type can be deleted as long as the transaction is committed. TRX_UNDO_UPDATE cannot be deleted after the transaction is committed. It must be ensured that no snapshot is using it before it can be cleaned up through the background purge thread.
The second step is performed in function trx_lists_init_at_db_start
. Since in the first step, undo_insert_list and undo_update_list have been established in the memory (each undo segment of the linked list is independent), so this step only needs to traverse all Linked list, reconstruct the status of the transaction (trx_resurrect_insert
and trx_resurrect_update
). Simply put, if the status of the undo log is TRX_UNDO_ACTIVE, the status of the transaction is TRX_ACTIVE. If the status of the undo log is TRX_UNDO_PREPARED, the status of the transaction is TRX_PREPARED. The setting of the variable srv_force_recovery should also be considered here. If the value of this variable is non-0, all transactions will be rolled back (that is, the transaction is set to TRX_ACTIVE), even if the status of the transaction should be TRX_STATE_PREPARED. After the transaction is rebuilt, it is added to the trx_sys->trx_list linked list according to the transaction ID. Finally, in the function trx_sys_init_at_db_start
, all transactions that need to be rolled back (the transaction status is TRX_ACTIVE) will be counted and the total number of rows of data that need to be rolled back will be output to the error log, similar to: 5 transaction(s) which must be rolled back or cleaned up. InnoDB: in total 342232 row operations to undo.
The third step operation is called in two places. One is at the end of recv_recovery_from_checkpoint_finish
, and the other is in recv_recovery_rollback_active
. The former is mainly to roll back the operation of the data dictionary, that is, the operation of rolling back the DDL statement, and the latter is to roll back the DML statement. The former must be completed before the database can provide services, and the latter can be continued after the database provides services (that is, the crash recovery is completed) (processed by opening a new background thread trx_rollback_or_clean_all_recovered
). Because InnoDB believes that the data dictionary is the most important, it must be rolled back to a consistent state. The data in the user table can be slightly slower and can be restored slowly after providing external services. Therefore, we often find that the database has been started, and then the error log is still printing rollback transaction information. The core function of transaction rollback is trx_rollback_or_clean_recovered
. The logic is very simple. You only need to traverse trx_sys->trx_list, and roll back or submit according to the different status of the transaction (trx_rollback_resurrected
). What should be noted here is that if the transaction is in the TRX_STATE_PREPARED state, then no processing is performed at the InnoDB layer. The Server layer needs to decide whether to roll back the transaction based on the binlog situation. If the binlog has been written, the transaction will be submitted because the binlog has been written. It may be transferred to the standby database. If the primary database is rolled back, the primary and standby data will be inconsistent. If the binlog is not written, the transaction will be rolled back.
innodb_fast_shutdown:
innodb_fast_shutdown = 0. This means that when MySQL is shut down, slow shutdown is performed, which not only includes flushing logs and data pages, but also includes data cleaning (purge), ibuf merging, buffer pool dump and lazy table drop operations (if the table There are unfinished operations. Even if drop table is executed and the return is successful, the table may not be deleted immediately).
innodb_fast_shutdown = 1. This is the default value, which means that when MySQL is shut down, only the logs and data will be flushed.
innodb_fast_shutdown = 2. This means that when it is closed, only the logs are flushed and nothing else is done, as if MySQL crashed.
The larger the value of this parameter, the faster MySQL will shut down, but the slower the startup speed, which is equivalent to shifting the work that needs to be done during shutdown to crash recovery. In addition, if MySQL is to be upgraded, it is recommended to use the first method to perform a clean shutdown.
innodb_force_recovery:
This parameter is mainly used to control what work InnoDB does when it starts. The larger the value, the less work is done and the startup is easier, but the data The risk of inconsistency is also greater. When MySQL cannot start due to some uncontrollable reasons, you can set this parameter and gradually increase it from 1 until MySQL starts, and then use SELECT INTO OUTFILE to export the data and try your best to reduce data loss.
innodb_force_recovery = 0. This is the default parameter. It will do everything when it starts, including redo log application, undo log rollback, starting background master and purge threads, and ibuf merging. It is detected that the data page is damaged. If it is in the system table space, it will crash. If it is in the user table space, an error log will be logged.
innodb_force_recovery = 1. If it is detected that the data page is damaged, there will be no crash or error (buf_page_io_complete
), and the correctness of the first data page of the table space will not be verified during startup (fil_check_first_page
), the table space cannot be accessed and crash recovery continues (fil_open_single_table_tablespace
, fil_load_single_table_tablespace
), the ddl operation cannot be performed (check_if_supported_inplace_alter
), and the database is also disabled. Write operations (row_insert_for_mysql
, row_update_for_mysql
, etc.), all prepared transactions will also be rolled back (trx_resurrect_insert
, trx_resurrect_update_in_prepared_state
). This option is still very commonly used. The data page may be damaged due to a bad disk. Setting it to 1 can ensure that the database starts normally.
innodb_force_recovery = 2. In addition to the operations after setting 1 will not run, the background master and purge threads will not start (srv_master_thread
, srv_purge_coordinator_thread
, etc.). When you find that the database is blocked because of these two threads It can be set when it cannot start for some reason.
innodb_force_recovery = 3. In addition to the operations after setting 2 will not run, undo rollback of the database will not be performed, but the rollback segment will still be scanned, and the undo linked list will still be created (trx_sys_init_at_db_start
). srv_read_only_mode will be turned on.
innodb_force_recovery = 4. In addition to the operations after setting 3 will not run, the ibuf operation will not run (ibuf_merge_or_delete_for_page
), and the thread for table information statistics will not run (because a bad index page will cause the database to crash) (info_low
, dict_stats_update
, etc.). Starting from this option, all subsequent options will damage data, so use with caution.
innodb_force_recovery = 5. In addition to the operations after setting 4 will not run, the rollback segment will not be scanned (recv_recovery_rollback_active
), and the undo linked list will not be created. This is mainly used when the undo log is written bad.
innodb_force_recovery = 6. In addition to operations after setting 5, the database rollforward operation will not be performed, including parsing and applying (recv_recovery_from_checkpoint_start_func
).
InnoDB has implemented a complete crash recovery mechanism to ensure that the database can be restored normally if it hangs in any state (including crash recovery state). This is related to the file The biggest difference in the system. In addition, crash recovery uses physical logs such as redo logs to apply data pages, which brings new ideas to MySQL Replication. Can the standby database synchronize data in a similar way to applying redo logs? Alibaba Cloud RDS MySQL team will bring you similar features in subsequent products, so stay tuned.
The above is the detailed content of Detailed explanation of MySQL engine features and InnoDB crash recovery. For more information, please follow other related articles on the PHP Chinese website!