Transactions on InnoDB_MySQL-mysql教程-PHP中文网

首页

数据库

mysql教程

Transactions on InnoDB_MySQL

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 31, 2016 am 08:48 AM

from: https://blogs.oracle.com/mysqlinnodb/entry/innodb_redundant_row_format

InnoDB Redundant Row Format

By annamalai.gurusami on Sep 13, 2013

Introduction

This article describes the InnoDB redundant row format. If you are new to InnoDB code base (a new developer starting to work with InnoDB), then this article is for you. I'll explain the row format by making use of a gdb session. An overview of the article is given below:

Create a simple table and populate few rows.
Access the page that contains the rows inserted.
Access a couple of rows and explain its format.
Give summary of redundant row format.
Useful gdb commands to analyse the InnoDB rows.
Look at a GNU Emacs Lisp function to traverse rows in an InnoDB index page.

To get the most out of this article, the reader is expected to repeat the gdb session as described here.

The Schema

Consider the following SQL statements to produce the schema:

CREATE TABLE t1 (f1 int unsigned) row_format=redundant engine=innodb;INSERT INTO t1 VALUES (1), (2), (3), (4), (5);

When the above CREATE TABLE is executed, InnoDB internally creates a clustered B-tree index (or just clustered index) representing the table. If no primary key is explicitly given, internally a primary key is added. Each node of B-tree is represented by one page (which is by default is of size 16KB). In the given scenario, we insert only 5 records. So the B-tree will consist of only 1 node, the root node.

Access InnoDB Index Page in gdb

All select queries will go through the row_search_for_mysql() function of InnoDB. When the first time it is called for a query, a persistent cursor (btr_pcur_t) will be opened by this function. When this B-tree persistent cursor (pcur in the following gdb) points to the InnoDB index page containing the rows of our interest, we can save the page frame to a file as follows:

(gdb) set $frame = pcur->btr_cur->page_cur->block->frame(gdb) dump binary memory /tmp/gdb.anna $frame $frame + srv_page_size(gdb) p srv_page_size$10 = 16384(gdb)

If you do not understand the commands above, refer to the gdb manual for more information. Saving a page in binary format in a file on disk helps us to analyse the contents of the page by loading it in a text editor like GNU Emacs or ViM. This is sometimes a very useful approach for debugging. The above commands were executed when I was on the sync point row_search_rec_loop, in the function row_search_for_mysql(). The line numbers are from mysql 5.7.3.

4155         } else if (mode == PAGE_CUR_G || mode == PAGE_CUR_L) {4156                 btr_pcur_open_at_index_side(4157                         mode == PAGE_CUR_G, index, BTR_SEARCH_LEAF,4158                         pcur, false, 0, &mtr);4159         }4160 4161 rec_loop:4162         DEBUG_SYNC_C("row_search_rec_loop");4163         if (trx_is_interrupted(trx)) {4164                 btr_pcur_store_position(pcur, &mtr);4165                 err = DB_INTERRUPTED;4166                 goto normal_return;4167         }

An important point to remember when analysing the page is that all the data in the page are stored in big-endian format.

Open Page in GNU Emacs

The following image shows the InnoDB page opened in GNU Emacs in hexl-mode major mode. I have placed the cursor on the origin of the infimum record (at offset 0x65 from page origin). In the right side, you can look at the text "infimum" and "supremum" as stored in the page. These are special records.

All rows in an InnoDB page are linked to one another in ascending order via the next record pointer. In the given example this ordering is based on the internally generated DB_ROW_ID. In each page, the first record (or the minimum record) is the infimum record, and the last record (or the maximum record) is the supremum record. The records between the infimum and supremum records are called the user records.

The next record pointer of each record points to the "origin" of the next record. The record contains data to the right of the origin and meta data to the left of the origin. The next record pointer is available in the 2 bytes before the record origin.

The next record pointer of infimum as given in the picture above is 0x0087 (which is 135 in decimal). This is the offset of the next record from the beginning of the page.

Printing the infimum record

As mentioned previously, infimum is a special minimum record in an index page. It exists at a fixed offset of 101 bytes from the page beginning. The function page_get_infimum_rec() can be used to get the infimum record given a page frame pointer. All records in InnoDB are pointed to at their record origin. So to access the meta data in the record, we need to subtract offset from the record origin.

Printing the infimum record given a page frame pointer (in char):

(gdb) p  pcur->btr_cur->page_cur->block->frame+101$41 = (unsigned char *) 0x7fffd1d3c065 "infimum"

Printing the infimum record given a page frame pointer (in hex):

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+101)@7$45 = {0x69, 0x6e, 0x66, 0x69, 0x6d, 0x75, 0x6d}

Printing the 6 byte Informational Bit Fields

Given the origin of the infimum record, we can print the 6 bytes of informational bits as follows (in hex).

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+101-6)@6$50 = {0x1, 0x0, 0x0, 0x3, 0x0, 0x87}

Printing the 6 bytes of informational bits of the infimum record (in binary). In this output, the right most byte displayed is the first byte before the record origin.

(gdb) x /6tb  pcur->btr_cur->page_cur->block->frame+101-60x7fffd1d3c05f: 00000001  00000000   00000000  00000011  00000000  10000111(gdb)

The above 6 bytes bit fields of the record is formatted as follows: (it also contains the actual values as seen above)

4 bits
info bits
byte: 6
00004 bits
n_owned
byte: 6
000113 bits
heap number
byte: 4,5
00000000 0000010 bits
field count
byte: 3,4
000 00000011 bit
flag
byte: 3
12 bytes
next record offset
byte: 1, 2
00000000 10000111

Each of the box above represents one bit field. It contains the following information:

The size of bit field in bits or bytes
A short description of the bit field.
The byte where this bit field exists.
The actual value of this bit field as seen in above gdb output

The 1 bit flag above is used to tell whether the offset array element is 1 byte (if set) or 2 bytes (if not set). In our example case, the 1 bit flag is set and hence the offset array elements are each 1 byte long.

From the field count bit field, we can see that the infimum record has only one field, whose value is fixed to "infimum". The next record pointer of the infimum record points to the first user record in the page. This is available at the offset 0x0087, which is 135 in decimal. So the first user record is available at page frame pointer + 135 bytes.

The Bit Fields

The information available in 6 bytes (REC_N_OLD_EXTRA_BYTES) of bit field data is listed here. These six bytes are available immediately before the record origin. These bit fields must be accessed by making use of the interface provided by the record manager (refer to rem0rec.h).

The Bit Field	Size of Bit Field	Description of the Bit Field
info bits	4 bits	Used to delete mark a record
n_owned	4 bits	Number of records owned by this record. This is related to page directory. To understand this field, we need to understand the page structure used by InnoDB. This is not explained further in this article.
heap number	13 bits	Gives the order number of this record in the page. Each row has a unique heap number in the page. The lock manager uses this heap number.
field count	10 bits	Number of fields in this record.
flag	1 bit	Tells if the offset array element is 1 or 2 bytes.
next record offset	2 bytes	Offset of the next record from page origin.

Printing the First User Record

To first parse the record we need to know the number of records in the row. This can be obtained from the 6 bytes of bit-field information available before the record origin. Let us first see this for the user record:

(gdb) x /6tb  pcur->btr_cur->page_cur->block->frame+135-60x7fffd1d3c081: 00000000  00000000  00010000  00001001  00000000  10101000(gdb)

4 bits
info bits
byte: 6
00004 bits
n_owned
byte: 6
000013 bits
heap number
byte: 4,5
00000000 0001010 bits
field count
byte: 3,4
000 00001001 bit
flag
byte: 3
12 bytes
next record offset
byte: 1, 2
00000000 10101000

The above information shows that this record has 4 fields (000 0000100 in binary). But our table definition was

CREATE TABLE t1 (f1 unsigned int) row_format=redundant engine=innodb;

If you are wondering why a row of this table has 4 field count, it is because InnoDB added 3 internal fields to the table t1 - a row identifier (DB_ROW_ID), transaction identifier (DB_TRX_ID) and a undo roll pointer (DB_ROLL_PTR). A row identifier column is internally added to a table if no primary key is explicitly given by the user. This row identifier becomes the internal primary key. The transaction identifier specifies the transaction that has last updated the row. The undo roll pointer points to the previous versions of the record. The 4th field is the actual column f1 added by the user.

The list of fields in the table t1 can be printed by accessing its associated dict_table_t object (or its clustered index dict_index_t object). In the following, the name GEN_CLUST_INDEX is the name of the clustered index of the table t1. This is the standard name of the clustered index if it was internally generated by InnoDB.

(gdb) p prebuilt->index->name       $75 = 0x7fffa00232e8 "GEN_CLUST_INDEX"(gdb) p prebuilt->index->table      $73 = (dict_table_t *) 0x7fffa001fe68(gdb) p prebuilt->index->table->name$74 = 0x7fffa000cef0 "test/t1"(gdb) set print array on(gdb) p *prebuilt->index->fields@4$67 =   {{col = 0x7fffa0020154, name = 0x7fffa00201c3 "DB_ROW_ID", prefix_len = 0,     fixed_len = 6},  {col = 0x7fffa0020160, name = 0x7fffa00201cd "DB_TRX_ID", prefix_len = 0,     fixed_len = 6},  {col = 0x7fffa002016c, name = 0x7fffa00201d7 "DB_ROLL_PTR", prefix_len = 0,     fixed_len = 7},  {col = 0x7fffa0020148, name = 0x7fffa00201c0 "f1", prefix_len = 0, fixed_len = 4}}(gdb)

Printing the Record Offsets (or Offset Array)

The redundant record contains an array of offsets of each field of the row. This array is available after the 6-byte bit fields (down from the record origin). The size of each element of the array is either 1 byte or 2 byte depending on the 1 bit flag in the 6 byte bit fields. In our current case each offset element is 1 byte long.

(gdb) p /d  *(pcur->btr_cur->page_cur->block->frame+135-6-4)@4$79 = {23, 19, 12, 6}(gdb)

The first field (DB_ROW_ID) starts at offset 0 from record origin. The second field (DB_TRX_ID) starts at offset 6 from record origin. The third field (DB_ROLL_PTR) starts at offset 12 bytes from record origin. The fourth field (f1) starts at offset 19 bytes from record origin. The end of the record is 23 bytes from the record origin. So the total size of this record can be calculated as follows:

S - size of offset array element (1 byte) N - number of fields in the row (4 fields)I - size of bit fields (6 bytes)D - Length of row after record origin (23 as given in offset array)R - Total Size of the row R = (S*N) + I + D = 4 + 6 + 23 = 33 bytes

The total size of the row is 33 bytes. Since this table contains only fixed length fields, the size of each row in this table will be the same.

Printing the DB_ROW_ID

The DB_ROW_ID is 6 bytes long. It starts at the record origin itself. It can be printed as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135)@6$82 = {0x0, 0x0, 0x0, 0x0, 0x2, 0x0}(gdb)

Printing the DB_TRX_ID

The second field DB_TRX_ID is also 6 bytes long. It starts at an offset of 6 bytes from the record origin. It can be examined as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135+6)@6$84 = {0x0, 0x0, 0x0, 0x0, 0x9, 0x8}(gdb)

Printing the DB_ROLL_PTR

The third field DB_ROLL_PTR is 7 bytes long. It points to the previous versions of a record available in the undo segments. It can be examined as follows:

(gdb)  p /x *(pcur->btr_cur->page_cur->block->frame+135+12)@7$3 = {0xa8, 0x0, 0x0, 0x1, 0x1a, 0x1, 0x10}

Printing the User Data Field

The fourth field "f1" is the user defined field. It is 4 bytes long. It can be examined as follows:

(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135+19)@4$5 = {0x0, 0x0, 0x0, 0x1}(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+135+19))$8 = 1(gdb)

Thus, we have completely accessed the first user record. Given the first user record, we can find out the location of the next user record (the first 2 bytes before the record origin). If we keep accessing the next user record, we will eventually reach the supremum record, which is the last (and considered largest or maximum) record in a page.

Printing All Rows

Just to improve our confidence in accessing the rows, here I have printed all the rows that we have inserted into the table. The first record is the infimum, followed by the records 1, 2, 3, 4 and 5 (which we have inserted) and then the last record supremum.

(gdb) p pcur->btr_cur->page_cur->block->frame+101$13 = (unsigned char *) 0x7fffd1e48065 "infimum"(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+135+19))$14 = 1(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+168+19))$15 = 2(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+201+19))$16 = 3(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+234+19))$17 = 4(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+267+19))$18 = 5(gdb) p pcur->btr_cur->page_cur->block->frame+116$19 = (unsigned char *) 0x7fffd1e48074 "supremum"(gdb)

The Complete Redundant Row Format

One redundant record can be logically separated into 5 parts - the offsets array, the 6-byte bit field, record origin, hidden fields (DB_ROW_ID, DB_TRX_ID, DB_ROLL_PTR) and the actual user-specified fields of the table. They can be viewed as follows:

offsets array
fn ... f1, x2, x1, pn ... p16-byte bit fieldoriginprimary key
DB_ROW_ID
or user given
p1, p2 ... pnhidden fields
DB_TRX_ID
DB_ROLL_PTR
x1, x2user fields
f1, f2, f3 ... fn

All record pointers point to the record origin. The record origin is the beginning of the primary key for the clustered primary index. It is not a separate entity that occupies some space. I am showing it separately to make it easy to visualize. All information prior to the record origin can be considered as the record header or the record meta-data.

Each data field has a corresponding offsets array element. The offsets array is arranged in reverse order. The first element in the array provides the offset of the end of the last data field and the last element in the array provides the offset of the end of the first data element.

The offsets array also tells whether a data field is NULL. If a field is NULL, the highest bit in the associated offset array element will be set to 1.

The biggest advantage of the redundant row format is that a record in this format can be parsed without needing to access the data dictionary. But this format is not space efficient. Even for fixed length fields, we store information in the offsets array (this is avoided in COMPACT, DYNAMIC and COMPRESSED row formats).

One more limitation of the redundant row format is that the blob data cannot be stored completely off page (as in DYNAMIC and COMPRESSED row formats). A 768 byte prefix of the blob will always be stored inline with the row and cannot be avoided.

A small note on row formats. InnoDB really supports only two row formats - the redundant row format and the compact row format. The compressed and dynamic row formats are a variation available only on top of the compact row format. Technically these variations can be implemented for the redundant row format as well, but is currently not available.

GNU Emacs Function to Traverse Rows in Page

If we dump a single InnoDB index page into a file, we can open it in GNU Emacs hexl-mode and use the following elisp function to traverse from infimum record to the supremum record. First place the cursor on the 'i' of the infimum record and then call this function to reach the next record origin. I am not an expert in writing elisp functions. If you can improve on this and write a better one, kindly share it with me.

<strong>(</strong><strong>defun </strong>ib-next-red-row <strong>(</strong><strong>)</strong>  "Go to the origin of the next record in ROW_FORMAT=REDUNDANT.The cursor must be positioned at the origin of a record. This is for the GNU Emacs hexl-mode. "  <strong>(</strong>interactive<strong>)</strong>  <strong>(</strong>setq rec-origin <strong>(</strong>point<strong>)</strong><strong>)</strong>  <strong>(</strong>setq rec-origin-offset <strong>(</strong>hexl-current-address<strong>)</strong><strong>)</strong>  ;; The next-rec pointer is stored in 2 bytes before the  ;; record origin  <strong>(</strong>hexl-backward-char 2<strong>)</strong>  <strong>(</strong>setq next-rec <strong>(</strong>string <strong>(</strong>char-after<strong>)</strong><strong>)</strong><strong>)</strong>  <strong>(</strong>forward-char 1<strong>)</strong>  <strong>(</strong>setq next-rec 	<strong>(</strong>concat next-rec <strong>(</strong>string <strong>(</strong>char-after<strong>)</strong><strong>)</strong><strong>)</strong><strong>)</strong>  <strong>(</strong>hexl-forward-char 1<strong>)</strong>  <strong>(</strong>setq next-rec 	<strong>(</strong>concat next-rec <strong>(</strong>string <strong>(</strong>char-after<strong>)</strong><strong>)</strong><strong>)</strong><strong>)</strong>  <strong>(</strong>forward-char 1<strong>)</strong>  <strong>(</strong>setq next-rec 	<strong>(</strong>concat next-rec <strong>(</strong>string <strong>(</strong>char-after<strong>)</strong><strong>)</strong><strong>)</strong><strong>)</strong>  ;; Next record is next-rec-off bytes away from page origin  <strong>(</strong>setq next-rec-off <strong>(</strong>hexl-hex-string-to-integer next-rec<strong>)</strong><strong>)</strong>  ;; Assuming that there is only one page in dump  <strong>(</strong>hexl-goto-address 0<strong>)</strong>  <strong>(</strong>hexl-forward-char next-rec-off<strong>)</strong><strong>)</strong>

Conclusion

In this article, the format of the redundant row was explained. A gdb session was used to demonstrate how to access a page and the rows within that page. The infimum record was accessed, printed and analysed. Using its next record pointer, the first user record was accessed. The reader can use the next record pointer of the first user record to access the next user record and so on till the supremum record is reached. I leave that as an exercise for the readers.

Thanks to Marko Makela for his useful tips on debugging at our yearly InnoDB Team Meeting 2013 held at Shanghai, China. This article derives from his ideas.

声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

MySQL：世界上最受欢迎的数据库的简介Apr 12, 2025 am 12:18 AM

MySQL是一种开源的关系型数据库管理系统，主要用于快速、可靠地存储和检索数据。其工作原理包括客户端请求、查询解析、执行查询和返回结果。使用示例包括创建表、插入和查询数据，以及高级功能如JOIN操作。常见错误涉及SQL语法、数据类型和权限问题，优化建议包括使用索引、优化查询和分表分区。

MySQL的重要性：数据存储和管理Apr 12, 2025 am 12:18 AM

MySQL是一个开源的关系型数据库管理系统，适用于数据存储、管理、查询和安全。1.它支持多种操作系统，广泛应用于Web应用等领域。2.通过客户端-服务器架构和不同存储引擎，MySQL高效处理数据。3.基本用法包括创建数据库和表，插入、查询和更新数据。4.高级用法涉及复杂查询和存储过程。5.常见错误可通过EXPLAIN语句调试。6.性能优化包括合理使用索引和优化查询语句。

为什么要使用mysql？利益和优势Apr 12, 2025 am 12:17 AM

选择MySQL的原因是其性能、可靠性、易用性和社区支持。1.MySQL提供高效的数据存储和检索功能，支持多种数据类型和高级查询操作。2.采用客户端-服务器架构和多种存储引擎，支持事务和查询优化。3.易于使用，支持多种操作系统和编程语言。4.拥有强大的社区支持，提供丰富的资源和解决方案。

描述InnoDB锁定机制（共享锁，独家锁，意向锁，记录锁，间隙锁，下一键锁）。Apr 12, 2025 am 12:16 AM

InnoDB的锁机制包括共享锁、排他锁、意向锁、记录锁、间隙锁和下一个键锁。1.共享锁允许事务读取数据而不阻止其他事务读取。2.排他锁阻止其他事务读取和修改数据。3.意向锁优化锁效率。4.记录锁锁定索引记录。5.间隙锁锁定索引记录间隙。6.下一个键锁是记录锁和间隙锁的组合，确保数据一致性。

MySQL查询性能差的常见原因是什么？Apr 12, 2025 am 12:11 AM

MySQL查询性能不佳的原因主要包括没有使用索引、查询优化器选择错误的执行计划、表设计不合理、数据量过大和锁竞争。 1.没有索引导致查询缓慢，添加索引后可显着提升性能。 2.使用EXPLAIN命令可以分析查询计划，找出优化器错误。 3.重构表结构和优化JOIN条件可改善表设计问题。 4.数据量大时，采用分区和分表策略。 5.高并发环境下，优化事务和锁策略可减少锁竞争。

您什么时候应该使用复合索引与多个单列索引？Apr 11, 2025 am 12:06 AM

在数据库优化中，应根据查询需求选择索引策略：1.当查询涉及多个列且条件顺序固定时，使用复合索引；2.当查询涉及多个列但条件顺序不固定时，使用多个单列索引。复合索引适用于优化多列查询，单列索引则适合单列查询。

如何识别和优化MySQL中的慢速查询？（慢查询日志，performance_schema）Apr 10, 2025 am 09:36 AM

要优化MySQL慢查询，需使用slowquerylog和performance_schema：1.启用slowquerylog并设置阈值，记录慢查询；2.利用performance_schema分析查询执行细节，找出性能瓶颈并优化。

MySQL和SQL：开发人员的基本技能Apr 10, 2025 am 09:30 AM

MySQL和SQL是开发者必备技能。1.MySQL是开源的关系型数据库管理系统，SQL是用于管理和操作数据库的标准语言。2.MySQL通过高效的数据存储和检索功能支持多种存储引擎，SQL通过简单语句完成复杂数据操作。3.使用示例包括基本查询和高级查询，如按条件过滤和排序。4.常见错误包括语法错误和性能问题，可通过检查SQL语句和使用EXPLAIN命令优化。5.性能优化技巧包括使用索引、避免全表扫描、优化JOIN操作和提升代码可读性。

See all articles