from: https://blogs.oracle.com/mysqlinnodb/entry/innodb_redundant_row_format
InnoDB Redundant Row Format
By annamalai.gurusami on Sep 13, 2013
Introduction
This article describes the InnoDB redundant row format. If you are new to InnoDB code base (a new developer starting to work with InnoDB), then this article is for you. I'll explain the row format by making use of a gdb session. An overview of the article is given below:
- Create a simple table and populate few rows.
- Access the page that contains the rows inserted.
- Access a couple of rows and explain its format.
- Give summary of redundant row format.
- Useful gdb commands to analyse the InnoDB rows.
- Look at a GNU Emacs Lisp function to traverse rows in an InnoDB index page.
To get the most out of this article, the reader is expected to repeat the gdb session as described here.
The Schema
Consider the following SQL statements to produce the schema:
CREATE TABLE t1 (f1 int unsigned) row_format=redundant engine=innodb;INSERT INTO t1 VALUES (1), (2), (3), (4), (5);
When the above CREATE TABLE is executed, InnoDB internally creates a clustered B-tree index (or just clustered index) representing the table. If no primary key is explicitly given, internally a primary key is added. Each node of B-tree is represented by one page (which is by default is of size 16KB). In the given scenario, we insert only 5 records. So the B-tree will consist of only 1 node, the root node.
Access InnoDB Index Page in gdb
All select queries will go through the row_search_for_mysql() function of InnoDB. When the first time it is called for a query, a persistent cursor (btr_pcur_t) will be opened by this function. When this B-tree persistent cursor (pcur in the following gdb) points to the InnoDB index page containing the rows of our interest, we can save the page frame to a file as follows:
(gdb) set $frame = pcur->btr_cur->page_cur->block->frame(gdb) dump binary memory /tmp/gdb.anna $frame $frame + srv_page_size(gdb) p srv_page_size$10 = 16384(gdb)
If you do not understand the commands above, refer to the gdb manual for more information. Saving a page in binary format in a file on disk helps us to analyse the contents of the page by loading it in a text editor like GNU Emacs or ViM. This is sometimes a very useful approach for debugging. The above commands were executed when I was on the sync point row_search_rec_loop, in the function row_search_for_mysql(). The line numbers are from mysql 5.7.3.
4155 } else if (mode == PAGE_CUR_G || mode == PAGE_CUR_L) {4156 btr_pcur_open_at_index_side(4157 mode == PAGE_CUR_G, index, BTR_SEARCH_LEAF,4158 pcur, false, 0, &mtr);4159 }4160 4161 rec_loop:4162 DEBUG_SYNC_C("row_search_rec_loop");4163 if (trx_is_interrupted(trx)) {4164 btr_pcur_store_position(pcur, &mtr);4165 err = DB_INTERRUPTED;4166 goto normal_return;4167 }
An important point to remember when analysing the page is that all the data in the page are stored in big-endian format.
Open Page in GNU Emacs
The following image shows the InnoDB page opened in GNU Emacs in hexl-mode major mode. I have placed the cursor on the origin of the infimum record (at offset 0x65 from page origin). In the right side, you can look at the text "infimum" and "supremum" as stored in the page. These are special records.

All rows in an InnoDB page are linked to one another in ascending order via the next record pointer. In the given example this ordering is based on the internally generated DB_ROW_ID. In each page, the first record (or the minimum record) is the infimum record, and the last record (or the maximum record) is the supremum record. The records between the infimum and supremum records are called the user records.
The next record pointer of each record points to the "origin" of the next record. The record contains data to the right of the origin and meta data to the left of the origin. The next record pointer is available in the 2 bytes before the record origin.
The next record pointer of infimum as given in the picture above is 0x0087 (which is 135 in decimal). This is the offset of the next record from the beginning of the page.
Printing the infimum record
As mentioned previously, infimum is a special minimum record in an index page. It exists at a fixed offset of 101 bytes from the page beginning. The function page_get_infimum_rec() can be used to get the infimum record given a page frame pointer. All records in InnoDB are pointed to at their record origin. So to access the meta data in the record, we need to subtract offset from the record origin.
Printing the infimum record given a page frame pointer (in char):
(gdb) p pcur->btr_cur->page_cur->block->frame+101$41 = (unsigned char *) 0x7fffd1d3c065 "infimum"
Printing the infimum record given a page frame pointer (in hex):
(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+101)@7$45 = {0x69, 0x6e, 0x66, 0x69, 0x6d, 0x75, 0x6d}
Printing the 6 byte Informational Bit Fields
Given the origin of the infimum record, we can print the 6 bytes of informational bits as follows (in hex).
(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+101-6)@6$50 = {0x1, 0x0, 0x0, 0x3, 0x0, 0x87}
Printing the 6 bytes of informational bits of the infimum record (in binary). In this output, the right most byte displayed is the first byte before the record origin.
(gdb) x /6tb pcur->btr_cur->page_cur->block->frame+101-60x7fffd1d3c05f: 00000001 00000000 00000000 00000011 00000000 10000111(gdb)
The above 6 bytes bit fields of the record is formatted as follows: (it also contains the actual values as seen above)
4 bitsinfo bits
byte: 6
00004 bits
n_owned
byte: 6
000113 bits
heap number
byte: 4,5
00000000 0000010 bits
field count
byte: 3,4
000 00000011 bit
flag
byte: 3
12 bytes
next record offset
byte: 1, 2
00000000 10000111
Each of the box above represents one bit field. It contains the following information:
The size of bit field in bits or bytesA short description of the bit field.
The byte where this bit field exists.
The actual value of this bit field as seen in above gdb output
The 1 bit flag above is used to tell whether the offset array element is 1 byte (if set) or 2 bytes (if not set). In our example case, the 1 bit flag is set and hence the offset array elements are each 1 byte long.
From the field count bit field, we can see that the infimum record has only one field, whose value is fixed to "infimum". The next record pointer of the infimum record points to the first user record in the page. This is available at the offset 0x0087, which is 135 in decimal. So the first user record is available at page frame pointer + 135 bytes.
The Bit Fields
The information available in 6 bytes (REC_N_OLD_EXTRA_BYTES) of bit field data is listed here. These six bytes are available immediately before the record origin. These bit fields must be accessed by making use of the interface provided by the record manager (refer to rem0rec.h).
The Bit Field | Size of Bit Field | Description of the Bit Field |
---|---|---|
info bits | 4 bits | Used to delete mark a record |
n_owned | 4 bits | Number of records owned by this record. This is related to page directory. To understand this field, we need to understand the page structure used by InnoDB. This is not explained further in this article. |
heap number | 13 bits | Gives the order number of this record in the page. Each row has a unique heap number in the page. The lock manager uses this heap number. |
field count | 10 bits | Number of fields in this record. |
flag | 1 bit | Tells if the offset array element is 1 or 2 bytes. |
next record offset | 2 bytes | Offset of the next record from page origin. |
Printing the First User Record
To first parse the record we need to know the number of records in the row. This can be obtained from the 6 bytes of bit-field information available before the record origin. Let us first see this for the user record:
(gdb) x /6tb pcur->btr_cur->page_cur->block->frame+135-60x7fffd1d3c081: 00000000 00000000 00010000 00001001 00000000 10101000(gdb)4 bits
info bits
byte: 6
00004 bits
n_owned
byte: 6
000013 bits
heap number
byte: 4,5
00000000 0001010 bits
field count
byte: 3,4
000 00001001 bit
flag
byte: 3
12 bytes
next record offset
byte: 1, 2
00000000 10101000
The above information shows that this record has 4 fields (000 0000100 in binary). But our table definition was
CREATE TABLE t1 (f1 unsigned int) row_format=redundant engine=innodb;
If you are wondering why a row of this table has 4 field count, it is because InnoDB added 3 internal fields to the table t1 - a row identifier (DB_ROW_ID), transaction identifier (DB_TRX_ID) and a undo roll pointer (DB_ROLL_PTR). A row identifier column is internally added to a table if no primary key is explicitly given by the user. This row identifier becomes the internal primary key. The transaction identifier specifies the transaction that has last updated the row. The undo roll pointer points to the previous versions of the record. The 4th field is the actual column f1 added by the user.
The list of fields in the table t1 can be printed by accessing its associated dict_table_t object (or its clustered index dict_index_t object). In the following, the name GEN_CLUST_INDEX is the name of the clustered index of the table t1. This is the standard name of the clustered index if it was internally generated by InnoDB.
(gdb) p prebuilt->index->name $75 = 0x7fffa00232e8 "GEN_CLUST_INDEX"(gdb) p prebuilt->index->table $73 = (dict_table_t *) 0x7fffa001fe68(gdb) p prebuilt->index->table->name$74 = 0x7fffa000cef0 "test/t1"(gdb) set print array on(gdb) p *prebuilt->index->fields@4$67 = {{col = 0x7fffa0020154, name = 0x7fffa00201c3 "DB_ROW_ID", prefix_len = 0, fixed_len = 6}, {col = 0x7fffa0020160, name = 0x7fffa00201cd "DB_TRX_ID", prefix_len = 0, fixed_len = 6}, {col = 0x7fffa002016c, name = 0x7fffa00201d7 "DB_ROLL_PTR", prefix_len = 0, fixed_len = 7}, {col = 0x7fffa0020148, name = 0x7fffa00201c0 "f1", prefix_len = 0, fixed_len = 4}}(gdb)
Printing the Record Offsets (or Offset Array)
The redundant record contains an array of offsets of each field of the row. This array is available after the 6-byte bit fields (down from the record origin). The size of each element of the array is either 1 byte or 2 byte depending on the 1 bit flag in the 6 byte bit fields. In our current case each offset element is 1 byte long.
(gdb) p /d *(pcur->btr_cur->page_cur->block->frame+135-6-4)@4$79 = {23, 19, 12, 6}(gdb)
The first field (DB_ROW_ID) starts at offset 0 from record origin. The second field (DB_TRX_ID) starts at offset 6 from record origin. The third field (DB_ROLL_PTR) starts at offset 12 bytes from record origin. The fourth field (f1) starts at offset 19 bytes from record origin. The end of the record is 23 bytes from the record origin. So the total size of this record can be calculated as follows:
S - size of offset array element (1 byte) N - number of fields in the row (4 fields)I - size of bit fields (6 bytes)D - Length of row after record origin (23 as given in offset array)R - Total Size of the row R = (S*N) + I + D = 4 + 6 + 23 = 33 bytes
The total size of the row is 33 bytes. Since this table contains only fixed length fields, the size of each row in this table will be the same.
Printing the DB_ROW_ID
The DB_ROW_ID is 6 bytes long. It starts at the record origin itself. It can be printed as follows:
(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135)@6$82 = {0x0, 0x0, 0x0, 0x0, 0x2, 0x0}(gdb)
Printing the DB_TRX_ID
The second field DB_TRX_ID is also 6 bytes long. It starts at an offset of 6 bytes from the record origin. It can be examined as follows:
(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135+6)@6$84 = {0x0, 0x0, 0x0, 0x0, 0x9, 0x8}(gdb)
Printing the DB_ROLL_PTR
The third field DB_ROLL_PTR is 7 bytes long. It points to the previous versions of a record available in the undo segments. It can be examined as follows:
(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135+12)@7$3 = {0xa8, 0x0, 0x0, 0x1, 0x1a, 0x1, 0x10}
Printing the User Data Field
The fourth field "f1" is the user defined field. It is 4 bytes long. It can be examined as follows:
(gdb) p /x *(pcur->btr_cur->page_cur->block->frame+135+19)@4$5 = {0x0, 0x0, 0x0, 0x1}(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+135+19))$8 = 1(gdb)
Thus, we have completely accessed the first user record. Given the first user record, we can find out the location of the next user record (the first 2 bytes before the record origin). If we keep accessing the next user record, we will eventually reach the supremum record, which is the last (and considered largest or maximum) record in a page.
Printing All Rows
Just to improve our confidence in accessing the rows, here I have printed all the rows that we have inserted into the table. The first record is the infimum, followed by the records 1, 2, 3, 4 and 5 (which we have inserted) and then the last record supremum.
(gdb) p pcur->btr_cur->page_cur->block->frame+101$13 = (unsigned char *) 0x7fffd1e48065 "infimum"(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+135+19))$14 = 1(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+168+19))$15 = 2(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+201+19))$16 = 3(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+234+19))$17 = 4(gdb) p ntohl(*(uint32_t *) (pcur->btr_cur->page_cur->block->frame+267+19))$18 = 5(gdb) p pcur->btr_cur->page_cur->block->frame+116$19 = (unsigned char *) 0x7fffd1e48074 "supremum"(gdb)
The Complete Redundant Row Format
One redundant record can be logically separated into 5 parts - the offsets array, the 6-byte bit field, record origin, hidden fields (DB_ROW_ID, DB_TRX_ID, DB_ROLL_PTR) and the actual user-specified fields of the table. They can be viewed as follows:
offsets arrayfn ... f1, x2, x1, pn ... p16-byte bit fieldoriginprimary key
DB_ROW_ID
or user given
p1, p2 ... pnhidden fields
DB_TRX_ID
DB_ROLL_PTR
x1, x2user fields
f1, f2, f3 ... fn
All record pointers point to the record origin. The record origin is the beginning of the primary key for the clustered primary index. It is not a separate entity that occupies some space. I am showing it separately to make it easy to visualize. All information prior to the record origin can be considered as the record header or the record meta-data.
Each data field has a corresponding offsets array element. The offsets array is arranged in reverse order. The first element in the array provides the offset of the end of the last data field and the last element in the array provides the offset of the end of the first data element.
The offsets array also tells whether a data field is NULL. If a field is NULL, the highest bit in the associated offset array element will be set to 1.
The biggest advantage of the redundant row format is that a record in this format can be parsed without needing to access the data dictionary. But this format is not space efficient. Even for fixed length fields, we store information in the offsets array (this is avoided in COMPACT, DYNAMIC and COMPRESSED row formats).
One more limitation of the redundant row format is that the blob data cannot be stored completely off page (as in DYNAMIC and COMPRESSED row formats). A 768 byte prefix of the blob will always be stored inline with the row and cannot be avoided.
A small note on row formats. InnoDB really supports only two row formats - the redundant row format and the compact row format. The compressed and dynamic row formats are a variation available only on top of the compact row format. Technically these variations can be implemented for the redundant row format as well, but is currently not available.
GNU Emacs Function to Traverse Rows in Page
If we dump a single InnoDB index page into a file, we can open it in GNU Emacs hexl-mode and use the following elisp function to traverse from infimum record to the supremum record. First place the cursor on the 'i' of the infimum record and then call this function to reach the next record origin. I am not an expert in writing elisp functions. If you can improve on this and write a better one, kindly share it with me.
<strong>(</strong><strong>defun </strong>ib-next-red-row <strong>(</strong><strong>)</strong> "Go to the origin of the next record in ROW_FORMAT=REDUNDANT.The cursor must be positioned at the origin of a record. This is for the GNU Emacs hexl-mode. " <strong>(</strong>interactive<strong>)</strong> <strong>(</strong>setq rec-origin <strong>(</strong>point<strong>)</strong><strong>)</strong> <strong>(</strong>setq rec-origin-offset <strong>(</strong>hexl-current-address<strong>)</strong><strong>)</strong> ;; The next-rec pointer is stored in 2 bytes before the ;; record origin <strong>(</strong>hexl-backward-char 2<strong>)</strong> <strong>(</strong>setq next-rec <strong>(</strong>string <strong>(</strong>char-after<strong>)</strong><strong>)</strong><strong>)</strong> <strong>(</strong>forward-char 1<strong>)</strong> <strong>(</strong>setq next-rec <strong>(</strong>concat next-rec <strong>(</strong>string <strong>(</strong>char-after<strong>)</strong><strong>)</strong><strong>)</strong><strong>)</strong> <strong>(</strong>hexl-forward-char 1<strong>)</strong> <strong>(</strong>setq next-rec <strong>(</strong>concat next-rec <strong>(</strong>string <strong>(</strong>char-after<strong>)</strong><strong>)</strong><strong>)</strong><strong>)</strong> <strong>(</strong>forward-char 1<strong>)</strong> <strong>(</strong>setq next-rec <strong>(</strong>concat next-rec <strong>(</strong>string <strong>(</strong>char-after<strong>)</strong><strong>)</strong><strong>)</strong><strong>)</strong> ;; Next record is next-rec-off bytes away from page origin <strong>(</strong>setq next-rec-off <strong>(</strong>hexl-hex-string-to-integer next-rec<strong>)</strong><strong>)</strong> ;; Assuming that there is only one page in dump <strong>(</strong>hexl-goto-address 0<strong>)</strong> <strong>(</strong>hexl-forward-char next-rec-off<strong>)</strong><strong>)</strong>
Conclusion
In this article, the format of the redundant row was explained. A gdb session was used to demonstrate how to access a page and the rows within that page. The infimum record was accessed, printed and analysed. Using its next record pointer, the first user record was accessed. The reader can use the next record pointer of the first user record to access the next user record and so on till the supremum record is reached. I leave that as an exercise for the readers.
Thanks to Marko Makela for his useful tips on debugging at our yearly InnoDB Team Meeting 2013 held at Shanghai, China. This article derives from his ideas.

MySQL是一种开源的关系型数据库管理系统,主要用于快速、可靠地存储和检索数据。其工作原理包括客户端请求、查询解析、执行查询和返回结果。使用示例包括创建表、插入和查询数据,以及高级功能如JOIN操作。常见错误涉及SQL语法、数据类型和权限问题,优化建议包括使用索引、优化查询和分表分区。

MySQL是一个开源的关系型数据库管理系统,适用于数据存储、管理、查询和安全。1.它支持多种操作系统,广泛应用于Web应用等领域。2.通过客户端-服务器架构和不同存储引擎,MySQL高效处理数据。3.基本用法包括创建数据库和表,插入、查询和更新数据。4.高级用法涉及复杂查询和存储过程。5.常见错误可通过EXPLAIN语句调试。6.性能优化包括合理使用索引和优化查询语句。

选择MySQL的原因是其性能、可靠性、易用性和社区支持。1.MySQL提供高效的数据存储和检索功能,支持多种数据类型和高级查询操作。2.采用客户端-服务器架构和多种存储引擎,支持事务和查询优化。3.易于使用,支持多种操作系统和编程语言。4.拥有强大的社区支持,提供丰富的资源和解决方案。

InnoDB的锁机制包括共享锁、排他锁、意向锁、记录锁、间隙锁和下一个键锁。1.共享锁允许事务读取数据而不阻止其他事务读取。2.排他锁阻止其他事务读取和修改数据。3.意向锁优化锁效率。4.记录锁锁定索引记录。5.间隙锁锁定索引记录间隙。6.下一个键锁是记录锁和间隙锁的组合,确保数据一致性。

MySQL查询性能不佳的原因主要包括没有使用索引、查询优化器选择错误的执行计划、表设计不合理、数据量过大和锁竞争。 1.没有索引导致查询缓慢,添加索引后可显着提升性能。 2.使用EXPLAIN命令可以分析查询计划,找出优化器错误。 3.重构表结构和优化JOIN条件可改善表设计问题。 4.数据量大时,采用分区和分表策略。 5.高并发环境下,优化事务和锁策略可减少锁竞争。

在数据库优化中,应根据查询需求选择索引策略:1.当查询涉及多个列且条件顺序固定时,使用复合索引;2.当查询涉及多个列但条件顺序不固定时,使用多个单列索引。复合索引适用于优化多列查询,单列索引则适合单列查询。

要优化MySQL慢查询,需使用slowquerylog和performance_schema:1.启用slowquerylog并设置阈值,记录慢查询;2.利用performance_schema分析查询执行细节,找出性能瓶颈并优化。

MySQL和SQL是开发者必备技能。1.MySQL是开源的关系型数据库管理系统,SQL是用于管理和操作数据库的标准语言。2.MySQL通过高效的数据存储和检索功能支持多种存储引擎,SQL通过简单语句完成复杂数据操作。3.使用示例包括基本查询和高级查询,如按条件过滤和排序。4.常见错误包括语法错误和性能问题,可通过检查SQL语句和使用EXPLAIN命令优化。5.性能优化技巧包括使用索引、避免全表扫描、优化JOIN操作和提升代码可读性。


热AI工具

Undresser.AI Undress
人工智能驱动的应用程序,用于创建逼真的裸体照片

AI Clothes Remover
用于从照片中去除衣服的在线人工智能工具。

Undress AI Tool
免费脱衣服图片

Clothoff.io
AI脱衣机

AI Hentai Generator
免费生成ai无尽的。

热门文章

热工具

mPDF
mPDF是一个PHP库,可以从UTF-8编码的HTML生成PDF文件。原作者Ian Back编写mPDF以从他的网站上“即时”输出PDF文件,并处理不同的语言。与原始脚本如HTML2FPDF相比,它的速度较慢,并且在使用Unicode字体时生成的文件较大,但支持CSS样式等,并进行了大量增强。支持几乎所有语言,包括RTL(阿拉伯语和希伯来语)和CJK(中日韩)。支持嵌套的块级元素(如P、DIV),

DVWA
Damn Vulnerable Web App (DVWA) 是一个PHP/MySQL的Web应用程序,非常容易受到攻击。它的主要目标是成为安全专业人员在合法环境中测试自己的技能和工具的辅助工具,帮助Web开发人员更好地理解保护Web应用程序的过程,并帮助教师/学生在课堂环境中教授/学习Web应用程序安全。DVWA的目标是通过简单直接的界面练习一些最常见的Web漏洞,难度各不相同。请注意,该软件中

SecLists
SecLists是最终安全测试人员的伙伴。它是一个包含各种类型列表的集合,这些列表在安全评估过程中经常使用,都在一个地方。SecLists通过方便地提供安全测试人员可能需要的所有列表,帮助提高安全测试的效率和生产力。列表类型包括用户名、密码、URL、模糊测试有效载荷、敏感数据模式、Web shell等等。测试人员只需将此存储库拉到新的测试机上,他就可以访问到所需的每种类型的列表。

记事本++7.3.1
好用且免费的代码编辑器

MinGW - 适用于 Windows 的极简 GNU
这个项目正在迁移到osdn.net/projects/mingw的过程中,你可以继续在那里关注我们。MinGW:GNU编译器集合(GCC)的本地Windows移植版本,可自由分发的导入库和用于构建本地Windows应用程序的头文件;包括对MSVC运行时的扩展,以支持C99功能。MinGW的所有软件都可以在64位Windows平台上运行。