InnoDB uses a novel file flush technique called doublewrite. Before writing pages to the data files, InnoDB first writes them to a contiguous area called the doublewrite buffer. Only after the write and the flush to the doublewrite buffer have completed, does InnoDB write the pages to their proper positions in the data file. If there is an operating system, storage subsystem, or mysqld process crash in the middle of a page write, InnoDB can later find a good copy of the page from the doublewrite buffer during crash recovery.
Although data is always written twice, the doublewrite buffer does not require twice as much I/O overhead or twice as many I/O operations. Data is written to the buffer itself as a large sequential chunk, with a single fsync() call to the operating system.
To turn off the doublewrite buffer, specify the option innodb_doublewrite=0.
2.partial page writeInnoDB 的Page Size一般是16KB,其数据校验也是针对这16KB来计算的,将数据写入到磁盘是以Page为单位进行操作的。而计算机硬件和操作系统,写文件是以4KB作为单位的,那么每写一个innodb的page到磁盘上,在os级别上需要写4个块.
16K的数据,写入4K 时,发生了系统断电/os crash ,只有一部分写是成功的,这种情况下就是 partial page write 问题。有人会想到系统恢复后MySQL可以根据redolog 进行恢复,而mysql在恢复的过程中是检查page的checksum,checksum就是pgae的最后事务号,发生partial page write 问题时,page已经损坏,找不到该page中的事务号,就无法恢复。
MySQL数据库的页为16K;而OS的块则一般为4K;IO块则更小,linux内核要求IO block size<=OS block size。
磁盘IO除了IO block size,还有一个概念是扇区(IO sector),扇区是磁盘物理操作的基本单位,而IO 块是磁盘操作的逻辑单位,一个IO块对应一个或多个扇区,扇区大小一般为512个字节。
(涉及到SSD和HDD:在固态硬盘中,读写的最小单位叫“页”,最常见的尺寸是4K;为了与过去的机械硬盘兼容,固态硬盘将一个4KB页模拟成8个512字节扇区,这样操作系统就不会觉察到有任何变化。)
所以各个块大小的关系可以梳理如下:
DB block > OS block >= IO block > 磁盘 sector,而且他们之间保持了整数倍的关系。比如我的系统各个块的大小如下,DB以mysql为例,OS以linux为例
MySQL页大小
mysql> show variables like 'innodb_page_size';
+------------------+-------+
| Variable_name | Value |
+------------------+-------+
| innodb_page_size | 16384 |
+------------------+-------+
1 row in set (0.00 sec)
OS块大小
[root@localhost ~]# getconf PAGESIZE
4096
IO块大小
[root@localhost ~]# blockdev --getbsz /dev/sda
4096
sector size
[root@localhost ~]# fdisk -l
磁盘 /dev/sda:21.5 GB, 21474836480 字节,41943040 个扇区
Units = 扇区 of 1 * 512 = 512 bytes
扇区大小(逻辑/物理):512 字节 / 512 字节
3.如何解决partial page write?
double write
1.当mysql将脏数据flush到data file的时候, 先使用memcopy 将脏数据复制到内存中的double write buffer ,
2.通过double write buffer再分2次,每次写入1MB到共享表空间,
3.然后马上调用fsync函数,同步到磁盘上,避免缓冲带来的问题。
在这个过程中,doublewrite是顺序写,开销并不大,在完成doublewrite写入后,在将double write buffer写入各表空间文件,这时是离散写入。如果发生了极端情况(断电),InnoDB再次启动后,发现了一个Page数据已经损坏,那么此时就可以从doublewrite buffer中进行数据恢复了。
mysql> show variables like '%double%';
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| innodb_doublewrite | ON |
+--------------------+-------+
1 row in set (0.01 sec)
mysql> show status like '%innodb_dblwr%';
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Innodb_dblwr_pages_written | 2558 |
| Innodb_dblwr_writes | 431 |
+----------------------------+-------+
2 rows in set (0.00 sec)
从上面可以看出,flush了2558次到doublewrite buffer中,写文件共431次,则每次write合并了2558/431次flush。