mysql innodb_flush_method 参数的各种取值及其影响

68 阅读 0 评论 45 点赞

我是靠谱客的博主高挑棉花糖，最近开发中收集的这篇文章主要介绍mysql innodb_flush_method 参数的各种取值及其影响，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

innodb_flush_method 用来控制innodb的redo日志文件和data数据文件采用何种flush方法。类unix/linux操作系统的取值为：fsync/0 ,O_DSYNC/1, littlesync/2 ,nosync/3 ,O_DIRECT/4 ,O_DIRECT_NO_FSYNC/5。默认取值fsync ；windows操作系统下的取值为: unbuffered/0 , normal/1 。默认值unbuffered。

先看看官方文档的说明： fsync or 0: InnoDB uses the fsync() system call to flush both the data and log files. fsync is the default setting. innodb使用fsync()系统调用刷新数据和日志文件； O_DSYNC or 1: InnoDB uses O_SYNC to open and flush the log files, and fsync() to flush the data files. InnoDB does not use O_DSYNC directly because there have been problems with it on many varieties of Unix. innodb使用O_SYNC flag打开并刷新日志文件，并且使用fsync()系统调用刷新数据文件。innodb没有将O_DSYNC作为默认值使用，因为在有一些变种unix系统上，O_DSYNC实现的有问题; littlesync or 2: This option is used for internal performance testing and is currently unsupported. Use at your own risk.截至目前8.0.22版本,仅作内部测试用； nosync or 3: This option is used for internal performance testing and is currently unsupported. Use at your own risk.截至目前8.0.22版本,仅作内部测试用; O_DIRECT or 4: InnoDB uses O_DIRECT (or directio() on Solaris) to open the data files, and uses fsync() to flush both the data and log files. This option is available on some GNU/Linux versions, FreeBSD, and Solaris. innodb使用O_DIRECT flag（或者在Solaris系统上使用directio()系统调用）打开数据文件，并且使用fsync()系统调用刷新数据和日志文件。 O_DIRECT_NO_FSYNC: InnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call after each write operation.innodb使用O_DIRECT打开文件，但是在写操作之后跳过使用fsync()系统调用刷新。 unbuffered : InnoDB uses simulated asynchronous I/O and non-buffered I/O.innodb使用仿真的异步和非缓存io; normal or 1: InnoDB uses simulated asynchronous I/O and buffered I/O.innodb使用仿真的异步和缓存io;

这里主要说unix/linux操作系统。

上面列出的所有的刷新参数，涉及到4个系统调用open(),write(),fsync(),fdatasync()以及open调用中的flag的设置。因此最终的落脚点在于对于系统调用的理解。

open的linux手册页https://man7.org/linux/man-pages/man2/open.2.html。
open的函数定义：


int open(const char *pathname, int flags);
int open(const char *pathname, int flags, mode_t mode);
int creat(const char *pathname, mode_t mode);
int openat(int dirfd, const char *pathname, int flags);
int openat(int dirfd, const char *pathname, int flags, mode_t mode);
/* Documented separately, in openat2(2): */
int openat2(int dirfd, const char *pathname,
const struct open_how *how, size_t size);

open函数的作用就是根据mode、flags、path打开并返回对应的文件描述符(fd).flags有三类 access mode flags:O_RDONLY,O_WRONLY,或O_RDWR；file creation flags:O_CLOEXEC,O_CREAT,O_DIRECTORY,O_EXCL,O_NOCTTY,O_NOFOLLOW,O_TMPFILE和O_TRUNC； file status flags:O_SYNC,O_DSYNC,O_DIRECT等。其中，access mode flags很明显是控制文件访问方式的：read_only只读，write_only只写，read_write读写；file creation flags影响了open操作本身的语义（file creation flags affect the semantics of the open operation itself）；file status flags影响了文件打开后，后续的io操作的语义（the file status flags affect the semantics of subsequent I/O operations.）。这里重点解释三个状态标记值：

O_SYNC Write operations on the file will complete according to the requirements of synchronized I/O file integrity completion (by contrast with the synchronized I/O data integrity completion provided by O_DSYNC.) By the time write(2) (or similar) returns, the output data and associated file metadata have been transferred to the underlying hardware (i.e., as though each write(2) was followed by a call to fsync(2)). O_SYNC flag指定文件上的写操作在达到同步IO的文件完整性要求的要求后完成。在后续的write()调用返回的时候，写出的数据以及文件元数据均送达底层存储硬件，就好像每一次write()调用后跟了一个fsync()调用。与O_DSYNC标记指定的同步IO的数据完整性要求不同。这里注意，O_SYNC着重在文件完整性，O_DSYNC着重在数据完整性。文件=文件数据内容+文件元数据，文件完整性>数据完整性。显而易见的，文件完整性比数据完整性要求更多的磁盘io操作。

O_DSYNC Write operations on the file will complete according to the requirements of synchronized I/O data integrity completion. By the time write(2) (and similar) return, the output data has been transferred to the underlying hardware, along with any file metadata that would be required to retrieve that data (i.e., as though each write(2) was followed by a call to fdatasync(2)). O_DSYNC flag指定文件上的写操作在达到同步IO的数据完整性的要求后完成。在后续的write()调用返回的时候，写出的数据以及任何被该数据的后续的读取操作所需要的文件元数据被送达底层存储硬件，就好像write操作后跟了一个fdatasync()调用。这里有一些拗口和难以理解，手册中举了一个例子来说明：

To understand the difference between the two types of completion,
consider two pieces of file metadata: the file last modification
timestamp (st_mtime) and the file length.
All write operations will
update the last file modification timestamp, but only writes that add
data to the end of the file will change the file length.
The last
modification timestamp is not needed to ensure that a read completes
successfully, but the file length is.
Thus, O_DSYNC would only
guarantee to flush updates to the file length metadata (whereas
O_SYNC would also always flush the last modification timestamp
metadata).

什么意思呢,假设有：“文件的最后修改时间(st_mtime)”和“文件长度”两个文件元数据的片段。所有的写操作都会更新文件的st_mtime，但是只有在文件末尾写入数据的操作才会修改文件长度。对于后续的读操作而言，st_mtime并不影响数据读取，但是文件长度会影响文件末尾数据的读取。就此而言，st_mtime不是被后续的读取操作所必须的文件元数据，而文件长度是被后续的读取操作所必须的文件元数据。因此，按照O_DSYNC对于数据完整性的要求来说，“文件长度”会随着写操作写入磁盘，st_mtime不会；而O_SYNC对于文件完整性的要求则会将“st_mtime”和“文件长度”都写入磁盘。

O_DIRECT (since Linux 2.4.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. O_DIRECT flag最小化了缓存对于读取和写入文件的影响。文件IO直接写入/读取用户空间。关于更详细的实现，请看这里O_DIRECT对齐。O_DIRECT只能做到尽力（makes an effort）以同步方式写入/读取数据，但是并不保证以O_SYNC那样同时处理数据和必须文件元数据的写入/读取.需要将O_DIRECT与O_SYNC结合使用才能达到两者各自效果的集合。需要注意的是，O_DIRECT将文件IO操作的对齐暴露给了用户，在linux2.6.0之前，是与文件系统的逻辑块大小进行对齐，在2.6.0之后，则于底层存储的逻辑块大小对齐。对齐大小可使用ioctl()系统调用或者“blockdev --getss 设备名 ”shell命令获取。

接着看看write():

ssize_t write(int fd, const void *buf, size_t count);

对于真正进行写操作的write调用，反而没有过多可解释的地方。唯一值得注意的是： If a write() is interrupted by a signal handler before any bytes are written, then the call fails with the error EINTR; if it is interrupted after at least one byte has been written, the call succeeds, and returns the number of bytes written. Note that a successful write() may transfer fewer than count bytes. Such partial writes can occur for various reasons; for example, because there was insufficient space on the disk device to write all of the requested bytes, or because a blocked write() to a socket, pipe, or similar was interrupted by a signal handler after it had transferred some, but before it had transferred all of the requested bytes. In the event of a partial write, the caller can make another write() call to transfer the remaining bytes. The subsequent call will either transfer further bytes or may result in an error (e.g., if the disk is now full). 一个write调用可能只写入了部分数据，这个时候如果主机断电，则会造成文件的完整性被破坏，这个时候，就是mysql的双写缓冲区(double write buffer)起作用的时候了。

最后看看fsync:
手册中，对于fsync，fdatasync的作用是这样描述的 synchronize a file's in-core state with store device.将文件的内核状态与存储设备进行同步。

fsync()通俗的理解是将数据从操作系统缓冲区buffer，或者disk cache 刷入到磁盘硬件中。对于fsync的调用会阻塞直到刷新成功。fynsc与fdatasync的区别在于fdatasync并不刷新非必要的文件元数据。与fsync相比，fdatasync减轻了磁盘的io压力。

好了，说完了操作系统调用，我们再回来看看mysql中的各种刷新参数的背后的逻辑：

fsync：意味着文件在open的时候，是以普通的方式读取的，需要经过“磁盘文件⇄内核空间⇄用户空间⇄innodb buffer pool”意味着一个文件block同时缓存在操作系统文件缓存和mysql innodb内存池中，对内存是中浪费。在写入的时候，使用fsync刷新，确保了缓存被刷新到磁盘文件。

O_DSYNC：使用O_SYNC flag 打开和flush 日志文件数据，使用fsync进行flush数据文件。回忆一下，O_SYNC flag是满足数据完整性的写入，相比fsync减少了磁盘io负载。

O_DIRECT：使用O_DIRECT flag（或者在Solaris系统上使用directio()系统调用）打开数据文件，并且使用fsync()系统调用刷新数据和日志文件。相比fsync参数选项，文件block终于不用在操作系统缓存层再缓存一次了。

O_DIRECT_NO_FSYNC：使用O_DIRECT打开文件，在写操作之后跳过使用fsync()系统调用刷新。避免了文件block在操作系统缓存层多缓存一次，但是不使用fsync()调用刷新，极不安全。生产环境谨慎使用。另外，注意这里在写操作后跳过了fsync调用，fsync是用来同步数据与文件元数据的，意味着文件元数据也没有同步。在mysql8.0.14版本之前，这对于部分文件系统确实是一个bug。8.0.14之后，对于文件创建，文件大小的变化以及文件关闭等文件元数据变化的动作，mysql仍然会调用fsync进行元数据的同步工作。

对于innodb调用fsync的次数，可以通过Innodb_data_fsyncs状态变量监控。参考文档：
https://man7.org/linux/man-pages/man2/open.2.html
https://man7.org/linux/man-pages/man2/fsync.2.html
https://man7.org/linux/man-pages/man2/write.2.html
https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_flush_method

2020.11.20更新：
最近浏览代码的时候（8.0.20版本），有了下面的发现：

// buf0dblwr.cc
#ifndef _WIN32
  /** @return true if we need to fsync to disk */
  static bool is_fsync_required() noexcept MY_ATTRIBUTE((warn_unused_result)) {
    /* srv_unix_file_flush_method is a dynamic variable. */
    return srv_unix_file_flush_method != SRV_UNIX_O_DIRECT &&
           srv_unix_file_flush_method != SRV_UNIX_O_DIRECT_NO_FSYNC;
  }
#endif /* _WIN32 */
#ifndef _WIN32
  if (is_fsync_required()) {
    segment->flush();
  }
#endif /* !_WIN32 */
#ifndef _WIN32
  if (is_fsync_required()) {
    batch_segment->flush();
  }
#endif /* !_WIN32 */

可以看到，在非WINDOWS系统，当刷新方法配置为O_DIRECT或者O_DIRECT_NO_FSYNC，数据写入双写缓冲区是是没有flush操作的。