Home  >  Article  >  Computer Tutorials  >  A brief analysis of the Linux file system (File System) architecture

A brief analysis of the Linux file system (File System) architecture

WBOY
WBOYforward
2024-03-11 15:31:21414browse

Linux的文件系统(File System)架构简析

This article mainly discusses virtual file systems. The architecture of the Linux file system includes the abstraction layer between specific file systems (such as Ext2, Ext3, XFS, etc.) and applications, namely the Virtual File System (VFS). VFS allows applications to communicate with different types of file systems without knowing the details of the underlying file system. With VFS, file system implementations can be isolated and decoupled from applications, thereby improving system flexibility and maintainability. VFS also allows the Linux kernel to support multiple file system types and provides a unified interface for applications to access the file system. Under the framework of VFS, different file systems can communicate with the kernel by implementing standard file system operation interfaces

picture

The above figure shows that the center of this architecture is the virtual file system VFS. VFS provides a file system framework, and local file systems can be implemented based on VFS. It mainly completes the following tasks:

1) As an abstraction layer, VFS provides a unified interface (read, write, chmod, etc.) for the application layer.

2) Some common functions are implemented in VFS, such as inode cache and page cache.

3) Standardizes the interface that a specific file system should implement.

Based on the above settings, other specific file systems only need to follow the conventions of VFS, implement the corresponding interfaces and internal logic, and register them in the system. After the user formats and mounts the file system, he or she can use the hard disk resources to perform operations based on the file system.

In the Linux operating system, after formatting the disk, you need to use the mount command to mount it to a directory in the system directory tree. This directory is called a mount point. After the mounting is completed, we can use the hard disk space formatted based on this file system. In the Linux operating system, the mount point can be almost any directory, but for standardization, the mount point is usually a subdirectory under the mnt directory.​

The following shows a relatively complex directory structure. In this directory structure, the root directory is located on the hard disk sda, and there are three subdirectories under the mnt directory, namely ext4, xfs and nfs, which respectively mount the Ext4 file system (built on the sdb hard disk) and the XFS file system (built on on sdc hard drive) and NFS file system (mounted over network).

picture

The relationship between multiple file systems in the directory tree is represented by some data structures in the kernel. When mounting a file system, the relationship between the file systems will be established and the API of the specific file system will be registered. When the user mode calls the API to open a file, it will find the corresponding file system API and associate it to the file-related structure (such as file and inode, etc.).

The above description is relatively schematic, but you may still feel a little confused. But don’t worry, we will introduce VFS and how to support multiple file systems in more detail next based on the code.

1. From file system API to VFS, then to specific file system

Linux’s VFS was not available from the beginning. The earliest released version of Linux did not have VFS. Moreover, VFS was not invented in Linux. It was first developed by Sun in 1985 in its SunOS2.0. The main purpose of developing VFS is to adapt its local file system and NFS file system.

VFS implements the abstraction of specific file systems through a set of public APIs and data structures. When the user calls the file system API provided by the operating system, the function implemented by the kernel VFS is called through a soft interrupt. The following table shows the correspondence between some file APIs and kernel VFS functions.​

User Mode API

Kernel function

illustrate

open

do_sys_open

open a file

close

ksys_close

Close file

read

ksys_read/vfs_read

Read data

write

ksys_write/vfs_write

data input

mount

do_mount

Mount file system

It can be seen from the above table that each user-mode API has a kernel-mode function corresponding to it. When an application calls the API of the file system, the corresponding function in the kernel state is triggered. What is listed here is only a relatively small subset of the file system API, and the purpose is to illustrate the relationship between the API and VFS. If you want to know about other APIs, please read the kernel source code yourself, and I won’t go into details in this article.

In order to give everyone a perceptual understanding of the relationship between VFS and specific file systems, this section takes Ext2's writing API as an example to show the calling relationship from API to VFS functions and then to Ext2 file system functions. As shown in the figure below, the API function write triggers the ksys_write function of the kernel through a soft interrupt. After some processing, this function will finally call the ext2_file_write_iter function of the Ext2 file system through the function pointer (file->f_op->wirte_iter).​

picture

In the above figure, the entrance to the kernel process is the ksys_write function. As can be seen from the implementation code, the main purpose here is to obtain a fd, and then call vfs_write with the member file in the fd as a parameter. Among them, fd is a structure, and its format is as shown in the figure below. The file member is a relatively core data structure. As can be seen from the above figure, it is through the contents of this member that the functions of the Ext2 file system are transferred.​

picture

It seems very simple, VFS only needs to call the function pointer registered by the specific file system. But there is an unresolved problem here. When were the function pointers in VFS registered?

The function pointer of Ext2 is initialized when the file is opened (for details, please refer to Section 3.1.2.2 of "Insider of File System Technology"). As we all know, a user-mode program returns a file descriptor when it opens a file, but the structure file representing the file in the kernel corresponds to it. The more important members of this structure include f_inode, f_ops and f_mapping, as shown in the figure below.

picture

In the above figure, f_inode is the inode node corresponding to the file. f_ops is a collection of function pointers for file operations in a specific file system (such as Ext2). It is initialized when the file is opened. It is through this collection of function pointers that VFS implements access to specific file systems.

The above involves another concept of VFS, inode. In Linux, inode is the abbreviation of index node, which represents a specific object (such as a file or directory) in the file system. There is a data structure named inode in VFS, which is an abstraction of the specific file system inode. For example, in the Ext2 file system, it is specifically defined as ext2_inode_info, but in XFS, it is represented by the data structure xfs_inode. Moreover, the inode data structure of the specific file system is intrinsically related to the inode of VFS. You can read the code by yourself.

2.inode cache and dentry cache

In the architecture diagram, we see that there are several cache implementations in VFS, including page cache, inode cache, dentry cache, etc. The inode cache and dentry cache are implemented in the same way and are relatively simple. Therefore, this article first introduces these two caches.

In fact, these two caches are implemented through hash tables. Everyone knows the concept of hash tables and will not go into details in this article. Take the inode cache as an example. The following figure shows its initialization process. Through the parameter ihash_entries, you can see that its size is dynamic (its size is related to the system memory. The larger the inode cache is when the system memory is read).

picture

Since inode and dentry are frequently accessed when accessing files, caching them can avoid the performance loss caused by reading data from the hard disk.

3.Page Cache

The function of VFS page cache (Cache) is mainly used to improve the performance of the file system. Caching technology refers to technology that stores part of the data and metadata of the file system in memory to improve file system performance. Since the access delay of memory is one hundred thousandth of the access delay of a mechanical hard disk (as shown in the figure below, with the register as the base unit 1s), the use of caching technology can greatly improve the performance of the file system.

picture

Cache improves the performance of the file system through three aspects of IO optimization, namely hot data, pre-reading and IO merging. Many applications will have hot data. For example, when an author edits a document, the current data block and nearby data blocks are hot data. Or when a popular article appears, the content of this article is hot data. The performance of underlying storage devices is often better for large block reads and writes. Read-ahead means reading large blocks of data from the underlying device in advance and caching them, so that application requests can be responded to through the cache. IO merging is for write requests. The write requests are not persisted to the back-end device immediately, but are cached and divided into large chunks of IO before writing.

Since the memory capacity is much smaller than the hard disk capacity, the page cache naturally cannot cache all hard disk data. In this way, only a subset of the file system data can be stored in the cache. When users continue to write data, they will face a situation where the cache is full. At this time, it involves the problem of how to flush the cache data to the disk and then store new data.

The process of flushing the cache to disk and storing new data is called cache replacement. There are many algorithms for cache replacement, each used to solve different problems. Next we introduce several common cache replacement algorithms.

LRU algorithm, the full name of LRU is Least Recently Used, which is the least recently used. This algorithm is based on the principle of temporal locality, that is, if a piece of data has been used recently, there is a high probability that it will be used again in the future. Therefore, the algorithm will release caches that have not been used recently.

The LRU algorithm is usually implemented using a linked list. The cache that has just been used will be inserted into the head of the table, while the data that has not been used often will be slowly squeezed to the end of the linked list. In order to understand the principle of LRU more clearly, we will illustrate it with the following figure.​

picture

In this example, we take full hits as an example to introduce. Assume that there are 6 data blocks in the cache, as shown in the first line of the figure. The number in the box represents the number of the data block. Assume that the first access (can be read or written) is data block No. 3. Since it has been accessed, it is moved to the head of the linked list.

The second access is to the data block No. 4. According to the same principle, the data block is also moved to the head of the linked list. The details are shown in line 2 of the above figure.

By analogy, after 4 rounds of access, the accessed data are moved forward, while the unaccessed data blocks (such as 1 and 2) are slowly squeezed to the back of the linked list. This indicates to a certain extent that the possibility of these two data blocks being accessed later is relatively small.

If it is a full hit, there will be no cache replacement. The actual situation is that the cache will often be insufficient, and the data in it needs to be released (depending on the situation, whether it needs to be flushed to disk) to store new data. At this time, the LRU algorithm comes in handy. This algorithm uses the tail data block to store new data and then puts it at the head of the linked list, as shown in the figure below. If there is dirty data in this data block, it needs to be flushed to the disk, otherwise it can be released directly.​

picture

The principle and implementation of the LRU algorithm are relatively simple, but its uses are very wide. However, the LRU algorithm has a shortcoming, that is, when a large amount of continuous data is suddenly written, all cache blocks will be replaced, causing all previous cache usage statistics to become invalid. This phenomenon is called cache pollution. In order to solve the cache pollution problem, there are many improved LRU algorithms, among which the more common ones are LRU-K, 2Q and LIRS.

LFU algorithm, the full name of LFU is Least Frequently Used, which is the least frequently used recently. This algorithm decides which cache block to release based on the frequency of data access. The cache block with the lowest access frequency will be released first.

As shown below is a schematic diagram of the LFU algorithm. Line 1 is the original state, and the number in the box represents the number of times the cache block has been accessed. The addition of new data and the elimination of cache blocks are performed from the tail. Assume that a certain piece of data (dotted box) has been accessed 4 times, and the number of accesses has changed from 12 to 16, so it needs to be moved to a new location, which is what line 2 in the figure looks like.

picture

This book uses a linked list as an example to illustrate the principle of LFU for ease of understanding, but it will never be implemented using a linked list during project implementation. Because a new location needs to be found when the number of accesses to a data block changes, the linked list search operation is very time-consuming. In order to achieve fast search, a search tree is generally used.​

LFU also has its shortcomings. If a data block was accessed frequently in a certain period of time long ago and is no longer accessed in the future, the data will remain in the cache. However, since the data will no longer be accessed, the effective capacity of the cache is reduced. In other words, the LFU algorithm does not consider the recent situation.

This article mainly introduces two very basic replacement algorithms such as LRU and LFU. In addition to the above algorithms, there are many replacement algorithms, most of which are based on the theory of LRU and LFU, such as 2Q, MQ, LRFU, TinyLFU and ARC, etc. Due to space limitations, this book will not go into details. You can read the relevant papers by yourself.

There is also a certain algorithm for data pre-reading. The pre-reading algorithm reads data from disk to cache in advance by identifying IO patterns. In this way, when the application reads data, it can read the data directly from the cache, thereby greatly improving the performance of reading data.

The most important thing in the pre-reading algorithm is the trigger condition, that is, under what circumstances the pre-reading operation is initiated. There are usually two situations that trigger read-ahead: one is when there are consecutive read requests from multiple addresses, which triggers a read-ahead operation; the other is when the application accesses a cache with a read-ahead mark. Here, the cache of the read-ahead mark is a mark made on the cache page when the read-ahead operation is completed. When the application reads the cache with this mark, the next read-ahead will be triggered, thereby omitting the identification of the IO mode.

picture

In order to explain the logic of pre-reading more clearly, we will introduce the entire process through the picture above. When the file system recognizes that the IO mode requires pre-reading, it will read out an additional part of the content (called synchronous pre-reading), as shown in time 1 (first line). At the same time, for synchronous read-ahead data, the file system will mark one of the blocks. The purpose of this mark is to trigger the next pre-read as early as possible before the end of the cache.​

At the second point in time, when the application continues to read data, because the marked cache block is read, the next pre-read is triggered at the same time. At this time, the data will be read from the disk in one step, and you can see from the figure that the cache increases.

At the next time points 3 and 4, the application can read data directly from the cache. Since no marked cache block has been read, the next read-ahead will not be triggered. At time point 5, because there is a pre-read mark, the pre-read process will be triggered again.

It can be seen from the above analysis that due to the pre-reading feature, the data is read into the cache in advance. Applications can read data directly from the cache without accessing the disk, so the overall access performance will be greatly improved.

The above is the detailed content of A brief analysis of the Linux file system (File System) architecture. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:mryunwei.com. If there is any infringement, please contact admin@php.cn delete