内核文档翻译 —— Overview of the Linux Virtual File System

发布时间 2023-12-02 20:53:01作者: 摩斯电码

原文:https://www.kernel.org/doc/html/latest/filesystems/vfs.html#overview-of-the-linux-virtual-file-system

Introduction

The Virtual File System (also known as the Virtual Filesystem Switch) is the software layer in the kernel that provides the filesystem interface to userspace programs. It also provides an abstraction within the kernel which allows different filesystem implementations to coexist.
虚拟文件系统(也称为虚拟文件系统切换)是内核中提供文件系统接口给用户空间程序的软件层。它还在内核中提供了一个抽象,允许不同的文件系统实现共存。

VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so on are called from a process context. Filesystem locking is described in the document Locking.
VFS系统调用open(2)、stat(2)、read(2)、write(2)、chmod(2)等是从进程上下文中调用的。文件系统锁定在文档《锁定》中有描述。

Directory Entry Cache (dcache)

目录项缓存(dcache)

The VFS implements the open(2), stat(2), chmod(2), and similar system calls. The pathname argument that is passed to them is used by the VFS to search through the directory entry cache (also known as the dentry cache or dcache). This provides a very fast look-up mechanism to translate a pathname (filename) into a specific dentry. Dentries live in RAM and are never saved to disc: they exist only for performance.
VFS实现了open(2)、stat(2)、chmod(2)等类似的系统调用。传递给它们的路径名参数被VFS用于在目录项缓存(也称为dentry缓存或dcache)中进行搜索。这提供了一个非常快速的查找机制,将路径名(文件名)转换为特定的dentry。Dentries存在于RAM中,从不保存到磁盘:它们仅用于性能。

The dentry cache is meant to be a view into your entire filespace. As most computers cannot fit all dentries in the RAM at the same time, some bits of the cache are missing. In order to resolve your pathname into a dentry, the VFS may have to resort to creating dentries along the way, and then loading the inode. This is done by looking up the inode.
目录项缓存旨在成为整个文件空间的视图。由于大多数计算机无法同时将所有dentries放入RAM中,因此缓存中的一些位是缺失的。为了将您的路径名解析为dentry,VFS可能需要创建沿途的dentries,然后加载inode。这是通过查找inode来完成的。

The Inode Object

inode对象

An individual dentry usually has a pointer to an inode. Inodes are filesystem objects such as regular files, directories, FIFOs and other beasts. They live either on the disc (for block device filesystems) or in the memory (for pseudo filesystems). Inodes that live on the disc are copied into the memory when required and changes to the inode are written back to disc. A single inode can be pointed to by multiple dentries (hard links, for example, do this).
一个单独的dentry通常指向一个inode。Inodes是文件系统对象,如普通文件、目录、FIFO等。它们可以存在于磁盘上(对于块设备文件系统)或内存中(对于伪文件系统)。存在于磁盘上的inode在需要时被复制到内存中,并且对inode的更改被写回磁盘。单个inode可以被多个dentries指向(例如,硬链接就是这样做的)。

To look up an inode requires that the VFS calls the lookup() method of the parent directory inode. This method is installed by the specific filesystem implementation that the inode lives in. Once the VFS has the required dentry (and hence the inode), we can do all those boring things like open(2) the file, or stat(2) it to peek at the inode data. The stat(2) operation is fairly simple: once the VFS has the dentry, it peeks at the inode data and passes some of it back to userspace.
要查找一个inode,需要VFS调用父目录inode的lookup()方法。这个方法是由inode所在的特定文件系统实现安装的。一旦VFS有了所需的dentry(因此有了inode),我们就可以做所有那些无聊的事情,比如打开文件(open(2)),或者stat(2)以查看inode数据。stat(2)操作相当简单:一旦VFS有了dentry,它就会查看inode数据并将其中的一些数据传递回用户空间。

The File Object

文件对象

Opening a file requires another operation: allocation of a file structure (this is the kernel-side implementation of file descriptors). The freshly allocated file structure is initialized with a pointer to the dentry and a set of file operation member functions. These are taken from the inode data. The open() file method is then called so the specific filesystem implementation can do its work. You can see that this is another switch performed by the VFS. The file structure is placed into the file descriptor table for the process.
打开文件需要另一个操作:分配一个文件结构(这是文件描述符的内核端实现)。新分配的文件结构被初始化为指向dentry和一组文件操作成员函数。这些函数来自inode数据。然后调用open()文件方法,以便特定的文件系统实现可以进行其工作。您可以看到这是VFS执行的另一个切换。文件结构被放入进程的文件描述符表中。

Reading, writing and closing files (and other assorted VFS operations) is done by using the userspace file descriptor to grab the appropriate file structure, and then calling the required file structure method to do whatever is required. For as long as the file is open, it keeps the dentry in use, which in turn means that the VFS inode is still in use.
读取、写入和关闭文件(以及其他各种VFS操作)是通过使用用户空间文件描述符来获取适当的文件结构,然后调用所需的文件结构方法来执行所需的操作。只要文件是打开的,它就会保持dentry处于使用状态,这反过来意味着VFS inode仍然在使用中。

Registering and Mounting a Filesystem

注册和挂载文件系统

To register and unregister a filesystem, use the following API functions:
要注册和注销文件系统,请使用以下API函数:

#include <linux/fs.h>

extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);

The passed struct file_system_type describes your filesystem. When a request is made to mount a filesystem onto a directory in your namespace, the VFS will call the appropriate mount() method for the specific filesystem. New vfsmount referring to the tree returned by ->mount() will be attached to the mountpoint, so that when pathname resolution reaches the mountpoint it will jump into the root of that vfsmount.
传递的struct file_system_type描述了您的文件系统。当在您的命名空间中的目录上挂载文件系统的请求时,VFS将调用特定文件系统的适当mount()方法。新的vfsmount引用->mount()返回的树将附加到挂载点,因此当路径名解析到达挂载点时,它将跳转到该vfsmount的根目录。

You can see all filesystems that are registered to the kernel in the file /proc/filesystems.
您可以在文件/proc/filesystems中看到所有已注册到内核的文件系统。

struct file_system_type

This describes the filesystem. The following members are defined:
这段描述了文件系统。以下成员被定义:

struct file_system_type {
        const char *name;
        int fs_flags;
        int (*init_fs_context)(struct fs_context *);
        const struct fs_parameter_spec *parameters;
        struct dentry *(*mount) (struct file_system_type *, int,
                const char *, void *);
        void (*kill_sb) (struct super_block *);
        struct module *owner;
        struct file_system_type * next;
        struct hlist_head fs_supers;

        struct lock_class_key s_lock_key;
        struct lock_class_key s_umount_key;
        struct lock_class_key s_vfs_rename_key;
        struct lock_class_key s_writers_key[SB_FREEZE_LEVELS];

        struct lock_class_key i_lock_key;
        struct lock_class_key i_mutex_key;
        struct lock_class_key invalidate_lock_key;
        struct lock_class_key i_mutex_dir_key;
};
  • name
    the name of the filesystem type, such as "ext2", "iso9660", "msdos" and so on
    文件系统类型的名称,如"ext2"、"iso9660"、"msdos"等

  • fs_flags
    various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
    各种标志(例如FS_REQUIRES_DEV、FS_NO_DCACHE等)

  • init_fs_context
    Initializes 'struct fs_context' ->ops and ->fs_private fields with filesystem-specific data.
    使用特定文件系统代码初始化 'struct fs_context' ->ops 和 ->fs_private 字段。

  • parameters
    Pointer to the array of filesystem parameters descriptors 'struct fs_parameter_spec'. More info in Filesystem Mount API.
    指向文件系统参数描述符 'struct fs_parameter_spec' 数组的指针。有关更多信息,请参阅文件系统挂载 API。

  • mount
    the method to call when a new instance of this filesystem should be mounted
    当应该挂载此文件系统的新实例时调用的方法

  • kill_sb
    the method to call when an instance of this filesystem should be shut down
    当应该关闭此文件系统的实例时调用的方法

  • owner
    for internal VFS use: you should initialize this to THIS_MODULE in most cases.
    内部 VFS 使用:在大多数情况下,您应该将其初始化为 THIS_MODULE。

  • next
    for internal VFS use: you should initialize this to NULL
    内部 VFS 使用:您应该将其初始化为 NULL。

  • fs_supers
    for internal VFS use: hlist of filesystem instances (superblocks)
    内部 VFS 使用:文件系统实例(超级块)的 hlist

    s_lock_key, s_umount_key, s_vfs_rename_key, s_writers_key, i_lock_key, i_mutex_key, invalidate_lock_key, i_mutex_dir_key: lockdep-specific

The mount() method has the following arguments:
mount() 方法具有以下参数:

  • struct file_system_type *fs_type
    describes the filesystem, partly initialized by the specific filesystem code
    部分由特定文件系统代码初始化的文件系统描述

  • int flags
    mount flags
    挂载标志

  • const char *dev_name
    the device name we are mounting.
    我们正在挂载的设备名称。

  • void *data
    arbitrary mount options, usually comes as an ASCII string (see "Mount Options" section)
    任意挂载选项,通常作为 ASCII 字符串提供(请参阅“挂载选项”部分)

The mount() method must return the root dentry of the tree requested by caller. An active reference to its superblock must be grabbed and the superblock must be locked. On failure it should return ERR_PTR(error).
mount() 方法必须返回调用者请求的树的根 dentry。必须抓取其超级块的活动引用并锁定超级块。失败时,它应该返回 ERR_PTR(error)。

The arguments match those of mount(2) and their interpretation depends on filesystem type. E.g. for block filesystems, dev_name is interpreted as block device name, that device is opened and if it contains a suitable filesystem image the method creates and initializes struct super_block accordingly, returning its root dentry to caller.
参数与 mount(2) 的参数匹配,其解释取决于文件系统类型。例如,对于块文件系统,dev_name 被解释为块设备名称,该设备被打开,如果它包含适当的文件系统映像,则该方法相应地创建和初始化 struct super_block,并将其根 dentry 返回给调用者。

->mount() may choose to return a subtree of existing filesystem - it doesn't have to create a new one. The main result from the caller's point of view is a reference to dentry at the root of (sub)tree to be attached; creation of new superblock is a common side effect.
->mount() 可能选择返回现有文件系统的子树 - 它不必创建新的文件系统。从调用者的角度来看,主要结果是对要附加的(子)树根的 dentry 的引用;创建新的超级块是一个常见的副作用。

The most interesting member of the superblock structure that the mount() method fills in is the "s_op" field. This is a pointer to a "struct super_operations" which describes the next level of the filesystem implementation.
mount() 方法填充的超级块结构的最有趣的成员是 "s_op" 字段。这是指向描述文件系统实现的下一个级别的 "struct super_operations" 的指针。

Usually, a filesystem uses one of the generic mount() implementations and provides a fill_super() callback instead. The generic variants are:
通常,文件系统使用通用的 mount() 实现之一,并提供一个 fill_super() 回调。通用变体包括:

  • mount_bdev
    mount a filesystem residing on a block device
    挂载驻留在块设备上的文件系统

  • mount_nodev
    mount a filesystem that is not backed by a device
    挂载不由设备支持的文件系统

  • mount_single
    mount a filesystem which shares the instance between all mounts
    挂载在所有挂载之间共享实例的文件系统

A fill_super() callback implementation has the following arguments:
fill_super() 回调实现具有以下参数:

  • struct super_block *sb
    the superblock structure. The callback must initialize this properly.
    超级块结构。回调必须正确初始化此结构。

  • void *data
    arbitrary mount options, usually comes as an ASCII string (see "Mount Options" section)
    任意挂载选项,通常作为 ASCII 字符串提供(请参阅“挂载选项”部分)

  • int silent
    whether or not to be silent on error
    是否在错误时保持沉默

The Superblock Object

超级块对象

A superblock object represents a mounted filesystem.
超级块对象表示已挂载的文件系统。

struct super_operations

结构体 super_operations

This describes how the VFS can manipulate the superblock of your filesystem. The following members are defined:
这描述了虚拟文件系统(VFS)如何操作文件系统的超级块。以下成员已定义:

struct super_operations {
        struct inode *(*alloc_inode)(struct super_block *sb);
        void (*destroy_inode)(struct inode *);
        void (*free_inode)(struct inode *);

        void (*dirty_inode) (struct inode *, int flags);
        int (*write_inode) (struct inode *, struct writeback_control *wbc);
        int (*drop_inode) (struct inode *);
        void (*evict_inode) (struct inode *);
        void (*put_super) (struct super_block *);
        int (*sync_fs)(struct super_block *sb, int wait);
        int (*freeze_super) (struct super_block *);
        int (*freeze_fs) (struct super_block *);
        int (*thaw_super) (struct super_block *);
        int (*unfreeze_fs) (struct super_block *);
        int (*statfs) (struct dentry *, struct kstatfs *);
        int (*remount_fs) (struct super_block *, int *, char *);
        void (*umount_begin) (struct super_block *);

        int (*show_options)(struct seq_file *, struct dentry *);
        int (*show_devname)(struct seq_file *, struct dentry *);
        int (*show_path)(struct seq_file *, struct dentry *);
        int (*show_stats)(struct seq_file *, struct dentry *);

        ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
        ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
        struct dquot **(*get_dquots)(struct inode *);

        long (*nr_cached_objects)(struct super_block *,
                                struct shrink_control *);
        long (*free_cached_objects)(struct super_block *,
                                struct shrink_control *);
};

All methods are called without any locks being held, unless otherwise noted. This means that most methods can block safely. All methods are only called from a process context (i.e. not from an interrupt handler or bottom half).
所有方法都在没有任何锁的情况下调用,除非另有说明。这意味着大多数方法可以安全地阻塞。所有方法仅从进程上下文中调用(即不是从中断处理程序或底半部分调用)。

  • alloc_inode
    this method is called by alloc_inode() to allocate memory for struct inode and initialize it. If this function is not defined, a simple 'struct inode' is allocated. Normally alloc_inode will be used to allocate a larger structure which contains a 'struct inode' embedded within it.
    此方法由 alloc_inode() 调用以为 struct inode 分配内存并对其进行初始化。如果未定义此函数,则分配一个简单的 'struct inode'。通常,alloc_inode 将用于分配一个包含嵌入其中的 'struct inode' 的较大结构。

  • destroy_inode
    this method is called by destroy_inode() to release resources allocated for struct inode. It is only required if ->alloc_inode was defined and simply undoes anything done by ->alloc_inode.
    此方法由 destroy_inode() 调用以释放为 struct inode 分配的资源。仅在 ->alloc_inode 被定义并且简单地撤消了 ->alloc_inode 所做的任何操作时才需要。

  • free_inode
    this method is called from RCU callback. If you use call_rcu() in ->destroy_inode to free 'struct inode' memory, then it's better to release memory in this method.
    此方法从 RCU 回调中调用。如果在 ->destroy_inode 中使用 call_rcu() 释放 'struct inode' 内存,那么最好在此方法中释放内存。

  • dirty_inode
    this method is called by the VFS when an inode is marked dirty. This is specifically for the inode itself being marked dirty, not its data. If the update needs to be persisted by fdatasync(), then I_DIRTY_DATASYNC will be set in the flags argument. I_DIRTY_TIME will be set in the flags in case lazytime is enabled and struct inode has times updated since the last ->dirty_inode call.
    当 inode 被标记为脏时,VFS 调用此方法。这特别是指 inode 本身被标记为脏,而不是其数据。如果更新需要通过 fdatasync() 进行持久化,则 flags 参数中将设置 I_DIRTY_DATASYNC。如果启用了 lazytime 并且自上次 ->dirty_inode 调用以来 struct inode 的时间已更新,则 flags 中将设置 I_DIRTY_TIME。

  • write_inode
    this method is called when the VFS needs to write an inode to disc. The second parameter indicates whether the write should be synchronous or not, not all filesystems check this flag.
    当 VFS 需要将 inode 写入磁盘时,调用此方法。第二个参数指示写入是否应该是同步的,不是所有文件系统都检查此标志。

  • drop_inode
    called when the last access to the inode is dropped, with the inode->i_lock spinlock held.
    当对 inode 的最后访问被丢弃时调用,保持 inode->i_lock 自旋锁。

    This method should be either NULL (normal UNIX filesystem semantics) or "generic_delete_inode" (for filesystems that do not want to cache inodes - causing "delete_inode" to always be called regardless of the value of i_nlink)
    此方法应为 NULL(正常的 UNIX 文件系统语义)或 "generic_delete_inode"(对于不希望缓存 inode 的文件系统 - 导致始终调用 "delete_inode",而不管 i_nlink 的值如何)。

    The "generic_delete_inode()" behavior is equivalent to the old practice of using "force_delete" in the put_inode() case, but does not have the races that the "force_delete()" approach had.
    “generic_delete_inode()”行为等同于在 put_inode() 情况下使用 “force_delete”的旧做法,但不具有“force_delete()”方法存在的竞争。

  • evict_inode
    called when the VFS wants to evict an inode. Caller does not evict the pagecache or inode-associated metadata buffers; the method has to use truncate_inode_pages_final() to get rid of those. Caller makes sure async writeback cannot be running for the inode while (or after) ->evict_inode() is called. Optional.
    当 VFS 想要驱逐一个 inode 时调用。调用者不会驱逐页缓存或与 inode 相关的元数据缓冲区;该方法必须使用 truncate_inode_pages_final() 来摆脱这些。调用者确保异步写回不能在调用 ->evict_inode() 期间(或之后)运行。可选。

  • put_super
    called when the VFS wishes to free the superblock (i.e. unmount). This is called with the superblock lock held
    当 VFS 希望释放超级块(即卸载)时调用。这是在保持超级块锁的情况下调用的。

  • sync_fs
    called when VFS is writing out all dirty data associated with a superblock. The second parameter indicates whether the method should wait until the write out has been completed. Optional.
    当 VFS 写出与超级块关联的所有脏数据时调用。第二个参数指示方法是否应等待写出完成。可选。

  • freeze_super
    Called instead of ->freeze_fs callback if provided. Main difference is that ->freeze_super is called without taking down_write(&sb->s_umount). If filesystem implements it and wants ->freeze_fs to be called too, then it has to call ->freeze_fs explicitly from this callback. Optional.
    如果提供,则调用此方法而不是 ->freeze_fs 回调。主要区别在于调用 ->freeze_super 时不会采取 down_write(&sb->s_umount)。如果文件系统实现了它并且希望调用 ->freeze_fs,则必须从此回调中显式调用 ->freeze_fs。可选。

  • freeze_fs
    called when VFS is locking a filesystem and forcing it into a consistent state. This method is currently used by the Logical Volume Manager (LVM) and ioctl(FIFREEZE). Optional.
    当 VFS 锁定文件系统并强制其进入一致状态时调用。此方法目前由逻辑卷管理器(LVM)和 ioctl(FIFREEZE) 使用。可选。

  • thaw_super
    called when VFS is unlocking a filesystem and making it writable again after ->freeze_super. Optional.
    当 VFS 解锁文件系统并使其可写时调用 ->freeze_super 后。可选。

  • unfreeze_fs
    called when VFS is unlocking a filesystem and making it writable again after ->freeze_fs. Optional.
    当 VFS 解锁文件系统并使其可写时调用 ->freeze_fs 后。可选。

  • statfs
    called when the VFS needs to get filesystem statistics.
    当 VFS 需要获取文件系统统计信息时调用。

  • remount_fs
    called when the filesystem is remounted. This is called with the kernel lock held
    当重新挂载文件系统时调用。这是在持有内核锁的情况下调用的。

  • umount_begin
    called when the VFS is unmounting a filesystem.
    当 VFS 卸载文件系统时调用。

  • show_options
    called by the VFS to show mount options for /proc/<pid>/mounts and /proc/<pid>/mountinfo. (see "Mount Options" section)
    由 VFS 调用以显示 /proc/<pid>/mounts/proc/<pid>/mountinfo 的挂载选项(参见“挂载选项”部分)。

  • show_devname
    Optional. Called by the VFS to show device name for /proc/<pid>/{mounts,mountinfo,mountstats}. If not provided then '(struct mount).mnt_devname' will be used.
    可选。由 VFS 调用以显示 /proc/<pid>/{mounts,mountinfo,mountstats} 的设备名称。如果未提供,则将使用 '(struct mount).mnt_devname'。

  • show_path
    Optional. Called by the VFS (for /proc/<pid>/mountinfo) to show the mount root dentry path relative to the filesystem root.
    可选。由 VFS(对于 /proc/<pid>/mountinfo)调用以显示相对于文件系统根的挂载根 dentry 路径。

  • show_stats
    Optional. Called by the VFS (for /proc/<pid>/mountstats) to show filesystem-specific mount statistics.
    可选。由 VFS(对于 /proc/<pid>/mountstats)调用以显示特定于文件系统的挂载统计信息。

  • quota_read
    called by the VFS to read from filesystem quota file.
    由 VFS 调用以从文件系统配额文件中读取。

  • quota_write
    called by the VFS to write to filesystem quota file.
    由 VFS 调用以向文件系统配额文件中写入。

  • get_dquots
    called by quota to get 'struct dquot' array for a particular inode. Optional.
    由配额调用以获取特定 inode 的 'struct dquot' 数组。可选。

  • nr_cached_objects
    called by the sb cache shrinking function for the filesystem to return the number of freeable cached objects it contains. Optional.
    由文件系统的 sb 缓存收缩函数调用,以返回它包含的可释放缓存对象的数量。可选。

  • free_cache_objects
    called by the sb cache shrinking function for the filesystem to scan the number of objects indicated to try to free them. Optional, but any filesystem implementing this method needs to also implement ->nr_cached_objects for it to be called correctly.
    由文件系统的 sb 缓存收缩函数调用,以扫描指定数量的对象并尝试释放它们。可选,但任何实现此方法的文件系统都需要实现 ->nr_cached_objects 以便正确调用。

    We can't do anything with any errors that the filesystem might encountered, hence the void return type. This will never be called if the VM is trying to reclaim under GFP_NOFS conditions, hence this method does not need to handle that situation itself.
    我们无法处理文件系统可能遇到的任何错误,因此返回类型为 void。如果 VM 尝试在 GFP_NOFS 条件下回收,那么永远不会调用此方法,因此此方法本身不需要处理该情况。

    Implementations must include conditional reschedule calls inside any scanning loop that is done. This allows the VFS to determine appropriate scan batch sizes without having to worry about whether implementations will cause holdoff problems due to large scan batch sizes.
    实现必须在任何扫描循环内包含条件性的重新调度调用。这允许 VFS 确定适当的扫描批处理大小,而无需担心实现是否会因大批处理大小而导致暂停问题。

Whoever sets up the inode is responsible for filling in the "i_op" field. This is a pointer to a "struct inode_operations" which describes the methods that can be performed on individual inodes.
设置 inode 的人负责填写 "i_op" 字段。这是指向描述可以在单个 inode 上执行的方法的 "struct inode_operations" 的指针。

struct xattr_handlers

On filesystems that support extended attributes (xattrs), the s_xattr superblock field points to a NULL-terminated array of xattr handlers. Extended attributes are name:value pairs.
在支持扩展属性(xattrs)的文件系统上,s_xattr 超级块字段指向一个以 NULL 结尾的 xattr 处理程序数组。扩展属性是名称:值对。

  • name
    Indicates that the handler matches attributes with the specified name (such as "system.posix_acl_access"); the prefix field must be NULL.
    指示处理程序匹配具有指定名称的属性(例如 "system.posix_acl_access");前缀字段必须为 NULL。

  • prefix
    Indicates that the handler matches all attributes with the specified name prefix (such as "user."); the name field must be NULL.
    指示处理程序匹配具有指定名称前缀的所有属性(例如 "user.");名称字段必须为 NULL。

  • list
    Determine if attributes matching this xattr handler should be listed for a particular dentry. Used by some listxattr implementations like generic_listxattr.
    确定是否应该为特定 dentry 列出与此 xattr 处理程序匹配的属性。由一些 listxattr 实现(如 generic_listxattr)使用。

  • get
    Called by the VFS to get the value of a particular extended attribute. This method is called by the getxattr(2) system call.
    由 VFS 调用以获取特定扩展属性的值。此方法由 getxattr(2) 系统调用调用。

  • set
    Called by the VFS to set the value of a particular extended attribute. When the new value is NULL, called to remove a particular extended attribute. This method is called by the setxattr(2) and removexattr(2) system calls.
    由 VFS 调用以设置特定扩展属性的值。当新值为 NULL 时,用于移除特定扩展属性。此方法由 setxattr(2) 和 removexattr(2) 系统调用调用。

When none of the xattr handlers of a filesystem match the specified attribute name or when a filesystem doesn't support extended attributes, the various *xattr(2) system calls return -EOPNOTSUPP.
当文件系统的所有 xattr 处理程序都不匹配指定的属性名称,或者文件系统不支持扩展属性时,各种 *xattr(2) 系统调用返回 -EOPNOTSUPP。

The Inode Object

An inode object represents an object within the filesystem.
Inode对象表示文件系统中的一个对象。

struct inode_operations

This describes how the VFS can manipulate an inode in your filesystem. As of kernel 2.6.22, the following members are defined:
这描述了VFS如何在文件系统中操作一个inode。从内核2.6.22开始,定义了以下成员:

struct inode_operations {
        int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool);
        struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
        int (*link) (struct dentry *,struct inode *,struct dentry *);
        int (*unlink) (struct inode *,struct dentry *);
        int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
        int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
        int (*rmdir) (struct inode *,struct dentry *);
        int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
        int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
                       struct inode *, struct dentry *, unsigned int);
        int (*readlink) (struct dentry *, char __user *,int);
        const char *(*get_link) (struct dentry *, struct inode *,
                                 struct delayed_call *);
        int (*permission) (struct mnt_idmap *, struct inode *, int);
        struct posix_acl * (*get_inode_acl)(struct inode *, int, bool);
        int (*setattr) (struct mnt_idmap *, struct dentry *, struct iattr *);
        int (*getattr) (struct mnt_idmap *, const struct path *, struct kstat *, u32, unsigned int);
        ssize_t (*listxattr) (struct dentry *, char *, size_t);
        void (*update_time)(struct inode *, struct timespec *, int);
        int (*atomic_open)(struct inode *, struct dentry *, struct file *,
                           unsigned open_flag, umode_t create_mode);
        int (*tmpfile) (struct mnt_idmap *, struct inode *, struct file *, umode_t);
        struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
        int (*set_acl)(struct mnt_idmap *, struct dentry *, struct posix_acl *, int);
        int (*fileattr_set)(struct mnt_idmap *idmap,
                            struct dentry *dentry, struct fileattr *fa);
        int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
};

Again, all methods are called without any locks being held, unless otherwise noted.
同样,除非另有说明,所有方法在没有任何锁的情况下调用。

  • create
    called by the open(2) and creat(2) system calls. Only required if you want to support regular files. The dentry you get should not have an inode (i.e. it should be a negative dentry). Here you will probably call d_instantiate() with the dentry and the newly created inode
    由open(2)和creat(2)系统调用调用。仅在要支持常规文件时才需要。您获得的dentry不应该有一个inode(即它应该是一个负的dentry)。在这里,您可能会使用d_instantiate()和dentry以及新创建的inode。

  • lookup
    called when the VFS needs to look up an inode in a parent directory. The name to look for is found in the dentry. This method must call d_add() to insert the found inode into the dentry. The "i_count" field in the inode structure should be incremented. If the named inode does not exist a NULL inode should be inserted into the dentry (this is called a negative dentry). Returning an error code from this routine must only be done on a real error, otherwise creating inodes with system calls like create(2), mknod(2), mkdir(2) and so on will fail. If you wish to overload the dentry methods then you should initialise the "d_dop" field in the dentry; this is a pointer to a struct "dentry_operations". This method is called with the directory inode semaphore held
    当VFS需要在父目录中查找inode时调用。要查找的名称在dentry中找到。此方法必须调用d_add()将找到的inode插入dentry中。应增加inode结构中的“i_count”字段。如果命名的inode不存在,则应将NULL inode插入dentry中(这称为负dentry)。从此例程返回错误代码只能在真正的错误上执行,否则使用create(2)、mknod(2)、mkdir(2)等系统调用创建inode将失败。如果要重载dentry方法,则应初始化dentry中的“d_dop”字段;这是指向struct“dentry_operations”的指针。此方法在持有目录inode信号量的情况下调用。

  • link
    called by the link(2) system call. Only required if you want to support hard links. You will probably need to call d_instantiate() just as you would in the create() method
    由link(2)系统调用调用。仅在要支持硬链接时才需要。您可能需要像在create()方法中一样调用d_instantiate()。

  • unlink
    called by the unlink(2) system call. Only required if you want to support deleting inodes
    由unlink(2)系统调用调用。仅在要支持删除inode时才需要。

  • symlink
    called by the symlink(2) system call. Only required if you want to support symlinks. You will probably need to call d_instantiate() just as you would in the create() method
    由symlink(2)系统调用调用。仅在要支持符号链接时才需要。您可能需要像在create()方法中一样调用d_instantiate()。

  • mkdir
    called by the mkdir(2) system call. Only required if you want to support creating subdirectories. You will probably need to call d_instantiate() just as you would in the create() method
    由mkdir(2)系统调用调用。仅在要支持创建子目录时才需要。您可能需要像在create()方法中一样调用d_instantiate()。

  • rmdir
    called by the rmdir(2) system call. Only required if you want to support deleting subdirectories
    由rmdir(2)系统调用调用。仅在要支持删除子目录时才需要。

  • mknod
    called by the mknod(2) system call to create a device (char, block) inode or a named pipe (FIFO) or socket. Only required if you want to support creating these types of inodes. You will probably need to call d_instantiate() just as you would in the create() method
    由mknod(2)系统调用调用,用于创建设备(char、block)inode、命名管道(FIFO)或套接字。仅在要支持创建这些类型的inode时才需要。您可能需要像在create()方法中一样调用d_instantiate()。

  • rename
    called by the rename(2) system call to rename the object to have the parent and name given by the second inode and dentry.
    由rename(2)系统调用调用,将对象重命名为由第二个inode和dentry给出的父目录和名称。

    The filesystem must return -EINVAL for any unsupported or unknown flags. Currently the following flags are implemented: (1) RENAME_NOREPLACE: this flag indicates that if the target of the rename exists the rename should fail with -EEXIST instead of replacing the target. The VFS already checks for existence, so for local filesystems the RENAME_NOREPLACE implementation is equivalent to plain rename. (2) RENAME_EXCHANGE: exchange source and target. Both must exist; this is checked by the VFS. Unlike plain rename, source and target may be of different type.
    文件系统必须对任何不受支持或未知标志返回-EINVAL。当前实现了以下标志:(1)RENAME_NOREPLACE:此标志表示如果重命名的目标存在,则重命名应失败并返回-EEXIST,而不是替换目标。VFS已经检查了存在性,因此对于本地文件系统,RENAME_NOREPLACE实现等同于普通重命名。(2)RENAME_EXCHANGE:交换源和目标。源和目标都必须存在;这由VFS检查。与普通重命名不同,源和目标可以是不同类型的。

  • get_link
    called by the VFS to follow a symbolic link to the inode it points to. Only required if you want to support symbolic links. This method returns the symlink body to traverse (and possibly resets the current position with nd_jump_link()). If the body won't go away until the inode is gone, nothing else is needed; if it needs to be otherwise pinned, arrange for its release by having get_link(..., ..., done) do set_delayed_call(done, destructor, argument). In that case destructor(argument) will be called once VFS is done with the body you've returned. May be called in RCU mode; that is indicated by NULL dentry argument. If request can't be handled without leaving RCU mode, have it return ERR_PTR(-ECHILD).
    由VFS调用以跟随符号链接指向的inode。仅在要支持符号链接时才需要。此方法返回要遍历的符号链接主体(可能使用nd_jump_link()重置当前位置)。如果主体在inode消失之前不会消失,则不需要其他操作;如果需要以其他方式固定它,请通过get_link(..., ..., done)调用set_delayed_call(done, destructor, argument)来安排其释放。在这种情况下,VFS完成返回主体后将调用destructor(argument)。可能在RCU模式下调用;这由NULL dentry参数表示。如果请求无法在离开RCU模式的情况下处理,请使其返回ERR_PTR(-ECHILD)。

    If the filesystem stores the symlink target in ->i_link, the VFS may use it directly without calling ->get_link(); however, ->get_link() must still be provided. ->i_link must not be freed until after an RCU grace period. Writing to ->i_link post-iget() time requires a 'release' memory barrier.
    如果文件系统将符号链接目标存储在->i_link中,则VFS可以直接使用它而无需调用->get_link();但是,仍然必须提供->get_link()。->i_link在RCU宽限期之后才能释放。在iget()时间后写入->i_link需要一个'release'内存屏障。

  • readlink
    this is now just an override for use by readlink(2) for the cases when ->get_link uses nd_jump_link() or object is not in fact a symlink. Normally filesystems should only implement ->get_link for symlinks and readlink(2) will automatically use that.
    现在只是readlink(2)的一个覆盖,用于当->get_link使用nd_jump_link()或对象实际上不是符号链接的情况。通常,文件系统应该只为符号链接实现->get_link,而readlink(2)将自动使用它。

  • permission
    called by the VFS to check for access rights on a POSIX-like filesystem.
    由VFS调用以检查POSIX类似文件系统上的访问权限。

    May be called in rcu-walk mode (mask & MAY_NOT_BLOCK). If in rcu-walk mode, the filesystem must check the permission without blocking or storing to the inode.
    可能以rcu-walk模式调用(mask & MAY_NOT_BLOCK)。如果在rcu-walk模式下,文件系统必须在不阻塞或存储到inode的情况下检查权限。

    If a situation is encountered that rcu-walk cannot handle, return -ECHILD and it will be called again in ref-walk mode.
    如果遇到rcu-walk无法处理的情况,请返回-ECHILD,然后将以ref-walk模式再次调用。

  • setattr
    called by the VFS to set attributes for a file. This method is called by chmod(2) and related system calls.
    由VFS调用以设置文件的属性。此方法由chmod(2)和相关系统调用调用。

  • getattr
    called by the VFS to get attributes of a file. This method is called by stat(2) and related system calls.
    由VFS调用以获取文件的属性。此方法由stat(2)和相关系统调用调用。

  • listxattr
    called by the VFS to list all extended attributes for a given file. This method is called by the listxattr(2) system call.
    由VFS调用以列出给定文件的所有扩展属性。此方法由listxattr(2)系统调用调用。

  • update_time
    called by the VFS to update a specific time or the i_version of an inode. If this is not defined the VFS will update the inode itself and call mark_inode_dirty_sync.
    由VFS调用以更新特定时间或inode的i_version。如果未定义此方法,VFS将更新inode本身并调用mark_inode_dirty_sync。

  • atomic_open
    called on the last component of an open. Using this optional method the filesystem can look up, possibly create and open the file in one atomic operation. If it wants to leave actual opening to the caller (e.g. if the file turned out to be a symlink, device, or just something filesystem won't do atomic open for), it may signal this by returning finish_no_open(file, dentry). This method is only called if the last component is negative or needs lookup. Cached positive dentries are still handled by f_op->open(). If the file was created, FMODE_CREATED flag should be set in file->f_mode. In case of O_EXCL the method must only succeed if the file didn't exist and hence FMODE_CREATED shall always be set on success.
    在打开的最后一个组件上调用。使用这个可选方法,文件系统可以在一个原子操作中查找、可能创建并打开文件。如果它希望将实际的打开留给调用者(例如,如果文件最终是一个符号链接、设备,或者只是文件系统不支持原子打开的某些东西),它可以通过返回 finish_no_open(file, dentry) 来表示这一点。只有在最后一个组件为负或需要查找时才会调用此方法。缓存的正向 dentries 仍然由 f_op->open() 处理。如果文件已创建,应在 file->f_mode 中设置 FMODE_CREATED 标志。在 O_EXCL 的情况下,该方法只能在文件不存在时成功,因此成功时必须始终设置 FMODE_CREATED。

  • tmpfile
    called in the end of O_TMPFILE open(). Optional, equivalent to atomically creating, opening and unlinking a file in given directory. On success needs to return with the file already open; this can be done by calling finish_open_simple() right at the end.
    在 O_TMPFILE 打开的末尾调用。可选,相当于在给定目录中原子地创建、打开和取消链接文件。成功时需要返回已经打开的文件;可以通过在最后立即调用 finish_open_simple() 来实现。

  • fileattr_get
    called on ioctl(FS_IOC_GETFLAGS) and ioctl(FS_IOC_FSGETXATTR) to retrieve miscellaneous file flags and attributes. Also called before the relevant SET operation to check what is being changed (in this case with i_rwsem locked exclusive). If unset, then fall back to f_op->ioctl().
    在 ioctl(FS_IOC_GETFLAGS) 和 ioctl(FS_IOC_FSGETXATTR) 上调用,以检索杂项文件标志和属性。在相关的 SET 操作之前调用,以检查正在更改的内容(在这种情况下,使用 i_rwsem 锁定独占)。如果未设置,则回退到 f_op->ioctl()。

  • fileattr_set
    called on ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) to change miscellaneous file flags and attributes. Callers hold i_rwsem exclusive. If unset, then fall back to f_op->ioctl().
    在 ioctl(FS_IOC_SETFLAGS) 和 ioctl(FS_IOC_FSSETXATTR) 上调用,以更改杂项文件标志和属性。调用者持有 i_rwsem 独占。如果未设置,则回退到 f_op->ioctl()。

The Address Space Object

地址空间对象

The address space object is used to group and manage pages in the page cache. It can be used to keep track of the pages in a file (or anything else) and also track the mapping of sections of the file into process address spaces.
地址空间对象用于对页面缓存中的页面进行分组和管理。它可以用于跟踪文件(或其他任何内容)中的页面,并跟踪文件的部分映射到进程地址空间中。

There are a number of distinct yet related services that an address-space can provide. These include communicating memory pressure, page lookup by address, and keeping track of pages tagged as Dirty or Writeback.
地址空间可以提供一些独立但相关的服务。这些包括通知内存压力、按地址查找页面以及跟踪标记为脏页或写回页的页面。

The first can be used independently to the others. The VM can try to either write dirty pages in order to clean them, or release clean pages in order to reuse them. To do this it can call the ->writepage method on dirty pages, and ->release_folio on clean folios with the private flag set. Clean pages without PagePrivate and with no external references will be released without notice being given to the address_space.
第一个服务可以独立于其他服务使用。虚拟内存可以尝试写入脏页以清理它们,或释放干净页以重用它们。为此,它可以在脏页上调用 ->writepage 方法,并在设置了私有标志的干净 folio 上调用 ->release_folio。没有 PagePrivate 标志且没有外部引用的干净页将在不通知地址空间的情况下被释放。

To achieve this functionality, pages need to be placed on an LRU with lru_cache_add and mark_page_active needs to be called whenever the page is used.
为实现这一功能,需要使用 lru_cache_add 将页面放置在 LRU 上,并在页面被使用时调用 mark_page_active。

Pages are normally kept in a radix tree index by ->index. This tree maintains information about the PG_Dirty and PG_Writeback status of each page, so that pages with either of these flags can be found quickly.
页面通常通过 ->index 保持在基数树索引中。该树维护有关每个页面的 PG_Dirty 和 PG_Writeback 状态的信息,以便可以快速找到具有这些标志之一的页面。

The Dirty tag is primarily used by mpage_writepages - the default ->writepages method. It uses the tag to find dirty pages to call ->writepage on. If mpage_writepages is not used (i.e. the address provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is almost unused. write_inode_now and sync_inode do use it (through __sync_single_inode) to check if ->writepages has been successful in writing out the whole address_space.
脏标记主要由 mpage_writepages(默认的 ->writepages 方法)使用。它使用该标记查找需要调用 ->writepage 的脏页。如果未使用 mpage_writepages(即地址空间提供自己的 ->writepages),则 PAGECACHE_TAG_DIRTY 标记几乎不被使用。write_inode_now 和 sync_inode 使用它(通过 __sync_single_inode)来检查 ->writepages 是否成功写出整个地址空间。

The Writeback tag is used by filemapwait and sync_page* functions, via filemap_fdatawait_range, to wait for all writeback to complete.
写回标记由 filemapwait 和 sync_page* 函数使用,通过 filemap_fdatawait_range,等待所有写回完成。

An address_space handler may attach extra information to a page, typically using the 'private' field in the 'struct page'. If such information is attached, the PG_Private flag should be set. This will cause various VM routines to make extra calls into the address_space handler to deal with that data.
地址空间处理程序可以附加额外的信息到页面,通常使用 'struct page' 中的 'private' 字段。如果附加了这样的信息,则应设置 PG_Private 标志。这将导致各种虚拟内存例程调用地址空间处理程序以处理该数据。

An address space acts as an intermediate between storage and application. Data is read into the address space a whole page at a time, and provided to the application either by copying of the page, or by memory-mapping the page. Data is written into the address space by the application, and then written-back to storage typically in whole pages, however the address_space has finer control of write sizes.
地址空间充当存储和应用程序之间的中间层。数据以整页的方式读入地址空间,并通过复制页面或内存映射页面提供给应用程序。数据以应用程序写入地址空间,然后通常以整页的方式写回到存储,但地址空间可以更精细地控制写入大小。

The read process essentially only requires 'read_folio'. The write process is more complicated and uses write_begin/write_end or dirty_folio to write data into the address_space, and writepage and writepages to writeback data to storage.
读取过程基本上只需要 'read_folio'。写入过程更为复杂,使用 write_begin/write_end 或 dirty_folio 将数据写入地址空间,使用 writepage 和 writepages 将数据写回到存储。

Adding and removing pages to/from an address_space is protected by the inode's i_mutex.
向地址空间添加和删除页面受到 inode 的 i_mutex 保护。

When data is written to a page, the PG_Dirty flag should be set. It typically remains set until writepage asks for it to be written. This should clear PG_Dirty and set PG_Writeback. It can be actually written at any point after PG_Dirty is clear. Once it is known to be safe, PG_Writeback is cleared.
当数据写入页面时,应设置 PG_Dirty 标志。通常情况下,该标志保持设置,直到 writepage 要求将其写入。这应清除 PG_Dirty 并设置 PG_Writeback。在确定安全后,可以在清除 PG_Dirty 后的任何时刻实际写入。一旦知道是安全的,就会清除 PG_Writeback。

Writeback makes use of a writeback_control structure to direct the operations. This gives the writepage and writepages operations some information about the nature of and reason for the writeback request, and the constraints under which it is being done. It is also used to return information back to the caller about the result of a writepage or writepages request.
写回使用 writeback_control 结构来指导操作。这使得 writepage 和 writepages 操作可以获得有关写回请求的性质和原因以及正在进行的约束的一些信息,并用于向调用者返回有关 writepage 或 writepages 请求结果的信息。

Handling errors during writeback

处理写回期间的错误

Most applications that do buffered I/O will periodically call a file synchronization call (fsync, fdatasync, msync or sync_file_range) to ensure that data written has made it to the backing store. When there is an error during writeback, they expect that error to be reported when a file sync request is made. After an error has been reported on one request, subsequent requests on the same file descriptor should return 0, unless further writeback errors have occurred since the previous file syncronization.
大多数进行缓冲I/O的应用程序会定期调用文件同步调用(fsync、fdatasync、msync或sync_file_range),以确保已写入的数据已经到达后备存储。当写回期间发生错误时,它们期望在发出文件同步请求时报告该错误。在一个请求上报告了错误之后,除非自上次文件同步以来发生了进一步的写回错误,否则对同一文件描述符的后续请求应该返回0。

Ideally, the kernel would report errors only on file descriptions on which writes were done that subsequently failed to be written back. The generic pagecache infrastructure does not track the file descriptions that have dirtied each individual page however, so determining which file descriptors should get back an error is not possible.
理想情况下,内核应该只在对写入后未能写回的文件描述符上报告错误。然而,通用页面缓存基础设施并不跟踪每个页面已脏的文件描述符,因此无法确定哪些文件描述符应该返回错误。

Instead, the generic writeback error tracking infrastructure in the kernel settles for reporting errors to fsync on all file descriptions that were open at the time that the error occurred. In a situation with multiple writers, all of them will get back an error on a subsequent fsync, even if all of the writes done through that particular file descriptor succeeded (or even if there were no writes on that file descriptor at all).
因此,内核中的通用写回错误跟踪基础设施只能报告在错误发生时打开的所有文件描述符上的fsync错误。在存在多个写入者的情况下,即使通过特定文件描述符进行的所有写入都成功(甚至如果该文件描述符上根本没有写入),所有写入者在随后的fsync上都会收到错误。

Filesystems that wish to use this infrastructure should call mapping_set_error to record the error in the address_space when it occurs. Then, after writing back data from the pagecache in their file->fsync operation, they should call file_check_and_advance_wb_err to ensure that the struct file's error cursor has advanced to the correct point in the stream of errors emitted by the backing device(s).
希望使用此基础设施的文件系统应在发生错误时调用mapping_set_error来记录地址空间中的错误。然后,在其file->fsync操作中从页面缓存中写回数据后,它们应调用file_check_and_advance_wb_err来确保struct file的错误光标已经移动到由后备设备发出的错误流中的正确位置。

struct address_space_operations

This describes how the VFS can manipulate mapping of a file to page cache in your filesystem. The following members are defined:
这描述了 VFS 如何在文件系统中操作文件到页面缓存的映射。以下成员被定义:

struct address_space_operations {
        int (*writepage)(struct page *page, struct writeback_control *wbc);
        int (*read_folio)(struct file *, struct folio *);
        int (*writepages)(struct address_space *, struct writeback_control *);
        bool (*dirty_folio)(struct address_space *, struct folio *);
        void (*readahead)(struct readahead_control *);
        int (*write_begin)(struct file *, struct address_space *mapping,
                           loff_t pos, unsigned len,
                        struct page **pagep, void **fsdata);
        int (*write_end)(struct file *, struct address_space *mapping,
                         loff_t pos, unsigned len, unsigned copied,
                         struct page *page, void *fsdata);
        sector_t (*bmap)(struct address_space *, sector_t);
        void (*invalidate_folio) (struct folio *, size_t start, size_t len);
        bool (*release_folio)(struct folio *, gfp_t);
        void (*free_folio)(struct folio *);
        ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
        int (*migrate_folio)(struct mapping *, struct folio *dst,
                        struct folio *src, enum migrate_mode);
        int (*launder_folio) (struct folio *);

        bool (*is_partially_uptodate) (struct folio *, size_t from,
                                       size_t count);
        void (*is_dirty_writeback)(struct folio *, bool *, bool *);
        int (*error_remove_page) (struct mapping *mapping, struct page *page);
        int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
        int (*swap_deactivate)(struct file *);
        int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
};
  • writepage
    called by the VM to write a dirty page to backing store. This may happen for data integrity reasons (i.e. 'sync'), or to free up memory (flush). The difference can be seen in wbc->sync_mode. The PG_Dirty flag has been cleared and PageLocked is true. writepage should start writeout, should set PG_Writeback, and should make sure the page is unlocked, either synchronously or asynchronously when the write operation completes.
    由 VM 调用以将脏页写入后备存储。这可能是出于数据完整性的原因(即“同步”),或者为了释放内存(刷新)。可以通过 wbc->sync_mode 来看到区别。PG_Dirty 标志已被清除,PageLocked 为真。writepage 应该开始写出,应该设置 PG_Writeback,并确保在写操作完成时,页面被解锁,无论是同步还是异步。

    If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to try too hard if there are problems, and may choose to write out other pages from the mapping if that is easier (e.g. due to internal dependencies). If it chooses not to start writeout, it should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep calling ->writepage on that page.
    如果 wbc->sync_mode 是 WB_SYNC_NONE,则如果存在问题,->writepage 不必尝试太努力,并且可以选择写出映射中的其他页面(例如由于内部依赖关系)。如果它选择不开始写出,它应该返回 AOP_WRITEPAGE_ACTIVATE,以便 VM 不会继续在该页面上调用 ->writepage。

    See the file "Locking" for more details.
    有关更多详细信息,请参阅“锁定”文件。

  • read_folio
    Called by the page cache to read a folio from the backing store. The 'file' argument supplies authentication information to network filesystems, and is generally not used by block based filesystems. It may be NULL if the caller does not have an open file (eg if the kernel is performing a read for itself rather than on behalf of a userspace process with an open file).
    由页面缓存调用以从后备存储中读取 folio。'file' 参数为网络文件系统提供认证信息,并且通常不被基于块的文件系统使用。如果调用者没有打开文件(例如内核正在为自身执行读取而不是代表具有打开文件的用户空间进程执行读取),则可能为 NULL。

    If the mapping does not support large folios, the folio will contain a single page. The folio will be locked when read_folio is called. If the read completes successfully, the folio should be marked uptodate. The filesystem should unlock the folio once the read has completed, whether it was successful or not. The filesystem does not need to modify the refcount on the folio; the page cache holds a reference count and that will not be released until the folio is unlocked.
    如果映射不支持大 folio,则 folio 将包含单个页面。在调用 read_folio 时,folio 将被锁定。如果读取成功完成,folio 应标记为 uptodate。文件系统应在读取完成后解锁 folio,无论成功与否。文件系统不需要修改 folio 的引用计数;页面缓存保持引用计数,直到 folio 被解锁。

    Filesystems may implement ->read_folio() synchronously. In normal operation, folios are read through the ->readahead() method. Only if this fails, or if the caller needs to wait for the read to complete will the page cache call ->read_folio(). Filesystems should not attempt to perform their own readahead in the ->read_folio() operation.
    文件系统可以同步实现 ->read_folio()。在正常操作中,folio 通过 ->readahead() 方法读取。只有在这失败,或者如果调用者需要等待读取完成,页面缓存才会调用 ->read_folio()。文件系统不应尝试在 ->read_folio() 操作中执行自己的预读。

    If the filesystem cannot perform the read at this time, it can unlock the folio, do whatever action it needs to ensure that the read will succeed in the future and return AOP_TRUNCATED_PAGE. In this case, the caller should look up the folio, lock it, and call ->read_folio again.
    如果文件系统此时无法执行读取,它可以解锁 folio,执行任何必要的操作以确保将来读取成功,并返回 AOP_TRUNCATED_PAGE。在这种情况下,调用者应查找 folio,锁定它,并再次调用 ->read_folio。

    Callers may invoke the ->read_folio() method directly, but using read_mapping_folio() will take care of locking, waiting for the read to complete and handle cases such as AOP_TRUNCATED_PAGE.
    调用者可以直接调用 ->read_folio() 方法,但使用 read_mapping_folio() 将负责锁定、等待读取完成并处理诸如 AOP_TRUNCATED_PAGE 之类的情况。

  • writepages
    called by the VM to write out pages associated with the address_space object. If wbc->sync_mode is WB_SYNC_ALL, then the writeback_control will specify a range of pages that must be written out. If it is WB_SYNC_NONE, then a nr_to_write is given and that many pages should be written if possible. If no ->writepages is given, then mpage_writepages is used instead. This will choose pages from the address space that are tagged as DIRTY and will pass them to ->writepage.
    由 VM 调用以写出与 address_space 对象关联的页面。如果 wbc->sync_mode 是 WB_SYNC_ALL,则 writeback_control 将指定必须写出的页面范围。如果是 WB_SYNC_NONE,则给出 nr_to_write,并且应尽可能写入那么多页面。如果没有给出 ->writepages,则将使用 mpage_writepages。这将选择标记为 DIRTY 的地址空间中的页面,并将它们传递给 ->writepage。

  • dirty_folio
    called by the VM to mark a folio as dirty. This is particularly needed if an address space attaches private data to a folio, and that data needs to be updated when a folio is dirtied. This is called, for example, when a memory mapped page gets modified. If defined, it should set the folio dirty flag, and the PAGECACHE_TAG_DIRTY search mark in i_pages.
    由 VM 调用以将 folio 标记为脏。如果地址空间附加了私有数据到 folio,并且在脏化 folio 时需要更新该数据,则特别需要这样做。例如,当内存映射页面被修改时会调用此函数。如果定义了,应设置 folio 的脏标志,并在 i_pages 中设置 PAGECACHE_TAG_DIRTY 搜索标记。

  • readahead
    Called by the VM to read pages associated with the address_space object. The pages are consecutive in the page cache and are locked. The implementation should decrement the page refcount after starting I/O on each page. Usually the page will be unlocked by the I/O completion handler. The set of pages are divided into some sync pages followed by some async pages, rac->ra->async_size gives the number of async pages. The filesystem should attempt to read all sync pages but may decide to stop once it reaches the async pages. If it does decide to stop attempting I/O, it can simply return. The caller will remove the remaining pages from the address space, unlock them and decrement the page refcount. Set PageUptodate if the I/O completes successfully. Setting PageError on any page will be ignored; simply unlock the page if an I/O error occurs.
    由 VM 调用以读取与 address_space 对象关联的页面。页面在页面缓存中是连续的,并且已锁定。实现应在启动每个页面的 I/O 后递减页面引用计数。通常情况下,页面将在 I/O 完成处理程序中被解锁。页面集被分为一些同步页面,然后是一些异步页面,rac->ra->async_size 给出了异步页面的数量。文件系统应尝试读取所有同步页面,但一旦达到异步页面,可以决定停止。如果决定停止尝试 I/O,可以简单地返回。调用者将从地址空间中移除剩余页面,解锁它们并递减页面引用计数。如果 I/O 完成成功,则设置 PageUptodate。在任何页面上设置 PageError 将被忽略;如果发生 I/O 错误,只需解锁页面。

  • write_begin
    Called by the generic buffered write code to ask the filesystem to prepare to write len bytes at the given offset in the file. The address_space should check that the write will be able to complete, by allocating space if necessary and doing any other internal housekeeping. If the write will update parts of any basic-blocks on storage, then those blocks should be pre-read (if they haven't been read already) so that the updated blocks can be written out properly.
    由通用缓冲写入代码调用,以请求文件系统准备在文件中给定偏移处写入 len 字节。地址空间应检查写入是否能够完成,必要时分配空间并进行任何其他内部工作。如果写入将更新存储上任何基本块的部分,则应预读这些块(如果尚未读取),以便可以正确地写出更新的块。

    The filesystem must return the locked pagecache page for the specified offset, in *pagep, for the caller to write into.
    文件系统必须将指定偏移处的锁定 pagecache 页面,在 *pagep 中返回给调用者写入。

    It must be able to cope with short writes (where the length passed to write_begin is greater than the number of bytes copied into the page).
    它必须能够处理短写入(write_begin 传递的长度大于复制到页面中的字节数)。

    A void * may be returned in fsdata, which then gets passed into write_end.
    可以在 fsdata 中返回 void *,然后将其传递到 write_end。

    Returns 0 on success; < 0 on failure (which is the error code), in which case write_end is not called.
    成功返回 0;失败返回 < 0(即错误代码),在这种情况下不会调用 write_end。

  • write_end
    After a successful write_begin, and data copy, write_end must be called. len is the original len passed to write_begin, and copied is the amount that was able to be copied.
    在成功的 write_begin 和数据复制后,必须调用 write_end。len 是传递给 write_begin 的原始长度,copied 是能够复制的数量。

    The filesystem must take care of unlocking the page and releasing it refcount, and updating i_size.
    文件系统必须负责解锁页面并释放它的引用计数,并更新 i_size。

    Returns < 0 on failure, otherwise the number of bytes (<= 'copied') that were able to be copied into pagecache.
    失败返回 < 0,否则返回能够复制到 pagecache 中的字节数(<= 'copied')。

  • bmap
    called by the VFS to map a logical block offset within object to physical block number. This method is used by the FIBMAP ioctl and for working with swap-files. To be able to swap to a file, the file must have a stable mapping to a block device. The swap system does not go through the filesystem but instead uses bmap to find out where the blocks in the file are and uses those addresses directly.
    VFS调用bmap来将对象内的逻辑块偏移映射到物理块号。该方法被FIBMAP ioctl调用,用于处理交换文件。为了能够交换到文件,文件必须与块设备有稳定的映射关系。交换系统不通过文件系统,而是使用bmap来查找文件中的块位置,并直接使用这些地址。

  • invalidate_folio
    If a folio has private data, then invalidate_folio will be called when part or all of the folio is to be removed from the address space. This generally corresponds to either a truncation, punch hole or a complete invalidation of the address space (in the latter case 'offset' will always be 0 and 'length' will be folio_size()). Any private data associated with the folio should be updated to reflect this truncation. If offset is 0 and length is folio_size(), then the private data should be released, because the folio must be able to be completely discarded. This may be done by calling the ->release_folio function, but in this case the release MUST succeed.
    如果一个folio具有私有数据,那么当部分或全部folio将从地址空间中移除时,将调用invalidate_folio。这通常对应于截断、打孔或完全使地址空间无效(在后一种情况下,'offset'将始终为0,'length'将为folio_size())。应更新与folio关联的任何私有数据以反映此截断。如果offset为0且length为folio_size(),则应释放私有数据,因为folio必须能够完全丢弃。这可以通过调用->release_folio函数来完成,但在这种情况下,释放必须成功。

  • release_folio
    release_folio is called on folios with private data to tell the filesystem that the folio is about to be freed. ->release_folio should remove any private data from the folio and clear the private flag. If release_folio() fails, it should return false. release_folio() is used in two distinct though related cases. The first is when the VM wants to free a clean folio with no active users. If ->release_folio succeeds, the folio will be removed from the address_space and be freed.
    对具有私有数据的folios调用release_folio,以通知文件系统该folio即将被释放。->release_folio应从folio中移除任何私有数据并清除私有标志。如果release_folio()失败,应返回false。release_folio()用于两种不同但相关的情况。第一种是当VM想要释放一个没有活跃用户的干净folio时。如果->release_folio成功,该folio将从address_space中移除并被释放。

    The second case is when a request has been made to invalidate some or all folios in an address_space. This can happen through the fadvise(POSIX_FADV_DONTNEED) system call or by the filesystem explicitly requesting it as nfs and 9p do (when they believe the cache may be out of date with storage) by calling invalidate_inode_pages2(). If the filesystem makes such a call, and needs to be certain that all folios are invalidated, then its release_folio will need to ensure this. Possibly it can clear the uptodate flag if it cannot free private data yet.
    第二种情况是当请求已经发出以使地址空间中的一些或所有folios无效。这可以通过fadvise(POSIX_FADV_DONTNEED)系统调用或通过文件系统显式请求(例如nfs和9p)来实现(当它们认为缓存可能与存储不一致时),通过调用invalidate_inode_pages2()。如果文件系统进行了这样的调用,并且需要确保所有folios都被使无效,则其release_folio将需要确保这一点。如果尚不能释放私有数据,可能可以清除uptodate标志。

  • free_folio
    free_folio is called once the folio is no longer visible in the page cache in order to allow the cleanup of any private data. Since it may be called by the memory reclaimer, it should not assume that the original address_space mapping still exists, and it should not block.
    一旦folio在页面缓存中不再可见,就会调用free_folio以允许清理任何私有数据。由于可能会被内存回收器调用,因此不应假定原始address_space映射仍然存在,并且不应阻塞。

  • direct_IO
    called by the generic read/write routines to perform direct_IO - that is IO requests which bypass the page cache and transfer data directly between the storage and the application's address space.
    由通用的读/写例程调用以执行直接IO - 即绕过页面缓存并在存储和应用程序地址空间之间直接传输数据的IO请求。

  • migrate_folio
    This is used to compact the physical memory usage. If the VM wants to relocate a folio (maybe from a memory device that is signalling imminent failure) it will pass a new folio and an old folio to this function. migrate_folio should transfer any private data across and update any references that it has to the folio.
    用于压缩物理内存使用。如果VM想要重新定位一个folio(可能来自一个发出即将失败信号的内存设备),它将向此函数传递一个新folio和一个旧folio。migrate_folio应传输任何私有数据并更新其对folio的任何引用。

  • launder_folio
    Called before freeing a folio - it writes back the dirty folio. To prevent redirtying the folio, it is kept locked during the whole operation.
    在释放folio之前调用 - 它会将脏folio写回。为了防止重新标记脏folio,在整个操作期间它会保持锁定状态。

  • is_partially_uptodate
    Called by the VM when reading a file through the pagecache when the underlying blocksize is smaller than the size of the folio. If the required block is up to date then the read can complete without needing I/O to bring the whole page up to date.
    当通过页面缓存读取文件时,底层块大小小于folio大小时,VM会调用此函数。如果所需块是最新的,则读取可以在不需要I/O将整个页面更新的情况下完成。

  • is_dirty_writeback
    Called by the VM when attempting to reclaim a folio. The VM uses dirty and writeback information to determine if it needs to stall to allow flushers a chance to complete some IO. Ordinarily it can use folio_test_dirty and folio_test_writeback but some filesystems have more complex state (unstable folios in NFS prevent reclaim) or do not set those flags due to locking problems. This callback allows a filesystem to indicate to the VM if a folio should be treated as dirty or writeback for the purposes of stalling.
    在尝试回收folio时,VM会调用此函数。VM使用脏和写回信息来确定是否需要停顿以允许刷新程序完成一些IO。通常可以使用folio_test_dirty和folio_test_writeback,但某些文件系统具有更复杂的状态(NFS中的不稳定folios阻止回收),或者由于锁定问题而不设置这些标志。此回调允许文件系统指示给VM,对于停顿的目的,一个folio是否应被视为脏或写回。

  • error_remove_page
    normally set to generic_error_remove_page if truncation is ok for this address space. Used for memory failure handling. Setting this implies you deal with pages going away under you, unless you have them locked or reference counts increased.
    通常设置为generic_error_remove_page,如果对于此地址空间来说截断是可以的,则用于内存故障处理。设置这个意味着你要处理页面在你下面消失的情况,除非你锁定了它们或者增加了引用计数。

  • swap_activate

Called to prepare the given file for swap. It should perform any validation and preparation necessary to ensure that writes can be performed with minimal memory allocation. It should call add_swap_extent(), or the helper iomap_swapfile_activate(), and return the number of extents added. If IO should be submitted through ->swap_rw(), it should set SWP_FS_OPS, otherwise IO will be submitted directly to the block device sis->bdev.
用于准备给定的文件进行交换。它应执行任何必要的验证和准备工作,以确保可以进行最小内存分配的写入。它应调用add_swap_extent()或辅助函数iomap_swapfile_activate(),并返回添加的范围数量。如果应通过->swap_rw()提交IO,它应设置SWP_FS_OPS,否则IO将直接提交给块设备sis->bdev。

  • swap_deactivate
    Called during swapoff on files where swap_activate was successful.
    在swapoff期间对成功调用swap_activate的文件进行调用。

  • swap_rw
    Called to read or write swap pages when SWP_FS_OPS is set.
    在设置了SWP_FS_OPS时,用于读取或写入交换页面。

The File Object

A file object represents a file opened by a process. This is also known as an "open file description" in POSIX parlance.
文件对象表示进程打开的文件。在POSIX术语中,这也被称为“打开文件描述符”。

struct file_operations

This describes how the VFS can manipulate an open file. As of kernel 4.18, the following members are defined:
这描述了VFS如何操作打开的文件。截至内核4.18,定义了以下成员:

struct file_operations {
        struct module *owner;
        loff_t (*llseek) (struct file *, loff_t, int);
        ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
        ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
        ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
        ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
        int (*iopoll)(struct kiocb *kiocb, bool spin);
        int (*iterate) (struct file *, struct dir_context *);
        int (*iterate_shared) (struct file *, struct dir_context *);
        __poll_t (*poll) (struct file *, struct poll_table_struct *);
        long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
        long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
        int (*mmap) (struct file *, struct vm_area_struct *);
        int (*open) (struct inode *, struct file *);
        int (*flush) (struct file *, fl_owner_t id);
        int (*release) (struct inode *, struct file *);
        int (*fsync) (struct file *, loff_t, loff_t, int datasync);
        int (*fasync) (int, struct file *, int);
        int (*lock) (struct file *, int, struct file_lock *);
        ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
        unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
        int (*check_flags)(int);
        int (*flock) (struct file *, int, struct file_lock *);
        ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
        ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
        int (*setlease)(struct file *, long, struct file_lock **, void **);
        long (*fallocate)(struct file *file, int mode, loff_t offset,
                          loff_t len);
        void (*show_fdinfo)(struct seq_file *m, struct file *f);
#ifndef CONFIG_MMU
        unsigned (*mmap_capabilities)(struct file *);
#endif
        ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
        loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
                                   struct file *file_out, loff_t pos_out,
                                   loff_t len, unsigned int remap_flags);
        int (*fadvise)(struct file *, loff_t, loff_t, int);
};

Again, all methods are called without any locks being held, unless otherwise noted.
同样,除非另有说明,所有方法在没有任何锁的情况下调用。

  • llseek
    called when the VFS needs to move the file position index
    在VFS需要移动文件位置索引时调用

  • read
    called by read(2) and related system calls
    由read(2)和相关系统调用调用

  • read_iter
    possibly asynchronous read with iov_iter as destination
    可能是带有iov_iter作为目标的异步读取

  • write
    called by write(2) and related system calls
    由write(2)和相关系统调用调用

  • write_iter
    possibly asynchronous write with iov_iter as source
    可能是带有iov_iter作为源的异步写入

  • iopoll
    called when aio wants to poll for completions on HIPRI iocbs
    当aio想要对HIPRI iocbs进行轮询时调用

  • iterate
    called when the VFS needs to read the directory contents
    在VFS需要读取目录内容时调用

  • iterate_shared
    called when the VFS needs to read the directory contents when filesystem supports concurrent dir iterators
    在文件系统支持并发目录迭代器时,VFS需要读取目录内容时调用

  • poll
    called by the VFS when a process wants to check if there is activity on this file and (optionally) go to sleep until there is activity. Called by the select(2) and poll(2) system calls
    当进程想要检查文件上是否有活动(可选地)并进入睡眠状态时,由VFS调用。由select(2)和poll(2)系统调用调用

  • unlocked_ioctl
    called by the ioctl(2) system call.
    由ioctl(2)系统调用调用。

  • compat_ioctl
    called by the ioctl(2) system call when 32 bit system calls are used on 64 bit kernels.
    在64位内核上使用32位系统调用时,由ioctl(2)系统调用调用。

  • mmap
    called by the mmap(2) system call
    由mmap(2)系统调用调用

  • open
    called by the VFS when an inode should be opened. When the VFS opens a file, it creates a new "struct file". It then calls the open method for the newly allocated file structure. You might think that the open method really belongs in "struct inode_operations", and you may be right. I think it's done the way it is because it makes filesystems simpler to implement. The open() method is a good place to initialize the "private_data" member in the file structure if you want to point to a device structure
    在VFS打开inode时调用。当VFS打开文件时,它会创建一个新的“struct file”。然后,它调用新分配的文件结构的open方法。你可能认为open方法实际上属于“struct inode_operations”,你可能是对的。我认为它的实现方式是为了使文件系统更容易实现。如果要指向设备结构,open()方法是初始化文件结构中的“private_data”成员的好地方

  • flush
    called by the close(2) system call to flush a file
    由close(2)系统调用调用以刷新文件

  • release
    called when the last reference to an open file is closed
    当对打开文件的最后一个引用关闭时调用

  • fsync
    called by the fsync(2) system call. Also see the section above entitled "Handling errors during writeback".
    由fsync(2)系统调用调用。还请参阅上面标题为“处理写回期间的错误”的部分。

  • fasync
    called by the fcntl(2) system call when asynchronous (non-blocking) mode is enabled for a file
    当文件启用异步(非阻塞)模式时,由fcntl(2)系统调用调用

  • lock
    called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW commands
    由fcntl(2)系统调用调用F_GETLK、F_SETLK和F_SETLKW命令

  • get_unmapped_area
    called by the mmap(2) system call
    由mmap(2)系统调用调用

  • check_flags
    called by the fcntl(2) system call for F_SETFL command
    由fcntl(2)系统调用调用F_SETFL命令

  • flock
    called by the flock(2) system call
    由flock(2)系统调用调用

  • splice_write
    called by the VFS to splice data from a pipe to a file. This method is used by the splice(2) system call
    由VFS将数据从管道拼接到文件中。此方法由splice(2)系统调用使用

  • splice_read
    called by the VFS to splice data from file to a pipe. This method is used by the splice(2) system call
    由VFS将数据从文件拼接到管道中。此方法由splice(2)系统调用使用

  • setlease
    called by the VFS to set or release a file lock lease. setlease implementations should call generic_setlease to record or remove the lease in the inode after setting it.
    由VFS设置或释放文件锁租约。setlease实现应调用generic_setlease,在设置后在inode中记录或删除租约。

  • fallocate
    called by the VFS to preallocate blocks or punch a hole.
    由VFS预分配块或戳一个洞。

  • copy_file_range
    called by the copy_file_range(2) system call.
    由copy_file_range(2)系统调用调用。

  • remap_file_range
    called by the ioctl(2) system call for FICLONERANGE and FICLONE and FIDEDUPERANGE commands to remap file ranges. An implementation should remap len bytes at pos_in of the source file into the dest file at pos_out. Implementations must handle callers passing in len == 0; this means "remap to the end of the source file". The return value should the number of bytes remapped, or the usual negative error code if errors occurred before any bytes were remapped. The remap_flags parameter accepts REMAP_FILE_* flags. If REMAP_FILE_DEDUP is set then the implementation must only remap if the requested file ranges have identical contents. If REMAP_FILE_CAN_SHORTEN is set, the caller is ok with the implementation shortening the request length to satisfy alignment or EOF requirements (or any other reason).
    由ioctl(2)系统调用调用FICLONERANGE、FICLONE和FIDEDUPERANGE命令以重新映射文件范围。实现应将源文件中pos_in处的len字节重新映射到目标文件中pos_out处。实现必须处理调用者传入len == 0的情况;这意味着“重新映射到源文件的末尾”。返回值应该是重新映射的字节数,如果在重新映射任何字节之前发生错误,则返回通常的负错误代码。remap_flags参数接受REMAP_FILE_*标志。如果设置了REMAP_FILE_DEDUP,则只有在请求的文件范围具有相同内容时,实现才能重新映射。如果设置了REMAP_FILE_CAN_SHORTEN,则调用者可以接受实现缩短请求长度以满足对齐或EOF要求(或任何其他原因)。

  • fadvise
    possibly called by the fadvise64() system call.
    可能由fadvise64()系统调用调用。

Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node (character or block special) most filesystems will call special support routines in the VFS which will locate the required device driver information. These support routines replace the filesystem file operations with those for the device driver, and then proceed to call the new open() method for the file. This is how opening a device file in the filesystem eventually ends up calling the device driver open() method.
请注意,文件操作由inode所在的特定文件系统实现。当打开设备节点(字符或块特殊文件)时,大多数文件系统将调用VFS中的特殊支持例程,以查找所需的设备驱动程序信息。这些支持例程将文件系统文件操作替换为设备驱动程序的文件操作,然后继续调用文件的新open()方法。这就是在文件系统中打开设备文件最终调用设备驱动程序open()方法的方式。

Directory Entry Cache (dcache)

目录条目缓存(dcache)

struct dentry_operations

This describes how a filesystem can overload the standard dentry operations. Dentries and the dcache are the domain of the VFS and the individual filesystem implementations. Device drivers have no business here. These methods may be set to NULL, as they are either optional or the VFS uses a default. As of kernel 2.6.22, the following members are defined:
这描述了文件系统如何重载标准的dentry操作。Dentries和dcache是VFS和各个文件系统实现的领域,设备驱动程序在这里没有业务。这些方法可以设置为NULL,因为它们要么是可选的,要么VFS使用默认值。截至内核2.6.22,以下成员已定义:

struct dentry_operations {
        int (*d_revalidate)(struct dentry *, unsigned int);
        int (*d_weak_revalidate)(struct dentry *, unsigned int);
        int (*d_hash)(const struct dentry *, struct qstr *);
        int (*d_compare)(const struct dentry *,
                         unsigned int, const char *, const struct qstr *);
        int (*d_delete)(const struct dentry *);
        int (*d_init)(struct dentry *);
        void (*d_release)(struct dentry *);
        void (*d_iput)(struct dentry *, struct inode *);
        char *(*d_dname)(struct dentry *, char *, int);
        struct vfsmount *(*d_automount)(struct path *);
        int (*d_manage)(const struct path *, bool);
        struct dentry *(*d_real)(struct dentry *, const struct inode *);
};
  • d_revalidate
    called when the VFS needs to revalidate a dentry. This is called whenever a name look-up finds a dentry in the dcache. Most local filesystems leave this as NULL, because all their dentries in the dcache are valid. Network filesystems are different since things can change on the server without the client necessarily being aware of it.
    当VFS需要重新验证dentry时调用。每当名称查找在dcache中找到一个dentry时都会调用此函数。大多数本地文件系统将其设置为NULL,因为它们在dcache中的所有dentries都是有效的。网络文件系统不同,因为服务器上的内容可能发生变化,而客户端未必知晓。

    This function should return a positive value if the dentry is still valid, and zero or a negative error code if it isn't.
    如果dentry仍然有效,此函数应返回正值;如果无效,则返回零或负错误代码。

    d_revalidate may be called in rcu-walk mode (flags & LOOKUP_RCU). If in rcu-walk mode, the filesystem must revalidate the dentry without blocking or storing to the dentry, d_parent and d_inode should not be used without care (because they can change and, in d_inode case, even become NULL under us).
    d_revalidate可能在rcu-walk模式下调用(flags & LOOKUP_RCU)。如果在rcu-walk模式下,文件系统必须重新验证dentry,不能阻塞或存储到dentry,d_parent和d_inode在没有谨慎使用的情况下不应该被使用(因为它们可能会发生变化,甚至在d_inode的情况下,甚至可能变为NULL)。

    If a situation is encountered that rcu-walk cannot handle, return -ECHILD and it will be called again in ref-walk mode.
    如果遇到rcu-walk无法处理的情况,请返回-ECHILD,它将在ref-walk模式下再次调用。

  • d_weak_revalidate
    called when the VFS needs to revalidate a "jumped" dentry. This is called when a path-walk ends at dentry that was not acquired by doing a lookup in the parent directory. This includes "/", "." and "..", as well as procfs-style symlinks and mountpoint traversal.
    当VFS需要重新验证“跳转”的dentry时调用。当路径遍历结束于未通过在父目录中查找获得的dentry时,即包括“/”、“.”和“..”,以及procfs风格的符号链接和挂载点遍历。

    In this case, we are less concerned with whether the dentry is still fully correct, but rather that the inode is still valid. As with d_revalidate, most local filesystems will set this to NULL since their dcache entries are always valid.
    在这种情况下,我们更关心的是inode是否仍然有效。与d_revalidate一样,大多数本地文件系统将其设置为NULL,因为它们的dcache条目始终有效。

    This function has the same return code semantics as d_revalidate.
    此函数具有与d_revalidate相同的返回代码语义。

    d_weak_revalidate is only called after leaving rcu-walk mode.
    仅在离开rcu-walk模式后才调用d_weak_revalidate。

  • d_hash
    called when the VFS adds a dentry to the hash table. The first dentry passed to d_hash is the parent directory that the name is to be hashed into.
    当VFS将dentry添加到哈希表时调用。传递给d_hash的第一个dentry是要将名称哈希到的父目录。

    Same locking and synchronisation rules as d_compare regarding what is safe to dereference etc.
    关于什么是安全解引用等方面的锁定和同步规则与d_compare相同。

  • d_compare
    called to compare a dentry name with a given name. The first dentry is the parent of the dentry to be compared, the second is the child dentry. len and name string are properties of the dentry to be compared. qstr is the name to compare it with.
    用给定名称比较dentry名称时调用。第一个dentry是要比较的dentry的父目录,第二个是子dentry。len和name string是要比较的dentry的属性。qstr是要与之比较的名称。

    Must be constant and idempotent, and should not take locks if possible, and should not or store into the dentry. Should not dereference pointers outside the dentry without lots of care (eg. d_parent, d_inode, d_name should not be used).
    必须是常量和幂等的,如果可能的话不应该获取锁,并且不应该或存储到dentry。不应该在没有谨慎使用的情况下解引用dentry之外的指针(例如d_parent、d_inode、d_name不应该被使用)。

    However, our vfsmount is pinned, and RCU held, so the dentries and inodes won't disappear, neither will our sb or filesystem module. ->d_sb may be used.
    但是,我们的vfsmount被固定,RCU被持有,因此dentries和inodes不会消失,我们的sb或文件系统模块也不会消失。->d_sb可能会被使用。

    It is a tricky calling convention because it needs to be called under "rcu-walk", ie. without any locks or references on things.
    这是一个棘手的调用约定,因为它需要在“rcu-walk”下调用,即在没有任何锁或引用的情况下调用。

  • d_delete
    called when the last reference to a dentry is dropped and the dcache is deciding whether or not to cache it. Return 1 to delete immediately, or 0 to cache the dentry. Default is NULL which means to always cache a reachable dentry. d_delete must be constant and idempotent.
    当对dentry的最后一个引用被释放并且dcache正在决定是否缓存它时调用。返回1立即删除,返回0缓存dentry。默认值为NULL,这意味着始终缓存可达的dentry。d_delete必须是常量和幂等的。

  • d_init
    called when a dentry is allocated
    当分配dentry时调用

  • d_release
    called when a dentry is really deallocated
    当dentry真正被释放时调用

  • d_iput
    called when a dentry loses its inode (just prior to its being deallocated). The default when this is NULL is that the VFS calls iput(). If you define this method, you must call iput() yourself
    当dentry失去其inode时调用(在其被释放之前)。当此值为NULL时,默认情况下VFS调用iput()。如果定义了此方法,必须自己调用iput()

  • d_dname
    called when the pathname of a dentry should be generated. Useful for some pseudo filesystems (sockfs, pipefs, ...) to delay pathname generation. (Instead of doing it when dentry is created, it's done only when the path is needed.). Real filesystems probably dont want to use it, because their dentries are present in global dcache hash, so their hash should be an invariant. As no lock is held, d_dname() should not try to modify the dentry itself, unless appropriate SMP safety is used. CAUTION : d_path() logic is quite tricky. The correct way to return for example "Hello" is to put it at the end of the buffer, and returns a pointer to the first char. dynamic_dname() helper function is provided to take care of this.
    当应该生成dentry的路径名时调用。对于一些伪文件系统(如sockfs、pipefs等),这对于延迟路径名生成很有用(不是在创建dentry时进行,而是在需要路径时进行)。真实的文件系统可能不想使用它,因为它们的dentries存在于全局dcache哈希中,因此它们的哈希应该是不变的。由于没有锁被持有,d_dname()不应该尝试修改dentry本身,除非使用了适当的SMP安全性。注意:d_path()逻辑非常棘手。例如,正确返回“Hello”的方法是将其放在缓冲区的末尾,并返回指向第一个字符的指针。dynamic_dname()提供了一个辅助函数来处理这个问题。

Example :

static char *pipefs_dname(struct dentry *dent, char *buffer, int buflen)
{
        return dynamic_dname(dentry, buffer, buflen, "pipe:[%lu]",
                        dentry->d_inode->i_ino);
}
  • d_automount
    called when an automount dentry is to be traversed (optional). This should create a new VFS mount record and return the record to the caller. The caller is supplied with a path parameter giving the automount directory to describe the automount target and the parent VFS mount record to provide inheritable mount parameters. NULL should be returned if someone else managed to make the automount first. If the vfsmount creation failed, then an error code should be returned. If -EISDIR is returned, then the directory will be treated as an ordinary directory and returned to pathwalk to continue walking.
    当要遍历一个自动挂载的dentry时调用(可选)。这应该创建一个新的VFS挂载记录并将记录返回给调用者。调用者提供了一个路径参数,给出了要描述自动挂载目标的自动挂载目录,以及提供可继承的挂载参数的父VFS挂载记录。如果有人先制作了自动挂载,则应返回NULL。如果vfsmount创建失败,则应返回错误代码。如果返回-EISDIR,则该目录将被视为普通目录,并返回到pathwalk以继续遍历。

    If a vfsmount is returned, the caller will attempt to mount it on the mountpoint and will remove the vfsmount from its expiration list in the case of failure. The vfsmount should be returned with 2 refs on it to prevent automatic expiration - the caller will clean up the additional ref.
    如果返回vfsmount,调用者将尝试在挂载点上挂载它,并在失败的情况下从其到期列表中删除vfsmount。vfsmount应该返回带有2个引用的vfsmount,以防止自动到期-调用者将清理额外的引用。

    This function is only used if DCACHE_NEED_AUTOMOUNT is set on the dentry. This is set by __d_instantiate() if S_AUTOMOUNT is set on the inode being added.
    仅当DCACHE_NEED_AUTOMOUNT在dentry上设置时才使用此函数。如果inode被添加时设置了S_AUTOMOUNT,则由__d_instantiate()设置。

  • d_manage
    called to allow the filesystem to manage the transition from a dentry (optional). This allows autofs, for example, to hold up clients waiting to explore behind a 'mountpoint' while letting the daemon go past and construct the subtree there. 0 should be returned to let the calling process continue. -EISDIR can be returned to tell pathwalk to use this directory as an ordinary directory and to ignore anything mounted on it and not to check the automount flag. Any other error code will abort pathwalk completely.
    允许文件系统管理从dentry过渡(可选)。这允许例如autofs在让守护程序继续并在那里构造子树的同时阻止等待探索“挂载点”的客户端。返回0以让调用进程继续。可以返回-EISDIR以告诉pathwalk忽略d_automount或任何挂载。任何其他错误代码都将完全中止pathwalk。

    If the 'rcu_walk' parameter is true, then the caller is doing a pathwalk in RCU-walk mode. Sleeping is not permitted in this mode, and the caller can be asked to leave it and call again by returning -ECHILD. -EISDIR may also be returned to tell pathwalk to ignore d_automount or any mounts.
    如果“rcu_walk”参数为true,则调用者正在以RCU-walk模式进行pathwalk。在此模式下不允许睡眠,并且可以要求调用者离开并再次调用,返回-ECHILD。也可以返回-EISDIR以告诉pathwalk忽略d_automount或任何挂载。

    This function is only used if DCACHE_MANAGE_TRANSIT is set on the dentry being transited from.
    仅当DCACHE_MANAGE_TRANSIT在正在过渡的dentry上设置时才使用此函数。

  • d_real
    overlay/union type filesystems implement this method to return one of the underlying dentries hidden by the overlay. It is used in two different modes:
    覆盖/联合类型的文件系统实现此方法以返回被覆盖的底层dentries之一。它以两种不同的模式使用:

    Called from file_dentry() it returns the real dentry matching the inode argument. The real dentry may be from a lower layer already copied up, but still referenced from the file. This mode is selected with a non-NULL inode argument.
    从file_dentry()调用时,它返回与inode参数匹配的真实dentry。真实的dentry可能来自已经复制上来的较低层,但仍然被文件引用。使用非NULL inode参数选择此模式。

    With NULL inode the topmost real underlying dentry is returned.
    使用NULL inode返回顶层真实的底层dentry。

Each dentry has a pointer to its parent dentry, as well as a hash list of child dentries. Child dentries are basically like files in a directory.
每个dentry都有指向其父dentry的指针,以及子dentries的哈希列表。子dentries基本上类似于目录中的文件。

Directory Entry Cache API

There are a number of functions defined which permit a filesystem to manipulate dentries:
定义了许多函数,允许文件系统操作dentries:

  • dget
    open a new handle for an existing dentry (this just increments the usage count)
    为现有dentry打开一个新句柄(这只是增加使用计数)

  • dput
    close a handle for a dentry (decrements the usage count). If the usage count drops to 0, and the dentry is still in its parent's hash, the "d_delete" method is called to check whether it should be cached. If it should not be cached, or if the dentry is not hashed, it is deleted. Otherwise cached dentries are put into an LRU list to be reclaimed on memory shortage.
    关闭dentry的句柄(减少使用计数)。如果使用计数降至0,并且dentry仍在其父哈希中,则调用“d_delete”方法以检查是否应该缓存它。如果不应该缓存,或者dentry没有哈希,则删除它。否则,缓存的dentries将放入LRU列表中,在内存短缺时将被回收。

  • d_drop
    this unhashes a dentry from its parents hash list. A subsequent call to dput() will deallocate the dentry if its usage count drops to 0
    这将从其父哈希列表中取消哈希一个dentry。如果其使用计数降至0,则随后调用dput()将释放dentry。

  • d_delete
    delete a dentry. If there are no other open references to the dentry then the dentry is turned into a negative dentry (the d_iput() method is called). If there are other references, then d_drop() is called instead
    删除一个dentry。如果没有其他对dentry的打开引用,则将dentry转换为负dentry(调用d_iput()方法)。如果有其他引用,则调用d_drop()。

  • d_add
    add a dentry to its parents hash list and then calls d_instantiate()
    将一个dentry添加到其父哈希列表,然后调用d_instantiate()

  • d_instantiate
    add a dentry to the alias hash list for the inode and updates the "d_inode" member. The "i_count" member in the inode structure should be set/incremented. If the inode pointer is NULL, the dentry is called a "negative dentry". This function is commonly called when an inode is created for an existing negative dentry
    将一个dentry添加到inode的别名哈希列表,并更新“d_inode”成员。应设置/增加inode结构中的“i_count”成员。如果inode指针为NULL,则称dentry为“负dentry”。当为现有负dentry创建inode时,通常调用此函数。

  • d_lookup
    look up a dentry given its parent and path name component It looks up the child of that given name from the dcache hash table. If it is found, the reference count is incremented and the dentry is returned. The caller must use dput() to free the dentry when it finishes using it.
    查找给定其父目录和路径名组件的dentry。它从dcache哈希表中查找给定名称的子项。如果找到,引用计数将增加,并返回dentry。调用者在使用完后必须使用dput()来释放dentry。

Mount Options

Parsing options

On mount and remount the filesystem is passed a string containing a comma separated list of mount options. The options can have either of these forms:
在挂载和重新挂载时,文件系统会接收一个包含逗号分隔的挂载选项列表的字符串。选项可以采用以下形式之一:

option option=value

The <linux/parser.h> header defines an API that helps parse these options. There are plenty of examples on how to use it in existing filesystems.
<linux/parser.h>头文件定义了一个API,帮助解析这些选项。现有文件系统中有很多如何使用它的示例。

Showing options

If a filesystem accepts mount options, it must define show_options() to show all the currently active options. The rules are:
如果文件系统接受挂载选项,必须定义show_options()以显示所有当前活动的选项。规则如下:

  • options MUST be shown which are not default or their values differ from the default
    必须显示不是默认的或其值与默认值不同的选项

  • options MAY be shown which are enabled by default or have their default value
    可以显示默认启用的或其默认值的选项

Options used only internally between a mount helper and the kernel (such as file descriptors), or which only have an effect during the mounting (such as ones controlling the creation of a journal) are exempt from the above rules.
仅在挂载助手和内核之间仅在挂载期间具有影响(例如控制日志创建的选项)的文件描述符之间或仅在挂载期间具有影响的选项(例如文件描述符)免除上述规则。

The underlying reason for the above rules is to make sure, that a mount can be accurately replicated (e.g. umounting and mounting again) based on the information found in /proc/mounts.
上述规则的根本原因是确保可以根据在/proc/mounts中找到的信息准确地复制挂载(例如卸载和再次挂载)。

Resources

(Note some of these resources are not up-to-date with the latest kernel
version.)

Creating Linux virtual filesystems. 2002
https://lwn.net/Articles/13325/

The Linux Virtual File-system Layer by Neil Brown. 1999
http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html

A tour of the Linux VFS by Michael K. Johnson. 1996
https://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html

A small trail through the Linux kernel by Andries Brouwer. 2001
https://www.win.tue.nl/~aeb/linux/vfs/trail.html