Unevictable LRU Infrastructure (翻译 by chatgpt)

发布时间 2023-12-05 12:06:38作者: 摩斯电码

原文:https://www.kernel.org/doc/html/latest/mm/unevictable-lru.html

Introduction

This document describes the Linux memory manager's "Unevictable LRU" infrastructure and the use of this to manage several types of "unevictable" folios.
本文档描述了Linux内存管理器的“不可驱逐LRU”基础设施以及使用该基础设施来管理几种类型的“不可驱逐”页。

The document attempts to provide the overall rationale behind this mechanism and the rationale for some of the design decisions that drove the implementation. The latter design rationale is discussed in the context of an implementation description. Admittedly, one can obtain the implementation details - the "what does it do?" - by reading the code. One hopes that the descriptions below add value by provide the answer to "why does it do that?".
该文档试图提供这一机制背后的总体原理,以及推动实施的一些设计决策的原因。后者的设计原理是在实施描述的背景下讨论的。可以通过阅读代码来获得实施细节 - “它是做什么的?”。希望以下描述能够回答“为什么要这样做?”的问题,从而增加价值。

The Unevictable LRU

不可驱逐LRU

The Unevictable LRU facility adds an additional LRU list to track unevictable folios and to hide these folios from vmscan. This mechanism is based on a patch by Larry Woodman of Red Hat to address several scalability problems with folio reclaim in Linux. The problems have been observed at customer sites on large memory x86_64 systems.
不可驱逐LRU功能添加了一个额外的LRU列表,用于跟踪不可驱逐的页,并将这些页隐藏在vmscan之外。这个机制基于Red Hat的Larry Woodman提出的一个补丁,用于解决Linux中页回收的几个可扩展性问题。这些问题在大内存x86_64系统的客户现场已经观察到。

To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of main memory will have over 32 million 4k pages in a single node. When a large fraction of these pages are not evictable for any reason [see below], vmscan will spend a lot of time scanning the LRU lists looking for the small fraction of pages that are evictable. This can result in a situation where all CPUs are spending 100% of their time in vmscan for hours or days on end, with the system completely unresponsive.
举例来说,一个非NUMA x86_64平台,拥有128GB的主内存,在单个节点上将拥有超过3200万个4k页。当这些页的大部分由于某种原因不可驱逐[见下文]时,vmscan将花费大量时间扫描LRU列表,寻找可驱逐的少部分页。这可能导致所有CPU在vmscan中花费100%的时间,持续数小时甚至数天,系统完全无响应。

The unevictable list addresses the following classes of unevictable pages:
不可驱逐列表解决了以下类型的不可驱逐页:

  • Those owned by ramfs.
    由ramfs拥有的页。

  • Those owned by tmpfs with the noswap mount option.
    由使用noswap挂载选项的tmpfs拥有的页。

  • Those mapped into SHM_LOCK'd shared memory regions.
    映射到SHM_LOCK'd共享内存区域中的页。

  • Those mapped into VM_LOCKED [mlock()ed] VMAs.
    映射到VM_LOCKED [mlock()ed] VMAs中的页。

The infrastructure may also be able to handle other conditions that make pages unevictable, either by definition or by circumstance, in the future.
未来,该基础设施可能还能够处理其他使页不可驱逐的条件,无论是定义上的还是情况上的。

The Unevictable LRU Folio List

不可驱逐LRU页列表

The Unevictable LRU folio list is a lie. It was never an LRU-ordered list, but a companion to the LRU-ordered anonymous and file, active and inactive folio lists; and now it is not even a folio list. But following familiar convention, here in this document and in the source, we often imagine it as a fifth LRU folio list.
不可驱逐LRU页列表是一个谎言。它从来不是一个LRU排序的列表,而是与LRU排序的匿名和文件、活动和非活动页列表相伴;现在它甚至不再是一个页列表。但是按照熟悉的惯例,在本文档和源代码中,我们经常将其想象为第五个LRU页列表。

The Unevictable LRU infrastructure consists of an additional, per-node, LRU list called the "unevictable" list and an associated folio flag, PG_unevictable, to indicate that the folio is being managed on the unevictable list.
不可驱逐LRU基础设施包括一个额外的每个节点的LRU列表,称为“不可驱逐”列表,以及一个相关的页标志PG_unevictable,用于指示该页正在不可驱逐列表上进行管理。

The PG_unevictable flag is analogous to, and mutually exclusive with, the PG_active flag in that it indicates on which LRU list a folio resides when PG_lru is set.
PG_unevictable标志类似于PG_active标志,二者是互斥的,它们指示了当设置了PG_lru时,页位于哪个LRU列表上。

The Unevictable LRU infrastructure maintains unevictable folios as if they were on an additional LRU list for a few reasons:
不可驱逐LRU基础设施将不可驱逐的页维护得好像它们在一个额外的LRU列表上一样,原因如下:

  1. We get to "treat unevictable folios just like we treat other folios in the system - which means we get to use the same code to manipulate them, the same code to isolate them (for migrate, etc.), the same code to keep track of the statistics, etc..." [Rik van Riel]
    我们可以“像对待系统中的其他页一样对待不可驱逐的页 - 这意味着我们可以使用相同的代码来操作它们,使用相同的代码来隔离它们(用于迁移等),使用相同的代码来跟踪统计信息等...” [Rik van Riel]

  2. We want to be able to migrate unevictable folios between nodes for memory defragmentation, workload management and memory hotplug. The Linux kernel can only migrate folios that it can successfully isolate from the LRU lists (or "Movable" pages: outside of consideration here). If we were to maintain folios elsewhere than on an LRU-like list, where they can be detected by folio_isolate_lru(), we would prevent their migration.
    我们希望能够在节点之间迁移不可驱逐的页,以进行内存碎片整理、工作负载管理和内存热插拔。Linux内核只能迁移它能够成功地从LRU列表中隔离出来的页(或“可移动”页:此处不考虑)。如果我们将页维护在不同于LRU样式列表的地方,使它们无法被folio_isolate_lru()检测到,我们将阻止它们的迁移。

The unevictable list does not differentiate between file-backed and anonymous, swap-backed folios. This differentiation is only important while the folios are, in fact, evictable.
不可驱逐列表不区分文件支持和匿名、交换支持的页。这种区分只在页实际上是可驱逐的时候才重要。

The unevictable list benefits from the "arrayification" of the per-node LRU lists and statistics originally proposed and posted by Christoph Lameter.
不可驱逐列表受益于最初由Christoph Lameter提出和发布的每个节点LRU列表和统计信息的“数组化”。

Memory Control Group Interaction

内存控制组交互

The unevictable LRU facility interacts with the memory control group [aka memory controller; see Memory Resource Controller] by extending the lru_list enum.
不可驱逐的 LRU 设施与内存控制组(又称内存控制器;请参阅内存资源控制器)进行交互,通过扩展 lru_list 枚举。

The memory controller data structure automatically gets a per-node unevictable list as a result of the "arrayification" of the per-node LRU lists (one per lru_list enum element). The memory controller tracks the movement of pages to and from the unevictable list.
内存控制器数据结构会自动获得每个节点的不可驱逐列表,这是由于对每个节点的LRU列表进行“数组化”(每个lru_list枚举元素对应一个)的结果。内存控制器会跟踪页面在不可驱逐列表中的移动。

When a memory control group comes under memory pressure, the controller will not attempt to reclaim pages on the unevictable list. This has a couple of effects:
当内存控制组面临内存压力时,控制器将不会尝试回收不可驱逐列表上的页面。这会产生一些影响:

  1. Because the pages are "hidden" from reclaim on the unevictable list, the reclaim process can be more efficient, dealing only with pages that have a chance of being reclaimed.
    由于页面在不可驱逐列表上“隐藏”起来,回收过程可以更加高效,只需处理有可能被回收的页面。

  2. On the other hand, if too many of the pages charged to the control group are unevictable, the evictable portion of the working set of the tasks in the control group may not fit into the available memory. This can cause the control group to thrash or to OOM-kill tasks.
    另一方面,如果控制组中被计入的页面中有太多不可驱逐的页面,控制组中任务的可驱逐工作集的部分可能无法适应可用内存。这可能导致控制组出现抖动或者OOM杀死任务。

Marking Address Spaces Unevictable

标记地址空间为不可驱逐

For facilities such as ramfs none of the pages attached to the address space may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE address space flag is provided, and this can be manipulated by a filesystem using a number of wrapper functions:
对于诸如ramfs之类的设施,附加到地址空间的页面都不可被驱逐。为了防止这些页面被驱逐,提供了AS_UNEVICTABLE地址空间标志,并且可以通过一些包装函数进行操作:

  • void mapping_set_unevictable(struct address_space *mapping);
    Mark the address space as being completely unevictable.
    标记地址空间为完全不可驱逐。

  • void mapping_clear_unevictable(struct address_space *mapping);
    Mark the address space as being evictable.
    标记地址空间为可驱逐。

  • int mapping_unevictable(struct address_space *mapping);
    Query the address space, and return true if it is completely unevictable.
    查询地址空间,如果它完全不可驱逐则返回true。

These are currently used in three places in the kernel:
这些函数目前在内核中的三个地方被使用:

  1. By ramfs to mark the address spaces of its inodes when they are created, and this mark remains for the life of the inode.
    由ramfs在创建其索引节点时标记其地址空间,并且这个标记会在索引节点的生命周期内保持不变。

  2. By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called. Note that SHM_LOCK is not required to page in the locked pages if they're swapped out; the application must touch the pages manually if it wants to ensure they're in memory.
    由SYSV SHM在标记SHM_LOCK的地址空间直到调用SHM_UNLOCK时使用。需要注意的是,如果被交换出去,SHM_LOCK并不需要将锁定的页面调入内存;如果应用程序希望确保它们在内存中,必须手动访问这些页面。

  3. By the i915 driver to mark pinned address space until it's unpinned. The amount of unevictable memory marked by i915 driver is roughly the bounded object size in debugfs/dri/0/i915_gem_objects.
    由i915驱动程序在标记固定的地址空间直到取消固定时使用。i915驱动程序标记的不可驱逐内存量大致等于debugfs/dri/0/i915_gem_objects中的有界对象大小。

Detecting Unevictable Pages

检测不可驱逐页面

The function folio_evictable() in mm/internal.h determines whether a folio is evictable or not using the query function outlined above [see section Marking address spaces unevictable] to check the AS_UNEVICTABLE flag.
在mm/internal.h中的函数folio_evictable()确定一个folio是否可被驱逐,使用了上述查询函数[参见标记地址空间为不可驱逐部分]来检查AS_UNEVICTABLE标志。

For address spaces that are so marked after being populated (as SHM regions might be), the lock action (e.g. SHM_LOCK) can be lazy, and need not populate the page tables for the region as does, for example, mlock(), nor need it make any special effort to push any pages in the SHM_LOCK'd area to the unevictable list. Instead, vmscan will do this if and when it encounters the folios during a reclamation scan.
对于被标记为不可驱逐的地址空间(例如SHM区域可能会被填充后标记),锁定操作(例如SHM_LOCK)可以是延迟的,不需要像mlock()那样填充区域的页表,也不需要特别努力将SHM_LOCK的区域中的页面推送到不可驱逐列表。相反,如果vmscan在回收扫描期间遇到这些folios,它会执行这些操作。

On an unlock action (such as SHM_UNLOCK), the unlocker (e.g. shmctl()) must scan the pages in the region and "rescue" them from the unevictable list if no other condition is keeping them unevictable. If an unevictable region is destroyed, the pages are also "rescued" from the unevictable list in the process of freeing them.
在解锁操作(例如SHM_UNLOCK)中,解锁者(例如shmctl())必须扫描区域中的页面,并且如果没有其他条件使它们保持不可驱逐,就从不可驱逐列表中“拯救”它们。如果一个不可驱逐的区域被销毁,在释放过程中也会从不可驱逐列表中“拯救”页面。

folio_evictable() also checks for mlocked folios by calling folio_test_mlocked(), which is set when a folio is faulted into a VM_LOCKED VMA, or found in a VMA being VM_LOCKED.
folio_evictable()还通过调用folio_test_mlocked()检查mlocked folios,当一个folio被故障到VM_LOCKED VMA中,或者在VM_LOCKED的VMA中找到时会设置该标志。

Vmscan's Handling of Unevictable Folios

vmscan对不可驱逐folios的处理

If unevictable folios are culled in the fault path, or moved to the unevictable list at mlock() or mmap() time, vmscan will not encounter the folios until they have become evictable again (via munlock() for example) and have been "rescued" from the unevictable list. However, there may be situations where we decide, for the sake of expediency, to leave an unevictable folio on one of the regular active/inactive LRU lists for vmscan to deal with. vmscan checks for such folios in all of the shrink_{active|inactive|page}_list() functions and will "cull" such folios that it encounters: that is, it diverts those folios to the unevictable list for the memory cgroup and node being scanned.
如果不可驱逐的folios在故障路径中被剔除,或者在mlock()或mmap()时移动到不可驱逐列表,vmscan在它们再次变得可驱逐(例如通过munlock())并且从不可驱逐列表中“拯救”它们之前不会遇到这些folios。然而,可能存在一些情况,出于便利考虑,我们决定将一个不可驱逐的folio留在常规的活动/非活动LRU列表中,以便vmscan处理。vmscan在所有的shrink_{active|inactive|page}_list()函数中检查这样的folios,并且会“剔除”它遇到的folios:也就是说,它将这些folios转移到内存控制组和节点的不可驱逐列表中。

There may be situations where a folio is mapped into a VM_LOCKED VMA, but the folio does not have the mlocked flag set. Such folios will make it all the way to shrink_active_list() or shrink_page_list() where they will be detected when vmscan walks the reverse map in folio_referenced() or try_to_unmap(). The folio is culled to the unevictable list when it is released by the shrinker.
可能存在这样的情况,一个folio被映射到一个VM_LOCKED VMA中,但是该folio没有设置mlocked标志。这样的folios将一直到shrink_active_list()或shrink_page_list(),在vmscan在folio_referenced()或try_to_unmap()中遍历反向映射时会被检测到。当它们被shrinker释放时,这些folios会被剔除到不可驱逐列表中。

To "cull" an unevictable folio, vmscan simply puts the folio back on the LRU list using folio_putback_lru() - the inverse operation to folio_isolate_lru() - after dropping the folio lock. Because the condition which makes the folio unevictable may change once the folio is unlocked, __pagevec_lru_add_fn() will recheck the unevictable state of a folio before placing it on the unevictable list.
为了“剔除”一个不可驱逐的folio,vmscan在释放folio锁之后,简单地使用folio_putback_lru()将folio放回LRU列表 - 这是folio_isolate_lru()的逆操作。因为使folio不可驱逐的条件一旦folio解锁可能会改变,__pagevec_lru_add_fn()会在将folio放置到不可驱逐列表之前重新检查folio的不可驱逐状态。

MLOCKED Pages

MLOCKED页面

The unevictable folio list is also useful for mlock(), in addition to ramfs and SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in NOMMU situations, all mappings are effectively mlocked.
不可驱逐页列表对于mlock()也非常有用,除了ramfs和SYSV SHM。需要注意的是,mlock()仅在CONFIG_MMU=y的情况下可用;在NOMMU的情况下,所有映射实际上都是被mlock的。

History

The "Unevictable mlocked Pages" infrastructure is based on work originally posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". Nick posted his patch as an alternative to a patch posted by Christoph Lameter to achieve the same objective: hiding mlocked pages from vmscan.
"不可驱逐的mlocked页面"基础设施是基于Nick Piggin最初发布的一项名为"mm: mlocked pages off LRU"的RFC补丁。Nick发布了他的补丁作为Christoph Lameter发布的一个实现相同目标的补丁的替代方案:隐藏mlocked页面不受vmscan的影响。

In Nick's patch, he used one of the struct page LRU list link fields as a count of VM_LOCKED VMAs that map the page (Rik van Riel had the same idea three years earlier). But this use of the link field for a count prevented the management of the pages on an LRU list, and thus mlocked pages were not migratable as isolate_lru_page() could not detect them, and the LRU list link field was not available to the migration subsystem.
在Nick的补丁中,他使用了struct page LRU列表链接字段之一作为映射该页面的VM_LOCKED VMAs的计数(Rik van Riel三年前也有过同样的想法)。但是,使用链接字段作为计数阻止了对LRU列表上页面的管理,因此mlocked页面无法像isolate_lru_page()一样被迁移,而且LRU列表链接字段对迁移子系统也不可用。

Nick resolved this by putting mlocked pages back on the LRU list before attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs. When Nick's patch was integrated with the Unevictable LRU work, the count was replaced by walking the reverse map when munlocking, to determine whether any other VM_LOCKED VMAs still mapped the page.
Nick通过在尝试隔离页面之前将mlocked页面放回LRU列表来解决了这个问题,因此放弃了VM_LOCKED VMAs的计数。当Nick的补丁与不可驱逐LRU工作集成时,计数被替换为在munlocking时遍历反向映射,以确定是否仍有其他VM_LOCKED VMAs映射该页面。

However, walking the reverse map for each page when munlocking was ugly and inefficient, and could lead to catastrophic contention on a file's rmap lock, when many processes which had it mlocked were trying to exit. In 5.18, the idea of keeping mlock_count in Unevictable LRU list link field was revived and put to work, without preventing the migration of mlocked pages. This is why the "Unevictable LRU list" cannot be a linked list of pages now; but there was no use for that linked list anyway - though its size is maintained for meminfo.
然而,当munlocking时为每个页面遍历反向映射是笨拙且低效的,可能导致文件的rmap锁发生灾难性争用,当许多已将其mlocked的进程尝试退出时。在5.18中,将mlock_count保留在不可驱逐LRU列表链接字段中的想法被重新提出并投入使用,而不会阻止mlocked页面的迁移。这就是为什么"不可驱逐LRU列表"现在不能是页面的链接列表;但是无论如何也没有用到那个链接列表 - 尽管它的大小仍然为meminfo所维护。

Basic Management

基本管理

mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable pages. When such a page has been "noticed" by the memory management subsystem, the page is marked with the PG_mlocked flag. This can be manipulated using the PageMlocked() functions.
mlocked页面 - 映射到VM_LOCKED VMA的页面 - 是一类不可驱逐的页面。当内存管理子系统“注意到”这样的页面时,页面会被标记为PG_mlocked标志。可以使用PageMlocked()函数进行操作。

A PG_mlocked page will be placed on the unevictable list when it is added to the LRU. Such pages can be "noticed" by memory management in several places:
当将PG_mlocked页面添加到LRU时,它将被放置在不可驱逐列表上。内存管理可以在以下几个位置“注意到”这些页面:

  1. in the mlock()/mlock2()/mlockall() system call handlers;
    在mlock()/mlock2()/mlockall()系统调用处理程序中;

  2. in the mmap() system call handler when mmapping a region with the MAP_LOCKED flag;
    在mmap()系统调用处理程序中,当使用MAP_LOCKED标志映射区域时;

  3. mmapping a region in a task that has called mlockall() with the MCL_FUTURE flag;
    在已调用mlockall()并带有MCL_FUTURE标志的任务中映射区域;

  4. in the fault path and when a VM_LOCKED stack segment is expanded; or
    在故障路径中,当扩展VM_LOCKED堆栈段时;

  5. as mentioned above, in vmscan:shrink_page_list() when attempting to reclaim a page in a VM_LOCKED VMA by folio_referenced() or try_to_unmap().
    如上所述,在vmscan:shrink_page_list()中,通过folio_referenced()或try_to_unmap()尝试回收VM_LOCKED VMA中的页面时。

mlocked pages become unlocked and rescued from the unevictable list when:
当mlocked页面满足以下条件时,它们将从不可驱逐列表中解锁并被救回:

  1. mapped in a range unlocked via the munlock()/munlockall() system calls;
    通过munlock()/munlockall()系统调用解锁范围内的映射;

  2. munmap()'d out of the last VM_LOCKED VMA that maps the page, including unmapping at task exit;
    从最后一个映射页面的VM_LOCKED VMA中解除映射,包括任务退出时的解除映射;

  3. when the page is truncated from the last VM_LOCKED VMA of an mmapped file; or
    当页面从mmapped文件的最后一个VM_LOCKED VMA中被截断时;

  4. before a page is COW'd in a VM_LOCKED VMA.
    在VM_LOCKED VMA中的页面进行写时复制之前。

mlock()/mlock2()/mlockall() System Call Handling

mlock()/mlock2()/mlockall()系统调用处理

mlock(), mlock2() and mlockall() system call handlers proceed to mlock_fixup() for each VMA in the range specified by the call. In the case of mlockall(), this is the entire active address space of the task. Note that mlock_fixup() is used for both mlocking and munlocking a range of memory. A call to mlock() an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED, is treated as a no-op and mlock_fixup() simply returns.
mlock()、mlock2()和mlockall()系统调用处理程序对调用指定范围内的每个VMA执行mlock_fixup()。对于mlockall(),这是任务的整个活动地址空间。请注意,mlock_fixup()用于锁定和解锁内存范围。对于已经是VM_LOCKED VMA的mlock()调用,或者对于不是VM_LOCKED的VMA的munlock()调用,将被视为无操作,mlock_fixup()只是简单地返回。

If the VMA passes some filtering as described in "Filtering Special VMAs" below, mlock_fixup() will attempt to merge the VMA with its neighbors or split off a subset of the VMA if the range does not cover the entire VMA. Any pages already present in the VMA are then marked as mlocked by mlock_folio() via mlock_pte_range() via walk_page_range() via mlock_vma_pages_range().
如果VMA通过了下面描述的“过滤特殊VMA”中的一些过滤条件,mlock_fixup()将尝试合并VMA与其相邻的VMA,或者如果范围未覆盖整个VMA,则拆分出VMA的子集。然后,mlock_folio()通过mlock_pte_range()通过walk_page_range()通过mlock_vma_pages_range()将VMA中已经存在的页面标记为mlocked。

Before returning from the system call, do_mlock() or mlockall() will call __mm_populate() to fault in the remaining pages via get_user_pages() and to mark those pages as mlocked as they are faulted.
在从系统调用返回之前,do_mlock()或mlockall()将调用__mm_populate()通过get_user_pages()来故障引入剩余的页面,并在故障引入时将这些页面标记为mlocked。

Note that the VMA being mlocked might be mapped with PROT_NONE. In this case, get_user_pages() will be unable to fault in the pages. That's okay. If pages do end up getting faulted into this VM_LOCKED VMA, they will be handled in the fault path - which is also how mlock2()'s MLOCK_ONFAULT areas are handled.
请注意,mlocked的VMA可能映射为PROT_NONE。在这种情况下,get_user_pages()将无法故障引入页面。这没关系。如果页面最终被故障引入到这个VM_LOCKED VMA中,它们将在故障路径中处理 - 这也是如何处理mlock2()的MLOCK_ONFAULT区域的方式。

For each PTE (or PMD) being faulted into a VMA, the page add rmap function calls mlock_vma_folio(), which calls mlock_folio() when the VMA is VM_LOCKED (unless it is a PTE mapping of a part of a transparent huge page). Or when it is a newly allocated anonymous page, folio_add_lru_vma() calls mlock_new_folio() instead: similar to mlock_folio(), but can make better judgments, since this page is held exclusively and known not to be on LRU yet.
对于故障引入到VMA中的每个PTE(或PMD),页面添加rmap函数调用mlock_vma_folio(),当VMA是VM_LOCKED时调用mlock_folio()(除非它是透明巨大页面的PTE映射的一部分)。或者当它是新分配的匿名页面时,folio_add_lru_vma()调用mlock_new_folio():类似于mlock_folio(),但可以做出更好的判断,因为该页面是独占持有的,并且已知尚未在LRU上。

mlock_folio() sets PG_mlocked immediately, then places the page on the CPU's mlock folio batch, to batch up the rest of the work to be done under lru_lock by __mlock_folio(). __mlock_folio() sets PG_unevictable, initializes mlock_count and moves the page to unevictable state ("the unevictable LRU", but with mlock_count in place of LRU threading). Or if the page was already PG_lru and PG_unevictable and PG_mlocked, it simply increments the mlock_count.
mlock_folio()立即设置PG_mlocked,然后将页面放置在CPU的mlock folio批处理中,以便在lru_lock下由__mlock_folio()完成其余工作。__mlock_folio()设置PG_unevictable,初始化mlock_count,并将页面移动到不可驱逐状态(“不可驱逐的LRU”,但使用mlock_count代替LRU线程)。或者,如果页面已经是PG_lru、PG_unevictable和PG_mlocked,它只是增加mlock_count。

But in practice that may not work ideally: the page may not yet be on an LRU, or it may have been temporarily isolated from LRU. In such cases the mlock_count field cannot be touched, but will be set to 0 later when __munlock_folio() returns the page to "LRU". Races prohibit mlock_count from being set to 1 then: rather than risk stranding a page indefinitely as unevictable, always err with mlock_count on the low side, so that when munlocked the page will be rescued to an evictable LRU, then perhaps be mlocked again later if vmscan finds it in a VM_LOCKED VMA.
但在实践中,这可能不太理想:页面可能尚未在LRU上,或者可能已经暂时从LRU中隔离出来。在这种情况下,mlock_count字段无法访问,但在__munlock_folio()将页面返回到“LRU”时,将在稍后将其设置为0。竞争会阻止将mlock_count设置为1:为了避免将页面无限期地滞留为不可驱逐,始终将mlock_count设置得较低,这样当解锁页面时,页面将被救回到可驱逐的LRU,然后如果vmscan在VM_LOCKED VMA中找到它,稍后可能再次被mlocked。

Filtering Special VMAs

过滤特殊VMA

mlock_fixup() filters several classes of "special" VMAs:
mlock_fixup()过滤了几类“特殊”VMA:

  1. VMAs with VM_IO or VM_PFNMAP set are skipped entirely. The pages behind these mappings are inherently pinned, so we don't need to mark them as mlocked. In any case, most of the pages have no struct page in which to so mark the page. Because of this, get_user_pages() will fail for these VMAs, so there is no sense in attempting to visit them.
    具有VM_IO或VM_PFNMAP标志的VMA完全被跳过。这些映射背后的页面本质上是固定的,因此我们不需要将它们标记为mlocked。无论如何,大多数页面没有struct page来标记页面。由于这个原因,对于这些VMA,get_user_pages()将失败,因此没有必要尝试访问它们。

  2. VMAs mapping hugetlbfs page are already effectively pinned into memory. We neither need nor want to mlock() these pages. But __mm_populate() includes hugetlbfs ranges, allocating the huge pages and populating the PTEs.
    映射hugetlbfs页面的VMA已经有效地固定在内存中。我们既不需要也不想mlock()这些页面。但是,__mm_populate()包括hugetlbfs范围,分配巨大页面并填充PTE。

  3. VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages, such as the VDSO page, relay channel pages, etc. These pages are inherently unevictable and are not managed on the LRU lists. __mm_populate() includes these ranges, populating the PTEs if not already populated.
    具有VM_DONTEXPAND标志的VMA通常是内核页面的用户空间映射,例如VDSO页面、中继通道页面等。这些页面本质上是不可驱逐的,并且不在LRU列表上进行管理。__mm_populate()包括这些范围,如果尚未填充,则填充PTE。

  4. VMAs with VM_MIXEDMAP set are not marked VM_LOCKED, but __mm_populate() includes these ranges, populating the PTEs if not already populated.
    具有VM_MIXEDMAP标志的VMA未标记为VM_LOCKED,但__mm_populate()包括这些范围,如果尚未填充,则填充PTE。

Note that for all of these special VMAs, mlock_fixup() does not set the VM_LOCKED flag. Therefore, we won't have to deal with them later during munlock(), munmap() or task exit. Neither does mlock_fixup() account these VMAs against the task's "locked_vm".
请注意,对于所有这些特殊VMA,mlock_fixup()不设置VM_LOCKED标志。因此,在munlock()、munmap()或任务退出时,我们不必处理它们。mlock_fixup()也不会将这些VMA计入任务的“locked_vm”中。

munlock()/munlockall() System Call Handling

munlock()/munlockall()系统调用处理

The munlock() and munlockall() system calls are handled by the same mlock_fixup() function as mlock(), mlock2() and mlockall() system calls are. If called to munlock an already munlocked VMA, mlock_fixup() simply returns. Because of the VMA filtering discussed above, VM_LOCKED will not be set in any "special" VMAs. So, those VMAs will be ignored for munlock.
munlock()和munlockall()系统调用由与mlock()、mlock2()和mlockall()系统调用相同的mlock_fixup()函数处理。如果调用munlock()来解锁已经解锁的VMA,mlock_fixup()只是简单地返回。由于上面讨论的VMA过滤,对于任何“特殊”VMA,VM_LOCKED都不会被设置。因此,这些VMA将在munlock时被忽略。

If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the specified range. All pages in the VMA are then munlocked by munlock_folio() via mlock_pte_range() via walk_page_range() via mlock_vma_pages_range() - the same function used when mlocking a VMA range, with new flags for the VMA indicating that it is munlock() being performed.
如果VMA是VM_LOCKED,mlock_fixup()再次尝试合并或拆分指定的范围。然后,munlock_folio()通过mlock_pte_range()通过walk_page_range()通过mlock_vma_pages_range()解锁VMA中的所有页面 - 与mlocking VMA范围时使用的相同函数,但为VMA指示正在执行munlock()的新标志。

munlock_folio() uses the mlock pagevec to batch up work to be done under lru_lock by __munlock_folio(). __munlock_folio() decrements the folio's mlock_count, and when that reaches 0 it clears the mlocked flag and clears the unevictable flag, moving the folio from unevictable state to the inactive LRU.
munlock_folio()使用mlock pagevec批处理在lru_lock下进行的工作。__munlock_folio()递减folio的mlock_count,当mlock_count达到0时,清除mlocked标志和不可驱逐标志,将folio从不可驱逐状态移动到非活动LRU。

But in practice that may not work ideally: the folio may not yet have reached "the unevictable LRU", or it may have been temporarily isolated from it. In those cases its mlock_count field is unusable and must be assumed to be 0: so that the folio will be rescued to an evictable LRU, then perhaps be mlocked again later if vmscan finds it in a VM_LOCKED VMA.
在实践中,这可能并不理想:页可能尚未达到“不可驱逐的LRU”,或者它可能已暂时与之隔离。在这些情况下,它的mlock_count字段是不可用的,必须假定为0:因此,该页将被救援到可驱逐的LRU,然后如果vmscan在VM_LOCKED VMA中找到它,可能稍后再次被mlocked。

Migrating MLOCKED Pages

迁移MLOCKED页面

A page that is being migrated has been isolated from the LRU lists and is held locked across unmapping of the page, updating the page's address space entry and copying the contents and state, until the page table entry has been replaced with an entry that refers to the new page. Linux supports migration of mlocked pages and other unevictable pages. PG_mlocked is cleared from the the old page when it is unmapped from the last VM_LOCKED VMA, and set when the new page is mapped in place of migration entry in a VM_LOCKED VMA. If the page was unevictable because mlocked, PG_unevictable follows PG_mlocked; but if the page was unevictable for other reasons, PG_unevictable is copied explicitly.
正在迁移的页面已从LRU列表中隔离,并在取消映射页面、更新页面的地址空间条目以及复制内容和状态时被锁定,直到页面表条目被替换为指向新页面的条目。Linux支持迁移mlocked页面和其他不可驱逐的页面。当从最后一个VM_LOCKED VMA中取消映射时,旧页面上的PG_mlocked将被清除,并且当新页面被映射到VM_LOCKED VMA的迁移条目位置时将被设置。如果页面由于mlocked而不可驱逐,PG_unevictable将跟随PG_mlocked;但如果页面由于其他原因不可驱逐,PG_unevictable将被明确复制。

Note that page migration can race with mlocking or munlocking of the same page. There is mostly no problem since page migration requires unmapping all PTEs of the old page (including munlock where VM_LOCKED), then mapping in the new page (including mlock where VM_LOCKED). The page table locks provide sufficient synchronization.
请注意,页面迁移可能与同一页面的mlocking或munlocking竞争。大多数情况下没有问题,因为页面迁移需要取消映射旧页面的所有PTE(包括VM_LOCKED的munlock),然后映射新页面(包括VM_LOCKED的mlock)。页面表锁提供了足够的同步。

However, since mlock_vma_pages_range() starts by setting VM_LOCKED on a VMA, before mlocking any pages already present, if one of those pages were migrated before mlock_pte_range() reached it, it would get counted twice in mlock_count. To prevent that, mlock_vma_pages_range() temporarily marks the VMA as VM_IO, so that mlock_vma_folio() will skip it.
然而,由于mlock_vma_pages_range()首先通过在VMA上设置VM_LOCKED来开始,然后才锁定已存在的任何页面,如果其中一个页面在mlock_pte_range()到达之前被迁移,它将在mlock_count中计数两次。为防止这种情况发生,mlock_vma_pages_range()临时将VMA标记为VM_IO,以便mlock_vma_folio()将跳过它。

To complete page migration, we place the old and new pages back onto the LRU afterwards. The "unneeded" page - old page on success, new page on failure - is freed when the reference count held by the migration process is released.
完成页面迁移后,我们将旧页面和新页面重新放回LRU。成功时释放“不需要的”页面(旧页面),失败时释放新页面,这是在迁移过程中保持的引用计数被释放时完成的。

Compacting MLOCKED Pages

压缩MLOCKED页面

The memory map can be scanned for compactable regions and the default behavior is to let unevictable pages be moved. /proc/sys/vm/compact_unevictable_allowed controls this behavior (see Documentation for /proc/sys/vm/). The work of compaction is mostly handled by the page migration code and the same work flow as described in Migrating MLOCKED Pages will apply.
内存映射可以扫描可压缩的区域,默认行为是允许移动不可驱逐的页面。/proc/sys/vm/compact_unevictable_allowed控制此行为(请参阅/proc/sys/vm/的文档)。压缩的工作大部分由页面迁移代码处理,并且与迁移MLOCKED页面中描述的工作流程相同。

MLOCKING Transparent Huge Pages

MLOCKING透明巨大页面

A transparent huge page is represented by a single entry on an LRU list. Therefore, we can only make unevictable an entire compound page, not individual subpages.
透明巨大页面由LRU列表上的单个条目表示。因此,我们只能使整个复合页面不可驱逐,而不能使单个子页面不可驱逐。

If a user tries to mlock() part of a huge page, and no user mlock()s the whole of the huge page, we want the rest of the page to be reclaimable.
如果用户尝试mlock()巨大页面的一部分,并且没有用户mlock()整个巨大页面,我们希望页面的其余部分是可回收的。

We cannot just split the page on partial mlock() as split_huge_page() can fail and a new intermittent failure mode for the syscall is undesirable.
我们不能只是在部分mlock()上拆分页面,因为split_huge_page()可能会失败,并且对于系统调用来说,新的间歇性故障模式是不可取的。

We handle this by keeping PTE-mlocked huge pages on evictable LRU lists: the PMD on the border of a VM_LOCKED VMA will be split into a PTE table.
我们通过将PTE-mlocked的巨大页面放在可驱逐的LRU列表上来处理这个问题:位于VM_LOCKED VMA边界上的PMD将被拆分为PTE表。

This way the huge page is accessible for vmscan. Under memory pressure the page will be split, subpages which belong to VM_LOCKED VMAs will be moved to the unevictable LRU and the rest can be reclaimed.
这样,巨大页面对vmscan是可访问的。在内存压力下,页面将被拆分,属于VM_LOCKED VMA的子页面将被移动到不可驱逐的LRU列表,其余部分可以被回收。

/proc/meminfo's Unevictable and Mlocked amounts do not include those parts of a transparent huge page which are mapped only by PTEs in VM_LOCKED VMAs.
/proc/meminfo的不可驱逐和mlocked数量不包括仅由VM_LOCKED VMA中的PTE映射的透明巨大页面的部分。

mmap(MAP_LOCKED) System Call Handling

mmap(MAP_LOCKED)系统调用处理

In addition to the mlock(), mlock2() and mlockall() system calls, an application can request that a region of memory be mlocked by supplying the MAP_LOCKED flag to the mmap() call. There is one important and subtle difference here, though. mmap() + mlock() will fail if the range cannot be faulted in (e.g. because mm_populate fails) and returns with ENOMEM while mmap(MAP_LOCKED) will not fail. The mmapped area will still have properties of the locked area - pages will not get swapped out - but major page faults to fault memory in might still happen.
除了mlock()、mlock2()和mlockall()系统调用外,应用程序还可以通过在mmap()调用中提供MAP_LOCKED标志来请求将内存区域mlocked。不过,这里有一个重要而微妙的区别。mmap() + mlock()如果无法故障(例如因为mm_populate失败)则会失败,并返回ENOMEM,而mmap(MAP_LOCKED)则不会失败。映射区域仍将具有锁定区域的属性 - 页面不会被交换出去 - 但可能仍会发生major页面故障以故障内存。

Furthermore, any mmap() call or brk() call that expands the heap by a task that has previously called mlockall() with the MCL_FUTURE flag will result in the newly mapped memory being mlocked. Before the unevictable/mlock changes, the kernel simply called make_pages_present() to allocate pages and populate the page table.
此外,任何由先前使用MCL_FUTURE标志调用mlockall()的任务扩展堆的mmap()调用或brk()调用将导致新映射的内存被mlocked。在不可驱逐/mlock更改之前,内核只是调用make_pages_present()来分配页面并填充页表。

To mlock a range of memory under the unevictable/mlock infrastructure, the mmap() handler and task address space expansion functions call populate_vma_page_range() specifying the vma and the address range to mlock.
为了在不可驱逐/mlock基础设施下mlock一段内存区域,mmap()处理程序和任务地址空间扩展函数调用populate_vma_page_range(),指定vma和要mlock的地址范围。

munmap()/exit()/exec() System Call Handling

munmap()/exit()/exec()系统调用处理

When unmapping an mlocked region of memory, whether by an explicit call to munmap() or via an internal unmap from exit() or exec() processing, we must munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages. Before the unevictable/mlock changes, mlocking did not mark the pages in any way, so unmapping them required no processing.
当取消映射内存的mlocked区域时,无论是通过显式调用munmap()还是通过exit()或exec()处理中的内部取消映射,如果我们正在移除最后一个映射页面的VM_LOCKED VMA,我们必须munlock页面。在不可驱逐/mlock更改之前,mlocking没有以任何方式标记页面,因此取消映射它们不需要处理。

For each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED (unless it was a PTE mapping of a part of a transparent huge page).
对于从VMA中取消映射的每个PTE(或PMD),page_remove_rmap()都会调用munlock_vma_folio(),当VMA为VM_LOCKED时会调用munlock_folio()(除非它是透明巨大页面的一部分的PTE映射)。

munlock_folio() uses the mlock pagevec to batch up work to be done under lru_lock by __munlock_folio(). __munlock_folio() decrements the folio's mlock_count, and when that reaches 0 it clears the mlocked flag and clears the unevictable flag, moving the folio from unevictable state to the inactive LRU.
munlock_folio()使用mlock pagevec批量处理要在lru_lock下由__munlock_folio()完成的工作。__munlock_folio()减少folio的mlock_count,当达到0时清除mlocked标志和不可驱逐标志,将folio从不可驱逐状态移动到非活跃LRU。

But in practice that may not work ideally: the folio may not yet have reached "the unevictable LRU", or it may have been temporarily isolated from it. In those cases its mlock_count field is unusable and must be assumed to be 0: so that the folio will be rescued to an evictable LRU, then perhaps be mlocked again later if vmscan finds it in a VM_LOCKED VMA.
在实践中,这可能并不理想:页可能尚未达到“不可驱逐的LRU”,或者它可能已暂时与之隔离。在这些情况下,它的mlock_count字段是不可用的,必须假定为0:因此,该页将被救援到可驱逐的LRU,然后如果vmscan在VM_LOCKED VMA中找到它,可能稍后再次被mlocked。

Truncating MLOCKED Pages

截断MLOCKED页面

File truncation or hole punching forcibly unmaps the deleted pages from userspace; truncation even unmaps and deletes any private anonymous pages which had been Copied-On-Write from the file pages now being truncated.
文件截断或孔打孔会强制取消用户空间中已删除页面的映射;截断甚至会取消映射并删除从现在被截断的文件页面上复制的任何私有匿名页面。

Mlocked pages can be munlocked and deleted in this way: like with munmap(), for each PTE (or PMD) being unmapped from a VMA, page_remove_rmap() calls munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED (unless it was a PTE mapping of a part of a transparent huge page).
以这种方式可以通过munmap()删除mlocked页面:与munmap()类似,对于从VMA中取消映射的每个PTE(或PMD),page_remove_rmap()都会调用munlock_vma_folio(),当VMA为VM_LOCKED时会调用munlock_folio()(除非它是透明巨大页面的一部分的PTE映射)。

However, if there is a racing munlock(), since mlock_vma_pages_range() starts munlocking by clearing VM_LOCKED from a VMA, before munlocking all the pages present, if one of those pages were unmapped by truncation or hole punch before mlock_pte_range() reached it, it would not be recognized as mlocked by this VMA, and would not be counted out of mlock_count. In this rare case, a page may still appear as PG_mlocked after it has been fully unmapped: and it is left to release_pages() (or __page_cache_release()) to clear it and update statistics before freeing (this event is counted in /proc/vmstat unevictable_pgs_cleared, which is usually 0).
然而,如果存在竞争的munlock(),由于mlock_vma_pages_range()从清除VMA上的VM_LOCKED开始munlocking,然后才取消映射所有已存在的页面,如果其中一个页面在mlock_pte_range()到达之前被截断或打孔取消映射,它将不会被识别为该VMA的mlocked,并且不会从mlock_count中计数。在这种罕见情况下,一个页面可能在完全取消映射后仍然显示为PG_mlocked:在释放页面之前(或__page_cache_release())清除它并更新统计数据(此事件在/proc/vmstat的unevictable_pgs_cleared中计数,通常为0)。

Page Reclaim in shrink_*_list()

shrink_*_list()中的页面回收

vmscan's shrink_active_list() culls any obviously unevictable pages - i.e. !page_evictable(page) pages - diverting those to the unevictable list. However, shrink_active_list() only sees unevictable pages that made it onto the active/inactive LRU lists. Note that these pages do not have PG_unevictable set - otherwise they would be on the unevictable list and shrink_active_list() would never see them.
vmscan的shrink_active_list()会清除任何明显不可驱逐的页面 - 即!page_evictable(page)页面 - 将其转移到不可驱逐列表。但是,shrink_active_list()只会看到已经进入活动/非活动LRU列表的不可驱逐页面。请注意,这些页面没有设置PG_unevictable标志 - 否则它们将在不可驱逐列表上,并且shrink_active_list()将永远不会看到它们。

Some examples of these unevictable pages on the LRU lists are:
这些在LRU列表上的不可驱逐页面的一些示例包括:

  1. ramfs pages that have been placed on the LRU lists when first allocated.
    当首次分配时放置在LRU列表上的ramfs页面。

  2. SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to allocate or fault in the pages in the shared memory region. This happens when an application accesses the page the first time after SHM_LOCK'ing the segment.
    SHM_LOCK的共享内存页面。shmctl(SHM_LOCK)不会尝试为共享内存区域分配或故障页面。这发生在应用程序在SHM_LOCK锁定段后第一次访问页面时。

  3. pages still mapped into VM_LOCKED VMAs, which should be marked mlocked, but events left mlock_count too low, so they were munlocked too early.
    仍然映射到VM_LOCKED VMA中的页面,应该标记为mlocked,但事件使mlock_count过低,因此它们过早地被munlocked。

vmscan's shrink_inactive_list() and shrink_page_list() also divert obviously unevictable pages found on the inactive lists to the appropriate memory cgroup and node unevictable list.
vmscan的shrink_inactive_list()和shrink_page_list()还会将在非活动列表上找到的明显不可驱逐的页面转移到适当的内存cgroup和节点不可驱逐列表。

rmap's folio_referenced_one(), called via vmscan's shrink_active_list() or shrink_page_list(), and rmap's try_to_unmap_one() called via shrink_page_list(), check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_folio() to correct them. Such pages are culled to the unevictable list when released by the shrinker.
rmap的folio_referenced_one(),通过vmscan的shrink_active_list()或shrink_page_list()调用,以及rmap的try_to_unmap_one(),通过shrink_page_list()调用,检查(3)仍然映射到VM_LOCKED VMA中的页面,并调用mlock_vma_folio()来纠正它们。这样的页面在收缩器释放时被剔除到不可驱逐列表。