Physical Memory (翻译 by chatgpt)

发布时间 2023-12-04 22:39:01作者: 摩斯电码

原文: https://www.kernel.org/doc/html/latest/mm/physical_memory.html

Linux is available for a wide range of architectures so there is a need for an architecture-independent abstraction to represent the physical memory. This chapter describes the structures used to manage physical memory in a running system.
Linux 可用于各种架构,因此需要一个与架构无关的抽象来表示物理内存。本章描述了在运行系统中用于管理物理内存的结构。

The first principal concept prevalent in the memory management is Non-Uniform Memory Access (NUMA). With multi-core and multi-socket machines, memory may be arranged into banks that incur a different cost to access depending on the “distance” from the processor. For example, there might be a bank of memory assigned to each CPU or a bank of memory very suitable for DMA near peripheral devices.
内存管理中首要的概念是非一致存储访问(NUMA)。在多核和多插槽的机器上,内存可以被安排到不同的存储区域,访问这些区域的成本取决于与处理器的“距离”。例如,可能会有一块内存分配给每个 CPU,或者有一块非常适合 DMA 的内存靠近外围设备。

Each bank is called a node and the concept is represented under Linux by a struct pglist_data even if the architecture is UMA. This structure is always referenced by its typedef pg_data_t. A pg_data_t structure for a particular node can be referenced by NODE_DATA(nid) macro where nid is the ID of that node.
每个存储区域称为一个节点,即使架构是 UMA,这个概念在 Linux 下是由一个名为 struct pglist_data 的结构来表示的。这个结构总是通过其 typedef pg_data_t 来引用。特定节点的 pg_data_t 结构可以通过 NODE_DATA(nid) 宏来引用,其中 nid 是该节点的 ID。

For NUMA architectures, the node structures are allocated by the architecture specific code early during boot. Usually, these structures are allocated locally on the memory bank they represent. For UMA architectures, only one static pg_data_t structure called contig_page_data is used. Nodes will be discussed further in Section Nodes
对于 NUMA 架构,节点结构是由特定架构的代码在启动时进行分配的。通常,这些结构是在它们所代表的存储区域上本地分配的。对于 UMA 架构,只使用一个名为 contig_page_data 的静态 pg_data_t 结构。节点将在“节点”一节中进一步讨论。

The entire physical address space is partitioned into one or more blocks called zones which represent ranges within memory. These ranges are usually determined by architectural constraints for accessing the physical memory. The memory range within a node that corresponds to a particular zone is described by a struct zone, typedeffed to zone_t. Each zone has one of the types described below.
整个物理地址空间被划分为一个或多个称为区域的块,这些区域代表内存中的范围。这些范围通常由访问物理内存的架构约束确定。与特定区域对应的节点内存范围由一个名为 struct zone 的结构描述,其 typedef 为 zone_t。每个区域具有以下描述的一种类型。

  • ZONE_DMA and ZONE_DMA32 historically represented memory suitable for DMA by peripheral devices that cannot access all of the addressable memory. For many years there are better more and robust interfaces to get memory with DMA specific requirements (Dynamic DMA mapping using the generic device), but ZONE_DMA and ZONE_DMA32 still represent memory ranges that have restrictions on how they can be accessed. Depending on the architecture, either of these zone types or even they both can be disabled at build time using CONFIG_ZONE_DMA and CONFIG_ZONE_DMA32 configuration options. Some 64-bit platforms may need both zones as they support peripherals with different DMA addressing limitations.
    ZONE_DMA 和 ZONE_DMA32 在历史上代表适用于无法访问所有可寻址内存的外围设备的 DMA 的内存。多年来,有更好、更健壮的接口来获取具有 DMA 特定要求的内存(使用通用设备进行动态 DMA 映射),但 ZONE_DMA 和 ZONE_DMA32 仍代表具有访问限制的内存范围。根据架构,可以使用 CONFIG_ZONE_DMA 和 CONFIG_ZONE_DMA32 配置选项在构建时禁用这两种区域类型中的任何一种,甚至两种都可以。一些 64 位平台可能需要这两种区域,因为它们支持具有不同 DMA 寻址限制的外围设备。

  • ZONE_NORMAL is for normal memory that can be accessed by the kernel all the time. DMA operations can be performed on pages in this zone if the DMA devices support transfers to all addressable memory. ZONE_NORMAL is always enabled.
    ZONE_NORMAL 用于内核始终可以访问的正常内存。如果 DMA 设备支持对所有可寻址内存的传输,则可以在此区域上执行 DMA 操作。ZONE_NORMAL 始终是启用的。

  • ZONE_HIGHMEM is the part of the physical memory that is not covered by a permanent mapping in the kernel page tables. The memory in this zone is only accessible to the kernel using temporary mappings. This zone is available only on some 32-bit architectures and is enabled with CONFIG_HIGHMEM.
    ZONE_HIGHMEM 是物理内存的一部分,不在内核页表中有永久映射。这个区域的内存只能通过临时映射对内核可访问。这个区域仅在一些 32 位架构上可用,并且通过 CONFIG_HIGHMEM 启用。

  • ZONE_MOVABLE is for normal accessible memory, just like ZONE_NORMAL. The difference is that the contents of most pages in ZONE_MOVABLE is movable. That means that while virtual addresses of these pages do not change, their content may move between different physical pages. Often ZONE_MOVABLE is populated during memory hotplug, but it may be also populated on boot using one of kernelcore, movablecore and movable_node kernel command line parameters. See Page migration and Memory Hot(Un)Plug for additional details.
    ZONE_MOVABLE 用于可以正常访问的内存,就像 ZONE_NORMAL 一样。不同之处在于 ZONE_MOVABLE 中大多数页面的内容是可移动的。这意味着虽然这些页面的虚拟地址不会改变,但它们的内容可能会在不同的物理页面之间移动。通常 ZONE_MOVABLE 在内存热插拔期间填充,但也可以在启动时使用 kernelcore、movablecore 和 movable_node 内核命令行参数之一进行填充。有关更多详细信息,请参阅页面迁移内存热(拔)插

  • ZONE_DEVICE represents memory residing on devices such as PMEM and GPU. It has different characteristics than RAM zone types and it exists to provide struct page and memory map services for device driver identified physical address ranges. ZONE_DEVICE is enabled with configuration option CONFIG_ZONE_DEVICE.
    ZONE_DEVICE 代表驻留在 PMEM 和 GPU 等设备上的内存。它具有与 RAM 区域类型不同的特性,存在的目的是为设备驱动程序标识的物理地址范围提供 struct page 和内存映射服务。ZONE_DEVICE 通过配置选项 CONFIG_ZONE_DEVICE 启用。

It is important to note that many kernel operations can only take place using ZONE_NORMAL so it is the most performance critical zone. Zones are discussed further in Section Zones.
重要的是要注意,许多内核操作只能使用 ZONE_NORMAL,因此它是最关键的性能区域。区域将在“区域”一节中进一步讨论。

The relation between node and zone extents is determined by the physical memory map reported by the firmware, architectural constraints for memory addressing and certain parameters in the kernel command line.
节点和区域范围之间的关系由固件报告的物理内存映射、内存寻址的架构约束和内核命令行中的某些参数确定。

For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the entire memory will be on node 0 and there will be three zones: ZONE_DMA, ZONE_NORMAL and ZONE_HIGHMEM:
例如,在具有2 G字节RAM的x86 UMA机器上使用32位内核,整个内存将位于节点0上,并且将有三个区域:ZONE_DMA、ZONE_NORMAL和ZONE_HIGHMEM:

0                                                            2G
+-------------------------------------------------------------+
|                            node 0                           |
+-------------------------------------------------------------+

0         16M                    896M                        2G
+----------+-----------------------+--------------------------+
| ZONE_DMA |      ZONE_NORMAL      |       ZONE_HIGHMEM       |
+----------+-----------------------+--------------------------+

With a kernel built with ZONE_DMA disabled and ZONE_DMA32 enabled and booted with movablecore=80% parameter on an arm64 machine with 16 Gbytes of RAM equally split between two nodes, there will be ZONE_DMA32, ZONE_NORMAL and ZONE_MOVABLE on node 0, and ZONE_NORMAL and ZONE_MOVABLE on node 1:
在使用禁用ZONE_DMA和启用ZONE_DMA32的内核,并在16 G字节RAM均匀分配给两个节点的arm64机器上使用movablecore=80%参数引导时,节点0上将有ZONE_DMA32、ZONE_NORMAL和ZONE_MOVABLE,节点1上将有ZONE_NORMAL和ZONE_MOVABLE:

1G                                9G                         17G
+--------------------------------+ +--------------------------+
|              node 0            | |          node 1          |
+--------------------------------+ +--------------------------+

1G       4G        4200M          9G          9320M          17G
+---------+----------+-----------+ +------------+-------------+
|  DMA32  |  NORMAL  |  MOVABLE  | |   NORMAL   |   MOVABLE   |
+---------+----------+-----------+ +------------+-------------+

Memory banks may belong to interleaving nodes. In the example below an x86 machine has 16 Gbytes of RAM in 4 memory banks, even banks belong to node 0 and odd banks belong to node 1:
内存条可能属于交错节点。在下面的示例中,一个x86机器有4个内存条,总共16 G字节RAM,偶数内存条属于节点0,奇数内存条属于节点1:

0              4G              8G             12G            16G
+-------------+ +-------------+ +-------------+ +-------------+
|    node 0   | |    node 1   | |    node 0   | |    node 1   |
+-------------+ +-------------+ +-------------+ +-------------+

0   16M      4G
+-----+-------+ +-------------+ +-------------+ +-------------+
| DMA | DMA32 | |    NORMAL   | |    NORMAL   | |    NORMAL   |
+-----+-------+ +-------------+ +-------------+ +-------------+

In this case node 0 will span from 0 to 12 Gbytes and node 1 will span from 4 to 16 Gbytes.
在这种情况下,节点0将从0到12 G字节,节点1将从4到16 G字节。

Nodes

As we have mentioned, each node in memory is described by a pg_data_t which is a typedef for a struct pglist_data. When allocating a page, by default Linux uses a node-local allocation policy to allocate memory from the node closest to the running CPU. As processes tend to run on the same CPU, it is likely the memory from the current node will be used. The allocation policy can be controlled by users as described in NUMA Memory Policy.
正如我们之前提到的,内存中的每个节点都由一个pg_data_t来描述,它是一个typedef为struct pglist_data的结构体。在分配页面时,默认情况下Linux使用本地节点分配策略,从最接近运行CPU的节点中分配内存。由于进程倾向于在同一个CPU上运行,所以很可能会使用当前节点的内存。分配策略可以由用户控制,如NUMA内存策略中所述。

Most NUMA architectures maintain an array of pointers to the node structures. The actual structures are allocated early during boot when architecture specific code parses the physical memory map reported by the firmware. The bulk of the node initialization happens slightly later in the boot process by free_area_init() function, described later in Section Initialization.
大多数NUMA架构都维护一个指向节点结构的指针数组。实际的结构在引导过程中由架构特定的代码在固件报告的物理内存映射中解析时提前分配。节点初始化的大部分工作稍后在引导过程中由free_area_init()函数完成,稍后在初始化部分中描述。

Along with the node structures, kernel maintains an array of nodemask_t bitmasks called node_states. Each bitmask in this array represents a set of nodes with particular properties as defined by enum node_states:
除了节点结构,内核还维护一个名为node_states的nodemask_t位掩码数组。该数组中的每个位掩码表示具有特定属性的一组节点,这些属性由enum node_states定义:

  • N_POSSIBLE
    The node could become online at some point.
    节点可能在某个时刻上线。
  • N_ONLINE
    The node is online.
    节点已上线。
  • N_NORMAL_MEMORY
    The node has regular memory.
    节点具有常规内存。
  • N_HIGH_MEMORY
    The node has regular or high memory. When CONFIG_HIGHMEM is disabled aliased to N_NORMAL_MEMORY.
    节点具有常规或高内存。当CONFIG_HIGHMEM禁用时,等同于N_NORMAL_MEMORY。
  • N_MEMORY
    The node has memory(regular, high, movable)
    节点具有内存(常规、高、可移动)
  • N_CPU
    The node has one or more CPUs
    节点具有一个或多个CPU

For each node that has a property described above, the bit corresponding to the node ID in the node_states[<property>] bitmask is set.
对于具有上述属性的每个节点,node_states[<property>]位掩码中对应于节点ID的位将被设置。

For example, for node 2 with normal memory and CPUs, bit 2 will be set in
例如,对于具有常规内存和CPU的节点2,位2将在以下位掩码中设置:

node_states[N_POSSIBLE]
node_states[N_ONLINE]
node_states[N_NORMAL_MEMORY]
node_states[N_HIGH_MEMORY]
node_states[N_MEMORY]
node_states[N_CPU]

For various operations possible with nodemasks please refer to include/linux/nodemask.h.
有关可用于nodemask的各种操作,请参阅include/linux/nodemask.h

Among other things, nodemasks are used to provide macros for node traversal, namely for_each_node() and for_each_online_node().
除其他用途外,nodemask还用于提供节点遍历的宏,即for_each_node()和for_each_online_node()。

For instance, to call a function foo() for each online node:
例如,要为每个在线节点调用函数foo():

for_each_online_node(nid) {
        pg_data_t *pgdat = NODE_DATA(nid);

        foo(pgdat);
}

Node structure

The nodes structure struct pglist_data is declared in include/linux/mmzone.h. Here we briefly describe fields of this structure:
结构体 pglist_data 定义在 include/linux/mmzone.h 中。以下是对该结构体字段的简要描述:

General

通用

  • node_zones
    The zones for this node. Not all of the zones may be populated, but it is the full list. It is referenced by this node's node_zonelists as well as other node's node_zonelists.
    该节点的内存区域。并非所有区域都可能被填充,但这是完整的列表。它被该节点的 node_zonelists 引用,也被其他节点的 node_zonelists 引用。
  • node_zonelists
    The list of all zones in all nodes. This list defines the order of zones that allocations are preferred from. The node_zonelists is set up by build_zonelists() in mm/page_alloc.c during the initialization of core memory management structures.
    所有节点中所有区域的列表。该列表定义了首选从哪些区域分配内存。node_zonelists 是在核心内存管理结构初始化期间由 mm/page_alloc.c 中的 build_zonelists() 设置的。
  • nr_zones
    Number of populated zones in this node.
    该节点中填充的区域数
  • node_mem_map
    For UMA systems that use FLATMEM memory model the 0's node node_mem_map is array of struct pages representing each physical frame.
    对于使用 FLATMEM 内存模型的 UMA 系统,0 号节点的 node_mem_map 是代表每个物理页框的结构页的数组。
  • node_page_ext
    For UMA systems that use FLATMEM memory model the 0's node node_page_ext is array of extensions of struct pages. Available only in the kernels built with CONFIG_PAGE_EXTENSION enabled.
    对于使用 FLATMEM 内存模型的 UMA 系统,0 号节点的 node_page_ext 是结构页的扩展数组。仅在启用了 CONFIG_PAGE_EXTENSION 的内核中可用。
  • node_start_pfn
    The page frame number of the starting page frame in this node.
    该节点中起始页框的页框号
  • node_present_pages
    Total number of physical pages present in this node.
    该节点中存在的物理页总数
  • node_spanned_pages
    Total size of physical page range, including holes.
    物理页范围的总大小,包括空洞。
  • node_size_lock
    A lock that protects the fields defining the node extents. Only defined when at least one of CONFIG_MEMORY_HOTPLUG or CONFIG_DEFERRED_STRUCT_PAGE_INIT configuration options are enabled. pgdat_resize_lock() and pgdat_resize_unlock() are provided to manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG or CONFIG_DEFERRED_STRUCT_PAGE_INIT.
    保护定义节点范围的字段的锁。仅在至少启用了 CONFIG_MEMORY_HOTPLUG 或 CONFIG_DEFERRED_STRUCT_PAGE_INIT 配置选项之一时定义。提供了 pgdat_resize_lock() 和 pgdat_resize_unlock() 来在不检查 CONFIG_MEMORY_HOTPLUG 或 CONFIG_DEFERRED_STRUCT_PAGE_INIT 的情况下操作 node_size_lock。
  • node_id
    The Node ID (NID) of the node, starts at 0.
    节点的节点 ID(NID),从 0 开始。
  • totalreserve_pages
    This is a per-node reserve of pages that are not available to userspace allocations.
    这是每个节点的保留页面,不可用于用户空间分配。
  • first_deferred_pfn
    If memory initialization on large machines is deferred then this is the first PFN that needs to be initialized. Defined only when CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled
    如果在大型机器上延迟了内存初始化,则这是需要初始化的第一个 PFN。仅在启用了 CONFIG_DEFERRED_STRUCT_PAGE_INIT 时定义。
  • deferred_split_queue
    Per-node queue of huge pages that their split was deferred. Defined only when CONFIG_TRANSPARENT_HUGEPAGE is enabled.
    延迟分割的大页的每节点队列。仅在启用了 CONFIG_TRANSPARENT_HUGEPAGE 时定义。
  • __lruvec
    Per-node lruvec holding LRU lists and related parameters. Used only when memory cgroups are disabled. It should not be accessed directly, use mem_cgroup_lruvec() to look up lruvecs instead.
    每节点的 LRU 列表和相关参数的 lruvec。仅在内存 cgroups 禁用时使用。不应直接访问它,应使用 mem_cgroup_lruvec() 来查找 lruvecs。

Reclaim control

回收控制

See also Page Reclaim.
另请参阅 页面回收

  • kswapd
    Per-node instance of kswapd kernel thread.
    kswapd 内核线程的每节点实例
  • kswapd_wait, pfmemalloc_wait, reclaim_wait
    Workqueues used to synchronize memory reclaim tasks
    用于同步内存回收任务的工作队列。
  • nr_writeback_throttled
    Number of tasks that are throttled waiting on dirty pages to clean.
    等待脏页清理的被限制的任务数
  • nr_reclaim_start
    Number of pages written while reclaim is throttled waiting for writeback.
    在回收被限制等待写回时写入的页面数
  • kswapd_order
    Controls the order kswapd tries to reclaim
    控制 kswapd 尝试回收的order
  • kswapd_highest_zoneidx
    The highest zone index to be reclaimed by kswapd
    kswapd 尝试回收的最高区域索引。
  • kswapd_failures
    Number of runs kswapd was unable to reclaim any pages
    kswapd 无法回收任何页面的运行次数。
  • min_unmapped_pages
    Minimal number of unmapped file backed pages that cannot be reclaimed. Determined by vm.min_unmapped_ratio sysctl. Only defined when CONFIG_NUMA is enabled.
    无法回收的最小未映射文件支持页面数。由 vm.min_unmapped_ratio sysctl 决定。仅在启用了 CONFIG_NUMA 时定义。
  • min_slab_pages
    Minimal number of SLAB pages that cannot be reclaimed. Determined by vm.min_slab_ratio sysctl. Only defined when CONFIG_NUMA is enabled
    无法回收的最小 SLAB 页面数。由 vm.min_slab_ratio sysctl 决定。仅在启用了 CONFIG_NUMA 时定义。
  • flags
    Flags controlling reclaim behavior.
    控制回收行为的标志

Compaction control

压缩控制

  • kcompactd_max_order
    Page order that kcompactd should try to achieve.
    kcompactd 应尝试实现的页面order
  • kcompactd_highest_zoneidx
    The highest zone index to be compacted by kcompactd.
    kcompactd 应该压缩的最高区域索引。
  • kcompactd_wait
    Workqueue used to synchronize memory compaction tasks.
    用于同步内存压缩任务的工作队列
  • kcompactd
    Per-node instance of kcompactd kernel thread.
    kcompactd 内核线程的每个节点实例。
  • proactive_compact_trigger
    Determines if proactive compaction is enabled. Controlled by vm.compaction_proactiveness sysctl.
    确定是否启用主动压缩。由 vm.compaction_proactiveness sysctl 控制。

Statistics

统计信息

  • per_cpu_nodestats
    Per-CPU VM statistics for the node
    节点的每个 CPU 的 VM 统计信息。
  • vm_stat
    VM statistics for the node.
    节点的 VM 统计信息