Control Group v2 —— Controller(翻译 by chatgpt)

发布时间 2023-12-07 21:03:42作者: 摩斯电码

原文:https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#controllers

Controllers

CPU

The "cpu" controllers regulates distribution of CPU cycles. This controller implements weight and absolute bandwidth limit models for normal scheduling policy and absolute bandwidth allocation model for realtime scheduling policy.
"cpu" 控制器调节 CPU 周期的分配。该控制器实现了普通调度策略的权重和绝对带宽限制模型,以及实时调度策略的绝对带宽分配模型。

In all the above models, cycles distribution is defined only on a temporal base and it does not account for the frequency at which tasks are executed. The (optional) utilization clamping support allows to hint the schedutil cpufreq governor about the minimum desired frequency which should always be provided by a CPU, as well as the maximum desired frequency, which should not be exceeded by a CPU.
在上述所有模型中,周期分配仅基于时间,不考虑任务执行的频率。(可选的)利用率钳位支持允许提示 schedutil cpufreq 调度器关于 CPU 应始终提供的最低期望频率,以及不应超过的最大期望频率。

关于utilization clamping,可以参考 Linux Kernel Utilization Clamping简介

WARNING: cgroup2 doesn't yet support control of realtime processes and the cpu controller can only be enabled when all RT processes are in the root cgroup. Be aware that system management software may already have placed RT processes into nonroot cgroups during the system boot process, and these processes may need to be moved to the root cgroup before the cpu controller can be enabled.
警告:cgroup2 尚不支持对实时进程的控制,CPU 控制器只能在所有 RT 进程位于根 cgroup 时启用。请注意,系统管理软件可能已经在系统启动过程中将 RT 进程放入非根 cgroup 中,这些进程可能需要在启用 CPU 控制器之前移动到根 cgroup 中。

CPU Interface Files

CPU 接口文件

All time durations are in microseconds.
所有时间持续时间均以微秒为单位。

  • cpu.stat
    A read-only flat-keyed file. This file exists whether the controller is enabled or not.
    一个只读的扁平键文件。无论控制器是否启用,此文件都存在。

    It always reports the following three stats:
    它始终报告以下三个统计信息:

    • usage_usec
    • user_usec
    • system_usec

    and the following five when the controller is enabled:
    当控制器启用时,还报告以下五个统计信息:

    • nr_periods
    • nr_throttled
    • throttled_usec
    • nr_bursts
    • burst_usec
  • cpu.weight
    A read-write single value file which exists on non-root cgroups. The default is "100".
    一个读写的单值文件,存在于非根 cgroup 中。默认值为 "100"。

    The weight in the range [1, 10000].
    权重范围为 [1, 10000]。

  • cpu.weight.nice
    A read-write single value file which exists on non-root cgroups. The default is "0".
    一个读写的单值文件,存在于非根 cgroup 中。默认值为 "0"。

    The nice value is in the range [-20, 19].
    优先级值范围为 [-20, 19]。

    This interface file is an alternative interface for "cpu.weight" and allows reading and setting weight using the same values used by nice(2). Because the range is smaller and granularity is coarser for the nice values, the read value is the closest approximation of the current weight.
    此接口文件是 "cpu.weight" 的替代接口,允许使用与 nice(2) 使用的相同值来读取和设置权重。由于 nice 值的范围较小且粒度较粗,读取值是当前权重的最接近近似值。

  • cpu.max
    A read-write two value file which exists on non-root cgroups. The default is "max 100000".
    一个读写的双值文件,存在于非根 cgroup 中。默认值为 "max 100000"。

    The maximum bandwidth limit. It's in the following format:
    最大带宽限制。格式如下:

    $MAX $PERIOD
    

    which indicates that the group may consume up to $MAX in each $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated.
    表示组在每个 $PERIOD 期间最多可以消耗$MAX。如果$MAX是"max",表示无限制。如果只写一个数字,则更新$MAX

  • cpu.max.burst
    A read-write single value file which exists on non-root cgroups. The default is "0".
    一个读写的单值文件,存在于非根 cgroup 中。默认值为 "0"。

    The burst in the range [0, $MAX].
    突发范围为 [0, $MAX]。

  • cpu.pressure
    A read-write nested-keyed file.
    一个读写的嵌套键文件。

    Shows pressure stall information for CPU. See Documentation/accounting/psi.rst for details.
    显示 CPU 的压力阻塞信息。有关详情,请参阅 Documentation/accounting/psi.rst

  • cpu.uclamp.min
    A read-write single value file which exists on non-root cgroups. The default is "0", i.e. no utilization boosting.
    一个读写的单值文件,存在于非根 cgroup 中。默认值为 "0",即无利用率增强。

    The requested minimum utilization (protection) as a percentage rational number, e.g. 12.34 for 12.34%.
    请求的最小利用率(保护)作为百分比有理数,例如 12.34 表示 12.34%。

    This interface allows reading and setting minimum utilization clamp values similar to the sched_setattr(2). This minimum utilization value is used to clamp the task specific minimum utilization clamp.
    此接口允许读取和设置类似于 sched_setattr(2) 的最小利用率夹紧值。此最小利用率值用于夹紧特定任务的最小利用率夹紧。

    The requested minimum utilization (protection) is always capped by the current value for the maximum utilization (limit), i.e. cpu.uclamp.max.
    请求的最小利用率(保护)始终受当前值的最大利用率(限制)的限制,即 cpu.uclamp.max。

  • cpu.uclamp.max
    A read-write single value file which exists on non-root cgroups. The default is "max". i.e. no utilization capping
    一个读写的单值文件,存在于非根 cgroup 中。默认值为 "max",即无利用率夹紧。

    The requested maximum utilization (limit) as a percentage rational number, e.g. 98.76 for 98.76%.
    请求的最大利用率(限制)作为百分比有理数,例如 98.76 表示 98.76%。

    This interface allows reading and setting maximum utilization clamp values similar to the sched_setattr(2). This maximum utilization value is used to clamp the task specific maximum utilization clamp.
    此接口允许读取和设置类似于 sched_setattr(2) 的最大利用率夹紧值。此最大利用率值用于夹紧特定任务的最大利用率夹紧。

Memory

The "memory" controller regulates distribution of memory. Memory is stateful and implements both limit and protection models. Due to the intertwining between memory usage and reclaim pressure and the stateful nature of memory, the distribution model is relatively complex.
"内存"控制器调节内存的分配。内存是有状态的,并实现了限制和保护模型。由于内存使用和回收压力之间的交织以及内存的有状态性,分配模型相对复杂。

While not completely water-tight, all major memory usages by a given cgroup are tracked so that the total memory consumption can be accounted and controlled to a reasonable extent. Currently, the following types of memory usages are tracked.
虽然不是完全封闭的,但给定 cgroup 的所有主要内存使用情况都会被跟踪,以便合理地核算和控制总内存消耗。目前,跟踪以下类型的内存使用情况。

  • Userland memory - page cache and anonymous memory.
    用户空间内存 - 页面缓存和匿名内存。

  • Kernel data structures such as dentries and inodes.
    内核数据结构,如 dentries 和 inodes。

  • TCP socket buffers.
    TCP 套接字缓冲区。

The above list may expand in the future for better coverage.
以上列表可能会在未来扩展以获得更好的覆盖范围。

Memory Interface Files

内存接口文件

All memory amounts are in bytes. If a value which is not aligned to PAGE_SIZE is written, the value may be rounded up to the closest PAGE_SIZE multiple when read back.
所有内存量均以字节为单位。如果写入的值不是按 PAGE_SIZE 对齐,读取时可能会将该值四舍五入到最接近的 PAGE_SIZE 的倍数。

  • memory.current
    A read-only single value file which exists on non-root cgroups.
    只读单值文件,存在于非根 cgroup 中。

    The total amount of memory currently being used by the cgroup and its descendants.
    由 cgroup 及其后代当前使用的内存总量。

  • memory.min
    A read-write single value file which exists on non-root cgroups. The default is "0".
    读写单值文件,存在于非根 cgroup 中。默认值为 "0"。

    Hard memory protection. If the memory usage of a cgroup is within its effective min boundary, the cgroup's memory won't be reclaimed under any conditions. If there is no unprotected reclaimable memory available, OOM killer is invoked. Above the effective min boundary (or effective low boundary if it is higher), pages are reclaimed proportionally to the overage, reducing reclaim pressure for smaller overages.
    硬内存保护。如果 cgroup 的内存使用量在其有效最小边界内,那么在任何情况下都不会回收 cgroup 的内存。如果没有可回收的未受保护内存可用,将调用 OOM killer。超过有效最小边界(或者如果更高,则为有效低边界),页面将按比例回收超额部分,减少较小超额的回收压力。

    Effective min boundary is limited by memory.min values of all ancestor cgroups. If there is memory.min overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get the part of parent's protection proportional to its actual memory usage below memory.min.
    有效最小边界受所有祖先 cgroup 的 memory.min 值的限制。如果存在 memory.min 过度承诺(子 cgroup 或 cgroups 需要更多受保护内存,超出父级允许的范围),那么每个子 cgroup 将按其实际内存使用量在 memory.min 下方获得父级保护的一部分。

    Putting more memory than generally available under this protection is discouraged and may lead to constant OOMs.
    在此保护下放置比通常可用的更多内存是不鼓励的,可能导致持续的 OOM。

    If a memory cgroup is not populated with processes, its memory.min is ignored.
    如果内存 cgroup 未填充进程,则其 memory.min 将被忽略。

  • memory.low
    A read-write single value file which exists on non-root cgroups. The default is "0".
    读写单值文件,存在于非根 cgroup 中。默认值为 "0"。

    Best-effort memory protection. If the memory usage of a cgroup is within its effective low boundary, the cgroup's memory won't be reclaimed unless there is no reclaimable memory available in unprotected cgroups. Above the effective low boundary (or effective min boundary if it is higher), pages are reclaimed proportionally to the overage, reducing reclaim pressure for smaller overages.
    尽力内存保护。如果 cgroup 的内存使用量在其有效低边界内,那么除非在未受保护的 cgroup 中没有可回收的内存可用,否则不会回收 cgroup 的内存。超过有效低边界(或者如果更高,则为有效最小边界),页面将按比例回收超额部分,减少较小超额的回收压力。

    Effective low boundary is limited by memory.low values of all ancestor cgroups. If there is memory.low overcommitment (child cgroup or cgroups are requiring more protected memory than parent will allow), then each child cgroup will get the part of parent's protection proportional to its actual memory usage below memory.low.
    有效低边界受所有祖先 cgroup 的 memory.low 值的限制。如果存在 memory.low 过度承诺(子 cgroup 或 cgroups 需要更多受保护内存,超出父级允许的范围),那么每个子 cgroup 将按其实际内存使用量在 memory.low 下方获得父级保护的一部分。

    Putting more memory than generally available under this protection is discouraged.
    在此保护下放置比通常可用的更多内存是不鼓励的。

  • memory.high
    A read-write single value file which exists on non-root cgroups. The default is "max".
    读写单值文件,存在于非根 cgroup 中。默认值为 "max"。

    Memory usage throttle limit. If a cgroup's usage goes over the high boundary, the processes of the cgroup are throttled and put under heavy reclaim pressure.
    内存使用限制。如果 cgroup 的使用超过高边界,cgroup 的进程将受到限制,并承受严重的回收压力。

    Going over the high limit never invokes the OOM killer and under extreme conditions the limit may be breached. The high limit should be used in scenarios where an external process monitors the limited cgroup to alleviate heavy reclaim pressure.
    超过高限不会调用 OOM killer,在极端情况下可能会超出限制。应在外部进程监视受限制的 cgroup 以减轻严重的回收压力的情况下使用高限。

  • memory.max
    A read-write single value file which exists on non-root cgroups. The default is "max".
    读写单值文件,存在于非根 cgroup 中。默认值为 "max"。

    Memory usage hard limit. This is the main mechanism to limit memory usage of a cgroup. If a cgroup's memory usage reaches this limit and can't be reduced, the OOM killer is invoked in the cgroup. Under certain circumstances, the usage may go over the limit temporarily.
    内存使用硬限制。这是限制 cgroup 内存使用的主要机制。如果 cgroup 的内存使用达到此限制且无法减少,则在 cgroup 中调用 OOM killer。在某些情况下,使用量可能会暂时超过限制。

    In default configuration regular 0-order allocations always succeed unless OOM killer chooses current task as a victim.
    在默认配置中,常规的 0 级分配总是成功,除非 OOM killer 选择当前任务作为受害者。

    Some kinds of allocations don't invoke the OOM killer. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead.
    某些类型的分配不会调用 OOM killer。调用者可以以不同方式重试它们,作为 -ENOMEM 返回到用户空间,或者在诸如磁盘预读取等情况下默默忽略。

  • memory.reclaim
    A write-only nested-keyed file which exists for all cgroups.
    仅写入的嵌套键文件,适用于所有 cgroups。

    This is a simple interface to trigger memory reclaim in the target cgroup.
    这是一个简单的接口,用于触发目标 cgroup 中的内存回收。

    This file accepts a single key, the number of bytes to reclaim. No nested keys are currently supported.
    此文件接受一个单一键,即要回收的字节数。当前不支持嵌套键。

    Example:

    echo "1G" > memory.reclaim

    The interface can be later extended with nested keys to configure the reclaim behavior. For example, specify the type of memory to reclaim from (anon, file, ..).
    接口可以稍后通过嵌套键扩展以配置回收行为。例如,指定要从中回收的内存类型(anon、file 等)。

    Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned.
    请注意,内核可能会过度或不足地从目标 cgroup 中回收。如果回收的字节数少于指定的数量,则返回 -EAGAIN。

    Please note that the proactive reclaim (triggered by this interface) is not meant to indicate memory pressure on the memory cgroup. Therefore socket memory balancing triggered by the memory reclaim normally is not exercised in this case. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim.
    请注意,主动回收(由此接口触发)并不意味着内存 cgroup 上的内存压力。因此,由内存回收触发的套接字内存平衡通常在这种情况下不会被执行。这意味着网络层不会根据内存.reclaim 引起的回收而进行调整。

  • memory.peak
    A read-only single value file which exists on non-root cgroups.
    只读单值文件,存在于非根 cgroup 中。

    The max memory usage recorded for the cgroup and its descendants since the creation of the cgroup.
    自创建 cgroup 以来记录的 cgroup 及其后代的最大内存使用量。

  • memory.oom.group
    A read-write single value file which exists on non-root cgroups. The default value is "0".
    读写单值文件,存在于非根 cgroup 中。默认值为 "0"。

    Determines whether the cgroup should be treated as an indivisible workload by the OOM killer. If set, all tasks belonging to the cgroup or to its descendants (if the memory cgroup is not a leaf cgroup) are killed together or not at all. This can be used to avoid partial kills to guarantee workload integrity.
    确定是否应将 cgroup 视为 OOM killer 的不可分割的工作负载。如果设置,属于 cgroup 或其后代(如果内存 cgroup 不是叶子 cgroup)的所有任务将一起被杀死或完全不被杀死。这可用于避免部分杀死以保证工作负载的完整性。

    Tasks with the OOM protection (oom_score_adj set to -1000) are treated as an exception and are never killed.
    具有 OOM 保护(oom_score_adj 设置为 -1000)的任务被视为例外,并且永远不会被杀死。

    If the OOM killer is invoked in a cgroup, it's not going to kill any tasks outside of this cgroup, regardless memory.oom.group values of ancestor cgroups.
    如果在 cgroup 中调用 OOM killer,则不会杀死任何超出此 cgroup 的任务,而不管祖先 cgroups 的 memory.oom.group 值如何。

  • memory.events
    A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event.
    只读的扁平键文件,存在于非根 cgroup 中。定义了以下条目。除非另有说明,此文件中的值更改会生成文件修改事件。

    Note that all fields in this file are hierarchical and the file modified event can be generated due to an event down the hierarchy. For the local events at the cgroup level see memory.events.local.
    请注意,此文件中的所有字段都是分层的,文件修改事件可能是由于层次结构下的事件而生成的。有关 cgroup 级别的本地事件,请参阅 memory.events.local。

    • low
      The number of times the cgroup is reclaimed due to high memory pressure even though its usage is under the low boundary. This usually indicates that the low boundary is over-committed.
      由于内存压力高,尽管使用量低于低边界,但 cgroup 被回收的次数。这通常表示低边界被过度承诺。

    • high
      The number of times processes of the cgroup are throttled and routed to perform direct memory reclaim because the high memory boundary was exceeded. For a cgroup whose memory usage is capped by the high limit rather than global memory pressure, this event's occurrences are expected.
      由于超过高内存边界,cgroup 的进程被限制并路由执行直接内存回收的次数。对于其内存使用受到高限制而不是全局内存压力的 cgroup,预期会发生此事件。

    • max
      The number of times the cgroup's memory usage was about to go over the max boundary. If direct reclaim fails to bring it down, the cgroup goes to OOM state.
      由于 cgroup 的内存使用量即将超过最大边界的次数。如果直接回收无法将其降下来,cgroup 将进入 OOM 状态。

    • oom
      The number of time the cgroup's memory usage was reached the limit and allocation was about to fail.
      由于 cgroup 的内存使用量达到限制并且分配即将失败的次数。

    This event is not raised if the OOM killer is not considered as an option, e.g. for failed high-order allocations or if caller asked to not retry attempts.
    如果不考虑 OOM killer 作为选项(例如对于失败的高阶分配或如果调用者要求不重试尝试),则不会引发此事件。

    • oom_kill
      The number of processes belonging to this cgroup killed by any kind of OOM killer.
      由任何类型的 OOM killer 杀死的属于此 cgroup 的进程数。

    • oom_group_kill
      The number of times a group OOM has occurred.
      组 OOM 发生的次数。

  • memory.events.local
    Similar to memory.events but the fields in the file are local to the cgroup i.e. not hierarchical. The file modified event generated on this file reflects only the local events.
    类似于 memory.events,但文件中的字段是本地的,即不是分层的。在此文件上生成的文件修改事件仅反映本地事件。

  • memory.stat
    A read-only flat-keyed file which exists on non-root cgroups.
    这是一个只读的扁平键文件,存在于非根 cgroups 中。

    This breaks down the cgroup's memory footprint into different types of memory, type-specific details, and other information on the state and past events of the memory management system.
    它将 cgroup 的内存占用分解为不同类型的内存、类型特定的细节以及内存管理系统的状态和过去事件的其他信息。

    All memory amounts are in bytes.
    所有内存量都以字节为单位。

    The entries are ordered to be human readable, and new entries can show up in the middle. Don't rely on items remaining in a fixed position; use the keys to look up specific values!
    这些条目被排序为人类可读,并且新条目可能会出现在中间。不要依赖于条目保持固定位置;使用键来查找特定值!

    If the entry has no per-node counter (or not show in the memory.numa_stat). We use 'npn' (non-per-node) as the tag to indicate that it will not show in the memory.numa_stat.
    如果条目没有每个节点的计数器(或者不显示在 memory.numa_stat 中),我们使用 'npn'(非每节点)作为标签,表示它不会显示在 memory.numa_stat 中。

    • anon
      Amount of memory used in anonymous mappings such as brk(), sbrk(), and mmap(MAP_ANONYMOUS)
      匿名映射中使用的内存量,例如 brk()、sbrk() 和 mmap(MAP_ANONYMOUS)

    • file
      Amount of memory used to cache filesystem data, including tmpfs and shared memory.
      用于缓存文件系统数据的内存量,包括 tmpfs 和共享内存。

    • kernel (npn)
      Amount of total kernel memory, including (kernel_stack, pagetables, percpu, vmalloc, slab) in addition to other kernel memory use cases.
      总内核内存量,包括(kernel_stack、pagetables、percpu、vmalloc、slab)以及其他内核内存使用情况。

    • kernel_stack
      Amount of memory allocated to kernel stacks.
      分配给内核栈的内存量。

    • pagetables
      Amount of memory allocated for page tables.
      用于页表的内存量。

    • sec_pagetables
      Amount of memory allocated for secondary page tables, this currently includes KVM mmu allocations on x86 and arm64.
      用于次级页表的内存量,目前包括 x86 和 arm64 上的 KVM mmu 分配。

    • percpu (npn)
      Amount of memory used for storing per-cpu kernel data structures.
      用于存储每 CPU 内核数据结构的内存量。

    • sock (npn)
      Amount of memory used in network transmission buffers
      网络传输缓冲区中使用的内存量

    • vmalloc (npn)
      Amount of memory used for vmap backed memory.
      用于 vmap 支持的内存量。

    • shmem
      Amount of cached filesystem data that is swap-backed, such as tmpfs, shm segments, shared anonymous mmap()s
      缓存的文件系统数据,如 tmpfs、shm 段、共享匿名 mmap(),这些数据是交换支持的。

    • zswap
      Amount of memory consumed by the zswap compression backend.
      zswap 压缩后端消耗的内存量。

    • zswapped
      Amount of application memory swapped out to zswap.
      交换到 zswap 的应用程序内存量。

    • file_mapped
      Amount of cached filesystem data mapped with mmap()
      使用 mmap() 映射的缓存文件系统数据量。

    • file_dirty
      Amount of cached filesystem data that was modified but not yet written back to disk
      已修改但尚未写回磁盘的缓存文件系统数据量。

    • file_writeback
      Amount of cached filesystem data that was modified and is currently being written back to disk
      已修改且当前正在写回磁盘的缓存文件系统数据量。

    • swapcached
      Amount of swap cached in memory. The swapcache is accounted against both memory and swap usage.
      内存中缓存的交换空间量。交换缓存计入内存和交换使用量。

    • anon_thp
      Amount of memory used in anonymous mappings backed by transparent hugepages
      由透明巨大页支持的匿名映射中使用的内存量。

    • file_thp
      Amount of cached filesystem data backed by transparent hugepages
      由透明巨大页支持的缓存文件系统数据量。

    • shmem_thp
      Amount of shm, tmpfs, shared anonymous mmap()s backed by transparent hugepages
      由透明巨大页支持的 shm、tmpfs、共享匿名 mmap() 的内存量。

    • inactive_anon, active_anon, inactive_file, active_file, unevictable
      Amount of memory, swap-backed and filesystem-backed, on the internal memory management lists used by the page reclaim algorithm.
      内存、交换支持和文件系统支持的数量,位于页面回收算法使用的内部内存管理列表上。

      As these represent internal list state (eg. shmem pages are on anon memory management lists), inactive_foo + active_foo may not be equal to the value for the foo counter, since the foo counter is type-based, not list-based.
      由于这些代表内部列表状态(例如,shmem 页面位于匿名内存管理列表上),因此 inactive_foo + active_foo 可能不等于 foo 计数的值,因为 foo 计数是基于类型而不是基于列表的。

    • slab_reclaimable
      Part of "slab" that might be reclaimed, such as dentries and inodes.
      可能被回收的“slab”部分,例如 dentries 和 inodes。

    • slab_unreclaimable
      Part of "slab" that cannot be reclaimed on memory pressure.
      在内存压力下无法回收的“slab”部分。

    • slab (npn)
      Amount of memory used for storing in-kernel data structures.
      用于存储内核数据结构的内存量。

    • workingset_refault_anon
      Number of refaults of previously evicted anonymous pages.
      先前被驱逐的匿名页面的重新故障次数。

    • workingset_refault_file
      Number of refaults of previously evicted file pages.
      先前被驱逐的文件页面的重新故障次数。

    • workingset_activate_anon
      Number of refaulted anonymous pages that were immediately activated.
      立即激活的重新故障匿名页面数量。

    • workingset_activate_file
      Number of refaulted file pages that were immediately activated.
      立即激活的重新故障文件页面数量。

    • workingset_restore_anon
      Number of restored anonymous pages which have been detected as an active workingset before they got reclaimed.
      在被回收之前被检测为活动工作集的已恢复匿名页面数量。

    • workingset_restore_file
      Number of restored file pages which have been detected as an active workingset before they got reclaimed.
      在被回收之前被检测为活动工作集的已恢复文件页面数量。

    • workingset_nodereclaim
      Number of times a shadow node has been reclaimed
      影子节点被回收的次数

    • pgscan (npn)
      Amount of scanned pages (in an inactive LRU list)
      扫描页面的数量(在非活动 LRU 列表中)

    • pgsteal (npn)
      Amount of reclaimed pages
      回收页面的数量

    • pgscan_kswapd (npn)
      Amount of scanned pages by kswapd (in an inactive LRU list)
      kswapd 扫描的页面数量(在非活动 LRU 列表中)

    • pgscan_direct (npn)
      Amount of scanned pages directly (in an inactive LRU list)
      直接扫描的页面数量(在非活动 LRU 列表中)

    • pgscan_khugepaged (npn)
      Amount of scanned pages by khugepaged (in an inactive LRU list)
      khugepaged 扫描的页面数量(在非活动 LRU 列表中)

    • pgsteal_kswapd (npn)
      Amount of reclaimed pages by kswapd
      kswapd 回收的页面数量

    • pgsteal_direct (npn)
      Amount of reclaimed pages directly
      直接回收的页面数量

    • pgsteal_khugepaged (npn)
      Amount of reclaimed pages by khugepaged
      khugepaged 回收的页面数量

    • pgfault (npn)
      Total number of page faults incurred
      发生的页面错误总数

    • pgmajfault (npn)
      Number of major page faults incurred
      发生的主要页面错误数

    • pgrefill (npn)
      Amount of scanned pages (in an active LRU list)
      扫描的页面数量(在活动 LRU 列表中)

    • pgactivate (npn)
      Amount of pages moved to the active LRU list
      移动到活动 LRU 列表的页面数量

    • pgdeactivate (npn)
      Amount of pages moved to the inactive LRU list
      移动到非活动 LRU 列表的页面数量

    • pglazyfree (npn)
      Amount of pages postponed to be freed under memory pressure
      在内存压力下推迟释放的页面数量

    • pglazyfreed (npn)
      Amount of reclaimed lazyfree pages
      回收的延迟释放页面数量

    • thp_fault_alloc (npn)
      Number of transparent hugepages which were allocated to satisfy a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
      为满足页面错误而分配的透明巨大页数。当未设置 CONFIG_TRANSPARENT_HUGEPAGE 时,此计数器不存在。

    • thp_collapse_alloc (npn)
      Number of transparent hugepages which were allocated to allow collapsing an existing range of pages. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
      为允许折叠现有页面范围而分配的透明巨大页数。当未设置 CONFIG_TRANSPARENT_HUGEPAGE 时,此计数器不存在。

    • thp_swpout (npn)
      Number of transparent hugepages which are swapout in one piece without splitting.
      一次性交换出的透明巨大页数。

    • thp_swpout_fallback (npn)
      Number of transparent hugepages which were split before swapout. Usually because failed to allocate some continuous swap space for the huge page.
      在交换出之前被分割的透明巨大页数。通常是因为无法为巨大页分配连续的交换空间。

  • memory.numa_stat
    A read-only nested-keyed file which exists on non-root cgroups.
    这是一个只读的嵌套键文件,存在于非根 cgroups 中。

    This breaks down the cgroup's memory footprint into different types of memory, type-specific details, and other information per node on the state of the memory management system.
    它将 cgroup 的内存占用分解为不同类型的内存、特定类型的细节以及有关内存管理系统状态的每个节点的其他信息。

    This is useful for providing visibility into the NUMA locality information within an memcg since the pages are allowed to be allocated from any physical node. One of the use case is evaluating application performance by combining this information with the application's CPU allocation.
    这对于提供 memcg 中 NUMA 本地性信息的可见性很有用,因为页面可以从任何物理节点分配。一个用例是通过将此信息与应用程序的 CPU 分配结合起来评估应用程序的性能。

    All memory amounts are in bytes.
    所有内存量都以字节为单位。

    The output format of memory.numa_stat is:
    memory.numa_stat 的输出格式为:

    type N0=<bytes in node 0> N1=<bytes in node 1> ...
    type N0=<节点 0 中的字节数> N1=<节点 1 中的字节数> ...

    The entries are ordered to be human readable, and new entries can show up in the middle. Don't rely on items remaining in a fixed position; use the keys to look up specific values!
    条目的顺序是为了便于阅读,并且新条目可能会出现在中间。不要依赖于条目保持固定位置;使用键来查找特定值!

    The entries can refer to the memory.stat.
    这些条目可以参考 memory.stat。

  • memory.swap.current
    A read-only single value file which exists on non-root cgroups.
    这是一个只读的单值文件,存在于非根 cgroups 中。

    The total amount of swap currently being used by the cgroup and its descendants.
    表示 cgroup 及其后代当前正在使用的交换空间总量。

  • memory.swap.high
    A read-write single value file which exists on non-root cgroups. The default is "max".
    这是一个读写的单值文件,存在于非根 cgroups 中。默认值为 "max"。

    Swap usage throttle limit. If a cgroup's swap usage exceeds this limit, all its further allocations will be throttled to allow userspace to implement custom out-of-memory procedures.
    交换使用率限制。如果 cgroup 的交换使用率超过此限制,所有进一步的分配将被限制,以允许用户空间实现自定义的内存不足程序。

    This limit marks a point of no return for the cgroup. It is NOT designed to manage the amount of swapping a workload does during regular operation. Compare to memory.swap.max, which prohibits swapping past a set amount, but lets the cgroup continue unimpeded as long as other memory can be reclaimed.
    此限制标志着 cgroup 的不可逆转点。它并非设计用于管理工作负载在正常操作期间进行交换的数量。与 memory.swap.max 相比,后者禁止超过设定数量的交换,但只要其他内存可以被回收,就让 cgroup 继续不受阻碍。

    Healthy workloads are not expected to reach this limit.
    健康的工作负载不应该达到此限制。

  • memory.swap.peak
    A read-only single value file which exists on non-root cgroups.
    这是一个只读的单值文件,存在于非根 cgroups 中。

    The max swap usage recorded for the cgroup and its descendants since the creation of the cgroup.
    自创建 cgroup 以来,记录的 cgroup 及其后代的最大交换使用量。

  • memory.swap.max
    A read-write single value file which exists on non-root cgroups. The default is "max".
    这是一个读写的单值文件,存在于非根 cgroups 中。默认值为 "max"。

    Swap usage hard limit. If a cgroup's swap usage reaches this limit, anonymous memory of the cgroup will not be swapped out.
    交换使用硬限制。如果 cgroup 的交换使用达到此限制,cgroup 的匿名内存将不会被交换出去。

  • memory.swap.events
    A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event.
    这是一个只读的扁平键文件,存在于非根 cgroups 中。定义了以下条目。除非另有说明,此文件中的值更改会生成文件修改事件。

    • high
      The number of times the cgroup's swap usage was over the high threshold.
      cgroup 的交换使用次数超过高阈值的次数。

    • max
      The number of times the cgroup's swap usage was about to go over the max boundary and swap allocation failed.
      cgroup 的交换使用次数即将超过最大边界并且交换分配失败的次数。

    • fail
      The number of times swap allocation failed either because of running out of swap system-wide or max limit.
      交换分配失败的次数,要么是因为系统范围内的交换用完了,要么是因为达到了最大限制。

    When reduced under the current usage, the existing swap entries are reclaimed gradually and the swap usage may stay higher than the limit for an extended period of time. This reduces the impact on the workload and memory management.
    在当前使用量减少时,现有的交换条目会逐渐被回收,交换使用量可能会长时间保持高于限制。这减少了对工作负载和内存管理的影响。

  • memory.zswap.current
    A read-only single value file which exists on non-root cgroups.
    这是一个只读的单值文件,存在于非根 cgroups 中。

    The total amount of memory consumed by the zswap compression backend.
    zswap 压缩后端消耗的内存总量。

  • memory.zswap.max
    A read-write single value file which exists on non-root cgroups. The default is "max".
    这是一个读写的单值文件,存在于非根 cgroups 中。默认值为 "max"。

    Zswap usage hard limit. If a cgroup's zswap pool reaches this limit, it will refuse to take any more stores before existing entries fault back in or are written out to disk.
    zswap 使用硬限制。如果 cgroup 的 zswap 池达到此限制,它将拒绝在现有条目故障返回或写入磁盘之前接受更多存储。

  • memory.pressure
    A read-only nested-keyed file.
    这是一个只读的嵌套键文件。

    Shows pressure stall information for memory. See Documentation/accounting/psi.rst for details.
    显示内存的压力阻塞信息。有关详细信息,请参阅 Documentation/accounting/psi.rst

Usage Guidelines

使用指南

"memory.high" is the main mechanism to control memory usage. Over-committing on high limit (sum of high limits > available memory) and letting global memory pressure to distribute memory according to usage is a viable strategy.
"memory.high" 是控制内存使用的主要机制。在高限制上进行过度承诺(高限制之和 > 可用内存)并让全局内存压力根据使用情况分配内存是一种可行的策略。

Because breach of the high limit doesn't trigger the OOM killer but throttles the offending cgroup, a management agent has ample opportunities to monitor and take appropriate actions such as granting more memory or terminating the workload.
由于高限制的突破不会触发 OOM 杀手,而是限制违规的 cgroup,管理代理有充分的机会监视并采取适当的行动,比如分配更多内存或终止工作负载。

Determining whether a cgroup has enough memory is not trivial as memory usage doesn't indicate whether the workload can benefit from more memory. For example, a workload which writes data received from network to a file can use all available memory but can also operate as performant with a small amount of memory. A measure of memory pressure - how much the workload is being impacted due to lack of memory - is necessary to determine whether a workload needs more memory; unfortunately, memory pressure monitoring mechanism isn't implemented yet.
确定一个 cgroup 是否有足够的内存并不是一件简单的事,因为内存使用量并不表明工作负载是否可以从更多内存中受益。例如,一个将从网络接收的数据写入文件的工作负载可以使用所有可用内存,但也可以在少量内存的情况下运行得很好。需要一种内存压力的度量 - 工作负载由于缺乏内存而受到的影响程度 - 来确定工作负载是否需要更多内存;不幸的是,内存压力监控机制尚未实现。

Memory Ownership

内存所有权

A memory area is charged to the cgroup which instantiated it and stays charged to the cgroup until the area is released. Migrating a process to a different cgroup doesn't move the memory usages that it instantiated while in the previous cgroup to the new cgroup.
内存区域由实例化它的 cgroup 承担责任,并且直到释放该区域之前一直由该 cgroup 承担责任。将进程迁移到不同的 cgroup 不会将其在以前 cgroup 中实例化的内存使用情况移动到新 cgroup 中。

A memory area may be used by processes belonging to different cgroups. To which cgroup the area will be charged is in-deterministic; however, over time, the memory area is likely to end up in a cgroup which has enough memory allowance to avoid high reclaim pressure.
一个内存区域可能被属于不同 cgroup 的进程使用。该区域将被收取到哪个 cgroup 是不确定的;然而,随着时间的推移,内存区域很可能最终进入具有足够内存配额以避免高回收压力的 cgroup。

If a cgroup sweeps a considerable amount of memory which is expected to be accessed repeatedly by other cgroups, it may make sense to use POSIX_FADV_DONTNEED to relinquish the ownership of memory areas belonging to the affected files to ensure correct memory ownership.
如果一个 cgroup 扫描了预计会被其他 cgroup 重复访问的大量内存,使用 POSIX_FADV_DONTNEED 放弃受影响文件的内存区域的所有权可能是有意义的,以确保正确的内存所有权。

IO

The "io" controller regulates the distribution of IO resources. This controller implements both weight based and absolute bandwidth or IOPS limit distribution; however, weight based distribution is available only if cfq-iosched is in use and neither scheme is available for blk-mq devices.
"IO"控制器调节IO资源的分配。该控制器实现了基于权重和绝对带宽或IOPS限制的分配;然而,只有在使用cfq-iosched时才可用基于权重的分配,并且对于blk-mq设备,这两种方案都不可用。

IO Interface Files

IO接口文件

  • io.stat
    A read-only nested-keyed file.
    一个只读的嵌套键文件。

    Lines are keyed by $MAJ:$MIN device numbers and not ordered. The following nested keys are defined.
    行由$MAJ:$MIN设备号键控,并且没有顺序。定义了以下嵌套键。

    rbytes Bytes read 读取的字节数
    wbytes Bytes written 写入的字节数
    rios Number of read IOs 读取IO的次数
    wios Number of write IOs 写入IO的次数
    dbytes Bytes discarded 丢弃的字节数
    dios Number of discard IOs 丢弃IO的次数

    An example read output follows:
    以下是一个示例读取输出:

        8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
        8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
    
  • io.cost.qos
    A read-write nested-keyed file which exists only on the root cgroup.
    一个可读写的嵌套键文件,仅存在于根cgroup中。

    This file configures the Quality of Service of the IO cost model based controller (CONFIG_BLK_CGROUP_IOCOST) which currently implements "io.weight" proportional control. Lines are keyed by $MAJ:$MIN device numbers and not ordered. The line for a given device is populated on the first write for the device on "io.cost.qos" or "io.cost.model". The following nested keys are defined.
    该文件配置IO成本模型基于控制器(CONFIG_BLK_CGROUP_IOCOST)的质量服务,目前实现了"io.weight"比例控制。行由$MAJ:$MIN设备号键控,并且没有顺序。对于给定设备的行在第一次写入"io.cost.qos"或"io.cost.model"时被填充。定义了以下嵌套键。

    enable Weight-based control enable 基于权重的控制启用
    ctrl "auto" or "user"
    rpct Read latency percentile [0, 100] 读取延迟百分位数[0, 100]
    rlat Read latency threshold 读取延迟阈值
    wpct Write latency percentile [0, 100] 写入延迟百分位数[0, 100]
    wlat Write latency threshold 写入延迟阈值
    min Minimum scaling percentage [1, 10000] 最小缩放百分比[1, 10000]
    max Maximum scaling percentage [1, 10000] 最大缩放百分比[1, 10000]

    The controller is disabled by default and can be enabled by setting "enable" to 1. "rpct" and "wpct" parameters default to zero and the controller uses internal device saturation state to adjust the overall IO rate between "min" and "max".
    该控制器默认处于禁用状态,可以通过将"enable"设置为1来启用。"rpct"和"wpct"参数默认为零,控制器使用内部设备饱和状态来调整"min"和"max"之间的整体IO速率。

    When a better control quality is needed, latency QoS parameters can be configured. For example:
    当需要更好的控制质量时,可以配置延迟QoS参数。例如:

    8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0

    shows that on sdb, the controller is enabled, will consider the device saturated if the 95th percentile of read completion latencies is above 75ms or write 150ms, and adjust the overall IO issue rate between 50% and 150% accordingly.
    表示在sdb上,控制器已启用,如果读取完成延迟的95th百分位数超过75ms或写入150ms,则认为设备饱和,并相应地调整整体IO发出速率在50%和150%之间。

    The lower the saturation point, the better the latency QoS at the cost of aggregate bandwidth. The narrower the allowed adjustment range between "min" and "max", the more conformant to the cost model the IO behavior. Note that the IO issue base rate may be far off from 100% and setting "min" and "max" blindly can lead to a significant loss of device capacity or control quality. "min" and "max" are useful for regulating devices which show wide temporary behavior changes - e.g. a ssd which accepts writes at the line speed for a while and then completely stalls for multiple seconds.
    饱和点越低,延迟QoS越好,但会牺牲总带宽。允许的"min"和"max"之间的调整范围越窄,IO行为越符合成本模型。请注意,IO发出基础速率可能远非100%,盲目设置"min"和"max"可能导致设备容量或控制质量的显著损失。"min"和"max"对于调节显示临时行为变化的设备很有用 - 例如,一块SSD在一段时间内以线速度接受写入,然后完全停滞了多秒钟。

    When "ctrl" is "auto", the parameters are controlled by the kernel and may change automatically. Setting "ctrl" to "user" or setting any of the percentile and latency parameters puts it into "user" mode and disables the automatic changes. The automatic mode can be restored by setting "ctrl" to "auto".
    当"ctrl"为"auto"时,参数由内核控制,并可能自动更改。将"ctrl"设置为"user"或设置任何百分位数和延迟参数会将其置于"user"模式,并禁用自动更改。可以通过将"ctrl"设置为"auto"来恢复自动模式。

  • io.cost.model
    A read-write nested-keyed file which exists only on the root cgroup.
    一个可读写的嵌套键文件,仅存在于根cgroup中。

    This file configures the cost model of the IO cost model based controller (CONFIG_BLK_CGROUP_IOCOST) which currently implements "io.weight" proportional control. Lines are keyed by $MAJ:$MIN device numbers and not ordered. The line for a given device is populated on the first write for the device on "io.cost.qos" or "io.cost.model". The following nested keys are defined.
    该文件配置IO成本模型基于控制器(CONFIG_BLK_CGROUP_IOCOST)的成本模型,目前实现了"io.weight"比例控制。行由$MAJ:$MIN设备号键控,并且没有顺序。对于给定设备的行在第一次写入"io.cost.qos"或"io.cost.model"时被填充。定义了以下嵌套键。

    ctrl "auto" or "user"
    model The cost model in use - "linear" 正在使用的成本模型 - "linear"

    When "ctrl" is "auto", the kernel may change all parameters dynamically. When "ctrl" is set to "user" or any other parameters are written to, "ctrl" become "user" and the automatic changes are disabled.
    当"ctrl"为"auto"时,内核可以动态更改所有参数。当"ctrl"设置为"user"或写入任何其他参数时,"ctrl"变为"user",自动更改被禁用。

    When "model" is "linear", the following model parameters are defined.
    当"model"为"linear"时,定义了以下模型参数。

    [r|w]bps The maximum sequential IO throughput 最大顺序IO吞吐量
    [r|w]seqiops The maximum 4k sequential IOs per second 每秒最大4k顺序IO数
    [r|w]randiops The maximum 4k random IOs per second 每秒最大4k随机IO数

    From the above, the builtin linear model determines the base costs of a sequential and random IO and the cost coefficient for the IO size. While simple, this model can cover most common device classes acceptably.
    从上面可以看出,内置的线性模型确定了顺序和随机IO的基本成本以及IO大小的成本系数。虽然简单,但这个模型可以较好地覆盖大多数常见设备类。

    The IO cost model isn't expected to be accurate in absolute sense and is scaled to the device behavior dynamically.
    IO成本模型不会在绝对意义上准确,并且会根据设备行为动态调整。

    If needed, tools/cgroup/iocost_coef_gen.py can be used to generate device-specific coefficients.
    如果需要,可以使用tools/cgroup/iocost_coef_gen.py来生成特定设备的系数。

  • io.weight
    A read-write flat-keyed file which exists on non-root cgroups. The default is "default 100".
    一个可读写的扁平键文件,存在于非根cgroup中。默认值为"default 100"。

    The first line is the default weight applied to devices without specific override. The rest are overrides keyed by $MAJ:$MIN device numbers and not ordered. The weights are in the range [1, 10000] and specifies the relative amount IO time the cgroup can use in relation to its siblings.
    第一行是应用于没有特定覆盖的设备的默认权重。其余的行由$MAJ:$MIN设备号键控,并且没有顺序。权重在[1, 10000]范围内,指定了cgroup相对于其同级别兄弟节点可以使用的IO时间的相对量。

    The default weight can be updated by writing either "default $WEIGHT" or simply "$WEIGHT". Overrides can be set by writing "$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
    默认权重可以通过写入"default $WEIGHT"或简单地"$WEIGHT"来更新。可以通过写入"$MAJ:$MIN $WEIGHT"来设置覆盖,并通过写入"$MAJ:$MIN default"来取消设置。

    An example read output follows:
    以下是一个示例读取输出:

        default 100
        8:16 200
        8:0 50
    
  • io.max
    A read-write nested-keyed file which exists on non-root cgroups.
    一个可读写的嵌套键文件,存在于非根cgroup中。

    BPS and IOPS based IO limit. Lines are keyed by $MAJ:$MIN device numbers and not ordered. The following nested keys are defined.
    BPS和IOPS基础的IO限制。行由$MAJ:$MIN设备号键控,并且没有顺序。定义了以下嵌套键。

    rbps Max read bytes per second 每秒最大读取字节数
    wbps Max write bytes per second 每秒最大写入字节数
    riops Max read IO operations per second 每秒最大读取IO操作数
    wiops Max write IO operations per second 每秒最大写入IO操作数

    When writing, any number of nested key-value pairs can be specified in any order. "max" can be specified as the value to remove a specific limit. If the same key is specified multiple times, the outcome is undefined.
    在写入时,可以以任意顺序指定任意数量的嵌套键值对。可以通过指定"max"作为值来删除特定限制。如果多次指定相同的键,结果是未定义的。

    BPS and IOPS are measured in each IO direction and IOs are delayed if limit is reached. Temporary bursts are allowed.
    BPS和IOPS是在每个IO方向上测量的,如果达到限制,则IO会延迟。允许临时突发。

    Setting read limit at 2M BPS and write at 120 IOPS for 8:16:
    设置8:16的读取限制为2M BPS和写入限制为120 IOPS:

    echo "8:16 rbps=2097152 wiops=120" > io.max

    Reading returns the following:
    读取返回以下内容:

    8:16 rbps=2097152 wbps=max riops=max wiops=120

    Write IOPS limit can be removed by writing the following:
    可以通过写入以下内容来移除写入IOPS限制:

    echo "8:16 wiops=max" > io.max

    Reading now returns the following:
    现在读取返回以下内容:

    8:16 rbps=2097152 wbps=max riops=max wiops=max

  • io.pressure
    A read-only nested-keyed file.
    一个只读的嵌套键文件。

    Shows pressure stall information for IO. See Documentation/accounting/psi.rst for details.
    显示IO的压力阻塞信息。详细信息请参阅Documentation/accounting/psi.rst

Writeback

Page cache is dirtied through buffered writes and shared mmaps and written asynchronously to the backing filesystem by the writeback mechanism. Writeback sits between the memory and IO domains and regulates the proportion of dirty memory by balancing dirtying and write IOs.
页面缓存通过缓冲写入和共享内存映射来脏化,并通过写回机制异步地写入到后备文件系统。写回机制位于内存和IO域之间,通过平衡脏化和写IO来调节脏内存的比例。

The io controller, in conjunction with the memory controller, implements control of page cache writeback IOs. The memory controller defines the memory domain that dirty memory ratio is calculated and maintained for and the io controller defines the io domain which writes out dirty pages for the memory domain. Both system-wide and per-cgroup dirty memory states are examined and the more restrictive of the two is enforced.
IO控制器与内存控制器一起实现页面缓存写回IO的控制。内存控制器定义了计算和维护脏内存比例的内存域,而IO控制器定义了为内存域中的脏页写出脏页的IO域。系统范围和每个cgroup的脏内存状态都会被检查,两者中限制更严格的那个将被执行。

cgroup writeback requires explicit support from the underlying filesystem. Currently, cgroup writeback is implemented on ext2, ext4, btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are attributed to the root cgroup.
cgroup写回需要底层文件系统的显式支持。目前,cgroup写回在ext2、ext4、btrfs、f2fs和xfs上实现。在其他文件系统上,所有写回IO都归属于根cgroup。

There are inherent differences in memory and writeback management which affects how cgroup ownership is tracked. Memory is tracked per page while writeback per inode. For the purpose of writeback, an inode is assigned to a cgroup and all IO requests to write dirty pages from the inode are attributed to that cgroup.
内存和写回管理存在固有的差异,这会影响到cgroup所有权的跟踪方式。内存是按页跟踪的,而写回是按inode跟踪的。为了进行写回,一个inode被分配给一个cgroup,并且所有写入该inode的脏页的IO请求都归属于该cgroup。

As cgroup ownership for memory is tracked per page, there can be pages which are associated with different cgroups than the one the inode is associated with. These are called foreign pages. The writeback constantly keeps track of foreign pages and, if a particular foreign cgroup becomes the majority over a certain period of time, switches the ownership of the inode to that cgroup.
由于内存的cgroup所有权是按页跟踪的,所以可能存在与inode关联的不同cgroup的页面。这些页面被称为外部页面。写回机制不断跟踪外部页面,并且如果某个外部cgroup在一定时间内成为多数,就会将inode的所有权切换到该cgroup。

我的理解:比如以页缓存为例,不同的进程都可能生成页缓存,可能导致这些页缓存对应的page属于不同的cgroup,而页缓存对应的文件,即inode只有一个,这个inode的所有权每次只能属于一个cgroup

While this model is enough for most use cases where a given inode is mostly dirtied by a single cgroup even when the main writing cgroup changes over time, use cases where multiple cgroups write to a single inode simultaneously are not supported well. In such circumstances, a significant portion of IOs are likely to be attributed incorrectly. As memory controller assigns page ownership on the first use and doesn't update it until the page is released, even if writeback strictly follows page ownership, multiple cgroups dirtying overlapping areas wouldn't work as expected. It's recommended to avoid such usage patterns.
尽管这种模型足够满足大多数用例,其中给定的inode大部分由单个cgroup脏化,即使主要写入cgroup随时间变化,但不支持多个cgroup同时写入单个inode的用例。在这种情况下,很可能会错误地归属大部分IO。由于内存控制器在第一次使用时分配页面所有权,并且在页面释放之前不更新它,即使写回严格遵循页面所有权,多个cgroup脏化重叠区域也无法按预期工作。建议避免这种使用模式。

The sysctl knobs which affect writeback behavior are applied to cgroup writeback as follows.
影响写回行为的sysctl参数适用于cgroup写回,如下所示。

  • vm.dirty_background_ratio, vm.dirty_ratio
    These ratios apply the same to cgroup writeback with the amount of available memory capped by limits imposed by the memory controller and system-wide clean memory.
    这些比率适用于cgroup写回,可用内存量受内存控制器和系统范围的干净内存的限制。

  • vm.dirty_background_bytes, vm.dirty_bytes
    For cgroup writeback, this is calculated into ratio against total available memory and applied the same way as vm.dirty[_background]_ratio.
    对于cgroup写回,这将根据总可用内存计算为比率,并以与vm.dirty[_background]_ratio相同的方式应用。

IO Latency

IO延迟

This is a cgroup v2 controller for IO workload protection. You provide a group with a latency target, and if the average latency exceeds that target the controller will throttle any peers that have a lower latency target than the protected workload.
这是用于IO工作负载保护的cgroup v2控制器。您可以为一个组提供一个延迟目标,如果平均延迟超过该目标,控制器将限制任何具有比受保护工作负载更低延迟目标的对等组。

The limits are only applied at the peer level in the hierarchy. This means that in the diagram below, only groups A, B, and C will influence each other, and groups D and F will influence each other. Group G will influence nobody:
限制仅在层次结构中的对等级别上应用。这意味着在下面的图表中,只有A、B和C组会相互影响,D和F组会相互影响。G组不会影响任何人:

          [root]
  /          |            \
  A          B            C
 /  \        |
D    F       G

So the ideal way to configure this is to set io.latency in groups A, B, and C. Generally you do not want to set a value lower than the latency your device supports. Experiment to find the value that works best for your workload. Start at higher than the expected latency for your device and watch the avg_lat value in io.stat for your workload group to get an idea of the latency you see during normal operation. Use the avg_lat value as a basis for your real setting, setting at 10-15% higher than the value in io.stat.
因此,配置的理想方式是在A、B和C组中设置io.latency。通常情况下,您不希望将值设置为低于设备支持的延迟。尝试找到适合您工作负载的最佳值。从比设备预期延迟更高的值开始,并观察io.stat中工作负载组的avg_lat值,以了解正常操作期间的延迟情况。将avg_lat值作为您实际设置的基础,设置为比io.stat中的值高10-15%。

How IO Latency Throttling Works

IO延迟限制的工作原理

io.latency is work conserving; so as long as everybody is meeting their latency target the controller doesn't do anything. Once a group starts missing its target it begins throttling any peer group that has a higher target than itself. This throttling takes 2 forms:
io.latency是工作保守的;只要每个人都满足其延迟目标,控制器就不会采取任何措施。一旦一个组开始错过其目标,它就会开始限制任何具有比自身更高目标的对等组。这种限制采取两种形式:

  • Queue depth throttling. This is the number of outstanding IO's a group is allowed to have. We will clamp down relatively quickly, starting at no limit and going all the way down to 1 IO at a time.
    队列深度限制。这是一个组允许的未完成IO数量。我们会相对迅速地进行限制,从没有限制一直降到一次只允许1个IO。

  • Artificial delay induction. There are certain types of IO that cannot be throttled without possibly adversely affecting higher priority groups. This includes swapping and metadata IO. These types of IO are allowed to occur normally, however they are "charged" to the originating group. If the originating group is being throttled you will see the use_delay and delay fields in io.stat increase. The delay value is how many microseconds that are being added to any process that runs in this group. Because this number can grow quite large if there is a lot of swapping or metadata IO occurring we limit the individual delay events to 1 second at a time.
    人为延迟引入。有些类型的IO无法进行限制,否则可能会对优先级更高的组产生不利影响。这包括交换和元数据IO。这些类型的IO可以正常进行,但它们会“计费”给发起组。如果发起组被限制,您将看到io.stat中的use_delay和delay字段增加。delay值是添加到在该组中运行的任何进程的微秒数。由于如果存在大量交换或元数据IO,这个数字可能会变得非常大,我们将单个延迟事件限制为1秒。

Once the victimized group starts meeting its latency target again it will start unthrottling any peer groups that were throttled previously. If the victimized group simply stops doing IO the global counter will unthrottle appropriately.
一旦受害组再次满足其延迟目标,它将开始解除限制之前被限制的任何对等组。如果受害组停止进行IO,全局计数器将适当地解除限制。

IO Latency Interface Files

IO延迟接口文件

  • io.latency
    This takes a similar format as the other controllers.
    其格式与其他控制器相似。

    "MAJOR:MINOR target=<target time in microseconds>"
    "MAJOR:MINOR target=<目标时间(以微秒为单位)>"

  • io.stat
    If the controller is enabled you will see extra stats in io.stat in addition to the normal ones.
    如果启用了控制器,您将在io.stat中看到额外的统计信息,除了常规统计信息外。

    • depth
      This is the current queue depth for the group.
      这是该组的当前队列深度。

    • avg_lat
      This is an exponential moving average with a decay rate of 1/exp bound by the sampling interval. The decay rate interval can be calculated by multiplying the win value in io.stat by the corresponding number of samples based on the win value.
      这是指数移动平均值,衰减率为1/exp,受采样间隔限制。衰减率间隔可以通过将io.stat中的win值乘以基于win值的相应样本数来计算。

    • win
      The sampling window size in milliseconds. This is the minimum duration of time between evaluation events. Windows only elapse with IO activity. Idle periods extend the most recent window.
      采样窗口大小(以毫秒为单位)。这是评估事件之间的最小持续时间。窗口只在有IO活动时流逝。空闲期间会延长最近的窗口。

需要通过io.cost.qos来开启,只存在于非根cgroup中

IO Priority

IO优先级

A single attribute controls the behavior of the I/O priority cgroup policy, namely the io.prio.class attribute. The following values are accepted for that attribute:
一个属性控制I/O优先级cgroup策略的行为,即io.prio.class属性。该属性接受以下值:

  • no-change
    Do not modify the I/O priority class.
    不修改I/O优先级类别。

  • promote-to-rt
    For requests that have a non-RT I/O priority class, change it into RT. Also change the priority level of these requests to 4. Do not modify the I/O priority of requests that have priority class RT.
    对于具有非RT I/O优先级类别的请求,将其更改为RT。还将这些请求的优先级级别更改为4。不修改具有RT优先级类别的请求的I/O优先级。

  • restrict-to-be
    For requests that do not have an I/O priority class or that have I/O priority class RT, change it into BE. Also change the priority level of these requests to 0. Do not modify the I/O priority class of requests that have priority class IDLE.
    对于没有I/O优先级类别或具有RT I/O优先级类别的请求,将其更改为BE。还将这些请求的优先级级别更改为0。不修改具有IDLE优先级类别的请求的I/O优先级类别。

  • idle
    Change the I/O priority class of all requests into IDLE, the lowest I/O priority class.
    将所有请求的I/O优先级类别更改为IDLE,即最低的I/O优先级类别。

  • none-to-rt
    Deprecated. Just an alias for promote-to-rt.
    已弃用。只是promote-to-rt的别名。

The following numerical values are associated with the I/O priority policies:
以下数字与I/O优先级策略相关联:

no-change 0
promote-to-rt 1
restrict-to-be 2
idle 3

The numerical value that corresponds to each I/O priority class is as follows:
与每个I/O优先级类别相关的数字值如下:

IOPRIO_CLASS_NONE 0
IOPRIO_CLASS_RT (real-time) 1
IOPRIO_CLASS_BE (best effort) 2
IOPRIO_CLASS_IDLE 3

The algorithm to set the I/O priority class for a request is as follows:
设置请求的I/O优先级类别的算法如下:

  • If I/O priority class policy is promote-to-rt, change the request I/O priority class to IOPRIO_CLASS_RT and change the request I/O priority level to 4.
    如果I/O优先级类别策略是promote-to-rt,则将请求的I/O优先级类别更改为IOPRIO_CLASS_RT,并将请求的I/O优先级级别更改为4。

  • If I/O priority class policy is not promote-to-rt, translate the I/O priority class policy into a number, then change the request I/O priority class into the maximum of the I/O priority class policy number and the numerical I/O priority class.
    如果I/O优先级策略不是提升到实时优先级,将I/O优先级策略翻译为一个数字,然后将请求的I/O优先级类别更改为I/O优先级策略数字和数值I/O优先级类别中的最大值。

需要通过io.cost.qos来开启,只存在于非根cgroup中

PID

The process number controller is used to allow a cgroup to stop any new tasks from being fork()'d or clone()'d after a specified limit is reached.
进程号控制器用于在达到指定限制后阻止任何新任务进行fork()或clone()。

The number of tasks in a cgroup can be exhausted in ways which other controllers cannot prevent, thus warranting its own controller. For example, a fork bomb is likely to exhaust the number of tasks before hitting memory restrictions.
在某些情况下,cgroup中的任务数量可能会耗尽,而其他控制器无法阻止这种情况,因此需要单独的控制器。例如,fork炸弹可能会在达到内存限制之前耗尽任务数量。

Note that PIDs used in this controller refer to TIDs, process IDs as used by the kernel.
请注意,此控制器中使用的PID是指内核使用的TID(线程ID)和PID(进程ID)。

PID Interface Files

PID接口文件

  • pids.max
    A read-write single value file which exists on non-root cgroups. The default is "max".
    非根cgroup上存在的可读写单值文件。默认值为"max"。

    Hard limit of number of processes.
    进程数量的硬限制

  • pids.current
    A read-only single value file which exists on all cgroups.
    所有cgroup上都存在的只读单值文件。

    The number of processes currently in the cgroup and its descendants.
    当前cgroup及其子孙中的进程数量。

Organisational operations are not blocked by cgroup policies, so it is possible to have pids.current > pids.max. This can be done by either setting the limit to be smaller than pids.current, or attaching enough processes to the cgroup such that pids.current is larger than pids.max. However, it is not possible to violate a cgroup PID policy through fork() or clone(). These will return -EAGAIN if the creation of a new process would cause a cgroup policy to be violated.
组织操作不受cgroup策略的阻塞,因此pids.current可能大于pids.max。这可以通过将限制设置为小于pids.current的值,或者将足够多的进程附加到cgroup中,使得pids.current大于pids.max来实现。但是,无法通过fork()或clone()违反cgroup PID策略。如果创建新进程会违反cgroup策略,则会返回-EAGAIN。

Cpuset

The "cpuset" controller provides a mechanism for constraining the CPU and memory node placement of tasks to only the resources specified in the cpuset interface files in a task's current cgroup. This is especially valuable on large NUMA systems where placing jobs on properly sized subsets of the systems with careful processor and memory placement to reduce cross-node memory access and contention can improve overall system performance.
"cpuset"控制器提供了一种机制,用于将任务的CPU和内存节点限制为仅使用任务当前cgroup中cpuset接口文件中指定的资源。这在大型NUMA系统上特别有价值,可以通过将作业放置在正确大小的系统子集上,并进行仔细的处理器和内存放置,以减少跨节点内存访问和争用,从而提高整个系统的性能。

The "cpuset" controller is hierarchical. That means the controller cannot use CPUs or memory nodes not allowed in its parent.
"cpuset"控制器是分层的。这意味着控制器不能使用其父级不允许的CPU或内存节点。

Cpuset Interface Files

Cpuset接口文件

  • cpuset.cpus
    A read-write multiple values file which exists on non-root cpuset-enabled cgroups.
    非根cpuset启用的cgroup上存在的可读写多值文件。

    It lists the requested CPUs to be used by tasks within this cgroup. The actual list of CPUs to be granted, however, is subjected to constraints imposed by its parent and can differ from the requested CPUs.
    它列出了此cgroup中的任务要使用的请求CPU。然而,实际授予的CPU列表受其父级施加的约束限制,并且可能与请求的CPU不同。

    The CPU numbers are comma-separated numbers or ranges. For example:
    CPU编号是逗号分隔的数字或范围。例如:
    # cat cpuset.cpus 0-4,6,8-10
    An empty value indicates that the cgroup is using the same setting as the nearest cgroup ancestor with a non-empty "cpuset.cpus" or all the available CPUs if none is found.
    空值表示cgroup正在使用与最近的非空"cpuset.cpus"的cgroup祖先相同的设置,或者如果找不到,则使用所有可用的CPU。

    The value of "cpuset.cpus" stays constant until the next update and won't be affected by any CPU hotplug events.
    "cpuset.cpus"的值在下一次更新之前保持不变,并且不会受到任何CPU热插拔事件的影响。

  • cpuset.cpus.effective
    A read-only multiple values file which exists on all cpuset-enabled cgroups.
    所有cpuset启用的cgroup上都存在的只读多值文件。

    It lists the onlined CPUs that are actually granted to this cgroup by its parent. These CPUs are allowed to be used by tasks within the current cgroup.
    它列出了由其父级实际授予此cgroup的在线CPU。这些CPU允许在当前cgroup中的任务使用。

    If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows all the CPUs from the parent cgroup that can be available to be used by this cgroup. Otherwise, it should be a subset of "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus" can be granted. In this case, it will be treated just like an empty "cpuset.cpus".
    如果"cpuset.cpus"为空,则"cpuset.cpus.effective"文件显示可以由此cgroup使用的父级cgroup中的所有CPU。否则,它应该是"cpuset.cpus"的子集,除非无法授予"cpuset.cpus"中列出的任何CPU。在这种情况下,它将被视为一个空的"cpuset.cpus"。

    Its value will be affected by CPU hotplug events.
    其值将受到CPU热插拔事件的影响。

  • cpuset.mems
    A read-write multiple values file which exists on non-root cpuset-enabled cgroups.
    非根cpuset启用的cgroup上存在的可读写多值文件。

    It lists the requested memory nodes to be used by tasks within this cgroup. The actual list of memory nodes granted, however, is subjected to constraints imposed by its parent and can differ from the requested memory nodes.
    它列出了此cgroup中的任务要使用的请求内存节点。然而,实际授予的内存节点列表受其父级施加的约束限制,并且可能与请求的内存节点不同。

    The memory node numbers are comma-separated numbers or ranges. For example:
    内存节点编号是逗号分隔的数字或范围。例如:
    # cat cpuset.mems 0-1,3
    An empty value indicates that the cgroup is using the same setting as the nearest cgroup ancestor with a non-empty "cpuset.mems" or all the available memory nodes if none is found.
    空值表示cgroup正在使用与最近的非空"cpuset.mems"的cgroup祖先相同的设置,或者如果找不到,则使用所有可用的内存节点。

    The value of "cpuset.mems" stays constant until the next update and won't be affected by any memory nodes hotplug events.
    "cpuset.mems"的值在下一次更新之前保持不变,并且不会受到任何内存节点热插拔事件的影响。

    Setting a non-empty value to "cpuset.mems" causes memory of tasks within the cgroup to be migrated to the designated nodes if they are currently using memory outside of the designated nodes.
    如果将非空值设置为"cpuset.mems",则会将cgroup内的任务内存迁移到指定节点,前提是它们当前正在使用指定节点之外的内存。

    There is a cost for this memory migration. The migration may not be complete and some memory pages may be left behind. So it is recommended that "cpuset.mems" should be set properly before spawning new tasks into the cpuset. Even if there is a need to change "cpuset.mems" with active tasks, it shouldn't be done frequently.
    这种内存迁移是有代价的。迁移可能不完全,可能会有一些内存页面被遗留下来。因此,在将新任务生成到cpuset之前,建议正确设置"cpuset.mems"。即使需要在活动任务中更改"cpuset.mems",也不应频繁进行更改。

  • cpuset.mems.effective
    A read-only multiple values file which exists on all cpuset-enabled cgroups.
    这是一个只读的多值文件,存在于所有启用cpuset的cgroup中。

    It lists the onlined memory nodes that are actually granted to this cgroup by its parent. These memory nodes are allowed to be used by tasks within the current cgroup.
    它列出了由其父级授予给该cgroup的实际在线内存节点。这些内存节点允许当前cgroup内的任务使用。

    If "cpuset.mems" is empty, it shows all the memory nodes from the parent cgroup that will be available to be used by this cgroup. Otherwise, it should be a subset of "cpuset.mems" unless none of the memory nodes listed in "cpuset.mems" can be granted. In this case, it will be treated just like an empty "cpuset.mems".
    如果"cpuset.mems"为空,则显示从父级cgroup中将可用于该cgroup的所有内存节点。否则,它应该是"cpuset.mems"的子集,除非"cpuset.mems"中列出的内存节点都无法被授予。在这种情况下,它将被视为一个空的"cpuset.mems"。

    Its value will be affected by memory nodes hotplug events.
    它的值会受到内存节点热插拔事件的影响。

  • cpuset.cpus.exclusive
    A read-write multiple values file which exists on non-root cpuset-enabled cgroups.
    这是一个可读写的多值文件,存在于非根cpuset启用的cgroup中。

    It lists all the exclusive CPUs that are allowed to be used to create a new cpuset partition. Its value is not used unless the cgroup becomes a valid partition root. See the "cpuset.cpus.partition" section below for a description of what a cpuset partition is.
    它列出了允许用于创建新的cpuset分区的所有独占CPU。除非cgroup成为有效的分区根,否则不使用它的值。有关cpuset分区的描述,请参见下面的"cpuset.cpus.partition"部分。

    When the cgroup becomes a partition root, the actual exclusive CPUs that are allocated to that partition are listed in "cpuset.cpus.exclusive.effective" which may be different from "cpuset.cpus.exclusive". If "cpuset.cpus.exclusive" has previously been set, "cpuset.cpus.exclusive.effective" is always a subset of it.
    当cgroup成为分区根时,分配给该分区的实际独占CPU列在"cpuset.cpus.exclusive.effective"中,该值可能与"cpuset.cpus.exclusive"不同。如果之前已经设置了"cpuset.cpus.exclusive",那么"cpuset.cpus.exclusive.effective"始终是它的子集。

    Users can manually set it to a value that is different from "cpuset.cpus". The only constraint in setting it is that the list of CPUs must be exclusive with respect to its sibling.
    用户可以手动将其设置为与"cpuset.cpus"不同的值。设置的唯一约束是CPU列表必须与其同级的CPU互斥。

    For a parent cgroup, any one of its exclusive CPUs can only be distributed to at most one of its child cgroups. Having an exclusive CPU appearing in two or more of its child cgroups is not allowed (the exclusivity rule). A value that violates the exclusivity rule will be rejected with a write error.
    对于父级cgroup,其任何一个独占CPU只能分配给其最多一个子cgroup。不允许在两个或更多子cgroup中出现相同的独占CPU(互斥规则)。违反互斥规则的值将被拒绝并引发写入错误。

    The root cgroup is a partition root and all its available CPUs are in its exclusive CPU set.
    根cgroup是一个分区根,其所有可用的CPU都在其独占CPU集合中。

  • cpuset.cpus.exclusive.effective
    A read-only multiple values file which exists on all non-root cpuset-enabled cgroups.
    这是一个只读的多值文件,存在于所有非根cpuset启用的cgroup中。

    This file shows the effective set of exclusive CPUs that can be used to create a partition root. The content of this file will always be a subset of "cpuset.cpus" and its parent's "cpuset.cpus.exclusive.effective" if its parent is not the root cgroup. It will also be a subset of "cpuset.cpus.exclusive" if it is set. If "cpuset.cpus.exclusive" is not set, it is treated to have an implicit value of "cpuset.cpus" in the formation of local partition.
    该文件显示可以用于创建分区根的有效独占CPU集合。该文件的内容始终是"cpuset.cpus"及其父级的"cpuset.cpus.exclusive.effective"的子集(如果其父级不是根cgroup)。如果设置了"cpuset.cpus.exclusive",它也将是其子集。如果未设置"cpuset.cpus.exclusive",则在本地分区的形成中,它被视为具有隐式值"cpuset.cpus"。

  • cpuset.cpus.partition
    A read-write single value file which exists on non-root cpuset-enabled cgroups. This flag is owned by the parent cgroup and is not delegatable.
    这是一个可读写的单值文件,存在于非根cpuset启用的cgroup中。此标志由父级cgroup拥有,不可委派。

    It accepts only the following input values when written to.
    写入时,它只接受以下输入值。

    "member" Non-root member of a partition 分区的非根成员
    "root" Partition root 分区根
    "isolated" Partition root without load balancing 无负载均衡的分区根

    A cpuset partition is a collection of cpuset-enabled cgroups with a partition root at the top of the hierarchy and its descendants except those that are separate partition roots themselves and their descendants. A partition has exclusive access to the set of exclusive CPUs allocated to it. Other cgroups outside of that partition cannot use any CPUs in that set.
    cpuset分区是具有分区根的cpuset启用的cgroup集合,其位于层次结构的顶部,其子孙节点除了那些自身是单独的分区根及其子孙节点之外。分区对其分配的独占CPU集合具有独占访问权限。该分区之外的其他cgroup不能使用该集合中的任何CPU。

    There are two types of partitions - local and remote. A local partition is one whose parent cgroup is also a valid partition root. A remote partition is one whose parent cgroup is not a valid partition root itself. Writing to "cpuset.cpus.exclusive" is optional for the creation of a local partition as its "cpuset.cpus.exclusive" file will assume an implicit value that is the same as "cpuset.cpus" if it is not set. Writing the proper "cpuset.cpus.exclusive" values down the cgroup hierarchy before the target partition root is mandatory for the creation of a remote partition.
    有两种类型的分区-本地分区和远程分区。本地分区是其父级cgroup也是有效的分区根的分区。远程分区是其父级cgroup本身不是有效的分区根的分区。在创建本地分区时,写入"cpuset.cpus.exclusive"是可选的,因为如果未设置"cpuset.cpus.exclusive",其"cpuset.cpus.exclusive"文件将假定一个隐式值,与"cpuset.cpus"相同。在创建远程分区之前,必须在cgroup层次结构中向下写入正确的"cpuset.cpus.exclusive"值。

    Currently, a remote partition cannot be created under a local partition. All the ancestors of a remote partition root except the root cgroup cannot be a partition root.
    目前,无法在本地分区下创建远程分区。远程分区根的所有祖先(除了根cgroup)都不能是分区根。

    The root cgroup is always a partition root and its state cannot be changed. All other non-root cgroups start out as "member".
    根cgroup始终是分区根,其状态无法更改。所有其他非根cgroup最初都是"member"。

    When set to "root", the current cgroup is the root of a new partition or scheduling domain. The set of exclusive CPUs is determined by the value of its "cpuset.cpus.exclusive.effective".
    当设置为"root"时,当前cgroup是一个新分区或调度域的根。独占CPU集合的设置由其"cpuset.cpus.exclusive.effective"的值确定。

    When set to "isolated", the CPUs in that partition will be in an isolated state without any load balancing from the scheduler. Tasks placed in such a partition with multiple CPUs should be carefully distributed and bound to each of the individual CPUs for optimal performance.
    当设置为"isolated"时,该分区中的CPU将处于隔离状态,调度程序不进行任何负载均衡。放置在具有多个CPU的此分区中的任务应谨慎分配并绑定到每个单独的CPU以获得最佳性能。

    A partition root ("root" or "isolated") can be in one of the two possible states - valid or invalid. An invalid partition root is in a degraded state where some state information may be retained, but behaves more like a "member".
    分区根("root"或"isolated")可以处于两种可能的状态之一-有效或无效。无效的分区根处于退化状态,可能保留了一些状态信息,但行为更像是一个"member"。

    All possible state transitions among "member", "root" and "isolated" are allowed.
    允许在"member"、"root"和"isolated"之间进行所有可能的状态转换。

    On read, the "cpuset.cpus.partition" file can show the following values.
    在读取时,"cpuset.cpus.partition"文件可以显示以下值。

    "member" Non-root member of a partition 分区的非根成员
    "root" Partition root 分区根
    "isolated" Partition root without load balancing 无负载均衡的分区根
    "root invalid (<reason>)" Invalid partition root 无效的分区根
    "isolated invalid (<reason>)" Invalid isolated partition root 无效的隔离分区根

    In the case of an invalid partition root, a descriptive string on why the partition is invalid is included within parentheses.
    在无效的分区根的情况下,括号中包含了关于分区无效的描述性字符串。

    For a local partition root to be valid, the following conditions must be met.
    对于本地分区根要有效,必须满足以下条件。

    1. The parent cgroup is a valid partition root.
      父级cgroup是有效的分区根。

    2. The "cpuset.cpus.exclusive.effective" file cannot be empty, though it may contain offline CPUs.
      "cpuset.cpus.exclusive.effective"文件不能为空,尽管它可能包含离线的CPU。

    3. The "cpuset.cpus.effective" cannot be empty unless there is no task associated with this partition.
      "cpuset.cpus.effective"不能为空,除非与该分区关联的任务为空。

    For a remote partition root to be valid, all the above conditions except the first one must be met.
    对于远程分区根要有效,必须满足上述所有条件,除了第一个条件。

    External events like hotplug or changes to "cpuset.cpus" or "cpuset.cpus.exclusive" can cause a valid partition root to become invalid and vice versa. Note that a task cannot be moved to a cgroup with empty "cpuset.cpus.effective".
    外部事件,如热插拔或对"cpuset.cpus"或"cpuset.cpus.exclusive"的更改,可能会导致有效的分区根变为无效,反之亦然。请注意,任务不能移动到具有空"cpuset.cpus.effective"的cgroup中。

    A valid non-root parent partition may distribute out all its CPUs to its child local partitions when there is no task associated with it.
    有效的非根父分区可以在没有与之关联的任务时将其所有CPU分配给其子本地分区。

    Care must be taken to change a valid partition root to "member" as all its child local partitions, if present, will become invalid causing disruption to tasks running in those child partitions. These inactivated partitions could be recovered if their parent is switched back to a partition root with a proper value in "cpuset.cpus" or "cpuset.cpus.exclusive".
    更改有效分区根为"member"时必须小心,因为所有子本地分区(如果存在)将变为无效,从而导致运行在这些子分区中的任务中断。如果将其父级切换回具有"cpuset.cpus"或"cpuset.cpus.exclusive"中适当值的分区根,这些未激活的分区可以恢复。

    Poll and inotify events are triggered whenever the state of "cpuset.cpus.partition" changes. That includes changes caused by write to "cpuset.cpus.partition", cpu hotplug or other changes that modify the validity status of the partition. This will allow user space agents to monitor unexpected changes to "cpuset.cpus.partition" without the need to do continuous polling.
    每当“cpuset.cpus.partition”的状态发生变化时,都会触发投票和inotify事件。这包括由于对“cpuset.cpus.partition”进行写操作、CPU热插拔或其他修改分区有效状态的更改。这将允许用户空间代理监视对“cpuset.cpus.partition”的意外更改,而无需进行持续轮询。

    A user can pre-configure certain CPUs to an isolated state with load balancing disabled at boot time with the "isolcpus" kernel boot command line option. If those CPUs are to be put into a partition, they have to be used in an isolated partition.
    用户可以通过内核引导命令行选项“isolcpus”在启动时预配置某些CPU为离散状态,并禁用负载平衡。如果这些CPU要放入一个分区中,它们必须在一个离散分区中使用。

Device controller

设备控制器

Device controller manages access to device files. It includes both creation of new device files (using mknod), and access to the existing device files.
设备控制器管理对设备文件的访问。它包括创建新的设备文件(使用mknod)和访问现有设备文件。

Cgroup v2 device controller has no interface files and is implemented on top of cgroup BPF. To control access to device files, a user may create bpf programs of type BPF_PROG_TYPE_CGROUP_DEVICE and attach them to cgroups with BPF_CGROUP_DEVICE flag. On an attempt to access a device file, corresponding BPF programs will be executed, and depending on the return value the attempt will succeed or fail with -EPERM.
Cgroup v2设备控制器没有接口文件,而是在cgroup BPF之上实现的。为了控制对设备文件的访问,用户可以创建类型为BPF_PROG_TYPE_CGROUP_DEVICE的bpf程序,并将其附加到带有BPF_CGROUP_DEVICE标志的cgroup上。在尝试访问设备文件时,相应的BPF程序将被执行,并根据返回值决定尝试是成功还是失败(返回-EPERM)。

A BPF_PROG_TYPE_CGROUP_DEVICE program takes a pointer to the bpf_cgroup_dev_ctx structure, which describes the device access attempt: access type (mknod/read/write) and device (type, major and minor numbers). If the program returns 0, the attempt fails with -EPERM, otherwise it succeeds.
BPF_PROG_TYPE_CGROUP_DEVICE程序接受指向bpf_cgroup_dev_ctx结构的指针,该结构描述了设备访问尝试的详细信息:访问类型(mknod/read/write)和设备(类型、主要号和次要号)。如果程序返回0,则尝试失败(返回-EPERM),否则成功。

An example of BPF_PROG_TYPE_CGROUP_DEVICE program may be found in tools/testing/selftests/bpf/progs/dev_cgroup.c in the kernel source tree.
在内核源代码的tools/testing/selftests/bpf/progs/dev_cgroup.c中可以找到BPF_PROG_TYPE_CGROUP_DEVICE程序的示例。

RDMA

The "rdma" controller regulates the distribution and accounting of RDMA resources.
"rdma"控制器用于调节RDMA资源的分配和计量。

RDMA Interface Files

RDMA接口文件

  • rdma.max
    A readwrite nested-keyed file that exists for all the cgroups except root that describes current configured resource limit for a RDMA/IB device.
    对于除根cgroup之外的所有cgroup,这是一个可读写的嵌套键控文件,描述了RDMA/IB设备的当前配置资源限制。

    Lines are keyed by device name and are not ordered. Each line contains space separated resource name and its configured limit that can be distributed.
    每行由设备名称作为键,不排序。每行包含以空格分隔的资源名称和其配置的限制。

    The following nested keys are defined.
    定义了以下嵌套键

    hca_handle Maximum number of HCA Handles HCA句柄的最大数量
    hca_object Maximum number of HCA Objects HCA对象的最大数量

    An example for mlx4 and ocrdma device follows:
    以下是mlx4和ocrdma设备的示例:

        mlx4_0 hca_handle=2 hca_object=2000
        ocrdma1 hca_handle=3 hca_object=max
    
  • rdma.current
    A read-only file that describes current resource usage. It exists for all the cgroup except root.
    这是一个只读文件,描述当前资源使用情况。对于除根cgroup之外的所有cgroup都存在。

    An example for mlx4 and ocrdma device follows:
    以下是mlx4和ocrdma设备的示例:

        mlx4_0 hca_handle=1 hca_object=20
        ocrdma1 hca_handle=1 hca_object=23
    

HugeTLB

The HugeTLB controller allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault.
HugeTLB控制器允许限制每个控制组中的HugeTLB使用,并在页面错误期间强制执行控制器限制。

HugeTLB Interface Files

  • hugetlb.<hugepagesize>.current
    Show current usage for "hugepagesize" hugetlb. It exists for all the cgroup except root.
    显示“hugepagesize” HugeTLB的当前使用情况。对于除根cgroup之外的所有cgroup都存在。

  • hugetlb.<hugepagesize>.max
    Set/show the hard limit of "hugepagesize" hugetlb usage. The default value is "max". It exists for all the cgroup except root.
    设置/显示“hugepagesize” HugeTLB使用的硬限制。默认值为“max”。对于除根cgroup之外的所有cgroup都存在。

  • hugetlb.<hugepagesize>.events
    A read-only flat-keyed file which exists on non-root cgroups.
    这是一个只读的扁平键控文件,存在于非根cgroup上。

    • max
      The number of allocation failure due to HugeTLB limit
      由于HugeTLB限制而导致的分配失败次数
  • hugetlb.<hugepagesize>.events.local
    Similar to hugetlb.<hugepagesize>.events but the fields in the file are local to the cgroup i.e. not hierarchical. The file modified event generated on this file reflects only the local events.
    类似于hugetlb.<hugepagesize>.events,但文件中的字段是针对cgroup本身的,即不是分层的。对此文件的修改事件仅反映本地事件。

  • hugetlb.<hugepagesize>.numa_stat
    Similar to memory.numa_stat, it shows the numa information of the hugetlb pages of <hugepagesize> in this cgroup. Only active in use hugetlb pages are included. The per-node values are in bytes.
    类似于memory.numa_stat,它显示了此cgroup中<hugepagesize>的HugeTLB页面的NUMA信息。仅包括正在使用的活动HugeTLB页面。每个节点的值以字节为单位。

Misc

杂项

The Miscellaneous cgroup provides the resource limiting and tracking mechanism for the scalar resources which cannot be abstracted like the other cgroup resources. Controller is enabled by the CONFIG_CGROUP_MISC config option.
"Miscellaneous cgroup"提供了对无法像其他cgroup资源那样抽象化的标量资源进行限制和跟踪的机制。控制器由CONFIG_CGROUP_MISC配置选项启用。

A resource can be added to the controller via enum misc_res_type{} in the include/linux/misc_cgroup.h file and the corresponding name via misc_res_name[] in the kernel/cgroup/misc.c file. Provider of the resource must set its capacity prior to using the resource by calling misc_cg_set_capacity().
杂项cgroup为无法像其他cgroup资源一样抽象的标量资源提供了资源限制和跟踪机制。控制器通过include/linux/misc_cgroup.h文件中的enum misc_res_type{}和kernel/cgroup/misc.c文件中的misc_res_name[]添加资源。资源的提供者必须在使用资源之前通过调用misc_cg_set_capacity()设置其容量。

Once a capacity is set then the resource usage can be updated using charge and uncharge APIs. All of the APIs to interact with misc controller are in include/linux/misc_cgroup.h.
一旦设置了容量,就可以使用charge和uncharge API更新资源使用情况。与misc控制器交互的所有API都在include/linux/misc_cgroup.h中。

Misc Interface Files

杂项接口文件

Miscellaneous controller provides 3 interface files. If two misc resources (res_a and res_b) are registered then:
杂项控制器提供了3个接口文件。如果注册了两个杂项资源(res_a和res_b),则:

  • misc.capacity
    A read-only flat-keyed file shown only in the root cgroup. It shows miscellaneous scalar resources available on the platform along with their quantities:
    这是一个只读的扁平键控文件,仅在根cgroup中显示。它显示了平台上可用的杂项标量资源及其数量:

        $ cat misc.capacity
        res_a 50
        res_b 10
    
  • misc.current
    A read-only flat-keyed file shown in the all cgroups. It shows the current usage of the resources in the cgroup and its children.:
    这是一个只读的扁平键控文件,显示在所有cgroup中。它显示了cgroup及其子cgroup中资源的当前使用情况:

        $ cat misc.current
        res_a 3
        res_b 0
    
  • misc.max
    A read-write flat-keyed file shown in the non root cgroups. Allowed maximum usage of the resources in the cgroup and its children.:
    这是一个可读写的扁平键控文件,显示在非根cgroup中。它允许设置cgroup及其子cgroup中资源的最大使用量:

        $ cat misc.max
        res_a max
        res_b 4
    

    Limit can be set by:
    可以通过以下方式设置限制:

        # echo res_a 1 > misc.max
    

    Limit can be set to max by:
    可以将限制设置为最大值:

        # echo res_a max > misc.max
    

    Limits can be set higher than the capacity value in the misc.capacity file.
    限制可以设置得比misc.capacity文件中的容量值更高。

  • misc.events
    A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event. All fields in this file are hierarchical.
    这是一个只读的扁平键控文件,存在于非根cgroup上。定义了以下条目。除非另有说明,否则此文件中的值更改会生成文件修改事件。此文件中的所有字段都是分层的。

    • max
      The number of times the cgroup's resource usage was about to go over the max boundary.
      cgroup的资源使用次数超过最大边界的次数。

Migration and Ownership

迁移和所有权

A miscellaneous scalar resource is charged to the cgroup in which it is used first, and stays charged to that cgroup until that resource is freed. Migrating a process to a different cgroup does not move the charge to the destination cgroup where the process has moved.
杂项标量资源首先被计入使用它的cgroup,并保持计入直到该资源被释放。将进程迁移到不同的cgroup不会将计费转移到进程所迁移到的目标cgroup中。

关于MISC Cgroup的用法,参考 LWN:新增misc cgroup!

Others

其他

perf_event

perf_event controller, if not mounted on a legacy hierarchy, is automatically enabled on the v2 hierarchy so that perf events can always be filtered by cgroup v2 path. The controller can still be moved to a legacy hierarchy after v2 hierarchy is populated.
如果perf_event控制器没有挂载在传统层次结构上,则会自动在v2层次结构上启用,以便可以始终通过cgroup v2路径过滤perf事件。在v2层次结构填充之后,仍然可以将控制器移动到传统层次结构中。

Non-normative information

非规范信息

This section contains information that isn't considered to be a part of the stable kernel API and so is subject to change.
本节包含的信息不被视为稳定的内核API的一部分,因此可能会发生更改。

CPU controller root cgroup process behaviour

CPU控制器根cgroup进程行为

When distributing CPU cycles in the root cgroup each thread in this cgroup is treated as if it was hosted in a separate child cgroup of the root cgroup. This child cgroup weight is dependent on its thread nice level.
在根cgroup中分配CPU周期时,该cgroup中的每个线程都被视为托管在根cgroup的单独子cgroup中。该子cgroup的权重取决于其线程的nice级别。

For details of this mapping see sched_prio_to_weight array in kernel/sched/core.c file (values from this array should be scaled appropriately so the neutral - nice 0 - value is 100 instead of 1024).
有关此映射的详细信息,请参见kernel/sched/core.c文件中的sched_prio_to_weight数组(应适当缩放此数组中的值,以便中性-好0-值为100,而不是1024)。

IO controller root cgroup process behaviour

IO控制器根cgroup进程行为

Root cgroup processes are hosted in an implicit leaf child node. When distributing IO resources this implicit child node is taken into account as if it was a normal child cgroup of the root cgroup with a weight value of 200.
根cgroup进程托管在一个隐式的叶子子节点中。在分配IO资源时,将考虑此隐式子节点,就好像它是根cgroup的普通子cgroup,其权重值为200。