Control Group v2 (翻译 by chatgpt)

发布时间 2023-12-06 20:19:27作者: 摩斯电码

原文:https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

另外两篇:

This is the authoritative documentation on the design, interface and conventions of cgroup v2. It describes all userland-visible aspects of cgroup including core and specific controller behaviors. All future changes must be reflected in this document. Documentation for v1 is available under Documentation/admin-guide/cgroup-v1/index.rst.
这是有关 cgroup v2 设计、接口和约定的权威文档。它描述了 cgroup 的所有用户可见方面,包括核心和特定控制器的行为。所有未来的更改都必须在此文档中反映出来。v1 的文档可在 Documentation/admin-guide/cgroup-v1/index.rst 中找到。

Introduction

Terminology

术语

"cgroup" stands for "control group" and is never capitalized. The singular form is used to designate the whole feature and also as a qualifier as in "cgroup controllers". When explicitly referring to multiple individual control groups, the plural form "cgroups" is used.
"cgroup" 代表 "控制组",不应大写。单数形式用于指代整个功能,也用作限定词,如 "cgroup 控制器"。当明确指称多个单独的控制组时,使用复数形式 "cgroups"。

What is cgroup?

什么是 cgroup?

cgroup is a mechanism to organize processes hierarchically and distribute system resources along the hierarchy in a controlled and configurable manner.
cgroup 是一种以受控和可配置的方式层次化组织进程并在层次结构中分配系统资源的机制。

cgroup is largely composed of two parts - the core and controllers. cgroup core is primarily responsible for hierarchically organizing processes. A cgroup controller is usually responsible for distributing a specific type of system resource along the hierarchy although there are utility controllers which serve purposes other than resource distribution.
cgroup 主要由两部分组成 - 核心和控制器。cgroup 核心主要负责层次化组织进程。一个 cgroup 控制器通常负责沿着层次结构分配特定类型的系统资源,尽管也有一些用于除资源分配之外的其他目的的实用控制器。

cgroups form a tree structure and every process in the system belongs to one and only one cgroup. All threads of a process belong to the same cgroup. On creation, all processes are put in the cgroup that the parent process belongs to at the time. A process can be migrated to another cgroup. Migration of a process doesn't affect already existing descendant processes.
cgroups 形成一个树结构,系统中的每个进程都属于一个且仅属于一个 cgroup。一个进程的所有线程都属于同一个 cgroup。在创建时,所有进程都被放置在父进程在那时所属的 cgroup 中。进程可以迁移到另一个 cgroup。进程的迁移不会影响已经存在的后代进程。

Following certain structural constraints, controllers may be enabled or disabled selectively on a cgroup. All controller behaviors are hierarchical - if a controller is enabled on a cgroup, it affects all processes which belong to the cgroups consisting the inclusive sub-hierarchy of the cgroup. When a controller is enabled on a nested cgroup, it always restricts the resource distribution further. The restrictions set closer to the root in the hierarchy can not be overridden from further away.
在遵循一定的结构约束的情况下,控制器可以在 cgroup 上选择性地启用或禁用。所有控制器的行为都是层次化的 - 如果在一个 cgroup 上启用了一个控制器,它会影响属于包含子层次结构的 cgroup 的所有进程。当一个控制器在一个嵌套的 cgroup 上启用时,它总是进一步限制资源分配。在层次结构中距离根部设置的限制无法被更远处的地方覆盖。

Basic Operations

基本操作

Mounting

挂载

Unlike v1, cgroup v2 has only single hierarchy. The cgroup v2 hierarchy can be mounted with the following mount command:
与 v1 不同,cgroup v2 只有单一层次结构。可以使用以下挂载命令挂载 cgroup v2 层次结构:

# mount -t cgroup2 none $MOUNT_POINT

cgroup2 filesystem has the magic number 0x63677270 ("cgrp"). All controllers which support v2 and are not bound to a v1 hierarchy are automatically bound to the v2 hierarchy and show up at the root. Controllers which are not in active use in the v2 hierarchy can be bound to other hierarchies. This allows mixing v2 hierarchy with the legacy v1 multiple hierarchies in a fully backward compatible way.
cgroup2 文件系统的魔数为 0x63677270("cgrp")。所有支持 v2 并且未绑定到 v1 层次结构的控制器会自动绑定到 v2 层次结构并显示在根目录下。在 v2 层次结构中未处于活动状态的控制器可以绑定到其他层次结构。这允许以完全向后兼容的方式将 v2 层次结构与传统的 v1 多层次结构混合使用。

A controller can be moved across hierarchies only after the controller is no longer referenced in its current hierarchy. Because per-cgroup controller states are destroyed asynchronously and controllers may have lingering references, a controller may not show up immediately on the v2 hierarchy after the final umount of the previous hierarchy. Similarly, a controller should be fully disabled to be moved out of the unified hierarchy and it may take some time for the disabled controller to become available for other hierarchies; furthermore, due to inter-controller dependencies, other controllers may need to be disabled too.
只有在控制器在当前层次结构中不再被引用后,才能将控制器移动到其他层次结构。由于每个 cgroup 控制器状态会异步销毁,并且控制器可能有残留引用,因此在前一个层次结构的最终卸载后,控制器可能不会立即出现在 v2 层次结构中。同样,要将控制器移出统一层次结构,应完全禁用控制器,并且可能需要一些时间才能使禁用的控制器对其他层次结构可用;此外,由于控制器之间存在相互依赖,可能需要禁用其他控制器。

While useful for development and manual configurations, moving controllers dynamically between the v2 and other hierarchies is strongly discouraged for production use. It is recommended to decide the hierarchies and controller associations before starting using the controllers after system boot.
虽然在开发和手动配置中移动控制器对于生产环境的使用是不推荐的。建议在系统启动后在开始使用控制器之前决定层次结构和控制器关联。

During transition to v2, system management software might still automount the v1 cgroup filesystem and so hijack all controllers during boot, before manual intervention is possible. To make testing and experimenting easier, the kernel parameter cgroup_no_v1= allows disabling controllers in v1 and make them always available in v2.
在迁移到 v2 期间,系统管理软件可能仍会自动挂载 v1 cgroup 文件系统,并在手动干预之前在启动时占用所有控制器。为了使测试和实验更加容易,内核参数 cgroup_no_v1= 允许在 v1 中禁用控制器,并始终在 v2 中可用。

cgroup v2 currently supports the following mount options.
cgroup v2 目前支持以下挂载选项。

  • nsdelegate
    Consider cgroup namespaces as delegation boundaries. This option is system wide and can only be set on mount or modified through remount from the init namespace. The mount option is ignored on non-init namespace mounts. Please refer to the Delegation section for details.
    将 cgroup 命名空间视为委托边界。此选项是系统范围的,只能在挂载时设置或通过从 init 命名空间重新挂载进行修改。在非 init 命名空间挂载时,该挂载选项将被忽略。有关详细信息,请参阅委托部分。

  • favordynmods
    Reduce the latencies of dynamic cgroup modifications such as task migrations and controller on/offs at the cost of making hot path operations such as forks and exits more expensive. The static usage pattern of creating a cgroup, enabling controllers, and then seeding it with CLONE_INTO_CGROUP is not affected by this option.
    减少动态 cgroup 修改(如任务迁移和控制器开/关)的延迟,但会增加热路径操作(如 fork 和 exit)的成本。创建 cgroup、启用控制器,然后使用 CLONE_INTO_CGROUP 进行种子化的静态使用模式不受此选项影响。

  • memory_localevents
    Only populate memory.events with data for the current cgroup, and not any subtrees. This is legacy behaviour, the default behaviour without this option is to include subtree counts. This option is system wide and can only be set on mount or modified through remount from the init namespace. The mount option is ignored on non-init namespace mounts.
    仅为当前 cgroup 填充 memory.events 数据,而不包括任何子树。这是传统行为,没有此选项时的默认行为是包括子树计数。此选项是系统范围的,只能在挂载时设置或通过从 init 命名空间重新挂载进行修改。在非 init 命名空间挂载时,该挂载选项将被忽略。

  • memory_recursiveprot
    Recursively apply memory.min and memory.low protection to entire subtrees, without requiring explicit downward propagation into leaf cgroups. This allows protecting entire subtrees from one another, while retaining free competition within those subtrees. This should have been the default behavior but is a mount-option to avoid regressing setups relying on the original semantics (e.g. specifying bogusly high 'bypass' protection values at higher tree levels).
    递归地将 memory.min 和 memory.low 保护应用于整个子树,而无需显式向下传播到叶子 cgroup。这允许在子树之间保护整个子树,同时保留这些子树内的自由竞争。这应该是默认行为,但是为了避免对依赖于原始语义的设置(例如在更高层级指定虚假的高 'bypass' 保护值)造成退化,这是一个挂载选项。

  • memory_hugetlb_accounting
    Count HugeTLB memory usage towards the cgroup's overall memory usage for the memory controller (for the purpose of statistics reporting and memory protetion). This is a new behavior that could regress existing setups, so it must be explicitly opted in with this mount option.
    将 HugeTLB 内存使用计入到 cgroup 的整体内存使用中(用于统计报告和内存保护)。这是一种新行为,可能会导致现有设置退化,因此必须使用此挂载选项明确选择。

    A few caveats to keep in mind:
    需要牢记的一些注意事项:

    • There is no HugeTLB pool management involved in the memory controller. The pre-allocated pool does not belong to anyone. Specifically, when a new HugeTLB folio is allocated to the pool, it is not accounted for from the perspective of the memory controller. It is only charged to a cgroup when it is actually used (for e.g at page fault time). Host memory overcommit management has to consider this when configuring hard limits. In general, HugeTLB pool management should be done via other mechanisms (such as the HugeTLB controller).
      内存控制器中没有 HugeTLB 池管理。预分配的池不属于任何人。具体来说,当为池分配新的 HugeTLB folio 时,从内存控制器的角度来看,它不会被计入。只有在实际使用时(例如在页面错误时)才会计入到 cgroup 中。主机内存超额管理在配置硬限制时必须考虑这一点。通常情况下,HugeTLB 池管理应该通过其他机制进行(例如 HugeTLB 控制器)。

    • Failure to charge a HugeTLB folio to the memory controller results in SIGBUS. This could happen even if the HugeTLB pool still has pages available (but the cgroup limit is hit and reclaim attempt fails).
      未能将 HugeTLB folio 计入内存控制器会导致 SIGBUS。即使 HugeTLB 池仍有可用页面(但达到了 cgroup 限制并且重取尝试失败),也可能发生这种情况。

    • Charging HugeTLB memory towards the memory controller affects memory protection and reclaim dynamics. Any userspace tuning (of low, min limits for e.g) needs to take this into account.
      将 HugeTLB 内存计入内存控制器会影响内存保护和重取动态。任何用户空间调整(例如低限制、最小限制)都需要考虑这一点。

    • HugeTLB pages utilized while this option is not selected will not be tracked by the memory controller (even if cgroup v2 is remounted later on).
      在未选择此选项时使用的 HugeTLB 页面不会被内存控制器跟踪(即使稍后重新挂载 cgroup v2)。

Organizing Processes and Threads

组织进程和线程

Processes

进程

Initially, only the root cgroup exists to which all processes belong. A child cgroup can be created by creating a sub-directory:
最初,只有根 cgroup 存在,所有进程都属于它。可以通过创建子目录来创建子 cgroup:

# mkdir $CGROUP_NAME

A given cgroup may have multiple child cgroups forming a tree structure. Each cgroup has a read-writable interface file "cgroup.procs". When read, it lists the PIDs of all processes which belong to the cgroup one-per-line. The PIDs are not ordered and the same PID may show up more than once if the process got moved to another cgroup and then back or the PID got recycled while reading.
一个给定的 cgroup 可能有多个子 cgroup,形成一个树状结构。每个 cgroup 都有一个可读写的接口文件 "cgroup.procs"。当读取时,它以每行一个的方式列出属于该 cgroup 的所有进程的 PID。PID 没有顺序,如果进程被移动到另一个 cgroup,然后再移回,或者 PID 在读取时被回收,同一个 PID 可能会出现多次。

A process can be migrated into a cgroup by writing its PID to the target cgroup's "cgroup.procs" file. Only one process can be migrated on a single write(2) call. If a process is composed of multiple threads, writing the PID of any thread migrates all threads of the process.
可以通过将进程的 PID 写入目标 cgroup 的 "cgroup.procs" 文件来将进程迁移到一个 cgroup。在单次 write(2) 调用中只能迁移一个进程。如果一个进程由多个线程组成,写入任何线程的 PID 都会迁移该进程的所有线程。

When a process forks a child process, the new process is born into the cgroup that the forking process belongs to at the time of the operation. After exit, a process stays associated with the cgroup that it belonged to at the time of exit until it's reaped; however, a zombie process does not appear in "cgroup.procs" and thus can't be moved to another cgroup.
当一个进程 fork 出一个子进程时,新进程诞生在进行 fork 操作时所属的 cgroup 中。退出后,一个进程保持与其退出时所属的 cgroup 相关联,直到被回收;然而,僵尸进程不会出现在 "cgroup.procs" 中,因此无法被移动到另一个 cgroup。

A cgroup which doesn't have any children or live processes can be destroyed by removing the directory. Note that a cgroup which doesn't have any children and is associated only with zombie processes is considered empty and can be removed:
一个没有子进程或活跃进程的 cgroup 可以通过移除目录来销毁。请注意,一个没有子进程,只与僵尸进程相关联的 cgroup 被视为空的,可以被移除:

# rmdir $CGROUP_NAME

"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy cgroup is in use in the system, this file may contain multiple lines, one for each hierarchy. The entry for cgroup v2 is always in the format "0::$PATH":
"/proc/$PID/cgroup"列出了进程所属的cgroup。如果系统中使用了旧的cgroup,这个文件可能包含多行,每行对应一个层次结构。cgroupv2的条目格式始终为"0::$PATH":

# cat /proc/842/cgroup
...
0::/test-cgroup/test-cgroup-nested

If the process becomes a zombie and the cgroup it was associated with is removed subsequently, " (deleted)" is appended to the path:
如果进程变成僵尸,并且它所属的 cgroup 随后被移除,路径后会附加 " (deleted)":

# cat /proc/842/cgroup
...
0::/test-cgroup/test-cgroup-nested (deleted)

Threads

线程

cgroup v2 supports thread granularity for a subset of controllers to support use cases requiring hierarchical resource distribution across the threads of a group of processes. By default, all threads of a process belong to the same cgroup, which also serves as the resource domain to host resource consumptions which are not specific to a process or thread. The thread mode allows threads to be spread across a subtree while still maintaining the common resource domain for them.
cgroup v2 支持对一些控制器进行线程粒度的支持,以支持需要在一组进程的线程之间进行层次资源分配的用例。默认情况下,一个进程的所有线程都属于同一个 cgroup,这也作为资源域,用于托管不特定于进程或线程的资源消耗。线程模式允许将线程分布在子树中,同时仍然保持它们的共同资源域。

Controllers which support thread mode are called threaded controllers. The ones which don't are called domain controllers.
支持线程模式的控制器称为线程控制器。不支持的称为域控制器

Marking a cgroup threaded makes it join the resource domain of its parent as a threaded cgroup. The parent may be another threaded cgroup whose resource domain is further up in the hierarchy. The root of a threaded subtree, that is, the nearest ancestor which is not threaded, is called threaded domain or thread root interchangeably and serves as the resource domain for the entire subtree.
将一个 cgroup 标记为线程使其加入其父级的资源域作为线程 cgroup。父级可能是另一个作为线程 cgroup 的线程 cgroup,其资源域进一步位于层次结构中。线程子树的根,即最近的非线程祖先,称为线程域线程根,用来为整个子树提供资源域。

Inside a threaded subtree, threads of a process can be put in different cgroups and are not subject to the no internal process constraint - threaded controllers can be enabled on non-leaf cgroups whether they have threads in them or not.
在线程子树中,一个进程的线程可以被放置在不同的 cgroup 中,并且不受无内部进程约束的限制 - 线程控制器可以在非叶子 cgroup 上启用,无论它们是否有线程。

As the threaded domain cgroup hosts all the domain resource consumptions of the subtree, it is considered to have internal resource consumptions whether there are processes in it or not and can't have populated child cgroups which aren't threaded. Because the root cgroup is not subject to no internal process constraint, it can serve both as a threaded domain and a parent to domain cgroups.
由于线程域 cgroup 托管整个子树的域资源消耗,它被认为具有内部资源消耗,无论其中是否有进程,也不能有未标记为线程的子 cgroup。因为根 cgroup 不受无内部进程约束的限制,它既可以作为线程域,也可以作为域 cgroup 的父级。

The current operation mode or type of the cgroup is shown in the "cgroup.type" file which indicates whether the cgroup is a normal domain, a domain which is serving as the domain of a threaded subtree, or a threaded cgroup.
cgroup 的当前操作模式或类型显示在 "cgroup.type" 文件中,指示该 cgroup 是普通域、作为线程子树的域,还是线程 cgroup。

On creation, a cgroup is always a domain cgroup and can be made threaded by writing "threaded" to the "cgroup.type" file. The operation is single direction:
在创建时,cgroup 总是一个域 cgroup,并且可以通过向 "cgroup.type" 文件写入 "threaded" 来使其成为线程。该操作是单向的:

# echo threaded > cgroup.type

Once threaded, the cgroup can't be made a domain again. To enable the thread mode, the following conditions must be met.
一旦成为线程,cgroup 就无法再次成为域。要启用线程模式,必须满足以下条件。

  • As the cgroup will join the parent's resource domain. The parent must either be a valid (threaded) domain or a threaded cgroup.
    由于 cgroup 将加入父级的资源域,父级必须是有效的(线程)域或线程 cgroup。

  • When the parent is an unthreaded domain, it must not have any domain controllers enabled or populated domain children. The root is exempt from this requirement.
    当父级是非线程域时,它不得启用任何域控制器或有域子 cgroup。根除外。

Topology-wise, a cgroup can be in an invalid state. Please consider the following topology:
在拓扑上,一个 cgroup 可能处于无效状态。请考虑以下拓扑:

A (threaded domain) - B (threaded) - C (domain, just created)

C is created as a domain but isn't connected to a parent which can host child domains. C can't be used until it is turned into a threaded cgroup. "cgroup.type" file will report "domain (invalid)" in these cases. Operations which fail due to invalid topology use EOPNOTSUPP as the errno.
C 被创建为域,但没有连接到可以托管子域的父级。在转换为线程 cgroup 之前,C 无法使用。在这种情况下,"cgroup.type" 文件将报告 "domain (invalid)"。由于无效拓扑导致的操作失败将使用 EOPNOTSUPP 作为 errno。

A domain cgroup is turned into a threaded domain when one of its child cgroup becomes threaded or threaded controllers are enabled in the "cgroup.subtree_control" file while there are processes in the cgroup. A threaded domain reverts to a normal domain when the conditions clear.
当一个域 cgroup 的子 cgroup 中的一个变为线程,或者在 cgroup 中存在进程时启用了线程控制器,域 cgroup 就会变成线程域。当条件清除时,线程域会恢复为普通域。

When read, "cgroup.threads" contains the list of the thread IDs of all threads in the cgroup. Except that the operations are per-thread instead of per-process, "cgroup.threads" has the same format and behaves the same way as "cgroup.procs". While "cgroup.threads" can be written to in any cgroup, as it can only move threads inside the same threaded domain, its operations are confined inside each threaded subtree.
在读取时,“cgroup.threads” 包含该 cgroup 中所有线程的线程 ID 列表。除了操作是针对线程而不是进程的,"cgroup.threads" 具有相同的格式,并且行为方式与 "cgroup.procs" 相同。虽然可以在任何 cgroup 中对 "cgroup.threads" 进行写入,但由于它只能在同一个线程域内移动线程,其操作受限于每个线程子树内。

The threaded domain cgroup serves as the resource domain for the whole subtree, and, while the threads can be scattered across the subtree, all the processes are considered to be in the threaded domain cgroup. "cgroup.procs" in a threaded domain cgroup contains the PIDs of all processes in the subtree and is not readable in the subtree proper. However, "cgroup.procs" can be written to from anywhere in the subtree to migrate all threads of the matching process to the cgroup.
线程域 cgroup 为整个子树提供资源域,虽然线程可以分布在整个子树中,但所有进程都被视为位于线程域 cgroup 中。在线程域 cgroup 中,“cgroup.procs” 包含子树中所有进程的 PID,并且在子树内部无法读取。然而,“cgroup.procs” 可以从子树中的任何位置进行写入,以将匹配进程的所有线程迁移到该 cgroup。

Only threaded controllers can be enabled in a threaded subtree. When a threaded controller is enabled inside a threaded subtree, it only accounts for and controls resource consumptions associated with the threads in the cgroup and its descendants. All consumptions which aren't tied to a specific thread belong to the threaded domain cgroup.
只有线程控制器可以在线程子树中启用。在线程子树中启用线程控制器时,它只会计算和控制与该 cgroup 及其后代线程相关的资源消耗。所有不属于特定线程的消耗都属于线程域 cgroup。

Because a threaded subtree is exempt from no internal process constraint, a threaded controller must be able to handle competition between threads in a non-leaf cgroup and its child cgroups. Each threaded controller defines how such competitions are handled.
由于线程子树不受无内部进程约束的限制,线程控制器必须能够处理非叶子 cgroup 及其子 cgroup 中线程之间的竞争。每个线程控制器定义了如何处理这种竞争。

Currently, the following controllers are threaded and can be enabled in a threaded cgroup:
目前,以下控制器是线程化的,并且可以在线程 cgroup 中启用:

  • cpu
  • cpuset
  • perf_event
  • pids

[Un]populated Notification

Each non-root cgroup has a "cgroup.events" file which contains "populated" field indicating whether the cgroup's sub-hierarchy has live processes in it. Its value is 0 if there is no live process in the cgroup and its descendants; otherwise, 1. poll and [id]notify events are triggered when the value changes. This can be used, for example, to start a clean-up operation after all processes of a given sub-hierarchy have exited. The populated state updates and notifications are recursive. Consider the following sub-hierarchy where the numbers in the parentheses represent the numbers of processes in each cgroup:
每个非根 cgroup 都有一个 "cgroup.events" 文件,其中包含一个 "populated" 字段,指示该 cgroup 的子层次结构中是否有活动进程。如果 cgroup 及其后代中没有活动进程,则其值为 0;否则为 1。当值发生变化时,将触发 poll 和通知事件。例如,可以使用此功能在给定子层次结构的所有进程退出后启动清理操作。"populated" 状态更新和通知是递归的。考虑以下子层次结构,其中括号中的数字表示每个 cgroup 中的进程数:

A(4) - B(0) - C(1)
            \ D(0)

A, B and C's "populated" fields would be 1 while D's 0. After the one process in C exits, B and C's "populated" fields would flip to "0" and file modified events will be generated on the "cgroup.events" files of both cgroups.
A、B 和 C 的 "populated" 字段将为 1,而 D 的为 0。在 C 中的一个进程退出后,B 和 C 的 "populated" 字段将翻转为 "0",并且将在两个 cgroup 的 "cgroup.events" 文件上生成文件修改事件。

Controlling Controllers

Enabling and Disabling

Each cgroup has a "cgroup.controllers" file which lists all controllers available for the cgroup to enable:
每个 cgroup 都有一个 "cgroup.controllers" 文件,其中列出了可用于启用的所有控制器:

# cat cgroup.controllers
cpu io memory

No controller is enabled by default. Controllers can be enabled and disabled by writing to the "cgroup.subtree_control" file:
默认情况下,没有启用任何控制器。可以通过向 "cgroup.subtree_control" 文件写入内容来启用和禁用控制器:

# echo "+cpu +memory -io" > cgroup.subtree_control

Only controllers which are listed in "cgroup.controllers" can be enabled. When multiple operations are specified as above, either they all succeed or fail. If multiple operations on the same controller are specified, the last one is effective.
只能启用 "cgroup.controllers" 中列出的控制器。如果像上面那样指定了多个操作,要么全部成功,要么全部失败。如果指定了对同一控制器的多个操作,则最后一个操作生效。

Enabling a controller in a cgroup indicates that the distribution of the target resource across its immediate children will be controlled. Consider the following sub-hierarchy. The enabled controllers are listed in parentheses:
在 cgroup 中启用控制器表示将目标资源在其直接子级之间的分配进行控制。考虑以下子层次结构。已在括号中列出了已启用的控制器:

A(cpu,memory) - B(memory) - C()
                          \ D()

As A has "cpu" and "memory" enabled, A will control the distribution of CPU cycles and memory to its children, in this case, B. As B has "memory" enabled but not "CPU", C and D will compete freely on CPU cycles but their division of memory available to B will be controlled.
由于 A 已启用了 "cpu" 和 "memory",因此 A 将控制 CPU 周期和内存分配给其子级,即 B。由于 B 已启用了 "memory" 但未启用 "CPU",因此 C 和 D 将自由竞争 CPU 周期,但它们对可用于 B 的内存的分配将受到控制。

As a controller regulates the distribution of the target resource to the cgroup's children, enabling it creates the controller's interface files in the child cgroups. In the above example, enabling "cpu" on B would create the "cpu." prefixed controller interface files in C and D. Likewise, disabling "memory" from B would remove the "memory." prefixed controller interface files from C and D. This means that the controller interface files - anything which doesn't start with "cgroup." are owned by the parent rather than the cgroup itself.
由于控制器调节目标资源分配给 cgroup 的子级,启用控制器会在子 cgroup 中创建控制器的接口文件。在上述示例中,在 B 上启用 "cpu" 将在 C 和 D 中创建以 "cpu." 为前缀的控制器接口文件。同样,在 B 中禁用 "memory" 将从 C 和 D 中删除以 "memory." 为前缀的控制器接口文件。这意味着控制器接口文件 - 任何不以 "cgroup." 开头的文件都由父级而不是 cgroup 自身拥有。

Top-down Constraint

自上而下的约束

Resources are distributed top-down and a cgroup can further distribute a resource only if the resource has been distributed to it from the parent. This means that all non-root "cgroup.subtree_control" files can only contain controllers which are enabled in the parent's "cgroup.subtree_control" file. A controller can be enabled only if the parent has the controller enabled and a controller can't be disabled if one or more children have it enabled.
资源是自上而下分配的,只有在资源已从父级分配到 cgroup 时,cgroup 才能进一步分配资源。这意味着所有非根 "cgroup.subtree_control" 文件只能包含在父级的 "cgroup.subtree_control" 文件中启用的控制器。只有在父级已启用控制器时,才能启用控制器,并且如果一个或多个子级已启用控制器,则无法禁用控制器。

No Internal Process Constraint

无内部进程约束

Non-root cgroups can distribute domain resources to their children only when they don't have any processes of their own. In other words, only domain cgroups which don't contain any processes can have domain controllers enabled in their "cgroup.subtree_control" files.
非根 cgroup 只有在不包含自己的任何进程时才能将域资源分配给其子级。换句话说,只有不包含任何进程的域 cgroup 才能在其 "cgroup.subtree_control" 文件中启用域控制器。

This guarantees that, when a domain controller is looking at the part of the hierarchy which has it enabled, processes are always only on the leaves. This rules out situations where child cgroups compete against internal processes of the parent.
这确保了当域控制器查看具有其启用的层次结构部分时,进程始终只存在于叶子节点上。这排除了子 cgroup 与父级的内部进程竞争的情况。

The root cgroup is exempt from this restriction. Root contains processes and anonymous resource consumption which can't be associated with any other cgroups and requires special treatment from most controllers. How resource consumption in the root cgroup is governed is up to each controller (for more information on this topic please refer to the Non-normative information section in the Controllers chapter).
根 cgroup 不受此限制。根包含进程和无法与任何其他 cgroup 关联的匿名资源消耗,并且需要大多数控制器的特殊处理。如何管理根 cgroup 中的资源消耗取决于每个控制器(有关此主题的更多信息,请参阅控制器章节中的非规范信息部分)。

Note that the restriction doesn't get in the way if there is no enabled controller in the cgroup's "cgroup.subtree_control". This is important as otherwise it wouldn't be possible to create children of a populated cgroup. To control resource distribution of a cgroup, the cgroup must create children and transfer all its processes to the children before enabling controllers in its "cgroup.subtree_control" file.
请注意,如果 cgroup 的 "cgroup.subtree_control" 中没有启用控制器,则此限制不会产生影响。这一点很重要,否则将无法创建已填充 cgroup 的子级。要控制 cgroup 的资源分配,必须在启用其 "cgroup.subtree_control" 文件中的控制器之前创建子级并将所有进程转移到子级。

Delegation

委托

Model of Delegation

委托模型

A cgroup can be delegated in two ways. First, to a less privileged user by granting write access of the directory and its "cgroup.procs", "cgroup.threads" and "cgroup.subtree_control" files to the user. Second, if the "nsdelegate" mount option is set, automatically to a cgroup namespace on namespace creation.
cgroup 可以通过两种方式进行委托。首先,通过授予目录及其 "cgroup.procs"、"cgroup.threads" 和 "cgroup.subtree_control" 文件的写入访问权限,将其委托给权限较低的用户。其次,如果设置了 "nsdelegate" 挂载选项,将其自动委托给命名空间创建时的 cgroup 命名空间。

Because the resource control interface files in a given directory control the distribution of the parent's resources, the delegatee shouldn't be allowed to write to them. For the first method, this is achieved by not granting access to these files. For the second, the kernel rejects writes to all files other than "cgroup.procs" and "cgroup.subtree_control" on a namespace root from inside the namespace.
由于给定目录中的资源控制接口文件控制了父级资源的分配,因此不应允许委托者对其进行写入。对于第一种方法,通过不授予对这些文件的访问权限来实现这一点。对于第二种方法,内核会拒绝在命名空间根目录内部从命名空间中进行的写入,除了 "cgroup.procs" 和 "cgroup.subtree_control" 之外的所有文件。

The end results are equivalent for both delegation types. Once delegated, the user can build sub-hierarchy under the directory, organize processes inside it as it sees fit and further distribute the resources it received from the parent. The limits and other settings of all resource controllers are hierarchical and regardless of what happens in the delegated sub-hierarchy, nothing can escape the resource restrictions imposed by the parent.
对于两种委托类型,最终结果是等效的。一旦委托,用户可以在目录下构建子层次结构,在其中组织进程,并进一步分配从父级接收的资源。所有资源控制器的限制和其他设置都是分层的,无论在委托的子层次结构中发生了什么,都不会逃脱父级施加的资源限制。

Currently, cgroup doesn't impose any restrictions on the number of cgroups in or nesting depth of a delegated sub-hierarchy; however, this may be limited explicitly in the future.
目前,cgroup 对于委托子层次结构中的 cgroup 数量或嵌套深度并没有施加任何限制;但是,这可能在将来明确受限。

Delegation Containment

委托容纳

A delegated sub-hierarchy is contained in the sense that processes can't be moved into or out of the sub-hierarchy by the delegatee.
委托的子层次结构是被包含的,这意味着代理者无法将进程移入或移出子层次结构。

For delegations to a less privileged user, this is achieved by requiring the following conditions for a process with a non-root euid to migrate a target process into a cgroup by writing its PID to the "cgroup.procs" file.
对于委托给较低特权用户的情况,通过要求具备以下条件,才能使具有非根 euid 的进程将目标进程迁移到 cgroup 中,即将其 PID 写入 "cgroup.procs" 文件来实现:

  • The writer must have write access to the "cgroup.procs" file.
    写入者必须对 "cgroup.procs" 文件具有写入权限。

  • The writer must have write access to the "cgroup.procs" file of the common ancestor of the source and destination cgroups.
    写入者必须对源 cgroup 和目标 cgroup 的共同祖先的 "cgroup.procs" 文件具有写入权限。

The above two constraints ensure that while a delegatee may migrate processes around freely in the delegated sub-hierarchy it can't pull in from or push out to outside the sub-hierarchy.
上述两个约束条件确保了,虽然代理者可以在委托的子层次结构中自由迁移进程,但它无法从外部拉入或推出到子层次结构之外。

For an example, let's assume cgroups C0 and C1 have been delegated to user U0 who created C00, C01 under C0 and C10 under C1 as follows and all processes under C0 and C1 belong to U0:
举个例子,假设 cgroups C0 和 C1 已经委托给用户 U0,用户 U0 在 C0 下创建了 C00、C01,在 C1 下创建了 C10,如下所示,C0 和 C1 下的所有进程都属于用户 U0:

~~~~~~~~~~~~~ - C0 - C00
~ cgroup    ~      \ C01
~ hierarchy ~
~~~~~~~~~~~~~ - C1 - C10

Let's also say U0 wants to write the PID of a process which is currently in C10 into "C00/cgroup.procs". U0 has write access to the file; however, the common ancestor of the source cgroup C10 and the destination cgroup C00 is above the points of delegation and U0 would not have write access to its "cgroup.procs" files and thus the write will be denied with -EACCES.
假设用户 U0 想要将当前位于 C10 中的一个进程的 PID 写入 "C00/cgroup.procs"。用户 U0 具有对该文件的写入权限;然而,源 cgroup C10 和目标 cgroup C00 的共同祖先位于委托点之上,用户 U0 将无法对其 "cgroup.procs" 文件具有写入权限,因此写入操作将被拒绝并返回 -EACCES。

For delegations to namespaces, containment is achieved by requiring that both the source and destination cgroups are reachable from the namespace of the process which is attempting the migration. If either is not reachable, the migration is rejected with -ENOENT.
对于命名空间的委托,容纳是通过要求源 cgroup 和目标 cgroup 都可以从试图进行迁移的进程的命名空间中访问来实现的。如果其中一个不可访问,则迁移将被拒绝并返回 -ENOENT。

Guidelines

指南

Organize Once and Control

组织一次,控制一次

Migrating a process across cgroups is a relatively expensive operation and stateful resources such as memory are not moved together with the process. This is an explicit design decision as there often exist inherent trade-offs between migration and various hot paths in terms of synchronization cost.
跨 cgroup 迁移进程是一个相对昂贵的操作,而且状态资源(如内存)不会随进程一起移动。这是一个明确的设计决策,因为在迁移和各种热路径之间通常存在固有的权衡。

As such, migrating processes across cgroups frequently as a means to apply different resource restrictions is discouraged. A workload should be assigned to a cgroup according to the system's logical and resource structure once on start-up. Dynamic adjustments to resource distribution can be made by changing controller configuration through the interface files.
因此,频繁地跨 cgroup 迁移进程以应用不同的资源限制是不鼓励的。一个工作负载应该根据系统的逻辑和资源结构在启动时分配给一个 cgroup。通过接口文件可以通过更改控制器配置来进行资源分配的动态调整。

Avoid Name Collisions

避免名称冲突

Interface files for a cgroup and its children cgroups occupy the same directory and it is possible to create children cgroups which collide with interface files.
一个 cgroup 及其子 cgroup 的接口文件占据同一个目录,因此可能会创建与接口文件冲突的子 cgroup。

All cgroup core interface files are prefixed with "cgroup." and each controller's interface files are prefixed with the controller name and a dot. A controller's name is composed of lower case alphabets and ''s but never begins with an '' so it can be used as the prefix character for collision avoidance. Also, interface file names won't start or end with terms which are often used in categorizing workloads such as job, service, slice, unit or workload.
所有 cgroup 核心接口文件都以 "cgroup." 为前缀,每个控制器的接口文件都以控制器名称和一个点号为前缀。控制器的名称由小写字母和下划线组成,但不以下划线开头,因此可以用作避免冲突的前缀字符。此外,接口文件名不会以常用的工作负载分类术语(如 job、service、slice、unit 或 workload)开头或结尾。

cgroup doesn't do anything to prevent name collisions and it's the user's responsibility to avoid them.
cgroup 不会采取任何措施来防止名称冲突,避免名称冲突是用户的责任。

Resource Distribution Models

资源分配模型

cgroup controllers implement several resource distribution schemes depending on the resource type and expected use cases. This section describes major schemes in use along with their expected behaviors.
cgroup 控制器实现了几种资源分配方案,取决于资源类型和预期的使用情况。本节描述了主要使用的几种方案以及它们的预期行为。

Weights

权重

A parent's resource is distributed by adding up the weights of all active children and giving each the fraction matching the ratio of its weight against the sum. As only children which can make use of the resource at the moment participate in the distribution, this is work-conserving. Due to the dynamic nature, this model is usually used for stateless resources.
通过将所有活动子进程的权重相加,并将每个子进程的分数设置为其权重与总和的比例相匹配,来分配父进程的资源。由于只有此刻可以使用资源的子进程参与分配,因此这是工作保持的。由于其动态性质,此模型通常用于无状态资源。

All weights are in the range [1, 10000] with the default at 100. This allows symmetric multiplicative biases in both directions at fine enough granularity while staying in the intuitive range.
所有权重都在 [1, 10000] 的范围内,默认值为 100。这允许在直观范围内进行对称的乘法偏差。

As long as the weight is in range, all configuration combinations are valid and there is no reason to reject configuration changes or process migrations.
只要权重在范围内,所有的配置组合都是有效的,没有理由拒绝配置更改或进程迁移。

"cpu.weight" proportionally distributes CPU cycles to active children and is an example of this type.
"cpu.weight" 按比例分配 CPU 周期给活动子进程,是这种类型的一个例子。

Limits

限制

A child can only consume up to the configured amount of the resource. Limits can be over-committed - the sum of the limits of children can exceed the amount of resource available to the parent.
子进程只能消耗配置的资源量。限制可以被超额分配 - 子进程的限制总和可以超过父进程可用的资源量。

Limits are in the range [0, max] and defaults to "max", which is noop.
限制在 [0, max] 的范围内,默认为 "max",即无操作。

As limits can be over-committed, all configuration combinations are valid and there is no reason to reject configuration changes or process migrations.
由于限制可以被超额分配,所有的配置组合都是有效的,没有理由拒绝配置更改或进程迁移。

"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume on an IO device and is an example of this type.
"io.max" 限制了 cgroup 在 IO 设备上可以消耗的最大 BPS 和/或 IOPS,是这种类型的一个例子。

Protections

保护

A cgroup is protected up to the configured amount of the resource as long as the usages of all its ancestors are under their protected levels. Protections can be hard guarantees or best effort soft boundaries. Protections can also be over-committed in which case only up to the amount available to the parent is protected among children.
只要所有祖先的使用量都在其保护级别以下,cgroup 就受到配置的资源量的保护。保护可以是硬性保证或尽力而为的软性边界。保护也可以被超额分配,此时在子进程中只有可用于父进程的资源量受到保护。

Protections are in the range [0, max] and defaults to 0, which is noop.
保护在 [0, max] 的范围内,默认为 0,即无操作。

As protections can be over-committed, all configuration combinations are valid and there is no reason to reject configuration changes or process migrations.
由于保护可以被超额分配,所有的配置组合都是有效的,没有理由拒绝配置更改或进程迁移。

"memory.low" implements best-effort memory protection and is an example of this type.
"memory.low" 实现了尽力而为的内存保护,是这种类型的一个例子。

Allocations

分配

A cgroup is exclusively allocated a certain amount of a finite resource. Allocations can't be over-committed - the sum of the allocations of children can not exceed the amount of resource available to the parent.
一个 cgroup 专门分配了一定数量的有限资源。分配不能被超额分配 - 子进程的分配总和不能超过父进程可用的资源量。

Allocations are in the range [0, max] and defaults to 0, which is no resource.
分配在 [0, max] 的范围内,默认为 0,即没有资源。

As allocations can't be over-committed, some configuration combinations are invalid and should be rejected. Also, if the resource is mandatory for execution of processes, process migrations may be rejected.
由于分配不能被超额分配,某些配置组合是无效的,应该被拒绝。此外,如果资源对于进程的执行是强制性的,可能会拒绝进程迁移。

"cpu.rt.max" hard-allocates realtime slices and is an example of this type.
"cpu.rt.max" 硬分配了实时时间片,是这种类型的一个例子。

Interface Files

接口文件

Format

格式

All interface files should be in one of the following formats whenever possible:

New-line separated values
以换行分隔的数值
(when only one value can be written at once)
(当一次只能写入一个值时)

      VAL0\n
      VAL1\n
      ...

Space separated values
以空格分隔的数值
(when read-only or multiple values can be written at once)
(当只读或一次可以写入多个值时)

      VAL0 VAL1 ...\n

Flat keyed
扁平键

      KEY0 VAL0\n
      KEY1 VAL1\n
      ...

Nested keyed
嵌套键

      KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
      KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
      ...

For a writable file, the format for writing should generally match reading; however, controllers may allow omitting later fields or implement restricted shortcuts for most common use cases.
对于可写入的文件,写入的格式通常应与读取的格式相匹配;但是,控制器可能允许省略后续字段或实现最常见用例的受限快捷方式。

For both flat and nested keyed files, only the values for a single key can be written at a time. For nested keyed files, the sub key pairs may be specified in any order and not all pairs have to be specified.
对于扁平键和嵌套键文件,一次只能写入单个键的值。对于嵌套键文件,子键对可以以任何顺序指定,并且不必指定所有对。

Conventions

约定

  • Settings for a single feature should be contained in a single file.
    单个特性的设置应包含在单个文件中。

  • The root cgroup should be exempt from resource control and thus shouldn't have resource control interface files.
    根 cgroup 应该豁免资源控制,因此不应该有资源控制接口文件。

  • The default time unit is microseconds. If a different unit is ever used, an explicit unit suffix must be present.
    默认时间单位为微秒。如果使用不同的单位,必须明确指定单位后缀。

  • A parts-per quantity should use a percentage decimal with at least two digit fractional part - e.g. 13.40.
    百万分比数量应使用至少两位小数的百分比小数,例如 13.40。

  • If a controller implements weight based resource distribution, its interface file should be named "weight" and have the range [1, 10000] with 100 as the default. The values are chosen to allow enough and symmetric bias in both directions while keeping it intuitive (the default is 100%).
    如果控制器实现基于权重的资源分配,则其接口文件应命名为 "weight",范围为 [1, 10000],默认为 100。选择这些值是为了在两个方向上提供足够对称的偏差,并保持直观性(默认值为 100%)。

  • If a controller implements an absolute resource guarantee and/or limit, the interface files should be named "min" and "max" respectively. If a controller implements best effort resource guarantee and/or limit, the interface files should be named "low" and "high" respectively.
    如果控制器实现绝对资源保证和/或限制,则接口文件应分别命名为 "min" 和 "max"。如果控制器实现尽力而为的资源保证和/或限制,则接口文件应分别命名为 "low" 和 "high"。

    In the above four control files, the special token "max" should be used to represent upward infinity for both reading and writing.
    在上述四个控制文件中,特殊标记 "max" 应该用于表示读取和写入的向上无限。

  • If a setting has a configurable default value and keyed specific overrides, the default entry should be keyed with "default" and appear as the first entry in the file.
    如果设置具有可配置的默认值和特定键的覆盖,那么默认条目应该以 "default" 为键,并出现在文件的第一个条目中。

    The default value can be updated by writing either "default $VAL" or "$VAL".
    默认值可以通过写入 "default $VAL" 或 "$VAL" 来更新。

    When writing to update a specific override, "default" can be used as the value to indicate removal of the override. Override entries with "default" as the value must not appear when read.
    在写入以更新特定覆盖时,可以使用 "default" 作为值来指示删除覆盖。当读取时,值为 "default" 的覆盖条目不得出现。

    For example, a setting which is keyed by major:minor device numbers with integer values may look like the following:
    例如,一个由主:次设备号键控的具有整数值的设置可能如下所示:

    # cat cgroup-example-interface-file
    default 150
    8:0 300
    

    The default value can be updated by:
    默认值可以通过以下方式更新:

    # echo 125 > cgroup-example-interface-file
    

    or:
    或者:

    # echo "default 125" > cgroup-example-interface-file
    

    An override can be set by:
    可以通过以下方式设置覆盖:

    # echo "8:16 170" > cgroup-example-interface-file
    

    and cleared by:
    并通过以下方式清除:

    # echo "8:0 default" > cgroup-example-interface-file
    # cat cgroup-example-interface-file
    default 125
    8:16 170
    
  • For events which are not very high frequency, an interface file "events" should be created which lists event key value pairs. Whenever a notifiable event happens, file modified event should be generated on the file.
    对于不是非常高频的事件,应创建一个名为 "events" 的接口文件,其中列出事件键值对。每当发生可通知的事件时,应在文件上生成文件修改事件。

Core Interface Files

核心接口文件

All cgroup core files are prefixed with "cgroup."
所有 cgroup 核心文件都以 "cgroup." 为前缀。

  • cgroup.type
    A read-write single value file which exists on non-root cgroups.
    一个可读写的单值文件,存在于非根 cgroup 上。
    When read, it indicates the current type of the cgroup, which can be one of the following values.
    读取时,指示 cgroup 的当前类型的值,可以是以下值之一。

    • "domain" : A normal valid domain cgroup.
      正常有效的域 cgroup。

    • "domain threaded" : A threaded domain cgroup which is serving as the root of a threaded subtree.
      作为线程子树根的线程域 cgroup。

    • "domain invalid" : A cgroup which is in an invalid state. It can't be populated or have controllers enabled. It may be allowed to become a threaded cgroup.
      处于无效状态的 cgroup。它不能被填充或启用控制器。可能允许成为线程 cgroup。

    • "threaded" : A threaded cgroup which is a member of a threaded subtree.
      作为线程子树的成员的线程 cgroup。

    A cgroup can be turned into a threaded cgroup by writing "threaded" to this file.
    可以通过向该文件写入 "threaded" 将 cgroup 转换为线程 cgroup。

  • cgroup.procs
    A read-write new-line separated values file which exists on all cgroups.
    一个可读写的以换行分隔的数值文件,存在于所有 cgroup 上。

    When read, it lists the PIDs of all processes which belong to the cgroup one-per-line. The PIDs are not ordered and the same PID may show up more than once if the process got moved to another cgroup and then back or the PID got recycled while reading.
    读取时,按行列出属于该 cgroup 的所有进程的 PID。PID 没有顺序,如果进程被移动到另一个 cgroup,然后再移回,或者在读取时 PID 被回收,同一个 PID 可能会出现多次。

    A PID can be written to migrate the process associated with the PID to the cgroup. The writer should match all of the following conditions.
    可以写入 PID 来迁移与该 PID 关联的进程到该 cgroup。写入者应满足以下所有条件。

    • It must have write access to the "cgroup.procs" file.
      必须对 "cgroup.procs" 文件具有写入访问权限。

    • It must have write access to the "cgroup.procs" file of the common ancestor of the source and destination cgroups.
      必须对源 cgroup 和目标 cgroup 的共同祖先的 "cgroup.procs" 文件具有写入访问权限。

    When delegating a sub-hierarchy, write access to this file should be granted along with the containing directory.
    在委托子层次结构时,应授予对该文件以及包含目录的写入访问权限。

    In a threaded cgroup, reading this file fails with EOPNOTSUPP as all the processes belong to the thread root. Writing is supported and moves every thread of the process to the cgroup.
    在线程 cgroup 中,读取该文件会因为所有进程属于线程根而失败,返回 EOPNOTSUPP。支持写入,并将该进程的每个线程移动到该 cgroup。

  • cgroup.threads
    A read-write new-line separated values file which exists on all cgroups.
    一个可读写的以换行分隔的数值文件,存在于所有 cgroup 上。

    When read, it lists the TIDs of all threads which belong to the cgroup one-per-line. The TIDs are not ordered and the same TID may show up more than once if the thread got moved to another cgroup and then back or the TID got recycled while reading.
    读取时,按行列出属于该 cgroup 的所有线程的 TID。TID 没有顺序,如果线程被移动到另一个 cgroup,然后再移回,或者在读取时 TID 被回收,同一个 TID 可能会出现多次。

    A TID can be written to migrate the thread associated with the TID to the cgroup. The writer should match all of the following conditions.
    可以写入 TID 来将与该 TID 关联的线程迁移到该 cgroup。写入者应满足以下所有条件。

    • It must have write access to the "cgroup.threads" file.
      必须对 "cgroup.threads" 文件具有写入访问权限。

    • The cgroup that the thread is currently in must be in the same resource domain as the destination cgroup.
      线程当前所在的 cgroup 必须与目标 cgroup 属于相同的资源域。

    • It must have write access to the "cgroup.procs" file of the common ancestor of the source and destination cgroups.
      必须对源 cgroup 和目标 cgroup 的共同祖先的 "cgroup.procs" 文件具有写入访问权限。

    When delegating a sub-hierarchy, write access to this file should be granted along with the containing directory.
    在委托子层次结构时,应授予对该文件以及包含目录的写入访问权限。

  • cgroup.controllers
    A read-only space separated values file which exists on all cgroups.
    一个只读的以空格分隔的数值文件,存在于所有 cgroup 上。

    It shows space separated list of all controllers available to the cgroup. The controllers are not ordered.
    显示所有可用于该 cgroup 的控制器的空格分隔列表。控制器没有顺序。

  • cgroup.subtree_control
    A read-write space separated values file which exists on all cgroups. Starts out empty.
    一个可读写的以空格分隔的数值文件,存在于所有 cgroup 上。初始为空。

    When read, it shows space separated list of the controllers which are enabled to control resource distribution from the cgroup to its children.
    读取时,显示启用控制资源分配从该 cgroup 到其子级的控制器的空格分隔列表。

    Space separated list of controllers prefixed with '+' or '-' can be written to enable or disable controllers. A controller name prefixed with '+' enables the controller and '-' disables. If a controller appears more than once on the list, the last one is effective. When multiple enable and disable operations are specified, either all succeed or all fail.
    可以写入以 '+' 或 '-' 为前缀的控制器的空格分隔列表,以启用或禁用控制器。以 '+' 为前缀的控制器名称启用该控制器,以 '-' 为前缀的控制器名称禁用该控制器。如果控制器在列表中出现多次,则最后一个有效。如果指定了多个启用和禁用操作,则要么全部成功,要么全部失败。

  • cgroup.events
    A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified otherwise, a value change in this file generates a file modified event.
    一个只读的扁平键文件,存在于非根 cgroup 上。定义了以下条目。除非另有说明,该文件中值的更改会生成文件修改事件。

    • populated
      1 if the cgroup or its descendants contains any live processes; otherwise, 0.
      如果 cgroup 或其后代包含任何活动进程,则为 1;否则为 0。

    • frozen
      1 if the cgroup is frozen; otherwise, 0.
      如果 cgroup 被冻结,则为 1;否则为 0。

  • cgroup.max.descendants
    A read-write single value files. The default is "max".
    一个可读写的单值文件。默认为 "max"。

    Maximum allowed number of descent cgroups. If the actual number of descendants is equal or larger, an attempt to create a new cgroup in the hierarchy will fail.
    允许的最大后代 cgroup 数。如果实际后代数等于或大于此值,则尝试在层次结构中创建新 cgroup 将失败。

  • cgroup.max.depth
    A read-write single value files. The default is "max".
    一个可读写的单值文件。默认为 "max"。

    Maximum allowed descent depth below the current cgroup. If the actual descent depth is equal or larger, an attempt to create a new child cgroup will fail.
    当前 cgroup 下允许的最大后代深度。如果实际后代深度等于或大于此值,则尝试创建新的子 cgroup 将失败。

  • cgroup.stat
    A read-only flat-keyed file with the following entries:
    一个只读的扁平键文件,具有以下条目:

    • nr_descendants
      Total number of visible descendant cgroups.
      可见后代 cgroup 的总数。

    • nr_dying_descendants
      Total number of dying descendant cgroups. A cgroup becomes dying after being deleted by a user. The cgroup will remain in dying state for some time undefined time (which can depend on system load) before being completely destroyed.
      垂死后代 cgroup 的总数。用户删除 cgroup 后,该 cgroup 将变为垂死状态。在完全销毁之前,该 cgroup 将保持在垂死状态一段时间(这可能取决于系统负载)。

      A process can't enter a dying cgroup under any circumstances, a dying cgroup can't revive.
      在任何情况下,进程都不能进入垂死 cgroup,垂死 cgroup 也不能恢复。

      A dying cgroup can consume system resources not exceeding limits, which were active at the moment of cgroup deletion.
      垂死 cgroup 可以消耗系统资源,但不会超出删除 cgroup 时的限制。

  • cgroup.freeze
    A read-write single value file which exists on non-root cgroups. Allowed values are "0" and "1". The default is "0".
    一个可读写的单值文件,存在于非根 cgroup 上。允许的值为 "0" 和 "1"。默认为 "0"。

    Writing "1" to the file causes freezing of the cgroup and all descendant cgroups. This means that all belonging processes will be stopped and will not run until the cgroup will be explicitly unfrozen. Freezing of the cgroup may take some time; when this action is completed, the "frozen" value in the cgroup.events control file will be updated to "1" and the corresponding notification will be issued.
    将 "1" 写入该文件会导致冻结该 cgroup 及其所有后代 cgroup。这意味着所有属于该 cgroup 的进程将停止运行,并且直到显式解冻该 cgroup 之前都不会运行。冻结 cgroup 可能需要一些时间;当此操作完成时,cgroup.events 控制文件中的 "frozen" 值将更新为 "1",并将发出相应的通知。

    A cgroup can be frozen either by its own settings, or by settings of any ancestor cgroups. If any of ancestor cgroups is frozen, the cgroup will remain frozen.
    一个 cgroup 可以被其自身的设置或任何祖先 cgroup 的设置冻结。如果任何祖先 cgroup 中有一个被冻结,该 cgroup 将保持冻结状态。

    Processes in the frozen cgroup can be killed by a fatal signal. They also can enter and leave a frozen cgroup: either by an explicit move by a user, or if freezing of the cgroup races with fork(). If a process is moved to a frozen cgroup, it stops. If a process is moved out of a frozen cgroup, it becomes running.
    处于冻结状态的 cgroup 中的进程可以被致命信号杀死。它们也可以进入和离开冻结的 cgroup:要么是用户明确移动,要么是冻结 cgroup 与 fork() 操作竞争。如果进程被移动到冻结的 cgroup,它将停止运行。如果进程从冻结的 cgroup 中移出,它将变为运行状态。

    Frozen status of a cgroup doesn't affect any cgroup tree operations: it's possible to delete a frozen (and empty) cgroup, as well as create new sub-cgroups.
    冻结的 cgroup 的状态不会影响任何 cgroup 树操作:可以删除冻结的(且为空的)cgroup,以及创建新的子 cgroup。

  • cgroup.kill
    A write-only single value file which exists in non-root cgroups. The only allowed value is "1".
    一个只允许写入的单值文件,存在于非根 cgroup 上。唯一允许的值为 "1"。

    Writing "1" to the file causes the cgroup and all descendant cgroups to be killed. This means that all processes located in the affected cgroup tree will be killed via SIGKILL.
    将 "1" 写入该文件会导致杀死该 cgroup 及其所有后代 cgroup。这意味着位于受影响 cgroup 树中的所有进程将通过 SIGKILL 被杀死。

    Killing a cgroup tree will deal with concurrent forks appropriately and is protected against migrations.
    杀死 cgroup 树将适当处理并受到并发 fork 的保护,并且受到迁移的保护。

    In a threaded cgroup, writing this file fails with EOPNOTSUPP as killing cgroups is a process directed operation, i.e. it affects the whole thread-group.
    在线程 cgroup 中,写入该文件会因为杀死 cgroup 是一个进程指导的操作,即它会影响整个线程组,而失败,返回 EOPNOTSUPP。

  • cgroup.pressure
    A read-write single value file that allowed values are "0" and "1". The default is "1".
    一个可读写的单值文件,允许的值为 "0" 和 "1"。默认为 "1"。

    Writing "0" to the file will disable the cgroup PSI accounting. Writing "1" to the file will re-enable the cgroup PSI accounting.
    将 "0" 写入该文件将禁用 cgroup 的 PSI 账户。将 "1" 写入该文件将重新启用 cgroup 的 PSI 账户。

    This control attribute is not hierarchical, so disable or enable PSI accounting in a cgroup does not affect PSI accounting in descendants and doesn't need pass enablement via ancestors from root.
    此控制属性不是分层的,因此在 cgroup 中禁用或启用 PSI 账户不会影响后代的 PSI 账户,并且不需要通过祖先从根传递启用。

    The reason this control attribute exists is that PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may cause non-negligible overhead for some workloads when under deep level of the hierarchy, in which case this control attribute can be used to disable PSI accounting in the non-leaf cgroups.
    存在此控制属性的原因是,PSI 为每个 cgroup 单独计算停滞,并在层次结构的每个级别进行聚合。对于某些工作负载,在深层次的层次结构下,这可能会导致非常大的开销,因此可以使用此控制属性来禁用非叶子 cgroup 中的 PSI 账户。

  • irq.pressure
    A read-write nested-keyed file.
    一个可读写的嵌套键文件。

    Shows pressure stall information for IRQ/SOFTIRQ. See Documentation/accounting/psi.rst for details.
    显示 IRQ/SOFTIRQ 的压力停滞信息。有关详细信息,请参阅 Documentation/accounting/psi.rst