Control Group v2 —— Namespace(翻译 by chatgpt)

发布时间 2023-12-07 22:00:37作者: 摩斯电码

原文:https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#namespace

Namespace

命名空间

Basics

基础知识

cgroup namespace provides a mechanism to virtualize the view of the "/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone flag can be used with clone(2) and unshare(2) to create a new cgroup namespace. The process running inside the cgroup namespace will have its "/proc/$PID/cgroup" output restricted to cgroupns root. The cgroupns root is the cgroup of the process at the time of creation of the cgroup namespace.
cgroup 命名空间提供了一种虚拟化 "/proc/$PID/cgroup" 文件和 cgroup 挂载点的视图的机制。CLONE_NEWCGROUP 克隆标志可以与 clone(2) 和 unshare(2) 一起使用,以创建一个新的 cgroup 命名空间。运行在 cgroup 命名空间内的进程将其 "/proc/$PID/cgroup" 输出限制为 cgroupns 根目录。cgroupns 根目录是在创建 cgroup 命名空间时进程的 cgroup。

Without cgroup namespace, the "/proc/$PID/cgroup" file shows the complete path of the cgroup of a process. In a container setup where a set of cgroups and namespaces are intended to isolate processes the "/proc/$PID/cgroup" file may leak potential system level information to the isolated processes. For example:
没有 cgroup 命名空间时,"/proc/$PID/cgroup" 文件显示进程的 cgroup 的完整路径。在容器设置中,一组 cgroups 和命名空间旨在隔离进程,"/proc/$PID/cgroup" 文件可能会向隔离的进程泄露潜在的系统级信息。例如:

# cat /proc/self/cgroup
0::/batchjobs/container_id1

The path '/batchjobs/container_id1' can be considered as system-data and undesirable to expose to the isolated processes. cgroup namespace can be used to restrict visibility of this path. For example, before creating a cgroup namespace, one would see:
路径 '/batchjobs/container_id1' 可以被视为系统数据,不希望暴露给隔离的进程。cgroup 命名空间可以用来限制此路径的可见性。例如,在创建 cgroup 命名空间之前,可以看到:

# ls -l /proc/self/ns/cgroup
lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
# cat /proc/self/cgroup
0::/batchjobs/container_id1

After unsharing a new namespace, the view changes:
在 unshare 新的命名空间后,视图会发生变化:

# ls -l /proc/self/ns/cgroup
lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
# cat /proc/self/cgroup
0::/

When some thread from a multi-threaded process unshares its cgroup namespace, the new cgroupns gets applied to the entire process (all the threads). This is natural for the v2 hierarchy; however, for the legacy hierarchies, this may be unexpected.
当来自多线程进程的某个线程 unshare 其 cgroup 命名空间时,新的 cgroupns 将应用于整个进程(所有线程)。这对于 v2 层次结构是自然的;然而,对于传统层次结构,这可能是意外的。

A cgroup namespace is alive as long as there are processes inside or mounts pinning it. When the last usage goes away, the cgroup namespace is destroyed. The cgroupns root and the actual cgroups remain.
只要有进程存在或挂载固定它,cgroup 命名空间就会存在。当最后一个使用消失时,cgroup 命名空间将被销毁。cgroupns 根和实际的 cgroups 保持不变。

The Root and Views

根和视图

The 'cgroupns root' for a cgroup namespace is the cgroup in which the process calling unshare(2) is running. For example, if a process in /batchjobs/container_id1 cgroup calls unshare, cgroup /batchjobs/container_id1 becomes the cgroupns root. For the init_cgroup_ns, this is the real root ('/') cgroup.
cgroup 命名空间的 'cgroupns 根' 是调用 unshare(2) 的进程所在的 cgroup。例如,如果在 /batchjobs/container_id1 cgroup 中的进程调用 unshare,则 cgroup /batchjobs/container_id1 将成为 cgroupns 根。对于 init_cgroup_ns,这是真正的根目录 ('/') cgroup。

The cgroupns root cgroup does not change even if the namespace creator process later moves to a different cgroup:
即使命名空间创建者进程后来移动到不同的 cgroup,cgroupns 根 cgroup 也不会改变:

# ~/unshare -c # unshare cgroupns in some cgroup
# cat /proc/self/cgroup
0::/
# mkdir sub_cgrp_1
# echo 0 > sub_cgrp_1/cgroup.procs
# cat /proc/self/cgroup
0::/sub_cgrp_1

Each process gets its namespace-specific view of "/proc/$PID/cgroup"
每个进程都会得到其特定于命名空间的视图 "/proc/$PID/cgroup"

Processes running inside the cgroup namespace will be able to see cgroup paths (in /proc/self/cgroup) only inside their root cgroup. From within an unshared cgroupns:
运行在 cgroup 命名空间内的进程将只能在其根 cgroup 内看到 cgroup 路径(在 /proc/self/cgroup 中)。在未共享的 cgroupns 中:

# sleep 100000 &
[1] 7353
# echo 7353 > sub_cgrp_1/cgroup.procs
# cat /proc/7353/cgroup
0::/sub_cgrp_1

From the initial cgroup namespace, the real cgroup path will be visible:
从初始 cgroup 命名空间中,将可见真实的 cgroup 路径:

$ cat /proc/7353/cgroup
0::/batchjobs/container_id1/sub_cgrp_1

From a sibling cgroup namespace (that is, a namespace rooted at a different cgroup), the cgroup path relative to its own cgroup namespace root will be shown. For instance, if PID 7353's cgroup namespace root is at '/batchjobs/container_id2', then it will see:
从兄弟 cgroup 命名空间(即以不同 cgroup 为根的命名空间)中,将显示相对于其自身 cgroup 命名空间根的 cgroup 路径。例如,如果 PID 7353 的 cgroup 命名空间根在 '/batchjobs/container_id2',那么它将看到:

# cat /proc/7353/cgroup
0::/../container_id2/sub_cgrp_1

Note that the relative path always starts with '/' to indicate that its relative to the cgroup namespace root of the caller.
请注意,相对路径始终以 '/' 开头,表示相对于调用者的 cgroup 命名空间根。

Migration and setns(2)

迁移和 setns(2)

Processes inside a cgroup namespace can move into and out of the namespace root if they have proper access to external cgroups. For example, from inside a namespace with cgroupns root at /batchjobs/container_id1, and assuming that the global hierarchy is still accessible inside cgroupns:
cgroup 命名空间内的进程可以在具有适当访问权限的外部 cgroups 中移动进入和移出命名空间根。例如,在 cgroupns 根在 /batchjobs/container_id1 的命名空间内,并假设全局层次结构仍然在 cgroupns 内可访问:

# cat /proc/7353/cgroup
0::/sub_cgrp_1
# echo 7353 > batchjobs/container_id2/cgroup.procs
# cat /proc/7353/cgroup
0::/../container_id2

Note that this kind of setup is not encouraged. A task inside cgroup namespace should only be exposed to its own cgroupns hierarchy.
请注意,不鼓励这种设置。cgroup 命名空间内的任务应该只暴露给其自己的 cgroupns 层次结构。

setns(2) to another cgroup namespace is allowed when:
允许使用 setns(2) 切换到另一个 cgroup 命名空间,当:

  • the process has CAP_SYS_ADMIN against its current user namespace
    进程对其当前用户命名空间具有 CAP_SYS_ADMIN 权限

  • the process has CAP_SYS_ADMIN against the target cgroup namespace's userns
    进程对目标 cgroup 命名空间的用户命名空间具有 CAP_SYS_ADMIN 权限

No implicit cgroup changes happen with attaching to another cgroup namespace. It is expected that the someone moves the attaching process under the target cgroup namespace root.
附加到另一个 cgroup 命名空间时不会发生隐式 cgroup 更改。预期是将附加进程移动到目标 cgroup 命名空间根下。

Interaction with Other Namespaces

与其他命名空间的交互

Namespace specific cgroup hierarchy can be mounted by a process running inside a non-init cgroup namespace:
运行在非 init cgroup 命名空间内的进程可以挂载命名空间特定的 cgroup 层次结构:

# mount -t cgroup2 none $MOUNT_POINT

This will mount the unified cgroup hierarchy with cgroupns root as the filesystem root. The process needs CAP_SYS_ADMIN against its user and mount namespaces.
这将以文件系统根目录为 cgroupns 根挂载统一的 cgroup 层次结构。进程需要对其用户和挂载命名空间具有 CAP_SYS_ADMIN 权限。

The virtualization of /proc/self/cgroup file combined with restricting the view of cgroup hierarchy by namespace-private cgroupfs mount provides a properly isolated cgroup view inside the container.
结合命名空间私有 cgroupfs 挂载和虚拟化 /proc/self/cgroup 文件,提供了容器内适当隔离的 cgroup 视图。

Information on Kernel Programming

内核编程信息

This section contains kernel programming information in the areas where interacting with cgroup is necessary. cgroup core and controllers are not covered.
本节包含与需要与 cgroup 交互的领域中的内核编程信息。不涵盖 cgroup 核心和控制器。

Filesystem Support for Writeback

用于回写的文件系统支持

A filesystem can support cgroup writeback by updating address_space_operations->writepage[s]() to annotate bio's using the following two functions.
文件系统可以通过更新 address_space_operations->writepage[s]() 来支持 cgroup 回写,使用以下两个函数注释 bio。

  • wbc_init_bio(@wbc, @bio)
    Should be called for each bio carrying writeback data and associates the bio with the inode's owner cgroup and the corresponding request queue. This must be called after a queue (device) has been associated with the bio and before submission.
    应该为携带回写数据的每个 bio 调用,并将该 bio 与 inode 的所有者 cgroup 和相应的请求队列关联起来。这必须在将队列(设备)与 bio 关联之后且在提交之前调用。

  • wbc_account_cgroup_owner(@wbc, @page, @bytes)
    Should be called for each data segment being written out. While this function doesn't care exactly when it's called during the writeback session, it's the easiest and most natural to call it as data segments are added to a bio.
    应该为正在写出的每个数据段调用。虽然此函数不关心在回写会话期间何时调用它,但在将数据段添加到 bio 时调用它是最简单和最自然的。

With writeback bio's annotated, cgroup support can be enabled per super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for selective disabling of cgroup writeback support which is helpful when certain filesystem features, e.g. journaled data mode, are incompatible.
通过注释回写的 bio,可以通过在 ->s_iflags 中设置 SB_I_CGROUPWB 来为每个 super_block 启用 cgroup 支持。这允许有选择地禁用 cgroup 回写支持,这在某些文件系统特性(例如,日志数据模式)不兼容时很有帮助。

wbc_init_bio() binds the specified bio to its cgroup. Depending on the configuration, the bio may be executed at a lower priority and if the writeback session is holding shared resources, e.g. a journal entry, may lead to priority inversion. There is no one easy solution for the problem. Filesystems can try to work around specific problem cases by skipping wbc_init_bio() and using bio_associate_blkg() directly.
wbc_init_bio() 将指定的 bio 绑定到其 cgroup。根据配置,该 bio 可能以较低的优先级执行,如果回写会话持有共享资源(例如,日志条目),可能会导致优先级反转。对于这个问题没有一个简单的解决方案。文件系统可以尝试通过直接使用 bio_associate_blkg() 跳过 wbc_init_bio() 并解决特定问题案例。

Deprecated v1 Core Features

弃用的 v1 核心功能

  • Multiple hierarchies including named ones are not supported.
    不支持多个层次结构,包括命名层次结构。

  • All v1 mount options are not supported.
    不支持所有 v1 挂载选项。

  • The "tasks" file is removed and "cgroup.procs" is not sorted.
    移除了 "tasks" 文件,"cgroup.procs" 未排序。

  • "cgroup.clone_children" is removed.
    移除了 "cgroup.clone_children"。

  • /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file at the root instead.
    /proc/cgroups 对于 v2 没有意义。请使用根目录下的 "cgroup.controllers" 文件。

Issues with v1 and Rationales for v2

v1 存在的问题和 v2 的原因

Multiple Hierarchies

多个层次结构

cgroup v1 allowed an arbitrary number of hierarchies and each hierarchy could host any number of controllers. While this seemed to provide a high level of flexibility, it wasn't useful in practice.
cgroup v1 允许任意数量的层次结构,每个层次结构可以承载任意数量的控制器。虽然这似乎提供了高度的灵活性,但在实践中并不实用。

For example, as there is only one instance of each controller, utility type controllers such as freezer which can be useful in all hierarchies could only be used in one. The issue is exacerbated by the fact that controllers couldn't be moved to another hierarchy once hierarchies were populated. Another issue was that all controllers bound to a hierarchy were forced to have exactly the same view of the hierarchy. It wasn't possible to vary the granularity depending on the specific controller.
例如,由于每个控制器只有一个实例,对所有层次结构都有用的实用型控制器(例如冷冻器)只能在一个层次结构中使用。问题加剧的原因是一旦层次结构被填充,控制器就无法移动到另一个层次结构。另一个问题是,绑定到层次结构的所有控制器被强制具有完全相同的层次结构视图。无法根据特定控制器的情况改变粒度。

In practice, these issues heavily limited which controllers could be put on the same hierarchy and most configurations resorted to putting each controller on its own hierarchy. Only closely related ones, such as the cpu and cpuacct controllers, made sense to be put on the same hierarchy. This often meant that userland ended up managing multiple similar hierarchies repeating the same steps on each hierarchy whenever a hierarchy management operation was necessary.
实际上,这些问题严重限制了可以放在同一层次结构上的控制器,大多数配置都会将每个控制器放在自己的层次结构上。只有密切相关的控制器,例如 CPU 和 CPU 账户控制器,才有意义放在同一个层次结构上。这通常意味着用户空间最终需要管理多个类似的层次结构,在需要层次结构管理操作时需要在每个层次结构上重复相同的步骤。

Furthermore, support for multiple hierarchies came at a steep cost. It greatly complicated cgroup core implementation but more importantly the support for multiple hierarchies restricted how cgroup could be used in general and what controllers was able to do.
此外,支持多个层次结构的代价很高。这极大地复杂化了 cgroup 核心的实现,更重要的是,对多个层次结构的支持限制了 cgroup 的一般用法和控制器的功能。

There was no limit on how many hierarchies there might be, which meant that a thread's cgroup membership couldn't be described in finite length. The key might contain any number of entries and was unlimited in length, which made it highly awkward to manipulate and led to addition of controllers which existed only to identify membership, which in turn exacerbated the original problem of proliferating number of hierarchies.
层次结构的数量没有限制,这意味着线程的 cgroup 成员资格无法用有限长度描述。密钥可能包含任意数量的条目,并且长度无限,这使得操作非常笨拙,并导致添加了仅用于标识成员资格的控制器,从而加剧了层次结构数量的增加的问题。

Also, as a controller couldn't have any expectation regarding the topologies of hierarchies other controllers might be on, each controller had to assume that all other controllers were attached to completely orthogonal hierarchies. This made it impossible, or at least very cumbersome, for controllers to cooperate with each other.
此外,由于控制器无法对其他控制器可能存在的层次结构拓扑有任何期望,每个控制器都必须假设所有其他控制器都附加到完全正交的层次结构上。这使得控制器之间合作变得不可能,或者至少非常繁琐。

In most use cases, putting controllers on hierarchies which are completely orthogonal to each other isn't necessary. What usually is called for is the ability to have differing levels of granularity depending on the specific controller. In other words, hierarchy may be collapsed from leaf towards root when viewed from specific controllers. For example, a given configuration might not care about how memory is distributed beyond a certain level while still wanting to control how CPU cycles are distributed.
在大多数情况下,将控制器放在完全正交的层次结构上是不必要的。通常需要的是根据特定控制器的能力具有不同的粒度级别。换句话说,当从特定控制器的视角查看时,层次结构可以从叶子向根部折叠。例如,给定的配置可能不关心内存分配超出一定级别,但仍希望控制 CPU 周期的分配。

Thread Granularity

线程粒度

cgroup v1 allowed threads of a process to belong to different cgroups. This didn't make sense for some controllers and those controllers ended up implementing different ways to ignore such situations but much more importantly it blurred the line between API exposed to individual applications and system management interface.
cgroup v1 允许进程的线程属于不同的 cgroup。这对于一些控制器来说是没有意义的,这些控制器最终实现了忽略这种情况的不同方式,但更重要的是,它模糊了向个别应用程序公开的 API 与系统管理接口之间的界限。

Generally, in-process knowledge is available only to the process itself; thus, unlike service-level organization of processes, categorizing threads of a process requires active participation from the application which owns the target process.
一般来说,进程内的知识只对进程本身可用;因此,与进程的服务级别组织不同,对进程的线程进行分类需要目标进程所属的应用程序积极参与。

cgroup v1 had an ambiguously defined delegation model which got abused in combination with thread granularity. cgroups were delegated to individual applications so that they can create and manage their own sub-hierarchies and control resource distributions along them. This effectively raised cgroup to the status of a syscall-like API exposed to lay programs.
cgroup v1 具有一个模糊定义的委托模型,它与线程粒度结合使用时被滥用。cgroups 被委托给个别应用程序,以便它们可以创建和管理自己的子层次结构,并沿着这些层次结构控制资源分配。这实际上将 cgroup 提升到了类似系统调用的 API 的地位,暴露给了普通程序。

First of all, cgroup has a fundamentally inadequate interface to be exposed this way. For a process to access its own knobs, it has to extract the path on the target hierarchy from /proc/self/cgroup, construct the path by appending the name of the knob to the path, open and then read and/or write to it. This is not only extremely clunky and unusual but also inherently racy. There is no conventional way to define transaction across the required steps and nothing can guarantee that the process would actually be operating on its own sub-hierarchy.
首先,cgroup 具有一个基本上不足够的接口,无法以这种方式公开。要访问自己的旋钮,进程必须从 /proc/self/cgroup 中提取目标层次结构上的路径,通过在路径后附加旋钮的名称来构造路径,然后打开并读取和/或写入它。这不仅极其笨拙和不寻常,而且本质上是竞争的。没有常规方法来定义所需步骤之间的事务,并且没有任何东西可以保证进程实际上正在操作自己的子层次结构。

cgroup controllers implemented a number of knobs which would never be accepted as public APIs because they were just adding control knobs to system-management pseudo filesystem. cgroup ended up with interface knobs which were not properly abstracted or refined and directly revealed kernel internal details. These knobs got exposed to individual applications through the ill-defined delegation mechanism effectively abusing cgroup as a shortcut to implementing public APIs without going through the required scrutiny.
cgroup 控制器实现了许多旋钮,这些旋钮永远不会被接受为公共 API,因为它们只是向系统管理伪文件系统添加控制旋钮。cgroup 最终具有接口旋钮,这些旋钮没有得到适当的抽象或精炼,并且直接暴露了内核内部细节。通过模糊定义的委托机制,这些旋钮被暴露给个别应用程序,实际上滥用了 cgroup,以便在不经过必要的审查的情况下实现公共 API 的快捷方式。

This was painful for both userland and kernel. Userland ended up with misbehaving and poorly abstracted interfaces and kernel exposing and locked into constructs inadvertently.
这对用户空间和内核都是痛苦的。用户空间最终得到了行为不端和抽象不足的接口,而内核则暴露并锁定了无意中构建的结构。

Competition Between Inner Nodes and Threads

内部节点和线程之间的竞争

cgroup v1 allowed threads to be in any cgroups which created an interesting problem where threads belonging to a parent cgroup and its children cgroups competed for resources. This was nasty as two different types of entities competed and there was no obvious way to settle it. Different controllers did different things.
cgroup v1 允许线程属于任何 cgroup,这就产生了一个有趣的问题,即属于父 cgroup 和其子 cgroup 的线程之间竞争资源。这是一个棘手的问题,因为两种不同类型的实体在竞争,而没有明显的解决方法。不同的控制器采取了不同的方法。

The cpu controller considered threads and cgroups as equivalents and mapped nice levels to cgroup weights. This worked for some cases but fell flat when children wanted to be allocated specific ratios of CPU cycles and the number of internal threads fluctuated - the ratios constantly changed as the number of competing entities fluctuated. There also were other issues. The mapping from nice level to weight wasn't obvious or universal, and there were various other knobs which simply weren't available for threads.
CPU 控制器将线程和 cgroup 视为等效,并将优先级映射到 cgroup 权重。这在某些情况下有效,但当子 cgroup 想要分配特定比例的 CPU 周期并且内部线程数量波动时,映射从优先级到权重并不明显或通用,并且还有其他问题。从优先级到权重的映射不明显或通用,并且还有其他问题。在线程上没有其他可用的旋钮。

The io controller implicitly created a hidden leaf node for each cgroup to host the threads. The hidden leaf had its own copies of all the knobs with leaf_ prefixed. While this allowed equivalent control over internal threads, it was with serious drawbacks. It always added an extra layer of nesting which wouldn't be necessary otherwise, made the interface messy and significantly complicated the implementation.
IO 控制器隐式地为每个 cgroup 创建了一个隐藏的叶节点来托管线程。隐藏的叶节点具有所有旋钮的副本,前缀为 leaf_。虽然这允许对内部线程进行等效控制,但存在严重的缺点。它总是添加了一个不必要的额外嵌套层,使接口混乱,并且显著复杂化了实现。

The memory controller didn't have a way to control what happened between internal tasks and child cgroups and the behavior was not clearly defined. There were attempts to add ad-hoc behaviors and knobs to tailor the behavior to specific workloads which would have led to problems extremely difficult to resolve in the long term.
内存控制器没有办法控制内部任务和子 cgroup 之间的行为,并且行为没有明确定义。有人尝试添加临时行为和旋钮来调整行为以适应特定工作负载,这将导致长期难以解决的问题。

Multiple controllers struggled with internal tasks and came up with different ways to deal with it; unfortunately, all the approaches were severely flawed and, furthermore, the widely different behaviors made cgroup as a whole highly inconsistent.
多个控制器都在内部任务上遇到了问题,并提出了不同的处理方式;不幸的是,所有这些方法都存在严重缺陷,并且,此外,广泛不同的行为使得整个 cgroup 高度不一致。

This clearly is a problem which needs to be addressed from cgroup core in a uniform way.
这显然是一个需要以统一方式从 cgroup 核心解决的问题。

Other Interface Issues

其他接口问题

cgroup v1 grew without oversight and developed a large number of idiosyncrasies and inconsistencies. One issue on the cgroup core side was how an empty cgroup was notified - a userland helper binary was forked and executed for each event. The event delivery wasn't recursive or delegatable. The limitations of the mechanism also led to in-kernel event delivery filtering mechanism further complicating the interface.
cgroup v1 在没有监督的情况下不断增长,并且发展出了大量的特殊性和不一致性。在 cgroup 核心方面的一个问题是如何通知空 cgroup - 对于每个事件都会 fork 并执行一个用户空间辅助二进制文件。事件传递不是递归的或可委托的。该机制的限制还导致内核事件传递过滤机制进一步复杂化接口。

Controller interfaces were problematic too. An extreme example is controllers completely ignoring hierarchical organization and treating all cgroups as if they were all located directly under the root cgroup. Some controllers exposed a large amount of inconsistent implementation details to userland.
控制器接口也存在问题。一个极端的例子是完全忽略分层组织,并将所有 cgroup 视为直接位于根 cgroup 下的位置。一些控制器向用户空间暴露了大量不一致的实现细节。

There also was no consistency across controllers. When a new cgroup was created, some controllers defaulted to not imposing extra restrictions while others disallowed any resource usage until explicitly configured. Configuration knobs for the same type of control used widely differing naming schemes and formats. Statistics and information knobs were named arbitrarily and used different formats and units even in the same controller.
控制器之间也缺乏一致性。当创建新的 cgroup 时,一些控制器默认不施加额外的限制,而其他控制器则禁止任何资源使用,直到明确配置。相同类型的控制的配置旋钮使用了完全不同的命名方案和格式。统计信息和信息旋钮的命名是任意的,并且即使在同一个控制器中也使用了不同的格式和单位。

cgroup v2 establishes common conventions where appropriate and updates controllers so that they expose minimal and consistent interfaces.
cgroup v2 建立了适当的公共约定,并更新了控制器,使其暴露最小且一致的接口。

Controller Issues and Remedies

控制器问题和解决方案

Memory

内存

The original lower boundary, the soft limit, is defined as a limit that is per default unset. As a result, the set of cgroups that global reclaim prefers is opt-in, rather than opt-out. The costs for optimizing these mostly negative lookups are so high that the implementation, despite its enormous size, does not even provide the basic desirable behavior. First off, the soft limit has no hierarchical meaning. All configured groups are organized in a global rbtree and treated like equal peers, regardless where they are located in the hierarchy. This makes subtree delegation impossible. Second, the soft limit reclaim pass is so aggressive that it not just introduces high allocation latencies into the system, but also impacts system performance due to overreclaim, to the point where the feature becomes self-defeating.
原始的下限,即软限制,被定义为默认情况下未设置的限制。因此,全局回收优先考虑的 cgroup 集合是选择加入的,而不是选择退出的。优化这些主要是负面查找的成本非常高,以至于尽管实现非常庞大,但它甚至没有提供基本的理想行为。首先,软限制没有层次意义。所有配置的组都组织在一个全局的 rbtree 中,并且被视为相等的同级,无论它们在层次结构中的位置如何。这使得子树委托成为不可能。其次,软限制回收过程是如此激进,以至于不仅会在系统中引入高分配延迟,而且由于过度回收而影响系统性能,甚至到了功能自我破坏的地步。

The memory.low boundary on the other hand is a top-down allocated reserve. A cgroup enjoys reclaim protection when it's within its effective low, which makes delegation of subtrees possible. It also enjoys having reclaim pressure proportional to its overage when above its effective low.
另一方面,内存低限是自上而下分配的保留。当它在其有效低限内时,cgroup 可以享受回收保护,这使得子树委托成为可能。当超过其有效低限时,它还可以享受与其超额相关的回收压力。

The original high boundary, the hard limit, is defined as a strict limit that can not budge, even if the OOM killer has to be called. But this generally goes against the goal of making the most out of the available memory. The memory consumption of workloads varies during runtime, and that requires users to overcommit. But doing that with a strict upper limit requires either a fairly accurate prediction of the working set size or adding slack to the limit. Since working set size estimation is hard and error prone, and getting it wrong results in OOM kills, most users tend to err on the side of a looser limit and end up wasting precious resources.
原始的上限,即硬限制,被定义为严格的限制,即使需要调用 OOM killer 也不能改变。但这通常与充分利用可用内存的目标相矛盾。工作负载的内存消耗在运行时变化,这要求用户进行过度承诺。但是,使用严格的上限进行这样的承诺要么需要对工作集大小进行相当准确的预测,要么需要在限制上添加松弛。由于工作集大小的估计很困难且容易出错,如果估计错误会导致 OOM 杀死,大多数用户倾向于选择更宽松的限制,并最终浪费宝贵的资源。

The memory.high boundary on the other hand can be set much more conservatively. When hit, it throttles allocations by forcing them into direct reclaim to work off the excess, but it never invokes the OOM killer. As a result, a high boundary that is chosen too aggressively will not terminate the processes, but instead it will lead to gradual performance degradation. The user can monitor this and make corrections until the minimal memory footprint that still gives acceptable performance is found.
另一方面,内存高限可以设置得更加保守。当达到时,它通过强制将分配置于直接回收中来限制分配,但它永远不会调用 OOM killer。因此,过于激进地选择高限将不会终止进程,而是会导致逐渐的性能下降。用户可以监视这一点,并进行更正,直到找到仍然提供可接受性能的最小内存占用。

In extreme cases, with many concurrent allocations and a complete breakdown of reclaim progress within the group, the high boundary can be exceeded. But even then it's mostly better to satisfy the allocation from the slack available in other groups or the rest of the system than killing the group. Otherwise, memory.max is there to limit this type of spillover and ultimately contain buggy or even malicious applications.
在极端情况下,当有许多并发分配和组内回收进展完全崩溃时,可以超过高限。但即使在这种情况下,满足其他组或系统其余部分中可用的松弛分配要比杀死组更好。否则,memory.max 用于限制这种溢出类型,并最终包含错误或甚至恶意应用程序。

Setting the original memory.limit_in_bytes below the current usage was subject to a race condition, where concurrent charges could cause the limit setting to fail. memory.max on the other hand will first set the limit to prevent new charges, and then reclaim and OOM kill until the new limit is met - or the task writing to memory.max is killed.
将原始的 memory.limit_in_bytes 设置为低于当前使用量存在竞争条件,其中并发的费用可能导致限制设置失败。另一方面,memory.max 首先会设置限制以防止新的费用,然后回收并调用 OOM killer,直到满足新的限制 - 或者写入 memory.max 的任务被杀死。

The combined memory+swap accounting and limiting is replaced by real control over swap space.
合并的 memory+swap 计数和限制被真正的对交换空间的控制所取代。

The main argument for a combined memory+swap facility in the original cgroup design was that global or parental pressure would always be able to swap all anonymous memory of a child group, regardless of the child's own (possibly untrusted) configuration. However, untrusted groups can sabotage swapping by other means - such as referencing its anonymous memory in a tight loop - and an admin can not assume full swappability when overcommitting untrusted jobs.
原始 cgroup 设计中合并内存+交换的主要论点是,全局或父级压力将始终能够交换子组的所有匿名内存,而不考虑子组自己(可能不受信任的)的配置。然而,不受信任的组可以通过其他方式破坏交换 - 例如在紧密循环中引用其匿名内存 - 管理员不能假设在过度承诺不受信任的作业时总是能够完全交换。

For trusted jobs, on the other hand, a combined counter is not an intuitive userspace interface, and it flies in the face of the idea that cgroup controllers should account and limit specific physical resources. Swap space is a resource like all others in the system, and that's why unified hierarchy allows distributing it separately.
另一方面,对于受信任的作业,合并计数器不是一个直观的用户空间接口,而且与 cgroup 控制器应该计算和限制特定物理资源的想法相矛盾。交换空间是系统中的所有其他资源一样的资源,这就是为什么统一层次结构允许单独分配它。