Workqueue (翻译 by chatgpt)

发布时间 2023-12-04 16:59:05作者: 摩斯电码

原文:https://www.kernel.org/doc/html/latest/core-api/workqueue.html

Introduction

There are many cases where an asynchronous process execution context is needed and the workqueue (wq) API is the most commonly used mechanism for such cases.
有许多情况需要异步处理执行上下文,而工作队列(wq)API是最常用的机制之一。

When such an asynchronous execution context is needed, a work item describing which function to execute is put on a queue. An independent thread serves as the asynchronous execution context. The queue is called workqueue and the thread is called worker.
当需要这样的异步执行上下文时,会将描述要执行的函数的工作项放入队列中。独立线程充当异步执行上下文。队列称为工作队列,线程称为工作者。

While there are work items on the workqueue the worker executes the functions associated with the work items one after the other. When there is no work item left on the workqueue the worker becomes idle. When a new work item gets queued, the worker begins executing again.
在工作队列上有工作项时,工作者依次执行与工作项相关联的函数。当工作队列上没有剩余的工作项时,工作者变为空闲状态。当新的工作项被排队时,工作者开始再次执行。

Why Concurrency Managed Workqueue?

为什么需要并发管理的工作队列?

In the original wq implementation, a multi threaded (MT) wq had one worker thread per CPU and a single threaded (ST) wq had one worker thread system-wide. A single MT wq needed to keep around the same number of workers as the number of CPUs. The kernel grew a lot of MT wq users over the years and with the number of CPU cores continuously rising, some systems saturated the default 32k PID space just booting up.
在原始的wq实现中,多线程(MT)wq每个CPU有一个工作者线程,单线程(ST)wq系统范围内只有一个工作者线程。单个MT wq需要保留与CPU数量相同的工作者。多年来,内核中出现了许多MT wq的用户,并且随着CPU核心数量不断增加,一些系统在启动时就耗尽了默认的32k PID空间。

Although MT wq wasted a lot of resource, the level of concurrency provided was unsatisfactory. The limitation was common to both ST and MT wq albeit less severe on MT. Each wq maintained its own separate worker pool. An MT wq could provide only one execution context per CPU while an ST wq one for the whole system. Work items had to compete for those very limited execution contexts leading to various problems including proneness to deadlocks around the single execution context.
尽管MT wq浪费了大量资源,但提供的并发级别仍然不尽人意。这个限制对ST和MT wq都是普遍存在的,尽管在MT上不那么严重。每个wq都维护着自己独立的工作者池。MT wq每个CPU只能提供一个执行上下文,而ST wq整个系统只有一个。工作项必须竞争这些非常有限的执行上下文,导致各种问题,包括在单个执行上下文周围容易发生死锁。

The tension between the provided level of concurrency and resource usage also forced its users to make unnecessary tradeoffs like libata choosing to use ST wq for polling PIOs and accepting an unnecessary limitation that no two polling PIOs can progress at the same time. As MT wq don't provide much better concurrency, users which require higher level of concurrency, like async or fscache, had to implement their own thread pool.
所提供的并发级别与资源使用之间的紧张关系也迫使用户做出不必要的折衷,比如libata选择使用ST wq来轮询PIO,并接受一个不必要的限制,即不允许同时进行两个轮询PIO。由于MT wq并不能提供更好的并发性,需要更高级别并发性的用户,如async或fscache,不得不实现自己的线程池。

Concurrency Managed Workqueue (cmwq) is a reimplementation of wq with focus on the following goals.
并发管理工作队列(cmwq)是对wq的重新实现,重点关注以下目标。

  • Maintain compatibility with the original workqueue API.
    保持与原始工作队列API的兼容性。

  • Use per-CPU unified worker pools shared by all wq to provide flexible level of concurrency on demand without wasting a lot of resource.
    使用每CPU统一的工作者池,以便根据需求提供灵活的并发级别,而不浪费大量资源。

  • Automatically regulate worker pool and level of concurrency so that the API users don't need to worry about such details.
    自动调节工作者池和并发级别,使API用户不需要担心这些细节。

The Design

设计

In order to ease the asynchronous execution of functions a new abstraction, the work item, is introduced.
为了简化函数的异步执行,引入了一个新的抽象——工作项。

A work item is a simple struct that holds a pointer to the function that is to be executed asynchronously. Whenever a driver or subsystem wants a function to be executed asynchronously it has to set up a work item pointing to that function and queue that work item on a workqueue.
工作项是一个简单的结构,它保存了指向要异步执行的函数的指针。每当驱动程序或子系统希望异步执行一个函数时,都必须设置一个指向该函数的工作项,并将该工作项排入工作队列。

Special purpose threads, called worker threads, execute the functions off of the queue, one after the other. If no work is queued, the worker threads become idle. These worker threads are managed in so called worker-pools.
专用线程,称为工作者线程,依次执行队列中的函数。如果没有工作排队,工作者线程将变为空闲状态。这些工作者线程由所谓的工作者池管理。

The cmwq design differentiates between the user-facing workqueues that subsystems and drivers queue work items on and the backend mechanism which manages worker-pools and processes the queued work items.
cmwq设计区分了面向用户的工作队列(子系统和驱动程序在其上排队工作项)和后端机制(管理工作者池并处理排队的工作项)之间的差异。

There are two worker-pools, one for normal work items and the other for high priority ones, for each possible CPU and some extra worker-pools to serve work items queued on unbound workqueues - the number of these backing pools is dynamic.
每个可能的CPU都有两个工作者池,一个用于普通工作项,另一个用于高优先级工作项,并且还有一些额外的工作者池用于为未绑定的工作队列排队的工作项提供服务——这些后备池的数量是动态的。

Subsystems and drivers can create and queue work items through special workqueue API functions as they see fit. They can influence some aspects of the way the work items are executed by setting flags on the workqueue they are putting the work item on. These flags include things like CPU locality, concurrency limits, priority and more. To get a detailed overview refer to the API description of alloc_workqueue() below.
子系统和驱动程序可以根据需要通过特殊的工作队列API函数创建和排队工作项。它们可以通过在放置工作项的工作队列上设置标志来影响工作项执行的某些方面。这些标志包括CPU本地性、并发限制、优先级等。有关详细概述,请参阅下面的alloc_workqueue() API描述。

When a work item is queued to a workqueue, the target worker-pool is determined according to the queue parameters and workqueue attributes and appended on the shared worklist of the worker-pool. For example, unless specifically overridden, a work item of a bound workqueue will be queued on the worklist of either normal or highpri worker-pool that is associated to the CPU the issuer is running on.
当将工作项排队到工作队列时,根据队列参数和工作队列属性确定目标工作者池,并将其附加到工作者池的共享工作列表上。例如,除非明确覆盖,绑定工作队列的工作项将排队到与发出者所在CPU相关联的普通或高优先级工作者池的工作列表上。

For any worker pool implementation, managing the concurrency level (how many execution contexts are active) is an important issue. cmwq tries to keep the concurrency at a minimal but sufficient level. Minimal to save resources and sufficient in that the system is used at its full capacity.
对于任何工作者池实现来说,管理并发级别(活动执行上下文的数量)是一个重要问题。cmwq试图保持并发级别在最小但足够的水平上。最小化资源消耗,同时保证系统充分利用。

Each worker-pool bound to an actual CPU implements concurrency management by hooking into the scheduler. The worker-pool is notified whenever an active worker wakes up or sleeps and keeps track of the number of the currently runnable workers. Generally, work items are not expected to hog a CPU and consume many cycles. That means maintaining just enough concurrency to prevent work processing from stalling should be optimal. As long as there are one or more runnable workers on the CPU, the worker-pool doesn't start execution of a new work, but, when the last running worker goes to sleep, it immediately schedules a new worker so that the CPU doesn't sit idle while there are pending work items. This allows using a minimal number of workers without losing execution bandwidth.
绑定到实际CPU的每个工作者池通过与调度程序连接来实现并发管理。每当活动工作者醒来或休眠时,工作者池都会收到通知,并跟踪当前可运行的工作者数量。通常不希望工作项占用CPU并消耗大量周期。这意味着保持足够的并发性以防止工作处理停滞应该是最佳的。只要CPU上有一个或多个可运行的工作者,工作者池就不会启动新工作的执行,但是当最后一个运行的工作者进入休眠状态时,它会立即安排一个新的工作者,以防止CPU在有待处理的工作项时处于空闲状态。这样可以使用最少数量的工作者而不会失去执行带宽。

Keeping idle workers around doesn't cost other than the memory space for kthreads, so cmwq holds onto idle ones for a while before killing them.
保留空闲工作者不会带来其他成本,除了kthreads的内存空间,因此cmwq会在终止它们之前保留一段时间。

For unbound workqueues, the number of backing pools is dynamic. Unbound workqueue can be assigned custom attributes using apply_workqueue_attrs() and workqueue will automatically create backing worker pools matching the attributes. The responsibility of regulating concurrency level is on the users. There is also a flag to mark a bound wq to ignore the concurrency management. Please refer to the API section for details.
对于未绑定的工作队列,后备池的数量是动态的。未绑定的工作队列可以使用apply_workqueue_attrs()分配自定义属性,并且工作队列将自动创建与属性匹配的后备工作者池。调节并发级别的责任在用户身上。还有一个标志可以标记绑定的wq以忽略并发管理。有关详细信息,请参阅API部分。

Forward progress guarantee relies on that workers can be created when more execution contexts are necessary, which in turn is guaranteed through the use of rescue workers. All work items which might be used on code paths that handle memory reclaim are required to be queued on wq's that have a rescue-worker reserved for execution under memory pressure. Else it is possible that the worker-pool deadlocks waiting for execution contexts to free up.
前进保证依赖于当需要更多执行上下文时可以创建工作者,这又通过使用救援工作者来保证。所有可能在处理内存回收的代码路径上使用的工作项都必须排队到具有为内存压力下执行保留的救援工作者的wq上。否则,工作者池可能会因等待执行上下文释放而发生死锁。

Application Programming Interface (API)

alloc_workqueue() allocates a wq. The original create_*workqueue() functions are deprecated and scheduled for removal. alloc_workqueue() takes three arguments - @name, @flags and @max_active. @name is the name of the wq and also used as the name of the rescuer thread if there is one.
alloc_workqueue() 分配一个工作队列(wq)。原始的 create_*workqueue() 函数已被弃用并计划移除。alloc_workqueue() 接受三个参数 - @name、@flags 和 @max_active。@name 是 wq 的名称,如果有救援线程,也将用作其名称。

A wq no longer manages execution resources but serves as a domain for forward progress guarantee, flush and work item attributes. @flags and @max_active control how work items are assigned execution resources, scheduled and executed.
现在,wq 不再管理执行资源,而是作为前进保证、刷新和工作项属性的域。@flags 和 @max_active 控制工作项如何分配执行资源、调度和执行。

flags

  • WQ_UNBOUND

    Work items queued to an unbound wq are served by the special worker-pools which host workers which are not bound to any specific CPU. This makes the wq behave as a simple execution context provider without concurrency management. The unbound worker-pools try to start execution of work items as soon as possible. Unbound wq sacrifices locality but is useful for the following cases.
    排队到无绑定 wq 的工作项由特殊的工作池提供服务,这些工作池托管着不绑定到任何特定 CPU 的工作者。这使得 wq 表现为简单的执行上下文提供者,而无需并发管理。无绑定的工作者池会尽快启动工作项的执行。无绑定的 wq 牺牲了局部性,但对以下情况很有用:

    • Wide fluctuation in the concurrency level requirement is expected and using bound wq may end up creating large number of mostly unused workers across different CPUs as the issuer hops through different CPUs.
      预期并发级别会有较大波动,并且使用绑定的 wq 可能会在不同 CPU 上创建大量大多数未使用的工作者,因为发起者在不同 CPU 之间跳转。

    • Long running CPU intensive workloads which can be better managed by the system scheduler.
      长时间运行的 CPU 密集型工作负载,可以更好地由系统调度器管理。

  • WQ_FREEZABLE

    A freezable wq participates in the freeze phase of the system suspend operations. Work items on the wq are drained and no new work item starts execution until thawed.
    可冻结的 wq 参与系统挂起操作的冻结阶段。wq 上的工作项被排空,并且在解冻之前不会启动新的工作项执行。

  • WQ_MEM_RECLAIM

    All wq which might be used in the memory reclaim paths MUST have this flag set. The wq is guaranteed to have at least one execution context regardless of memory pressure.
    所有可能在内存回收路径中使用的 wq 必须设置此标志。无论内存压力如何,都保证 wq 至少有一个执行上下文。

  • WQ_HIGHPRI

    Work items of a highpri wq are queued to the highpri worker-pool of the target cpu. Highpri worker-pools are served by worker threads with elevated nice level.
    高优先级 wq 的工作项排队到目标 CPU 的高优先级工作者池。高优先级工作者池由具有提升的 nice 级别的工作者线程提供服务。

    Note that normal and highpri worker-pools don't interact with each other. Each maintains its separate pool of workers and implements concurrency management among its workers.
    请注意,普通和高优先级工作者池不会相互交互。每个都维护其独立的工作者池,并在其工作者之间实现并发管理。

  • WQ_CPU_INTENSIVE

    Work items of a CPU intensive wq do not contribute to the concurrency level. In other words, runnable CPU intensive work items will not prevent other work items in the same worker-pool from starting execution. This is useful for bound work items which are expected to hog CPU cycles so that their execution is regulated by the system scheduler.
    CPU 密集型 wq 的工作项不会增加并发级别。换句话说,可运行的 CPU 密集型工作项不会阻止同一工作者池中的其他工作项启动执行。这对于预期占用 CPU 时间较多的绑定工作项很有用,以便它们的执行由系统调度器调节。

    Although CPU intensive work items don't contribute to the concurrency level, start of their executions is still regulated by the concurrency management and runnable non-CPU-intensive work items can delay execution of CPU intensive work items.
    尽管 CPU 密集型工作项不会增加并发级别,但它们的执行仍受并发管理的调节,可运行的非 CPU 密集型工作项可能会延迟 CPU 密集型工作项的执行。

    This flag is meaningless for unbound wq.
    对于无绑定的 wq,此标志无意义。

max_active

@max_active determines the maximum number of execution contexts per CPU which can be assigned to the work items of a wq. For example, with @max_active of 16, at most 16 work items of the wq can be executing at the same time per CPU. This is always a per-CPU attribute, even for unbound workqueues.
@max_active 确定每个 CPU 可分配给 wq 的工作项的最大执行上下文数。例如,使用 @max_active 为 16,每个 CPU 最多可以同时执行 16 个 wq 的工作项。这始终是每个 CPU 的属性,即使对于无绑定的工作队列也是如此。

The maximum limit for @max_active is 512 and the default value used when 0 is specified is 256. These values are chosen sufficiently high such that they are not the limiting factor while providing protection in runaway cases.
@max_active 的最大限制为 512,当指定为 0 时使用的默认值为 256。选择这些值足够高,以便它们在提供保护的同时不成为失控情况的限制因素。

The number of active work items of a wq is usually regulated by the users of the wq, more specifically, by how many work items the users may queue at the same time. Unless there is a specific need for throttling the number of active work items, specifying '0' is recommended.
wq 的活动工作项数量通常由 wq 的用户(更具体地说,由用户可能同时排队的工作项数量)调节。除非有特定需要限制活动工作项的数量,建议指定为 '0'。

Some users depend on the strict execution ordering of ST wq. The combination of @max_active of 1 and WQ_UNBOUND used to achieve this behavior. Work items on such wq were always queued to the unbound worker-pools and only one work item could be active at any given time thus achieving the same ordering property as ST wq.
一些用户依赖于 ST wq 的严格执行顺序。使用 @max_active 为 1 和 WQ_UNBOUND 的组合以实现此行为。这样的 wq 上的工作项总是排队到无绑定的工作者池,并且任何给定时间只能有一个工作项处于活动状态,从而实现与 ST wq 相同的顺序属性。

In the current implementation the above configuration only guarantees ST behavior within a given NUMA node. Instead alloc_ordered_workqueue() should be used to achieve system-wide ST behavior.
在当前实现中,上述配置仅在给定的 NUMA 节点内保证 ST 行为。相反,应使用 alloc_ordered_workqueue() 来实现系统范围的 ST 行为。

Example Execution Scenarios

The following example execution scenarios try to illustrate how cmwq behave under different configurations.
以下示例执行场景试图说明在不同配置下cmwq的行为。

Work items w0, w1, w2 are queued to a bound wq q0 on the same CPU. w0 burns CPU for 5ms then sleeps for 10ms then burns CPU for 5ms again before finishing. w1 and w2 burn CPU for 5ms then sleep for 10ms.
工作项w0、w1、w2排队到同一个CPU上的绑定wq q0。w0在烧CPU 5毫秒后休眠10毫秒,然后再次烧CPU 5毫秒,最后完成。w1和w2在烧CPU 5毫秒后休眠10毫秒。

Ignoring all other tasks, works and processing overhead, and assuming simple FIFO scheduling, the following is one highly simplified version of possible sequences of events with the original wq.

TIME IN MSECS  EVENT
0              w0 starts and burns CPU
5              w0 sleeps
15             w0 wakes up and burns CPU
20             w0 finishes
20             w1 starts and burns CPU
25             w1 sleeps
35             w1 wakes up and finishes
35             w2 starts and burns CPU
40             w2 sleeps
50             w2 wakes up and finishes

And with cmwq with @max_active >= 3,

TIME IN MSECS  EVENT
0              w0 starts and burns CPU
5              w0 sleeps
5              w1 starts and burns CPU
10             w1 sleeps
10             w2 starts and burns CPU
15             w2 sleeps
15             w0 wakes up and burns CPU
20             w0 finishes
20             w1 wakes up and finishes
25             w2 wakes up and finishes

If @max_active == 2,

TIME IN MSECS  EVENT
0              w0 starts and burns CPU
5              w0 sleeps
5              w1 starts and burns CPU
10             w1 sleeps
15             w0 wakes up and burns CPU
20             w0 finishes
20             w1 wakes up and finishes
20             w2 starts and burns CPU
25             w2 sleeps
35             w2 wakes up and finishes

Now, let's assume w1 and w2 are queued to a different wq q1 which has WQ_CPU_INTENSIVE set,
现在,假设w1和w2排队到具有WQ_CPU_INTENSIVE设置的不同wq q1中,

TIME IN MSECS  EVENT
0              w0 starts and burns CPU
5              w0 sleeps
5              w1 and w2 start and burn CPU
10             w1 sleeps
15             w2 sleeps
15             w0 wakes up and burns CPU
20             w0 finishes
20             w1 wakes up and finishes
25             w2 wakes up and finishes

Guidelines

指南

  • Do not forget to use WQ_MEM_RECLAIM if a wq may process work items which are used during memory reclaim. Each wq with WQ_MEM_RECLAIM set has an execution context reserved for it. If there is dependency among multiple work items used during memory reclaim, they should be queued to separate wq each with WQ_MEM_RECLAIM.
    如果一个wq可能处理在内存回收期间使用的工作项,请不要忘记使用WQ_MEM_RECLAIM。每个设置了WQ_MEM_RECLAIM的wq都有一个为其保留的执行上下文。如果多个在内存回收期间使用的工作项之间存在依赖关系,则它们应该分别排队到具有WQ_MEM_RECLAIM的单独wq中。

  • Unless strict ordering is required, there is no need to use ST wq.
    除非需要严格的顺序,否则无需使用ST wq。

  • Unless there is a specific need, using 0 for @max_active is recommended. In most use cases, concurrency level usually stays well under the default limit.
    除非有特定的需求,建议将@max_active设置为0。在大多数情况下,并发级别通常远低于默认限制。

  • A wq serves as a domain for forward progress guarantee (WQ_MEM_RECLAIM, flush and work item attributes. Work items which are not involved in memory reclaim and don't need to be flushed as a part of a group of work items, and don't require any special attribute, can use one of the system wq. There is no difference in execution characteristics between using a dedicated wq and a system wq.
    一个wq作为前进保证的域(WQ_MEM_RECLAIM、flush和工作项属性)。不涉及内存回收并且不需要作为工作项组的一部分进行刷新,也不需要任何特殊属性的工作项可以使用系统wq之一。在使用专用wq和系统wq之间,执行特性没有区别。

  • Unless work items are expected to consume a huge amount of CPU cycles, using a bound wq is usually beneficial due to the increased level of locality in wq operations and work item execution.
    除非预计工作项将消耗大量CPU周期,否则使用绑定wq通常是有益的,因为wq操作和工作项执行的局部性水平增加了。

Affinity Scopes

亲和性范围

An unbound workqueue groups CPUs according to its affinity scope to improve cache locality. For example, if a workqueue is using the default affinity scope of "cache", it will group CPUs according to last level cache boundaries. A work item queued on the workqueue will be assigned to a worker on one of the CPUs which share the last level cache with the issuing CPU. Once started, the worker may or may not be allowed to move outside the scope depending on the affinity_strict setting of the scope.
一个未绑定的工作队列根据其亲和性范围将CPU分组,以提高缓存局部性。例如,如果一个工作队列使用默认的亲和性范围"cache",它将根据最后一级缓存的边界将CPU分组。排队到工作队列上的工作项将分配给与发出CPU共享最后一级缓存的CPU上的工作者。一旦启动,工作者可能允许或不允许移动到范围之外,这取决于范围的affinity_strict设置。

Workqueue currently supports the following affinity scopes.
工作队列目前支持以下亲和性范围。

  • default

    Use the scope in module parameter workqueue.default_affinity_scope which is always set to one of the scopes below.
    使用模块参数workqueue.default_affinity_scope中的范围,该参数始终设置为下面的范围之一。

  • cpu

    CPUs are not grouped. A work item issued on one CPU is processed by a worker on the same CPU. This makes unbound workqueues behave as per-cpu workqueues without concurrency management.
    CPU未分组。在一个CPU上发出的工作项由同一个CPU上的工作者处理。这使得未绑定的工作队列表现得像没有并发管理的per-cpu工作队列。

  • smt

    CPUs are grouped according to SMT boundaries. This usually means that the logical threads of each physical CPU core are grouped together.
    CPU根据SMT边界分组。这通常意味着每个物理CPU核心的逻辑线程被分组在一起。

  • cache

    CPUs are grouped according to cache boundaries. Which specific cache boundary is used is determined by the arch code. L3 is used in a lot of cases. This is the default affinity scope.
    CPU根据缓存边界分组。使用的具体缓存边界由架构代码确定。在许多情况下使用L3。这是默认的亲和性范围。

  • numa

    CPUs are grouped according to NUMA bounaries.
    CPU根据NUMA边界分组。

  • system

    All CPUs are put in the same group. Workqueue makes no effort to process a work item on a CPU close to the issuing CPU.
    所有CPU都放在同一组中。工作队列不会努力在靠近发出CPU的CPU上处理工作项。

The default affinity scope can be changed with the module parameter workqueue.default_affinity_scope and a specific workqueue's affinity scope can be changed using apply_workqueue_attrs().
可以使用模块参数workqueue.default_affinity_scope更改默认的亲和性范围,并使用apply_workqueue_attrs()更改特定工作队列的亲和性范围。

If WQ_SYSFS is set, the workqueue will have the following affinity scope related interface files under its /sys/devices/virtual/workqueue/WQ_NAME/ directory.
如果设置了WQ_SYSFS,则工作队列将在其/sys/devices/virtual/workqueue/WQ_NAME/目录下具有以下与亲和性范围相关的接口文件。

  • affinity_scope

    Read to see the current affinity scope. Write to change.
    读取以查看当前的亲和性范围。写入以更改。

    When default is the current scope, reading this file will also show the current effective scope in parentheses, for example, default (cache).
    当默认范围是当前范围时,读取此文件还将显示当前有效范围(括在括号中),例如default (cache)。

  • affinity_strict

    0 by default indicating that affinity scopes are not strict. When a work item starts execution, workqueue makes a best-effort attempt to ensure that the worker is inside its affinity scope, which is called repatriation. Once started, the scheduler is free to move the worker anywhere in the system as it sees fit. This enables benefiting from scope locality while still being able to utilize other CPUs if necessary and available.
    默认为0,表示亲和性范围不严格。当工作项开始执行时,工作队列会尽力确保工作者在其亲和性范围内,这称为repatriation。一旦启动,调度程序可以自由地将工作者移动到系统中的任何位置。这使得可以从范围局部性中受益,同时仍然能够利用其他CPU(如果必要且可用)。

    If set to 1, all workers of the scope are guaranteed always to be in the scope. This may be useful when crossing affinity scopes has other implications, for example, in terms of power consumption or workload isolation. Strict NUMA scope can also be used to match the workqueue behavior of older kernels.
    如果设置为1,则范围内的所有工作者都保证始终在范围内。这在跨亲和性范围具有其他影响时可能很有用,例如在功耗或工作负载隔离方面。严格的NUMA范围也可以用于匹配旧内核的工作队列行为。

Affinity Scopes and Performance

亲和范围和性能

It'd be ideal if an unbound workqueue's behavior is optimal for vast majority of use cases without further tuning. Unfortunately, in the current kernel, there exists a pronounced trade-off between locality and utilization necessitating explicit configurations when workqueues are heavily used.
如果一个无限制的工作队列的行为在绝大多数情况下都是最佳的,而无需进一步调整,那将是理想的。不幸的是,在当前的内核中,存在着明显的局部性和利用率之间的权衡,这需要在工作队列被大量使用时进行显式配置。

Higher locality leads to higher efficiency where more work is performed for the same number of consumed CPU cycles. However, higher locality may also cause lower overall system utilization if the work items are not spread enough across the affinity scopes by the issuers. The following performance testing with dm-crypt clearly illustrates this trade-off.
更高的局部性会导致更高的效率,即在相同数量的CPU周期内执行更多的工作。然而,如果发行者没有足够地将工作项分散到亲和范围中,更高的局部性也可能导致整体系统利用率降低。下面的使用dm-crypt进行的性能测试清楚地说明了这种权衡。

The tests are run on a CPU with 12-cores/24-threads split across four L3 caches (AMD Ryzen 9 3900x). CPU clock boost is turned off for consistency. /dev/dm-0 is a dm-crypt device created on NVME SSD (Samsung 990 PRO) and opened with cryptsetup with default settings.
测试在一个具有12个核心/24个线程的CPU上运行,分布在四个L3缓存中(AMD Ryzen 9 3900x)。为了保持一致,关闭了CPU时钟提升。/dev/dm-0是在NVME SSD(Samsung 990 PRO)上创建的dm-crypt设备,并使用默认设置通过cryptsetup打开。

Scenario 1: Enough issuers and work spread across the machine

场景1:足够的发行者和工作在整个机器上分散

The command used:
使用的命令:

$ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k --ioengine=libaio \
  --iodepth=64 --runtime=60 --numjobs=24 --time_based --group_reporting \
  --name=iops-test-job --verify=sha512

There are 24 issuers, each issuing 64 IOs concurrently. --verify=sha512 makes fio generate and read back the content each time which makes execution locality matter between the issuer and kcryptd. The followings are the read bandwidths and CPU utilizations depending on different affinity scope settings on kcryptd measured over five runs. Bandwidths are in MiBps, and CPU util in percents.
有24个发行者,每个发行者同时发出64个IO。--verify=sha512使fio每次生成并读取内容,这使得发行者和kcryptd之间的执行局部性很重要。以下是在五次运行中测量的不同亲和范围设置下的读取带宽和CPU利用率。带宽以MiBps为单位,CPU利用率以百分比表示。

Affinity Bandwidth (MiBps) CPU util (%)
system 1159.40 ±1.34 99.31 ±0.02
cache 1166.40 ±0.89 99.34 ±0.01
cache (strict) 1166.00 ±0.71 99.35 ±0.01

With enough issuers spread across the system, there is no downside to "cache", strict or otherwise. All three configurations saturate the whole machine but the cache-affine ones outperform by 0.6% thanks to improved locality.
在整个系统上分散了足够的发行者,对于"cache"、严格或非严格的配置都没有不利影响。这三种配置都能充分利用整个机器,但由于改进的局部性,"cache"亲和性的性能优于0.6%。

Scenario 2: Fewer issuers, enough work for saturation

场景2:较少的发行者,足够的工作以达到饱和

The command used:
使用的命令:

$ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
  --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 \
  --time_based --group_reporting --name=iops-test-job --verify=sha512

The only difference from the previous scenario is --numjobs=8. There are a third of the issuers but is still enough total work to saturate the system.
与前一个场景的唯一区别是--numjobs=8。发行者数量减少到三分之一,但仍有足够的总工作量来饱和系统。

Affinity Bandwidth (MiBps) CPU util (%)
system 1155.40 ±0.89 97.41 ±0.05
cache 1154.40 ±1.14 96.15 ±0.09
cache (strict) 1112.00 ±4.64 93.26 ±0.35

This is more than enough work to saturate the system. Both "system" and "cache" are nearly saturating the machine but not fully. "cache" is using less CPU but the better efficiency puts it at the same bandwidth as "system".
这足够的工作量可以饱和系统。"system"和"cache"几乎都能饱和整个机器,但并没有完全饱和。"cache"使用的CPU较少,但更好的效率使其带宽与"system"相同。

Eight issuers moving around over four L3 cache scope still allow "cache (strict)" to mostly saturate the machine but the loss of work conservation is now starting to hurt with 3.7% bandwidth loss.
八个发行者在四个L3缓存范围内移动,仍然可以使"cache (strict)"基本上饱和整个机器,但工作保护的损失现在开始对带宽造成影响,损失了3.7%的带宽。

Scenario 3: Even fewer issuers, not enough work to saturate

场景3:更少的发行者,不足以饱和

The command used:
使用的命令:

$ fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
  --ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 \
  --time_based --group_reporting --name=iops-test-job --verify=sha512

Again, the only difference is --numjobs=4. With the number of issuers reduced to four, there now isn't enough work to saturate the whole system and the bandwidth becomes dependent on completion latencies.
同样,唯一的区别是--numjobs=4。发行者数量减少到四个,现在没有足够的工作来饱和整个系统,带宽变得依赖于完成延迟。

Affinity Bandwidth (MiBps) CPU util (%)
system 993.60 ±1.82 75.49 ±0.06
cache 973.40 ±1.52 74.90 ±0.07
cache (strict) 828.20 ±4.49 66.84 ±0.29

Now, the tradeoff between locality and utilization is clearer. "cache" shows 2% bandwidth loss compared to "system" and "cache (struct)" whopping 20%.
现在,局部性和利用率之间的权衡更加明显。与"system"相比,"cache"的带宽损失了2%,而"cache (strict)"的带宽损失了惊人的20%。

Conclusion and Recommendations

In the above experiments, the efficiency advantage of the "cache" affinity scope over "system" is, while consistent and noticeable, small. However, the impact is dependent on the distances between the scopes and may be more pronounced in processors with more complex topologies.
“cache”亲和范围相对于“system”具有效率优势,尽管一直存在且明显,但幅度较小。然而,其影响取决于范围之间的距离,并且在拓扑结构更复杂的处理器上可能更为显著。

While the loss of work-conservation in certain scenarios hurts, it is a lot better than "cache (strict)" and maximizing workqueue utilization is unlikely to be the common case anyway. As such, "cache" is the default affinity scope for unbound pools.
虽然在某些情况下工作保护的丧失会带来损失,但这要比“cache (strict)”好得多,而且最大化工作队列利用率不太可能是常见情况。因此,“cache”是未绑定池的默认亲和范围。

  • As there is no one option which is great for most cases, workqueue usages that may consume a significant amount of CPU are recommended to configure the workqueues using apply_workqueue_attrs() and/or enable WQ_SYSFS.
    由于没有一种选项适用于大多数情况,建议对可能消耗大量 CPU 的工作队列使用 apply_workqueue_attrs() 进行配置,并/或者启用 WQ_SYSFS。

  • An unbound workqueue with strict "cpu" affinity scope behaves the same as WQ_CPU_INTENSIVE per-cpu workqueue. There is no real advanage to the latter and an unbound workqueue provides a lot more flexibility.
    具有严格“cpu”亲和范围的未绑定工作队列与每 CPU 工作队列 WQ_CPU_INTENSIVE 的行为相同。后者并没有真正的优势,而未绑定工作队列提供了更大的灵活性。

  • Affinity scopes are introduced in Linux v6.5. To emulate the previous behavior, use strict "numa" affinity scope.
    亲和范围是在 Linux v6.5 中引入的。要模拟以前的行为,请使用严格的“numa”亲和范围。

  • The loss of work-conservation in non-strict affinity scopes is likely originating from the scheduler. There is no theoretical reason why the kernel wouldn't be able to do the right thing and maintain work-conservation in most cases. As such, it is possible that future scheduler improvements may make most of these tunables unnecessary.
    非严格亲和范围中工作保护的丧失可能源自调度器。从理论上讲,内核应该能够在大多数情况下做正确的事情并保持工作保护。因此,未来调度器的改进可能会使大多数这些可调参数变得不必要。

Examining Configuration、

配置检查

Use tools/workqueue/wq_dump.py to examine unbound CPU affinity configuration, worker pools and how workqueues map to the pools:
使用 tools/workqueue/wq_dump.py 来检查未绑定 CPU 亲和性配置、工作池以及工作队列与池的映射:

$ tools/workqueue/wq_dump.py
Affinity Scopes
===============
wq_unbound_cpumask=0000000f

CPU
  nr_pods  4
  pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008
  pod_node [0]=0 [1]=0 [2]=1 [3]=1
  cpu_pod  [0]=0 [1]=1 [2]=2 [3]=3

SMT
  nr_pods  4
  pod_cpus [0]=00000001 [1]=00000002 [2]=00000004 [3]=00000008
  pod_node [0]=0 [1]=0 [2]=1 [3]=1
  cpu_pod  [0]=0 [1]=1 [2]=2 [3]=3

CACHE (default)
  nr_pods  2
  pod_cpus [0]=00000003 [1]=0000000c
  pod_node [0]=0 [1]=1
  cpu_pod  [0]=0 [1]=0 [2]=1 [3]=1

NUMA
  nr_pods  2
  pod_cpus [0]=00000003 [1]=0000000c
  pod_node [0]=0 [1]=1
  cpu_pod  [0]=0 [1]=0 [2]=1 [3]=1

SYSTEM
  nr_pods  1
  pod_cpus [0]=0000000f
  pod_node [0]=-1
  cpu_pod  [0]=0 [1]=0 [2]=0 [3]=0

Worker Pools
============
pool[00] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  0
pool[01] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  0
pool[02] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  1
pool[03] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  1
pool[04] ref= 1 nice=  0 idle/workers=  4/  4 cpu=  2
pool[05] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  2
pool[06] ref= 1 nice=  0 idle/workers=  3/  3 cpu=  3
pool[07] ref= 1 nice=-20 idle/workers=  2/  2 cpu=  3
pool[08] ref=42 nice=  0 idle/workers=  6/  6 cpus=0000000f
pool[09] ref=28 nice=  0 idle/workers=  3/  3 cpus=00000003
pool[10] ref=28 nice=  0 idle/workers= 17/ 17 cpus=0000000c
pool[11] ref= 1 nice=-20 idle/workers=  1/  1 cpus=0000000f
pool[12] ref= 2 nice=-20 idle/workers=  1/  1 cpus=00000003
pool[13] ref= 2 nice=-20 idle/workers=  1/  1 cpus=0000000c

Workqueue CPU -> pool
=====================
[    workqueue \ CPU              0  1  2  3 dfl]
events                   percpu   0  2  4  6
events_highpri           percpu   1  3  5  7
events_long              percpu   0  2  4  6
events_unbound           unbound  9  9 10 10  8
events_freezable         percpu   0  2  4  6
events_power_efficient   percpu   0  2  4  6
events_freezable_power_  percpu   0  2  4  6
rcu_gp                   percpu   0  2  4  6
rcu_par_gp               percpu   0  2  4  6
slub_flushwq             percpu   0  2  4  6
netns                    ordered  8  8  8  8  8
...

See the command's help message for more info.
有关更多信息,请参阅该命令的帮助消息。

Monitoring

监控

Use tools/workqueue/wq_monitor.py to monitor workqueue operations:
使用 tools/workqueue/wq_monitor.py 监控工作队列操作:

$ tools/workqueue/wq_monitor.py events
                            total  infl  CPUtime  CPUhog CMW/RPR  mayday rescued
events                      18545     0      6.1       0       5       -       -
events_highpri                  8     0      0.0       0       0       -       -
events_long                     3     0      0.0       0       0       -       -
events_unbound              38306     0      0.1       -       7       -       -
events_freezable                0     0      0.0       0       0       -       -
events_power_efficient      29598     0      0.2       0       0       -       -
events_freezable_power_        10     0      0.0       0       0       -       -
sock_diag_events                0     0      0.0       0       0       -       -

                            total  infl  CPUtime  CPUhog CMW/RPR  mayday rescued
events                      18548     0      6.1       0       5       -       -
events_highpri                  8     0      0.0       0       0       -       -
events_long                     3     0      0.0       0       0       -       -
events_unbound              38322     0      0.1       -       7       -       -
events_freezable                0     0      0.0       0       0       -       -
events_power_efficient      29603     0      0.2       0       0       -       -
events_freezable_power_        10     0      0.0       0       0       -       -
sock_diag_events                0     0      0.0       0       0       -       -

...

See the command's help message for more info.
查看命令的帮助信息以获取更多信息。

Debugging

调试

Because the work functions are executed by generic worker threads there are a few tricks needed to shed some light on misbehaving workqueue users.
因为工作函数是由通用工作线程执行的,所以需要一些技巧来解决工作队列用户的问题。

Worker threads show up in the process list as:
工作线程在进程列表中显示为:

root      5671  0.0  0.0      0     0 ?        S    12:07   0:00 [kworker/0:1]
root      5672  0.0  0.0      0     0 ?        S    12:07   0:00 [kworker/1:2]
root      5673  0.0  0.0      0     0 ?        S    12:12   0:00 [kworker/0:0]
root      5674  0.0  0.0      0     0 ?        S    12:13   0:00 [kworker/1:0]

If kworkers are going crazy (using too much cpu), there are two types of possible problems:
如果 kworker 运行异常(占用过多 CPU),可能存在两种可能的问题:

  1. Something being scheduled in rapid succession
    连续快速调度某些任务

  2. A single work item that consumes lots of cpu cycles
    单个工作项消耗大量 CPU 周期

The first one can be tracked using tracing:
第一种情况可以通过跟踪来追踪:

$ echo workqueue:workqueue_queue_work > /sys/kernel/tracing/set_event
$ cat /sys/kernel/tracing/trace_pipe > out.txt
(wait a few secs)
^C

If something is busy looping on work queueing, it would be dominating the output and the offender can be determined with the work item function.
如果有某个任务在工作队列上忙碌循环,它将主导输出,并且可以通过工作项函数确定问题所在。

For the second type of problems it should be possible to just check the stack trace of the offending worker thread.
对于第二种类型的问题,可以直接检查有问题的工作线程的堆栈跟踪。

$ cat /proc/THE_OFFENDING_KWORKER/stack

The work item's function should be trivially visible in the stack trace.
工作项的函数应该在堆栈跟踪中清晰可见。

Non-reentrance Conditions

非重入条件

Workqueue guarantees that a work item cannot be re-entrant if the following conditions hold after a work item gets queued:
工作队列保证在工作项排队后,如果满足以下条件,则工作项不会重入:

  1. The work function hasn't been changed.
    工作函数未被更改。

  2. No one queues the work item to another workqueue.
    没有将工作项排队到另一个工作队列。

  3. The work item hasn't been reinitiated.
    工作项未被重新初始化。

In other words, if the above conditions hold, the work item is guaranteed to be executed by at most one worker system-wide at any given time.
换句话说,如果上述条件成立,工作项保证在任何给定时间内最多由一个工作线程在系统范围内执行。

Note that requeuing the work item (to the same queue) in the self function doesn't break these conditions, so it's safe to do. Otherwise, caution is required when breaking the conditions inside a work function.
请注意,在自身函数中重新排队工作项(到同一队列)不会破坏这些条件,因此是安全的。否则,在工作函数内部打破这些条件时需要谨慎。