A Tour Through TREE_RCU's Grace-Period Memory Ordering (翻译)

发布时间 2023-11-06 19:52:50作者: 摩斯电码

原文:
https://docs.kernel.org/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html

August 8, 2017

This article was contributed by Paul E. McKenney

Introduction

This document gives a rough visual overview of how Tree RCU's grace-period memory ordering guarantee is provided.
本文档大致介绍了Tree RCU(树形RCU)如何提供优雅期内存排序保证的视觉概述。

What Is Tree RCU's Grace Period Memory Ordering Guarantee?

什么是Tree RCU的优雅期内存排序保证?

RCU grace periods provide extremely strong memory-ordering guarantees for non-idle non-offline code. Any code that happens after the end of a given RCU grace period is guaranteed to see the effects of all accesses prior to the beginning of that grace period that are within RCU read-side critical sections. Similarly, any code that happens before the beginning of a given RCU grace period is guaranteed to not see the effects of all accesses following the end of that grace period that are within RCU read-side critical sections.
RCU(读-复制-更新)的优雅期提供了非空闲非离线代码的极强内存排序保证。在给定的RCU优雅期结束后发生的任何代码都保证能看到在该优雅期开始之前的所有访问效果,这些访问在RCU读侧临界区内。同样,发生在给定RCU优雅期开始之前的任何代码都保证不会看到在该优雅期结束之后的所有访问效果,这些访问在RCU读侧临界区内。

Note well that RCU-sched read-side critical sections include any region of code for which preemption is disabled. Given that each individual machine instruction can be thought of as an extremely small region of preemption-disabled code, one can think of synchronize_rcu() as smp_mb() on steroids.
请注意,RCU-sched读侧临界区包括任何禁用抢占的代码区域。鉴于每个单独的机器指令可以被视为一个极小的禁用抢占的代码区域,可以将synchronize_rcu()视为增强版的smp_mb()。

RCU updaters use this guarantee by splitting their updates into two phases, one of which is executed before the grace period and the other of which is executed after the grace period. In the most common use case, phase one removes an element from a linked RCU-protected data structure, and phase two frees that element. For this to work, any readers that have witnessed state prior to the phase-one update (in the common case, removal) must not witness state following the phase-two update (in the common case, freeing).
RCU更新器通过将更新分为两个阶段来使用此保证,其中一个阶段在优雅期之前执行,另一个阶段在优雅期之后执行。在最常见的用例中,第一阶段从链接的RCU保护数据结构中移除一个元素,第二阶段释放该元素。为了使其正常工作,任何在第一阶段更新(通常是移除)之前已经观察到状态的读者不能观察到在第二阶段更新(通常是释放)之后的状态。

The RCU implementation provides this guarantee using a network of lock-based critical sections, memory barriers, and per-CPU processing, as is described in the following sections.
RCU实现使用基于锁的临界区、内存屏障和每个CPU的处理来提供此保证,具体描述在以下各节中。

Tree RCU Grace Period Memory Ordering Building Blocks

Tree RCU 优雅期内存排序构建块

The workhorse for RCU's grace-period memory ordering is the critical section for the rcu_node structure's ->lock. These critical sections use helper functions for lock acquisition, including raw_spin_lock_rcu_node(), raw_spin_lock_irq_rcu_node(), and raw_spin_lock_irqsave_rcu_node(). Their lock-release counterparts are raw_spin_unlock_rcu_node(), raw_spin_unlock_irq_rcu_node(), and raw_spin_unlock_irqrestore_rcu_node(), respectively. For completeness, a raw_spin_trylock_rcu_node() is also provided. The key point is that the lock-acquisition functions, including raw_spin_trylock_rcu_node(), all invoke smp_mb__after_unlock_lock() immediately after successful acquisition of the lock.
RCU 的优雅期内存排序的工作核心是 rcu_node 结构体的 ->lock 的临界区。这些临界区使用了锁获取的辅助函数,包括 raw_spin_lock_rcu_node()、raw_spin_lock_irq_rcu_node() 和 raw_spin_lock_irqsave_rcu_node()。它们的锁释放对应函数分别是 raw_spin_unlock_rcu_node()、raw_spin_unlock_irq_rcu_node() 和 raw_spin_unlock_irqrestore_rcu_node()。为了完整起见,还提供了 raw_spin_trylock_rcu_node()。关键点在于,锁获取函数(包括 raw_spin_trylock_rcu_node())在成功获取锁之后都会立即调用 smp_mb__after_unlock_lock()。

Therefore, for any given rcu_node structure, any access happening before one of the above lock-release functions will be seen by all CPUs as happening before any access happening after a later one of the above lock-acquisition functions. Furthermore, any access happening before one of the above lock-release function on any given CPU will be seen by all CPUs as happening before any access happening after a later one of the above lock-acquisition functions executing on that same CPU, even if the lock-release and lock-acquisition functions are operating on different rcu_node structures. Tree RCU uses these two ordering guarantees to form an ordering network among all CPUs that were in any way involved in the grace period, including any CPUs that came online or went offline during the grace period in question.
因此,对于任何给定的 rcu_node 结构体,任何在上述任一锁释放函数之前发生的访问都将被所有 CPU 视为发生在任何在上述任一锁获取函数之后发生的访问之前。此外,任何在任一给定 CPU 上的上述任一锁释放函数之前发生的访问都将被所有 CPU 视为发生在该 CPU 上执行的稍后的上述任一锁获取函数之后发生的任何访问之前,即使锁释放和锁获取函数在不同的 rcu_node 结构体上操作。Tree RCU 使用这两个排序保证在所有参与优雅期的 CPU 之间形成排序网络,包括在该优雅期内上线或下线的任何 CPU。

The following litmus test exhibits the ordering effects of these lock-acquisition and lock-release functions:
以下 litmus 测试展示了这些锁获取和锁释放函数的排序效果:

 1 int x, y, z;
 2
 3 void task0(void)
 4 {
 5   raw_spin_lock_rcu_node(rnp);
 6   WRITE_ONCE(x, 1);
 7   r1 = READ_ONCE(y);
 8   raw_spin_unlock_rcu_node(rnp);
 9 }
10
11 void task1(void)
12 {
13   raw_spin_lock_rcu_node(rnp);
14   WRITE_ONCE(y, 1);
15   r2 = READ_ONCE(z);
16   raw_spin_unlock_rcu_node(rnp);
17 }
18
19 void task2(void)
20 {
21   WRITE_ONCE(z, 1);
22   smp_mb();
23   r3 = READ_ONCE(x);
24 }
25
26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0);

The WARN_ON() is evaluated at "the end of time", after all changes have propagated throughout the system. Without the smp_mb__after_unlock_lock() provided by the acquisition functions, this WARN_ON() could trigger, for example on PowerPC. The smp_mb__after_unlock_lock() invocations prevent this WARN_ON() from triggering.
WARN_ON()在“时间的末尾”进行评估,在所有更改已在系统中传播之后。如果获取函数没有提供smp_mb__after_unlock_lock(),则此WARN_ON()可能会触发,例如在PowerPC上。smp_mb__after_unlock_lock()调用可防止此WARN_ON()触发。

Quick Quiz:
快速测验:
But the chain of rcu_node-structure lock acquisitions guarantees that new readers will see all of the updater's pre-grace-period accesses and also guarantees that the updater's post-grace-period accesses will see all of the old reader's accesses. So why do we need all of those calls to smp_mb__after_unlock_lock()?
但是,rcu_node结构锁获取链保证新读取器将看到更新程序的优雅期前访问,并且还保证更新程序的优雅期后访问将看到所有旧读取器的访问。那么为什么我们需要所有这些smp_mb__after_unlock_lock()调用呢?
Answer:
答案:
Because we must provide ordering for RCU's polling grace-period primitives, for example, get_state_synchronize_rcu() and poll_state_synchronize_rcu(). Consider this code:
因为我们必须为RCU的轮询优雅期原语提供排序,例如get_state_synchronize_rcu()和poll_state_synchronize_rcu()。考虑以下代码:

	CPU 0                                     CPU 1
	----                                      ----
	WRITE_ONCE(X, 1)                          WRITE_ONCE(Y, 1)
	g = get_state_synchronize_rcu()           smp_mb()
	while (!poll_state_synchronize_rcu(g))    r1 = READ_ONCE(X)
	        continue;
	r0 = READ_ONCE(Y)

RCU guarantees that the outcome r0 == 0 && r1 == 0 will not happen, even if CPU 1 is in an RCU extended quiescent state (idle or offline) and thus won't interact directly with the RCU core processing at all.
RCU保证即使CPU 1处于RCU扩展静止状态(空闲或离线),并且不会直接与RCU核心处理进行交互,也不会发生r0 == 0 && r1 == 0的结果。

This approach must be extended to include idle CPUs, which need RCU's grace-period memory ordering guarantee to extend to any RCU read-side critical sections preceding and following the current idle sojourn. This case is handled by calls to the strongly ordered atomic_add_return() read-modify-write atomic operation that is invoked within rcu_dynticks_eqs_enter() at idle-entry time and within rcu_dynticks_eqs_exit() at idle-exit time. The grace-period kthread invokes rcu_dynticks_snap() and rcu_dynticks_in_eqs_since() (both of which invoke an atomic_add_return() of zero) to detect idle CPUs.
这种方法必须扩展到包括空闲的CPU,这些CPU需要RCU的宽限期内存排序保证,以扩展到当前空闲停留之前和之后的任何RCU读取侧关键部分。这种情况通过在空闲进入时在rcu_dynticks_eqs_enter()中调用有序的atomic_add_return()读取-修改-写原子操作来处理,并在空闲退出时在rcu_dynticks_eqs_exit()中调用。宽限期kthread调用rcu_dynticks_snap()和rcu_dynticks_in_eqs_since()(两者都调用了一个原子_add_return()的零)来检测空闲的CPU。

Quick Quiz:
But what about CPUs that remain offline for the entire grace period?
但是对于整个宽限期都保持离线状态的CPU怎么办?
Answer:
Such CPUs will be offline at the beginning of the grace period, so the grace period won't expect quiescent states from them. Races between grace-period start and CPU-hotplug operations are mediated by the CPU's leaf rcu_node structure's ->lock as described above.
这样的CPU在宽限期开始时将处于离线状态,因此宽限期不会期望它们处于静止状态。宽限期开始和CPU热插拔操作之间的竞争由CPU的叶子rcu_node结构的->lock来调解,如上所述。

The approach must be extended to handle one final case, that of waking a task blocked in synchronize_rcu(). This task might be affined to a CPU that is not yet aware that the grace period has ended, and thus might not yet be subject to the grace period's memory ordering. Therefore, there is an smp_mb() after the return from wait_for_completion() in the synchronize_rcu() code path.
这种方法必须扩展以处理最后一种情况,即唤醒在synchronize_rcu()中被阻塞的任务。该任务可能与一个尚未意识到宽限期已结束的CPU关联,因此可能尚未受到宽限期的内存排序的影响。因此,在synchronize_rcu()代码路径中从wait_for_completion()返回后会有一个smp_mb()。

Quick Quiz:
What? Where??? I don't see any smp_mb() after the return from wait_for_completion()!!!
什么?在从wait_for_completion()返回后我没有看到任何smp_mb()!!!
Answer:
That would be because I spotted the need for that smp_mb() during the creation of this documentation, and it is therefore unlikely to hit mainline before v4.14. Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and Jonathan Cameron for asking questions that sensitized me to the rather elaborate sequence of events that demonstrate the need for this memory barrier.
那是因为在创建本文档时,我注意到了这个smp_mb()的需要,因此在v4.14之前不太可能出现在主线代码中。感谢Lance Roy、Will Deacon、Peter Zijlstra和Jonathan Cameron提出的问题,使我意识到了这个内存屏障的需要,因为它们展示了一系列相当复杂的事件。

Tree RCU's grace--period memory-ordering guarantees rely most heavily on the rcu_node structure's ->lock field, so much so that it is necessary to abbreviate this pattern in the diagrams in the next section. For example, consider the rcu_prepare_for_idle() function shown below, which is one of several functions that enforce ordering of newly arrived RCU callbacks against future grace periods:
Tree RCU的宽限期内存排序保证主要依赖于rcu_node结构的->lock字段,以至于在下一节的图表中有必要对这种模式进行缩写。例如,考虑下面所示的rcu_prepare_for_idle()函数,它是几个函数之一,用于对新到达的RCU回调与未来的宽限期进行排序:

 1 static void rcu_prepare_for_idle(void)
 2 {
 3   bool needwake;
 4   struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 5   struct rcu_node *rnp;
 6   int tne;
 7
 8   lockdep_assert_irqs_disabled();
 9   if (rcu_rdp_is_offloaded(rdp))
10     return;
11
12   /* Handle nohz enablement switches conservatively. */
13   tne = READ_ONCE(tick_nohz_active);
14   if (tne != rdp->tick_nohz_enabled_snap) {
15     if (!rcu_segcblist_empty(&rdp->cblist))
16       invoke_rcu_core(); /* force nohz to see update. */
17     rdp->tick_nohz_enabled_snap = tne;
18     return;
19   }
20   if (!tne)
21     return;
22
23   /*
24    * If we have not yet accelerated this jiffy, accelerate all
25    * callbacks on this CPU.
26   */
27   if (rdp->last_accelerate == jiffies)
28     return;
29   rdp->last_accelerate = jiffies;
30   if (rcu_segcblist_pend_cbs(&rdp->cblist)) {
31     rnp = rdp->mynode;
32     raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
33     needwake = rcu_accelerate_cbs(rnp, rdp);
34     raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
35     if (needwake)
36       rcu_gp_kthread_wake();
37   }
38 }

But the only part of rcu_prepare_for_idle() that really matters for this discussion are lines 32–34. We will therefore abbreviate this function as follows:
但是在这个讨论中,只有 rcu_prepare_for_idle() 的 32-34 行真正重要。因此,我们将这个函数缩写如下:
image

The box represents the rcu_node structure's ->lock critical section, with the double line on top representing the additional smp_mb__after_unlock_lock().
方框代表 rcu_node 结构的 ->lock 临界区,顶部的双线代表额外的 smp_mb__after_unlock_lock()。

Tree RCU Grace Period Memory Ordering Components

Tree RCU 优雅期内存排序组件

Tree RCU's grace-period memory-ordering guarantee is provided by a number of RCU components:
Tree RCU 的优雅期内存排序保证由多个 RCU 组件提供:

  1. Callback Registry
    回调注册表
  2. Grace-Period Initialization
    优雅期初始化
  3. Self-Reported Quiescent States
    自报平静状态
  4. Dynamic Tick Interface
    动态时钟接口
  5. CPU-Hotplug Interface
    CPU 热插拔接口
  6. Forcing Quiescent States
    强制平静状态
  7. Grace-Period Cleanup
    优雅期清理
  8. Callback Invocation
    回调调用

Each of the following section looks at the corresponding component in detail.
下面的每个部分详细介绍了相应的组件。

Callback Registry

回调注册表

If RCU's grace-period guarantee is to mean anything at all, any access that happens before a given invocation of call_rcu() must also happen before the corresponding grace period. The implementation of this portion of RCU's grace period guarantee is shown in the following figure:
如果 RCU 的优雅期保证有任何意义,那么在调用 call_rcu() 之前发生的任何访问都必须在相应的优雅期之前发生。RCU 优雅期保证的实现如下图所示:
image

Because call_rcu() normally acts only on CPU-local state, it provides no ordering guarantees, either for itself or for phase one of the update (which again will usually be removal of an element from an RCU-protected data structure). It simply enqueues the rcu_head structure on a per-CPU list, which cannot become associated with a grace period until a later call to rcu_accelerate_cbs(), as shown in the diagram above.
因为 call_rcu() 通常只对 CPU 本地状态进行操作,所以它不提供任何排序保证,无论是对自身还是对更新的第一阶段(通常是从 RCU 保护的数据结构中删除元素)。它只是将 rcu_head 结构排队到每个 CPU 的列表中,直到稍后调用 rcu_accelerate_cbs() 才能与优雅期关联,如上图所示。

One set of code paths shown on the left invokes rcu_accelerate_cbs() via note_gp_changes(), either directly from call_rcu() (if the current CPU is inundated with queued rcu_head structures) or more likely from an RCU_SOFTIRQ handler. Another code path in the middle is taken only in kernels built with CONFIG_RCU_FAST_NO_HZ=y, which invokes rcu_accelerate_cbs() via rcu_prepare_for_idle(). The final code path on the right is taken only in kernels built with CONFIG_HOTPLUG_CPU=y, which invokes rcu_accelerate_cbs() via rcu_advance_cbs(), rcu_migrate_callbacks, rcutree_migrate_callbacks(), and takedown_cpu(), which in turn is invoked on a surviving CPU after the outgoing CPU has been completely offlined.
左侧显示的一组代码路径通过 note_gp_changes() 直接从 call_rcu() 调用 rcu_accelerate_cbs()(如果当前 CPU 队列中有大量排队的 rcu_head 结构)或更可能是从 RCU_SOFTIRQ 处理程序调用。中间的另一个代码路径仅在构建了 CONFIG_RCU_FAST_NO_HZ=y 的内核中采用,它通过 rcu_prepare_for_idle() 调用 rcu_accelerate_cbs()。右侧的最后一个代码路径仅在构建了 CONFIG_HOTPLUG_CPU=y 的内核中采用,在传出 CPU 完全下线后,在幸存的 CPU 上调用 rcu_accelerate_cbs(),rcu_advance_cbs(),rcu_migrate_callbacks,rcutree_migrate_callbacks() 和 takedown_cpu()。

There are a few other code paths within grace-period processing that opportunistically invoke rcu_accelerate_cbs(). However, either way, all of the CPU's recently queued rcu_head structures are associated with a future grace-period number under the protection of the CPU's lead rcu_node structure's ->lock. In all cases, there is full ordering against any prior critical section for that same rcu_node structure's ->lock, and also full ordering against any of the current task's or CPU's prior critical sections for any rcu_node structure's ->lock.
在优雅期处理中,还有一些其他的代码路径会机会性地调用 rcu_accelerate_cbs()。无论哪种方式,所有 CPU 最近排队的 rcu_head 结构都与未来的优雅期号码关联,受到 CPU 的主 rcu_node 结构的 ->lock 保护。在所有情况下,都会完全按顺序执行与该 rcu_node 结构的 ->lock 的任何先前临界区以及当前任务或 CPU 的任何 rcu_node 结构的 ->lock 的先前临界区。

The next section will show how this ordering ensures that any accesses prior to the call_rcu() (particularly including phase one of the update) happen before the start of the corresponding grace period.
下一节将展示这种排序如何确保在调用 call_rcu() 之前的任何访问(特别是包括更新的第一阶段)发生在相应优雅期的开始之前。

Quick Quiz:
But what about synchronize_rcu()?
那 synchronize_rcu() 呢?
Answer:
The synchronize_rcu() passes call_rcu() to wait_rcu_gp(), which invokes it. So either way, it eventually comes down to call_rcu().
synchronize_rcu() 将 call_rcu() 传递给 wait_rcu_gp(),然后调用它。所以无论如何,最终都会调用 call_rcu()。

Grace-Period Initialization

宽限期初始化

Grace-period initialization is carried out by the grace-period kernel thread, which makes several passes over the rcu_node tree within the rcu_gp_init() function. This means that showing the full flow of ordering through the grace-period computation will require duplicating this tree. If you find this confusing, please note that the state of the rcu_node changes over time, just like Heraclitus's river. However, to keep the rcu_node river tractable, the grace-period kernel thread's traversals are presented in multiple parts, starting in this section with the various phases of grace-period initialization.
宽限期初始化由宽限期内核线程执行,在rcu_gp_init()函数中对rcu_node树进行多次遍历。这意味着展示宽限期计算的完整流程需要复制这棵树。如果您觉得这很困惑,请注意rcu_node的状态会随时间而变化,就像赫拉克利特的河流一样。但是,为了使rcu_node河流易于处理,宽限期内核线程的遍历被分成多个部分,从此部分开始介绍宽限期初始化的各个阶段。

The first ordering-related grace-period initialization action is to advance the rcu_state structure's ->gp_seq grace-period-number counter, as shown below:
与排序相关的第一个宽限期初始化操作是提高rcu_state结构体的->gp_seq宽限期号计数器,如下所示:
image

The actual increment is carried out using smp_store_release(), which helps reject false-positive RCU CPU stall detection. Note that only the root rcu_node structure is touched.
实际的增量是使用smp_store_release()执行的,它有助于拒绝错误的RCU CPU停顿检测。请注意,只有根rcu_node结构被触及。

The first pass through the rcu_node tree updates bitmasks based on CPUs having come online or gone offline since the start of the previous grace period. In the common case where the number of online CPUs for this rcu_node structure has not transitioned to or from zero, this pass will scan only the leaf rcu_node structures. However, if the number of online CPUs for a given leaf rcu_node structure has transitioned from zero, rcu_init_new_rnp() will be invoked for the first incoming CPU. Similarly, if the number of online CPUs for a given leaf rcu_node structure has transitioned to zero, rcu_cleanup_dead_rnp() will be invoked for the last outgoing CPU. The diagram below shows the path of ordering if the leftmost rcu_node structure onlines its first CPU and if the next rcu_node structure has no online CPUs (or, alternatively if the leftmost rcu_node structure offlines its last CPU and if the next rcu_node structure has no online CPUs).
第一次通过rcu_node树更新位掩码,基于自启动前一个宽限期以来上线或下线的CPU。在这个rcu_node结构的在线CPU数量没有从零转换到或从零转换出的常见情况下,此遍历将仅扫描叶子rcu_node结构。但是,如果给定叶子rcu_node结构的在线CPU数量从零转换,将为第一个传入的CPU调用rcu_init_new_rnp()。同样,如果给定叶子rcu_node结构的在线CPU数量转换为零,将为最后一个传出的CPU调用rcu_cleanup_dead_rnp()。下图显示了如果最左边的rcu_node结构上线了它的第一个CPU,并且下一个rcu_node结构没有在线CPU(或者,如果最左边的rcu_node结构下线了它的最后一个CPU,并且下一个rcu_node结构没有在线CPU)的排序路径。
image

The final rcu_gp_init() pass through the rcu_node tree traverses breadth-first, setting each rcu_node structure's ->gp_seq field to the newly advanced value from the rcu_state structure, as shown in the following diagram.
最后一次rcu_gp_init()遍历rcu_node树,以广度优先的方式设置每个rcu_node结构的->gp_seq字段为从rcu_state结构体中新提高的值,如下图所示。
image

This change will also cause each CPU's next call to __note_gp_changes() to notice that a new grace period has started, as described in the next section. But because the grace-period kthread started the grace period at the root (with the advancing of the rcu_state structure's ->gp_seq field) before setting each leaf rcu_node structure's ->gp_seq field, each CPU's observation of the start of the grace period will happen after the actual start of the grace period.
这个变化也会导致每个CPU的下一次调用__note_gp_changes()注意到一个新的宽限期已经开始,如下一节所述。但是,由于宽限期内核线程在设置每个叶子rcu_node结构的->gp_seq字段之前(通过提高rcu_state结构体的->gp_seq字段)在根处开始了宽限期,因此每个CPU对宽限期开始的观察将发生在实际宽限期开始之后。

Quick Quiz:
But what about the CPU that started the grace period? Why wouldn't it see the start of the grace period right when it started that grace period?
但是,启动宽限期的CPU怎么样?为什么它不会在启动宽限期时立即看到宽限期的开始?
Answer:
In some deep philosophical and overly anthromorphized sense, yes, the CPU starting the grace period is immediately aware of having done so. However, if we instead assume that RCU is not self-aware, then even the CPU starting the grace period does not really become aware of the start of this grace period until its first call to __note_gp_changes(). On the other hand, this CPU potentially gets early notification because it invokes __note_gp_changes() during its last rcu_gp_init() pass through its leaf rcu_node structure.
在某种深刻的哲学和过度拟人化的意义上,是的,启动宽限期的CPU立即意识到已经这样做了。然而,如果我们假设RCU不是自我意识的,那么即使启动宽限期的CPU也不会真正意识到这个宽限期的开始,直到它的第一次调用__note_gp_changes()。另一方面,这个CPU可能会得到早期通知,因为它在其最后一次rcu_gp_init()遍历其叶子rcu_node结构时调用了__note_gp_changes()

Self-Reported Quiescent States

自报静止状态

When all entities that might block the grace period have reported quiescent states (or as described in a later section, had quiescent states reported on their behalf), the grace period can end. Online non-idle CPUs report their own quiescent states, as shown in the following diagram:
当可能阻塞宽限期的所有实体报告了静止状态(或者在后面的部分中描述的代表它们报告了静止状态)后,宽限期可以结束。在线非空闲的CPU报告自己的静止状态,如下图所示:
image

This is for the last CPU to report a quiescent state, which signals the end of the grace period. Earlier quiescent states would push up the rcu_node tree only until they encountered an rcu_node structure that is waiting for additional quiescent states. However, ordering is nevertheless preserved because some later quiescent state will acquire that rcu_node structure's ->lock.
这是最后一个报告静止状态的CPU,它表示宽限期的结束。较早的静止状态只会向上推送到遇到等待额外静止状态的rcu_node结构为止。然而,顺序仍然得到保留,因为稍后的某个静止状态将获取该rcu_node结构的->lock。

Any number of events can lead up to a CPU invoking note_gp_changes (or alternatively, directly invoking __note_gp_changes()), at which point that CPU will notice the start of a new grace period while holding its leaf rcu_node lock. Therefore, all execution shown in this diagram happens after the start of the grace period. In addition, this CPU will consider any RCU read-side critical section that started before the invocation of __note_gp_changes() to have started before the grace period, and thus a critical section that the grace period must wait on.
任何数量的事件都可能导致CPU调用note_gp_changes(或者直接调用__note_gp_changes()),在此时,该CPU将在持有其叶子rcu_node锁的情况下注意到新的宽限期的开始。因此,此图中显示的所有执行都发生在宽限期开始之后。此外,该CPU将认为在调用__note_gp_changes()之前开始的任何RCU读取侧临界区都是在宽限期之前开始的,因此宽限期必须等待的临界区。

Quick Quiz:
But a RCU read-side critical section might have started after the beginning of the grace period (the advancing of ->gp_seq from earlier), so why should the grace period wait on such a critical section?
但是,RCU读取侧临界区可能在宽限期开始之后(从之前的->gp_seq的增加)开始,那么为什么宽限期应该等待这样的临界区呢?
Answer:
It is indeed not necessary for the grace period to wait on such a critical section. However, it is permissible to wait on it. And it is furthermore important to wait on it, as this lazy approach is far more scalable than a “big bang” all-at-once grace-period start could possibly be.
确实,宽限期不必等待这样的临界区。但是,等待它是允许的。而且,等待它是非常重要的,因为这种懒惰的方法比“一次性”全面开始宽限期要具有更高的可扩展性。

If the CPU does a context switch, a quiescent state will be noted by rcu_note_context_switch() on the left. On the other hand, if the CPU takes a scheduler-clock interrupt while executing in usermode, a quiescent state will be noted by rcu_sched_clock_irq() on the right. Either way, the passage through a quiescent state will be noted in a per-CPU variable.
如果CPU进行上下文切换,rcu_note_context_switch()将在左侧记录一个静止状态。另一方面,如果CPU在用户模式下执行时接收到调度时钟中断,rcu_sched_clock_irq()将在右侧记录一个静止状态。无论哪种方式,通过静止状态的过程都将在每个CPU变量中记录。

The next time an RCU_SOFTIRQ handler executes on this CPU (for example, after the next scheduler-clock interrupt), rcu_core() will invoke rcu_check_quiescent_state(), which will notice the recorded quiescent state, and invoke rcu_report_qs_rdp(). If rcu_report_qs_rdp() verifies that the quiescent state really does apply to the current grace period, it invokes rcu_report_rnp() which traverses up the rcu_node tree as shown at the bottom of the diagram, clearing bits from each rcu_node structure's ->qsmask field, and propagating up the tree when the result is zero.
下次在此CPU上执行RCU_SOFTIRQ处理程序(例如,在下一个调度时钟中断之后),rcu_core()将调用rcu_check_quiescent_state(),它将注意到记录的静止状态,并调用rcu_report_qs_rdp()。如果rcu_report_qs_rdp()验证该静止状态确实适用于当前的宽限期,它将调用rcu_report_rnp(),该函数在图的底部显示了沿着rcu_node树向上遍历,从每个rcu_node结构的->qsmask字段中清除位,并在结果为零时向上传播。

Note that traversal passes upwards out of a given rcu_node structure only if the current CPU is reporting the last quiescent state for the subtree headed by that rcu_node structure. A key point is that if a CPU's traversal stops at a given rcu_node structure, then there will be a later traversal by another CPU (or perhaps the same one) that proceeds upwards from that point, and the rcu_node ->lock guarantees that the first CPU's quiescent state happens before the remainder of the second CPU's traversal. Applying this line of thought repeatedly shows that all CPUs' quiescent states happen before the last CPU traverses through the root rcu_node structure, the “last CPU” being the one that clears the last bit in the root rcu_node structure's ->qsmask field.
请注意,只有当当前CPU报告了由该rcu_node结构为根的子树的最后一个静止状态时,遍历才会向上通过给定的rcu_node结构。一个关键点是,如果CPU的遍历在给定的rcu_node结构处停止,那么另一个CPU(或者可能是同一个CPU)将从该点开始向上进行遍历,并且rcu_node ->lock保证了第一个CPU的静止状态发生在第二个CPU遍历的其余部分之前。反复应用这种思路可以显示出,所有CPU的静止状态都发生在最后一个CPU通过根rcu_node结构进行遍历之前,这个“最后一个CPU”是清除根rcu_node结构的->qsmask字段中的最后一位的CPU。

Dynamic Tick Interface

动态Tick接口

Due to energy-efficiency considerations, RCU is forbidden from disturbing idle CPUs. CPUs are therefore required to notify RCU when entering or leaving idle state, which they do via fully ordered value-returning atomic operations on a per-CPU variable. The ordering effects are as shown below:
由于能源效率的考虑,RCU禁止干扰空闲CPU。因此,CPU需要通过每个CPU变量上的完全有序的值返回原子操作来通知RCU何时进入或离开空闲状态。排序效果如下所示:
image

The RCU grace-period kernel thread samples the per-CPU idleness variable while holding the corresponding CPU's leaf rcu_node structure's ->lock. This means that any RCU read-side critical sections that precede the idle period (the oval near the top of the diagram above) will happen before the end of the current grace period. Similarly, the beginning of the current grace period will happen before any RCU read-side critical sections that follow the idle period (the oval near the bottom of the diagram above).
RCU优雅期内核线程在持有相应CPU的叶子rcu_node结构的->lock时对每个CPU的空闲变量进行采样。这意味着在空闲期之前(上图中的椭圆形)发生的任何RCU读取侧临界区域都将在当前优雅期结束之前发生。同样,当前优雅期的开始将在空闲期之后(上图中的椭圆形)发生任何RCU读取侧临界区域之前。

Plumbing this into the full grace-period execution is described below.
将其连接到完整的优雅期执行中如下所述。

CPU-Hotplug Interface

CPU热插拔接口

RCU is also forbidden from disturbing offline CPUs, which might well be powered off and removed from the system completely. CPUs are therefore required to notify RCU of their comings and goings as part of the corresponding CPU hotplug operations. The ordering effects are shown below:
RCU也禁止干扰离线CPU,这些CPU可能已经关闭并从系统中完全删除。因此,CPU需要在相应的CPU热插拔操作的一部分中通知RCU它们的到来和离开。排序效果如下所示:
image

Because CPU hotplug operations are much less frequent than idle transitions, they are heavier weight, and thus acquire the CPU's leaf rcu_node structure's ->lock and update this structure's ->qsmaskinitnext. The RCU grace-period kernel thread samples this mask to detect CPUs having gone offline since the beginning of this grace period.
由于CPU热插拔操作比空闲转换频率低得多,因此它们的权重更大,因此会获取CPU的叶子rcu_node结构的->lock并更新此结构的->qsmaskinitnext。RCU优雅期内核线程采样此掩码以检测自此优雅期开始以来已经离线的CPU。

Plumbing this into the full grace-period execution is described below.
将其连接到完整的优雅期执行中如下所述。

Forcing Quiescent States

强制静止状态

As noted above, idle and offline CPUs cannot report their own quiescent states, and therefore the grace-period kernel thread must do the reporting on their behalf. This process is called “forcing quiescent states”, it is repeated every few jiffies, and its ordering effects are shown below:
如上所述,空闲和离线CPU无法报告自己的静止状态,因此优雅期内核线程必须代表它们进行报告。这个过程称为“强制静止状态”,它每隔几个jiffies重复一次,并且其排序效果如下所示:
image

Each pass of quiescent state forcing is guaranteed to traverse the leaf rcu_node structures, and if there are no new quiescent states due to recently idled and/or offlined CPUs, then only the leaves are traversed. However, if there is a newly offlined CPU as illustrated on the left or a newly idled CPU as illustrated on the right, the corresponding quiescent state will be driven up towards the root. As with self-reported quiescent states, the upwards driving stops once it reaches an rcu_node structure that has quiescent states outstanding from other CPUs.
每次静止状态强制通过保证遍历叶子rcu_node结构,如果由于最近空闲和/或离线的CPU而没有新的静止状态,则仅遍历叶子。但是,如果有新的离线CPU(如左侧所示)或新的空闲CPU(如右侧所示),则相应的静止状态将向上驱动到根。与自报静止状态一样,向上驱动在到达具有来自其他CPU的静止状态的rcu_node结构时停止。

Quick Quiz:
The leftmost drive to root stopped before it reached the root rcu_node structure, which means that there are still CPUs subordinate to that structure on which the current grace period is waiting. Given that, how is it possible that the rightmost drive to root ended the grace period?
最左侧的根驱动在到达根rcu_node结构之前停止,这意味着仍然有从属于该结构的CPU正在等待当前优雅期。鉴于此,为什么最右侧的根驱动结束了优雅期?
Answer:
Good analysis! It is in fact impossible in the absence of bugs in RCU. But this diagram is complex enough as it is, so simplicity overrode accuracy. You can think of it as poetic license, or you can think of it as misdirection that is resolved in the stitched-together diagram.
好的分析!实际上,在没有RCU中的错误的情况下是不可能的。但是,这个图表已经足够复杂了,因此简单性优先于准确性。您可以将其视为诗意许可证,也可以将其视为解决拼接在一起的图表中的误导。

Grace-Period Cleanup

优雅期清理

Grace-period cleanup first scans the rcu_node tree breadth-first advancing all the ->gp_seq fields, then it advances the rcu_state structure's ->gp_seq field. The ordering effects are shown below:
优雅期清理首先扫描rcu_node树,广度优先推进所有->gp_seq字段,然后推进rcu_state结构的->gp_seq字段。排序效果如下所示:
image

As indicated by the oval at the bottom of the diagram, once grace-period cleanup is complete, the next grace period can begin.
如图中底部的椭圆形所示,一旦优雅期清理完成,下一个优雅期就可以开始了。

Quick Quiz:
But when precisely does the grace period end?
但是,何时精确地结束优雅期?
Answer:
There is no useful single point at which the grace period can be said to end. The earliest reasonable candidate is as soon as the last CPU has reported its quiescent state, but it may be some milliseconds before RCU becomes aware of this. The latest reasonable candidate is once the rcu_state structure's ->gp_seq field has been updated, but it is quite possible that some CPUs have already completed phase two of their updates by that time. In short, if you are going to work with RCU, you need to learn to embrace uncertainty.
没有有用的单个点可以说优雅期已经结束。最早的合理候选者是在最后一个CPU报告其静止状态后立即,但是RCU可能在几毫秒之前就意识到了这一点。最晚的合理候选者是一旦rcu_state结构的->gp_seq字段已更新,但是在那时,某些CPU可能已经完成了它们的更新的第二阶段。简而言之,如果您要使用RCU,您需要学会接受不确定性。

Callback Invocation

回调调用

Once a given CPU's leaf rcu_node structure's ->gp_seq field has been updated, that CPU can begin invoking its RCU callbacks that were waiting for this grace period to end. These callbacks are identified by rcu_advance_cbs(), which is usually invoked by __note_gp_changes(). As shown in the diagram below, this invocation can be triggered by the scheduling-clock interrupt (rcu_sched_clock_irq() on the left) or by idle entry (rcu_cleanup_after_idle() on the right, but only for kernels build with CONFIG_RCU_FAST_NO_HZ=y). Either way, RCU_SOFTIRQ is raised, which results in rcu_do_batch() invoking the callbacks, which in turn allows those callbacks to carry out (either directly or indirectly via wakeup) the needed phase-two processing for each update.
一旦给定CPU的叶子rcu_node结构的->gp_seq字段已更新,该CPU就可以开始调用等待此优雅期结束的RCU回调。这些回调由rcu_advance_cbs()标识,通常由__note_gp_changes()调用。如下图所示,此调用可以由调度时钟中断(左侧的rcu_sched_clock_irq())或空闲条目(右侧的rcu_cleanup_after_idle(),但仅适用于使用CONFIG_RCU_FAST_NO_HZ=y构建的内核)触发。无论哪种方式,都会引发RCU_SOFTIRQ,这导致rcu_do_batch()调用回调,这又允许这些回调执行(直接或间接通过唤醒)每个更新所需的第二阶段处理。
image

Please note that callback invocation can also be prompted by any number of corner-case code paths, for example, when a CPU notes that it has excessive numbers of callbacks queued. In all cases, the CPU acquires its leaf rcu_node structure's ->lock before invoking callbacks, which preserves the required ordering against the newly completed grace period.
请注意,回调调用也可能由任何数量的角落情况代码路径引发,例如,当CPU注意到其排队的回调数量过多时。在所有情况下,CPU在调用回调之前获取其叶子rcu_node结构的->lock,这保留了针对新完成的优雅期所需的排序。

However, if the callback function communicates to other CPUs, for example, doing a wakeup, then it is that function's responsibility to maintain ordering. For example, if the callback function wakes up a task that runs on some other CPU, proper ordering must in place in both the callback function and the task being awakened. To see why this is important, consider the top half of the grace-period cleanup diagram. The callback might be running on a CPU corresponding to the leftmost leaf rcu_node structure, and awaken a task that is to run on a CPU corresponding to the rightmost leaf rcu_node structure, and the grace-period kernel thread might not yet have reached the rightmost leaf. In this case, the grace period's memory ordering might not yet have reached that CPU, so again the callback function and the awakened task must supply proper ordering.
但是,如果回调函数与其他CPU通信,例如唤醒,则该函数有责任维护排序。例如,如果回调函数唤醒在某个其他CPU上运行的任务,则必须在回调函数和被唤醒的任务中都放置适当的排序。要了解为什么这很重要,请考虑优雅期清理图的顶部。回调可能在与最左侧叶子rcu_node结构相对应的CPU上运行,并唤醒在与最右侧叶子rcu_node结构相对应的CPU上运行的任务,而优雅期内核线程可能尚未到达最右侧叶子。在这种情况下,优雅期的内存排序可能尚未到达该CPU,因此回调函数和唤醒的任务必须提供适当的排序。

Putting It All Together

将所有内容整合在一起

A stitched-together diagram is here:
这里有一个拼接在一起的图示:
image

Legal Statement

法律声明

This work represents the view of the author and does not necessarily represent the view of IBM.
本作品代表作者的观点,不一定代表IBM的观点。

Linux is a registered trademark of Linus Torvalds.
Linux是Linus Torvalds的注册商标。

Other company, product, and service names may be trademarks or service marks of others.
其他公司、产品和服务名称可能是其他公司的商标或服务标记。