指令cache一致性

N2 also gets optional hardware instruction cache coherency. ARM recommends enabling it on systems with a lot of cores because broadcasting software-issued instruction cache invalidates would not be scalable. To implement instruction cache coherency, ARM makes the L2 cache inclusive of L1i contents. Then, I assume the L2 becomes exclusive of L1 data cache contents, ensuring that data writes will never cause a L1d hit when the address is in L1i. Finally, a read-for-ownership will evict a line from L2 caches in all other cores, which automatically causes L1i invalidates and prevents L1i caches from holding stale data.

ARM recommends configuring the core with 1 MB of L2. A 512 KB L2 would spend 1/8 of its capacity duplicating L1i contents to ensure L1i coherency, and a 256 KB L2 would be a really bad idea (likely why ARM doesn’t even allow it as an option). Thankfully, going to 1 MB of L2 capacity doesn’t cost any extra latency. Just as with A710’s 512 KB L2, getting data from L2 takes 13-14 cycles. Ampere Altra and Zen 4 have similar cycle counts for L2 accesses, though Zen 4 enjoys an actual latency advantage thanks to higher clock speeds.

‍

Strangely, code fetch bandwidth from L2 is worse than from L3. I wonder if Loongson ran into some difficulties when implementing hardware instruction cache coherency. If done correctly, hardware instruction cache coherency can benefit JIT-ed code and enable better scaling to high core counts. However, it’s not easy. Loongson’s L2 is non-inclusive, which means it can’t act as a snoop filter. Maybe a L2 hit from the instruction side has to probe the L1D to ensure it gets up-to-date data. But a L3 hit might benefit from separate coherency directory located in the L3 complex, which can indicate whether up-to-date data can be provided without snoops.

526互联

指令cache一致性

指令cache一致性