Compute Is Easy, Memory Is Harder And Harder

https://www.nextplatform.com/2022/12/13/compute-is-easy-memory-is-harder-and-harder/ - The Next Platform
What good is a floating point operation embodied in a vector or matrix unit if you can’t get data into fast enough to actually use the compute engine to
2023-04-21 11:05:01

What good is a floating point operation embodied in a vector or matrix unit if you can’t get data into fast enough to actually use the compute engine to process it in some fashion in a clock cycle? The answer is obvious to all of us: Not much.
如果您不能足够快地将数据放入以实际使用计算引擎在时钟周期中以某种方式处理数据，那么体现在向量或矩阵单元中的浮点运算有什么用？答案对我们所有人来说都是显而易见的：不多。

People have been talking about the imbalance between compute and memory bandwidth for decades, and every year the high performance computing industry has been forced to accept less and less memory bandwidth per floating point operation because increasing memory bandwidth is exceedingly difficult in a fashion that doesn’t also end up being very pricey.
几十年来，人们一直在谈论计算和内存带宽之间的不平衡，每年高性能计算行业都被迫接受每个浮点运算越来越少的内存带宽，因为增加内存带宽非常困难，最终也不会非常昂贵。

And while we are thinking about it, increasing memory capacity is also getting more difficult because fat memory is also coming under Moore’s Law pressures and making increasingly dense as well as increasingly faster memory is getting more and more difficult, and hence the price of memory has not come down as much as it otherwise. And thus, we do not have the kind of massive memory machines that we might have dreamed of decades ago.
当我们在考虑这个问题时，增加内存容量也变得越来越困难，因为脂肪记忆也受到摩尔定律的压力，并且使越来越密集和越来越快的记忆变得越来越困难，因此内存的价格并没有像其他方式那样下降。因此，我们没有几十年前可能梦想的那种巨大的记忆机器。

We were reminded of this acutely during the Turing Award keynote by Jack Dongarra, well know to readers of The Next Platform as a distinguished researcher at Oak Ridge National Laboratory and research professor emeritus at the University of Tennessee. Like many of you, we watched the Turing Award keynote that Dongarra gave, talking about how he unexpectedly got into the supercomputing business and became the expert on measuring system performance on these massive machine – mostly by being part of the team that was constantly evolving math libraries as supercomputer architectures changed every decade or so. If you haven’t watched the keynote, you should, and you can do so at this link. This history is fascinating, and it forecasts how we will continue to evolve software as architectures continue to evolve.
在杰克·唐加拉（Jack Dongarra）的图灵奖主题演讲中，我们敏锐地想起了这一点，他是橡树岭国家实验室的杰出研究员和田纳西大学的名誉研究教授，为The Next Platform的读者所熟知。像你们中的许多人一样，我们观看了Dongarra的图灵奖主题演讲，谈论他如何出人意料地进入超级计算业务并成为测量这些大型机器的系统性能的专家 - 主要是通过成为团队的一员，随着超级计算机架构每十年左右发生变化，不断发展数学库。如果您还没有观看主题演讲，您应该观看，您可以在此链接 . 这段历史令人着迷，它预测了随着架构的不断发展，我们将如何继续发展软件。

But that’s what we are not going to talk about here.
但这就是我们不打算在这里谈论的内容。

What stuck out in our mind as we were watching Dongarra’s keynote is the massive overprovisioning of flops in today’s processors relative to memory bandwidth, and it was resonating in our head because that same week Intel had just announced some benchmark results on its upcoming “Sapphire Rapids” Xeon SP server CPUs, showing the benefit of HBM2e stacked memory, which has roughly 4X the memory bandwidth of plain vanilla DDR5 memory sticks used in modern server CPUs. (Sapphire Rapids has a 64 GB HBM2e memory option, and can be used in conjunction with DDR5 memory or instead of it.) The benefit of the HBM2e high bandwidth memory shows how much out of whack flops and bandwidth are:
当我们观看 Dongarra 的主题演讲时，我们脑海中浮现的是当今处理器相对于内存带宽的大量过度配置，它在我们的脑海中引起了共鸣，因为同一周英特尔刚刚宣布了其即将推出的“蓝宝石急流”至强 SP 服务器 CPU 的一些基准测试结果，显示了 HBM2e 堆叠内存的好处，它的内存带宽大约是现代服务器 CPU 中使用的普通 DDR5 记忆棒的 4 倍。（Sapphire Rapids具有64 GB HBM2e内存选项，可以与DDR5内存结合使用或代替DDR5内存。HBM2e 高带宽内存的优势显示了多少不正常的故障和带宽：

http://www.nextplatform.com/wp-content/uploads/2022/11/intel-max-series-cpu-hpc-performance.jpg

The addition of HBM2e memory to the Sapphire Rapids CPU does not affect Dongarra’s beloved High Performance Linpack (HPL) matrix math test very much, as you can see, and that is because HPL is not particularly memory bound. But the High Performance Conjugate Gradients (HPCG) and Stream Triad benchmarks, both of which are memory bound like crazy, sure do see a performance boost just by switching memory. (We presume that the machines tested had a pair of top bin, 60-core Sapphire Rapids chips.) Under normal circumstances with the HPCG test, which is probably the most accurate test reflecting how some very tough HPC applications really are written (and by necessity, not by choice), the world’s fastest supercomputers are use anywhere from 1 percent to 5 percent of the machine’s potential flops. So by increasing this by a factor of 3.8X would be a very, very big improvement indeed if that performance can scale across thousands of nodes. (This remains to be seen, and HPCG is the test that will – or won’t – show it.
如您所见，将HBM2e内存添加到Sapphire Rapids CPU不会对Dongarra心爱的高性能Linpack（HPL）矩阵数学测试产生太大影响，这是因为HPL不是特别受内存限制。但是高性能共轭梯度（HPCG）和Stream Triad基准测试，两者都像疯狂地受内存限制，确实可以通过切换内存来提高性能。（我们假设测试的机器有一对顶部垃圾箱，60核蓝宝石急流芯片。在HPCG测试的正常情况下，这可能是反映一些非常困难的HPC应用程序是如何编写的最准确的测试（并且出于必要，而不是选择），世界上最快的超级计算机使用的机器潜在失败的1%到5%不等。因此，如果性能可以跨数千个节点扩展，那么将其提高 3.8 倍将是一个非常非常大的改进。（这还有待观察，HPCG是将 - 或不会 - 显示它的测试。

So just how far out of whack are flops and memory bandwidth with each other? Dongarra showed how it is getting worse with each passing architectural revolution in supercomputing:
那么，翻牌和内存带宽之间的失衡程度到底有多大呢？Dongarra展示了随着超级计算的每一次架构革命，情况如何变得更糟：

http://www.nextplatform.com/wp-content/uploads/2022/12/dongarra-turing-memory-vs-compute.jpg

And here is a zoom into the chart that Dongarra showed:
以下是Dongarra显示的图表的放大图：

http://www.nextplatform.com/wp-content/uploads/2022/12/dongarra-turing-memory-vs-compute-zoom.jpg

“When we look at performance today on our machines, the data movement is the thing that’s the killer,” Dongarra explained. “We’re looking at the floating point execution rate divided by the data movement rate, and we’re looking at different processors. In the old days, we had processors that basically had a match of one flops per one data movement – that’s how they were balanced. And if you guys remember the old Cray-1s, you could do two floating point operations and three data movements all simultaneously. So this is trying to get a get a handle on that. But over time, the processors have changed the balance. What has happened over the course of the next twenty years, from the beginning here is that an order of magnitude was lost. That is, we can now do ten floating point operations for every data movement that we make. And more recently, we’ve seen that number grow to 100 floating point operations for every data movement. And even some machines today are in the 200 range. That says there’s a tremendous imbalance between the floating point and data movement. So we have tremendous floating point capability – we are overprovision for floating point – but we don’t have the mechanism for moving data very effectively around in our system.”
“当我们今天查看机器的性能时，数据移动是杀手锏，”Dongarra解释说。“我们正在研究浮点执行率除以数据移动率，我们正在研究不同的处理器。在过去，我们的处理器基本上每次数据移动都会匹配一次翻牌 - 这就是它们的平衡方式。如果你们还记得旧的Cray-1，你可以同时进行两个浮点运算和三个数据移动。所以这是试图解决这个问题。但随着时间的推移，处理器已经改变了平衡。在接下来的二十年里，从一开始这里发生的事情就是损失了一个数量级。也就是说，我们现在可以为我们所做的每个数据移动执行十个浮点运算。最近，我们看到每个数据移动的浮点运算数增长到 100 个浮点运算。甚至今天的一些机器也在 200 范围内。也就是说，浮点和数据移动之间存在巨大的不平衡。因此，我们拥有巨大的浮点能力 - 我们过度配置浮点 - 但我们没有在我们的系统中非常有效地移动数据的机制。"

The chart shows how generationally is has gotten worse and worse. And moving to HBM2e and even HBM3 or HBM4 and HBM5 memory is only a start, we think. And CXL memory can only partially address the issue. Inasmuch as CXL memory is faster than flash, we love it as a tool for system architects. But there are only so many PCI-Express lanes in the system to do CXL memory capacity and memory bandwidth expansion inside of a node. And while shared memory is interesting and possibly quite useful for HPC simulation and modeling as well as AI training workloads – again, because it will be higher performing than flash storage – that doesn’t mean any of this will be affordable.
这张图表显示了代际关系变得越来越糟糕。我们认为，转向HBM2e甚至HBM3或HBM4和HBM5内存只是一个开始。CXL 内存只能部分解决此问题。由于 CXL 内存比闪存更快，因此我们喜欢将其作为系统架构师的工具。但是系统中只有这么多的PCI-Express通道可以在节点内部进行CXL内存容量和内存带宽扩展。虽然共享内存很有趣，并且可能对HPC模拟和建模以及AI训练工作负载非常有用 - 再次，因为它将比闪存存储具有更高的性能 - 但这并不意味着任何这些都是负担得起的。

We don’t yet know what even the HBM2e memory option on Sapphire Rapids will cost. If it gooses memory bound applications by 4X to 5X but the CPU costs 3X more, that is not a gain really on the performance per watt front that really gates architectural choices.
我们还不知道蓝宝石急流上的HBM2e内存选项将花费多少。如果它将内存限制的应用程序提高了 4 到 5 倍，但 CPU 成本高出 3 倍，那么这并不能真正提高真正限制架构选择的每瓦性能。

The HBM2e memory option on the future Xeon SP is a good step in the right direction. But maybe having a lot more SRAM in L1, L2, and L3 caches is more important than adding cores if we want to get the memory back in balance.
未来至强SP上的HBM2e内存选项是朝着正确方向迈出的良好一步。但是，如果我们想让内存恢复平衡，也许在 L1、L2 和 L3 缓存中拥有更多的 SRAM 比添加内核更重要。

http://www.nextplatform.com/wp-content/uploads/2022/12/dongarra-turning-logo.jpg

Having won the Turing Award gives Dongarra a chance to lecture the industry a bit, and once encouraged to do so, he thankfully did. And we quote him at length because when Dongarra speaks, people should listen.
获得图灵奖给了Dongarra一个机会给这个行业讲课，一旦被鼓励这样做，他就谢天谢地地做到了。我们长篇大论地引用他的话，因为当唐加拉说话时，人们应该倾听。

“I have harped on the imbalance of the machines,” Dongarra said. “So today, we build our machines based on commodity off the shelf processors from AMD or Intel, commodity off the shelf accelerators, commodity off the shelf interconnects – those are commodity stuff. We’re not designing our hardware to the specifics of the applications that are going to be used to drive them. So perhaps we should step back and have a closer look at the how the architecture should interact with the with the applications, with the software co-design – something we talk about, but the reality is very little co-design takes place today with our hardware. And you can see from those numbers, there’s very little that goes on. And perhaps a good –better – indicator is what’s happening in Japan, where they have much closer interactions with the with the architects, with the hardware people to design machines that have a better balance. So if I was going to look at forward looking research projects, I would say maybe we should spin up projects that look at architecture and have the architecture better reflected in the applications. But I would say that we should have a better balance between the hardware and the applications and the software – really engage in co-design. Have spin-off projects, which look at hardware. You know, in the old days, when I was going to school, we had universities that were developing architectures that would, that would put together machines. Illinois was a good example of that – Stanford, MIT, CMU. Other places spun up and had had hardware projects that were investigating architectures. We don’t see that as much today. Maybe we should think about investing there, putting some research money – perhaps from the Department of Energy – into that mechanism for doing that kind of work.”
“我已经喋喋不休地谈论机器的不平衡，”唐加拉说。“所以今天，我们基于AMD或英特尔的商用现成处理器、商用现成的加速器、商品现成的互连——这些都是商品来制造我们的机器。我们没有根据将用于驱动它们的应用程序的具体情况来设计我们的硬件。因此，也许我们应该退后一步，仔细看看架构应该如何与应用程序交互，与软件协同设计 - 我们谈论过的事情，但现实是今天很少与我们的硬件进行协同设计。你可以从这些数字中看到，几乎没有发生什么。也许一个好的——更好的——指标是日本正在发生的事情，在那里他们与建筑师、硬件人员有更密切的互动，以设计出具有更好平衡的机器。因此，如果我要研究前瞻性的研究项目，我会说也许我们应该启动关注架构的项目，并将架构更好地反映在应用程序中。但我想说的是，我们应该在硬件、应用程序和软件之间取得更好的平衡——真正参与共同设计。有衍生项目，看看硬件。你知道，在过去，当我上学时，我们有大学正在开发架构，这些架构可以组装机器。伊利诺伊州就是一个很好的例子——斯坦福大学、麻省理工学院、CMU。其他地方也出现了一些硬件项目，这些项目正在研究架构。我们今天没有看到那么多。也许我们应该考虑在那里投资，将一些研究资金 - 也许来自能源部 - 投入到这种机制中，以开展此类工作。"

We agree wholeheartedly on the hardware-software co-design, and we believe that architectures should reflect the software that is running them. Frankly, if an exascale machine costs $500 million, but you can only use 5 percent of the flops to do real work, that is like paying $10 billion for what is effectively a 100 petaflops machine running at 100 percent utilization if you look at the price/performance. To do it the way Dongarra is suggesting would make all supercomputers more unique and less general purpose, and also more expensive. But there is a place where the performance per watt, cost per flops, performance per memory bandwidth, and cost per memory bandwidth all line up better than we are seeing today with tests like HPCG. We have to get these HPC and AI architectures back in whack.
我们全心全意地同意硬件-软件协同设计，并且我们认为架构应该反映运行它们的软件。坦率地说，如果一台百万兆次级机器的成本为5亿美元，但你只能使用5%的翻牌来做实际工作，这就像支付100亿美元购买一台100 petaflops的机器，如果你看一下性价比，它实际上是一台100 petaflops的机器，以100%的利用率运行。按照Dongarra建议的方式做到这一点，将使所有超级计算机更加独特，用途更少，而且成本更高。但是，在某个地方，每瓦性能、每翻牌成本、每内存带宽性能和每内存带宽成本都比我们今天在 HPCG 等测试中看到的要好。我们必须让这些HPC和AI架构重新陷入困境。

The next generation of researchers, inspired by Dongarra and his peers, need to tackle this memory bandwidth problem and not sweep it under the rug. Or, better still for a metaphorical image – stop rolling it up in a carpet like a Mob hit and driving it out to the Meadowlands in the trunk of a Lincoln. A divergence of 100X or 200X is, in fact, a performance and an economic crime.
受Dongarra及其同行的启发，下一代研究人员需要解决这个内存带宽问题，而不是将其扫地出门。或者，更好的是隐喻图像 - 停止像暴徒袭击一样将其卷在地毯上，然后用林肯的后备箱将其开到草地上。事实上，100倍或200倍的背离是一种表现和经济犯罪。

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
将本周的亮点、分析和故事直接从我们发送到您的收件箱，中间没有任何内容。
Subscribe now