Will AMD’s MI300 beat NVIDIA in AI?

The upcoming MI300, which will ship later this year after NVIDIA’s Grace/Hopper Superchip, certainly has a shot at that. But many unknowns remain that will determine how well it works for AI applications. And then there is software. Yes, software. Lots of software.

At the 2023 CES Keynote, AMD CEO Dr. Lisa Su reiterated the company’s plan to bring the Instinct MI300 to market by the end of this year, showing the monster silicon in hand. The chip is certainly an important milestone for the company, and the industry in general, being the most aggressive chip implementation to date. Combining the industry’s fastest CPU with a new GPU and HBM can bring many benefits, especially since it supports sharing that HBM memory across the compute complex. The idea of ​​a large APU is not new; I worked on the canceled Big APU at AMD in 2014 and am a true believer. But combining the CPU and GPU into a single package is just the start.

What we know

The MI300 is a monster device, with nine TSMC’s 5nm chiplets stacked over four 6nm chiplets using 3D die stacking, all of which will in turn be paired with 128GB of shared HBM memory on the package to maximize bandwidth and minimize data movement. We note that NVIDIA’s Grace/Hopper, which we expect to ship before the MI300, will still share 2 separate memory pools, using HBM for the GPU and much more DRAM for the CPU. AMD says they can run the MI300 without DRAM as an option, just using HBM, which would be pretty cool and really fast.

At 146B transistors, this device will use a lot of energy to drive and cool; I have seen estimates of 900 watts. But in this advanced AI, it may not matter; NVIDIA’s Grace-Hopper Superchip will consume about the same, and a Cerebra’s Wafer-Scale Engine uses 15kW. What matters is how much work that power enables.

AMD repeated the claim from its financial analyst day that the MI300 would outperform its own MI250x by 8X for AI, delivering 5X the power efficiency. We would like to note here that this is indeed a low bar, since the MI250 does not natively support low-precision math below 16-bit. The new GPU will likely support 4- and 8-bit int and floating point, and will have four times the number of CUs, so 8X is a chip shot AMD can exceed.

What we don’t know

So from a hardware point of view, the MI300 looks potentially very strong. But AMD has been slow to innovate beyond the GPU cores, focusing more on the floating point needed by HPC customers. For example, AMD did not provide an equivalent to Tensor Cores on the MI250x, which can dramatically improve the performance of AI (and select HPC) applications by increasing parallelism. Does MI300 support tensor cores? I would assume so. But the AI ​​game has moved on from the convolutional algorithms for image processing, which tensor kernels accelerate, to Natural Language Processing and basic generative models, and it requires more innovation.

As we’ve all seen with GPT-3 and now ChatGPT, large basic language models are the new frontier for AI. To accelerate these, NVIDIA Hopper has a transformer engine that can speed up training by as much as 9X and inference throughput by as much as 30X. The H100 Transformer Engine can mix 8-bit precision and 16-bit half-precision as needed, while maintaining accuracy. Will AMD have something similar? AMD fans better hope so; foundational models are the future of AI.

We also don’t know how big a cluster footprint will be. Specifically, NVIDIA is moving from an 8-node cluster to a 256-node shared memory cluster, greatly simplifying the deployment of large AI models. Likewise, we don’t yet know how AMD will support larger nodes; different models require a different ratio of GPUs to CPUs. NVIDIA has shown that it will support 16 Hoppers per Grace CPU over NVLink.

Software is a big problem for AMD

Finally in the software arena, I think we have to give AMD a hall pass: given AMD AI hardware performance to date, there hasn’t been much serious work on the software stack. Yes, ROCm is a good start, but really only covers the basics, just getting code to work reasonably well on the hardware.

Conversely, consider ROCm compared to NVIDIA’s software stack. The ROCm libraries roughly correspond to ONE of the small icons on the NVIDIA image below: CuDNN. NVIDIA doesn’t even refer to things like OpenMPI or debuggers and tracers; these are table stakes only. Or Kubernetes and Docker. AMD has no Triton Inference server, no RAPIDS, no TensorRT, etc., etc., etc. And there’s no hint of anything approaching the 14 application frames at the top of NVIDIA’s slide.

That said, some customers, such as OpenAI, have isolated themselves from vendor-supplied software, and that is opaque. Last year, OpenAI introduced the open source Triton software stack, bypassing the NVIDIA CUDA stack. One could imagine that OpenAI could use its own software on the MI300 and be just fine. But for most others, there is so much more to AI software than CUDA libraries.

Conclusions

AMD has done an admirable job with the MI300, leading the entire industry in embracing chip-based architectures. We believe the MI300 will position AMD as a worthy alternative to Grace/Hopper, especially for those who prefer a non-NVIDIA platform. Consequently, AMD has the opportunity to be considered a viable second source for fast GPUs, especially when HPC is the number one application space and AI is an important but secondary consideration. AMD’s floating point performance is now well ahead of NVIDIA. And Intel’s combined CPU + GPU, called Falcon Shores, is slated for 2024, assuming no one misses out.

But what we and the market need to see is real-world application performance. So let’s see some MLPerf, AMD!

Leave a Reply

Your email address will not be published. Required fields are marked *