AMD MI300 performance – Faster than H100, but how much?

nharada · on Dec 6, 2023

One interesting consequence of the model consolidation happening in the LLM world is that it's actually easier for AMD's hardware to be competitive. It used to be that you had to support a ton of ops since nobody is going to use your GPU for just a single model. Nowadays, most models are just transformers with only a few operations inside, albeit repeated many times over. On top of that, you can build a reasonable application or business using only one or two LLMs (i.e. LLaMA 2 + Whisper, etc). So as long as they can get those running fast, they can sell GPUs to people who don't need the flexibility and breadth of CUDA.

ilaksh · on Dec 7, 2023

The idea of model architecture making fast hardware design easier is what makes https://github.com/tinygrad/tinygrad so interesting.

dharma1 · on Dec 7, 2023

yep! I think geohot could singlehandedly turn AMD into a winner if (big if) they get tinygrad competitive perf wise on AMD hw. So far it's not

KRAKRISMOTT · on Dec 7, 2023

The mighty dot product and matrix multiplication strikes again!

oivey · on Dec 7, 2023

I’m not sure this is true. Nvidia ships hardware today that is specialized for transformers, and presumably AMD needs something similar on their side. An arms race of specific hardware like that is probably not to their advantage. It makes the interoperability story even harder.

They still need a usable GPGPU language to glue all that hardware together, too.

dhruvdh · on Dec 7, 2023

There is nothing special about Nvidia hardware outside of them having significantly better lower precision FLOPs before anyone else. Their marketing chose to call this a "transformer engine" but there is no actual "engine" in silicon.

Like AMD announced today, and is shown in the article we're commenting on, AMD now is 1.3 times faster at FP8, INT8 flops with and without sparsity, meaning their "transformer engine" is 1.3 times faster.

The rest of your comment doesn't make sense. See the other comments on software in this thread.

brucethemoose2 · on Dec 7, 2023

One very exciting thing is the vram. 192GB!!!

That seems like a minor bump compared to the new H100, but sometimes the quality of your training, or the speed of your inference, is hugely effected by small thresholds of VRAM capacity per card. The MI300X can do things the H100 cannot, and judging by the sheer size of the silicon, do them reasonably quickly.

dharma1 · on Dec 7, 2023

I think vram is one thing they can beat Nvidia on. Both on the pro and consumer GPUs. Just put 48gb into a consumer GPU and everyone would flock to AMD and help build their software side

brucethemoose2 · on Dec 7, 2023

They already have 48GB Pro GPUs.

If they were going to do it on the consumer side, they would have done it already. And I would own a 7900 48GB instead of a 3090, and probably have debugged ROCm on several projects by now :/

dharma1 · on Dec 7, 2023

Exactly - users would flock to it. They have to compete with Nvidia somehow and I don't see what other edge they could have - massive disadvantage on the software side that has to be compensated for in some way

brucethemoose2 · on Dec 7, 2023

Well local LLM running with rocm isnt exactly huge business. You are not wrong, but I can envision why decision makers wouldn't want to spoil the pro line.

dharma1 · on Dec 8, 2023

The real pros will use data centre GPUs like the MI300. I think local ML will become huge in games

latchkey · on Dec 7, 2023

Today's MI300x announcement. Worth a watch before you comment about how their software sucks.

https://www.youtube.com/watch?v=tfSZqjxsr0M

dhruvdh · on Dec 7, 2023

I'll add some highlights for those who don't have the time - Flash Attention 2 kernels, Paged Attention kernels, PyTorch, vLLM, transformers, ONNX runtime support, OpenAI triton support in v3, tensorflow, JAX, OpenXLA.

Also, while this wasn't in the presentation, CUDA can be translated to HIP using automated tools. and - CuPBoPAMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs (https://dl.acm.org/doi/pdf/10.1145/3624062.3624185)

viewtransform · on Dec 7, 2023

They've gone so far to be code drop in compatible that an <import cuda> statement is still <import cuda> when running Rocm on AMD!

AMD made a presentation on their AI software strategy at Microsoft Ignite two weeks ago. Worth a watch for the slides and live demo

https://youtu.be/7jqZBTduhAQ?t=61

latchkey · on Dec 7, 2023

> CUDA can be translated to HIP using automated tools.

Not just automated, but also open source.

https://github.com/ROCm-Developer-Tools/HIPIFY

superkuh · on Dec 7, 2023

Open source, but only covering some AMD hardware for 4 years. Don't trust the support to last.

slavik81 · on Dec 8, 2023

One of my coworkers uses a Vega 56 in their local workstation. That card is six years old. It's not officially supported, but it still works. I have no idea if the AI frameworks work on hardware that old, but the math libraries still do.

dotnet00 · on Dec 7, 2023

>CuPBoPAMD

They really couldn't come up with a better name than whatever that is?

VHRanger · on Dec 7, 2023

Koduri left so they have a chance to make the software not suck anymore

incrudible · on Dec 7, 2023

All that buzzword bingo does not mean the software is actually robust and performs well. If you have hardware that has much better specs on paper, but only delivers maybe 20% better performance in their own benchmarks, I am still going to presume that their software is not up to par.

daveguy · on Dec 7, 2023

20% is a huge amount in terms of high end performance. For example, 20% more transistors / compute units / etc does not give 20% more performance at the top end unless your execution is excellent.

shaklee3 · on Dec 7, 2023

OP's point is that the hardware specs are 2-3x higher in many places, but all their benchmarks are 20-30% higher. The article mentions this as well. It means AMD couldn't even utilize their own hardware very well at this point.

zaptrem · on Dec 7, 2023

Transformers are heavily memory bandwidth bound on modern hardware, and these chips only have 60% higher memory bandwidth.

shaklee3 · on Dec 7, 2023

Their own slides show a 20-30% speedup on attention tasks.

latchkey · on Dec 7, 2023

20% better performance, lower cost and most important... availability.

zitterbewegung · on Dec 7, 2023

Is this a drop in replacement for clouds (which is the actual question to ask because if you are buying this you have to make new infrastructure) the answer is No. Will Microsoft and Oracle allow you to use this to rent ? Yes. Does Microsoft and Oracle need more accelerators to do internal business processes? Absolutely (Microsoft announced it would rent AI accelerators because they can’t keep up with their own internal (and maybe external) demand). So cost has to not just be per card but infrastructure build out .

Not sure when Microsoft will just have their own TPU accelerator but I wouldn’t be surprised if the announce it sometime.

IceHegel · on Dec 7, 2023

They announced the Maia 100 AI accelerator last month. TSMC N5, 105b transistors.

https://news.microsoft.com/source/features/ai/in-house-chips...

dhruvdh · on Dec 7, 2023

The part is OCP compliant and can be drop in. Microsoft has already annouced their "TPU" at Ignite. The amount of misinformation in these threads is driving me insane.

latchkey · on Dec 7, 2023

Me too, but you're doing a great job not only responding to things, but also giving more context.

Der_Einzige · on Dec 7, 2023

Can you cite your claims about Oracle wanting to deploy these? Oracle has a deep partnership with Nvidia to the point that in many cases, they have joint booths at conferences.

I was not aware of any efforts to purchase or deploy MI300s at OCI.

viewtransform · on Dec 7, 2023

At today's AMD presentation the SVP of Oracle Cloud Infrastructure reveals it. https://youtu.be/tfSZqjxsr0M?t=2006

metadat · on Dec 7, 2023

There are a lot of SVPs, but this was still helpful, even though Karan is just a small fry!

Looking forward to trying out the MI300x.

dhruvdh · on Dec 7, 2023

Oracle were present at the event today and said it themselves. Also, it has been widely reported for months now that this is happening.

shrubble · on Dec 7, 2023

So how much of the market does AMD need to be profitable and keep going? Supposedly the Nvidia H100 alone, has ~15 billion USD in order backlog.

If AMD sells say $3 billion USD into that market, is that a big net positive for them?

hajile · on Dec 7, 2023

AMD's use of chiplets matters here. There is overhead, but they also not only get higher yields, but also have many more chips that can hit their highest target clockspeed meaning they can ship far more high-end chips.

This is the same reason why they were so successful with EPYC. It's easier to find eight small 8-core chips with high clocks/low-power than to find one 56-core chip with that same high-clock/low-power set which is why Intel sold so many cut-down chips while AMD just sold their small amount of defective chips as 6 and 12-core consumer chips.

Nvidia will have a harder time shipping defective because everyone wants the best chips with the best performance/watt to save money and do things fast. Outside of that, there's not a huge market for defective 800mm2 chips and the massive GPU the come on. AMD can sell their defective units a couple chiplets at a time to non-AI customers in laptops and workstation cards.

shaklee3 · on Dec 7, 2023

It also implies that it's harder to program efficiently and is slower on the interconnect.

aurareturn · on Dec 7, 2023

It should be a transparent GPU like M1 Ultra.

shaklee3 · on Dec 7, 2023

It's transparent, but it has a performance cost. There will be low-level ways to improve locality for certain libraries, but it's nearly impossible for common users to claw that performance loss back.

latchkey · on Dec 7, 2023

I ran 120k of the rx470-rx580 series and another 30k of the PS5 APU. All of them were run at the most efficient and tuned settings possible. Every single chip was a snowflake.

We're not just talking defective, we are talking about a silicon lottery on performance and it varies between every single one.

omneity · on Dec 7, 2023

Oh wow. Could you talk more about your use of all these GPUs? And how you handled all these disparities for real usage?

latchkey · on Dec 7, 2023

I have in previous comments. =)

shrubble · on Dec 7, 2023

Thank you for that detailed info - had not thought about that...

ElectricalUnion · on Dec 7, 2023

If the backlog gets too big, as a consumer you might get compelled to just abandon it and switch to the "available" alternative even if it's sightly inferior.

Suddenly potentially having a lot less backlog might screw up Nvidia faster that it "helps" AMD too.

fransje26 · on Dec 7, 2023

Does the memory controller also bottom out at about half the bandwidth in practice, like on the MI250? Or is the full memory bandwidth finally available?

Theoretical numbers are nice, but they are just that, ethereal numbers on a piece of paper. And on a CUDA device, I know from practice that I can get at least 90% of the memory bandwidth.

phdelightful · on Dec 8, 2023

DOE HPC people are not having this problem with AMD GPUs: here’s a paper that reports 1.36 TB/s on one GCD of an MI250x (theoretical is 1.6TB/s).

https://dl.acm.org/doi/pdf/10.1145/3624062.3624203

Table 6

xnx · on Dec 7, 2023

No mention of price anywhere that I've seen. I assume it will only be a slight $/flop discount vs. Nvidia.

latchkey · on Dec 7, 2023

Pricing at this point is almost irrelevant given that it is available in Q1/Q2 2024 and H100's are sold out until who knows when and you're still going to be at the bottom of a long line unless you have a giant order to place. Advantages, like IB, are 50+ week lead times as well, if you're lucky.

yukIttEft · on Dec 7, 2023

H100 GPU is a sibling of NVidia's customer RTX 40-series: https://en.wikipedia.org/wiki/Hopper_(microarchitecture)

Is MI300 also a variant of their customer GPUs?

edward28 · on Dec 7, 2023

This is a different architecture, CDNA3

anticensor · on Dec 7, 2023

Which is a derivative of pre-RDNA customer GPU architecture, Vega.

MichaelRazum · on Dec 7, 2023

It costs about 20k$ as far as I understand. Why is it so expensive?

TradingPlaces · on Dec 7, 2023

*if you ignore sparsity.

bastard_op · on Dec 7, 2023

Ok, so even if the AMD chips are equal to or faster, Nvidia still has a large advantage with their networking using their proprietary NVLink switching that was most certainly borne from the Mellanox Infiniband IP acquisition. AMD is going to have to rely on more traditional PCI-E based networking for all off-system traffic via Infiniband or Converged Ethernet vs the more optimized Nvidia GPU clustering that uses it's on proprietary super-low latency NVLink-based switching.

https://www.nvidia.com/en-us/data-center/nvlink/

I'm super curious to see how their networking results compare between transports when real testing results between large supercomputer-ish farms get out there.

dhruvdh · on Dec 7, 2023

AMD uses Infinity Fabric, which is their NVLink. They announced today how they're opening up Infinity Fabric to select partners, including Broadcom. Meaning partners can now develop their own switches.

Also, as AMD and HPE power some of the most performant supercomputers in the world, they wouldn't have won those contracts if their networking was subpar. Those use slingshot.

You may also be interested in reading about https://ultraethernet.org/ .

saberience · on Dec 7, 2023

NVLink is used for linking chips together, InfiniBand is used for linking systems together, i.e. into big HPC clusters. Infinity Fabric is used for inter chip communication, it's not a replacement for Infiniband and cannot be used to create clusters of machines.

viewtransform · on Dec 7, 2023

I queued for you the part where Forrest Norad explains all this in today's presentation including a discussion on ultrafast ethernet. https://youtu.be/tfSZqjxsr0M?t=5198

latchkey · on Dec 7, 2023

You obviously didn't watch the announcement. They are doubling down on ethernet and open standards.

JonChesterfield · on Dec 7, 2023

I think Pensando and Xilinx are the networking story, not pcie. The clusters use Cray's slingshot. Within a node, the x64 cores and GPU units are on a common fabric which is also not pcie.

gigatexal · on Dec 6, 2023

But without CUDA it’s gotta be DOA no? CUDA is what makes Nvidia’s hardware valuable. AMD should finance a competitor. They’ve had strong hardware but it’s the software that’s hurting them from what I’ve seen.

deaddodo · on Dec 6, 2023

They have a competitor, it's just less popular than CUDA:

https://www.amd.com/en/products/software/rocm.html

Before that, they were pushing heavily for standard OpenCL, but that failed because the hardware wasn't as competitive and the ecosystem/tooling barren.

atemerev · on Dec 6, 2023

ROCm is not just unpopular, it is so boilerplate-ridden, it is nearly unusable. It takes about 5x more code to do things there compared to CUDA.

Yes, “just use libraries”. But as Andrej Karpathy used to say, “I don't need some library holding my hand and providing abstractions. Real men command GPUs with their own raw kernel code”.

(And, incidentally, there are much less libraries for ROCm, for this reason. Somebody should write them, and why do that with something that takes 5x more code to do the same thing?)

viewtransform · on Dec 7, 2023

That was a year ago. AMD is changing their software ecosystem at a rapid pace with AI software as a #1 priority.

To get a picture of the current state which has changed a lot this MS Ignite presentation may be of interest. https://youtu.be/7jqZBTduhAQ?t=61

atemerev · on Dec 7, 2023

Thanks. Indeed, things became much better than I remembered them.

kcb · on Dec 6, 2023

Why? I understand the framework setup may not be as polished as CUDA, but I was under the impression HIP is the primary kernel language supported by ROCm and at first glance it's basically a 1:1 CUDA clone.

atemerev · on Dec 7, 2023

Hm, apparently, in the last years, HIP became much integrated into ROCm as the principal way of doing things. I was referring to the situation as of 2-3 years ago, when plain ROCm was a chore to write.

OK, I retract my statement.

JonChesterfield · on Dec 6, 2023

Where do you get the 5x code idea from? Hip and cuda are almost regex-equivalent, openmp is the same language on each.

atemerev · on Dec 7, 2023

OP was talking about plain ROCm, not HIP. HIP is an “almost API-compatible CUDA”.

trynumber9 · on Dec 6, 2023

Depends on your definition of DOA. For the big customers who can't get as many H100 as they want it'll be an option. They have the resources to make sure their workload does in fact work and then scale out hardware appropriately. And it might not even be a downgrade: MI300X has more bandwidth and larger shared memory space.

And there's MI300A which has a combined CPU+GPU with shared memory space which already has supercomputer customers. You can say it's DOA relative to Hopper but MI300 will be AMD's most successful GPGPU yet.

lanza · on Dec 7, 2023

PyTorch and Triton don't use CUDA. An extraordinary volume of big tech compute only cares about the PTX “isa” for NVIDIA.

bratao · on Dec 6, 2023

I don´t think so. There are few algorithms that cover the majority of use. For example in NLP a fast LLM (such as LLAMA) training/prediction and support for Hugging face transformers (BERT and such) would cover 95%+ of NLP use-cases that need a GPU.

kemotep · on Dec 7, 2023

> AMD should finance a competitor.

If they can’t afford to out R&D Nvidia today, how could they afford to fund a competitor? In the past 20 years the only serious new entrant to the discrete GPU market has been Intel and they are also far behind Nvidia and CUDA.

wmf · on Dec 6, 2023

AMD is now starting to have decent software support.

JonChesterfield · on Dec 6, 2023

It's somewhat sad that AMD bet on open source over proprietary and then approximately zero people stepped up to help. Easier to use cuda and complain I suppose. Fortunately they've stuck with the open source plan anyway and just written it themselves.

wmf · on Dec 6, 2023

Open source people would rather spend their time reverse engineering Nvidia, Mali, and Apple GPUs than helping AMD.

ahartmetz · on Dec 7, 2023

AMD's paid linux driver team is moving relatively quickly using some documentation and tools (simulators) under NDA. It is difficult to contribute usefully to the core of that effort without getting hired by AMD. AFAIK, that is why not many hobbyists are contributing.

JonChesterfield · on Dec 7, 2023

I would guess the documentation is something of a hindrance. I'm in compilers rather than drivers, where the public docs are essentially the pdfs at https://gpuopen.com/amd-isa-documentation/. It's a lot better than the nvidia equivalent but it would also be rough going implementing a compiler based on those alone; too many instructions with names but unspecified semantics. The driver / firmware story is probably similar.

ahartmetz · on Dec 16, 2023

Oh, I thought I was missing something, maybe another document with more explanations. Thanks for clearing that up.

I was actually motivated to contribute to the AMD drivers at some point and did land a couple of patches, but wasn't looking to switch career paths. The drivers have become quite good without me anyway, so no regrets :)

striking · on Dec 7, 2023

Look, I love AMD, but there are different groups of open source people here. One group is trying to broaden the support for desktop Linux across all hardware, and the other group is trying to crunch numbers with GPUs and build support for that. There isn't a super strong overlap, and I think the former group is currently larger than the latter.

kcb · on Dec 7, 2023

The highest performing open source graphics driver available anywhere is RADV.

screye · on Dec 6, 2023

Rocm will only get adoption when Cuda code can be ported over with 1 click.

At the very least, pytorch has to work effortlessly across GPUs, down to the less popular and odd workarounds people build into their pyotrch code.

Even if RocM reaches parity, the first mover advantage for Cuda is too large. RocM porting has to be literally effortless. Everything else is DOA.

com2kid · on Dec 6, 2023

Counterpoint, at scale a month of training a GPU cost more than an engineer now days.

And the large companies (Meta, Microsoft) buy these GPUs by the thousands upon thousands.

Having a small team of engineers spend a couple months porting code over is well worth it for even modest cost reductions.

These may not be useful to smaller companies working in the AI space, but odds are to sellout, all AMD really needs are the sales contracts that they've already announced.

wmf · on Dec 6, 2023

The facts don't bear this out. Various companies have already ported to AMD even though it isn't effortless.

screye · on Dec 6, 2023

Edit-

That being said, if you a cloud provider and need to scale up a bunch of basically similar transformers models, then it should be an easy sell for AMD.

Der_Einzige · on Dec 7, 2023

People here don't want to hear the truth, but this is the truth.