Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AMD MI300 performance – Faster than H100, but how much? (semianalysis.com)
116 points by klelatti on Dec 6, 2023 | hide | past | favorite | 85 comments


One interesting consequence of the model consolidation happening in the LLM world is that it's actually easier for AMD's hardware to be competitive. It used to be that you had to support a ton of ops since nobody is going to use your GPU for just a single model. Nowadays, most models are just transformers with only a few operations inside, albeit repeated many times over. On top of that, you can build a reasonable application or business using only one or two LLMs (i.e. LLaMA 2 + Whisper, etc). So as long as they can get those running fast, they can sell GPUs to people who don't need the flexibility and breadth of CUDA.


The idea of model architecture making fast hardware design easier is what makes https://github.com/tinygrad/tinygrad so interesting.


yep! I think geohot could singlehandedly turn AMD into a winner if (big if) they get tinygrad competitive perf wise on AMD hw. So far it's not


The mighty dot product and matrix multiplication strikes again!


I’m not sure this is true. Nvidia ships hardware today that is specialized for transformers, and presumably AMD needs something similar on their side. An arms race of specific hardware like that is probably not to their advantage. It makes the interoperability story even harder.

They still need a usable GPGPU language to glue all that hardware together, too.


There is nothing special about Nvidia hardware outside of them having significantly better lower precision FLOPs before anyone else. Their marketing chose to call this a "transformer engine" but there is no actual "engine" in silicon.

Like AMD announced today, and is shown in the article we're commenting on, AMD now is 1.3 times faster at FP8, INT8 flops with and without sparsity, meaning their "transformer engine" is 1.3 times faster.

The rest of your comment doesn't make sense. See the other comments on software in this thread.


One very exciting thing is the vram. 192GB!!!

That seems like a minor bump compared to the new H100, but sometimes the quality of your training, or the speed of your inference, is hugely effected by small thresholds of VRAM capacity per card. The MI300X can do things the H100 cannot, and judging by the sheer size of the silicon, do them reasonably quickly.


I think vram is one thing they can beat Nvidia on. Both on the pro and consumer GPUs. Just put 48gb into a consumer GPU and everyone would flock to AMD and help build their software side


They already have 48GB Pro GPUs.

If they were going to do it on the consumer side, they would have done it already. And I would own a 7900 48GB instead of a 3090, and probably have debugged ROCm on several projects by now :/


Exactly - users would flock to it. They have to compete with Nvidia somehow and I don't see what other edge they could have - massive disadvantage on the software side that has to be compensated for in some way


Well local LLM running with rocm isnt exactly huge business. You are not wrong, but I can envision why decision makers wouldn't want to spoil the pro line.


The real pros will use data centre GPUs like the MI300. I think local ML will become huge in games


Today's MI300x announcement. Worth a watch before you comment about how their software sucks.

https://www.youtube.com/watch?v=tfSZqjxsr0M


I'll add some highlights for those who don't have the time - Flash Attention 2 kernels, Paged Attention kernels, PyTorch, vLLM, transformers, ONNX runtime support, OpenAI triton support in v3, tensorflow, JAX, OpenXLA.

Also, while this wasn't in the presentation, CUDA can be translated to HIP using automated tools. and - CuPBoPAMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs (https://dl.acm.org/doi/pdf/10.1145/3624062.3624185)


They've gone so far to be code drop in compatible that an <import cuda> statement is still <import cuda> when running Rocm on AMD!

AMD made a presentation on their AI software strategy at Microsoft Ignite two weeks ago. Worth a watch for the slides and live demo

https://youtu.be/7jqZBTduhAQ?t=61


> CUDA can be translated to HIP using automated tools.

Not just automated, but also open source.

https://github.com/ROCm-Developer-Tools/HIPIFY


Open source, but only covering some AMD hardware for 4 years. Don't trust the support to last.


One of my coworkers uses a Vega 56 in their local workstation. That card is six years old. It's not officially supported, but it still works. I have no idea if the AI frameworks work on hardware that old, but the math libraries still do.


>CuPBoPAMD

They really couldn't come up with a better name than whatever that is?


Koduri left so they have a chance to make the software not suck anymore


All that buzzword bingo does not mean the software is actually robust and performs well. If you have hardware that has much better specs on paper, but only delivers maybe 20% better performance in their own benchmarks, I am still going to presume that their software is not up to par.


20% is a huge amount in terms of high end performance. For example, 20% more transistors / compute units / etc does not give 20% more performance at the top end unless your execution is excellent.


OP's point is that the hardware specs are 2-3x higher in many places, but all their benchmarks are 20-30% higher. The article mentions this as well. It means AMD couldn't even utilize their own hardware very well at this point.


Transformers are heavily memory bandwidth bound on modern hardware, and these chips only have 60% higher memory bandwidth.


Their own slides show a 20-30% speedup on attention tasks.


20% better performance, lower cost and most important... availability.


Is this a drop in replacement for clouds (which is the actual question to ask because if you are buying this you have to make new infrastructure) the answer is No. Will Microsoft and Oracle allow you to use this to rent ? Yes. Does Microsoft and Oracle need more accelerators to do internal business processes? Absolutely (Microsoft announced it would rent AI accelerators because they can’t keep up with their own internal (and maybe external) demand). So cost has to not just be per card but infrastructure build out .

Not sure when Microsoft will just have their own TPU accelerator but I wouldn’t be surprised if the announce it sometime.


They announced the Maia 100 AI accelerator last month. TSMC N5, 105b transistors.

https://news.microsoft.com/source/features/ai/in-house-chips...


The part is OCP compliant and can be drop in. Microsoft has already annouced their "TPU" at Ignite. The amount of misinformation in these threads is driving me insane.


Me too, but you're doing a great job not only responding to things, but also giving more context.


Can you cite your claims about Oracle wanting to deploy these? Oracle has a deep partnership with Nvidia to the point that in many cases, they have joint booths at conferences.

I was not aware of any efforts to purchase or deploy MI300s at OCI.


At today's AMD presentation the SVP of Oracle Cloud Infrastructure reveals it. https://youtu.be/tfSZqjxsr0M?t=2006


There are a lot of SVPs, but this was still helpful, even though Karan is just a small fry!

Looking forward to trying out the MI300x.


Oracle were present at the event today and said it themselves. Also, it has been widely reported for months now that this is happening.


So how much of the market does AMD need to be profitable and keep going? Supposedly the Nvidia H100 alone, has ~15 billion USD in order backlog.

If AMD sells say $3 billion USD into that market, is that a big net positive for them?


AMD's use of chiplets matters here. There is overhead, but they also not only get higher yields, but also have many more chips that can hit their highest target clockspeed meaning they can ship far more high-end chips.

This is the same reason why they were so successful with EPYC. It's easier to find eight small 8-core chips with high clocks/low-power than to find one 56-core chip with that same high-clock/low-power set which is why Intel sold so many cut-down chips while AMD just sold their small amount of defective chips as 6 and 12-core consumer chips.

Nvidia will have a harder time shipping defective because everyone wants the best chips with the best performance/watt to save money and do things fast. Outside of that, there's not a huge market for defective 800mm2 chips and the massive GPU the come on. AMD can sell their defective units a couple chiplets at a time to non-AI customers in laptops and workstation cards.


It also implies that it's harder to program efficiently and is slower on the interconnect.


It should be a transparent GPU like M1 Ultra.


It's transparent, but it has a performance cost. There will be low-level ways to improve locality for certain libraries, but it's nearly impossible for common users to claw that performance loss back.


I ran 120k of the rx470-rx580 series and another 30k of the PS5 APU. All of them were run at the most efficient and tuned settings possible. Every single chip was a snowflake.

We're not just talking defective, we are talking about a silicon lottery on performance and it varies between every single one.


Oh wow. Could you talk more about your use of all these GPUs? And how you handled all these disparities for real usage?


I have in previous comments. =)


Thank you for that detailed info - had not thought about that...


If the backlog gets too big, as a consumer you might get compelled to just abandon it and switch to the "available" alternative even if it's sightly inferior.

Suddenly potentially having a lot less backlog might screw up Nvidia faster that it "helps" AMD too.


Does the memory controller also bottom out at about half the bandwidth in practice, like on the MI250? Or is the full memory bandwidth finally available?

Theoretical numbers are nice, but they are just that, ethereal numbers on a piece of paper. And on a CUDA device, I know from practice that I can get at least 90% of the memory bandwidth.


DOE HPC people are not having this problem with AMD GPUs: here’s a paper that reports 1.36 TB/s on one GCD of an MI250x (theoretical is 1.6TB/s).

https://dl.acm.org/doi/pdf/10.1145/3624062.3624203

Table 6


No mention of price anywhere that I've seen. I assume it will only be a slight $/flop discount vs. Nvidia.


Pricing at this point is almost irrelevant given that it is available in Q1/Q2 2024 and H100's are sold out until who knows when and you're still going to be at the bottom of a long line unless you have a giant order to place. Advantages, like IB, are 50+ week lead times as well, if you're lucky.


H100 GPU is a sibling of NVidia's customer RTX 40-series: https://en.wikipedia.org/wiki/Hopper_(microarchitecture)

Is MI300 also a variant of their customer GPUs?


This is a different architecture, CDNA3


Which is a derivative of pre-RDNA customer GPU architecture, Vega.


It costs about 20k$ as far as I understand. Why is it so expensive?


*if you ignore sparsity.


Ok, so even if the AMD chips are equal to or faster, Nvidia still has a large advantage with their networking using their proprietary NVLink switching that was most certainly borne from the Mellanox Infiniband IP acquisition. AMD is going to have to rely on more traditional PCI-E based networking for all off-system traffic via Infiniband or Converged Ethernet vs the more optimized Nvidia GPU clustering that uses it's on proprietary super-low latency NVLink-based switching.

https://www.nvidia.com/en-us/data-center/nvlink/

I'm super curious to see how their networking results compare between transports when real testing results between large supercomputer-ish farms get out there.


AMD uses Infinity Fabric, which is their NVLink. They announced today how they're opening up Infinity Fabric to select partners, including Broadcom. Meaning partners can now develop their own switches.

Also, as AMD and HPE power some of the most performant supercomputers in the world, they wouldn't have won those contracts if their networking was subpar. Those use slingshot.

You may also be interested in reading about https://ultraethernet.org/ .


NVLink is used for linking chips together, InfiniBand is used for linking systems together, i.e. into big HPC clusters. Infinity Fabric is used for inter chip communication, it's not a replacement for Infiniband and cannot be used to create clusters of machines.


I queued for you the part where Forrest Norad explains all this in today's presentation including a discussion on ultrafast ethernet. https://youtu.be/tfSZqjxsr0M?t=5198


You obviously didn't watch the announcement. They are doubling down on ethernet and open standards.


I think Pensando and Xilinx are the networking story, not pcie. The clusters use Cray's slingshot. Within a node, the x64 cores and GPU units are on a common fabric which is also not pcie.


But without CUDA it’s gotta be DOA no? CUDA is what makes Nvidia’s hardware valuable. AMD should finance a competitor. They’ve had strong hardware but it’s the software that’s hurting them from what I’ve seen.


They have a competitor, it's just less popular than CUDA:

https://www.amd.com/en/products/software/rocm.html

Before that, they were pushing heavily for standard OpenCL, but that failed because the hardware wasn't as competitive and the ecosystem/tooling barren.


ROCm is not just unpopular, it is so boilerplate-ridden, it is nearly unusable. It takes about 5x more code to do things there compared to CUDA.

Yes, “just use libraries”. But as Andrej Karpathy used to say, “I don't need some library holding my hand and providing abstractions. Real men command GPUs with their own raw kernel code”.

(And, incidentally, there are much less libraries for ROCm, for this reason. Somebody should write them, and why do that with something that takes 5x more code to do the same thing?)


That was a year ago. AMD is changing their software ecosystem at a rapid pace with AI software as a #1 priority.

To get a picture of the current state which has changed a lot this MS Ignite presentation may be of interest. https://youtu.be/7jqZBTduhAQ?t=61


Thanks. Indeed, things became much better than I remembered them.


Why? I understand the framework setup may not be as polished as CUDA, but I was under the impression HIP is the primary kernel language supported by ROCm and at first glance it's basically a 1:1 CUDA clone.


Hm, apparently, in the last years, HIP became much integrated into ROCm as the principal way of doing things. I was referring to the situation as of 2-3 years ago, when plain ROCm was a chore to write.

OK, I retract my statement.


Where do you get the 5x code idea from? Hip and cuda are almost regex-equivalent, openmp is the same language on each.


OP was talking about plain ROCm, not HIP. HIP is an “almost API-compatible CUDA”.


Depends on your definition of DOA. For the big customers who can't get as many H100 as they want it'll be an option. They have the resources to make sure their workload does in fact work and then scale out hardware appropriately. And it might not even be a downgrade: MI300X has more bandwidth and larger shared memory space.

And there's MI300A which has a combined CPU+GPU with shared memory space which already has supercomputer customers. You can say it's DOA relative to Hopper but MI300 will be AMD's most successful GPGPU yet.


PyTorch and Triton don't use CUDA. An extraordinary volume of big tech compute only cares about the PTX “isa” for NVIDIA.


I don´t think so. There are few algorithms that cover the majority of use. For example in NLP a fast LLM (such as LLAMA) training/prediction and support for Hugging face transformers (BERT and such) would cover 95%+ of NLP use-cases that need a GPU.


> AMD should finance a competitor.

If they can’t afford to out R&D Nvidia today, how could they afford to fund a competitor? In the past 20 years the only serious new entrant to the discrete GPU market has been Intel and they are also far behind Nvidia and CUDA.


AMD is now starting to have decent software support.


It's somewhat sad that AMD bet on open source over proprietary and then approximately zero people stepped up to help. Easier to use cuda and complain I suppose. Fortunately they've stuck with the open source plan anyway and just written it themselves.


Open source people would rather spend their time reverse engineering Nvidia, Mali, and Apple GPUs than helping AMD.


AMD's paid linux driver team is moving relatively quickly using some documentation and tools (simulators) under NDA. It is difficult to contribute usefully to the core of that effort without getting hired by AMD. AFAIK, that is why not many hobbyists are contributing.


I would guess the documentation is something of a hindrance. I'm in compilers rather than drivers, where the public docs are essentially the pdfs at https://gpuopen.com/amd-isa-documentation/. It's a lot better than the nvidia equivalent but it would also be rough going implementing a compiler based on those alone; too many instructions with names but unspecified semantics. The driver / firmware story is probably similar.


Oh, I thought I was missing something, maybe another document with more explanations. Thanks for clearing that up.

I was actually motivated to contribute to the AMD drivers at some point and did land a couple of patches, but wasn't looking to switch career paths. The drivers have become quite good without me anyway, so no regrets :)


Look, I love AMD, but there are different groups of open source people here. One group is trying to broaden the support for desktop Linux across all hardware, and the other group is trying to crunch numbers with GPUs and build support for that. There isn't a super strong overlap, and I think the former group is currently larger than the latter.


The highest performing open source graphics driver available anywhere is RADV.


Rocm will only get adoption when Cuda code can be ported over with 1 click.

At the very least, pytorch has to work effortlessly across GPUs, down to the less popular and odd workarounds people build into their pyotrch code.

Even if RocM reaches parity, the first mover advantage for Cuda is too large. RocM porting has to be literally effortless. Everything else is DOA.


Counterpoint, at scale a month of training a GPU cost more than an engineer now days.

And the large companies (Meta, Microsoft) buy these GPUs by the thousands upon thousands.

Having a small team of engineers spend a couple months porting code over is well worth it for even modest cost reductions.

These may not be useful to smaller companies working in the AI space, but odds are to sellout, all AMD really needs are the sales contracts that they've already announced.


The facts don't bear this out. Various companies have already ported to AMD even though it isn't effortless.


Edit-

That being said, if you a cloud provider and need to scale up a bunch of basically similar transformers models, then it should be an easy sell for AMD.


People here don't want to hear the truth, but this is the truth.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: