One interesting consequence of the model consolidation happening in the LLM world is that it's actually easier for AMD's hardware to be competitive. It used to be that you had to support a ton of ops since nobody is going to use your GPU for just a single model. Nowadays, most models are just transformers with only a few operations inside, albeit repeated many times over. On top of that, you can build a reasonable application or business using only one or two LLMs (i.e. LLaMA 2 + Whisper, etc). So as long as they can get those running fast, they can sell GPUs to people who don't need the flexibility and breadth of CUDA.
I’m not sure this is true. Nvidia ships hardware today that is specialized for transformers, and presumably AMD needs something similar on their side. An arms race of specific hardware like that is probably not to their advantage. It makes the interoperability story even harder.
They still need a usable GPGPU language to glue all that hardware together, too.
There is nothing special about Nvidia hardware outside of them having significantly better lower precision FLOPs before anyone else. Their marketing chose to call this a "transformer engine" but there is no actual "engine" in silicon.
Like AMD announced today, and is shown in the article we're commenting on, AMD now is 1.3 times faster at FP8, INT8 flops with and without sparsity, meaning their "transformer engine" is 1.3 times faster.
The rest of your comment doesn't make sense. See the other comments on software in this thread.
That seems like a minor bump compared to the new H100, but sometimes the quality of your training, or the speed of your inference, is hugely effected by small thresholds of VRAM capacity per card. The MI300X can do things the H100 cannot, and judging by the sheer size of the silicon, do them reasonably quickly.
I think vram is one thing they can beat Nvidia on. Both on the pro and consumer GPUs. Just put 48gb into a consumer GPU and everyone would flock to AMD and help build their software side
If they were going to do it on the consumer side, they would have done it already. And I would own a 7900 48GB instead of a 3090, and probably have debugged ROCm on several projects by now :/
Exactly - users would flock to it. They have to compete with Nvidia somehow and I don't see what other edge they could have - massive disadvantage on the software side that has to be compensated for in some way
Well local LLM running with rocm isnt exactly huge business. You are not wrong, but I can envision why decision makers wouldn't want to spoil the pro line.
I'll add some highlights for those who don't have the time - Flash Attention 2 kernels, Paged Attention kernels, PyTorch, vLLM, transformers, ONNX runtime support, OpenAI triton support in v3, tensorflow, JAX, OpenXLA.
Also, while this wasn't in the presentation, CUDA can be translated to HIP using automated tools. and -
CuPBoPAMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs (https://dl.acm.org/doi/pdf/10.1145/3624062.3624185)
One of my coworkers uses a Vega 56 in their local workstation. That card is six years old. It's not officially supported, but it still works. I have no idea if the AI frameworks work on hardware that old, but the math libraries still do.
All that buzzword bingo does not mean the software is actually robust and performs well. If you have hardware that has much better specs on paper, but only delivers maybe 20% better performance in their own benchmarks, I am still going to presume that their software is not up to par.
20% is a huge amount in terms of high end performance. For example, 20% more transistors / compute units / etc does not give 20% more performance at the top end unless your execution is excellent.
OP's point is that the hardware specs are 2-3x higher in many places, but all their benchmarks are 20-30% higher. The article mentions this as well. It means AMD couldn't even utilize their own hardware very well at this point.
Is this a drop in replacement for clouds (which is the actual question to ask because if you are buying this you have to make new infrastructure) the answer is No. Will Microsoft and Oracle allow you to use this to rent ? Yes. Does Microsoft and Oracle need more accelerators to do internal business processes? Absolutely (Microsoft announced it would rent AI accelerators because they can’t keep up with their own internal (and maybe external) demand). So cost has to not just be per card but infrastructure build out .
Not sure when Microsoft will just have their own TPU accelerator but I wouldn’t be surprised if the announce it sometime.
The part is OCP compliant and can be drop in. Microsoft has already annouced their "TPU" at Ignite. The amount of misinformation in these threads is driving me insane.
Can you cite your claims about Oracle wanting to deploy these? Oracle has a deep partnership with Nvidia to the point that in many cases, they have joint booths at conferences.
I was not aware of any efforts to purchase or deploy MI300s at OCI.
AMD's use of chiplets matters here. There is overhead, but they also not only get higher yields, but also have many more chips that can hit their highest target clockspeed meaning they can ship far more high-end chips.
This is the same reason why they were so successful with EPYC. It's easier to find eight small 8-core chips with high clocks/low-power than to find one 56-core chip with that same high-clock/low-power set which is why Intel sold so many cut-down chips while AMD just sold their small amount of defective chips as 6 and 12-core consumer chips.
Nvidia will have a harder time shipping defective because everyone wants the best chips with the best performance/watt to save money and do things fast. Outside of that, there's not a huge market for defective 800mm2 chips and the massive GPU the come on. AMD can sell their defective units a couple chiplets at a time to non-AI customers in laptops and workstation cards.
It's transparent, but it has a performance cost. There will be low-level ways to improve locality for certain libraries, but it's nearly impossible for common users to claw that performance loss back.
I ran 120k of the rx470-rx580 series and another 30k of the PS5 APU. All of them were run at the most efficient and tuned settings possible. Every single chip was a snowflake.
We're not just talking defective, we are talking about a silicon lottery on performance and it varies between every single one.
If the backlog gets too big, as a consumer you might get compelled to just abandon it and switch to the "available" alternative even if it's sightly inferior.
Suddenly potentially having a lot less backlog might screw up Nvidia faster that it "helps" AMD too.
Does the memory controller also bottom out at about half the bandwidth in practice, like on the MI250? Or is the full memory bandwidth finally available?
Theoretical numbers are nice, but they are just that, ethereal numbers on a piece of paper. And on a CUDA device, I know from practice that I can get at least 90% of the memory bandwidth.
Pricing at this point is almost irrelevant given that it is available in Q1/Q2 2024 and H100's are sold out until who knows when and you're still going to be at the bottom of a long line unless you have a giant order to place. Advantages, like IB, are 50+ week lead times as well, if you're lucky.
Ok, so even if the AMD chips are equal to or faster, Nvidia still has a large advantage with their networking using their proprietary NVLink switching that was most certainly borne from the Mellanox Infiniband IP acquisition. AMD is going to have to rely on more traditional PCI-E based networking for all off-system traffic via Infiniband or Converged Ethernet vs the more optimized Nvidia GPU clustering that uses it's on proprietary super-low latency NVLink-based switching.
I'm super curious to see how their networking results compare between transports when real testing results between large supercomputer-ish farms get out there.
AMD uses Infinity Fabric, which is their NVLink. They announced today how they're opening up Infinity Fabric to select partners, including Broadcom. Meaning partners can now develop their own switches.
Also, as AMD and HPE power some of the most performant supercomputers in the world, they wouldn't have won those contracts if their networking was subpar. Those use slingshot.
NVLink is used for linking chips together, InfiniBand is used for linking systems together, i.e. into big HPC clusters. Infinity Fabric is used for inter chip communication, it's not a replacement for Infiniband and cannot be used to create clusters of machines.
I queued for you the part where Forrest Norad explains all this in today's presentation including a discussion on ultrafast ethernet.
https://youtu.be/tfSZqjxsr0M?t=5198
I think Pensando and Xilinx are the networking story, not pcie. The clusters use Cray's slingshot. Within a node, the x64 cores and GPU units are on a common fabric which is also not pcie.
But without CUDA it’s gotta be DOA no? CUDA is what makes Nvidia’s hardware valuable. AMD should finance a competitor. They’ve had strong hardware but it’s the software that’s hurting them from what I’ve seen.
Before that, they were pushing heavily for standard OpenCL, but that failed because the hardware wasn't as competitive and the ecosystem/tooling barren.
ROCm is not just unpopular, it is so boilerplate-ridden, it is nearly unusable. It takes about 5x more code to do things there compared to CUDA.
Yes, “just use libraries”. But as Andrej Karpathy used to say, “I don't need some library holding my hand and providing abstractions. Real men command GPUs with their own raw kernel code”.
(And, incidentally, there are much less libraries for ROCm, for this reason. Somebody should write them, and why do that with something that takes 5x more code to do the same thing?)
Why? I understand the framework setup may not be as polished as CUDA, but I was under the impression HIP is the primary kernel language supported by ROCm and at first glance it's basically a 1:1 CUDA clone.
Hm, apparently, in the last years, HIP became much integrated into ROCm as the principal way of doing things. I was referring to the situation as of 2-3 years ago, when plain ROCm was a chore to write.
Depends on your definition of DOA. For the big customers who can't get as many H100 as they want it'll be an option. They have the resources to make sure their workload does in fact work and then scale out hardware appropriately. And it might not even be a downgrade: MI300X has more bandwidth and larger shared memory space.
And there's MI300A which has a combined CPU+GPU with shared memory space which already has supercomputer customers. You can say it's DOA relative to Hopper but MI300 will be AMD's most successful GPGPU yet.
I don´t think so. There are few algorithms that cover the majority of use. For example in NLP a fast LLM (such as LLAMA) training/prediction and support for Hugging face transformers (BERT and such) would cover 95%+ of NLP use-cases that need a GPU.
If they can’t afford to out R&D Nvidia today, how could they afford to fund a competitor? In the past 20 years the only serious new entrant to the discrete GPU market has been Intel and they are also far behind Nvidia and CUDA.
It's somewhat sad that AMD bet on open source over proprietary and then approximately zero people stepped up to help. Easier to use cuda and complain I suppose. Fortunately they've stuck with the open source plan anyway and just written it themselves.
AMD's paid linux driver team is moving relatively quickly using some documentation and tools (simulators) under NDA. It is difficult to contribute usefully to the core of that effort without getting hired by AMD. AFAIK, that is why not many hobbyists are contributing.
I would guess the documentation is something of a hindrance. I'm in compilers rather than drivers, where the public docs are essentially the pdfs at https://gpuopen.com/amd-isa-documentation/. It's a lot better than the nvidia equivalent but it would also be rough going implementing a compiler based on those alone; too many instructions with names but unspecified semantics. The driver / firmware story is probably similar.
Oh, I thought I was missing something, maybe another document with more explanations. Thanks for clearing that up.
I was actually motivated to contribute to the AMD drivers at some point and did land a couple of patches, but wasn't looking to switch career paths. The drivers have become quite good without me anyway, so no regrets :)
Look, I love AMD, but there are different groups of open source people here. One group is trying to broaden the support for desktop Linux across all hardware, and the other group is trying to crunch numbers with GPUs and build support for that. There isn't a super strong overlap, and I think the former group is currently larger than the latter.
Counterpoint, at scale a month of training a GPU cost more than an engineer now days.
And the large companies (Meta, Microsoft) buy these GPUs by the thousands upon thousands.
Having a small team of engineers spend a couple months porting code over is well worth it for even modest cost reductions.
These may not be useful to smaller companies working in the AI space, but odds are to sellout, all AMD really needs are the sales contracts that they've already announced.
That being said, if you a cloud provider and need to scale up a bunch of basically similar transformers models, then it should be an easy sell for AMD.