DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses PTX

hintymad · on Jan 29, 2025

When I grew up, the media in my country kept telling us how great the geek culture in the US was, how deep down the stack the geeks were willing to go, and both adults and us kids were left in awe. The entire nation, for what I could tell, routinely reflected why we couldn't be like the US: educating and nurturing generations of geeks to be the best engineers and scientists in the world.

Well, it was quite a reverse culture shock after I moved to the US. I definitely didn't know that "teacher's pet" was a thing, or my coworker, a brilliant engineer who went to a highly reputed public school, was chased off his school bus simply because he used some poetic words, or geeks were not that respected in schools, or a mile wide and an inch deep with great leadership is what the US people revered. In the meantime, I guess other countries more or less picked up the baton of the US culture, and grew their own geeks.

int_19h · on Jan 29, 2025

Did you not see American movies? I'm also from a different country, but this part of the American culture is very much front and center in its exports, IMO.

aleph_minus_one · on Jan 29, 2025

> Did you not see American movies? I'm also from a different country, but this part of the American culture is very much front and center in its exports, IMO.

There exist quite a lot of US-American movies. I would claim that those who are less deeply centered around US-specific cultural traits are typically much more revered in other countries.

There are, of course, exceptions from this rule (for example "The Simpsons", which excessively satirizes the USA's popular culture, but is nevertheless loved in many other countries), but I do think that the general rule of thumb does hold.

In this sense: even if you watch US-American movies, it is rather easy to mostly ignore those who strongly show US-specific cultural traits and habits, in particular if these are considered to be annoying in the other country.

exe34 · on Jan 29, 2025

I did watch American movies but somehow I also came away with the same conclusions as a child - I think I was influenced by documentaries that seemed to show such intelligence and hard work in the realm of the sciences and technology and assumed the whole culture supported it - films were supposed to be fiction after all.

as an adult across the pond, I'm very disappointed by the US.

hintymad · on Jan 29, 2025

Not really. Our English education was so bad that I could barely understand spoken English and I knew fewer than 5000 words when graduating high school. Ironically, I did fabulously well in English exams, which said a lot about our English education back then.

int_19h · on Jan 29, 2025

Hm, in our market (where English is similarly poor - certainly not good enough on average to watch movies even with subtitles) they just dubbed them.

benreesman · on Jan 29, 2025

It comes in waves. In the 90s there was a monopoly so rapacious that the Department of Justice had to very nearly break it up before it strangled the Web in its crib. Grift and nepotism and cronyism and distorted markets were the norm then too.

This is the first time it’s been so bad since, but I’m optimistic: we’re actually in a better position now because serious foreign tech dynamos don’t give a fuck about American mafia nepotism, we can’t just keep this in the family.

I don’t know if this DeepSeek App Store thing will be the match that lights this thing up, but the grass is very tall and very, very dry.

whimsicalism · on Jan 29, 2025

large portions of america are anti-intellectual but i truly fail to see how you could have missed that from movies

blased · on Jan 29, 2025

> large portions of america are anti-intellectual

Large portions of the world are anti-intellectual, at the same time intellectuals are often much worse than the average person and frequently do deserve scorn as a class of people in society.

whimsicalism · on Jan 29, 2025

China is not anti-intellectual in the same way. I remember watching Xi’s new year speech a few years ago and he was highlighting specific scientific achievements, talking about astronauts, it was very different from a US presidential speech

blased · on Jan 29, 2025

Well Biden can't really give a speech due to being undeniably old to the point of incoherence (what led to him dropping his candidacy). Trump did Operation Warpspeed and created Space Force, and openly talked about that, he also openly supports Elon Musk running his own scientific enterprises. So I'm not sure what the argument is for America not talking about its own scientific achievements.

peepeepoopoo102 · on Jan 29, 2025

That's just it: American tech companies aren't staffed by Americans anymore. Liang Wenfang, the founder of DeepSeek, made the point that he values developing domestic Chinese talent over importing foreign experts. It seems to have worked.

looseyesterday · on Jan 29, 2025

"It seems to have worked" I am sure he's a great manager, but whats really working for him is crazy high number of graduates and PHDs in China who are unemployed or underemployed

peepeepoopoo102 · on Jan 29, 2025

That would sound a lot like the languishing native STEM talent in the US if it weren't also for the fact that most of the DeepSeek team doesn't have PhDs.

addicted · on Jan 29, 2025

The team may not have too many PHDs but they are from prestigious universities and have published many papers.

The lack of a PhD probably has more to do with the structure of Chinese education but these people were basically studying like PHDs.

https://www.firstpost.com/explainers/china-deepseek-ai-full-...

justinclift · on Jan 30, 2025

> published many papers.

China has a reputation for being a "paper mill" place though, so not sure that publishing papers has any value whatsoever as a quality indicator.

whimsicalism · on Jan 29, 2025

tech talent in the US is not languishing

aleph_minus_one · on Jan 29, 2025

> tech talent in the US is not languishing

Only those tech talents in the USA who are useful for some big tech agenda are not languishing.

peepeepoopoo102 · on Jan 29, 2025

I should clarify: the US STEM workforce has historically been very white and very male. That's the body of talent that has been languishing, and the data proves it.

whimsicalism · on Jan 29, 2025

tf are you talking about, share the data

peepeepoopoo102 · on Jan 29, 2025

A couple highlights: 94% of the job growth among Fortune 100 companies since 2020 has gone to minorities

https://www.bloomberg.com/graphics/2023-black-lives-matter-e...

Job growth amongst foreign born residents in the US has significantly outpaced native born job growth:

https://www.bls.gov/opub/ted/2023/foreign-born-workers-were-...

whimsicalism · on Jan 29, 2025

because white people are 90%+ of those retiring from fortune 500s. citing this is proof of innumeracy

peepeepoopoo102 · on Jan 29, 2025

[flagged]

whimsicalism · on Jan 29, 2025

> That's a complete non sequitur. Why would it make sense for new hires to be overwhelmingly disproportionately oversampled from minority groups according to current demographics?

Because that is not what this stat is measuring. They are doing (demos of new employees - demos of retiring employees), not just demos of new employees. Hence why it is a misleading statistic. They word it very carefully to not say 94% of new hires are minorities.

> Is that a hint of racism?

I don't even know your race

peepeepoopoo102 · on Jan 29, 2025

> They are doing (demos of new employees - demos of retiring employees), not just demos of new employees.

Source: I made it up.

whimsicalism · on Jan 29, 2025

source: the "The Analysis" section of your own source or if you needed it stated more explicitly [0]:

> Before judging whether that’s impressive or excessive or some other adjective, it’s helpful to know what the available pool of new workers looked like. Or, more precisely, what the pool of new workers minus the pool of departing workers looked like. Net change is what we’re able to see.

i'm sure this will have precisely 0 impact on your worldview though

0: https://archive.is/POQnF#selection-1715.156-1715.449

whimsicalism · on Jan 29, 2025

since your reply died - your absolutely wrong about massive oversampling, you can get 95% with basic assumptions about 60% white, 40% minority (matching actual proportions in the population - and actually an underestimate when you consider that new hires are young) and the retiring fortune 500 population (70-90% white).

as i said, just basic innumeracy on your part

blased · on Jan 29, 2025

This is the sort of condescending, simplistic reduction of America that only a foreigner could make (one that apparently lives here nonetheless!).

DiabloD3 · on Jan 29, 2025

Haha, what a shoddy headline. "Bypasses" and "industry-standard" have no place here.

CUDA is not an industry standard. Vulcan is an industry standard. They did not bypass CUDA... that's like saying if I use Vulcan I'm bypassing OpenGL. PTX is an alternative low level API provided by Nvidia because of how awful CUDA is for high performance code.

What DeepSeek wrote could only have either been written in PTX or Vulcan.

Any other company could have done this, and low latency traders on Wall Street that use Nvidia write their stuff in PTX for obvious reasons.

OpenAI, was, is, and always will be, absolutely incompetent when it comes to using their hardware effectively... and they're no different than any other company. Reading is not a goddamned super power! Just read the docs!

solidasparagus · on Jan 29, 2025

In what world is CUDA not an industry standard?

sigmoid10 · on Jan 29, 2025

You can ignore it, the commenter clearly has no idea what they are talking about. PTX is literally the instruction set that Cuda, Vulcan and OpenGL compile to on Nvidia cards in the end. It's assembly for GPUs. And it's infinitely harder to work with. Go to an average technical university and you'll probably find quite a few people who can write Cuda (or OpenGL or Vulcan for that matter). But it would be very surprising if you can find even a single person that can comfortably write PTX.

DiabloD3 · on Jan 29, 2025

"Compile to" isn't exactly the correct phrase either.

PTX is not the IL used by Nvidia's drivers, but does compile directly to it with less slop involved. If you had said "PTX's instructions are analogous to writing assembly for CPUs or any other GPUs (ala Clang's AMDGPU target)", that would have probably been the better way.

Arguably, PTX is closer to being the SPIR-V part of their stack (more than just an assembler compiler, but similar in concept). None of Nvidia's tools really ever line up with good analogies with the outside world, the curse of Nvidia's NIH syndrome.

Generally, you're not going to be writing all of your code in PTX, but I find it wild you think people going to "an average technical university" would be unable to use it for the parts they need it for. That says more about you than it does them.

All of Nvidia's docs for this are online, it isn't that hard. Have you tried?

sigmoid10 · on Jan 31, 2025

>PTX's instructions are analogous to writing assembly for CPUs

How else would you have understood it? At this level it's literally just pedantics. In the same way you can say C doesn't technically compile to assembly for CPUs. The point is that it's the lower abstraction level that is still (more or less) human readable. But just like in CUDA, you may want to write parts of your code in it if you want to benefit from things that the higher level language doesn't expose. The terminology might seem different, but in practice it is pretty analogous.

DiabloD3 · on Feb 1, 2025

With assembly for a CPU, the instructions you're writing are verbatim being sent (compiled, obviously) to the CPU.

With PTX, you are not writing in the native assembly of the Nvidia GPU, but yet another abstraction, just one more similar in nature.

bxtt · on Jan 29, 2025

This is somewhat untrue as well. HFT because constrained similarly have to optimize on this level akin to HFT crypto doing optimizations not within solidity, nor yul but on opcode in huff. That’s the issue with these big tech companies. Just endless budget and throw bad code into larger distributed clusters to overcompensate.

orf · on Jan 29, 2025

CUDA is obviously and clearly the industry standard…

Cladode · on Jan 29, 2025

> write their stuff in PTX

I wonder if you vould you point me to concrete examples where people write PTX rather than CUDA? I'm asking because I just learned CUDA since it's so much faster than Python!

winwang · on Jan 31, 2025

Here's a rather trivial example of using PTX: https://docs.nvidia.com/cuda/parallel-thread-execution/#spec...

For various micro-bench reasons I wanted to use a global clock instead of an SM-local one, and I believe this was needed. Also note that even CUDA has "lower level"-like operations, e.g. warp primitives. PTX itself is super easy to embed in it like asm.

DiabloD3 · on Jan 29, 2025

There isn't a lot of easily accessible examples outside of the corporate world.

Open source authors typically shy away from Nvidia closed source APIs, and PTX is tied to how Nvidia hardware works, so you won't see it implemented for other hardware.

To do what Deepseek did, but didn't want to waste your time and money with Nvidia, you'd use Vulkan. Theres more Vulkan in the world than CUDA.

shihab · on Jan 29, 2025

I had the impression that gpu isn’t a good fit for ultra low latency usecases. Can you please elaborate on what sort of work hft firms do with gpu?

jakewins · on Jan 29, 2025

Not in HFT, but I guess maybe for being very fast running optimization solvers and forecast models etc? Essentially compute models for ultimately driving market decisions based on lots of input data

We do a lot of forecasting and solvers where I am, just run them on CPUs though.. but maybe if you’re wanting to compete on speed you would?

sterlind · on Jan 29, 2025

Optimization solvers usually don't benefit from GPUs. I think it's because it's sparse matrices and a sequential series of pivots.

aleph_minus_one · on Jan 29, 2025

> Optimization solvers usually don't benefit from GPUs. I think it's because it's sparse matrices and a sequential series of pivots.

This depends a lot on the problem and the algorithm that is used. For example interior point methods are clearly better suited to be running on GPUs than the primal or dual simplex algorithm.

eichi · on Jan 29, 2025

I like this kind of geeky poetical comment while I'm not certain whether it is true or not.

nialse · on Jan 29, 2025

What it does show is that CUDA leaves serious performance optimization on the table despite its gigantic code base. Using compression to reduce memory bandwidth is a well known trick in quantization, and in other scenarios since forever. There has been little competitive pressure on Nvidia to go further since their software stack leaves the competition in the dust. This time, they may actually need to step up their efforts, due to customer pressure. Good times!

kristjansson · on Jan 29, 2025

A careful implementation adapted to a specific workload is always going to beat a more generic implementation.

mrbungie · on Jan 29, 2025

I see almost no incentives for NVDA to do so as long as they have no real competitors in that space.

frontalier · on Jan 29, 2025

> I see almost no incentives for NVDA to do so

> as long as they have no real competitors in that space.

?! isn't _avoiding competition_ a good enough incentive for them to improve their offering?

mrbungie · on Jan 29, 2025

And the competition to CUDA is? We're talking about CUDA, not inference chips.

exe34 · on Jan 29, 2025

is that competition in the room with us this quarter?

lvl155 · on Jan 29, 2025

You know what IS ridiculous is that people are willing to go assembly but still do not use AMD gpus despite the price difference.

jiggawatts · on Jan 29, 2025

What is really ridiculous is the astonishing inability of AMD to pivot and start fixing the software stack.

It’s been years, and they’re doing… what exactly to catch up?

jitl · on Jan 29, 2025

I still think it’s a good investment.

danpalmer · on Jan 29, 2025

There are already open stacks out there that help. The problem is that Nvidia provide a full stack option: chips, networking, and software, whereas AMD only provide the chips, you then need another company's networking, and then you need to plug together open source software.

I'd also bet that if you buy enough of the Nvidia chips, they'll probably send in a bunch of engineers to get everything working with their full stack. AMD won't be able to do that in the same way because they're not vertically integrated.

icelancer · on Jan 29, 2025

> networking

Yep. The interconnect is the most overlooked part of the stack.

whimsicalism · on Jan 29, 2025

amd has revealed itself to be surprisingly poorly run.

aren’t APIs no longer copyrightable? they should just have reimplemented cuda with their backend

kristjansson · on Jan 29, 2025

Writing a few intrinsics where necessary is not really comparable to the work required do reimplement something like CUDA on AMD (or equivalently deal with ROCm)

kristjansson · on Jan 29, 2025

This is ridiculous. Since the actual training code for DeepSeek is _not_ public, this is a based only on the technical report, which mentions PTX one (1) time in §3.2.2 Efficient Implementation of Cross-Node All-to-All Communication:

> Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.

So they have some intrinsic in some part of their training framework. That's it.

pillefitz · on Jan 29, 2025

It all feels like an attempt to prevent replication to further tank the market. Not necessarily the technical details, but the reporting thereo, which was spammed all over Reddit and HN

kristjansson · on Jan 29, 2025

This whole episode is weird. I can’t tell how much of the popular reporting is misinformed and how much has been disinformed. R1 (and sorta V3) are clearly progress, but are definitely not step-function improvements to prior SOTAs.

WiSaGaN · on Jan 29, 2025

They are definitely step function improvements in regard to inferrance cost of the same level of reasoning intelligence.

kristjansson · on Jan 29, 2025

Only because OpenAI overpriced o1, like they did with GPT-4.

almostgotcaught · on Jan 29, 2025

lol this is the wackiest non-news; every serious project has at least some parts of their kernels implemented in PTX/AMDGCN.

snovv_crash · on Jan 29, 2025

ML engineers are finally finding out that their abstraction layers are actually slowing them down.

almostgotcaught · on Jan 29, 2025

> finally finding out

there's no finally - perf people have always written kernels in asm.

westurner · on Jan 29, 2025

Parallel Thread Execution: https://en.wikipedia.org/wiki/Parallel_Thread_Execution

t2oi4h324jl234 · on Jan 29, 2025

This is strange. PTX is still at the IR level ?

IIRC this is still relatively hardware agnostic. Can you actually get very far by doing this ? From a quick perusal, DeepSeek also uses Triton in the codebase.

hulitu · on Jan 30, 2025

> DeepSeek's AI breakthrough bypasses industry-standard CUDA, uses PTX

Isn't CUDA an Nvidia child ? This sounds like "Microsoft = industry standard".

lousken · on Jan 29, 2025

i would like to assume that low level optimization is crucial part of AI training and openai, meta and others aren't wasting billions on this

sdedovic · on Jan 29, 2025

Maybe I’m misunderstanding but CUDA compiles to PTX? Is the implication they wrote in a different language than CUDA and to generate the PTX?

lostmsu · on Jan 29, 2025

You can code in PTX assembly (or just inline in C++) or generate binary PTX if you need to.

konradha · on Jan 29, 2025

Next article: What's SASS?

gamblor956 · on Jan 29, 2025

tldr: they wrote in low level code instead of using a higher level framework like their competitors have been doing so they were able to hand tune the performance.

This gives them a few months head start before meta and Google start doing the same thing.