Haha, what a shoddy headline. "Bypasses" and "industry-standard" have no place here.
CUDA is not an industry standard. Vulcan is an industry standard. They did not bypass CUDA... that's like saying if I use Vulcan I'm bypassing OpenGL. PTX is an alternative low level API provided by Nvidia because of how awful CUDA is for high performance code.
What DeepSeek wrote could only have either been written in PTX or Vulcan.
Any other company could have done this, and low latency traders on Wall Street that use Nvidia write their stuff in PTX for obvious reasons.
OpenAI, was, is, and always will be, absolutely incompetent when it comes to using their hardware effectively... and they're no different than any other company. Reading is not a goddamned super power! Just read the docs!
You can ignore it, the commenter clearly has no idea what they are talking about. PTX is literally the instruction set that Cuda, Vulcan and OpenGL compile to on Nvidia cards in the end. It's assembly for GPUs. And it's infinitely harder to work with. Go to an average technical university and you'll probably find quite a few people who can write Cuda (or OpenGL or Vulcan for that matter). But it would be very surprising if you can find even a single person that can comfortably write PTX.
"Compile to" isn't exactly the correct phrase either.
PTX is not the IL used by Nvidia's drivers, but does compile directly to it with less slop involved. If you had said "PTX's instructions are analogous to writing assembly for CPUs or any other GPUs (ala Clang's AMDGPU target)", that would have probably been the better way.
Arguably, PTX is closer to being the SPIR-V part of their stack (more than just an assembler compiler, but similar in concept). None of Nvidia's tools really ever line up with good analogies with the outside world, the curse of Nvidia's NIH syndrome.
Generally, you're not going to be writing all of your code in PTX, but I find it wild you think people going to "an average technical university" would be unable to use it for the parts they need it for. That says more about you than it does them.
All of Nvidia's docs for this are online, it isn't that hard. Have you tried?
>PTX's instructions are analogous to writing assembly for CPUs
How else would you have understood it? At this level it's literally just pedantics. In the same way you can say C doesn't technically compile to assembly for CPUs. The point is that it's the lower abstraction level that is still (more or less) human readable. But just like in CUDA, you may want to write parts of your code in it if you want to benefit from things that the higher level language doesn't expose. The terminology might seem different, but in practice it is pretty analogous.
This is somewhat untrue as well. HFT because constrained similarly have to optimize on this level akin to HFT crypto doing optimizations not within solidity, nor yul but on opcode in huff. That’s the issue with these big tech companies. Just endless budget and throw bad code into larger distributed clusters to overcompensate.
I wonder if you vould you point me to concrete examples where people write PTX rather than CUDA? I'm asking because I just learned CUDA since it's so much faster than Python!
For various micro-bench reasons I wanted to use a global clock instead of an SM-local one, and I believe this was needed.
Also note that even CUDA has "lower level"-like operations, e.g. warp primitives. PTX itself is super easy to embed in it like asm.
There isn't a lot of easily accessible examples outside of the corporate world.
Open source authors typically shy away from Nvidia closed source APIs, and PTX is tied to how Nvidia hardware works, so you won't see it implemented for other hardware.
To do what Deepseek did, but didn't want to waste your time and money with Nvidia, you'd use Vulkan. Theres more Vulkan in the world than CUDA.
Not in HFT, but I guess maybe for being very fast running optimization solvers and forecast models etc? Essentially compute models for ultimately driving market decisions based on lots of input data
We do a lot of forecasting and solvers where I am, just run them on CPUs though.. but maybe if you’re wanting to compete on speed you would?
> Optimization solvers usually don't benefit from GPUs. I think it's because it's sparse matrices and a sequential series of pivots.
This depends a lot on the problem and the algorithm that is used. For example interior point methods are clearly better suited to be running on GPUs than the primal or dual simplex algorithm.
CUDA is not an industry standard. Vulcan is an industry standard. They did not bypass CUDA... that's like saying if I use Vulcan I'm bypassing OpenGL. PTX is an alternative low level API provided by Nvidia because of how awful CUDA is for high performance code.
What DeepSeek wrote could only have either been written in PTX or Vulcan.
Any other company could have done this, and low latency traders on Wall Street that use Nvidia write their stuff in PTX for obvious reasons.
OpenAI, was, is, and always will be, absolutely incompetent when it comes to using their hardware effectively... and they're no different than any other company. Reading is not a goddamned super power! Just read the docs!