Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sure, but I'm not sure if that is what the parent poster was saying (that nvcc generates poor quality PTX for newer devices).

It's been a while since I looked at CUDA, but it used to be that NVIDIA were continually extending cuDNN to add support for kernels needed by SOTA models, and I assume these kernels were all hand optimized.

I'm curious what kind of models people are writing where not only is there is no optimized cuDNN support, but also solutions like Triton or torch.compile, and even hand optimized CUDA C kernels are too slow. Are hand written PTX kernels really that common ?



Yes. Take a look at, say, CUTLASS: you'll see that they use PTX instructions because there are no intrinsics, much less automatic compiler lowering, for the accelerators they target.


Yes, but that's an NVIDIA project, so would be expected to be hand optimized, same as their cuDNN kernels.

I'm more curious about what types of model people in research or industry are developing, where NVIDIA support such as this is not enough, and they are developing their own PTX kernels.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: