If you want to write very efficient CUDA kernel for modern datacenter NVIDIA GPU (read H100), you need to write it with having hardware in mind (and preferably in hands, H100 and RTX 4090 behave very differently in practice). So I don't think the difference between AMD and NVIDIA is as big as everyone perceives.
If you want to write very efficient CUDA kernel for modern datacenter NVIDIA GPU (read H100), you need to write it with having hardware in mind (and preferably in hands, H100 and RTX 4090 behave very differently in practice). So I don't think the difference between AMD and NVIDIA is as big as everyone perceives.