More

roanakb · on Aug 23, 2024

Unfortunately, SM efficiency is not accessible via nvidia-smi. The best methods to track it would be to:

1. Profile your model with Pytorch Profiler 2. Export metrics with Nvidia DCGM

roanakb · on Aug 23, 2024

oh this looks great, thank you for bringing this up! I'll have to give it a try, but seems like the FSDP limitation on torch.compile might carry over?

roanakb · on Aug 23, 2024

Yup, you'll see 100% utilization on a kernel over a time period if it's considered active, which includes just having a single thread executing [1]. SM occupancy is great but can be a little difficult to interpret since you're not simply trying to maximize it, unlike SM efficiency.

[1]: https://pytorch.org/blog/pytorch-profiler-1.9-released/#gpu-...

roanakb · on Aug 22, 2024

Nice, seems like ML Productivity Goodput is a pretty well thought-out metric to understand the overall efficiency of your cluster. I'll consider adding this into our cluster management platform. Only potential drawbacks I'd guess are it being somewhat difficult to compute since it relies on metrics like MFUs, and not something we can observe layer-by-layer to understand inefficient kernels, but I'll take a deeper look. Thanks!

roanakb · on Aug 22, 2024

Agreed, roofline plots would be quite powerful in this context. From a quick search, seems like the only way to create a roofline plot for your model would be to use Nsight [1]? Would be interested to know if there are any simpler tools, since one of the big benefits of SM efficiency is how easily the metric is accessed.

[1]: https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s...

jfkfif · on Aug 23, 2024

Depending on the size of your application you can calculate flops by hand

https://docs.nersc.gov/tools/performance/roofline/

roanakb · on Aug 22, 2024

Yup, similar to SM efficiency in that sense too. If you aren't seeing >80%, there is certainly time left on the table. But getting a high SM efficiency value doesn't guarantee you're making good use of the hardware as well. (still a better proxy than GPU util though)

roanakb · on Aug 14, 2024

this is a good one for debugging rdma: https://docs.redhat.com/en/documentation/red_hat_enterprise_...

roanakb · on July 25, 2024

Looks great, you guys made it really easy to integrate!

brianjkim21 · on July 25, 2024

thanks roanak!

roanakb · on Aug 1, 2023

Thanks! Let me know if there are any features you'd like to see added.

roanakb · on July 27, 2023

Looks really cool! Nice work.