When understanding the performance of your model it's very helpful to look at a roofline plot [1]. The roofline plot will show you the floating-point performance as a function of arithmetic intensity for the various ops in your model. The plot has two regimes: a memory-bound regime on the left and a compute-bound regime on the right. This can help to identify memory-bound ops that are taking a significant fraction of compute time.
Agreed, roofline plots would be quite powerful in this context. From a quick search, seems like the only way to create a roofline plot for your model would be to use Nsight [1]? Would be interested to know if there are any simpler tools, since one of the big benefits of SM efficiency is how easily the metric is accessed.
[1]: https://en.wikipedia.org/wiki/Roofline_model