Oh, that's a really good point. I think the closest I've done is NVIDIA profilin...

WithinReason · on Jan 30, 2023

> Currently, MaxPooling is one of the slowest operations per-byte of memory handled, I think due to the very slow indexed backprop.

Could be a bandwidth bottleneck. As a rule of thumb, transferring a float from DRAM to the GPU is equivalent to doing about a 100 multiply-adds, and pooling only does about one op per transferred value. Indexing should not be an issue with a half decent CUDA implementation.

Could you do a stride 2 convolution instead of stride 1 + pooling?

tysam_and · on Jan 30, 2023

I wiiiiish we could do stride 2. I have tried many schemes to get it to work, it would be a dream. The best I could guess would be a fused conv+max operator. It drops a very unreasonable amount with anything else (including 5x5, stride 2). Something about the sparse activations I think is playing very nicely with the high learning rate.

If it's not indexing, I'm not sure what quite it is. It's the backwards pass that is slow (all in GPU of course), like super slow. If I was able to crack out Triton I would just make a bitmask, nearest neighbor expand the incoming gradients and multiply against the bitmask. On the GPU, gather-like ops have been slow in Pytorch from what I've seen, but your comment is making me scratch my head as to what that could be now.

Hm.... very interesting. Thanks for letting me know. Very curious. I'll have to take a look at that.

WithinReason · on Jan 30, 2023

If only the backward pass is slow that is strange. Maybe it is an issue with the CUDA implementation. A 2x2 stride 2 pooling should be simple, but if it's using a parametrized general max pooling kernel then the implementation would be more complex than a specialised one. How much slower is it than the forward pass?