Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Oh, that's a really good point. I think the closest I've done is NVIDIA profiling -- the convs are definitely the most efficient from a multiplication/second perspective.

I think I'll certainly have to try that, the closest I have in the meantime to what you might be interested in is a more detailed profiling I did of the kernel launches in Pytorch -- it handily comes with a _very_ lovely profiler that is such a departure of the rather overly intricate and fiddly ones of yore.

Anywho -- here's what I ran into and all of the requisite debugging with it: https://github.com/tysam-code/hlb-CIFAR10/issues/2#issuecomm...

Currently, MaxPooling is one of the slowest operations per-byte of memory handled, I think due to the very slow indexed backprop. Unfortunately, it is surprisingly/shockingly necessary for good performance.

Thank you for the suggestion, I had a friend that I learned this from, and totally forgot to do that basic-identity-overhead-testing trick. D'oh'! XD If I get any good results from it, I'll be sure to do my best to tag/credit you in the GitHub release notes for the next release. :D :D :))))) <3 <3 <3 <3 :D :D :))))



> Currently, MaxPooling is one of the slowest operations per-byte of memory handled, I think due to the very slow indexed backprop.

Could be a bandwidth bottleneck. As a rule of thumb, transferring a float from DRAM to the GPU is equivalent to doing about a 100 multiply-adds, and pooling only does about one op per transferred value. Indexing should not be an issue with a half decent CUDA implementation.

Could you do a stride 2 convolution instead of stride 1 + pooling?


I wiiiiish we could do stride 2. I have tried many schemes to get it to work, it would be a dream. The best I could guess would be a fused conv+max operator. It drops a very unreasonable amount with anything else (including 5x5, stride 2). Something about the sparse activations I think is playing very nicely with the high learning rate.

If it's not indexing, I'm not sure what quite it is. It's the backwards pass that is slow (all in GPU of course), like super slow. If I was able to crack out Triton I would just make a bitmask, nearest neighbor expand the incoming gradients and multiply against the bitmask. On the GPU, gather-like ops have been slow in Pytorch from what I've seen, but your comment is making me scratch my head as to what that could be now.

Hm.... very interesting. Thanks for letting me know. Very curious. I'll have to take a look at that.


If only the backward pass is slow that is strange. Maybe it is an issue with the CUDA implementation. A 2x2 stride 2 pooling should be simple, but if it's using a parametrized general max pooling kernel then the implementation would be more complex than a specialised one. How much slower is it than the forward pass?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: