First, congratulations! It's a really cool project, speed of training is such an impediment to research, it's wonderful to see such amazing speeds achieved on these big networks.
I have two questions:
Page's write-up makes very clear how much time is saved by each of his optimizations, I've read your readme and releases but it's not entirely clear what were the main contributors to your 3x improvement in training time. Would you be able to broadly quantify how much each change factored into your improvements?
Second, how much do you believe these optimizations are transferable to different convolutional architectures? When I read about the work done on this subject I'm always worried the training time is actually (hundred of hours of manual inductive bias farming + 10s of training) and that the hundred of hours of work would have to be redone for a new architecture. Is that the case or have you been able to draw generally applicable conclusions to improve training speed for any convolutional network?
Are there any competitions/leaderboards for training X dataset to Y accuracy on Z hardware? Kind of like code golf for ML, but probably delivers quite practical benefits at the same time. It's a cool idea.
Great point! If I recall correctly, this team (well, nearly all the top teams from DawnBench) took Page's code and wrestled it into the multi-GPU realm. I'm a sucker for simplicity, as much as is reasonable (this codebase currently does not use JIT or any custom kernels! (!!!)), and also making sure that the average practitioner (like me) could do something workable without having to pay tons of money. My computing costs are $50 a month currently, i.e. the cost of Pro Colab. And we were able to break the single-GPU WR, and we're really close to pushing past any of the official multi-GPU submissions (old as they may be!).
I took David's work in a different direction and just kept it true I think to the original spirit of things. Cycle times for experimentation are king in ML when it comes to the speed of research progress, regardless of what anyone else might tell you. Having tons of hardware may be really flashy and useful for the end product, but it's certainly not needed for much of the lo-fi, day-to-day stuff.
That said, the A100 is definitely a step up. It is under 2x, though, as we are basically only memory-and-slow-backprop-kernel limited now, not as much by the convolutions (which now are among the shorter operations). Running https://github.com/99991/cifar10-fast-simple on my end gave me 17.2 seconds, vs the 24 seconds that Dave reported on the V100 (though the lovely author of that repo, @99991, was able to get faster speeds on their personal A100 setup). So we're definitely in that weird regime where moving everything to massively scaled matrix multiplies when possible is preferred, and sometimes that's...tricky for a few of these operations.
> may not be a fair comparison to newer hardware like the A100
in fairness most of those entries use 4-8 V100s, to OP's single GPU. while the A100 is more powerful, I think just the "on a single GPU" framing is valuable
I commented in the parent comment addressing this too, sorry that I topic-leaked! I'm cruising in my personal ML sabbatical on savings, so I'm sorta money-incentivized to be as thrifty as possible. Hence as noted before, right now I'm just at $50 a month!
I'm hoping this research is valuable to people in other areas, too. The concepts about order-of-operations, information flow, scaling, information-efficiency-at-high-throughputs, etc I think are applicable anywhere, given the right contexts. Though I have some sneaking huge suspicions that many of these laws (like the traditional scaling laws) only start popping up in importance and becoming more relevant as the ideally efficient architecture families are slowly approached through iterative optimization.
The top on DAWNBench a few years ago was $0.02, but that was a single V100 and their best time was 45s on 8*V100. No idea how much the 10s (top time) cost to run, but it was also 8*V100.
I think it's maybe something like 13.8 'credits' an hour on Colab, and you get 500 credits for $50 straight up, or $50 a month (I'm truly a sucker for simple flat pricing schemes with a natural cap on them, it's good for the overzealous network trainer's/developer's wallet! :D). So that's like, I dunno, $1.38 per hour for an A100 basically guaranteed (not bad at all! And the H100 is coming soon, I'd assume! :D)
If training takes ~9.91-9.96 seconds, and we ignore everything else in the process (assuming we have some kind of strange Elvish magical computers that don't require any spinup of any sorts)... then that's (9.91 to 9.96)/60/60 * 1.38 = 0.0037988 - 0.0038180 dollars per run, or .37988 - .38180 cents per run. The full setup including install from clone, data download, and network init, I'd estimate being lower bounded at maybe 1.2-1.3 cents per run or so with a good internet connection (but I'm not entirely sure about that! D:). Upper bound for a reasonably fast machine I think would be no more than 2 cents, clean start to finish for a single training run. Multiples for best-of (maybe not the safest idea), or better yet -- simple ensembling of the EMA-ed models could be upper bounded at likely no more than ~4 cents or so for 5 models, if I'm doing my math correctly.
That said, the 'cents' calculation likely I think is .37988 - .38180 cents in this case.
What's weird is that that does seem a bit steep considering it's 8 V100s for 45 seconds, and those were...pretty pricy at that time, I think? So maybe something is horribly wrong with my math! D:
Hope that helps, great question and many thanks for the question, happy to answer any follow-up questions you might have. This is a very interesting line of inquiry, and I haven't yet spent enough time developing it yet! :D
I would best guess (if I could) that these are baseline runs from the Stanford Dawnbench team themselves in 2017, when (I'm assuming) the competition was first launched.
Oddly enough, this field was not as crowded as I remembered. Nowadays, I'm alone re-running a competition that many have left behind, as many have left the 90's and its fashion and everything else behind. It's a bit of a lonely competition field, but these days any major change I make at this point is basically a new world record.
I wanted to thank you for pointing that out, that's really interesting. I think I haven't given enough credit to the lovely entry by Chen Wang -- beating out the previous competitor, using a P100, with their own small GTX 1080 (?!?!?!) in 35 minutes. Now, the absolutely bonkers thing here is that they seemed to achieve an accuracy of 95.29%, which is very much nonlinear in difficulty. This, I've at least personally found, is truly a case of dabbing on the haters, as the children or whatever they are are saying these days.
Where I've been for work before, I have a bit of a reputation for making things fast...unreasonably so. Reasonably unreasonably so? I don't know, I have no idea!
First, I think we implemented it basically from scratch so we could do some serious hacking and speed up the development cycles, then it's been just seriously off to the races to keep improving things with the basics -- and the basics done as well as can be! We do have some very nefarious things up our sleeves for the goal of getting below the (hopefully eventual) ludicrous bar of a 2s training time.
Very observant of you to think about and check on that, I never would have guessed that someone would have looked for that! The other posts really did not get a lot of traction, but I think now that we're really getting into the thick of it, it's starting to get a lot more spicy for different people who are interested in this sort of thing.
Alright, I'm too wired from all the nonstop commenting, I need to go to bed. 'Night! :D :) <3 <3 <3 <3 :DDDD
Thank you! I don't know for sure if we will get there, but it has been going _muuuch_ more quick than expected already. I do suspect that it might slow down as life gets going, and the required changes are more and more involved. Remember, this is a freaking super tightly integrated system, so complex components may take a long time to get it to work.
But, we can hope! Now that we're below the lovely advertising-friendly bar of 10 seconds, I can use it as a living resume as I look to pick up some research/architecture-related stuff part-time.
Cheers and thanks again for your comment, much love and have a good/great week! :D
Can you add some details about what are the most complex components in your system? Also what parts you implemented from scratch? I see you still use torch
Very nice teaching comments, I am learning new stuff that also applies to transformers.
I have been wondering for a while now: how do you benchmark GPU code accurately? Huggingface/transformers used to have a benchmarking library that uses python, pytorch and cuda libraries to benchmark memory usage, inference and epoch times but it has rightly been deprecated for being inaccurate. They recommend using external 3rd party tooling to benchmark GPU processes. Thing is: I can't seem to find good 3rd party benchmarking tools.
I pumped my fist and involuntarily said "YEEEHEEESSSSSS" quite loudly when I read your first compliment, you are the first comment on that and I really, really, really, appreciate it. Woop woop!
I am so freaking exhausted from all the commenting here and on Reddit (self-inflicted!), so I'll just give you a gem that has saved my bacon multiple times. You are correct that profiling in this world for GPU stuff is basically absolute crap. There is one exception that I have found (finally) at long last:
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=10, warmup=15, active=5),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet18'),
record_shapes=True
) as profiler:
# training loop
# do a step
profiler.step()
if steps > [number of wait+warmup+active steps]:
break
# oh, and don't forget to set your epoch count to 0 for this so you don't loop over and over again
Then load up tensorboard point at that json. And now to close it out, Steve Jobs style: It's that simple.
Hope that helps. Much love and 'Night! ::) :D :D :D :D :))))
In my experience working on applied DL research problems the training bottleneck is almost always the data loading. This is because datasets for real-world problems usually don't fit in GPU memory (not even close) and often require expensive pre-processing. The latter can obviously be driven down using similar methods to here, but the former seems insurmountable. It's not clear to me what the path is to getting these workloads down into this rapid experimental iteration regime.
Expensive pre-processing is hardly a problem: Generate representations and features once and store to drive. Use a feature/vector database if you feel fancy.
For deep learning, out-of-core instance loading by means of batching is the standard approach. Sure you sometimes have to tune batch size and learning rate for optimal learning, but it's a solved issue.
I disagree that's it's solved in practice for all users. I agree it's solved for many well established workloads.
Researchers are the people most in need of rapid experimental iteration vs one off training. But during research the data requirements may not be pinned down well enough to pre-process everything ahead of time or repeatedly doing so is it's own bottleneck. Out of core loading may still be a bottleneck if IO is slow, samples are big or necessary pre-processing can't be pushed to GPU.
I'm meaning to reply to the parent comment after I get some sleep, but I do believe that is what they are implying! The hard part, I believe, is _how_ we effectively do the memory management! :D
doesn't work due to requirement of online data augmentation like random crops and image distortions (for image model, loading is a non-issue for text). And a feature database is not going to help you if you just need to iterate over all the features.
Hey hey heya! Sorry for the long wait! I've wanted to reply to every comment thoughtfully, and your comment on the single file bit was the first feedback I've heard about it. I've sort of fretted about that decision (though it's paid off well for me), and I wanted to say I can't express how much anxiety that one small compliment has relieved for me. Not that everyone will like it, but that it's indeed a good idea.
As far as speed goes, I've filled out a few good comments about a "what's next" here with Bram in my comment to them. There's also a corresponding Reddit thread that I've been frantically corresponding on, but most of that is more historical-and-hyperparameter focused! (https://www.reddit.com/r/MachineLearning/comments/10op6va/r_...)
Upper limit with (I think very large scale?) training on different data, then transferring to CIFAR10 I believe is 99.5%, but that is extremely unreasonable for commodity algorithms and hardware, unfortunately, I believe D:
Hope this helps, thank you for your comment, I really appreciate it! :D
To add to the other response: Human accuracy is a fuzzy thing. The accuracy you'll get from a labeling service spending a couple seconds per label is different from the accuracy you'll get from a typical person if you ask them to painstakingly label just a single image.
Oh! Ack! Thanks! Very good question. I'll try to remember to put a 'learning comment' for that on the next go-around. Those are the means and standard deviations of each channel in the training set -- normalizing them up front helps us expect that we have a 0 mean and variance of 1 with our images going into our neural network.
This is helpful as we don't need a BatchNorm in front of our first layer that way (and I'm sure for other reasons too).
Many thanks for asking, that's a very great question! I was super scared for a minute that I somehow had a naked hyperparameter untagged, but these certainly are very ~mysterious~ numbers indeedy!~ :D <3 <3 <3 <3 :D :D :D :)))
Very great suggestion! Thanks for posting this! :D :)))) I'd like to, but colab seems to be super duper finnicky on the CUDA installations included. The version of Triton required for the default (and really--the fastest I think--) backend of Pytorch 2.0 for compilation requires a version of CUDA 1 step higher or so than what's on the Colab machines. I'm sure there's a fantastic way to sidestep that, I haven't just found it yet.
Opening that up, should compile times not take as long, should _~hopefully~_ allow for a lot more cool things that are slow under an eager/interpreted mode -- like, perhaps, better parallel branches of execution, or some ops which technically are much faster due to being in-place, but were on-the-whole much slower due to requiring a lot of teeny tiny individual kernel launches. I'd estimate the network will improve maybe 1-2+ seconds with it (and I really hope the compile overhead isn't that long! D: D: D: D: That would/could be another nightmare in and of itself!)
Greetings, Bram! I hope you are well (we don't know each other, I just felt compelled to say that XD :D)
I browsed your Twitter, this might be the shortlist that you find the most interesting:
> fp16 -> fp8
>>>>> loss(dtype=fp16).sum() -> loss(dtype=fp8).mean() # the summing is for regularization's sake
> nn.MaxPool2d's backward is hellaciously slow due to what appears to be indexing. right now we're under 10 seconds without _any_ custom kernels or JITting (!!!, as far as I know!!!). this may need to change, though, because I think max_pooling is now taking longer than some of the convolutions themselves. it's also annoyingly necessary for performance, very much so
> BatchNorm2d(bias=True, weight=False) + Conv2d(bias=False) is paradoxically faster than BatchNorm2d(affine=False) + Conv2d(bias=True). This shouldn't be the case, as having a fused kernel like that should be faster.
> Information flow engineering. Knowing which layers are more information-stressed during training could help with micro-adjustments of layer depths to an uncomfortable level, but right now we're just sort of guessing in the dark. Being able to visualize and track this would be nice.
> Waiting for Pytorch 2.0! That will open up a lot of neat tricks that would otherwise slow down training with tons of small kernel launches. However, that likely will mean we end up losing our JIT card. I'm hoping that's not the Faustian bargain we have to make.
And a few others, however, those are under wraps as I think we'd like to publish those if/when they work out and once they're completed. We always have other improvements in the pipeline even as we're finalizing and publishing a release. :D
Hope that helps, the library you're working on looks super cool! Thanks for sharing the pictures/notes of that on your Twitter, I truly enjoyed reading through that.
Have you tried training with 1 channel in each layer, just to estimate the overhead? When I tired that with David Page's network I only got about a 40% speedup compared to the full network.
Oh, that's a really good point. I think the closest I've done is NVIDIA profiling -- the convs are definitely the most efficient from a multiplication/second perspective.
I think I'll certainly have to try that, the closest I have in the meantime to what you might be interested in is a more detailed profiling I did of the kernel launches in Pytorch -- it handily comes with a _very_ lovely profiler that is such a departure of the rather overly intricate and fiddly ones of yore.
Currently, MaxPooling is one of the slowest operations per-byte of memory handled, I think due to the very slow indexed backprop. Unfortunately, it is surprisingly/shockingly necessary for good performance.
Thank you for the suggestion, I had a friend that I learned this from, and totally forgot to do that basic-identity-overhead-testing trick. D'oh'! XD If I get any good results from it, I'll be sure to do my best to tag/credit you in the GitHub release notes for the next release. :D :D :))))) <3 <3 <3 <3 :D :D :))))
> Currently, MaxPooling is one of the slowest operations per-byte of memory handled, I think due to the very slow indexed backprop.
Could be a bandwidth bottleneck. As a rule of thumb, transferring a float from DRAM to the GPU is equivalent to doing about a 100 multiply-adds, and pooling only does about one op per transferred value. Indexing should not be an issue with a half decent CUDA implementation.
Could you do a stride 2 convolution instead of stride 1 + pooling?
I wiiiiish we could do stride 2. I have tried many schemes to get it to work, it would be a dream. The best I could guess would be a fused conv+max operator. It drops a very unreasonable amount with anything else (including 5x5, stride 2). Something about the sparse activations I think is playing very nicely with the high learning rate.
If it's not indexing, I'm not sure what quite it is. It's the backwards pass that is slow (all in GPU of course), like super slow. If I was able to crack out Triton I would just make a bitmask, nearest neighbor expand the incoming gradients and multiply against the bitmask. On the GPU, gather-like ops have been slow in Pytorch from what I've seen, but your comment is making me scratch my head as to what that could be now.
Hm.... very interesting. Thanks for letting me know. Very curious. I'll have to take a look at that.
If only the backward pass is slow that is strange. Maybe it is an issue with the CUDA implementation. A 2x2 stride 2 pooling should be simple, but if it's using a parametrized general max pooling kernel then the implementation would be more complex than a specialised one. How much slower is it than the forward pass?
Many many many thanks -- thank you! The architecture here is certainly specialized, though it should apply quite well to a number of other usecases at similar image scales. I believe the only other main changes needed for a larger-image dataset (like ImageNet) would be adding maybe 2-3 blocks, so an additional 6-9 convolutions (making it Resnet15-Resnet18).
The hyperparameters are fiddly but the region of good performance is pretty flat, so I'd expect it to do well. I might take it for a spin on something else at some point, let me know if you do!
One benefit of having to go extremely fast is that it sort of makes you have to make your architectures as simple and short as possible. This helps with generalization to other datasets, I'd reckon from my experience here and other professional experience.
Thanks for your question -- much appreciated -- and there are indeed a lot of potentials for this little research project, hope this helps!
I have two questions:
Page's write-up makes very clear how much time is saved by each of his optimizations, I've read your readme and releases but it's not entirely clear what were the main contributors to your 3x improvement in training time. Would you be able to broadly quantify how much each change factored into your improvements?
Second, how much do you believe these optimizations are transferable to different convolutional architectures? When I read about the work done on this subject I'm always worried the training time is actually (hundred of hours of manual inductive bias farming + 10s of training) and that the hundred of hours of work would have to be redone for a new architecture. Is that the case or have you been able to draw generally applicable conclusions to improve training speed for any convolutional network?