First, congratulations! It's a really cool project, speed of training is such an...

First, congratulations! It's a really cool project, speed of training is such an impediment to research, it's wonderful to see such amazing speeds achieved on these big networks.

I have two questions:

Page's write-up makes very clear how much time is saved by each of his optimizations, I've read your readme and releases but it's not entirely clear what were the main contributors to your 3x improvement in training time. Would you be able to broadly quantify how much each change factored into your improvements?

Second, how much do you believe these optimizations are transferable to different convolutional architectures? When I read about the work done on this subject I'm always worried the training time is actually (hundred of hours of manual inductive bias farming + 10s of training) and that the hundred of hours of work would have to be redone for a new architecture. Is that the case or have you been able to draw generally applicable conclusions to improve training speed for any convolutional network?