Excellent article, and the little bit of the paper I read, so far, stresses building intuition. Andrew Ng’s classes stressed building intuition also.
Even though I have largely earned my living doing deep learning in the last six years, I believe that hybrid AI will get us to AGI. Getting a first principles understanding of DL models is complementary to building hybrid AI systems.
EDIT: I am happy to have a work in progress PDF of the manuscript, but I wish authors would release PDF, ePub, and Kindle formats also. I spread my reading across devices and with a PDF I need to remember where I left off reading and navigate there.
The winner of the recently concluded BirdCLEF competition (for identifying bird species from audio) was a hybrid combining a deep neural network with gradient-boosted decision trees, to handle the date+geo metadata.
This all makes sense, but at the same time feels a bit paradoxical to me. We’re developing a first-principles theory to understand the mechanics of massive empirical models. Isn’t that kind of ironic?
If the problem we’re solving is “faster iteration and design of optimal deep learning models for any particular problem”, I would have thought the ML-style approach would be to solve that.. through massive empiricism. Develop a generalized dataset simulator, a language capable of describing the design space of all deep learning models, and build a mapping between dataset characteristics <> optimal model architecture. Maybe that’s the goal and I haven’t dug deep enough. Just feels funny that all of our raw computing power has led us back to the need for more concise theory. Maybe that’s a fundamental law of some kind.
I think this is to be expected. The oscillation between first-principles and empirical models falls out of the scientific method: see a few datapoints, develop a predictive theory, try to prove the theory wrong with new datapoints, and reiterate for alternative explanations with fewer assumptions, greater predictive power, etc...
This happens even in pure mathematics, just at a more abstract level: start with conjectures seen on finite examples, prove some limited infinite cases, eventually prove or disprove the conjecture entirely.
Current DL models are so huge they've outpaced the scale where our existing first-principles tools (like linear algebra) can efficiently predict the phenomena we see when we use them. The space has gotten larger, but human brains haven't, so if we still want humans to be productive, we need to develop a more efficient theory. Empirical models explaining empirical models might work, but not for humans.
We had a manager with Theoretical Physics education from a top school who seriously suggested to ultimately solve QA by building a system to run the program through all the possible combinations of branches/etc.
Floating-point math is hard, but testing these functions is trivial, and fast. Just do it.
The functions ceil, floor, and round are particularly easy to test because there are presumed-good CRT (C RunTime) functions that you can check them against. And, you can test every float bit-pattern (all four billion!) in about ninety seconds. It’s actually very easy. Just iterate through all four-billion (technically 2^32) bit patterns, call your test function, call your reference function, and make sure the results match. Properly comparing NaN and zero results takes a bit of care but it’s still not too bad.
For some types of system, searching through all possible rules for the behaviour that you want isn't out of the question, and can sometimes lead to finding a solution faster than wandering around in parameter-space (either along a directed path or randomly).
A recommendation of this 'search-based' approach (where it can be used) instead of always constructing solutions is one of the 'new' aspects of Wolfram's "New Kind of Science".
I believe that this ability to tolerate searching for a solution is a component of what can allow neural-nets to find solutions better than programmers would ever write. A NN will happily explore nonsensical or illogical bits of parameter-space where there are still some good solutions to be found in places where the 'rational' or 'logical' human mind wouldn't (or even couldn't?) go.
My favorite theoretical description of multilayer networks comes from the first multilayer network, the 1986 harmonium [1]. It used a free energy minimization model (in the paper it is called harmony maximization), which is both concise, natural and effective. I find the paper very well written and insightful — even today.
I haven't fully read the current paper, but it doesn't mention "free energy"— which seems odd given their emphasis on thermodynamics and first principles.
I just added it to my reading list. Thankfully, the authors require only knowledge only of undergrad math (basic linear algebra and multivariate calculus), and prioritize "intuition above formality":
> While this book might look a little different from the other deep learning books that you’ve seen before, we assure you that it is appropriate for everyone with knowledge of linear algebra, multivariable calculus, and informal probability theory, and with a healthy interest in neural networks. Practitioner and theorist alike, we want all of you to enjoy this book... we’ve strived for pedagogy in every choice we’ve made, placing intuition above formality.
I just finished looking through the manuscript [https://deeplearningtheory.com/PDLT.pdf]. Mathematics is heavy for me especially for a quick read, albeit a great thing I see is that the authors have reduced dependencies on external literature by inlining the various derivations and proofs instead of just providing references.
## The epilogue section (page 387 of the book, 395 in the PDF) is giving a good overview, presented below per my own understanding:
Networks with a very large number of parameters, much larger than the size of the training data, should as such overfit. The number of parameters is conventionally taken as a measure of model complexity. Having a very large network can allow it to perform well on the training data by just memorizing it and perform poorly on unseen data. Somehow these very large networks are empirically performing well still in achieving generalization, i.e., these are recognizing good patterns from the training data.
The authors show that model complexity (or ability to generalize well I would say) for such large networks is dependent on its depth-to-width ratio:
* When the network is much wider than deeper (the ratio approaches zero), the neurons in the network don't have as many "data-dependent couplings". My understanding from this is that while the large width gives the network power in terms of number of parameters, it has lessor opportunity for a correspondingly large number of feature transformations. While the network can still fit the training data well [2, 3], it may not generalize well. In the authors' words, when the depth-to-width ratio is close to zero (page 394), "such networks are not really deep" (even if depth is much more than two) "and they do not learn representations."
* On the opposite end, when the network is very deep (ratio going closer to one or larger), {I'm rephrasing the authors from my limited understanding} the network needs non-Gaussian description of the model parameter space, which makes it "not tractable" and not practically useful for machine learning.
While it makes intuitive sense that the network's capability to find good patterns and representations depends on the depth-to-width ratio, the authors have supplied the mathematical underpinnings behind this as briefly summarized above. My previous intuition was that having a larger number of layers allows for more feature transformations, giving the network a higher ease of learning. The new understanding via the authors' work is that if for the same number of layers, the width is increased, the network now has a harder job to learn feature transformations commensurate with now larger number of neurons.
## My own commentary and understanding (some from before looking at the manuscript)
If the size of the network is very small, the network won't be able to fit the training data well. A network with a larger size would generally have more 'representation' power, allowing it to know more complex patterns.
The ability to fit the training data is of course however different from ability to generalize to unseen data. Merely adding more representation power can allow it to overfit. As the network size starts exceeding the size of the training data, it could have a tendency to just memorize the training data without generalizing, unless something is done to prevent that.
So as the size of the network is increased with the intentions of giving it more representation power, we need something more such that the network first learns the most common patterns (highest compression, but lossy) and then keeps on learning progressively more intricate patterns (now lessor compression, more accurate).
My intuition so far was that achieving this was an aspect of the training algorithm and cell design innovations and also of the depth-to-width ratio. The authors however show that this depends on the depth-to-width ratio and in the way specified. It is still counter-intuitive to me that algorithmic innovation may not play a role in this, or perhaps I am misunderstanding the work.
So now the 'representation power' of the network and its ability to fit the training data itself would generally increase with the size of the network. However, its ability to learn good representations and generalize depends on the depth-to-width ratio. Loosely speaking then, to increase model accuracy on training data itself, model size may need to be increased while keeping the aspect ratio constant at least as far as the training data size is larger, whereas to improve generalization and finding good representations for a given model size, the aspect ratio should be tuned.
Intuitively, I think that under a pathological case where the network is so large that merely its width (as opposed to width times depth) is exceeding the size of the training data, then even if the depth-to-width ratio is chosen according to the guidance from the authors (page 394 in the book) the model would still fail to learn well.
Finally, I wonder what the implications of the work is for networks with temporal or spatial weight-sharing like convolutional networks, recurrent and recursive networks, attention, transformers, etc. For example, for recurrent neural networks, the effective depth of the network depends on how long the input data sequence was. I.e., the depth-to-width ratio could be varying simply because input length is varying. The learning from the authors' work I think should directly apply if each time step is treated as a training sample on its own, i.e., backpropagation through time is not considered. However, I wonder if the authors' work still presents some challenges on how long could the input sequences be as the non-Gaussian aspect may start coming into the picture.
As time permits, I would read the manuscript in more detail. I'm hopeful however that other people may achieve that faster and help me understand better. :-)
The depth-to-width ratio reminds me of the EfficientNet paper, which introduces simultaneous scaling of network depth and width to trade-off model size+quality vs computational complexity.
I trust you're joking. Carnot wanted a theory that could explain the steam-powered machines of the industrial age -- just as now we want a theory that could explain the AI-powered machines of the information age. The authors mention as much in the introduction.
He's not joking though, they really say that in the PDF linked below. "Steam navigation brings nearer together the most distant nations. ... their theory is very little understood, and the attempts to improve them are stil l directed almost by chance. ...We propose now to submit these questions to a deliberate examination. -Sadi Carnot, commenting on the need for a theory of deep learning"
It's probably a typo, it'll get corrected. It should probably say "a theory of thermodynamics" or something similar instead of deep learning.
Are you the author of the pdf or do you know the author so that you know for sure? If not then it could be a typo, I'm not sure why that's such a controversial statement?
Even though I have largely earned my living doing deep learning in the last six years, I believe that hybrid AI will get us to AGI. Getting a first principles understanding of DL models is complementary to building hybrid AI systems.
EDIT: I am happy to have a work in progress PDF of the manuscript, but I wish authors would release PDF, ePub, and Kindle formats also. I spread my reading across devices and with a PDF I need to remember where I left off reading and navigate there.