ML researcher here wanting to offer a clarification.
L1 induces sparsity. Weight decay explicitly _does not_, as it is L2. This is a common misconception.
Something a lot of people don't know is that weight decay works because when applied as regularization it causes the network to approach the MDL, which reduces regret during training.
Pruning in the brain is somewhat related, but because the brain uses sparsity to (fundamentally, IIRC) induce representations instead of compression, it's basically a different motif entirely.
If you need a hint here on this one, think about the implicit biases of different representations and the downstream impacts that they can have on the learned (or learnable) representations of whatever system is in question.
I enjoyed this presentation, thank you for sharing it. Good stuff in here.
I think things are a bit off about the reasoning behind the basis functions, but as I noted elsewhere here that's work I'm not entirely able to talk about as I'm actively working on developing it right now, and will release it when I can.
However, you can see some of the empirical consequences of an updated understanding on my end of encoding and compression in a release of hlb-CIFAR10 that's coming up soon that should cut out another decent chunk of training time. As a part of it, we reduce the network from a ResNet8 architecture to a ResNet7, and we additionally remove one of the (potentially less necessary) residuals. It is all 'just' empirical, of course, but long-term, as they say, the proof is in the pudding, since things are already so incredibly tightened down.
L1 induces sparsity. Weight decay explicitly _does not_, as it is L2. This is a common misconception.
Something a lot of people don't know is that weight decay works because when applied as regularization it causes the network to approach the MDL, which reduces regret during training.
Pruning in the brain is somewhat related, but because the brain uses sparsity to (fundamentally, IIRC) induce representations instead of compression, it's basically a different motif entirely.
If you need a hint here on this one, think about the implicit biases of different representations and the downstream impacts that they can have on the learned (or learnable) representations of whatever system is in question.
I hope this answers your question.