Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

ML researcher here wanting to offer a clarification.

L1 induces sparsity. Weight decay explicitly _does not_, as it is L2. This is a common misconception.

Something a lot of people don't know is that weight decay works because when applied as regularization it causes the network to approach the MDL, which reduces regret during training.

Pruning in the brain is somewhat related, but because the brain uses sparsity to (fundamentally, IIRC) induce representations instead of compression, it's basically a different motif entirely.

If you need a hint here on this one, think about the implicit biases of different representations and the downstream impacts that they can have on the learned (or learnable) representations of whatever system is in question.

I hope this answers your question.



can you please spell out what MDL is an acronym for?



thanks


> because the brain uses sparsity to (fundamentally, IIRC) induce representations instead of compression

What's the evidence for this?


https://bernstein-network.de/wp-content/uploads/2021/03/Lect... this has an awesome overview of the current understanding of neural encoding mechanisms.


I enjoyed this presentation, thank you for sharing it. Good stuff in here.

I think things are a bit off about the reasoning behind the basis functions, but as I noted elsewhere here that's work I'm not entirely able to talk about as I'm actively working on developing it right now, and will release it when I can.

However, you can see some of the empirical consequences of an updated understanding on my end of encoding and compression in a release of hlb-CIFAR10 that's coming up soon that should cut out another decent chunk of training time. As a part of it, we reduce the network from a ResNet8 architecture to a ResNet7, and we additionally remove one of the (potentially less necessary) residuals. It is all 'just' empirical, of course, but long-term, as they say, the proof is in the pudding, since things are already so incredibly tightened down.


That looks interesting, do you know what paper talks about the connection between MDL, regret, and weight decay?


I would start with Shannon's information theory and the Wikipedia page on L2/the MDL as a decent starting point.

For the first, there are a few good papers that simplify the concepts even further.


Sorry, I know what MDL and L2 regularization are, I would like the paper that connects them in the way you mentioned




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: