twometwo's comments

twometwo · on Dec 23, 2024

What is a ELI5 explanation of KL-regularization and entropy maximization to select the policy?

Edited: I found this to be useful for explaining maximum entropy https://awjuliani.medium.com/maximum-entropy-policies-in-rei...

I think that in chess you take a piece and that increases the value but you have to consider the position (that is how your pieces can move) and that is the entropy. So maximum entropy is taking pieces but considering strategic position (policy). But there must be a confluence term, that is how well having many players or new states is a good thing to have. Don't know how to math relate that "confluence" term to entropy. From a computer point of view having a huge number of states makes computation of best move impossible but at the same time can make the optimum larger, so it is related to how given the computer power the algorithm can approximate a maximum that is an increasing function of the number of states. There must be a trade off here that I called confluence.

Also thanks for all explanations.

Alifatisk · on Dec 23, 2024

About KL-regularization, think of it like training wheels for the robot's brain. It helps the robot's learning process by preventing it from making drastic changes to its strategy too quickly.

It's like saying, "Hey robot, remember what you learned last time? Don't forget it completely, but feel free to adjust a bit."

upghost · on Dec 24, 2024

It's just a fancy word for clamping the new reward value to within some delta of the original value. Otherwise the model ends up "exploiting" outliers that make sense to machines but not to humans. They do the same thing with PPO in RLHF.

Great article, if you're interested: https://huyenchip.com/2023/05/02/rlhf.html#3_2_finetuning_us...

porridgeraisin · on Dec 23, 2024

You will have hyperparameters that weight the KL divergence (between the updated policy distribution and the current policy distribution). This helps you tune how sensitive the training process is. Entropy maximization is common in offline RL specifically as it ensures the policy has some non determinism at least and isn't bound too closely to the data you have collected, to the point of basically being deterministic. This is also tunable with a weight.

twometwo · on Dec 23, 2024

As a lazy reader, could I ask those posters that use linear models for prediction to just say so, for example use tags as: [Linear prediction based on the last five years for bitcoin value]. Since I can make easily that prediction by myself I am not going to extract much value with your linear prediction. Anyway that prediction reflects what you are interested in or what is interesting for you, or what you consider could be interesting for the rest of us. That is the answer to another question: What do you think is interesting for us.

twometwo · on Dec 23, 2024

I am just curious about this. You said the word never, and I think your claim can be tested, perhaps you could post a list of five obscure questions for a LLM to answer and then someone could ask that to a good LLM for you, or an expert in that field, to assess the value of the answers.

Edited: I just submitted an ASK HN post about this.