So to my understanding, this work reproduces DeepSeek R1's reinforcement learnin...

krackers · 2025-01-25T08:13:26 1737792806

I've been trying to follow the literature on PPO/GRPO as applied to LLMs. From what I understand, since reward is only given once the entire COT sequence is sampled, traditional RL techniques would require some form of credit-assignment to distribute that reward amongst individual tokens – which is where the critic/value network comes in, right?

Instead DeepSeek (with GRPO) seems to just omit that value function entirely and use only sparse rewards. How does this end up being more efficient, since I thought the sparse nature of rewards makes it harder to converge to the optimal policy?

serialx · 2025-01-25T08:24:29 1737793469

I don't think it's only using sparse rewards because of the format rewards. The training recipe is pretty comprehensive and involves multiple stages.[1] The paper mentions that when only using the RL technique, the output is often not suitable for reading. (Language mixing, etc) That feels like a AlphaZero moment for LLMs?

[1]: https://www.reddit.com/r/LocalLLaMA/comments/1i8rujw/notes_o...

krackers · 2025-01-25T08:26:57 1737793617

The R1 paper says that they didn't use "process reward modeling". And the paper that introduced GPRO says that it can be used either with "outcome supervision" or "process supervision", with outcome supervision "only provid[ing] a reward at the end of each output". Put together, doesn't that imply R1 uses sparse rewards provided only at end of COT sequence?

serialx · 2025-01-25T08:35:38 1737794138

Ah sorry, you might be right. I meant "sparse reward" as a reward system that is mostly 0 but occasionally 1. Your "sparse reward" means only providing reward at the end of each output.

HeatrayEnjoyer · 2025-01-25T13:25:29 1737811529

> Ah sorry, you might be right. I meant "sparse reward" as a reward system that is mostly 0 but occasionally 1.

Did we introduce the abusive pressure of Korean educational culture to machines?

zby · 2025-01-25T11:11:55 1737803515

I think the reward is relative to other sampled answers for the same question. This way the signal is strong at the very margin of what is possible with a given model and there is less noise in it with impossible or too easy questions.

There is some confusion - because they do compute that simple reward, but then they convert it to a relative value and call it advantage. And I think they use that advantage to update the model - not the base reward.

krackers · 2025-01-25T20:42:05 1737837725

Yes you're right, in their paper I think they say the process of sampling multiple traces then taking relative rewards is supposed to monte-carlo approximate the value network? I don't really have the intuition for that, but it does make sense that rather than simply nudging probabilities in the direction of the trace with the highest absolute reward, you want to favor the trace which had the best reward relative to current state. E.g. for quick intuition if absolute rewards for traces were {0, 0, 0, 0.01} then using absolute rewards would only give a weak signal (nudge weights proportional to 0.01 * logprob) for the last trace, but using relative rewards (based on z-score) of 1.5 * logprob.

zby · 2025-01-25T22:42:30 1737844950

Not only that - if you have {0,0,0,0.01} - then the probability that you would get any reward at one shot would be very low. And also I have the intuition that giving the rewards to traces at the edge is more efficient - because the model needs only a small perturbation to get right. If you gave negative rewards to traces that are very far from being right - then the model might be steered in a wrong direction.

suraci · 2025-01-25T07:58:52 1737791932

It looks like the 'old-school' RL to me, which makes me wonder why it took so long to get here

vixen99 · 2025-01-25T12:21:04 1737807664

Nothing like acronyms to make me feel dumb and ill-informed.

basementcat · 2025-01-25T14:42:30 1737816150

Reinforcement Learning

https://en.m.wikipedia.org/wiki/Reinforcement_learning

amluto · 2025-01-25T09:15:36 1737796536

The part I found strange: these RL formulations give no reward for incorrect solutions, so unless there are training examples that are easy enough for the base model to solve, the RL process won’t do anything.

So is the actual magic that the base models are good enough to sometimes generate successful CoT output in their unmodified state? Or did I miss something in the R1 paper and the code here?

zby · 2025-01-25T11:36:10 1737804970

I think is where the relative rewards come to play - they sample many thinking traces and reward those that are correct. This works at the current 'cutting edge' for the model - exactly where it could be improved.

Imanari · 2025-01-25T11:17:36 1737803856

I was wondering the same thing. I feel there is too large of a gap between a raw base model and and a model that produces fully correct answers and follows a specific format. My guess is their rule base reward system is more nuanced than just correctness and format.

krackers · 2025-01-25T20:46:59 1737838019

Yeah I find this part not clearly expressed as well. My best guess is that it's not simply binary "correct/incorrect" but rather the reward is made up of multiple parts (e.g. format + correctness) and structured in a way such that "close enough" answers still get some reward. From there I would expect that a base model might at least be able to "autocomplete" the format/style, at which point RL machinery would kick in to tune it to properly obey the format, and once that's mastered eventually correctness.

They did mention something about tuning on an un-SFT'd base model being much slower 'warming it up' with some existing reasoning traces.