The R1 paper (https://arxiv.org/pdf/2501.12948) emphasizes their success with reinforcement learning without requiring any supervised data (unlike RLHF for example). They note that this works well for math and programming questions with verifiable answers.
What's totally unclear is what data they used for this reinforcement learning step. How many math problems of the right difficulty with well-defined labeled answers are available on the internet? (I see about 1,000 historical AIME questions, maybe another factor of 10 from other similar contests). Similarly, they mention LeetCode - it looks like there are around 3000 LeetCode questions online. Curious what others think - maybe the reinforcement learning step requires far less data than I would guess?
What's totally unclear is what data they used for this reinforcement learning step. How many math problems of the right difficulty with well-defined labeled answers are available on the internet? (I see about 1,000 historical AIME questions, maybe another factor of 10 from other similar contests). Similarly, they mention LeetCode - it looks like there are around 3000 LeetCode questions online. Curious what others think - maybe the reinforcement learning step requires far less data than I would guess?