You're right, although tests like this have been done many times locally as well...

You're right, although tests like this have been done many times locally as well. This issue comes from the fact that RL usually kills the token prediction variance, disproportionately narrowing it to 2-3 likely choices in the output distribution even in cases where uncertainty calls for hundreds. This is also a major factor behind fixed LLM stereotypes and -isms. Base models usually don't exhibit that behavior and have sufficient randomness.