Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It isn't even vaguely a distill of o1. The reasoning models are, from what we can tell, relatively small. This model is massive and they probably scaled the parameter count to improve factual knowledge retention.

They also mentioned developing some new techniques for training small models and then incorporating those into the larger model (probably to help scale across datacenters), so I wonder if they are doing a bit of what people think MoE is, but isn't. Pre-train a smaller model, focus it on specific domains, then use that to provide synthetic data for training the larger model on that domain.



You can 'distill' with data from a smaller, better model into a larger, shittier one. It doesn't matter. This is what they said they did on the livestream.


I have distilled models before, I know how it works. They may have used o1 or o3 to create some of the synthetic data for this one, but they clearly did not try and create any self-reflective reasoning in this model whatsoever.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: