Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So why is "distilling from N-gram" better, why does it make the transformer learn English faster?

Hypothesis: it's the standard "teacher-student" or "distillation" trick - if you're learning next-token-prediction, you only learn what the correct answer is (i.e. the spike in probability), but when you're distilling from a teacher model, you learn the entire distribution of potential answers.

Curious, can anyone more experienced in AI research comment on this?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: