Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It says it’s common crawl, I interpret it to mean this is a generic web scrape dataset, presumably they filter stuff out they don’t want before pretraining. You’d have to do do some ablation testing to know what value it adds


Common Crawl is a particular dataset. commoncrawl.org




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: