Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No, the model has nothing do to with Llama. We are using our own architecture, and training from scratch. Llama also does not have open training data, and is non-compliant, in contrast to this model.

Source: I'm part of the training team



If you guys need help on GGUFs + Unsloth dynamic quants + finetuning support via Unsloth https://github.com/unslothai/unsloth on day 0 / 1, more than happy to help :)


absolutely! i've sent you a linkedin message last week. but here seems to work much better, thanks a lot!


Oh sorry I might have missed it! I think you or your colleague emailed me (I think?) My email is daniel @ unsloth.ai if that helps :)


Hey, really cool project, I’m excited to see the outcome. Is there a blog / paper summarizing how you are doing it ? Also which research group is currently working on it at eth ?


L3 has open pretraining data, it's just not official for obvious legal reasons: https://huggingface.co/datasets/HuggingFaceFW/fineweb


Wait, whole (english speaking) web content dataset size is ~50TB?


Yes, if we take the filtered and deduplicated HTMLs of CommonCrawl. I've made a video on this topic recently: https://www.youtube.com/watch?v=8yH3rY1fZEA


Fun presentation, thanks! 72min ingestion time for ~81TB of data is ~1TB/min or ~19GB/s. Distributed or single-node? Shards? I see 50 jobs are used for parallel ingestion, and I wonder how ~19GB/s was achieved since ingestion rates were far below that figure last time I played around with CH performance. Granted, that was some years ago.


Distributed across 20 replicas.


So you're not going to use copyrighted data for training? That's going to be a disadvantage with respect to LLaMa and other well-known models, it's an open secret that everyone is using everything they can get their hands on.

Good luck though, very needed project!


Not sure about the Swiss laws, but the EU AI Act and the 2019/790 digital millennium directive it piggies back on the topic, does allow for training on copyrighted data as long as any opt-out mechanisms (e.g. robots.txt) are respected. AFAICT this LLM was trained by respecting those mechanisms (and as linked elsewhere they didn't find any practical difference in performance - note that there is an exception to allow ignoring the opt-out mechanisms for research purposes, so they could make that comparison).


That is not correct. The EU AI Act has no such provision, ans the data mining excemption does not apply as the EU has made clear. As for Switzerland copyrighted material cannot be used unless licensed.


Thanks for clarifying! I wish you all the best luck!


Are you using dbpedia?


no. the main source is fineweb2, but with additional filtering for compliance, toxicity removal, and quality filters such as fineweb2-hq


Thx for engaging here.

Can you comment on how the filtering impacted language coverage? E.g. finweb2 has 1800+ languages, but some with very little actual representation, while finweb2-hq has just 20 but each with a subdsantial data set.

(I'm personaly most interested in covering the 24 official EU languages)


we kept all 1800+ (script/language) pairs, not only the quality filtered ones. the question if a mix of quality filtered and not languages impacts the mixing is still an open question. preliminary research (Section 4.2.7 of https://arxiv.org/abs/2502.10361 ) indicates that quality filtering can mitigate the curse of multilinguality to some degree, so facilitate cross-lingual generalization, but it has to be seen how strong this effect is on larger scale




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: