The dataset climbmix 400b looks like it is 600GB, it would be neat if someone could host this in compressed form, given that LLM can be used to compress, even having a small LLM compress it would perform better than classical compression algorithms, why is this approach not used within the ML community?
Or is it the "anyone who means anything in the field, has access to high bandwidth anyway"?
Or is it the "anyone who means anything in the field, has access to high bandwidth anyway"?