You're only thinking of the training data. But the pre-trained model is like a newborn, trashing and yelling and not listening. It needs a second level of training made of a mix of about 1800 supervised tasks. Now it has progressed a little, you can get it to listen, but it's still not ok, it's like a 5 year old. You need to label more data with human preferences and fine-tune the model to align it with what we think is good behaviour. Now it behaves like a 10 year old.
In the original dataset you already combine dozens of sources - web scrapes, book collections, paper collections, materials in many languages, etc. In the second stage you have thousands of small supervised datasets. In the third stage you have to label. So I think the dataset building phase is pretty difficult.
In the original dataset you already combine dozens of sources - web scrapes, book collections, paper collections, materials in many languages, etc. In the second stage you have thousands of small supervised datasets. In the third stage you have to label. So I think the dataset building phase is pretty difficult.