They absolutely did not spend millions to train it. Credible estimates place the cost for an entity like Meta at about 30-100k, probably less since Meta likely owns the 256 A100x8s needed to train it.
Even as an individual, it wouldn't cost you anywhere near a million if you only trained 13B and took advantage of volume pricing.
I don’t think it’s fair to just ignore the capex part of the model training costs. If we take AWS pricing, the 21 days of training for 65B cited in the llama paper would cost 2.6m at reserved prices. While there’s a lot of AWS profit there, it’s a reasonable first approximation of the TCO of that hardware. Even if real TCO is a third, that’s still nearly a million to train 65B, never mind the staff costs.
Plus there's bound to be false starts, reverts, crashes, etc that bump up the actual reproduction cost. Most training cost estimations take an extremely rosy best-case view assuming everything goes smoothly on the first try and no gpu cycles were wasted.
Could I get a source for that? Not that I don't believe you, but my napkin math puts the cost of training the 65b parameter model alone at a lot higher than 100k.
Even as an individual, it wouldn't cost you anywhere near a million if you only trained 13B and took advantage of volume pricing.