I mean, I 100% agree size is not everything. You can have a model which is massi...

I mean, I 100% agree size is not everything. You can have a model which is massive but not trained well so it actually performs worse than a smaller, better/more efficiently trained model. That's why we use Llama 2 70b over Falcon-180b, OPT-175b, and Bloom-175b.

I don't know how Mistral performs on codegen specifically, but models which are finetuned for a specific use case can definitely punch above their weight class. As I stated, I'm just talking about the general case.

But so far we don't know of a 7b model (there could be a private one we don't know about) which is able to beat a modern 70b model such as Llama 2 70b. Could one have been created which is able to do that but we simply don't know about it? Yes. Could we apply Phi's technique to 7b models and be able to reach Llama 2 70b levels of performance? Maybe, but I'll believe it when we have a 7b model based on it and a human evaluation study to confirm. It's been months now since the Phi study came out and I haven't heard about any new 7b model being built on it. If it really was such a breakthrough to allow 10x parameter reduction and 100x dataset reduction, it would be dumb for these companies to not pursue it.