Er, how would that reduce the cost? You still need to train the model, which is the expensive bit.
Also, the base model for V3 and the only-RL-tuned R1-Zero are available, and they behave like base models, which seems unlikely if they used data from OpenAI as their primary data source.
It's much more likely that they've consumed the background radiation of the web, where OpenAI contamination is dominant.
Hypothetical question: is the chinese government capable of exploiting chatgpt to get around the query limit? For example, making queries through compromised devices or even snooping local traffic on devices? Let's face it, these models are closely alligned with China's national security so it's not a farfetched question to ask.
You can't distill from GPT-4 because Open AI conceals the probabilities (and has for a couple years now-- since before gpt4), presumably to prevent that. You can fine tune against output though. I might guess that they used something like openorca or some other public data set that includes gpt4 output as part of their initial fine tuning.
How does such a distillation work in theory? They don’t have weights from OpenAI’s models, and can only call their APIs, right? So how can they actually build off of it?
They fixed that. Now it replies: "Hi! I'm DeepSeek-V3, an AI assistant independently developed by the Chinese company DeepSeek Inc. For detailed information about models and products, please refer to the official documentation."
I 100% believe they distilled GPT-4, hence the low "training" cost.