During pre-training the model is learning next-token prediction, which is naturally additive. Even if you added DEL as a token it would still be quite hard to change the data so that it can be used in a mext-token prediction task
Hope that helps
Sure, but the extent to which you bend the truth to get those impressive numbers is absolutely gotcha-able.
Showing a new screen by default to everyone who is using your main product flow and then claiming that everyone who is seeing it is a priori a "user" is absurd. And that is the only way they can get to 2 billion a month, by my estimation.
They could put a new yellow rectangle at the top of all google search results and claim that the product launch has reached 2 billion monthly users and is one of the fastest-growing products of all time. Clearly absurd, and the same math as what they are saying here. I'm claiming my hottake gotcha :)
Anthropic has amazing scientists and engineers, but when it comes to results that align with the narrative of LLMs being conscious, or intelligent, or similar properties, they tend to blow the results out of proportion
Edit: In my opinion at least, maybe they would say that if models are exhibiting that stuff 20% of the time nowadays then we’re a few years away from that reaching > 50%, or some other argument that I would disagree with probably
Not necessarily meaningless, but maybe relative, i.e. a person who generally replaces non-Apple laptops every X years would replace MacBooks every Y years, with Y > X
Mixture of Experts isn't using multiple models with different specialties, it's more like a sparsity technique, where you massively increase the number of parameters and use only a subset of the weights in each forward pass.