It can also directly output images. Some examples are up on the page. Though with how little coverage that's gotten, not sure if users will ever be able to play with that
People are saying that GPT-4o still uses Dall-e for image generation. I think that it doesn't match the quality of dedicated image models yet. Which is understandable. I bet it can't generate music as well as Suno or Udio either. But the direction is clear and I'm sure someday it will generate great images, music, and video. You'll be able to do a video call with it where it generates its own avatar in real time. And they'll add more outputs for keyboard/mouse/touchscreen control, and eventually robot control. GPT-7o is going to be absolutely wild.
I think they are assuming a world where you took this existing model but it was trained on a dataset of animals making noises to each other, so that you could then feed the trained model the vocalization of one animal and the model would be able to produce a continuation of audio that has a better-than-zero chance of being a realistic sound coming from another animal - so in other words, if dogs have some type of bark that encodes a "I found something yummy" message and other dogs tend to have some bark that encodes "I'm on my way" and we're just oblivious to all of that sub-text, then maybe the model would be able to communicate back and forth with an animal in a way that makes "sense" to the animal.
Probably substitute dogs for chimps though.
But obviously that doesn't solve at all or human-understandability, unless maybe you have it all as audio+video and then ask the model to explain what visual often accompanies a specific type of audio? Maybe the model can learn what sounds accompany violence or accompany the discovery of a source of water or something?