Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is no text. The model understands ingests audio directly and also outputs audio directly.


So they retrained the whole model on audio datasets and the tokens are now sounds, not words/part of words?


They trained on text and audio and images. The model accepts tokens of all three types. And it can directly output audio as well as text.


It can also directly output images. Some examples are up on the page. Though with how little coverage that's gotten, not sure if users will ever be able to play with that


People are saying that GPT-4o still uses Dall-e for image generation. I think that it doesn't match the quality of dedicated image models yet. Which is understandable. I bet it can't generate music as well as Suno or Udio either. But the direction is clear and I'm sure someday it will generate great images, music, and video. You'll be able to do a video call with it where it generates its own avatar in real time. And they'll add more outputs for keyboard/mouse/touchscreen control, and eventually robot control. GPT-7o is going to be absolutely wild.


Is it a stretch to think this thing could accurately "talk" with animals?


Yes? Why would it be able to do that?


I think they are assuming a world where you took this existing model but it was trained on a dataset of animals making noises to each other, so that you could then feed the trained model the vocalization of one animal and the model would be able to produce a continuation of audio that has a better-than-zero chance of being a realistic sound coming from another animal - so in other words, if dogs have some type of bark that encodes a "I found something yummy" message and other dogs tend to have some bark that encodes "I'm on my way" and we're just oblivious to all of that sub-text, then maybe the model would be able to communicate back and forth with an animal in a way that makes "sense" to the animal.

Probably substitute dogs for chimps though.

But obviously that doesn't solve at all or human-understandability, unless maybe you have it all as audio+video and then ask the model to explain what visual often accompanies a specific type of audio? Maybe the model can learn what sounds accompany violence or accompany the discovery of a source of water or something?


Yep, exactly what brought that to mind. Multimodal seems like the kind of thing needed for such a far-fetched idea.


Not really a stretch in my mind. https://www.earthspecies.org/ and others are working on it already.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: