My layperson interpretation of this particular error was that the AI model probably came up with the initial recipe response in full, but when the audio of that response was cut off because the user interrupted it, the model wasn't given any context of where it was interrupted so it didn't understand that the user hadn't heard the first part of the recipe.
I assume the responses from that point onwards didn't take the video input into account, and the model just assumes the user has completed the first step based on the conversation history. I don't know how these 'live' ai sessions things work but based on the existing openai/gemini live ai chat products it seems to me most of the time the model will immediately comment on the video when the 'live' chat starts but for the rest of the conversation it works using TTS+STT unless the user asks the AI to consider the visual input.
I guess if you have enough experience with these live AI sessions you can probably see why it's going wrong and steer it back in the right direction with more explicit instructions but that wouldn't look very slick in a developer keynote. I think in reality this feature could still be pretty useful as long as you aren't expecting it to be as smooth as talking to a real person
If someone said "The earth is round and anybody who says it isn't doesn't know what they are talking about" would you still challenge their intellectual honesty in this way?
Then how come in face-to-face interactions people generally communicate using speech rather than text?
Clearly there's a disadvantage to using text in that situation, and I think it's that it almost always takes longer to express thoughts/intents using text. ISTM a sufficiently advanced computer voice interface would have the same advantage.
Because it allows people to communicate when they're not in close physical proximity. Would you rather go out to dinner with friends and just speak to each other or sit there and type your conversation out in a WhatsApp group chat?
It's a convenience/necessity thing, pure and simple.
I said was talking about face-to-face (or 'in person' as you put it) communication. You're absolutely right that over long-distance people prefer to communicate by text, but in person people prefer to communicate by speech so that's exactly my point: there are at least some contexts in which people prefer speech.
I guess I could also follow suit and return your weird toxic/patronising insult here too since you clearly didn't understand my original comment, but perhaps it would be nicer if we didn't do that?
That's funny, the way I interpreted this sentence is that usage was already high in older, male, and high-income countries so most of the new users are coming from outside these demographics. Which, ironically, is the exact opposite of what you're saying.
You read "Users are younger, increasingly female, global, and adoption is growing fastest in lower-income countries" and gathered that "Young moms with no money in poor countries use this product the most". Do I really need to spell out the fact that you completely failed to understand basic English here?
the restaurant spends resources (both physical and human) cooking and serving you the meal, likewise for the barber. a better example would be showing up late for a cinema showing so that you deliberately avoid watching the adverts and trailers... which i would guess most people would agree is morally fine?
The more direct cinema example would be sneaking into the theater and there were empty seats (so you did not deny anyone else access to the movie). Is that morally fine? You watched the movie, the creator doesn't get paid.
I assume the responses from that point onwards didn't take the video input into account, and the model just assumes the user has completed the first step based on the conversation history. I don't know how these 'live' ai sessions things work but based on the existing openai/gemini live ai chat products it seems to me most of the time the model will immediately comment on the video when the 'live' chat starts but for the rest of the conversation it works using TTS+STT unless the user asks the AI to consider the visual input.
I guess if you have enough experience with these live AI sessions you can probably see why it's going wrong and steer it back in the right direction with more explicit instructions but that wouldn't look very slick in a developer keynote. I think in reality this feature could still be pretty useful as long as you aren't expecting it to be as smooth as talking to a real person