How far are we away from something like a helmet with chat GPT and a video camera installed, I imagine this will be awesome for low vision people. Imagine having a guide tell you how to walk to the grocery store, and help you grocery shop without an assistant. Of course you have tons of liability issues here, but this is very impressive
We're planning on getting a phone-carrying lanyard and she will just carry her phone around her neck with Be My Eyes^0 looking out the rear camera, pointed outward. She's DeafBlind, so it'll be bluetoothed to her hearing aids, and she can interact with the world through the conversational AI.
I helped her access the video from the presentation, and it brought her to tears. Now, she can play guitar, and the AI and her can write songs and sing them together.
This is a big day in the lives of a lot of people whom aren't normally part of the conversation. As of today, they are.
That story has always been completely reasonable and plausible to me. Incredible foresight. I guess I should start a midlevel management voice automation company.
Definitely heading there:
https://marshallbrain.com/manna
"With half of the jobs eliminated by robots, what happens to all the people who are out of work? The book Manna explores the possibilities and shows two contrasting outcomes, one filled with great hope and the other filled with misery."
And here are some ideas I put together around 2010 on how to deal with the socio-economic fallout from AI and other advanced technology:
https://pdfernhout.net/beyond-a-jobless-recovery-knol.html
"This article explores the issue of a "Jobless Recovery" mainly from a heterodox economic perspective. It emphasizes the implications of ideas by Marshall Brain and others that improvements in robotics, automation, design, and voluntary social networks are fundamentally changing the structure of the economic landscape. It outlines towards the end four major alternatives to mainstream economic practice (a basic income, a gift economy, stronger local subsistence economies, and resource-based planning). These alternatives could be used in combination to address what, even as far back as 1964, has been described as a breaking "income-through-jobs link". This link between jobs and income is breaking because of the declining value of most paid human labor relative to capital investments in automation and better design. Or, as is now the case, the value of paid human labor like at some newspapers or universities is also declining relative to the output of voluntary social networks such as for digital content production (like represented by this document). It is suggested that we will need to fundamentally reevaluate our economic theories and practices to adjust to these new realities emerging from exponential trends in technology and society."
And a related YouTube video:
"The Richest Man in the World: A parable about structural unemployment and a basic income"
https://www.youtube.com/watch?v=p14bAe6AzhA
"A parable about robotics, abundance, technological change, unemployment, happiness, and a basic income."
My sig is about the deeper issue here though: "The biggest challenge of the 21st century is the irony of technologies of abundance in the hands of those still thinking in terms of scarcity."
Your last quote also reminds me this may be true for everything else, especially our diets.
Technology has leapfrogged nature and our consumption patterns have not caught up to modern abundance. Scott Galloway recently mentioned this in his OMR speech and speculated that GLP1 drugs (which actually help addiction) will assist in bringing our biological impulses more inline with current reality.
Indeed, they are related. A 2006 book on eating healthier called "The Pleasure Trap: Mastering the Hidden Force that Undermines Health & Happiness" by Douglas J. Lisle and Alan Goldhamer helped me see that connection (so, actually going the other way at first). And a later book from 2010 called "Supernormal Stimuli: How Primal Urges Overran Their Evolutionary Purpose" by Deirdre Barrett also expanded that idea beyond food to media and gaming and more. The 2010 essay "The Acceleration of Addictiveness" by Paul Graham also explores those themes. In the 2007 book The Assault on Reason by Al Gore talks about watching television and the orienting response to sudden motion like scene changes.
In short, humans are adapted for a world with a scarcity of salt, refined carbs like sugar, fat, information, sudden motion, and more. But the world most humans live in now has an abundance of those things -- and our previously-adaptive evolved inclinations to stock up on salt/sugar/fat (especially when stressed) or to pay attention to the unusual (a cause of stress) are now working against our physical and mental health in this new environment. Thanks for the reference to a potential anti-addiction substance. Definitely something that deserves more research.
My sig -- informed by the writings of people like Mumford, Einstein, Fuller, Hogan, Le Guinn, Banks, Adams, Pet, and many others -- is making the leap to how that evolutionary-mismatch theme applies to our use of all sorts of technology.
Here is a deeper exploration of that in relation to militarism (and also commercial competition to some extent):
https://pdfernhout.net/recognizing-irony-is-a-key-to-transce...
"There is a fundamental mismatch between 21st century reality and 20th century security thinking. Those "security" agencies are using those tools of abundance, cooperation, and sharing mainly from a mindset of scarcity, competition, and secrecy. Given the power of 21st century technology as an amplifier (including as weapons of mass destruction), a scarcity-based approach to using such technology ultimately is just making us all insecure. Such powerful technologies of abundance, designed, organized, and used from a mindset of scarcity could well ironically doom us all whether through military robots, nukes, plagues, propaganda, or whatever else... Or alternatively, as Bucky Fuller and others have suggested, we could use such technologies to build a world that is abundant and secure for all. ... The big problem is that all these new war machines and the surrounding infrastructure are created with the tools of abundance. The irony is that these tools of abundance are being wielded by people still obsessed with fighting over scarcity. So, the scarcity-based political mindset driving the military uses the technologies of abundance to create artificial scarcity. That is a tremendously deep irony that remains so far unappreciated by the mainstream."
Conversely, reflecting on this more just now, are we are perhaps evolutionarily adapted to take for granted some things like social connections, being in natural green spaces, getting sunlight, getting enough sleep, or getting physical exercise? These are all things that are in increasingly short supply in the modern world for many people -- but which there may never have been much evolutionary pressure previously to seek out, since they were previously always available.
For example, in the past humans were pretty much always in face-to-face interactions with others of their tribe, so there was no big need to seek that out especially if it meant ignoring the next then-rare new shiny thing. Johann Hari and others write about this loss of regular human face-to-face connection as a major cause of depression.
Stephen Ilardi expands on that in his work, which brings together many of these themes and tries to help people address them to move to better health.
From: https://tlc.ku.edu/
"We were never designed for the sedentary, indoor, sleep-deprived, socially-isolated, fast-food-laden, frenetic pace of modern life. (Stephen Ilardi, PhD)"
GPT-4o, by apparently providing "her" movie-like engaging interactions with an AI avatar that seeks to please the user (while possibly exploiting them) is yet another example of our evolutionary tendencies potentially being used to our detriment. And when our social lives are filled-to-overflowing with "junk" social relationships with AIs, will most people have the inclinations to seek out other real humans if it involves doing perhaps increasingly-uncomfortable-from-disuse actions (like leaving the home or putting down the smartphone)? Not quite the same, but consider: https://en.wikipedia.org/wiki/Hikikomori
Related points by others:
"AI and Trust"
https://www.schneier.com/blog/archives/2023/12/ai-and-trust.... "In this talk, I am going to make several arguments. One, that there are two different kinds of trust—interpersonal trust and social trust—and that we regularly confuse them. Two, that the confusion will increase with artificial intelligence. We will make a fundamental category error. We will think of AIs as friends when they’re really just services. Three, that the corporations controlling AI systems will take advantage of our confusion to take advantage of us. They will not be trustworthy. And four, that it is the role of government to create trust in society. And therefore, it is their role to create an environment for trustworthy AI. And that means regulation. Not regulating AI, but regulating the organizations that control and use AI."
"The Expanding Dark Forest and Generative AI - Maggie Appleton"
https://youtu.be/VXkDaDDJjoA?t=2098 (in the section on the lack of human relationship potential when interacting with generated content)
This Dutch book [1] by Gummbah has the text "Kooptip" imprinted on the cover, which would roughly translate to "Buying recommendation". It worked for me!
Does it give you voice instructions based on what it knows or is it actively watching the environment and telling you things like "light is red, car is coming"?
Just the ability to distinguish bills would be hugely helpful, although I suppose that's much less of a problem these days with credit cards and digital payment options.
With this capability, how close are y'all to it being able to listen to my pronunciation of a new language (e.g. Italian) and given specific feedback about how to pronounce it like a local?
It completely botched teaching someone to say “hello” in Chinese - it used the wrong tones and then incorrectly told them their pronunciation was good.
As for the Mandarin tones, the model might have mixed it up with the tones from a dialect like Cantonese. It’s interesting to discover how much difference a more specific prompt could make.
I don't know if my iOS app is using GPT-4o, but asking it to translate to Cantonese gives you gibberish. It gave me the correct characters, but the Jyutping was completely unrelated. Funny thing is that the model pronounced the incorrect Jyutping plus said the numbers (for the tones) out loud.
I think there is too much focus on tones in beginning Chinese. Yes, you should get them right, but no, you'll get better as long as you speak more, even if your tones are wrong at first. So rather than remember how to say fewer words with the right tones, you'll get farther if you can say more words with whatever tones you feel like applying. That "feeling" will just get better over time. Until then, you'll talk as good as a farmer coming in from the country side whose first language isn't mandarin.
I couldn’t disagree more. Everyone can understand some common tourist phrases without tones - and you will probably get a lot of positive feedback from Chinese people. It’s common to view a foreigner making an attempt at Mandarin (even a bad one) as a sign of respect.
But for conversation, you can’t speak Mandarin without using proper tones because you simply won’t be understood.
That really isn't true, or at least it isn't true with some practice. You don't have to consciously think about or learn tones, but you will eventually pick them anyways (tones are learned unconsciously via lots of practice trying to speak and be understood).
You can be perfectly understood if you don't speak broadcast Chinese. There are plenty of heavy accents to deal with anyways. Like Beijing 儿化 or the inability of southerners to pronounce sh very differently from s.
People always say tech workers are all white guys -- it's such a bizarre delusion, because if you've ever actually seen software engineers at most companies, a majority of them are not white. Not to mention that product/project managers, designers, and QA are all intimately involved in these projects, and in my experience those departments tend to have a much higher ratio of women.
Even beside that though -- it's patently ridiculous to suggest that these devices would perform worse with an Asian man who speaks fluent English and was born in California. Or a white woman from the Bay Area. Or a white man from Massachusetts.
You kind of have a point about tech being the product of the culture in which it was produced, but the needless exaggerated references to gender and race undermine it.
An interesting point, I tend to have better outcomes by using my heavily accented ESL English, than my native pronunciation of my mother tongue
I'm guessing it's part of the tech work force being a bit more multicultural than initially thought, or it just being easier to test with
It's a shame, because that means I can use stuff that I can't recommend to people around me
Multilingual UX is an interesting painpoint, I had to change the language of my account to English so I could use some early Bard version, even though It was perfectly able to understand and answer in Spanish
You also get the synchronicity / four minute mile effect egging on other people to excel with specialized models, like Falcon or Qwen did in the wake of the original ChatGPT/Llama excitement.
I don't think that'd work without a dedicated startup behind it.
The first (and imo the main) hurdle is not reproduction, but just learning to hear the correct sounds. If you don't speak Hindi and are a native English speaker, this [1] is a good example. You can only work on nailing those consonants when they become as distinct to your ear as cUp and cAp are in English.
We can get by by falling back to context (it's unlikely someone would ask for a "shit of paper"!), but it's impossible to confidently reproduce the sounds unless they are already completely distinct in our heads/ears.
That's because we think we hear things as they are, but it's an illusion. Cup/cap distinction is as subtle to an Eastern European as Hindi consonants or Mandarin tones are to English speakers, because the set of meaningful sounds distinctions differs between languages. Relearning the phonetic system requires dedicated work (minimal pairs is one option) and learning enough phonetics to have the vocabulary to discuss sounds as they are. It's not enough to just give feedback.
> but it's impossible to confidently reproduce the sounds unless they are already completely distinct in our heads/ears
interestingly, i think this isn't always true -- i was able to coach my native-spanish-speaking wife to correctly pronounce "v" vs "b" (both are just "b" in spanish, or at least her dialect) before she could hear the difference; later on she was developed the ability to hear it.
I had a similar experience learning Mandarin as a native English speaker in my late 30s. I learned to pronounce the ü sound (which doesn't exist in English) by getting feedback and instruction from a teacher about what mouth shape to use. And then I just memorized which words used it. It was maybe a year later before I started to be able to actually hear it as a distinct sound rather than perceiving it as some other vowel.
After watching the demo, my question isn't about how close it is to helping me learn a language, but about how close it is to being me in another language.
Even styles of thought might be different in other languages, so I don't say that lightly... (stay strong, Sapir-Wharf, stay strong ;)
I was conversing with it in Hinglish (A combination of Hindi and English) which folks in Urban India use and it was pretty on point apart from some use of esoteric hindi words but i think with right prompting we can fix that.
I'm a Spaniard and to my ears it clearly sounds like "Es una manzana y un plátano".
What's strange to me is that, as far as I know, "plátano" is only commonly used in Spain, but the accent of the AI voice didn't sound like it's from Spain. It sounds more like an American who speaks Spanish as a second language, and those folks typically speak some Mexican dialect of Spanish.
Interesting, I was reading some comments from Japanese users and they said the Japanese voice sounds like a (very good N1 level) foreigner speaking Japanese.
At least IME, and there may be regional or other variations I’m missing, people in México tend to use “plátano” for bananas and “plátano macho” for plantains.
In Spain, it's like that. In Latin America, it was always "plátano," but in the last ten years, I've seen a new "global Latin American Spanish" emerging that uses "banana" for Cavendish, some Mexican slang, etc. I suspect it's because of YouTube and Twitch.
The content was correct but the pronunciation was awful. Now, good enough? For sure, but I would not be able to stand something talking like that all the time
Most people don't, since you either speak with native speakers or you speak in English mostly, since in international teams you speak in English and not one of the native languages even if nobody speaks English natively. So it is rare to hear broken non-English.
And note that understanding broken language is a skill you have to train. If you aren't used to it then it is impossible to understand what they say. You might not have been in that situation if you are an English speaker since you are so used to broken English, but it happens a lot for others.
It sounds like a generic Eastern European who has learned some Italian. The girl in the clip did not sound native Italian either (or she has an accent that I have never heard in my life).
This is damn near one of the most impressive things, can only imagine especially with live translation and voice synthesis (eleven labs style) you'd be capable of to integrate with something like teams (select each persons language and do realtime translation to each persons native language, with their own voice and intonations would NUTS)
By humanity you mean Microsoft's shareholders right? Cause for regular people all this crap means is they have to deal with even more spam and scams everywhere they turn. You now have to be paranoid about even answering the phone with your real voice, lest the psychopaths on the other end record it and use it to fool a family member.
Yeah, real win for humanity, and not the psycho AI sycophants
Random OpenAI question: While the GPT models have become ever cheaper, the price for the tts models have stayed in the $15/1Mio char range. I was hoping this would also become cheaper at some point. There're so many apps (e.g. language learning) that quickly become too expensive given these prices. With the GPT-4o voice (which sounds much better than the current TTS or TTS HD endpoint) I thought maybe the prices for TTS would go down. Sadly that hasn't happened. Is that something on the OpenAI agenda?
I've always been wondering what GPT models lack that makes them "query->response" only. I've always tried to get chatbots to lose the initially needed query, with no avail. What would It take to get a GPT model to freely generate tokens in a thought like pattern? I think when I'm alone without query from another human. Why can't they?
> What would It take to get a GPT model to freely generate tokens in a thought like pattern?
That’s fundamentally not how GPT models work, but you can easily build a framework around them that calls them in a loop; you’d need a special system prompt to get anything “thought like” that way, and if you want it to be anything other than stream-of-simulated-consciousness with no relevance to anything, and a non-empty “user” prompt each round, which could be as simple as time, a status update on something in the world, etc.
Monkeys who've trained since birth to use sign language, and can reply incredible questions, have the same issue. The researchers noticed they never once asked a question like "why is the sky blue?" or "why do you dress up". Zero initiating conversation, but they do reply when you ask what they want.
I suppose it would cost even more electricity to have ChatGPT musing alone though, burning through its nvidia cards...
I think this will be key in a logical proof that statistical generation can never lead to sentience; Penrose will be shown to be correct, at least regarding the computability of consciousness.
You could say, in a sense, that without a human mind to collapse the wave function, the superposition of data in a neural net's weights can never have any meaning.
Even when we build connections between these statistical systems to interact with each other in a way similar to contemplation, they still require a human-created nucleation point on which to root the generation of their ultimate chain of outputs.
I feel like the fact that these models contain so much data has gripped our hardwired obsession for novelty and clouds our perception of their actual capacity to do de novo creation, which I think will be shown to be nil.
An understanding of how LLMs function should probably make this intuitively clear. Even with infinite context and infinite ability to weigh conceptual relations, they would still sit lifeless for all time without some, any, initial input against which they can run their statistics.
It happens sometimes. Just the other day a local TinyLlama instance started asking me questions.
The chat memory was full of mostly nonsense and it asked me a completely random and simple question out of the blue. Did chatbots evolve a lot since he was created.
I think you can get models to "think" if you give them a goal in the system prompt, a memory of previous thoughts, and keep invoking them with cron
Yes, but that's the fundamental difference. Even if I closed my eyes, plugged my ears and nose and laid in a saltwater floating chamber, my brain will always generate new input / noise.
(GPT) Models toggle between a state of existence when queried and ceasing to exist when not.
They are designed for query and reponse. They don't do anything unless you give them input. Also there's not much research on the best architecture for running continuous though loops in the background and how to mix them into the conversational "context". Current LLMs only emulate single thought synthesis based on long-term memory recall (and some goes off to query the Internet).
> I think when I'm alone without query from another human.
You are actually constantly queried, but it's stimulation from your senses. There are also neurons in your brain which fires regularly, like a clock that ticks every second.
Do you want to make a system that thinks without input? Then you need to add hidden stimuli via a non-deterministic random number generator, preferably a quantum based RNG (or it won't be possible to claim the resulting system has free-will). Even a single photon hitting your retina can affect your thoughts and there are no doubt other quantum effects that trips neurons in your brain above the firing threshold.
I think you need at least three of four levels of loops interacting, with varying strength between them. First level would be the interface to the world, the input and output level (video, audio, text). Data from here are high priority and is capable of interrupting lower levels.
The second level would be short term memory and context switching. Conversations needs to be classified, and stored in a database, and you need an API to retrieve old contexts (conversations). You also possibly need context compression (summarization of conversations in case you're about to hit a context window limit).
The third level would be the actual "thinking", a loop that constantly talks to itself to accomplish a goal using the data from all the other levels but mostly driven by the short term memory. Possibly you could go super-human here and spawn multiple worker processes in parallel. You need to classify the memories by asking; do I need more information? where do I find this information? Do I need an algorithm to accomplish a task? What is the completion criteria. Everything here is powered by an algorithm. You would take your data and produce a list of steps that you have to follow to resolves to a conclusion.
Everything you do as a human to resolve a thought can be expressed as a list or tree of steps.
If you've had a conversation with someone and you keep thinking about it afterwards, what has happened is basically that you have spawned a "worker process" that tries to come to a conclusion that satisfies some criteria. Perhaps there was ambiguity in the conversation that you are trying to resolve, or the conversation gave you some chemical stimulation.
The last level would be subconscious noise driven by the RNG, this would filter up with low priority. In the absence of other external stimuli with higher priority, or currently running thought processes, this would drive the spontaneous self-thinking portion (and dreams) when external stimuli is lacking.
Implement this and you will have something more akin to true AGI (whatever that is) on a very basic level.
In my ChatGPT app or on the website I can select GPT-4o as a model, but my model doesn't seem to work like the demo. The voice mode is the same as before and the images come from DALLE and ChatGPT doesn't seem to understand or modify them any better than previously.
I couldn’t quite tell from the announcement, but is there still a separate TTS step, where GPT is generating tones/pitches that are to be used, or is it completely end to end where GPT is generating the output sounds directly?
Very exciting, would love to read more about how the architecture of the image generation works. Is it still a diffusion model that has been integrated with a transformer somehow, or an entirely new architecture that is not diffusion based?
Licensing the emotion-intoned TTS as a standalone API is something I would look forward to seeing. Not sure how feasible that would be if, as a sibling comment suggested, it bypasses the text-rendering step altogether.
Is it possible to use this as a TTS model? I noticed on the announcement post that this is a single model as opposed to a text model being piped to a separate TTS model.
The web page implies you can try it immediately. Initially it wasn't available.
A few hours later it was in both the web UI and the mobile app - I got a popu[ telling me that GPT-4o was available. However nothing seems to be any different. I'm not given any option to use video as an input, the app can't seem to pick up any new info from my voice.
I'm left a bit confused as to what I can do that I couldn't do before. I certainly can't seem to recreate much of the stuff from the announcement demos.
Sorry to hijack, but how the hell can I solve this? I have the EXACT SAME error on two iOS devices (native app only — web is fine), but not on Android, Mac, or Windows.
It's really how it works.