Not only by providing the correct SotA, but also noting that the graduate student, probably at an expensive University, was so "cheap" as not to buy the cheap tools for their research. Imagine physicists from the 1900s working without tools and not being able to do experiments because "we would have to buy radium so let's try with free iron that I have instead". "Radioactivity is not a thing".
Yes, totally, especially given this was written only a month ago!
The student referred me to a recent arXiv paper 2303.12712 [cs.CL] about GPT-4, which is apparently behind a paywall at the moment but does even better than the system he could use (https://chat.openai.com/).
I wonder the graduate student considered paying the $20 and/or asking Knuth to pay.
The game “20 questions” is probably the hardest I’ve seen chatGPT fail.
What’s interesting about the game is that, at first pass, there’s no ambiguity. All questions need to be answered with “Yes” or “No”. But many questions asked during the game actually have answers of “it depends”.
For example, I was thinking of “peanut butter” and chatGPT asked me “Does it fit in your hand?” as well as “Is it used in the kitchen?”. Given my answers, chatGPT spent the back half of its questions on different kitchen utensils. It never once considered backing up and verifying that there wasn’t some misunderstanding.
I played three games with it, and it made the same mistake each time.
Of course, playing the game via text loses a lot of information relative to playing IRL with your friends. In person, the answerer would pause, hum, and otherwise demonstrate that the question asked was ambiguous given the restrictions of the game.
Regardless, it was clear that chatGPT wasn’t accounting for ambiguity.
> It never once considered backing up and verifying that there wasn’t some misunderstanding.
Of course not; ChatGPT doesn't "consider". It doesn't think, it doesn't know. It can't identify that there was a misunderstanding of its own volition.
All ChatGPT does is use a (very sophisticated!) statistical analysis to generate text that conforms to an expectation of what a human response to a similar prompt might look like. It has been trained well in so far as it is able to produce prompts that seem like a human may have written them, but it doesn't reveal cognitive processes like "reconsidering" because it doesn't have any.
20-some years ago, I had this "20 questions" handheld electronic game that was eerily good at winning. I imagine it was a bunch of well-programmed tables of data, but in any case, it's certainly possible for a machine to do well at this game.
I think the more we see ChatGPT do things like "oh, I know this game -- I'm going to run a 20-year-old 20 Questions subroutine that is not part of my neural network language model to generate responses", it will become even more impressive.
> I think the more we see ChatGPT do things like "oh, I know this game -- I'm going to run a 20-year-old 20 Questions subroutine that is not part of my neural network language model to generate responses", it will become even more impressive.
Agreed. Incidentally I’ve built a little toy version of a runtime for exactly this purpose - there’s a translation layer that’s given a bunch of available “APIs” (fed through the LLM context), and breaks down a high level goal into a structured series of API calls.
the runtime parses these API calls, and natively executes some (e.g. run a program, write to the file system) and others result in LLM invocations.
I’m sure OpenAI and crew are way ahead of me here, of course. I’m excited to see what the future holds in this field.
The first AI-style program I ever wrote (about 25 years ago. Yes, I'm old) played 20 questions, but it would "learn" from prior games, so the more you played, the better it performed.
Yeah, ChatGPT could integrate Akinator[0] and trivially be great at the game. Without the help, though, It's a good, revealing benchmark for the LLMs ability.
LLM for the foreseeable future function most reliably as a user interface layer for other system. I use GPT to “translate” natural language down into the API calls that get real data and it works great. I’d never trust it beyond that.
"Green Glass Door" also completely stumped it. It just could not deduce that the trick was semantic at the word representation level, rather than something related to the object that the word describes.
What's funny about 20 questions is that Akinator has been absolutely slaying it for like 20 years now.
What happens if you answer with something approximating the hemming and hawing rather than a straight yes or no? You can encode that into text, it's just less common outside of very informal chat conversations.
I just did a 20-questions with it, and was surprised by how bad gpt4 did. Then for fun, I turned it around and had me be the guesser. It's weird and surreal to play 20-questions when you know that the clue-giver doesn't have an answer in their mind (or more literally, there isn't a single answer in any stateful form while you play), but is instead just eventually saying "yes that's what I was thinking of" when it's statistically appropriate.
With the code execution plugin, one could theoretically ask chatgpt to generate a salted hash of their answer at the start that's revealed at the end to prove it was correct.
Without any plugins, chatgpt will happily return sha hashes and salts when I asked it to play rock paper scissors this was. The only trouble was, the hashes were totally wrong.
i love your example, i wonder if this kind of game can be implemented in future training scenarios
we as humans understand ambiguity so much easier because we learn to speak and interact before we write, and writing ambiguity is way less obvious if you've never experienced it
I use food (including peanut butter) in cooking. I cook in the kitchen. Therefore peanut butter is a thing I use in the kitchen. Seems correct and proper to me.
The ambiguity as I see it is that the kitchen isn't the only place I use peanut butter. I've eaten it (which I think counts as "using") in other rooms. I've even made peanut-butter sandwiches (properly "using" it) in the living room before.
Well, the alleged point is challenged. If playing this game, the questioner must constantly verify that the other party is using the language properly, you'll exhaust that 20 q limit rather quickly.
- is it used in the kitchen?
- yes.
- [well, kitchen appliances, here we go ..] is it ..?
...
- [aha. meat intelligence no speak proper English?] Is this thing you use in kitchen edible?
- Oh, yeah.
- [oh dear. we can not let meat machines govern this planet...]
Yes. You use edible things in preparing or cooking food (which may happen in the kitchen). 'Use' maps to food prep (the act) but never to prep location. Only in cases where the thing has both general edible and food preparation usage -- "I use honey extensively in the kitchen" for example -- does "use" and "edible" make sense.
But peanut butter has general edible and food preparation usage quite similar to honey, doesn't it? You can spread it on a slice of bread to eat directly or use it as a baking ingredient, but you probably wouldn't eat it by the spoonful straight from the container. (Or maybe that's how people usually eat peanut butter, I kind of don't want to know.)
Right -- although many things that are ambiguous in text are disambiguated in actual speech, so the problems that arise with audio speech are not wholly the same as with text.
A classic example is the word "record", which has first syllable stress as a noun, but second syllable stress as a verb. "I bought a RECord" vs "Please reCORD the music".
(in the dominant American dialect; I don't recall about other dialects/countries)
I haven't seen anyone mention Anvil[1] yet, but it lets you "Build web apps with nothing but Python." and is lovely tool that I've successfully used for a handful of side projects.
But as someone who feels most at home with Python, I always love to see new competition in this space.
EDIT: My following statement about self hosting is incorrect. You can, infact, self host.
This looks wonderful, but the inability to self host is a killer from the solo developer point of view. Being limited to 50,000 database rows on the free account isn't ideal.
I've been using Resh[0] for the past 6 months or so. A rich and queryable shell history is a massive boost in day-to-day productivity. The syncing described here is a pretty cool feature.
https://news.ycombinator.com/item?id=32291993