More

PaulScotti · on July 11, 2024

Thanks! Yes the traffic is making the game slow... To anyone impatient you can just git clone the game, add your ANTHROPIC API key to a .env file, and play it locally. It runs super fast if played locally.

Creating a service would be amazing but seems like too much work. And people can already create their own murder mystery with this codebase by just modifying the characters.json file.

Making this game gave me some fun ideas though for creating a world simulation engine--any developers who might be interested in collaborating on something like that please get in touch :)

PaulScotti · on July 11, 2024

Yeah nice idea -- does sound plausible and would make things much cheaper and faster.

PaulScotti · on July 11, 2024

The Officer doesn't actually get supplied information on the true killer in their context window... so that response you got is actually incorrect.

You can check the actual solution by clicking the End Game button

PaulScotti · on July 10, 2024

Upgraded the server and should now be working... I think

ranguna · on July 11, 2024

It's very slow for me, at this point I think it might have just timed out.

Regardless, nice job!

I might try modifying it to hit custom endpoint for people to try their own models

PaulScotti · on July 11, 2024

Yeah sorry, it is still quite slow due to the traffic. It'd be much faster and robust to run locally via git cloning the repo and adding your own API key as shown in the README

For using other models it should be pretty straightforward to just modify the api functions to suit whatever model is being used -- would be fun to try out custom models! (Feel free to pull request the repo btw if you do modify such things)

An idea we had initially was actually to use an open-source model and fine-tune it using the DB of responses (including the hidden violation bot and refinement bot outputs) collected from people playing the game. That way the game could get better and better over time as more user data gets collected.

Disclaimer we actually did implement this via postgres and now have thousands of responses from players in case anyone wants to follow through on this idea.

ranguna · on July 15, 2024

Hey, cool idea but I think you are going to have do delete all that data. Users never agreed to have their data used for anything.

Easy fix though, simply add a prompt letting users give their concent.

PaulScotti · on July 10, 2024

Haha interesting approach!

PaulScotti · on July 10, 2024

Damn that sucks, sorry. For what it's worth I tried playing the game dozens always asking for an overview as my first message and I never encountered such a response , so hopefully that's quite the rare experience.

PaulScotti · on July 10, 2024

Wow really, can you tell me what you said to get them to confess?

PaulScotti · on July 9, 2024

Sharing a little open-source game where you interrogate suspects in an AI murder mystery. As long as it doesn't cost me too much from the Anthropic API I'm happy to host it for free (no account needed).

The game involves chatting with different suspects who are each hiding a secret about the case. The objective is to deduce who actually killed the victim and how. I placed clues about suspects’ secrets in the context windows of other suspects, so you should ask suspects about each other to solve the crime.

The suspects are instructed to never confess their crimes, but their secrets are still in their context window. We had to implement a special prompt refinement system that works behind-the-scenes to keep conversations on-track and prohibit suspects from accidentally confessing information they should be hiding.

We use a Critique & Revision approach where every message generated from a suspect first gets fed into a "violation bot" checker, checking if any Principles are violated in the response (e.g., confessing to murder). Then, if a Principle is found to be violated, the explanation regarding this violation, along with the original output message, are fed to a separate "refinement bot" which refines the text to avoid such violations. There are global and suspect-specific Principles to further fine-tune this process. There are some additional tricks too, such as distinct personality, secret, and violation contexts for each suspect and prepending all user inputs with "Detective Sheerluck: "

The entire project is open-sourced here on github: https://github.com/ironman5366/ai-murder-mystery-hackathon

If you are curious, here's the massive json file containing the full story and the secrets for each suspect (spoilers obviously): https://github.com/ironman5366/ai-murder-mystery-hackathon/b...

Workaccount2 · on July 10, 2024

>As long as it doesn't cost me too much from the Anthropic API

Watch this like a hawk while it's up on HN.

probably_wrong · on July 10, 2024

Too late - I just asked my first question and the system is not responding.

So either the service is dead or the interface doesn't work on Firefox.

Grimblewald · on July 11, 2024

im on firefox and it works, just takes a while.

HanClinto · on July 10, 2024

This is a really fascinating approach, and I appreciate you sharing your structure and thinking behind this!

I hope this isn't too much of a tangent, but I've been working on building something lately, and you've given me some inspiration and ideas on how your approach could apply to something else.

Lately I've been very interested in using adversarial game-playing as a way for LLMs to train themselves without RLHF. There have been some interesting papers on the subject [1], and initial results are promising.

I've been working on extending this work, but I'm still just in the planning stage.

The gist of the challenge involves setting up 2+ LLM agents in an adversarial relationship, and using well-defined game rules to award points to either the attacker or to the defender. This is then used in an RL setup to train the LLM. This has many advantages over RLHF -- in particular, one does not have to train a discriminator, and neither does it rely on large quantities of human-annotated data.

With that as background, I really like your structure in AI Alibis, because it inspired me to solidify the rules for one of the adversarial games that I want to build that is modeled after the Gandalf AI jailbreaking game. [2]

In that game, the AI is instructed to not reveal a piece of secret information, but in an RL context, I imagine that the optimal strategy (as a Defender) is to simply never answer anything. If you never answer, then you can never lose.

But if we give the Defender three words -- two marked as Open Information, and only one marked as Hidden Information, then we can penalize the Defender for not replying with the free information (much like your NPCs are instructed to share information that they have about their fellow NPCs), and they are discouraged for sharing the hidden information (much like your NPCs have a secret that they don't want anyone else to know, but it can perhaps be coaxed out of them if one is clever enough).

In that way, this Adversarial Gandalf game is almost like a two-player version of your larger AI Alibis game, and I thank you for your inspiration! :)

[1] https://github.com/Linear95/SPAG [2] https://github.com/HanClinto/MENTAT/blob/main/README.md#gand...

PaulScotti · on July 11, 2024

Thanks for sharing! I read your README and think it's a very interesting research path to consider. I wonder if such an adversarial game approach could be outfitted to not just well-defined games but to wholly generalizable improvements. e.g., could be used as a way to improve RLAIF potentially?

HanClinto · on July 12, 2024

Thanks for the feedback!

> I wonder if such an adversarial game approach could be outfitted to not just well-defined games but to wholly generalizable improvements. e.g., could be used as a way to improve RLAIF potentially?

That's a good question!

Here's my (amateur) understanding of the landscape:

- RLHF: Given a mixture of unlabeled LLM responses, first gather human feedback on which response is preferred to mark them as Good or Bad. Use these annotations to train a Reward Model that attempts to model the preferences of humans on the input data. Then use this Reward Model for training the model with traditional RL techniques.

- RLAIF: Given good and bad examples of LLM responses, instead of using human feedback, use an off-the-shelf zero-shot LLM to annotate the data. Then, one can either train a traditional Reward Model using these auto-annotated samples, or else one can use the LLM to generate scores in real-time when training the models (a more "online" method of real-time scoring). In either case, each of these Reward methods can be used for training with RL.

- Adversarial Games: By limiting the scope of responses to situations where the preference of one answer vs. another can be computed with an algorithm (I.E., clearly-defined rules of a game), then we bypass the need to deal with a "fuzzy" Reward Model (whether built through traditional RLHF, or through RLAIF). The whole reason why RLAIF is a "thing" is because high-quality human-annotated data is difficult to acquire, so researchers attempt to approximate it with LLMs. But if we bypass that need and can clearly define the rules of a game, then we basically have an infinite source of high-quality annotated data -- although limited in scope to apply only to the context of a game.

If the rules of the game exist only within the boundaries of the game (such as Chess, or Go, or Starcraft), then the things learned may not generalize well outside of the game. But the expectation is that -- if the context of the game goes through the semantic language space (or through "coding space", in the context of training coding models) -- then the things that the LLM learns within the game will have general applicability in the general space.

So if I can understand your suggestion, to make a similar RLAIF-type improvement to adversarial training, then instead of using a clearly-defined game structure to define the game space, then we would use another LLM to act as the "arbiter" of the game -- perhaps by first defining the rules of a challenge, and then judging between the two competitors which response is better.

Instead of needing to write code to say "Player A wins" or "Player B wins", using an LLM to determine that would shortcut that.

That's an interesting idea, and I need to mull it over. My first thought is that -- I was trying to get away from "fuzzy" reward models and instead use something that is deterministically "perfect". But maybe the advantage of being able to move more quickly (and explore more complex game spaces) would outweigh it.

I need to think this through. There are some situations where I could really see your generalized approach working quite well (such as the proposed "Adversarial Gandalf" game -- using an LLM as the arbiter would probably work quite well), but there are others where using an outside tool (such as a compiler, in the case of the code-vulnerability challenges) would still be necessary.

I wasn't aware of the RLAIF paper before -- thank you for the link! You've given me a lot to think about, and I really appreciate the dialog!

nopeYouAreWrong · on July 11, 2024

Adversarial game playing as a way of training AI is basically the plot of War Games.

HanClinto · on July 11, 2024

And also the breakthrough that let AlphaGo and AlphaStar make the leaps that they did.

The trouble is that those board games don't translate well to other domains. But if the game space can operate through the realm of language and semantics, then the hope is that we can tap into the adversarial growth curve, but for LLMs.

Up until now, everything that we've done has just been imitation learning (even RLHF is only a poor approximation "true" RL).

batch12 · on July 10, 2024

These protections are fun, but not adequate really. I enjoyed the game from the perspective of making it tell me who the killer is. It took about 7 messages to force it out (unless it's lying).

gkfasdfasdf · on July 10, 2024

Very cool, I wonder how it would play if run with local models, e.g. with ollama and gemma2 or llama3

mysteria · on July 10, 2024

If the game could work properly with a quantized 7B or 3B it could even be runnable directly in the user's browser with WA on CPU. I think there are a couple implementations of that already, though keep in mind that it there would be a several GB model download.

sva_ · on July 10, 2024

Doesn't seem to reply to me. So I guess the limit has been reached?

PaulScotti · on July 10, 2024

Should be working now and way faster! Had to upgrade the server to increased number of workers

PaulScotti · on July 11, 2024

To anyone still finding the game slow due to traffic, you can just git clone the game, add your ANTHROPIC API key to a .env file, and play it locally (this is explained in the README in our github repo). It runs super fast if played locally.

byteknight · on July 10, 2024

You just made front page. Definitely keep an eye on usage :)

herease · on July 11, 2024

This is really awesome I have to say!

nickllmmaster · on July 16, 2024

dude this is great, and what a coincidence! We made a similar detective puzzle game a few months earlier based on GPT-4Turbo. We also encountered this problem of ai leaking key information too easily, our solution to that was A) we break down the whole story into several pieces, and each character knows only a piece, ai cannot leak pieces he doesn't know; B) we did some prompt switching, unless the player has gathered sufficient amount of information, the prompt would always provent the ai from confessing.

Give it a try if interested! also free to play! https://psigame.itch.io/netjazz2076

billconan · on July 11, 2024

how to prevent the agents from just telling the game player the secret?

PaulScotti · on Dec 17, 2023

Guys Figure 1 is not real results, it's an illustration of the "goal" of the paper. The real results are in Table 3. And are much worse.

explaininjs · on Dec 17, 2023

Interesting ploy. Present far-better-than-achieved results right on the front page with no text to explain their origin^, but make them poor enough quality to make it seem as if they might be real.

^ "Overall illustration of translate EEG waves into text through quantised encoding." doesn't count.

mike_hearn · on Dec 17, 2023

Urgh. And it gets worse from there. The bugs list on the repo has a closed and locked bug report from someone claiming that their code is using teacher forcing!

https://github.com/duanyiqun/DeWave/issues/1

In a normal recurrent neural network, the model predicts token-at-a-time. It predicts a token, and that token is appended to the total prediction so far which is then fed back into the model to generate the next token. In other words, the network generates all the predictions itself based off its own previous outputs and the other inputs (brainwaves in this case), meaning that a bad prediction can send the entire thing off track.

In teacher forcing that isn't the case. All the tokens up to the point where it's predicting are taken from the correct inputs. That means the model is never exposed to its own previous errors. But of course in a real system you don't have access to the correct inputs, so this is not feasible to do in reality.

The other repo says:

"We have written a corrected version to use model.generate to evaluate the model, the result is not so good"

but they don't give examples.

This problem completely invalidates the paper's results. It is awful that they have effectively hidden and locked the thread in which the issue was reported. It's also kind of nonsensical that people doing such advanced ML work are claiming they accidentally didn't know the difference between model.forward() and model.generate(). I mean I'm not an ML researcher and might have mangled the description of teacher forcing, but even I know these aren't the same thing at all.

iamleppert · on Dec 17, 2023

You’d be shocked how common this is in academia. Most of the time it goes undetected because the people writing the checks can’t be bothered to understand.

chpatrick · on Dec 17, 2023

So instead of generating the next token from its own previous predictions (which is what it would do in real life), the code they used for the evaluation actually predicts from the ground truth?

ghayes · on Dec 17, 2023

Which would basically turn the model into a plainly normal LLM without any need for utilizing the brainwave inputs, right?

AndrewKemendo · on Dec 17, 2023

This is a super important point and I think warrants a letter to the editor

oeeker · on Dec 18, 2023

how could such thing get published?

heyoni · on Dec 18, 2023

My guess is repeatability is hard when it comes to AI

godelski · on Dec 18, 2023

What's interesting to me is that apparently a lot of people see nothing wrong with this[0]. That whole thread is wild and I'm just showing a small portion.

Also, @dang, can we ban links to iflscience? They're a trash publication that entirely relies on clickbaity misrepresentations of research works. There is __always__ a better source that can be used.

[0] https://news.ycombinator.com/item?id=38565424

oldesthacker · on Dec 17, 2023

The results of Table 3 are not really exciting. Could this change with 100 times more data? The key novelty in the specific context of this particular application is the quantized variational encoder used "to derive discrete codex encoding and align it with pre-trained language models."

seydor · on Dec 18, 2023

Why is it such a "pattern" in these brain-computer papers that the authors keep making wild clickbait claims. Last year it was the DishBrain paper, which caused a lot of reactions, as it referred to the tiny system as "sentient" (https://hal.science/hal-04012408)

This year it is the "Brainoware" which is claimed to do speech recognition , and now this.

PaulScotti · on Feb 10, 2023

There's already been preprints released showing fMRI reconstructions that appear to do better than an implicit multi-class classifier [1] [2]. But also, even if the result is an implicit multi-class classifier, if the n is sufficiently high then that would still be quite impressive!

[1] https://openreview.net/pdf?id=pHdiaqgh_nf

[2] https://arxiv.org/pdf/2211.06956.pdf

ad404b8a372f2b9 · on Feb 10, 2023

I see no evidence that they do better than multi-class classification, in fact they both work as I described. They learn embeddings of fMRI which perform an implicit classification of the data (which can be recovered just by quantizing the embedding space to get its modes) and put very large generation models on top.

The only reason the reconstructions are much better than before is because they use the latest generation models. Those models have internal models of the classes which allow them to fill in the high-frequency details in the reconstruction. The only information they get from the fMRI is the same low-frequency signal that previous papers already had, and indeed the only things the reconstructions get right are low-frequency: class of object/scene, broad position of object, broad shape of object.

fMRI scans are aggregates of brain information, they act like low-pass filters over the brain state. You can put as big a model on top as you want it won't make it more truthful a reconstruction.

I think, as you say, that detecting as many classes as possible is already a pretty good goal, developing new embeddings and techniques to see how much juice we can squeeze out of the scans. I like the arxiv preprint you posted in particular since it does just that and evaluates accuracy (although the way it does it is flawed since it uses an image classifier on the reconstruction which presents the same problems). What I don't like is the misrepresentation of what's going on when people put those large generative models on top of this kind of data.