Sharing a little open-source game where you interrogate suspects in an AI murder...

Workaccount2 · on July 10, 2024

>As long as it doesn't cost me too much from the Anthropic API

Watch this like a hawk while it's up on HN.

probably_wrong · on July 10, 2024

Too late - I just asked my first question and the system is not responding.

So either the service is dead or the interface doesn't work on Firefox.

Grimblewald · on July 11, 2024

im on firefox and it works, just takes a while.

HanClinto · on July 10, 2024

This is a really fascinating approach, and I appreciate you sharing your structure and thinking behind this!

I hope this isn't too much of a tangent, but I've been working on building something lately, and you've given me some inspiration and ideas on how your approach could apply to something else.

Lately I've been very interested in using adversarial game-playing as a way for LLMs to train themselves without RLHF. There have been some interesting papers on the subject [1], and initial results are promising.

I've been working on extending this work, but I'm still just in the planning stage.

The gist of the challenge involves setting up 2+ LLM agents in an adversarial relationship, and using well-defined game rules to award points to either the attacker or to the defender. This is then used in an RL setup to train the LLM. This has many advantages over RLHF -- in particular, one does not have to train a discriminator, and neither does it rely on large quantities of human-annotated data.

With that as background, I really like your structure in AI Alibis, because it inspired me to solidify the rules for one of the adversarial games that I want to build that is modeled after the Gandalf AI jailbreaking game. [2]

In that game, the AI is instructed to not reveal a piece of secret information, but in an RL context, I imagine that the optimal strategy (as a Defender) is to simply never answer anything. If you never answer, then you can never lose.

But if we give the Defender three words -- two marked as Open Information, and only one marked as Hidden Information, then we can penalize the Defender for not replying with the free information (much like your NPCs are instructed to share information that they have about their fellow NPCs), and they are discouraged for sharing the hidden information (much like your NPCs have a secret that they don't want anyone else to know, but it can perhaps be coaxed out of them if one is clever enough).

In that way, this Adversarial Gandalf game is almost like a two-player version of your larger AI Alibis game, and I thank you for your inspiration! :)

[1] https://github.com/Linear95/SPAG [2] https://github.com/HanClinto/MENTAT/blob/main/README.md#gand...

PaulScotti · on July 11, 2024

Thanks for sharing! I read your README and think it's a very interesting research path to consider. I wonder if such an adversarial game approach could be outfitted to not just well-defined games but to wholly generalizable improvements. e.g., could be used as a way to improve RLAIF potentially?

HanClinto · on July 12, 2024

Thanks for the feedback!

> I wonder if such an adversarial game approach could be outfitted to not just well-defined games but to wholly generalizable improvements. e.g., could be used as a way to improve RLAIF potentially?

That's a good question!

Here's my (amateur) understanding of the landscape:

- RLHF: Given a mixture of unlabeled LLM responses, first gather human feedback on which response is preferred to mark them as Good or Bad. Use these annotations to train a Reward Model that attempts to model the preferences of humans on the input data. Then use this Reward Model for training the model with traditional RL techniques.

- RLAIF: Given good and bad examples of LLM responses, instead of using human feedback, use an off-the-shelf zero-shot LLM to annotate the data. Then, one can either train a traditional Reward Model using these auto-annotated samples, or else one can use the LLM to generate scores in real-time when training the models (a more "online" method of real-time scoring). In either case, each of these Reward methods can be used for training with RL.

- Adversarial Games: By limiting the scope of responses to situations where the preference of one answer vs. another can be computed with an algorithm (I.E., clearly-defined rules of a game), then we bypass the need to deal with a "fuzzy" Reward Model (whether built through traditional RLHF, or through RLAIF). The whole reason why RLAIF is a "thing" is because high-quality human-annotated data is difficult to acquire, so researchers attempt to approximate it with LLMs. But if we bypass that need and can clearly define the rules of a game, then we basically have an infinite source of high-quality annotated data -- although limited in scope to apply only to the context of a game.

If the rules of the game exist only within the boundaries of the game (such as Chess, or Go, or Starcraft), then the things learned may not generalize well outside of the game. But the expectation is that -- if the context of the game goes through the semantic language space (or through "coding space", in the context of training coding models) -- then the things that the LLM learns within the game will have general applicability in the general space.

So if I can understand your suggestion, to make a similar RLAIF-type improvement to adversarial training, then instead of using a clearly-defined game structure to define the game space, then we would use another LLM to act as the "arbiter" of the game -- perhaps by first defining the rules of a challenge, and then judging between the two competitors which response is better.

Instead of needing to write code to say "Player A wins" or "Player B wins", using an LLM to determine that would shortcut that.

That's an interesting idea, and I need to mull it over. My first thought is that -- I was trying to get away from "fuzzy" reward models and instead use something that is deterministically "perfect". But maybe the advantage of being able to move more quickly (and explore more complex game spaces) would outweigh it.

I need to think this through. There are some situations where I could really see your generalized approach working quite well (such as the proposed "Adversarial Gandalf" game -- using an LLM as the arbiter would probably work quite well), but there are others where using an outside tool (such as a compiler, in the case of the code-vulnerability challenges) would still be necessary.

I wasn't aware of the RLAIF paper before -- thank you for the link! You've given me a lot to think about, and I really appreciate the dialog!

nopeYouAreWrong · on July 11, 2024

Adversarial game playing as a way of training AI is basically the plot of War Games.

HanClinto · on July 11, 2024

And also the breakthrough that let AlphaGo and AlphaStar make the leaps that they did.

The trouble is that those board games don't translate well to other domains. But if the game space can operate through the realm of language and semantics, then the hope is that we can tap into the adversarial growth curve, but for LLMs.

Up until now, everything that we've done has just been imitation learning (even RLHF is only a poor approximation "true" RL).

batch12 · on July 10, 2024

These protections are fun, but not adequate really. I enjoyed the game from the perspective of making it tell me who the killer is. It took about 7 messages to force it out (unless it's lying).

gkfasdfasdf · on July 10, 2024

Very cool, I wonder how it would play if run with local models, e.g. with ollama and gemma2 or llama3

mysteria · on July 10, 2024

If the game could work properly with a quantized 7B or 3B it could even be runnable directly in the user's browser with WA on CPU. I think there are a couple implementations of that already, though keep in mind that it there would be a several GB model download.

sva_ · on July 10, 2024

Doesn't seem to reply to me. So I guess the limit has been reached?

PaulScotti · on July 10, 2024

Should be working now and way faster! Had to upgrade the server to increased number of workers

PaulScotti · on July 11, 2024

To anyone still finding the game slow due to traffic, you can just git clone the game, add your ANTHROPIC API key to a .env file, and play it locally (this is explained in the README in our github repo). It runs super fast if played locally.

byteknight · on July 10, 2024

You just made front page. Definitely keep an eye on usage :)

herease · on July 11, 2024

This is really awesome I have to say!

nickllmmaster · on July 16, 2024

dude this is great, and what a coincidence! We made a similar detective puzzle game a few months earlier based on GPT-4Turbo. We also encountered this problem of ai leaking key information too easily, our solution to that was A) we break down the whole story into several pieces, and each character knows only a piece, ai cannot leak pieces he doesn't know; B) we did some prompt switching, unless the player has gathered sufficient amount of information, the prompt would always provent the ai from confessing.

Give it a try if interested! also free to play! https://psigame.itch.io/netjazz2076

billconan · on July 11, 2024

how to prevent the agents from just telling the game player the secret?