I’d say it’s more like Waymo’s world model. The main actor uses a latent vector representation of the state of the game to make decisions. This latent vector at train time is meant to compress a bunch of useful information about the game. So while you can’t really understand the actual latent vector that represents state, you do know it encodes at least the state of the game.
This world model stuff is only possible in environments that are sandboxed. Ie you can represent the state of the world in an and have a way of producing the next state given a current state and action. Things like Atari games, robot simulations, etc
yea. We're definitely concerned about hallucinations and are using a variety of techniques to try and mitigate it (there's some existing discussion here, but using committees and sub-agents responsible for smaller tasks has helped).
What's helped the most, though, is using cluster information to back up decision making. That way we know the data it's considering isn't garbage, and the outputs are backed up by actual data.
one thing we're experimenting to help with the hallucinations/error rate issue is using a committee framework where we take a majority vote.
If the error rate of 1 expert is 5%, then for a committee of 10 experts, the probability a majority of the committee errors is around 0.00276% (binomial distribution with p=0.05). For 10 steps, this would be an error rate of 0.0276%
I'm not sure they are highly correlated. A committee uses the same LLM with the same input context to generate different outputs. Given the same context LLMs should produce the same next token output distribution (assuming fixed model parameters, temperature, etc). So, while tokens in a specific output are highly correlated, complete outputs should be independent since they are generated independently from the same distribution. You are right they are not iid but the calculation was just a simplification.
The product will automatically execute runbooks for you. So far we've focused on using runbooks customers already have, since they know they work for them. We've also added the ability to turn of automatic execution for cases like a suggested runbook, so the customer can make any edits if necessary before approving it to be executed automatically.
Yea, this is a big challenge for us. We're using a variety of strategies to make sure hallucinations are rare, but that's why we're also committed to not executing actions that modify your cluster unless explicitly specified in a runbook
yea, we'd like to actually create these issues on a real cluster, but we couldn't figure out a good way of doing it at scale. The best alternative that we could think of was using an LLM that knows the root cause and could hopefully simulate outputs of commands consistently. Let us know if you have other ideas, we're always looking for ways to improve it.
> the user competition is meant to be a fun side-project that we threw together today, I think it's cool that people hack things like that so quickly :)
The website comes off as a marketing strategy rather than a fun one-day hackathon project. I think that's why it's getting the reaction you're seeing.
More seriously usually issues where the observed behaviour is "the system is slow" are harder to root cause than complete outages. It depends partly how good your capacity planning is obviously, but maybe an AI could help with that too.
[1] https://www.lizmap.com/en/
[2] https://docs.mapstore.geosolutionsgroup.com/en/latest/
[3] https://mundi.ai/