The tournament measures the cumulative winnings. However, those can be far from the statistical expectation due to the variance of card distribution in poker.
To establish a real winner, you need to play many games:
> As seen in the Claudico match (20), even 80,000 games may not be enough to statistically significantly separate players whose skill differs by a considerable margin [1]
It is possible to reduce the number of required games thanks to variance reduction techniques [1], but I don't think this is what the website does.
To answer the question - "which 'quality' of the LLMs this tournament then actually measures" - since we can't tell the winner reliably, I don't think we can even make particular claims about the LLMs.
However, it could be interesting to analyze the play from a "psychology profile perspective" of dark triad (psychopaths / machiavellians / narcissists).
Essentially, these personality types have been observed to prefer some strategies and this can be quantified [2].
To establish a real winner, you need to play many games:
> As seen in the Claudico match (20), even 80,000 games may not be enough to statistically significantly separate players whose skill differs by a considerable margin [1]
It is possible to reduce the number of required games thanks to variance reduction techniques [1], but I don't think this is what the website does.
To answer the question - "which 'quality' of the LLMs this tournament then actually measures" - since we can't tell the winner reliably, I don't think we can even make particular claims about the LLMs.
However, it could be interesting to analyze the play from a "psychology profile perspective" of dark triad (psychopaths / machiavellians / narcissists). Essentially, these personality types have been observed to prefer some strategies and this can be quantified [2].
[1] DeepStack, https://static1.squarespace.com/static/58a75073e6f2e1c1d5b36...
[2] Generation of Games for Opponent Model Differentiation https://arxiv.org/pdf/2311.16781