Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't like this test, because the very first question I was present with, had both answers looked equivalently good. Actually they were almost the same, just with different phrasing. So my choice would be absolute random. It means, that end score will be polluted by random. They should have added things like "both answers good" and "both answers bad".


If the positions are randomly assigned, it shouldn't matter. I mean, the results may be clear faster, but the overall shouldn't change even if you need to flip a coin from time to time.


Sure, but providing a "undecided" option would solve the issue OP is describing for the individual voter.


I have a lot of experience with pairwise testing so I can explain this.

The reason there isn't an "equal" option is because it's impossible to calibrate. How close do the two options have to be before the average person considers them "equal"? You can't really say.

The other problem is when two things are very close, if you provide an "equal" option you lose the very slight preference information. One test I did was getting people to say which of two greyscale colours is lighter. With enough comparisons you can easily get the correct ordering even down to 8 bits (i.e. people can distinguish 0x808080 and 0x818181), but they really look the same if you just look at a pair of them (unless they are directly adjacent, which wasn't the case in my test).

The "polluted by randomness" issue isn't a problem with sufficient comparisons because you show the things in a random order so it eventually gets cancelled out. Imagine throwing a very slightly weighted coin; it's mostly random but with enough throws you can see the bias.

...

On the other hand, 16 comparisons isn't very many at all, and also I did implement an ad-hoc "they look the same" option for my tests and it did actually perform significantly better, even if it isn't quite as mathematically rigorous.

Also player skill ranking systems like Elo or TrueSkill have to deal with draws (in games that allow them), and really most of these ranking algorithms are totally ad-hoc anyway (e.g. why does Bradley-Terry use a sigmoid model?), so it's not really a big deal to add more ad-hocness into your model.


Ordering isn't necessarily the most valuable signal to rank models where much stronger degrees of preference between some of the answers exist though. "I don't mind either of these answers but I do have a clear preference for this one" is sometimes a more valuable signal than a forced choice". And A model x which is consistently subtly preferred to model y in the common case where both models yield acceptable outputs but manages to be universally disfavoured for being wrong or bad more often is going to be a worse model for most use cases.

Also depends what the pairwise comparisons are measuring of course. If it's shades of grey, is the statistical preference identifying a small fraction of the public that's able to discern a subtle mismatch in shading between adjacent boxes, or is it purely subjective colour preference confounded by far greater variation in monitor output? If it's LLM responses, I wonder whether regular LLM users have subtle biases against recognisable phrasing quirks of well-known models which aren't necessarily more prominent or less appropriate than the less familiar phrasing quirks of a less-familiar model. Heavy use of em-dashes, "not x but y" constructions and bullet points were perceived as clear, well-structured communication before they were seen as stereotypical, artificial AI responses.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: