Do You Find LM Arena Leaderboard Useful?

in #stemyesterday

"AI" benchmarks can be gamed with some effort. I have seen scenarios where a changing of the order of the answers in an MCQ resulting in drastically difference performance "AI". When there is a massive gap in benchmarks between models, we can also see that they are either an older model or of a different size. In such cases, the users don't even need the benchmark to figure out which is better.

LM Arena Offer Blind Testing

LM Arena 1.png

I asked a very short and simple prompt about HIVE and I got the results in very fast. The paid subscriptions are handled by LM Arena. Uses can select one of four options. Once the the voting is complete, the model names are revealed.

LM Arena 2.png

The results of these votes are used to rank various models against each other. The votes come from a small sample of enthusiasts who already know about LM Arena. Since the same userbase is the one that is most likely to know and understand "AI", I don't think the sample size is going to bea problem.

Current Leaderboard With xAI on Top

LM Arena Top 10.png

Sort:  

I'll need to use it more and see. But normally, a ranking where users see side-by-side answers from two models and they don't know what models they are, and they select the better answer should be a pretty accurate ranking.