My understanding was that the system consists of using the historical odds of winning (given the rating difference). If you benchmark that using only past data, I think it is by definition the most accurate system. (The data is always a better fit to itself than a theoretical fit is.)
Naturally future data is much harder to deal with than past data. But even for future data it's not obvious that ELO (or any other theoretical fit to the odds of winning) will be more accurate than the historical odds.
Yes, the best fit for the data is the data itself, it's a tautology. Nothing wrong with Elo's exponential curve, it just can't beat the actual data.
You raise a good point in that I could've created a training set and a test set, that probably would be a better validation. But I don't know, I'm not doing science, I'm making a game.
On the topic of whether the future matches the past, the predictions were based on a rolling database of the past 100000 matches, which is approximately the number of matches played per 7 days. So my theory is that the data is quite recent and up-to-date and so should match, in general.
Of course I never tested this. In the end, I'm not doing science, I'm making a game. If the retention goes up, complaints are down, then I can't keep working on the rating system, there are 1000 other things to do.
Yeah, I'm not giving advice on how you should do it. I was just unsure whether critics here had understood that measured data is probably better than any theoretical fit, even the revered ELO.
> I think it is by definition the most accurate system
By gum, an opportunity to quibble semantics on the internet. That is true if benchmark using means 'only admit to knowing' and accuracy means 'must be numerically quantifiable given existing data'. It is false otherwise, especially if accuracy means 'conforming to truth' and we have a model for how the numbers are being generated.
Obviously if I generate a set of numbers by sampling a normal distribution then the most accurate model is a normal distribution, no matter what empirical data I use for benchmarking.
That is to say, if we know how the data was generated (sans noise) we can reject empirical distributions as the most accurate, because we can directly know the distribution of the data.
Ok, that is a legitimate ... quibble. Let's assume that we don't already know the correct distribution. In that case we're going to judge each theoretical fit by how close it comes to the historical data. (Or else we're going to get that wrong, which is another common approach.) ELO is much more prestigious and credible than some guy who made a game, but it is less credible than data, for some number of data points N. (Although I think a theory can be more prestigious than data almost independent of N.)
Sure. If there's enough data then the data becomes more credible than even the most popular theoretical fit. If I have four games I played with my nephew then people should probably go with ELO.
Naturally future data is much harder to deal with than past data. But even for future data it's not obvious that ELO (or any other theoretical fit to the odds of winning) will be more accurate than the historical odds.