The performance of machine learning models is usually compared via the mean value of a selected performance measure such as the area under the receiver operating characteristic curve on a specific benchmark data set. However, this measure, its mean value or even relative differences between models do not provide a good prediction of whether the results can translate to other data sets. Gosiewska and colleagues present here a comparison based on Elo ranking, which offers a probabilistic interpretation of how much better one model is than another.
- Alicja Gosiewska
- Katarzyna Woźnica
- Przemysław Biecek