The area under the receiver operating characteristic curve (AUROC) of the test set is used throughout machine learning (ML) for assessing a model’s performance. However, when concordance is not the only ambition, this gives only a partial insight into performance, masking distribution shifts of model outputs and model instability.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
Artificial intelligence in the identification and prediction of adverse transfusion reactions(ATRs) and implications for clinical management: a systematic review of models and applications
BMC Medical Informatics and Decision Making Open Access 28 October 2025
-
Diagnosing pathologic myopia by identifying morphologic patterns using ultra widefield images with deep learning
npj Digital Medicine Open Access 13 July 2025
-
Developing multifactorial dementia prediction models using clinical variables from cohorts in the US and Australia
Translational Psychiatry Open Access 21 January 2025
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout



Change history
12 April 2024
A Correction to this paper has been published: https://doi.org/10.1038/s42256-024-00834-6
References
Halligan, S., Altman, D. G. & Mallett, S. Eur. Radiol. 25, 932–939 (2015).
Lobo, J. M., Jiménez-Valverde, A. & Real, R. Glob. Ecol. Biogeogr. 17, 145–151 (2008).
Kwegyir-Aggrey, K., Gerchick, M., Mohan, M. Horowitz, A. & Venkatasubramanian, S. In Proc. 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’23) 1570–1583 (ACM, 2023).
White, N., Parsons, R., Collins, G. & Barnett, A. BMC Med. 21, 339 (2023).
Rabe, C. et al. Alzheimers Dement. 19, 1393–1402 (2023).
Roberts, M. et al. Nat. Mach. Intell. 3, 199–217 (2021).
Wynants, L. et al. BMJ 369, m1328 (2020).
Chicco, D. & Jurman, G. BioData Min. 16, 4 (2023).
Hazan, A. & Dittmer, S. CodeOcean https://doi.org/10.24433/CO.1960655.v1 (2023).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Rights and permissions
About this article
Cite this article
Roberts, M., Hazan, A., Dittmer, S. et al. The curious case of the test set AUROC. Nat Mach Intell 6, 373–376 (2024). https://doi.org/10.1038/s42256-024-00817-7
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s42256-024-00817-7
This article is cited by
-
Artificial intelligence in the identification and prediction of adverse transfusion reactions(ATRs) and implications for clinical management: a systematic review of models and applications
BMC Medical Informatics and Decision Making (2025)
-
Developing multifactorial dementia prediction models using clinical variables from cohorts in the US and Australia
Translational Psychiatry (2025)
-
Diagnosing pathologic myopia by identifying morphologic patterns using ultra widefield images with deep learning
npj Digital Medicine (2025)