Developing robust benchmarks for driving forward AI innovation in healthcare

Mincu, Diana; Roy, Subhrajit

doi:10.1038/s42256-022-00559-4

Perspective
Published: 15 November 2022

Developing robust benchmarks for driving forward AI innovation in healthcare

Nature Machine Intelligence volume 4, pages 916–921 (2022)Cite this article

9524 Accesses
36 Citations
24 Altmetric
Metrics details

Subjects

Abstract

Machine learning technologies have seen increased application to the healthcare domain. The main drivers are openly available healthcare datasets, and a general interest from the community to use its powers for knowledge discovery and technological advancements in this more conservative field. However, with this additional volume comes a range of questions and concerns — are the obtained results meaningful and conclusions accurate; how do we know we have improved state of the art; is the clinical problem well defined and does the model address it? We reflect on key aspects in the end-to-end pipeline that we believe suffer the most in this space, and suggest some good practices to avoid reproducing these issues.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Fair shares: building and benefiting from healthcare AI with mutually beneficial structures and development partnerships

Article 14 July 2021

Multiple stakeholders drive diverse interpretability requirements for machine learning in healthcare

Article 10 August 2023

Guiding principles for the responsible development of artificial intelligence tools for healthcare

Article Open access 01 April 2023

References

Harutyunyan, H., Khachatrian, H., Kale, D. C., Ver Steeg, G. & Galstyan, A. Multitask learning and benchmarking with clinical time series data. Sci. Data 6, 96 (2019).
Article Google Scholar
Heil, B. et al. Reproducibility standards for machine learning in the life sciences. Nat. Methods 18, 1132–1135 (2021).
Viknesh, S. et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: the STARD-AI Steering Group. Nat. Med. 26, 807–808 (2020).
Collins, G. S. et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open 11, e048008 (2021).
Kakarmath, S. et al. Best practices for authors of healthcare-related artificial intelligence manuscripts. npj Digit. Med. 3, 134 (2020).
Hulsen, T. Sharing is caring—data sharing initiatives in healthcare. Int. J. Environ. Res. Public Health 17, 3046 (2020).
Article Google Scholar
Atkin, C. et al. Perceptions of anonymised data use and awareness of the NHS data opt-out amongst patients, carers and healthcare staff. Res. Involv. Engagem. 7, 40 (2021).
Chico, V., Hunn, A. & Taylor, M. Public Views on Sharing Anonymised Patient-Level Data Where There Is a Mixed Public and Private Benefit (Univ. Melbourne, 2019).
Schwarz, C. G. et al. Identification of anonymous MRI research participants with face-recognition software. New Engl. J. Med. 381, 1684–1686 (2019).
Rieke, N. et al. The future of digital health with federated learning. npj Digit. Med. 3, 119 (2020).
Kaissis, G. et al. End-to-end privacy preserving deep learning on multi-institutional medical imaging. Nat. Mach. Intell. 3, 473–484 (2021).
Ngong, I. Maintaining privacy in medical data with differential privacy. OpenMined Blog https://blog.openmined.org/maintaining-privacy-in-medical-data-with-differential-privacy/ (2020).
Sablayrolles, A., Douze, M., Schmid, C. & Jegou, H. Radioactive data: tracing through training. Proc. Mach. Learning Res. 119, 8326–8335 (2020).
Sablayrolles, A., Douze, M., Schmid, C., Ollivier, Y. & Jegou, H. White-box vs black-box: Bayes optimal strategies for membership inference. Proc. Mach. Learning Res. 97, 5558–5567 (2019).
Johnson, A. et al. MIMIC-IV (version 1.0) PhysioNet https://doi.org/10.13026/s6n6-xd98 (2021).
Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 160035 (2016).
Lee, J. et al. Open-access MIMIC-II database for intensive care research. Conf. Proc. IEEE Eng. Med. Biol. Soc. 2011, 8315–8318 (2011).
Hayes-Larson, E., Kezios, K., Mooney, S. & Lovasi, G. Who is in this study, anyway? Guidelines for a useful Table 1. J. Clin. Epidemiol. 114, 125–132 (2019).
Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell. 3, 199–217 (2021).
Rostamzadeh, N. et al. Healthsheet: development of a transparency artifact for health datasets. In 2022 ACM Conference on Fairness, Accountability, and Transparency 1943–1961 (Association for Computing Machinery, 2022).
Sculley, D. et al. Hidden technical debt in machine learning systems. Adv. Neural Inf. Process. Syst. 28, 2503–2511 (2015).
Northcutt, C., Athalye, A. & Mueller, J. Pervasive label errors in test sets destabilize machine learning benchmarks. In 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks 1 (2021).
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
Kooi, T. et al. Large scale deep learning for computer aided detection of mammographic lesions. Med. Image Anal. 35, 303–312 (2017).
De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24, 1342–1350 (2018).
Zhao, X. et al. Deep learning-based fully automated detection and segmentation of lymph nodes on multiparametric-MRI for rectal cancer: a multicentre study. eBioMedicine 56, 102780 (2020).
Roy, S. et al. Evaluation of artificial intelligence systems for assisting neurologists with fast and accurate annotations of scalp electroencephalography data. eBioMedicine 66, 103275 (2021).
Tomašev, N. et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 572, 116–119 (2019).
Wang, S. et al. MIMIC-Extract: a data extraction, preprocessing, and representation pipeline for MIMIC-III. In Proc. ACM Conference on Health, Inference, and Learning 222–235 (Association for Computing Machinery, 2020).
Rough, K. et al. Predicting inpatient medication orders from electronic health record data. Clin. Pharmacol. Ther. 108, 145–154 (2020).
Roy, S. et al. Multitask prediction of organ dysfunction in the intensive care unit using sequential subnetwork routing. J. Am. Med. Inform. Assoc. 28, 1936–1946 (2021).
Kirby, J. C. et al. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J. Am. Med. Inform. Assoc. 23, 1046–1052 (2016).
Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 6, 15 (2012).
Hicks, S. A. et al. On evaluation metrics for medical applications of artificial intelligence. Sci. Rep. 12, 12 (2022).
Schrouff, J. et al. Maintaining fairness across distribution shift: do we have viable solutions for real-world applications? Preprint at arXiv https://arxiv.org/abs/2202.01034 (2022).
D’Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. Journal of Machine Learning Research 23, 1–61 (2022).
Röösli, E., Bozkurt, S. & Hernandez-Boussard, T. Peeking into a black box, the fairness and generalizability of a MIMIC-III benchmarking model. Sci. Data 9, 24 (2022).
Carter, S., Armstrong, Z., Schubert, L., Johnson, I. & Olah, C. Exploring neural networks with activation atlases. Distill https://distill.pub/2019/activation-atlas/ (2019).
Rocktäschel, T., Grefenstette, E., Hermann, K. M., Kočiský, T. & Blunsom, P. Reasoning about entailment with neural attention. Preprint at arXiv https://arxiv.org/abs/1509.06664 (2016).
Li, M., Zhao, Z. & Scheidegger, C. Visualizing neural networks with the grand tour. Distill https://distill.pub/2020/grand-tour/ (2020).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. Proceedings of the 34 th International Conference on Machine Learning, PMLR https://doi.org/10.48550/arXiv.1703.01365 (2017).
Mincu, D. et al. Concept-based model explanations for electronic health records. In Proc. Conference on Health, Inference, and Learning 36–46 (Association for Computing Machinery, 2021).
Adebayo, J. et al. Sanity checks for saliency maps. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (2018).
Arun, N. et al. Assessing the trustworthiness of saliency maps for localizing abnormalities in medical imaging. Radiol. Artif. Intell. 3, e200267 (2021).
Liu, X. et al. Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed. Nat. Med. 25, 1467–1468 (2019).
Lu, C. et al. Deploying clinical machine learning? Consider the following…. Preprint at arXiv https://arxiv.org/abs/2109.06919 (2021).
Zhou, Q., Chen, Z. H., Cao, Y. H. & Peng, S. Clinical impact and quality of randomized controlled trials involving interventions evaluating artificial intelligence prediction tools: a systematic review. npj Digit. Med. 4, 12 (2021).
Biswal, S. et al. SLEEPNET: automated sleep staging system via deep learning. Preprint at arXiv https://arxiv.org/abs/1707.08262 (2017).
Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat. Med. 26, 1364–1374 (2020).
Ryffel, T. et al. A generic framework for privacy preserving deep learning. Preprint at arXiv https://arxiv.org/abs/1811.04017 (2018).
Liu, X., Glocker, B., McCradden, M. M., Ghassemi, M., Denniston, A. K. & Oakden-Rayner, L. The medical algorithmic audit. Lancet Digit. Health 4, e384–e397 (2022).
Article Google Scholar

Download references

Acknowledgements

We thank the clinicians who offered their help and opinions when reviewing this paper: L. Hartsell and M. Seneviratne. We also thank our colleagues and collaborators, N. Tomasev, K. Heller, J. Schrouff, N. Rostamzadeh, C. Ghate, L. Proleev, L. Hartsel, N. Broestl, G. Flores and S. Pfohl, for their help and support in reviewing and beta-testing our opinions.

Author information

Authors and Affiliations

Google Research, London, UK
Diana Mincu & Subhrajit Roy

Authors

Diana Mincu
View author publications
Search author on:PubMed Google Scholar
Subhrajit Roy
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Diana Mincu or Subhrajit Roy.

Ethics declarations

Competing interests

Both authors are employed by Google UK.

Peer review

Peer review information

Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mincu, D., Roy, S. Developing robust benchmarks for driving forward AI innovation in healthcare. Nat Mach Intell 4, 916–921 (2022). https://doi.org/10.1038/s42256-022-00559-4

Download citation

Received: 01 June 2022
Accepted: 07 October 2022
Published: 15 November 2022
Issue date: November 2022
DOI: https://doi.org/10.1038/s42256-022-00559-4

This article is cited by

Historical evolution and current research status of lymph node staging in gastric cancer: a review
- Yinghong Lin
- Qiangzu Shao
- Zeping Huang
World Journal of Surgical Oncology (2025)
Machine learning in point-of-care testing: innovations, challenges, and opportunities
- Gyeo-Re Han
- Artem Goncharov
- Aydogan Ozcan
Nature Communications (2025)
Recommendations for the creation of benchmark datasets for reproducible artificial intelligence in radiology
- Nikos Sourlos
- Rozemarijn Vliegenthart
- Peter van Ooijen
Insights into Imaging (2024)
Tongue image fusion and analysis of thermal and visible images in diabetes mellitus using machine learning techniques
- Usharani Thirunavukkarasu
- Snekhalatha Umapathy
- Tahani Jaser Alahmadi
Scientific Reports (2024)
Generative deep learning for the development of a type 1 diabetes simulator
- Omer Mujahid
- Ivan Contreras
- Josep Vehi
Communications Medicine (2024)

Developing robust benchmarks for driving forward AI innovation in healthcare

Subjects

Abstract

Access options

Similar content being viewed by others

Fair shares: building and benefiting from healthcare AI with mutually beneficial structures and development partnerships

Multiple stakeholders drive diverse interpretability requirements for machine learning in healthcare

Guiding principles for the responsible development of artificial intelligence tools for healthcare

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Rights and permissions

About this article

Cite this article

This article is cited by

Historical evolution and current research status of lymph node staging in gastric cancer: a review

Machine learning in point-of-care testing: innovations, challenges, and opportunities

Recommendations for the creation of benchmark datasets for reproducible artificial intelligence in radiology

Tongue image fusion and analysis of thermal and visible images in diabetes mellitus using machine learning techniques

Generative deep learning for the development of a type 1 diabetes simulator

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links