Table 2 General information about the included silent studies
From: A scoping review of silent trials for medical artificial intelligence
Study | Aim and rationale | Model type and intended use | Model evaluation | Additional considerations | Categorization |
|---|---|---|---|---|---|
Aakre et al. (2017)21 | To assess an automated SOFA score calculation for patients in the ICU | Predictive machine learning | • Agreement between automated SOFA scoring and manual scoring calculation over a 1-month period • Comparison of 215 ICU inpatients’ SOFA scores at 3 hospital sites, with 5,978 total scores compared • 134 random spot checks on 27 unique patients to assess the real-time accuracy of automated SOFA score calculation • Manual scoring performed independently by research team members, with a chart review for comparison | Interviewed clinicians about interface features to visualize SOFA subcomponents | Compared model outputs with clinician annotations |
Afshar et al. (2023)28 | To assess the AI tool’s predictive performance and evaluative human factors | Predictive deep learning | • Algorithm performance: sensitivity and specificity • Observed 100 random encounters with adult patients • Described data flow from and to the EHR • Described scalability and computational infrastructure | • Interview guide and survey to assess user acceptability of the tool • Determined barriers and facilitators to using the tool | Framework for the design and implementation of the model |
Alrajhi et al. (2022)75 | To assess a real-time severity prediction tool for COVID-19 management | Predictive machine learning | • Algorithm performance: AUC/ROC, F1 • 185 cases for the prospective validation set • Imputed missing data; addressed class imbalances | Clinician feedback related to class imbalance issue | Algorithmic validation study |
Aydın et al. (2025)76 | To validate and compare an ML-based scoring system for paediatric appendicitis | Diagnostic machine learning | • Algorithm performance: AUC, sensitivity, specificity, PPV, NPV • Applied to 3,036 paediatric patients across 13 hospitals and 13 paediatric centres • ML-based diagnosis assessed against histopathological examination (gold standard) • Compared ML model performance against existing scoring methods | • Specified separation of care and model validation • Assessed feature interactions and ranked importance | Algorithmic validation, comparative study |
Bachelot et al. (2023)77 | To compare model performance for testicular sperm extraction | Predictive machine learning | • Algorithm performance: AUC, sensitivity, specificity • 26 patients for the prospective validation set • Described data processing | Assessed feature importance across models | Algorithmic validation study |
Bedoya et al. (2020)39 | To validate a sepsis prediction model | Diagnostic deep learning | • Algorithm performance: compared with standard EWS, compared multiple models with the standard process • 1,475 encounters over a 2-month silent trial • Model development team tracked alarm volume, resolved technical issues and identified label leakage • Calculated alarm volume | Stakeholder engagement with clinical teams used | Comparison of the model with the standard-of-care algorithm |
Berg et al. (2023)78 | To assess an AI software for classifying palpable breast masses in a low-resource setting | Predictive AI | • Algorithm performance: AUC, specificity, NPR • 758 masses in breast tissue • A single radiologist reader reviewed AI- and radiologist-assigned malignancies • Minimal training for users to mimic the conditions of intended use | Compared diagnostic performance with human readers | |
Brajer et al. (2020)36 | To assess the model’s ability to predict the risk of in-hospital mortality for adult patients | Predictive machine learning | • Algorithm performance: ROC, PR, AUROC • 5,273 hospitalizations, 4,525 unique adult patients in the ICU • Assessed subgroup-specific performance for sensitivity, specificity and PPV • Assessed threshold setting in different environments • Described data and model availability; updated predictions daily | • Partnered with clinical and operational leaders to design the model and evaluation • Clinical partners provided feedback into the interface • Model fact sheet iteratively designed with stakeholder input | Compared algorithmic prediction with human annotations |
Butler et al. (2019)79 | To clinically validate an AI tool for triaging brain cancer | Triage machine learning | • Algorithm performance: sensitivity, specificity • 104 patients with brain cancer • Outcome assessment was blinded to the algorithm • Some subgroup-specific analysis of under-represented cancer cases | Simulated workflow run within a research laboratory | Compared algorithmic prediction with independent clinician diagnosis |
Campanella et al. (2025)80 | To conduct a prospective silent trial of a model for lung cancer detection | Predictive machine learning | • Algorithm performance: AUC, PPV, NPV, sensitivity, specificity • Application of an open-source foundation model with local fine-tuning • 4-month trial period • Subgrouped analysis by sample type, failure mode testing of false negatives • Assessed different thresholds against primary metrics • Described data pipeline and real-time stream | Assessed the attention areas of the model | Prospective silent trial |
Chen et al. (2025)81 | To evaluate the utility of a radiomics nomogram to predict oesophageal pathological progression | Predictive machine learning | • Algorithm performance: AUC, sensitivity, specificity, accuracy, DCA • 251 cases • Ground truth was reviewed by a pathologist and compared and combined with the model for overall clinical utility • Described the need for preprocessing due to equipment differences | DCA for utility | Clinical validation |
Cheng et al. (2025)82 | To prospectively validate a hypertension risk model | Predictive machine learning | • Algorithm performance: AUC, precision, sensitivity, specificity, calibration curves • 961,519 cases • Assessed fairness across age and sex, BMI across different risk levels, model performance, and socioeconomic factors in the high-risk group • Discussed managing data missingness and shift | Clinician-focused app to provide clinicians an opportunity to assess prediction utility and risk factor contributions | Algorithmic validation |
Chiang et al. (2025)83 | To prospectively validate an early warning haemodynamic risk model | Predictive machine learning | • Algorithm performance: AUROC, AUPRC, precision, recall, specificity, false alarm rate and missed alarm rate • 18,438 patient cases • Assessed sex and age, as well as respiratory, cardiovascular, gastrointestinal and trauma groups on AUROC and AUPRC • Model updates hourly | Algorithmic validation | |
Chufal et al. (2025)84 | To prospectively and temporally validate a model predicting ineligibility for radiotherapy treatment | Predictive machine learning | • Algorithm performance: AUC • 47 patients • Compared model prediction with clinical decision on a case-by-case basis, with only the research team seeing the model predictions • Noted fairness concerns by sociodemographic groups; stated that these were addressed through consistency in the assessment method | Discussion of threshold setting based on clinical impact to patients and risk assessment | Prospective algorithmic validation with clinical verification |
Coley et al. (2021)85 | To assess an algorithm’s predictive accuracy of suicide attempt within 90 days | Predictive machine learning | • Algorithm performance: sensitivity, specificity, PPV, NPV • Prospective algorithmic validation concurrent with the testing set | Temporal validation, internal algorithmic validation | |
Corbin et al. (2023)86 | To conduct a silent trial of the model’s prospective performance | Predictive machine learning | Algorithm performance: AUROC, ROC, calibration, net benefit, expected utility • 10,000–20,000 unique patients • Bias assessed across protected demographic classes • Mapping of data inputs to outputs across the data stream workflow | Prospective algorithmic validation | |
Dave et al. (2023)87 | To evaluate the accuracy of a real-time model detecting abnormal lung parenchyma | Predictive deep learning | • Algorithm performance: AUROC, F1 • 100 patients, sample size rationale provided • Analysed by sex, race, ventilation strategy and BMI • Functionality embedded into an ultrasound machine • Assessed different classification and contiguity thresholds • Human assessment independent from predictions | Compared algorithmic prediction with human annotations | |
El Moheb et al. (2025)88 | To validate a model for automated billing coding | Administrative deep learning | • Algorithm performance: precision, recall, F1, AUPRC • 268 operative notes • Trained to predict 19 CPT codes for automated coding, compared with expert medical coders • Assessed overcoding and undercoding, as well as discrepancies against ground truth | Prospective algorithmic validation study | |
Escalé-Besa et al. (2023)24 | To validate a model’s diagnostic accuracy for skin diseases | Diagnostic deep learning | • Algorithm performance: accuracy, sensitivity, specificity per disease; TP, FP, TN or FN based on the top 3 most likely diagnosis • 100 patients • Failure care analysis • Clinician diagnosis and offered AI prediction | Satisfaction of GPs with AI as decision support for each case | Compared diagnostic performance with human readers |
Faqar-Uz-Zaman et al. (2022)89 | To evaluate the diagnostic accuracy of an app in the ED | Diagnostic (N/A) | • Algorithm performance: • 450 patients • Compared diagnostic accuracy for the top 4–5 diagnoses between the AI tool and the ED physician (matched between candidate diagnoses) | Compared algorithmic prediction with human annotations | |
Felmingham et al. (2022)90 | To evaluate an AI tool’s diagnostic accuracy for skin cancer detection | Diagnostic deep learning | • Algorithm performance: AUROC, sensitivity, specificity, FNR • 214 cases, 742 lesions • Trained on the use of a camera and software before the study • Compared diagnostic accuracy with independent diagnoses by teledermatologists • Analysis of AI errors | Compared algorithmic prediction with independent clinician diagnosis | |
Feng et al. (2025)91 | To validate a diagnostic model for distinguishing thymomas from other nodules | Diagnostic machine learning | • Algorithm performance: ROC, DCA, sensitivity, specificity • 23 patients • Expert evaluation panel provided ground truth • Performance of 3 radiologists (mixed experience levels) compared with model performance using AUC • No clinical information provided to the radiologists | Described a training process for radiologists | Prospective clinical validation (silent trial) |
Hanley et al. (2017)92 | To evaluate an AI tool for predicting the need for a CT scan in patients with TBI | Triage machine learning | • Algorithm performance: AUROC, sensitivity, specificity, NPV, PPV; clinical utility • 720 patient CTs across 11 ED sites • Assessed model outputs against clinical annotations as determined by laboratory reading and imaging specialist readers according to a prespecified statistical plan • Failure mode analysis of false negatives | Compared algorithmic prediction with human annotations | |
Hoang et al. (2025)93 | To evaluate SAFE-WAIT in a silent trial | Predictive machine learning | • Algorithm performance: recall, specificity, accuracy, precision, NPV, FPR, FNR, F1 score • Bias assessment conducted by sex (male, female) and age bracket (young, middle-aged, older adult) | Utility value calculation articulated in terms of clinically relevant decisions and outcomes | Silent trial (algorithmic validation) |
Im et al. (2018)94 | To validate an AI tool for diagnosing aggressive lymphomas before deployment to LMICs | Diagnostic deep learning | • Algorithm performance: specificity, sensitivity, efficiency, size measurements, staining, reproducibility • Described data quality controls • Equipment detailed • 40 patients | Computational time and system components, cost, computational infrastructure | Independent verification of AI labels against clinician assessment |
Jauk et al. (2020)19 | To evaluate a delirium prediction model in its clinical setting | Predictive machine learning | • Algorithm performance: AUROC, sensitivity, specificity, FPR, FNR, PPV, NPV • Rated against nurse assessment of the delirium risk score and the Confusion Assessment Method • Reported failure modes and exclusions • Independent assessment by nurses on 33 patients, 86 with exposure to the AI output | • Expert group of senior physicians, ward nurses, technicians, employees • Offered training for users | Compared outcomes with expert ratings |
Kim et al. (2023)10 | To validate a commercial AI tool for detecting chest radiographic abnormalities | Diagnostic AI | • Algorithm performance: AUROC, sensitivity, specificity • Assessed pathologies on 3,047 radiographs with and without AI output across two centres • CE marking by the Ministry of Food and Drug Safety of Korea • 4 first- and third-year radiology residents as target users • Reading times and failure care analysis | Compared diagnostic accuracy with and without AI assistance | |
Korfiatis et al. (2023)95 | To evaluate an AI tool detecting PDA from CT scans | Diagnostic deep learning | • Algorithm performance: AUROC, sensitivity, specificity, F1 • Simulated a screening sample of 297 consecutive abdominal CTs for validation by radiologists • Assessed failure modes using tumour-related parameters | • Reported substantial impact to clinical workflow • Used heat maps during the review process | Radiologist-verified diagnostic accuracy |
Kramer et al. (2024)96 | To validate a model predicting malnutrition in hospitalized patients | Predictive machine learning | • Algorithm performance: AUROC, sensitivity, specificity, accuracy • 159 patients • Dieticians assessed malnutrition in admitted patients, compared (masked) with real-time ML predictions | Compared algorithmic prediction with human annotations | |
Kwong et al. (2022)97 | To evaluate a model predicting hydronephrosis in utero | Predictive deep learning | • Algorithm performance: AUROC, AUPRC • Assessed failure modes by age, laterality, changes in image processing and ultrasound machine • Assessed bias for sex and postal code • Looked for potential causes of drift • Recorded model downtime • 1,234 cases with prediction at the desired implementation care point and compared with later decision to proceed with surgery • Reported data stream for model evaluation related to patient data confidentiality and security | • Measured clinician engagement • Assessed usability and disruption to workflow • Used activation maps • Conducted patient and family surveys to assess receptivity | Verification of the model against the outcome label |
Liu et al. (2023)98 | To validate a model predicting postoperative pain | Predictive deep learning | • Algorithm performance: ROC, AUC, RMSE, correlation • Compared algorithmic prediction of maximum pain score with clinician preprocedure prediction in adult inpatients undergoing noncardiac surgery with general anaesthesia • Included patient race in the model but did not report performance subgrouped by race • Reported dataset drift | Compared algorithmic prediction with independent clinician rating | |
Liu et al. (2024)99 | To evaluate an AI model estimating bone age | Decision support deep learning | • Algorithm performance: RMSE, MSE • Assessed performance by patient age and sex, as well as radiography vendor • 973 radiographs across 9 hospitals • 3 expert reviewers as gold standard; inter-rater reliability calculated | • Measured time to completion of reading, human versus AI • Per-bone κ values to indicate disagreements | Clinical validation study comparing AI with gold standard |
Luo et al. (2019)100 | To validate a model detecting gastrointestinal cancers | Diagnostic deep learning | • Algorithm performance: AUC, ROC, PPV, NPV, sensitivity, specificity • Reviewed false negatives plus a random subset assessed against an independent assessment by experts • 175 patients, 4,532 images collected from 5 hospitals • Noted the presence and location of tumours | Measured processing time | Algorithmic validation with verification of a random subset |
Lupei et al. (2022)101 | To evaluate the real-time performance of a COVID-19 prognostic model | Predictive machine learning | • Algorithm performance: AUC, ROC, PPV, NPV, sensitivity, specificity • 13,271 symptomatic patients with COVID-19 • Evaluated sensitivity and specificity across sex and race • Assessed label drift as a result of improved outcomes for patients | Opted out of research requests, noted in the chart and honoured by the team | Prospective algorithmic validation |
Mahajan et al. (2023)102 | To assess a model’s predictive accuracy for 30-day postoperative mortality and major adverse cardiac and cerebrovascular events | Predictive machine learning | • Algorithm performance: AUC, ROC, PPV, NPV, sensitivity, specificity • 206,353 patient cases • Compared performance with an algorithm already used in care | SHAP values applied to retrospective test only | Prospective algorithmic validation study |
Major et al. (2020)103 | To validate a model predicting short-term in-hospital mortality | Predictive machine learning | • Algorithm performance: descriptive statistics (n patients meeting the primary outcome) • 9-month trial with 41,728 predictions + 12-week silent test in which hospitalists reviewed 104 alerts to determine whether the alert was actionable and appropriate • Assessed bias by comparing algorithmic fairness approaches | • Clinical stakeholders selected 75% PPV as the desired threshold for the model • Experimented with different thresholds, varied across sites to reflect population needs | Prospective algorithmic validation |
Manz et al. (2020)16 | To validate an algorithm predicting 180-day mortality risk in a general oncology cohort | Predictive machine learning | • Algorithm performance: AUC, AUPRC, Brier score, PPV, NPV, sensitivity, alert rate tested at different risk thresholds • 24,582 patient cases over a 2-month period • Calculated performance metrics across different groups (disease site, practice type, self-reported race, sex, insurance, stage of cancer); reported performance to be better for women or at a later stage of cancer for men • Described the model being locked; no updates made | Use of a nudging strategy described in a companion paper | Prospective algorithmic validation |
Miró Catalina et al. (2024)104 | To validate a diagnostic algorithm in radiology | Diagnostic deep learning | • Algorithm performance: TP, TN, FP, FN, sensitivity, specificity • 278 cases of 471 participants • Researchers interpreted reference radiology reports before inputting to AI to obtain a diagnosis for comparison • Error testing for certain pathologies | Compared diagnostic performance with human readers | |
Morse et al. (2022)27 | To evaluate a model detecting CKD in a paediatric hospital | Evaluative machine learning | • Algorithm performance: AUROC • ML model draws data directly from the EHR in near real time • 1,270 patient admissions over ~6 months | Prospective algorithmic validation | |
Nemeth et al. (2023)37 | To validate a model for detecting septic shock | Predictive machine learning | • Algorithm performance: AUC, PPV, NPV • 5,384 hospital admissions in 4,804 patients during a 6-month silent test, comparing predictions with a clinician’s independent judgement • Extensive failure case analysis • Tested different time horizons • Described data flow and infrastructure for the model | • Codesign using interviews with multiple stakeholders • User acceptance testing • Alignment of model use with practice guidelines | Compared model outputs with clinician annotations |
O’Brien et al. (2020)105 | To evaluate an EWS for patient deterioration | Predictive machine learning | • Algorithm performance: PPV, sensitivity, thresholding • 4,210 encounters, 97 patients • Set up data analytics to reflect real-time streaming of live data | • Alert risk presented using red, yellow and green colour codes • Nursing consult on visualization | Algorithmic validation study |
Ouyang et al. (2020)32 | To validate a segmentation model assessing cardiac function | Predictive deep learning | • Algorithm performance: AUC, RMSE, R2 • Measurements of cardiac function in 1,288 patients • Compared model measurements with those by human annotators, with manual blinded re-evaluation by 5 experts for cases with a large discrepancy between the model and annotations | Compared model outputs with clinician annotations | |
Pan et al. (2025)106 | To validate a model predicting the utility of CT for mTBI | Predictive machine learning | • Algorithm performance: AUC, accuracy, sensitivity, specificity, PPV, NPV, F1, DCA • 86 patients • ML model compared with serum biomarkers for TBI and a statistical regression model | • SHAP values • DCA to assess clinical utility | Prospective clinical validation (silent trial) |
Pou-Prom et al. (2022)34 | To validate an early warning system in inpatients | Predictive machine learning | • Algorithm performance: AUC, PPV, sensitivity • Determined a composite outcome label • Described the shift needed to accommodate changes due to onset of the COVID-19 pandemic • Described a detailed preprocessing plan • Evaluated the processing stream • Initially planned a 4-month trial, which was extended to 6 months • Conducted training with users | Weekly check-ins with stakeholders during the silent phase | Real-time algorithmic validation |
Pyrros et al. (2023)107 | To validate a model detecting type 2 diabetes from chest radiographs and EHR data | Predictive deep learning | • Algorithm performance: AUROC, PPV, sensitivity, specificity, F1, Youden’s J index, PR, NPV, odds ratio, demographics • 9,943 chest radiographs • Noted the potential for health disparities; planned subgroup analysis by race/ethnicity; mentioned the need for fine-tuning due to fairness and robustness issues • Data stream and infrastructure described | Used an animated technique through an autoencoder for feature highlighting | Algorithmic validation study |
Qian et al. (2025)108 | To validate a model predicting surgical intervention need for paediatric intussusception | Predictive deep learning | • Algorithm performance: AUC, accuracy, NPV, F1, ROC • 50 patients • Reported consistent performance across different patient populations by age | Algorithmic validation | |
Rajakariar et al. (2020)25 | To validate a smartwatch device for detecting atrial fibrillation | Diagnostic machine learning | • Algorithm performance: sensitivity, specificity, TP, TN, Cohen’s κ for agreement • Failure case analysis for unclassified tracings assessed by 2 electrophysiologists • Described the data pipeline • 200 consecutive patients over 6 months, 439 ECGs • Cardiologist diagnosis as the reference standard | Compared device output with clinician diagnosis | |
Rawson et al. (2021)109 | To validate a model detecting secondary bacterial infection during COVID-19 | Predictive machine learning | • Algorithm performance: AUROC, descriptive analysis | Prospective pilot test of the algorithm | |
Razavian et al. (2020)33 | To validate a model predicting outcomes for hospitalized patients with COVID-19 | Predictive machine learning | • Algorithm performance: AUROC, AUPRC, PPV, thresholded sensitivity, confidence intervals • Integration through the EHR; data flow described • Described the cleaning process, feature minimization, threshold selection and time horizon • 445 patients over 474 admissions (109,913 prediction instances) • Medical students and practicing physicians assessed face validity, timing and clinical utility | • Review with medical students to assess 30 patient encounters for impact on clinical decision-making from model prediction • Interface described • Feature-level XAI | Prospective observational study (unclear of impact) |
Ren et al. (2025)110 | To evaluate a smartphone-based AI for classifying auricular deformities | Diagnostic deep learning | • Algorithm performance: AUC, ROC, sensitivity, specificity, precision, F1 score • 272 cases • Ground truth established by two independent professionals • Compared human and model performance • Scalable and low-cost diagnostic support • Guidance for proper image acquisition • Failure analysis identified discrepancies between retrospective and prospective validation sets • Described the data pipeline and inference process | Clinical validation | |
Schinkel et al. (2022)111 | To validate a model predicting a positive blood culture result | Predictive machine learning | • Algorithm performance: AUROC, AUPRC, calibration, feature contributions, DCA • Described data processing in a live context • 3-month period of real-time validation | Real-time prospective algorithmic validation | |
Shah et al. (2021)112 | To validate a model predicting clinical deterioration | Predictive machine learning | • Algorithm performance: AUPRC, AUROC, PPV, NNE • Preplanned subgroup analysis by race, sex and age revealed discrepancies • 146,446 hospitalizations in 103,930 unique patients • Described data processing steps and feature importance calculations | Algorithmic validation study | |
Shamout et al. (2021)113 | To validate a model predicting deterioration from COVID-19 | Predictive machine learning | • Algorithm performance: AUC, PR, PPV, NPV • 375 examinations • Real-time extraction; addressed computational resources | Prospective algorithmic validation (silent trial) | |
Shelov et al. (2018)38 | To validate a model predicting clinical acuity in a paediatric ICU | Machine learning decision support | • Algorithm performance: Littenberg Technology Assessment in Medicine framework • Approximately 6-month verification phase before going live • Measured the impact of the model in EHR on processing time • Validation done through a survey for project team clinicians to complete (315 forms for 182 patients) • Retrospective analysis of data quality and patients meeting the at-risk criteria • Reported on the availability of the algorithm | • Some interfaces included • Design included a multidisciplinary team comprising physicians, nurses, informaticians, respiratory therapists and improvement advisors | Prospective verification of the model against clinical judgement |
Sheppard et al. (2018)29 | To validate an algorithm for triaging patients with suspected high BP for ambulatory pressure monitoring | Triage machine learning | • Algorithm performance: sensitivity, specificity, PPV, NPV, AUROC • Compared the accuracy of the triaging strategy across subgroups (by setting, age, sex, smoking status, BMI, history of hypertension, diabetes, CKD, cardiovascular disease and BP measuring device) • 887 eligible patients with 3 same-visit BP readings • Described the rationale for excluding cases based on data missingness | Advised patients with hypertension history on the design of the project, recruitment and study literature before ethics submission | Comparison of algorithmic triaging approach against the standard |
Shi et al. (2025)114 | To evaluate a model predicting the risk of colorectal polyp recurrence | Predictive machine learning | • Algorithm performance: ROC, DCA, sensitivity, specificity • 166 patients | • DCA to assess clinical utility • Demonstrated the user interface | Prospective algorithmic validation study |
Smith et al. (2024)115 | To evaluate a model for breast cancer screening | AI decision support | • Algorithm performance: recall or no recall decision • Assessed concordant and discordant cases • 8,779 patients aged 50–70 years • Trained film readers verified the results • Assessed multiple features of patients and scan results | Regions of interest available during reviews | Compared diagnostic performance with human readers |
Stamatopoulos et al. (2025)116 | To validate a model predicting miscarriage risk | Predictive machine learning | • Algorithm performance: sensitivity, specificity, PPV, NPV • Assessor had access to ground truth and compared algorithm predictions against short-term outcomes | Inferred a lack of clinical utility due to unreliable predictions | Prospective algorithmic validation study |
Stephen et al. (2023)20 | To validate a model detecting paediatric sepsis | Predictive machine learning | • Algorithm performance: AUC, PPV • 8,608 cases (1-year period) • Thresholding for alerts to consider false alerts, alert fatigue, resources for sepsis huddle | Team of clinicians, data scientists, improvement experts and clinical informaticians; regular meetings throughout the project | Real-time algorithmic validation |
Swinnerton et al. (2025)117 | To prospectively validate a prediction tool for severe COVID-19 risk | Predictive machine learning | • Algorithm performance: AUC, calibration • 51,587 infections • Assessed subgroup performance | Feature importance | Prospective algorithmic validation study |
Tan et al. (2025)26 | To clinically validate AI-based multispectral imaging for burn wound assessment | Classification deep learning | • Algorithm performance: sensitivity, specificity, accuracy • 40 patients, 70 burn images • Failure mode analysis affecting overdiagnosis • Bias assessment by skin pigmentation and tattoo presence • Reported on availability, feasibility and time to diagnostic result • Described the user interface • UKCA class I medical device, ISO 13485 | • Reported evaluator training • Described the user interface | Prospective clinical validation study |
Tariq et al. (2023)118 | To validate a model screening for low bone density | Screening machine learning | • Algorithm performance: image label, precision, recall, F score, AUROC • For 2 consecutive days, curated 344 scans (with and without contrast) from patients aged ≥50 years • Some analysis of lower-performing areas | Heat maps for regions of interest | Algorithmic validation study |
Titano et al. (2018)119 | To simulate the clinical implementation of a triage algorithm for radiology | Triage deep learning | • Algorithm performance: AUC, sensitivity, specificity, accuracy, time to notify about critical findings, runtime • 180 images reviewed by a radiologist and a surgeon (50/50 split); 2 radiologists and a neurosurgeon reviewed images without access to the EMR or prior images | Prospective simulated trial with human readers | |
Vaid et al. (2020)120 | To validate an outcome prediction model for COVID-19 | Predictive machine learning | • Algorithm performance: AUROC, AUPRC, F1, sensitivity, specificity • 21-day trial • Assessed race as a potential contributing variable to outcome prediction | SHAP scores | Prospective algorithmic validation (silent trial) |
Wall et al. (2022)121 | To evaluate a model for supporting radiation therapy plans | Predictive machine learning | • Algorithm performance: prediction error, ROC, concordance • VQA application provides failures for features, top 5 features and ‘total gain’ • Reported runtime and compute power • Physicists measured 445 VMAT plans over 3 months • VQA predictions recorded alongside PSQA measurements | Prospective validation including comparison with the standard of care | |
Wan et al. (2025)122 | To validate a model predicting neoadjuvant treatment response | Predictive machine learning | • Algorithm performance: AIC, ROC, PPV, NPV, DCA, calibration • 76 patients • Compared the performance of a clinical–radiomics model to that of a radiomics model, a clinical model and a radiologist’s subjective assessment | DCA to assess potential clinical benefit | Clinical validation |
Wang et al. (2019)123 | To validate a model predicting new-onset lung cancer | Predictive machine learning | • Algorithm performance: AUC, ROC, PPV, sensitivity, specificity • Performance within each risk category • 836,659 patient records | Algorithmic validation study | |
Wang et al. (2025)124 | To validate a model for cardiovascular disease diagnosis | Diagnostic deep learning | • Algorithmic validation: AUC, sensitivity, specificity, F1, accuracy • 62 patients • Ground truth established by 3 emergency physicians reviewing the data, compared with algorithm outputs | SHAP values | Algorithmic validation with clinical verification |
Wissel et al. (2020)125 | To validate an NLP application to assign surgical candidacy for epilepsy | Decision support machine learning | • Algorithm performance: AUC, sensitivity, specificity, PPV, NPV, NNS, number of prospective surgical candidates • Retrained the model weekly on the most recent training set based on free text notes • Verification on 100 randomly selected patient cases • Tested the inter-rater reliability of clinicians’ manual classifications versus the algorithm | Interpretability analysis revealed wording associated with surgical candidacy | Algorithmic validation with verification of a random subset |
Wong et al. (2021)30 | To temporally validate a model predicting acute respiratory failure | Predictive machine learning | • Algorithm performance: AUROC, AUPRC, sensitivity, specificity, PPV, NPV • Event horizon • 122,842 encounters, 112,740 controls | Temporal validation study | |
Xie et al. (2025)126 | To validate a model diagnosing axial spondyloarthritis | Diagnostic deep learning | • Algorithmic validation: AUC, accuracy, sensitivity, specificity, F1, precision • 209 patients • Diagnostic accuracy compared with accepted clinical classification criteria for each patient | SHAP values | Algorithmic validation |
Ye et al. (2019)127 | To validate a real-time early warning system predicting high risk of inpatient mortality | Predictive machine learning | • Algorithm performance: sensitivity, specificity, PPV, ROC, C-statistic, hazard ratios • 11,762 patients with an assigned EWS | Top 50 important features | Algorithmic validation study |
Ye et al. (2020)128 | To validate a nomogram for predicting liver failure | Predictive machine learning | • Algorithm performance: precision, recall, accuracy, F1 • 120 patients undergoing hepatectomy | Algorithmic validation study | |
Yu et al. (2022)129 | To validate a sepsis prediction model | Predictive machine learning | • Algorithm performance: F1, sensitivity, specificity, AUROC, AUPRC • 3,532 alerts; 388 met the sepsis criteria • Analysed model successes and failures • Considered scalability through compute requirements | SHAP values for a ‘lite’ version of the model | Algorithmic validation study |
Zhang et al. (2025)130 | To validate a model identifying atrial fibrillation after ischaemic stroke | Diagnostic deep learning | • Algorithm performance: AUC, sensitivity, specificity, PPC, NPV • 73 patients • Assessed model performance by patient age bracket • An independent researcher conducted a blinded review of predicted atrial fibrillation status and actual diagnosis after clinical workup • Described data cleaning and patient inclusion criteria | Algorithmic validation |