Table 2 General information about the included silent studies

From: A scoping review of silent trials for medical artificial intelligence

Study

Aim and rationale

Model type and intended use

Model evaluation

Additional considerations

Categorization

Aakre et al. (2017)21

To assess an automated SOFA score calculation for patients in the ICU

Predictive machine learning

• Agreement between automated SOFA scoring and manual scoring calculation over a 1-month period

• Comparison of 215 ICU inpatients’ SOFA scores at 3 hospital sites, with 5,978 total scores compared

• 134 random spot checks on 27 unique patients to assess the real-time accuracy of automated SOFA score calculation

• Manual scoring performed independently by research team members, with a chart review for comparison

Interviewed clinicians about interface features to visualize SOFA subcomponents

Compared model outputs with clinician annotations

Afshar et al. (2023)28

To assess the AI tool’s predictive performance and evaluative human factors

Predictive deep learning

• Algorithm performance: sensitivity and specificity

• Observed 100 random encounters with adult patients

• Described data flow from and to the EHR

• Described scalability and computational infrastructure

• Interview guide and survey to assess user acceptability of the tool

• Determined barriers and facilitators to using the tool

Framework for the design and implementation of the model

Alrajhi et al. (2022)75

To assess a real-time severity prediction tool for COVID-19 management

Predictive machine learning

• Algorithm performance: AUC/ROC, F1

• 185 cases for the prospective validation set

• Imputed missing data; addressed class imbalances

Clinician feedback related to class imbalance issue

Algorithmic validation study

Aydın et al. (2025)76

To validate and compare an ML-based scoring system for paediatric appendicitis

Diagnostic machine learning

• Algorithm performance: AUC, sensitivity, specificity, PPV, NPV

• Applied to 3,036 paediatric patients across 13 hospitals and 13 paediatric centres

• ML-based diagnosis assessed against histopathological examination (gold standard)

• Compared ML model performance against existing scoring methods

• Specified separation of care and model validation

• Assessed feature interactions and ranked importance

Algorithmic validation, comparative study

Bachelot et al. (2023)77

To compare model performance for testicular sperm extraction

Predictive machine learning

• Algorithm performance: AUC, sensitivity, specificity

• 26 patients for the prospective validation set

• Described data processing

Assessed feature importance across models

Algorithmic validation study

Bedoya et al. (2020)39

To validate a sepsis prediction model

Diagnostic deep learning

• Algorithm performance: compared with standard EWS, compared multiple models with the standard process

• 1,475 encounters over a 2-month silent trial

• Model development team tracked alarm volume, resolved technical issues and identified label leakage

• Calculated alarm volume

Stakeholder engagement with clinical teams used

Comparison of the model with the standard-of-care algorithm

Berg et al. (2023)78

To assess an AI software for classifying palpable breast masses in a low-resource setting

Predictive AI

• Algorithm performance: AUC, specificity, NPR

• 758 masses in breast tissue

• A single radiologist reader reviewed AI- and radiologist-assigned malignancies

• Minimal training for users to mimic the conditions of intended use

 

Compared diagnostic performance with human readers

Brajer et al. (2020)36

To assess the model’s ability to predict the risk of in-hospital mortality for adult patients

Predictive machine learning

• Algorithm performance: ROC, PR, AUROC

• 5,273 hospitalizations, 4,525 unique adult patients in the ICU

• Assessed subgroup-specific performance for sensitivity, specificity and PPV

• Assessed threshold setting in different environments

• Described data and model availability; updated predictions daily

• Partnered with clinical and operational leaders to design the model and evaluation

• Clinical partners provided feedback into the interface

• Model fact sheet iteratively designed with stakeholder input

Compared algorithmic prediction with human annotations

Butler et al. (2019)79

To clinically validate an AI tool for triaging brain cancer

Triage machine learning

• Algorithm performance: sensitivity, specificity

• 104 patients with brain cancer

• Outcome assessment was blinded to the algorithm

• Some subgroup-specific analysis of under-represented cancer cases

Simulated workflow run within a research laboratory

Compared algorithmic prediction with independent clinician diagnosis

Campanella et al. (2025)80

To conduct a prospective silent trial of a model for lung cancer detection

Predictive machine learning

• Algorithm performance: AUC, PPV, NPV, sensitivity, specificity

• Application of an open-source foundation model with local fine-tuning

• 4-month trial period

• Subgrouped analysis by sample type, failure mode testing of false negatives

• Assessed different thresholds against primary metrics

• Described data pipeline and real-time stream

Assessed the attention areas of the model

Prospective silent trial

Chen et al. (2025)81

To evaluate the utility of a radiomics nomogram to predict oesophageal pathological progression

Predictive machine learning

• Algorithm performance: AUC, sensitivity, specificity, accuracy, DCA

• 251 cases

• Ground truth was reviewed by a pathologist and compared and combined with the model for overall clinical utility

• Described the need for preprocessing due to equipment differences

DCA for utility

Clinical validation

Cheng et al. (2025)82

To prospectively validate a hypertension risk model

Predictive machine learning

• Algorithm performance: AUC, precision, sensitivity, specificity, calibration curves

• 961,519 cases

• Assessed fairness across age and sex, BMI across different risk levels, model performance, and socioeconomic factors in the high-risk group

• Discussed managing data missingness and shift

Clinician-focused app to provide clinicians an opportunity to assess prediction utility and risk factor contributions

Algorithmic validation

Chiang et al. (2025)83

To prospectively validate an early warning haemodynamic risk model

Predictive machine learning

• Algorithm performance: AUROC, AUPRC, precision, recall, specificity, false alarm rate and missed alarm rate

• 18,438 patient cases

• Assessed sex and age, as well as respiratory, cardiovascular, gastrointestinal and trauma groups on AUROC and AUPRC

• Model updates hourly

 

Algorithmic validation

Chufal et al. (2025)84

To prospectively and temporally validate a model predicting ineligibility for radiotherapy treatment

Predictive machine learning

• Algorithm performance: AUC

• 47 patients

• Compared model prediction with clinical decision on a case-by-case basis, with only the research team seeing the model predictions

• Noted fairness concerns by sociodemographic groups; stated that these were addressed through consistency in the assessment method

Discussion of threshold setting based on clinical impact to patients and risk assessment

Prospective algorithmic validation with clinical verification

Coley et al. (2021)85

To assess an algorithm’s predictive accuracy of suicide attempt within 90 days

Predictive machine learning

• Algorithm performance: sensitivity, specificity, PPV, NPV

• Prospective algorithmic validation concurrent with the testing set

 

Temporal validation, internal algorithmic validation

Corbin et al. (2023)86

To conduct a silent trial of the model’s prospective performance

Predictive machine learning

Algorithm performance: AUROC, ROC, calibration, net benefit, expected utility

• 10,000–20,000 unique patients

• Bias assessed across protected demographic classes

• Mapping of data inputs to outputs across the data stream workflow

 

Prospective algorithmic validation

Dave et al. (2023)87

To evaluate the accuracy of a real-time model detecting abnormal lung parenchyma

Predictive deep learning

• Algorithm performance: AUROC, F1

• 100 patients, sample size rationale provided

• Analysed by sex, race, ventilation strategy and BMI

• Functionality embedded into an ultrasound machine

• Assessed different classification and contiguity thresholds

• Human assessment independent from predictions

 

Compared algorithmic prediction with human annotations

El Moheb et al. (2025)88

To validate a model for automated billing coding

Administrative deep learning

• Algorithm performance: precision, recall, F1, AUPRC

• 268 operative notes

• Trained to predict 19 CPT codes for automated coding, compared with expert medical coders

• Assessed overcoding and undercoding, as well as discrepancies against ground truth

 

Prospective algorithmic validation study

Escalé-Besa et al. (2023)24

To validate a model’s diagnostic accuracy for skin diseases

Diagnostic deep learning

• Algorithm performance: accuracy, sensitivity, specificity per disease; TP, FP, TN or FN based on the top 3 most likely diagnosis

• 100 patients

• Failure care analysis

• Clinician diagnosis and offered AI prediction

Satisfaction of GPs with AI as decision support for each case

Compared diagnostic performance with human readers

Faqar-Uz-Zaman et al. (2022)89

To evaluate the diagnostic accuracy of an app in the ED

Diagnostic (N/A)

• Algorithm performance:

• 450 patients

• Compared diagnostic accuracy for the top 4–5 diagnoses between the AI tool and the ED physician (matched between candidate diagnoses)

 

Compared algorithmic prediction with human annotations

Felmingham et al. (2022)90

To evaluate an AI tool’s diagnostic accuracy for skin cancer detection

Diagnostic deep learning

• Algorithm performance: AUROC, sensitivity, specificity, FNR

• 214 cases, 742 lesions

• Trained on the use of a camera and software before the study

• Compared diagnostic accuracy with independent diagnoses by teledermatologists

• Analysis of AI errors

 

Compared algorithmic prediction with independent clinician diagnosis

Feng et al. (2025)91

To validate a diagnostic model for distinguishing thymomas from other nodules

Diagnostic machine learning

• Algorithm performance: ROC, DCA, sensitivity, specificity

• 23 patients

• Expert evaluation panel provided ground truth

• Performance of 3 radiologists (mixed experience levels) compared with model performance using AUC

• No clinical information provided to the radiologists

Described a training process for radiologists

Prospective clinical validation (silent trial)

Hanley et al. (2017)92

To evaluate an AI tool for predicting the need for a CT scan in patients with TBI

Triage machine learning

• Algorithm performance: AUROC, sensitivity, specificity, NPV, PPV; clinical utility

• 720 patient CTs across 11 ED sites

• Assessed model outputs against clinical annotations as determined by laboratory reading and imaging specialist readers according to a prespecified statistical plan

• Failure mode analysis of false negatives

 

Compared algorithmic prediction with human annotations

Hoang et al. (2025)93

To evaluate SAFE-WAIT in a silent trial

Predictive machine learning

• Algorithm performance: recall, specificity, accuracy, precision, NPV, FPR, FNR, F1 score

• Bias assessment conducted by sex (male, female) and age bracket (young, middle-aged, older adult)

Utility value calculation articulated in terms of clinically relevant decisions and outcomes

Silent trial (algorithmic validation)

Im et al. (2018)94

To validate an AI tool for diagnosing aggressive lymphomas before deployment to LMICs

Diagnostic deep learning

• Algorithm performance: specificity, sensitivity, efficiency, size measurements, staining, reproducibility

• Described data quality controls

• Equipment detailed

• 40 patients

Computational time and system components, cost, computational infrastructure

Independent verification of AI labels against clinician assessment

Jauk et al. (2020)19

To evaluate a delirium prediction model in its clinical setting

Predictive machine learning

• Algorithm performance: AUROC, sensitivity, specificity, FPR, FNR, PPV, NPV

• Rated against nurse assessment of the delirium risk score and the Confusion Assessment Method

• Reported failure modes and exclusions

• Independent assessment by nurses on 33 patients, 86 with exposure to the AI output

• Expert group of senior physicians, ward nurses, technicians, employees

• Offered training for users

Compared outcomes with expert ratings

Kim et al. (2023)10

To validate a commercial AI tool for detecting chest radiographic abnormalities

Diagnostic AI

• Algorithm performance: AUROC, sensitivity, specificity

• Assessed pathologies on 3,047 radiographs with and without AI output across two centres

• CE marking by the Ministry of Food and Drug Safety of Korea

• 4 first- and third-year radiology residents as target users

• Reading times and failure care analysis

 

Compared diagnostic accuracy with and without AI assistance

Korfiatis et al. (2023)95

To evaluate an AI tool detecting PDA from CT scans

Diagnostic deep learning

• Algorithm performance: AUROC, sensitivity, specificity, F1

• Simulated a screening sample of 297 consecutive abdominal CTs for validation by radiologists

• Assessed failure modes using tumour-related parameters

• Reported substantial impact to clinical workflow

• Used heat maps during the review process

Radiologist-verified diagnostic accuracy

Kramer et al. (2024)96

To validate a model predicting malnutrition in hospitalized patients

Predictive machine learning

• Algorithm performance: AUROC, sensitivity, specificity, accuracy

• 159 patients

• Dieticians assessed malnutrition in admitted patients, compared (masked) with real-time ML predictions

 

Compared algorithmic prediction with human annotations

Kwong et al. (2022)97

To evaluate a model predicting hydronephrosis in utero

Predictive deep learning

• Algorithm performance: AUROC, AUPRC

• Assessed failure modes by age, laterality, changes in image processing and ultrasound machine

• Assessed bias for sex and postal code

• Looked for potential causes of drift

• Recorded model downtime

• 1,234 cases with prediction at the desired implementation care point and compared with later decision to proceed with surgery

• Reported data stream for model evaluation related to patient data confidentiality and security

• Measured clinician engagement

• Assessed usability and disruption to workflow

• Used activation maps

• Conducted patient and family surveys to assess receptivity

Verification of the model against the outcome label

Liu et al. (2023)98

To validate a model predicting postoperative pain

Predictive deep learning

• Algorithm performance: ROC, AUC, RMSE, correlation

• Compared algorithmic prediction of maximum pain score with clinician preprocedure prediction in adult inpatients undergoing noncardiac surgery with general anaesthesia

• Included patient race in the model but did not report performance subgrouped by race

• Reported dataset drift

 

Compared algorithmic prediction with independent clinician rating

Liu et al. (2024)99

To evaluate an AI model estimating bone age

Decision support deep learning

• Algorithm performance: RMSE, MSE

• Assessed performance by patient age and sex, as well as radiography vendor

• 973 radiographs across 9 hospitals

• 3 expert reviewers as gold standard; inter-rater reliability calculated

• Measured time to completion of reading, human versus AI

• Per-bone κ values to indicate disagreements

Clinical validation study comparing AI with gold standard

Luo et al. (2019)100

To validate a model detecting gastrointestinal cancers

Diagnostic deep learning

• Algorithm performance: AUC, ROC, PPV, NPV, sensitivity, specificity

• Reviewed false negatives plus a random subset assessed against an independent assessment by experts

• 175 patients, 4,532 images collected from 5 hospitals

• Noted the presence and location of tumours

Measured processing time

Algorithmic validation with verification of a random subset

Lupei et al. (2022)101

To evaluate the real-time performance of a COVID-19 prognostic model

Predictive machine learning

• Algorithm performance: AUC, ROC, PPV, NPV, sensitivity, specificity

• 13,271 symptomatic patients with COVID-19

• Evaluated sensitivity and specificity across sex and race

• Assessed label drift as a result of improved outcomes for patients

Opted out of research requests, noted in the chart and honoured by the team

Prospective algorithmic validation

Mahajan et al. (2023)102

To assess a model’s predictive accuracy for 30-day postoperative mortality and major adverse cardiac and cerebrovascular events

Predictive machine learning

• Algorithm performance: AUC, ROC, PPV, NPV, sensitivity, specificity

• 206,353 patient cases

• Compared performance with an algorithm already used in care

SHAP values applied to retrospective test only

Prospective algorithmic validation study

Major et al. (2020)103

To validate a model predicting short-term in-hospital mortality

Predictive machine learning

• Algorithm performance: descriptive statistics (n patients meeting the primary outcome)

• 9-month trial with 41,728 predictions + 12-week silent test in which hospitalists reviewed 104 alerts to determine whether the alert was actionable and appropriate

• Assessed bias by comparing algorithmic fairness approaches

• Clinical stakeholders selected 75% PPV as the desired threshold for the model

• Experimented with different thresholds, varied across sites to reflect population needs

Prospective algorithmic validation

Manz et al. (2020)16

To validate an algorithm predicting 180-day mortality risk in a general oncology cohort

Predictive machine learning

• Algorithm performance: AUC, AUPRC, Brier score, PPV, NPV, sensitivity, alert rate tested at different risk thresholds

• 24,582 patient cases over a 2-month period

• Calculated performance metrics across different groups (disease site, practice type, self-reported race, sex, insurance, stage of cancer); reported performance to be better for women or at a later stage of cancer for men

• Described the model being locked; no updates made

Use of a nudging strategy described in a companion paper

Prospective algorithmic validation

Miró Catalina et al. (2024)104

To validate a diagnostic algorithm in radiology

Diagnostic deep learning

• Algorithm performance: TP, TN, FP, FN, sensitivity, specificity

• 278 cases of 471 participants

• Researchers interpreted reference radiology reports before inputting to AI to obtain a diagnosis for comparison

• Error testing for certain pathologies

 

Compared diagnostic performance with human readers

Morse et al. (2022)27

To evaluate a model detecting CKD in a paediatric hospital

Evaluative machine learning

• Algorithm performance: AUROC

• ML model draws data directly from the EHR in near real time

• 1,270 patient admissions over ~6 months

 

Prospective algorithmic validation

Nemeth et al. (2023)37

To validate a model for detecting septic shock

Predictive machine learning

• Algorithm performance: AUC, PPV, NPV

• 5,384 hospital admissions in 4,804 patients during a 6-month silent test, comparing predictions with a clinician’s independent judgement

• Extensive failure case analysis

• Tested different time horizons

• Described data flow and infrastructure for the model

• Codesign using interviews with multiple stakeholders

• User acceptance testing

• Alignment of model use with practice guidelines

Compared model outputs with clinician annotations

O’Brien et al. (2020)105

To evaluate an EWS for patient deterioration

Predictive machine learning

• Algorithm performance: PPV, sensitivity, thresholding

• 4,210 encounters, 97 patients

• Set up data analytics to reflect real-time streaming of live data

• Alert risk presented using red, yellow and green colour codes

• Nursing consult on visualization

Algorithmic validation study

Ouyang et al. (2020)32

To validate a segmentation model assessing cardiac function

Predictive deep learning

• Algorithm performance: AUC, RMSE, R2

• Measurements of cardiac function in 1,288 patients

• Compared model measurements with those by human annotators, with manual blinded re-evaluation by 5 experts for cases with a large discrepancy between the model and annotations

 

Compared model outputs with clinician annotations

Pan et al. (2025)106

To validate a model predicting the utility of CT for mTBI

Predictive machine learning

• Algorithm performance: AUC, accuracy, sensitivity, specificity, PPV, NPV, F1, DCA

• 86 patients

• ML model compared with serum biomarkers for TBI and a statistical regression model

• SHAP values

• DCA to assess clinical utility

Prospective clinical validation (silent trial)

Pou-Prom et al. (2022)34

To validate an early warning system in inpatients

Predictive machine learning

• Algorithm performance: AUC, PPV, sensitivity

• Determined a composite outcome label

• Described the shift needed to accommodate changes due to onset of the COVID-19 pandemic

• Described a detailed preprocessing plan

• Evaluated the processing stream

• Initially planned a 4-month trial, which was extended to 6 months

• Conducted training with users

Weekly check-ins with stakeholders during the silent phase

Real-time algorithmic validation

Pyrros et al. (2023)107

To validate a model detecting type 2 diabetes from chest radiographs and EHR data

Predictive deep learning

• Algorithm performance: AUROC, PPV, sensitivity, specificity, F1, Youden’s J index, PR, NPV, odds ratio, demographics

• 9,943 chest radiographs

• Noted the potential for health disparities; planned subgroup analysis by race/ethnicity; mentioned the need for fine-tuning due to fairness and robustness issues

• Data stream and infrastructure described

Used an animated technique through an autoencoder for feature highlighting

Algorithmic validation study

Qian et al. (2025)108

To validate a model predicting surgical intervention need for paediatric intussusception

Predictive deep learning

• Algorithm performance: AUC, accuracy, NPV, F1, ROC

• 50 patients

• Reported consistent performance across different patient populations by age

 

Algorithmic validation

Rajakariar et al. (2020)25

To validate a smartwatch device for detecting atrial fibrillation

Diagnostic machine learning

• Algorithm performance: sensitivity, specificity, TP, TN, Cohen’s κ for agreement

• Failure case analysis for unclassified tracings assessed by 2 electrophysiologists

• Described the data pipeline

• 200 consecutive patients over 6 months, 439 ECGs

• Cardiologist diagnosis as the reference standard

 

Compared device output with clinician diagnosis

Rawson et al. (2021)109

To validate a model detecting secondary bacterial infection during COVID-19

Predictive machine learning

• Algorithm performance: AUROC, descriptive analysis

 

Prospective pilot test of the algorithm

Razavian et al. (2020)33

To validate a model predicting outcomes for hospitalized patients with COVID-19

Predictive machine learning

• Algorithm performance: AUROC, AUPRC, PPV, thresholded sensitivity, confidence intervals

• Integration through the EHR; data flow described

• Described the cleaning process, feature minimization, threshold selection and time horizon

• 445 patients over 474 admissions (109,913 prediction instances)

• Medical students and practicing physicians assessed face validity, timing and clinical utility

• Review with medical students to assess 30 patient encounters for impact on clinical decision-making from model prediction

• Interface described

• Feature-level XAI

Prospective observational study (unclear of impact)

Ren et al. (2025)110

To evaluate a smartphone-based AI for classifying auricular deformities

Diagnostic deep learning

• Algorithm performance: AUC, ROC, sensitivity, specificity, precision, F1 score

• 272 cases

• Ground truth established by two independent professionals

• Compared human and model performance

• Scalable and low-cost diagnostic support

• Guidance for proper image acquisition

• Failure analysis identified discrepancies between retrospective and prospective validation sets

• Described the data pipeline and inference process

 

Clinical validation

Schinkel et al. (2022)111

To validate a model predicting a positive blood culture result

Predictive machine learning

• Algorithm performance: AUROC, AUPRC, calibration, feature contributions, DCA

• Described data processing in a live context

• 3-month period of real-time validation

 

Real-time prospective algorithmic validation

Shah et al. (2021)112

To validate a model predicting clinical deterioration

Predictive machine learning

• Algorithm performance: AUPRC, AUROC, PPV, NNE

• Preplanned subgroup analysis by race, sex and age revealed discrepancies

• 146,446 hospitalizations in 103,930 unique patients

• Described data processing steps and feature importance calculations

 

Algorithmic validation study

Shamout et al. (2021)113

To validate a model predicting deterioration from COVID-19

Predictive machine learning

• Algorithm performance: AUC, PR, PPV, NPV

• 375 examinations

• Real-time extraction; addressed computational resources

 

Prospective algorithmic validation (silent trial)

Shelov et al. (2018)38

To validate a model predicting clinical acuity in a paediatric ICU

Machine learning decision support

• Algorithm performance: Littenberg Technology Assessment in Medicine framework

• Approximately 6-month verification phase before going live

• Measured the impact of the model in EHR on processing time

• Validation done through a survey for project team clinicians to complete (315 forms for 182 patients)

• Retrospective analysis of data quality and patients meeting the at-risk criteria

• Reported on the availability of the algorithm

• Some interfaces included

• Design included a multidisciplinary team comprising physicians, nurses, informaticians, respiratory therapists and improvement advisors

Prospective verification of the model against clinical judgement

Sheppard et al. (2018)29

To validate an algorithm for triaging patients with suspected high BP for ambulatory pressure monitoring

Triage machine learning

• Algorithm performance: sensitivity, specificity, PPV, NPV, AUROC

• Compared the accuracy of the triaging strategy across subgroups (by setting, age, sex, smoking status, BMI, history of hypertension, diabetes, CKD, cardiovascular disease and BP measuring device)

• 887 eligible patients with 3 same-visit BP readings

• Described the rationale for excluding cases based on data missingness

Advised patients with hypertension history on the design of the project, recruitment and study literature before ethics submission

Comparison of algorithmic triaging approach against the standard

Shi et al. (2025)114

To evaluate a model predicting the risk of colorectal polyp recurrence

Predictive machine learning

• Algorithm performance: ROC, DCA, sensitivity, specificity

• 166 patients

• DCA to assess clinical utility

• Demonstrated the user interface

Prospective algorithmic validation study

Smith et al. (2024)115

To evaluate a model for breast cancer screening

AI decision support

• Algorithm performance: recall or no recall decision

• Assessed concordant and discordant cases

• 8,779 patients aged 50–70 years

• Trained film readers verified the results

• Assessed multiple features of patients and scan results

Regions of interest available during reviews

Compared diagnostic performance with human readers

Stamatopoulos et al. (2025)116

To validate a model predicting miscarriage risk

Predictive machine learning

• Algorithm performance: sensitivity, specificity, PPV, NPV

• Assessor had access to ground truth and compared algorithm predictions against short-term outcomes

Inferred a lack of clinical utility due to unreliable predictions

Prospective algorithmic validation study

Stephen et al. (2023)20

To validate a model detecting paediatric sepsis

Predictive machine learning

• Algorithm performance: AUC, PPV

• 8,608 cases (1-year period)

• Thresholding for alerts to consider false alerts, alert fatigue, resources for sepsis huddle

Team of clinicians, data scientists, improvement experts and clinical informaticians; regular meetings throughout the project

Real-time algorithmic validation

Swinnerton et al. (2025)117

To prospectively validate a prediction tool for severe COVID-19 risk

Predictive machine learning

• Algorithm performance: AUC, calibration

• 51,587 infections

• Assessed subgroup performance

Feature importance

Prospective algorithmic validation study

Tan et al. (2025)26

To clinically validate AI-based multispectral imaging for burn wound assessment

Classification deep learning

• Algorithm performance: sensitivity, specificity, accuracy

• 40 patients, 70 burn images

• Failure mode analysis affecting overdiagnosis

• Bias assessment by skin pigmentation and tattoo presence

• Reported on availability, feasibility and time to diagnostic result

• Described the user interface

• UKCA class I medical device, ISO 13485

• Reported evaluator training

• Described the user interface

Prospective clinical validation study

Tariq et al. (2023)118

To validate a model screening for low bone density

Screening machine learning

• Algorithm performance: image label, precision, recall, F score, AUROC

• For 2 consecutive days, curated 344 scans (with and without contrast) from patients aged ≥50 years

• Some analysis of lower-performing areas

Heat maps for regions of interest

Algorithmic validation study

Titano et al. (2018)119

To simulate the clinical implementation of a triage algorithm for radiology

Triage deep learning

• Algorithm performance: AUC, sensitivity, specificity, accuracy, time to notify about critical findings, runtime

• 180 images reviewed by a radiologist and a surgeon (50/50 split); 2 radiologists and a neurosurgeon reviewed images without access to the EMR or prior images

 

Prospective simulated trial with human readers

Vaid et al. (2020)120

To validate an outcome prediction model for COVID-19

Predictive machine learning

• Algorithm performance: AUROC, AUPRC, F1, sensitivity, specificity

• 21-day trial

• Assessed race as a potential contributing variable to outcome prediction

SHAP scores

Prospective algorithmic validation (silent trial)

Wall et al. (2022)121

To evaluate a model for supporting radiation therapy plans

Predictive machine learning

• Algorithm performance: prediction error, ROC, concordance

• VQA application provides failures for features, top 5 features and ‘total gain’

• Reported runtime and compute power

• Physicists measured 445 VMAT plans over 3 months

• VQA predictions recorded alongside PSQA measurements

 

Prospective validation including comparison with the standard of care

Wan et al. (2025)122

To validate a model predicting neoadjuvant treatment response

Predictive machine learning

• Algorithm performance: AIC, ROC, PPV, NPV, DCA, calibration

• 76 patients

• Compared the performance of a clinical–radiomics model to that of a radiomics model, a clinical model and a radiologist’s subjective assessment

DCA to assess potential clinical benefit

Clinical validation

Wang et al. (2019)123

To validate a model predicting new-onset lung cancer

Predictive machine learning

• Algorithm performance: AUC, ROC, PPV, sensitivity, specificity

• Performance within each risk category

• 836,659 patient records

 

Algorithmic validation study

Wang et al. (2025)124

To validate a model for cardiovascular disease diagnosis

Diagnostic deep learning

• Algorithmic validation: AUC, sensitivity, specificity, F1, accuracy

• 62 patients

• Ground truth established by 3 emergency physicians reviewing the data, compared with algorithm outputs

SHAP values

Algorithmic validation with clinical verification

Wissel et al. (2020)125

To validate an NLP application to assign surgical candidacy for epilepsy

Decision support machine learning

• Algorithm performance: AUC, sensitivity, specificity, PPV, NPV, NNS, number of prospective surgical candidates

• Retrained the model weekly on the most recent training set based on free text notes

• Verification on 100 randomly selected patient cases

• Tested the inter-rater reliability of clinicians’ manual classifications versus the algorithm

Interpretability analysis revealed wording associated with surgical candidacy

Algorithmic validation with verification of a random subset

Wong et al. (2021)30

To temporally validate a model predicting acute respiratory failure

Predictive machine learning

• Algorithm performance: AUROC, AUPRC, sensitivity, specificity, PPV, NPV

• Event horizon

• 122,842 encounters, 112,740 controls

 

Temporal validation study

Xie et al. (2025)126

To validate a model diagnosing axial spondyloarthritis

Diagnostic deep learning

• Algorithmic validation: AUC, accuracy, sensitivity, specificity, F1, precision

• 209 patients

• Diagnostic accuracy compared with accepted clinical classification criteria for each patient

SHAP values

Algorithmic validation

Ye et al. (2019)127

To validate a real-time early warning system predicting high risk of inpatient mortality

Predictive machine learning

• Algorithm performance: sensitivity, specificity, PPV, ROC, C-statistic, hazard ratios

• 11,762 patients with an assigned EWS

Top 50 important features

Algorithmic validation study

Ye et al. (2020)128

To validate a nomogram for predicting liver failure

Predictive machine learning

• Algorithm performance: precision, recall, accuracy, F1

• 120 patients undergoing hepatectomy

 

Algorithmic validation study

Yu et al. (2022)129

To validate a sepsis prediction model

Predictive machine learning

• Algorithm performance: F1, sensitivity, specificity, AUROC, AUPRC

• 3,532 alerts; 388 met the sepsis criteria

• Analysed model successes and failures

• Considered scalability through compute requirements

SHAP values for a ‘lite’ version of the model

Algorithmic validation study

Zhang et al. (2025)130

To validate a model identifying atrial fibrillation after ischaemic stroke

Diagnostic deep learning

• Algorithm performance: AUC, sensitivity, specificity, PPC, NPV

• 73 patients

• Assessed model performance by patient age bracket

• An independent researcher conducted a blinded review of predicted atrial fibrillation status and actual diagnosis after clinical workup

• Described data cleaning and patient inclusion criteria

 

Algorithmic validation

  1. AIC, Akaike information criterion; AUC, area under the curve; BMI, body mass index; BP, blood pressure; COVID-19, coronavirus disease 2019; CKD, chronic kidney disease; CPT, Current Procedural Terminology; CT, computed tomography; DCA, decision curve analysis; ECG, electrocardiogram; ED, emergency department; EMR, electronic medical record; EWS, early warning score; FN, false negative; FNR, false negative rate; FP, false positive; GP, general physician; ICU, intensive care unit; ISO, International Organization for Standardization; LMICs, low- to middle-income countries; ML, machine learning; MSE, mean square error; mTBI, mild traumatic brain injury; N/A, not applicable, NLP, natural language processing; NNE, number needed to evaluate; NNS, number needed to screen; NPR, negative prediction rate; NPV, negative predictive value; PDA, pancreatic ductal adenocarcinoma; PPV, positive predictive value; PR, precision–recall; PSQA, patient-specific quality assurance; RMSE, root mean square error; ROC, receiver operating characteristic; SOFA, sequential organ failure assessment; TBI, traumatic brain injury; TN, true negative; TP, true positive; UKCA, UK Conformity Assessed; VMAT, volumetric modulated arc therapy; VQA, virtual quality assurance; XAI, explainable AI.