Table 4 Comparison of evaluation indexes of five machine learning algorithm models.

From: Development and validation of a hypoxemia prediction model in middle-aged and elderly outpatients undergoing painless gastroscopy

Model

Group

Accuracy

AUROC, 95 CI

AUPRC

Precision

Recall

F1

Brier

LR

Train

83.252

0.895(0.879,0.915)

0.875

0.833

0.832

0.832

0.128

Test

80.405

0.893(0.881,0.899)

0.586

0.333

0.781

0.467

0.134

Validate

75.275

0.832(0.809,0.845)

0.426

0.385

0.833

0.526

0.195

SVM

Train

88.399

0.938(0.931,0.957)

0.914

0.866

0.908

0.887

0.091

Test

80.405

0.855(0.812,0.884)

0.438

0.303

0.625

0.408

0.113

Validate

75.275

0.791(0.724,0.817)

0.365

0.383

0.817

0.521

0.196

RF

Train

97.386

0.998(0.998,0.999)

0.998

0.972

0.975

0.974

0.026

Test

90.878

0.914(0.889,0.924)

0.721

0.561

0.719

0.630

0.072

Validate

79.945

0.848(0.819,0.860)

0.458

0.434

0.717

0.541

0.130

XGB

Train

94.853

0.988(0.987,0.994)

0.988

0.949

0.948

0.948

0.042

Test

88.514

0.902(0.865,0.920)

0.702

0.478

0.688

0.564

0.076

Validate

78.846

0.808(0.745,0.859)

0.440

0.412

0.667

0.510

0.155

LightGBM

Train

95.670

0.991(0.990,0.996)

0.991

0.962

0.951

0.956

0.037

Test

88.176

0.891(0.847,0.917)

0.652

0.468

0.688

0.557

0.083

Validate

81.319

0.814(0.711,0.856)

0.470

0.455

0.667

0.541

0.141

  1. LR: logistic regression; SVM: support vector machine; RF: random forest; XGBoost: extreme gradient boosting; LightGBM: light gradient boosting machine; Training: training set; Testing: internal testing set; Validation: External validation set; Accuracy: the proportion of correctly predicted samples out of the total samples; AUROC: the area under the receiver’s operating characteristic curve; 95 CI: confidence interval; AUPRC, the area under the precision-recall curve; Precision: the proportion of true positive predictions among all positive predictions made; Recall: the proportion of true positive predictions among all actual positive instances; F1 score: the harmonic mean of precision and recall; Brier score: Assess the degree of proximity between the model’s risk predictions and the actual observed probabilities. A lower score indicates better model performance, with model output probabilities closer to the true labels.