Table 2 Hyperparameter tuning transformer models on the validation dataset & evaluating generalizability on the test datasets. Final model selected by highest sensitivity on the validation dataset (bold). * indicates the mean metric of the ensembled models rather than the voting ensemble metric. AUROC refers to the model’s “Area under the receiver operating Characteristic” curve metric, LR refers to the model’s “Learning Rate”, PPV refers to the model’s “Positive predictive Value”. Pr@50Re metric reports the precision of a model when it achieves 50% recall while the Sp@50Se metric reports the specificity of a model when it reaches 50% sensitivity.
Experiment | Exp. Param. | Sensitivity | Specificity | Accuracy | AUROC | PPV | NPV | F1 Score |
|---|---|---|---|---|---|---|---|---|
Ensembled Models | Grouped Codes | 60.10% | 61.90% | 61.90% | 65.50%* | 2.50% | N/A | 4.70% |
All Codes | 61.80% | 60.80% | 60.80% | 66.10%* | 2.50% | N/A | 4.70% | |
Pretrained | 68.10% | 61.40% | 61.50% | 70.90%* | 2.70% | N/A | 5.30% | |
Pretrained w/ LR decay | 72.00% | 57.90% | 58.10% | 71.20%* | 2.70% | N/A | 5.10% | |
Pretrained w/ LR decay, batch size = 64 | 72.30% | 58.00% | 58.20% | 71.10%* | 2.70% | 99.24% | 5.20% | |
Final Model; Test Dataset | Pretrained w/ LR decay, batch size = 64 | 70.90% | 56.90% | 57.10% | 69.60%* | 2.40% | 99.22% | 4.70% |