Table 2 Test F1-Scores and ROC-AUC for the deep neural networks in COVID-19 detection (o.o.d. evaluation)a

From: Improving deep neural network generalization and robustness to background bias via layer-wise relevance propagation optimization

Model and Metric

Normal

Pneumonia

COVID-19

Mean (macro-average)

ISNet F1-Score

0.555 ± 0.022, [0.512,0.597]

0.858 ± 0.007, [0.844,0.871]

0.907 ± 0.006, [0.896,0.918]

0.773 ± 0.009, [0.755,0.791]

U-Net+DenseNet121 F1-Score

0.571 ± 0.018, [0.535,0.607]

0.586 ± 0.013, [0.561,0.611]

0.776 ± 0.008, [0.76,0.792]

0.645 ± 0.009, [0.626,0.663]

DenseNet121 F1-Score

0.444 ± 0.02, [0.403,0.482]

0.434 ± 0.015, [0.405,0.463]

0.76 ± 0.008, [0.744,0.775]

0.546 ± 0.01, [0.527,0.565]

Multi-task U-Net F1-Score

0.419 ± 0.025, [0.369,0.469]

0.119 ± 0.011, [0.098,0.14]

0.585 ± 0.009, [0.566,0.602]

0.374 ± 0.01, [0.355,0.394]

AG-Sononet F1-Score

0.124 ± 0.015, [0.096,0.153]

0.284 ± 0.015, [0.255,0.312]

0.659 ± 0.009, [0.641,0.676]

0.356 ± 0.008, [0.34,0.372]

Extended GAIN F1-Score

0.203 ± 0.019, [0.166,0.24]

0.485 ± 0.013, [0.46,0.511]

0.711 ± 0.009, [0.693,0.728]

0.466 ± 0.009, [0.449,0.485]

RRR F1-Score

0.36 ± 0.018, [0.325,0.394]

0.552 ± 0.013, [0.526,0.577]

0.737 ± 0.009, [0.72,0.755]

0.55 ± 0.009, [0.532,0.568]

Vision Transformer (ViT-B/16) F1-Score

0.382 ± 0.017, [0.348,0.415]

0.474 ± 0.013, [0.448,0.499]

0.525 ± 0.011, [0.503,0.548]

0.46 ± 0.009, [0.443,0.478]

ISNet AUC

0.931 ± 0.01

0.962 ± 0.006

0.976 ± 0.005

0.952

U-Net+DenseNet121 AUC

0.888 ± 0.019

0.78 ± 0.016

0.846 ± 0.013

0.833

DenseNet121 AUC

0.804 ± 0.023

0.805 ± 0.015

0.86 ± 0.013

0.808

Multi-task U-Net AUC

0.721 ± 0.034

0.412 ± 0.019

0.487 ± 0.02

0.553

AG-Sononet AUC

0.451 ± 0.028

0.681 ± 0.019

0.658 ± 0.018

0.591

Extended GAIN AUC

0.7 ± 0.025

0.756 ± 0.016

0.806 ± 0.016

0.724

RRR AUC

0.782 ± 0.02

0.736 ± 0.017

0.835 ± 0.014

0.775

Vision Transformer (ViT-B/16) AUC

0.755 ± 0.032

0.645 ± 0.019

0.619 ± 0.019

0.683

  1. aClass ROC-AUC scores are calculated with a one-versus-rest approach and accompanied by 95% confidence intervals. Mean AUC is provided as point estimates and we calculate it with a pairwise technique48, instead of averaging the class scores. Other metrics are reported as: mean ± std, [95% HDI]. Both mean and standard deviation (std) are extracted from the metric’s probability distribution, according to Bayesian estimation. 95% HDI indicates the 95% highest density interval, an interval containing 95% of the metric’s probability mass. Furthermore, any point inside the interval has a probability density that is higher than that of any point outside. Supplementary Note 10 explains the statistical methods in detail.