Table 1 Test macro-average F1-Scores for neural networks trained in datasets with synthetic background biasa

From: Improving deep neural network generalization and robustness to background bias via layer-wise relevance propagation optimization

Model

Biased test maF1

Standard test maF1

Deceiving bias test maF1

Stanford dogs with synthetic background bias

 ISNet

0.548 ± 0.035

0.553 ± 0.035

0.548 ± 0.035

 ISNet Grad*Input

0.55 ± 0.034

0.545 ± 0.034

0.545 ± 0.034

 Standard classifier

0.926 ± 0.019

0.419 ± 0.034

0.071 ± 0.017

 Segmentation-classification pipeline

0.519 ± 0.035

0.519 ± 0.035

0.518 ± 0.035

 Multi-task U-Net

0.522 ± 0.036

0.455 ± 0.036

0.38 ± 0.035

 AG-Sononet

0.956 ± 0.015

0.214 ± 0.027

0.019 ± 0.009

 Extended GAIN

0.935 ± 0.017

0.445 ± 0.034

0.1 ± 0.019

 RRR

0.851 ± 0.025

0.548 ± 0.034

0.299 ± 0.025

 Vision transformer (ViT-B/16)

0.637 ± 0.034

0.419 ± 0.032

0.399 ± 0.032

 Standard classifier reference (trained without synthetic bias)

0.556 ± 0.035

COVID-19 detection with synthetic background bias

 ISNet

0.775 ± 0.008

0.775 ± 0.008

0.775 ± 0.008

 ISNet Grad*Input

0.542 ± 0.01

0.544 ± 0.01

0.417 ± 0.01

 Standard classifier

0.775 ± 0.008

0.434 ± 0.01

0.195 ± 0.004

 Segmentation-classification pipeline

0.618 ± 0.009

0.619 ± 0.009

0.618 ± 0.009

 Multi-task U-Net

0.667 ± 0.01

0.341 ± 0.007

0.156 ± 0.004

 AG-Sononet

0.943 ± 0.005

0.386 ± 0.008

0.047 ± 0.003

 Extended GAIN

0.41 ± 0.009

0.306 ± 0.006

0.219 ± 0.003

 RRR

0.464 ± 0.009

0.458 ± 0.008

0.426 ± 0.008

 Vision transformer (ViT-B/16)

0.685 ± 0.009

0.496 ± 0.009

0.327 ± 0.009

 Standard classifier reference (trained without synthetic bias)

0.546 ± 0.01

Facial attribute estimation with synthetic background bias

 ISNet

0.807 ± 0.027

0.807 ± 0.027

0.807 ± 0.027

 ISNet Grad*Input

0.496 ± 0.02

0.499 ± 0.02

0.503 ± 0.021

 Standard classifier

0.974 ± 0.012

0.641 ± 0.054

0.398 ± 0.019

 Segmentation-classification pipeline

0.794 ± 0.031

0.794 ± 0.031

0.794 ± 0.031

 Multi-task U-Net

0.985 ± 0.008

0.665 ± 0.129

0.351 ± 0.015

 AG-Sononet

0.985 ± 0.009

0.616 ± 0.094

0.326 ± 0.016

 Extended GAIN

0.886 ± 0.023

0.773 ± 0.034

0.633 ± 0.03

 RRR

0.794 ± 0.024

0.77 ± 0.032

0.557 ± 0.025

 Vision transformer (ViT-B/16)

0.675 ± 0.023

0.645 ± 0.03

0.531 ± 0.023

 Standard classifier reference (trained without synthetic bias)

0.802 ± 0.028

  1. aIn the multi-class single-label experiments (Stanford Dogs and COVID-19 detection), scores are reported as mean and standard deviation. In facial attribute estimation (multi-label problem), they are displayed as mean and 95% confidence intervals. Supplementary Note 10 provides more details about the statistical analysis in this study.