Table 1 Performance evaluation of diagnosis-based counterfactual explanations obtained using different approaches. In each case, we report results averaged across 500 test samples.

From: Training calibration-based counterfactual explainers for deep learning models in medical image analysis

Method

Validity \(\uparrow\)

Confidence \(\uparrow\)

Sparsity \(\downarrow\)

Proximity \(\downarrow\)

Realism \(\uparrow\)

Vanilla

0.68

0.63±0.11

0.3±0.17

4.59±0.68

1.16 ± 0.09

Mixup

0.78

0.69±0.17

0.27±0.16

4.09±0.52

1.19 ± 0.13

UWCC

0.79

0.75±0.13

0.25±0.17

4.26±0.63

1.16 ± 0.2

MC dropout

0.73

0.66±0.16

0.34±0.19

4.57±0.53

1.18 ± 0.16

Deep ensembles (5 models)

0.8

0.72±0.09

0.29±0.11

3.68±0.57

1.21 ± 0.12

TraCE

0.87

0.81±0.12

0.23±0.14

3.73±0.51

1.33 ± 0.13

  1. Significant values are in [bold].