Table 3 Accuracy, confidence, and utility scores stratified by explanation type for correct and incorrect recommendations.

From: How machine-learning recommendations influence clinician treatment selections: the example of antidepressant selection

Recommendation correctness

Explanation type

Accuracy

Confidence

Perceived utility

Correct recommendations

Baseline

0.357 (0.333–0.381)

p = 0.239

3.67 (3.63–3.72)

p = 0.239

N/A

p = 0.343

No explanation

0.394 (0.355–0.433)

3.64 (3.57–3.72)

3.45 (3.35–3.55)

Placebo

0.357 (0.318–0.397)

3.66 (3.59–3.74)

3.53 (3.43–3.63)

Feature based

0.397 (0.356–0.437)

3.62 (3.54–3.69)

3.54 (3.44–3.64)

Heuristic based

0.390 (0.352–0.428)

3.70 (3.62–3.77)

3.54 (3.44–3.64)

Incorrect recommendations

Baseline

0.357 (0.333–0.381)

p = 0.004*

3.67 (3.63–3.72)

p = 0.155

N/A

p = 0.573

No explanation

0.298 (0.250–0.345)

3.60 (3.50–3.70)

3.38 (3.23–3.53)

Placebo

0.311 (0.262–0.359)

3.65 (3.54–3.76)

3.36 (3.21–3.51)

Feature based

0.262 (0.214–0.310)

3.67 (3.56–3.77)

3.48 (3.34–3.63)

Heuristic based

0.327 (0.277–0.376)

3.57 (3.46–3.68)

3.36 (3.20–3.52)

  1. p-values measured using repeated-measures ANOVA with a significance level of 0.05.