Table 2 Qualitative reproducibility results, in terms of the repeatability measure, for the Likert scoring of the radiologist’s annotation and the DLS output.

From: Reproducibility analysis of automated deep learning based localisation of mandibular canals on a temporal CBCT dataset

Reproducibility

Heterogeneity

Radiologist mean (95% CI)

DLS mean (95% CI)

Average

Full dataset

0.923 (0.842, 0.979)

0.877 (0.814, 0.932)

Normal

0.958 (0.882, 1.000)

0.895 (0.833, 0.935)

TMJ Prosthetic

0.841 (0.725, 0.925)

0.911 (0.829, 0.970)

Orthognathic

0.945 (0.844, 1.000)

0.815 (0.711, 0.907)

Pathological

0.916 (0.747, 0.996)

0.899 (0.712, 0.998)

Expert 1

Full dataset

0.964 (0.937, 0.983)

0.915 (0.887, 0.936)

Normal

0.996 (0.976, 1.000)

0.913 (0.864, 0.939)

TMJ Prosthetic

0.865 (0.776, 0.933)

0.936 (0.872, 0.977)

Orthognathic

0.995 (0.970, 1.000)

0.866 (0.796, 0.917)

Pathological

0.960 (0.881, 0.997)

0.969 (0.905, 1.000)

Expert 2

Full dataset

0.932 (0.899, 0.958)

0.874 (0.838, 0.904)

Normal

0.950 (0.893, 0.984)

0.886 (0.828, 0.925)

TMJ Prosthetic

0.856 (0.766, 0.927)

0.913 (0.841, 0.961)

Orthognathic

0.947 (0.888, 0.982)

0.793 (0.704, 0.861)

Pathological

0.959 (0.882, 0.997)

0.928 (0.842, 0.981)

Expert 3

Full dataset

0.872 (0.829, 0.907)

0.841 (0.801, 0.875)

Normal

0.928 (0.864, 0.970)

0.887 (0.827, 0.925)

TMJ Prosthetic

0.800 (0.697, 0.891)

0.885 (0.810, 0.934)

Orthognathic

0.893 (0.821, 0.944)

0.786 (0.698, 0.856)

Pathological

0.829 (0.709, 0.918)

0.798 (0.678, 0.885)

  1. Results shown for each heterogeneity group and radiologist, as well as for the full dataset and average of the Experts. CI denotes the Bayesian credibility interval.