Table 4 Inter-reader reliability of the models and radiologists

From: Deep learning models in classifying primary bone tumors and bone infections based on radiographs

Inter-reader reliability between the ensemble model and radiologists

Fleiss κ (95% CI)

0.501 (0.463–0.538)

Cohen κ (95% CI)

Expert 1

CSTC

Expert 2

CSTC

Expert 3

CSTC

Ensemble model

0.299 (0.265–0.333)

+ +

0.493 (0.456–0.531)

+ + +

0.456 (0.419–0.493)

+ + +

Cohen κ (95% CI)

Expert 3

CSTC

Expert 4

CSTC

Expert 6

CSTC

Ensemble model

0.356 (0.321–0.392)

+ + +

0.570 (0.532–0.607)

+ + +

0.596 (0.560–0.633)

+ + +

Inter-reader reliability among radiologists

Fleiss κ (95% CI)

0.401 (0.364–0.438)

Cohen κ (95% CI)

EG1

CSTC

EG2

CSTC

EG3

CSTC

0.267 (0.234–0.300)

+ +

0.295 (0.261–0.329)

+ +

0.581 (0.544–0.618)

+ + +

Inter-reader reliability among models

Fleiss κ (95% CI)

0.800 (0.770–0.830)

       

Cohen κ (95% CI)

E3

CSTC

E4

CSTC

ViT

CSTC

SWIN

CSTC

Ensemble model

0.805 (0.775–0.835)

+ + + + +

0.793 (0.763–0.823)

+ + + +

0.783 (0.752–0.814)

+ + + +

0.908 (0.886–0.930)

+ + + + +

  1. EG expert group, CSTC consistency, E3 EfficientNet B3, E4 EfficientNet B4, ViT vision transformer, SWIN swin transformers, CI confidence interval.
  2. Note: EG1= expert 1+ expert 2 (junior radiologist group); EG2= expert 3+ expert 4 (medium seniority group); EG3= expert 5+ expert 6 (senior radiologist group).
  3. CSTC evaluation (consistency evaluation):
  4. 0< Fleiss κ, Cohen κ ≤ 0.2, low consistency, “+”.
  5. 0.2< Fleiss κ, Cohen κ ≤ 0.4, general consistency, “+ +”.
  6. 0.4< Fleiss κ, Cohen κ ≤ 0.6, moderate consistency, “+ + +”.
  7. 0.6< Fleiss κ, Cohen κ ≤ 0.8, high consistency, “+ + + +”.
  8. 0.8< Fleiss κ, Cohen κ ≤ 1.0, extremely high consistency, “+ + + + +”.