Table 2 Detailed comparison of the performance of humans and chatbots accross the different levels of the AO classification system.

From: Performance of ChatGPT, human radiologists, and context-aware ChatGPT in identifying AO codes from radiology reports

 

GPT 3.5-Turbo

GPT 4

FraCChat 3.5

FraCChat 4

Human

Full AO

GPT 3.5-Turbo

−0.44 (0.64)

−4.79 (< 0.001)

−7.17 (< 0.001)

−9.20 (< 0.001)

GPT 4

0.44 (0.64)

−4.36 (< 0.001)

−6.73 (< 0.001)

−8.77 (< 0.001)

FraCChat 3.5

4.79 (< 0.001)

4.36 (< 0.001)

−2.37 (< 0.001)

−4.41 (< 0.001)

FraCChat 4

7.17 (< 0.001)

6.73 (< 0.001)

2.37 (< 0.001)

−2.04 (< 0.001)

Human

9.20 (< 0.001)

8.77 (< 0.001)

4.41 (< 0.001)

2.04 (< 0.001)

Location

GPT 3.5-Turbo

−1.49 (< 0.001)

−6.71 (< 0.001)

−9.01 (< 0.001)

−9.96 (< 0.001)

GPT 4

1.49 (< 0.001)

−5.22 (< 0.001)

−7.52 (< 0.001)

−8.47 (< 0.001)

FraCChat 3.5

6.71 (< 0.001)

5.22 (< 0.001)

−2.31 (< 0.001)

−3.26 (< 0.001)

FraCChat 4

9.01 (< 0.001)

7.52 (< 0.001)

2.31 (< 0.001)

−0.95 (0.70)

Human

9.96 (< 0.001)

8.47 (< 0.001)

3.26 (< 0.001)

0.95 (0.70)

Part of bone

GPT 3.5-Turbo

−1.33 (< 0.001)

−7.03 (< 0.001)

−9.99 (< 0.001)

−9.23 (< 0.001)

GPT 4

1.33 (< 0.001)

−5.69 (< 0.001)

−8.66 (< 0.001)

−7.90 (< 0.001)

FraCChat 3.5

7.03 (< 0.001)

5.69 (< 0.001)

−2.97 (0.002)

−2.21 (0.11)

FraCChat 4

9.99 (< 0.001)

8.66 (< 0.001)

2.97 (0.0021)

0.76 (0.95)

Human

9.23 (< 0.001)

7.90 (< 0.001)

2.21 (0.11)

−0.76 (0.95)

Type

GPT 3.5-Turbo

−0.88 (< 0.001)

−6.00 (< 0.001)

−8.56 (< 0.001)

−9.39 (< 0.001)

GPT 4

0.88 (< 0.001)

−5.13 (< 0.001)

−7.68 (< 0.001)

−8.52 (< 0.001)

FraCChat 3.5

6.00 (< 0.001)

5.13 (< 0.001)

−2.56 (< 0.001)

−3.39 (< 0.001)

FraCChat 4

8.56 (< 0.001)

7.68 (< 0.001)

2.56 (< 0.001)

−0.83 (0.75)

Human

9.39 (< 0.001)

8.52 (< 0.001)

3.39 (< 0.001)

0.83 (0.75)

Group

GPT 3.5-Turbo

−0.72 (0.02)

−4.19 (< 0.001)

−5.79 (< 0.001)

−7.04 (< 0.001)

GPT 4

0.72 (0.02)

−3.47 (< 0.001)

−5.07 (< 0.001)

−6.32 (< 0.001)

FraCChat 3.5

4.19 (< 0.001)

3.47 (< 0.001)

−1.59 (< 0.001)

−2.85 (< 0.001)

FraCChat 4

5.79 (< 0.001)

5.07 (< 0.001)

1.59 (< 0.001)

−1.25 (0.12)

Human

7.04 (< 0.001)

6.32 (< 0.001)

2.85 (< 0.000)

1.25 (0.12)

Subgroup

GPT 3.5-Turbo

0.09 (0.99)

−3.47 (< 0.001)

−5.89 (< 0.001)

−7.96 (< 0.001)

GPT 4

−0.09 (0.99)

−3.56 (< 0.001)

−5.99 (< 0.001)

−8.05 (< 0.001)

FraCChat 3.5

3.47 (< 0.001)

3.56 (< 0.001)

−2.42 (< 0.001)

−4.49 (< .001)

FraCChat 4

5.89 (< 0.001)

5.99 (< 0.001)

2.42 (< 0.001)

−2.07 (< 0.001)

Human

7.96 (< 0.001)

8.05 (< 0.001)

4.49 (< 0.001)

2.07 (< 0.001)