Table 2 Detailed comparison of the performance of humans and chatbots accross the different levels of the AO classification system.
GPT 3.5-Turbo | GPT 4 | FraCChat 3.5 | FraCChat 4 | Human | |
|---|---|---|---|---|---|
Full AO | |||||
GPT 3.5-Turbo | – | −0.44 (0.64) | −4.79 (< 0.001) | −7.17 (< 0.001) | −9.20 (< 0.001) |
GPT 4 | 0.44 (0.64) | – | −4.36 (< 0.001) | −6.73 (< 0.001) | −8.77 (< 0.001) |
FraCChat 3.5 | 4.79 (< 0.001) | 4.36 (< 0.001) | – | −2.37 (< 0.001) | −4.41 (< 0.001) |
FraCChat 4 | 7.17 (< 0.001) | 6.73 (< 0.001) | 2.37 (< 0.001) | – | −2.04 (< 0.001) |
Human | 9.20 (< 0.001) | 8.77 (< 0.001) | 4.41 (< 0.001) | 2.04 (< 0.001) | – |
Location | |||||
GPT 3.5-Turbo | – | −1.49 (< 0.001) | −6.71 (< 0.001) | −9.01 (< 0.001) | −9.96 (< 0.001) |
GPT 4 | 1.49 (< 0.001) | – | −5.22 (< 0.001) | −7.52 (< 0.001) | −8.47 (< 0.001) |
FraCChat 3.5 | 6.71 (< 0.001) | 5.22 (< 0.001) | – | −2.31 (< 0.001) | −3.26 (< 0.001) |
FraCChat 4 | 9.01 (< 0.001) | 7.52 (< 0.001) | 2.31 (< 0.001) | – | −0.95 (0.70) |
Human | 9.96 (< 0.001) | 8.47 (< 0.001) | 3.26 (< 0.001) | 0.95 (0.70) | – |
Part of bone | |||||
GPT 3.5-Turbo | – | −1.33 (< 0.001) | −7.03 (< 0.001) | −9.99 (< 0.001) | −9.23 (< 0.001) |
GPT 4 | 1.33 (< 0.001) | – | −5.69 (< 0.001) | −8.66 (< 0.001) | −7.90 (< 0.001) |
FraCChat 3.5 | 7.03 (< 0.001) | 5.69 (< 0.001) | – | −2.97 (0.002) | −2.21 (0.11) |
FraCChat 4 | 9.99 (< 0.001) | 8.66 (< 0.001) | 2.97 (0.0021) | – | 0.76 (0.95) |
Human | 9.23 (< 0.001) | 7.90 (< 0.001) | 2.21 (0.11) | −0.76 (0.95) | – |
Type | |||||
GPT 3.5-Turbo | – | −0.88 (< 0.001) | −6.00 (< 0.001) | −8.56 (< 0.001) | −9.39 (< 0.001) |
GPT 4 | 0.88 (< 0.001) | – | −5.13 (< 0.001) | −7.68 (< 0.001) | −8.52 (< 0.001) |
FraCChat 3.5 | 6.00 (< 0.001) | 5.13 (< 0.001) | – | −2.56 (< 0.001) | −3.39 (< 0.001) |
FraCChat 4 | 8.56 (< 0.001) | 7.68 (< 0.001) | 2.56 (< 0.001) | – | −0.83 (0.75) |
Human | 9.39 (< 0.001) | 8.52 (< 0.001) | 3.39 (< 0.001) | 0.83 (0.75) | – |
Group | |||||
GPT 3.5-Turbo | – | −0.72 (0.02) | −4.19 (< 0.001) | −5.79 (< 0.001) | −7.04 (< 0.001) |
GPT 4 | 0.72 (0.02) | – | −3.47 (< 0.001) | −5.07 (< 0.001) | −6.32 (< 0.001) |
FraCChat 3.5 | 4.19 (< 0.001) | 3.47 (< 0.001) | – | −1.59 (< 0.001) | −2.85 (< 0.001) |
FraCChat 4 | 5.79 (< 0.001) | 5.07 (< 0.001) | 1.59 (< 0.001) | – | −1.25 (0.12) |
Human | 7.04 (< 0.001) | 6.32 (< 0.001) | 2.85 (< 0.000) | 1.25 (0.12) | – |
Subgroup | |||||
GPT 3.5-Turbo | – | 0.09 (0.99) | −3.47 (< 0.001) | −5.89 (< 0.001) | −7.96 (< 0.001) |
GPT 4 | −0.09 (0.99) | − | −3.56 (< 0.001) | −5.99 (< 0.001) | −8.05 (< 0.001) |
FraCChat 3.5 | 3.47 (< 0.001) | 3.56 (< 0.001) | − | −2.42 (< 0.001) | −4.49 (< .001) |
FraCChat 4 | 5.89 (< 0.001) | 5.99 (< 0.001) | 2.42 (< 0.001) | – | −2.07 (< 0.001) |
Human | 7.96 (< 0.001) | 8.05 (< 0.001) | 4.49 (< 0.001) | 2.07 (< 0.001) | – |