Table 2 Performance of LLM-Chatbots in addressing questions with english prompts
English Prompts | BARD | ChatGPT 3.5 | ChatGPT 4 | PBARD vs. 3.5 | PBARD vs. 4 | P3.5 vs. 4 |
|---|---|---|---|---|---|---|
Sum Score, mean (SD)a | 5.40 (0.93) | 5.45 (1.06) | 5.65 (0.67) | 0.74 | 0.057 | 0.16 |
Appropriate, n % | 66 (88.0) | 69 (92.0) | 73 (97.3) | 0.33 | 0.021 | 0.24 |
Borderline, n % | 9 (12.0) | 5 (6.7) | 1 (1.3) | |||
Inappropriate, n % | 0 (0.0) | 1 (1.3) | 1 (1.3) | |||
Chinese Prompts | ERNIE | ChatGPT 3.5 | ChatGPT 4 | PERNIE vs. 3.5 | PERNIE vs. 4 | P3.5 vs. 4 |
Sum Score, mean (SD)a | 4.99 (1.85) | 5.25 (1.62) | 5.07 (1.74) | 0.34 | 0.78 | 0.49 |
Appropriate, n % | 63 (84.0) | 66 (88.0) | 64 (85.3) | 0.71 | 0.96 | 0.85 |
Borderline, n % | 3 (4.0) | 3 (4.0) | 3 (4.0) | |||
Inappropriate, n % | 9 (12.0) | 6 (8.0) | 8 (10.7) |