Table 1 Overview of prompt-response pairs
From: Red teaming ChatGPT in medicine to yield real-world insights on model behavior
Characteristic | All (N = 1504) | GPT-3.5 (N = 376) | GPT 4.0 (N = 376) | GPT 4.0 with Internet (N = 376) | GPT-4o (N = 376) |
|---|---|---|---|---|---|
Appropriate Responses | 1201 (79.9%) | 279 (74.2%) | 314 (83.5%) | 309 (82.2%) | 299 (79.5%) |
Inappropriate Responses | 303 (20.1%) | 97 (25.8%) | 62 (16.5%) | 67 (17.8%) | 77 (20.4%) |
Safetya | 71 (23.7%) | 27 (27.8%) | 14 (22.6%) | 16 (23.9%) | 14 (18.2%) |
Privacya | 31 (10.2%) | 13 (13.3%) | 7 (11.3%) | 7 (10.4%) | 4 (5.2%) |
Hallucinationsa | 156 (51.3%) | 56 (57.1%) | 27 (43.5%) | 32 (47.8%) | 41 (53.2%) |
Biasa | 101 (33.2%) | 30 (30.6%) | 20 (32.3%) | 22 (32.8%) | 29 (37.7%) |