Table 2 Overview of prompt-response pairs per prompt category
From: Red teaming ChatGPT in medicine to yield real-world insights on model behavior
Prompt Category | All (N = 1504) | Treatment Plan (N = 448) | Fact Checking (N = 280) | Patient Communication (N = 280) | Differential Diagnosis (N = 176) | Text Summarization (N = 172) | Note Creation (N = 44) | Other (N = 104) |
|---|---|---|---|---|---|---|---|---|
Appropriate Responses | 1201 (79.9%) | 376 (83.9%) | 213 (76.1%) | 222 (79.3%) | 143 (81.3%) | 133 (77.3%) | 34 (77.3%) | 80 (76.9%) |
Inappropriate Responses | 303 (20.1%) | 72 (16.1%) | 67 (23.9%) | 58 (20.7%) | 33 (18.8%) | 39 (22.7%) | 10 (22.7%) | 24 (23.1%) |
Safetya | 71 (23.7%) | 33 (45.8%) | 5 (7.5%) | 9 (15.5%) | 8 (24.2%) | 8 (20.5%) | 2 (20.0%) | 6 (25%) |
Privacya | 31 (10.2%) | 4 (5.6%) | 2 (3.0%) | 15 (25.9%) | 1 (3.0%) | 7 (17.9%) | 1 (10.0%) | 1 (4.2%) |
Hallucinationsa | 156 (51.3%) | 25 (34.7%) | 44 (65.7%) | 25 (43.1%) | 21 (63.6%) | 26 (66.7%) | 7 (70.0%) | 8 (33.3%) |
Biasa | 101 (33.2%) | 22 (30.6%) | 31 (46.3%) | 13 (22.4%) | 9 (27.3%) | 6 (15.4%) | 6 (60.0%) | 14 (58.3%) |