Table 1 Study corpus characteristics
From: Evaluating clinical AI summaries with large language models as judges
Train/Development | Test | P-value | |
|---|---|---|---|
Summaries, n | 160 | 40 | |
Provider Specialty, n (%) | 0.299 | ||
Primary Care | 78 (48.76%) | 14 (35.00%) | |
Surgical Care | 39 (24.36%) | 13 (32.50%) | |
Emergency/Urgent Care | 21 (13.12%) | 4 (10.00%) | |
Neurology/Neurosurgery | 11 (6.88%) | 3 (7.50%) | |
Other Specialty Care | 11 (6.88%) | 6 (15.00%) | |
Number of Notes, n (%) | 0.078 | ||
Three | 37 (23.12%) | 16 (40.00%) | |
Four | 45 (28.12%) | 7 (17.50%) | |
Five | 78 (48.75%) | 17 (42.50%) | |
Length of Notes (words), median (IQR) | 3050 (2174, 4128) | 2816 (2137, 4364) | 0.931 |
Length of Notes (tokens), median (IQR) | 6445 (4366, 8665) | 5845 (4428, 9086) | 0.949 |
Length of LLM Input* (words), median (IQR) | 4746 (3871, 5831) | 4788 (3782, 4788) | 0.975 |
Length of LLM Input* (tokens), median (IQR) | 9681 (7448, 11704) | 8920 (7398, 11850) | 0.854 |
Length of Summary (words), median (IQR) | 328 (191, 498) | 250 (179, 414) | 0.418 |
Length of Summary (tokens), median (IQR) | 566 (368, 869) | 425 (340, 787) | 0.221 |