Table 1 Study corpus characteristics

From: Evaluating clinical AI summaries with large language models as judges

 

Train/Development

Test

P-value

Summaries, n

160

40

 

Provider Specialty, n (%)

  

0.299

Primary Care

78 (48.76%)

14 (35.00%)

 

Surgical Care

39 (24.36%)

13 (32.50%)

 

Emergency/Urgent Care

21 (13.12%)

4 (10.00%)

 

Neurology/Neurosurgery

11 (6.88%)

3 (7.50%)

 

Other Specialty Care

11 (6.88%)

6 (15.00%)

 

Number of Notes, n (%)

  

0.078

Three

37 (23.12%)

16 (40.00%)

 

Four

45 (28.12%)

7 (17.50%)

 

Five

78 (48.75%)

17 (42.50%)

 

Length of Notes (words), median (IQR)

3050 (2174, 4128)

2816 (2137, 4364)

0.931

Length of Notes (tokens), median (IQR)

6445 (4366, 8665)

5845 (4428, 9086)

0.949

Length of LLM Input* (words), median (IQR)

4746 (3871, 5831)

4788 (3782, 4788)

0.975

Length of LLM Input* (tokens), median (IQR)

9681 (7448, 11704)

8920 (7398, 11850)

0.854

Length of Summary (words), median (IQR)

328 (191, 498)

250 (179, 414)

0.418

Length of Summary (tokens), median (IQR)

566 (368, 869)

425 (340, 787)

0.221

  1. *Instruction + Rubric + Notes + Summary.
  2. The study corpus consisted of 200 unique patient summaries split into development and test sets. Provider specialties were grouped into five categories: Primary Care (Internal Medicine, Family Medicine, Pediatrics); Surgical Care (General Surgery, Orthopedics, Ophthalmology, Urology), Neurology/Neurosurgery, Emergency/Urgent Care, and Other Specialty Care (Gynecology, Dermatology, Psychiatry, Sleep, and Anesthesiology). The median and inter quartile range (IQR) for the length of the notes, summaries, and LLM input are provided for both the number fo words and number fo tokens.