Table 2 Accuracy comparison in patient outcome predictions based on deceased patient reports by the two models, error bar indicating bootstrapped standard error

From: Unmasking and quantifying racial bias of large language models in medical report generation

Model

Race

Deceased prediction rate (%)

Bootstrap error

GPT-3.5-turbo

white

56.906077

1.833146

GPT-3.5-turbo

Black

61.325967

1.800223

GPT-3.5-turbo

Asian

57.872928

1.830777

GPT-3.5-turbo

Hispanic

59.392265

1.823753

GPT-4

white

31.823204

1.096065

GPT-4

Black

33.425414

1.105362

GPT-4

Asian

32.651934

1.110768

GPT-4

Hispanic

33.149171

1.102979

  1. N = 16,000 generated responses.