Table 2 Accuracy comparison in patient outcome predictions based on deceased patient reports by the two models, error bar indicating bootstrapped standard error
From: Unmasking and quantifying racial bias of large language models in medical report generation
Model | Race | Deceased prediction rate (%) | Bootstrap error |
---|---|---|---|
GPT-3.5-turbo | white | 56.906077 | 1.833146 |
GPT-3.5-turbo | Black | 61.325967 | 1.800223 |
GPT-3.5-turbo | Asian | 57.872928 | 1.830777 |
GPT-3.5-turbo | Hispanic | 59.392265 | 1.823753 |
GPT-4 | white | 31.823204 | 1.096065 |
GPT-4 | Black | 33.425414 | 1.105362 |
GPT-4 | Asian | 32.651934 | 1.110768 |
GPT-4 | Hispanic | 33.149171 | 1.102979 |