Fig. 1: Evaluation procedure to probe bias in LLMs.
From: Unmasking and quantifying racial bias of large language models in medical report generation

This figure illustrates the workflow of our bias probing, using GPT-3.5-turbo and GPT-4. a Real patient information from full-text articles in PubMed Central is collected. b LLM extracts patient information. c Original racial and ethnic group information is removed, and hypothetical racial or ethnic groups information is injected to create hypothetical patient profiles. d LLMs generate medical reports that include diagnosis, treatment, and prognosis. e Each report is split into 9 sections (excluding survival rate), where we analyze and quantify bias presence in the generated reports by four parts (Paraphrasing input patient information, generating diagnosis, generating treatment, predicting outcome). Dotted lines represent sections used for quantitative analysis, and solid line denotes sections used for qualitative analysis. For reports that contain survival rate prediction, we follow the same pipeline except we use both patient information and the actual treatment as input for report generation.