Demographic inaccuracies and biases in the depiction of patients by artificial intelligence text-to-image generators

Wiegand, Tim Luca Till; Jung, Leonard Ben; Gudera, Jonas Anton; Schuhmacher, Luisa Sophie; Moehrle, Paulina; Rischewski, Jon Felix; Mehrzad, Pardiss; Jeong, Subin; Nguyen, Lisa Ha; Poeschla, Michael; Velezmoro, Laura Isabella; Kruk, Linus; Dimitriadis, Konstantinos; Koerte, Inga Katharina

doi:10.1038/s41746-025-01817-6

Download PDF

Article
Open access
Published: 19 July 2025

Demographic inaccuracies and biases in the depiction of patients by artificial intelligence text-to-image generators

Tim Luca Till Wiegand^1,2,3,4,
Leonard Ben Jung^1,2,5,
Jonas Anton Gudera^2,6,
Luisa Sophie Schuhmacher^1,2,
Paulina Moehrle^2,7,
Jon Felix Rischewski^2,8,
Pardiss Mehrzad^9,10,
Subin Jeong¹¹,
Lisa Ha Nguyen^1,2,
Michael Poeschla^9,10,
Laura Isabella Velezmoro^2,12,
Linus Kruk^2,13,14,
Konstantinos Dimitriadis^2,15^na1 &
…
Inga Katharina Koerte^1,16,17^na1

npj Digital Medicine volume 8, Article number: 459 (2025) Cite this article

2623 Accesses
3 Altmetric
Metrics details

Subjects

Abstract

The wide usage of artificial intelligence (AI) text-to-image generators raises concerns about the role of AI in amplifying misconceptions in healthcare. This study therefore evaluated the demographic accuracy and potential biases in the depiction of patients by four commonly used text-to-image generators. A total of 9060 images of patients with 29 different diseases was generated using Adobe Firefly, Bing Image Generator, Meta Imagine, and Midjourney. Twelve independent raters determined the sex, age, weight, and race and ethnicity of the patients depicted. Comparison to the real-world epidemiology showed that the generated images failed to depict demographical characteristics such as sex, age, and race and ethnicity accurately. In addition, we observed an over-representation of White and normal weight individuals. Inaccuracies and biases may stem from non-representative and non-specific training data as well as insufficient or misdirected bias mitigation strategies. In consequence, new strategies to counteract such inaccuracies and biases are needed.

Acquisition parameters influence AI recognition of race in chest x-rays and mitigating these factors reduces underdiagnosis bias

Article Open access 29 August 2024

Evaluating and mitigating bias in AI-based medical text generation

Article 23 April 2025

Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations

Article Open access 10 December 2021

Introduction

Generative artificial intelligence (AI) aims to create synthetic data that is indistinguishable from real data¹. The field is experiencing rapid advancements due to the development of new powerful algorithms and increasing computational power. One example of generative AI algorithms are text-to-image generators that can create synthetic images based on human text commands². These generators use algorithms from the field of natural language processing, such as transformers to understand the text command. Next, they apply algorithms from the field of computer vision, such as generative adversarial networks or diffusion models to create images based on the command.

Today, the images generated by publicly available text-to-image generators are often photorealistic and hardly distinguishable from real images. In addition, most generators are free to use, the commands to generate the images can be adapted to fit any need of the user, the copyright typically belongs to the individual creating the images, and no consent of the person depicted is required. The visual quality of the images, together with the convenience of the image generation have led to increased popularity of these algorithms, with millions of images being created every day. In the medical context, artificially created images of patients are being used in scientific and non-scientific publications and for teaching purposes (e.g., in presentation slides, reading material)^3,4,5,6,7. Furthermore, as medical data is scarce, the images are being used to augment data sets to train other AI algorithms^6,8,9. These are, for example, being used to obtain diagnoses from photos of patient faces¹⁰.

Generative AI holds tremendous potential, but the increasing number of users and use cases brings risks and challenges. Previous research has shown that text-to-image generators may fail to depict factual details accurately^11,12. However, especially when used in the medical context, it is not sufficient that images are photorealistic; they also need to be accurate. For example, when depicting patients with certain diseases, the pictures should accurately present important features of the disease, including fundamental epidemiological characteristics. Although some diseases display typical phenotypes that could be reflected in facial images (e.g., a flattened nose and epicanthus in Down syndrome), disease-specific facial features are lacking for numerous conditions. On the other hand, most diseases predominantly occur in specific age ranges, and/or a specific sex, and/or specific races and ethnicities. Accordingly, a first step for generative AI should be the accurate representation of the epidemiological characteristics of diseases in generated images so that they match the real-world epidemiology. In addition, there is vast literature on the susceptibility of medical personnel towards unconscious biases related to epidemiological characteristics such as age, sex, race/ethnicity, or body weight¹³. These biases can have profound effects on patient well-being. For example, women and people of color were shown to have worse access to healthcare (e.g., greater delays to healthcare and less healthcare coverage)^14,15. Biases, in this context, refer to systematic errors or prejudices that lead to a tendency to favor one group over another, thereby resulting in unequal medical treatment¹³. AI was shown to amplify such biases. For example, women and people of color were shown to be misclassified and underrepresented by AI algorithms^16,17,18. This is partly due to the fact that the models are often trained on publicly available data and are thus likely to adopt and replicate biases in the data¹⁷. However, AI should help reduce these biases rather than potentially amplifying them.

Yet, recent research has shown that AI algorithms in general, as well as generative AI algorithms in particular, are susceptible to biases based on sex and gender as well as race and ethnicity^16,17,18. Especially females and people of color were shown to be under-represented in established training datasets¹⁶. In consequence, AI algorithms tend to misclassify and misinterpret such images^16,17,18. In addition, some diseases, such as infectious^19,20, psychiatric^21,22, or internal medicine diseases²³, carry disease-related stigmas with profound effects on patients. Stigmas reduce the likelihood of individuals seeking help and adhering to therapy regimens, they reduce treatment quality by medical personnel, and increase societal risk factors of patients^21,22. Thus, replication or amplification of biases and stigmas by generative AI models may have particularly adverse effects.

In this study, we evaluate the accuracy of the representation of disease-specific demographic characteristics (not key phenotypic features) of patient populations depicted by all four commonly used text-to-image generators that allow the generation of patient images. More specifically, we use Firefly from Adobe, the Bing Image Generator from Microsoft, Imagine from Meta, and Midjourney to create images of patients with 29 diseases. Among these are 14 diseases with distinct epidemiology (e.g., occurrence in only a specific age or sex) and 15 stigmatized diseases. Further, we analyze potential biases, with a focus on sex, and race and ethnicity of the depicted individuals.

Results

A total of 9060 images were generated: 2320 with each of the four text-to-image generators, and 320 for each of the 29 diseases (Fig. 1). There were few exceptions: For substance use disorder, only 20 images could be generated in Bing, likely due to a software update. Generation of images of patients with HIV infection and liver cirrhosis was not possible in Adobe due to company guidelines^24,25. For exemplary AI-generated images and ratings, see Fig. 2. Due to the substantial differences in the image characteristics between diseases, it is plausible that the disease-specific prompts caused the main features of the photos.

**Fig. 2: Examples of AI-generated images and corresponding demographic characteristics.**

Ratings and inter-rater reliability

For each image, two raters determined the following demographic characteristics:

Sex [female (F), male (M)];
Age [child (0–19 years), adult (20–60 years), elderly (>60 years)] as suggested by the World Health Organization (WHO)²⁶ and the United Nations (UN)²⁷;
A combined rating of “race/ethnicity” [Asian, Black or African American (BAA), Hispanic or Latino (HL), Native Hawaiian or Other Pacific Islander (NHPI), American Indian or Alaska Native (AIAN), White] as suggested by the United States Census Bureau²⁸ and National Institutes of Health (NIH)²⁹;
Weight [underweight, normal weight, overweight].

The inter-rater reliability (IRR) for sex was κ = 0.963; for age κ = 0.792; for race/ethnicity κ = 0.896; and for weight κ = 0.842.

Accuracy of the representation of disease-specific demographic characteristics

Across the 29 diseases, the representation of disease-specific demographic characteristics in the patient images was often inaccurate for all four text-to-image generators (Figs. 1 and 3). Accurate representation of age, sex, and race/ethnicity was achieved only twice, in images of patients with multiple sclerosis and liver cirrhosis by Meta. The fundamental demographic characteristics as defined in Fig. 1 (e.g., depiction of children in case of pyloric stenosis or medulloblastoma) were most often accurately depicted in Midjourney (in 9 of 14 diseases) and least often in Adobe (in 2 of 14 diseases). In many cases, demographic characteristics were depicted markedly wrong, e.g., for the sex-specific diseases prostate cancer, hemophilia B, premenstrual syndrome, and eclampsia, for which Adobe, Meta, and in parts Midjourney depicted both female and male patients. On the other hand, Meta showed better accuracy in representing the race/ethnicity, and Midjourney in representing the age groups (Fig. 3). The incidence of the disease did not have a clear effect on the accuracy, e.g., images of patients with the more common disease pyloric stenosis did not show better accuracy than images of the rare disease medulloblastoma.

**Fig. 3: Comparison of the overall performance of each text-to-image generator.**

General biases

Among all diseases, there was an over-representation of White individuals that was most pronounced in Adobe (Adobe: 87%, Bing: 68%, Meta: 28%, Midjourney: 78%, pooled real-world patient data: 20%; Fig. 4a). There was no substantial over-representation of Asian, BAA, HL, NHPI, or AIAN in the 15 stigmatized diseases (Fig. 4b). Moreover, among all diseases, there was an over-representation of normal weight individuals (Adobe: 96%, Bing: 88%, Meta: 93%, Midjourney: 93%, general population³⁰: 63%). Conversely, there was an under-representation especially of overweight individuals (Adobe: 3%, Bing: 5%, Meta: 4%, Midjourney: 3%, general population³⁰: 32%). We did not observe a substantial over-representation of male sex (Adobe: 49%, Bing: 55%, Meta: 48%, Midjourney: 42%, pooled real-world patient data: 52%).

**Fig. 4: Representation of the races/ethnicities in the real world, in images by Adobe, Bing, Meta, and Midjourney.**

Sex differences in stigmatized diseases

Among the 15 stigmatized diseases, there were sex differences in age, facial expression, and weight. In all four text-to-image generators, females were younger than males (Adobe: F(1, 1036) = 8.270, p = 0.004; Bing: F(1, 1136) = 23.878, p < 0.001; Meta: F(1, 1195) = 19.872, p < 0.001; Midjourney: F(1, 1196) = 58.746, p < 0.001). More precisely, they were more often depicted as children and/or adults, and less often as elderly.

Other findings were more mixed. In Adobe (F(1, 1035) = 4.960, p = 0.026), females were more often depicted as being happy or sad/anxious/in pain, and less often as having neutral facial expression. In Midjourney (F(1, 1195) = 4.386, p = 0.036), females were more often depicted as being neutral or sad/anxious/in pain, and less often as being happy or angry.

Bing (F(1, 1136) = 23.878, p < 0.001) and Meta (F(1, 1195) = 19.872, p < 0.001) depicted females as having higher weight than males. In Midjourney (F(1, 1195) = 4.011, p = 0.045), depicted females had lower weight than males. For analysis of all 29 diseases together see Supplementary Materials.

Racial/ethnical differences in stigmatized diseases

Among the 15 stigmatized diseases, there were racial/ethnical differences in age, facial expression, and weight. In all four text-to-image generators, White individuals were more often depicted as elderly (Adobe: Age: F(1, 1036) = 6.475, p = 0.011; Bing: F(1, 1136) = 4.810, p = 0.029; Meta: F(1, 1195) = 50.692, p < 0.001; Midjourney: F(1, 1196) = 13.072, p < 0.001). Of note, the mean age peak of diseases for which White is the most common race is 42 years, for all other diseases 34 years.

In Adobe (F(1, 1035) = 8.104, p = 0.005), Meta (F(1, 1194) = 45.094, p < 0.001), and Midjourney (F(1, 1195) = 8.347, p = 0.004), White individuals were more often sad/anxious/in pain and less often neutral.

In images by Bing (F(1, 1135) = 27.083, p < 0.001) and Meta (F(1, 1194) = 4.646, p = 0.031), Asian, BAA, HL, NHPI, and AIAN individuals combined were rated as having more weight than White individuals. In Midjourney, the opposite was the case (F(1, 1195) = 6.804, p = 0.009). For analysis of all 29 diseases together see Supplementary Materials.

Discussion

In summary, we found that the images of patients created by Adobe Firefly, the Bing Image Generator, Meta Imagine, and Midjourney often did not accurately represent the disease-specific demographic characteristics. In addition, we observed an over-representation of White as well as normal weight individuals across all analyzed diseases. In all text-to-image generators, female individuals were more often depicted as being younger, and White individuals more often as being elderly compared to male, and Asian, BAA, HL, NHPI and AIAN individuals, respectively. Such inaccuracies raise concern about the role of AI in amplifying misconceptions in healthcare¹⁸ given its large numbers of users and use cases^{3,4,5,6,7,8,9}. Addressing these concerns may help to realize the full potential of generative AI in healthcare.

We found that images by all four text-to-image generators displayed a broad range of demographic inaccuracies. This was most striking for Adobe’s and Meta’s depictions of patients with prostate cancer, hemophilia B, premenstrual syndrome, and eclampsia, for which both female and male individuals were shown. Likewise, images by Bing and Midjourney often displayed substantial inaccuracies, especially regarding the races/ethnicities.

Presumably, these inaccuracies are largely caused by the composition of the training data of the generative AI models. They are typically trained on large non-medical datasets consisting of publicly available images from the internet, databases such as ImageNet, Common Objects in Context, and other sources⁶. Such large datasets are necessary to produce photorealistic images. However, as they do not contain large numbers of images of actual patients, information on disease-specific demographic characteristics, as well as important risk factors, are missing. Thus, their ability to generate accurate images of these patients and their diseases is limited. Instead, this may have led to the over-representation of White and normal-weight individuals, that may also be over-represented in the training data.

Another factor influencing the quality of the output is bias mitigation strategies in the code of the algorithms that can be applied in the post-training phase of algorithm development and aim to counteract known biases in the training data. These bias mitigation strategies can result in an over-correction of biases, as was shown previously³¹. Thus, it can also be speculated that the depictions of both female and male patients in the images of sex-specific diseases by Adobe, Meta, and partly Midjourney were influenced by such code-based adaptions. In fact, there was no over-representation of any sex in the images of the two text-to-image generators. This may be interpreted as a positive sign, as sex/gender biases have been a common phenomenon in generative AI algorithms^16,18. On the other hand, achieving accurate demographic representation by applying bias mitigation seems challenging and representative training data may be necessary.

Moreover, we also found examples of insufficient representation in our data. We detected a bias toward White individuals in all four generators. Interestingly, this bias was much lower in Meta Imagine, which may be another sign of stricter bias mitigation by Meta. A similar over-representation of White individuals has previously been reported in a study on AI-generated images of healthcare professionals¹⁸. In addition, we detected a bias towards normal weight in all four generators. Conversely, we found that especially individuals with overweight were under-represented, which may be caused by a similar under-representation in the training data. However, these results need to be interpreted cautiously as the images did not depict the entire body and estimation of BMI from facial photos is challenging.

In all four text-to-image generators, the depicted females were younger than the males, which may represent a bias of the algorithms, potentially veering towards gender stereotypes. However, the real-world epidemiology is complex. While females generally have a higher life expectancy than males³², there are studies suggesting an earlier onset in females in some of the diseases included in our analyses, e.g., depression³³ or COVID-19³⁴. However, there are also studies suggesting an earlier symptom onset in males in other diseases, e.g., schizophrenia³⁵ or diabetes type 2³⁶; or no known sex differences, e.g., in multiple sclerosis³⁷ or malaria³⁸. There is additional research on sex differences in the age of diagnosis in contrast to the age of disease onset³⁵. Taken together, no conclusive interpretation of these findings is possible.

In all four text-to-image generators, White individuals were more often depicted as elderly. This is in line with the mean age peak of diseases predominantly affecting White individuals being higher than the mean age peak of all other diseases (42 vs. 34 years). Also, general live expectancy is still highest in Europe (around 79 years) and lowest in Africa (around 64 years)³⁹. In addition, Adobe, Meta, and Midjourney depicted White individuals more often as sad/anxious/in pain as compared to Asian, BAA, HL, NHPI, and AIAN patients. This could potentially be interpreted as differences in emotionality in reaction to diseases, or as a sign of higher empathy of the algorithms towards White patients, although caution is warranted. Further, Asian, BAA, HL, NHPI, and AIAN individuals combined showed more weight than White individuals, which is inaccurate despite global shifts in the distribution of under- and overweight^40,41. Thus, not only did our data reveal an under-representation of Asian, BAA, HL, NHPI, and AIAN individuals combined but also a tendency to portray them as more overweight compared to White individuals.

Generative AI holds tremendous potential in healthcare. However, AI-generated images are not yet ready to be used in the medical context without caution. Instead, they should be carefully evaluated for accuracy and potential biases. Such biases include but are not limited to over-representation of White individuals, male sex (although not observed in our sample), and normal weight. Given our findings, transparency of the application of AI-generated images is advised. Further, it is recommended to address inaccuracies and biases manually, e.g., by selecting specific images so that they represent the real-world distribution of the most important demographic characteristics. Importantly, this requires knowledge about these characteristics in the real world and is bound to introduce individual biases. Going forward, accuracy of the algorithms could be addressed by improving the representation of the general training data or by fine-tuning models that were pre-trained with images of healthy humans with carefully curated patient images. Moreover, biases such as the over-representation of White and normal-weight individuals could also be addressed with code-based bias mitigation strategies. Furthermore, in the context of sensitive or scientific content, text-to-image generators may be improved by measures of prompt engineering and quality control. For example, rather than prohibiting the generation of patient images as has been observed in our study with Google Gemini, Stable Diffusion, and DALL-E/ChatGPT, text-to-image generators could mark such images and either ask if the disease characteristics should be represented accurately or also provide precise scientific data. Prohibiting image generation of patients with stigmatized diseases such as substance use disorder, HIV infection, or liver cirrhosis may instead perpetuate stigma through censorship.

There are limitations to our study. Firstly, although we chose a neutral prompt for generating patient images, other prompts requesting epidemiologically accurate and unbiased presentation of patients could have led to improved images and thus to different results of our analyses. However, a neutral prompt was chosen to standardize the model input and gain the least biased impression of model performance. Moreover, most users—particularly outside scientific contexts—are likely to use similarly simple prompts for reasons of convenience. They may also be unaware of the limitations of generative AI or the potential of prompt engineering to yield more accurate patient representations. Future research is warranted to better understand the potential usefulness of prompt engineering as well as the application of suggestive prompts provided by the AI algorithm. Secondly, the rating process is inherently limited. One can only approximate demographic characteristics from images, despite the rating criteria being carefully defined. For example, race and ethnicity are aspects of a person’s identity that we could only estimate based on features such as skin color and facial characteristics. It is also difficult to estimate the weight category just from pictures of faces as facial shape and fat mass need not be correlated with BMI. Biological sex can only be determined by chromosomal analysis and ratings do not reflect gender identity. Thirdly, our comparisons to real-world epidemiological data are limited by the availability and quality of the real-world epidemiological data itself. Fourthly, the field of generative AI is rapidly evolving. Thus, our findings are only a snapshot of the features and capabilities of these algorithms in February and October 2024. However, the conclusions drawn from the results of this study point to more fundamental issues surrounding accuracy and biases in text-to-image generators that need to be addressed.

Future research could (A) explore the effects of prompt engineering to improve results; (B) investigate the effects of improved training data to increase demographic accuracy; (C) adopt more advanced measures for weight/BMI estimation from image, e.g., by using a deep learning model to estimate the BMI based on facial images; (D) explore intersectional biases, e.g., depiction of BAA women compared to White men; (E) study if community guidelines preventing the generation of images of patients with certain stigmatized diseases such as substance use disorder, HIV infection, or liver cirrhosis may amplify biases and stigmatization rather than preventing it.

Taken together, images of patients created by all four common text-to-image generators that permit the generation of patient images did not accurately display fundamental demographic characteristics such as sex, age, and race/ethnicity. In addition, we observed an over-representation of White as well as normal weight individuals. In consequence, the use of AI-generated patient images requires caution and future software models should focus on ensuring adequate demographic representation of patient groups across the world.

Methods

Text-to-image generators

We used the latest versions of the four common text-to-image generators Firefly from Adobe (adobe.com/products/firefly.html), Bing Image Generator from Microsoft (bing.com/images/create), Imagine from Meta (imagine.meta.com), and Midjourney (midjourney.com). Importantly, other commonly used text-to-image generators, including DALL-E/ChatGPT from OpenAI, Gemini from Google, and stable diffusion from stability AI were tested but company guidelines prohibited the generation of patient images.

The following text prompt was used to generate images of patients: “Photo of the face of a patient with [disease]”. The blank was filled with the name of the specific diseases, e.g., “pyloric stenosis”; Fig. 1). More specifically, we created images from patients with 14 different diseases with distinct epidemiological characteristics to analyze the epidemiological accuracy of the generated patient images. We chose the diseases pyloric stenosis⁴² and medulloblastoma⁴³ that predominantly occur in children, the diseases cholecystitis⁴⁴ and granulomatosis with polyangiitis⁴⁵ that predominately occur in adults, the diseases Alzheimer’s disease⁴⁶ and multiple myeloma⁴⁷ that predominantly occur in elderly, the diseases prostate cancer⁴⁸ and hemophilia B⁴⁹ that only/predominantly occur in male sex, the diseases premenstrual syndrome⁵⁰ and eclampsia⁵¹ that only occur in female sex, the diseases melanoma⁵² and multiple sclerosis⁵³ that predominantly occur in individuals originating from or living in Europe or North America, and the diseases malaria⁵⁴ and sickle cell anemia⁵⁵ that predominantly occur in individuals originating from or living in Africa. Among the two diseases for each category, we chose one with a fairly high incidence and one with a fairly small incidence to identify if the incidence affects the quality of the images.

In addition, we created images from patients with 15 different diseases that are commonly stigmatized. More specifically, we created images of the five stigmatized infectious diseases^19,20 human immunodeficiency virus (HIV) infection, tuberculosis, hepatitis B, lues, and COVID-19; of the five stigmatized psychiatric diseases^21,56,57 depression, substance use disorder, anxiety disorder, schizophrenia, and attention deficit hyperactivity disorder (ADHD); and of the five stigmatized internal medicine conditions and diseases^58,59,60,61 obesity, heart attack, diabetes type 2, lung cancer, and liver cirrhosis. For detailed descriptions of the rationale behind each disease/condition, see Supplementary Materials.

The first result of each prompt was always used. Images were only excluded if they were black and white, did not represent a realistic photo, presented ambiguity in terms of which person should be rated, or if essential parts of the face (e.g., eyes, nose, mouth) were cut off.

All images were created in February and October (due to article revision) 2024. We used eight computers, four internet browsers (i.e., Firefox, Internet Explorer, Google Chrome, Safari), and 16 accounts to minimize the influence of user data on the image generation. Prompts were applied one by one in a new session/empty interface of the text-to-image generators. We generated 80 images for each of the 29 diseases in Adobe, Bing, Meta, and Midjourney. Importantly, we were only able to generate 20 images of patients with substance use disorder in Bing. This was likely due to a sudden software update prohibiting the generation of additional images of individuals with substance use disorders. Likewise, generation of images of patients with HIV infection and liver cirrhosis was not possible in Adobe Firefly due to company guidelines.

Ratings

Determining demographic characteristics from images of faces is challenging. We thus took several measures to standardize ratings and to reduce subjectivity: First, the ratings were performed by an international, multi-racial/-ethnical team of twelve M.D. Ph.D. researchers (T.L.T.W., L.B.J., J.A.G., L.S.S., P.M., J.F.R., P.M., S.J., L.H.N., M.P., L.I.V., L.K.; 6 female, 6 male; 9 nationalities; 3 races/ethnicities). Second, the raters adhered to the established multiracial Chicago face dataset, which includes standardized images and descriptions of faces⁶². Third, a separate practice data set with images from the four text-to-image generators was created and all ratings were performed and discussed in the entire group of raters in accordance with the Chicago face dataset. Fourth, each of the images was rated by two raters, independently. In case of disagreement between the ratings, a third rater was included, and the final rating achieved by discussion and majority voting.

Wherever possible, real-world epidemiological data were obtained from official sources such as the WHO or large-scale epidemiological reviews such as global burden of disease studies. If such sources were not available, other epidemiological publications were used (see references in Fig. 1).

Statistical analyses

Firstly, we calculated the IRR for each variable based on the ratings by raters 1 and 2 using Cohen’s κ.

Secondly, we analyzed the accuracy of the representation of disease-specific demographic characteristics in the generated patient images. Here, for each disease and text-to-image generator we compared the age, sex, and race/ethnicity combined to the real-world epidemiology. We evaluated whether the patients’ age, sex, and race/ethnicity as depicted in the images were “accurate” in comparison to the real-world epidemiological data (green background in Fig. 1), “imprecise” (yellow background), or “wrong” (red background).

Age was rated as “accurate” if both the most common age group as well as the age distribution matched the real-world data, as “imprecise” if only one of the two matched, and as “wrong” if none of the two matched.

Sex was rated as “accurate” if the F:M ratio was less than factor 1.50 different to the real world, as “imprecise” if there was a difference of factor 1.50–3.00, or “wrong” if the difference was larger than factor 3.00. For example, a ratio of 60F:40M in the generated images and 45F:55M in the real world would correspond to a factor of (60/40)/(45/55) = 1.83 (“imprecise”).

For ratings of race/ethnicity we calculated the cumulative deviation of the percentage values of the generated images from the real epidemiology. The race/ethnicity was rated as “accurate” if the deviation was less than 50, as “imprecise” if the deviation was 50–100, or as “wrong” if the deviation war larger than 100. For example, for pyloric stenosis, the real world epidemiology is Asian: 39%, White: 36%, HL: 15%, BAA: 10%. In Adobe, the distribution was: Asian: 8%, White: 79%, HL: 5%, BAA: 8%. The deviation thus corresponds to: (39 − 8) + (79 − 36) + (15 − 5) + (10 − 8) = 86 (“imprecise”; for details see Supplementary Materials).

Thirdly, we compared the real-world data and the image ratings among all diseases combined to analyze more general biases such as an over-representation of male and White individuals as reported previously¹⁸.

Fourthly, we investigated biases regarding sex and race/ethnicity in the images of patients with stigmatized diseases. We used analyses of covariance (ANCOVA) to identify sex differences as well as racial/ethnical differences in weight and age. Based on the literature, we expected biases, especially in the depiction of White individuals in comparison to people of color¹⁸. Thus, we dichotomized the race/ethnicity variable into White vs. Asian or BAA or HL or NHPI, or AIAN. Analyses on sex differences were controlled for the effects of the disease depicted, race/ethnicity, and age (not in analyses on sex differences in age). Analyses on racial/ethnical differences were controlled for the effects of the disease depicted, sex, and age (not in analyses on racial/ethnical differences in age). The analyses were performed in IBM SPSS Statistics version 29.0.2.0. P-levels < 0.05 were considered statistically significant.

Data availability

The 9060 generated patient images are available upon reasonable request directed at the corresponding author.

Code availability

The text-to-image generators are available online: Firefly from Adobe (adobe.com/products/firefly.html), Bing Image Generator from Microsoft (bing.com/images/create), Imagine from Meta (imagine.meta.com), and Midjourney (midjourney.com).

References

Reddy, S. Generative AI in healthcare: an implementation science informed translational path on application, integration and governance. Implement. Sci. 19, 27 (2024).
Article PubMed PubMed Central Google Scholar
Ramzan, S., Iqbal, M. M. & Kalsum, T. Text-to-image generation using deep learning. Eng. Proc. 20, 16 (2022).
Google Scholar
Noel, G. Evaluating AI-powered text-to-image generators for anatomical illustration: a comparative study. Anat. Sci. Educ. 17, 979–983(2023).
Kumar, A., Burr, P. & Young, T. M. Using AI text-to-image generation to create novel illustrations for medical education: current limitations as illustrated by hypothyroidism and horner syndrome. JMIR Med. Educ. 10, e52155 (2024).
Article PubMed PubMed Central Google Scholar
Koljonen, V. What could we make of AI in plastic surgery education. J. Plast. Reconstr. Aesthet. Surg. 81, 94–96 (2023).
Article PubMed Google Scholar
Fan, B. E., Chow, M. & Winkler, S. Artificial intelligence-generated facial images for medical education. Med. Sci. Educ. 34, 5–7 (2024).
Article PubMed Google Scholar
Reed, J. M. Using generative AI to produce images for nursing education. Nurse Educ. 48, 246 (2023).
Article PubMed Google Scholar
Koohi-Moghadam, M. & Bae, K. T. Generative AI in medical imaging: applications, challenges, and ethics. J. Med. Syst. 47, 94 (2023).
Article PubMed Google Scholar
Adams, L. C. et al. What does DALL-E 2 know about radiology? J. Med. Internet Res. 25, e43110 (2023).
Article PubMed PubMed Central Google Scholar
Rokhshad, R., Keyhan, S. O. & Yousefi, P. Artificial intelligence applications and ethical challenges in oral and maxillo-facial cosmetic surgery: a narrative review. Maxillofac. Plast. Reconstr. Surg. 45, 14 (2023).
Article PubMed PubMed Central Google Scholar
Borji, A. Qualitative failures of image generation models and their application in detecting deepfakes. Image Vis. Comput. 137, 104771 (2023).
Article Google Scholar
Joynt, V. et al. A comparative analysis of text-to-image generative AI models in scientific contexts: a case study on nuclear power. Sci. Rep. 14, 30377 (2024).
Article CAS PubMed PubMed Central Google Scholar
Meidert, U., Dönnges, G., Bucher, T., Wieber, F. & Gerber-Grote, A. Unconscious bias among health professionals: a scoping review. Int. J. Environ Res. Public Health 20, 6569 (2023).
Caraballo, C. et al. Trends in racial and ethnic disparities in barriers to timely medical care among adults in the US, 1999 to 2018. JAMA Health Forum 3, e223856 (2022).
Article PubMed PubMed Central Google Scholar
Daher, M. et al. Gender disparities in difficulty accessing healthcare and cost-related medication non-adherence: The CDC behavioral risk factor surveillance system (BRFSS) survey. Prev. Med. 153, 106779 (2021).
Article PubMed PubMed Central Google Scholar
Buolamwini, J. & Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Proc. Conference on fairness, accountability and transparency 77–91 (PMLR, 2018).
Bianchi, F., et al. Easily accessible text-to-image generation amplifies demographic stereotypes at large scale. In Proc. 2023 ACM Conference on Fairness, Accountability, and Transparency 1493–1504 (Association for Computing Machinery, 2023).
Ali, R. et al. Demographic representation in 3 leading artificial intelligence text-to-image generators. JAMA Surg. 159, 87–95 (2024).
Article PubMed Google Scholar
Saeed, F. et al. A narrative review of stigma related to infectious disease outbreaks: what can be learned in the face of the Covid-19 pandemic? Front. Psychiatry 11, 565919 (2020).
Article PubMed PubMed Central Google Scholar
Mak, W. W. et al. Comparative stigma of HIV/AIDS, SARS, and tuberculosis in Hong Kong. Soc. Sci. Med. 63, 1912–1922 (2006).
Article PubMed PubMed Central Google Scholar
Committee on the Science of Changing Behavioral Health Social, N., et al. In Ending Discrimination Against People with Mental and Substance Use Disorders: The Evidence for Stigma Change (National Academies Press (US), 2016).
Wood, L., Birtel, M., Alsawy, S., Pyle, M. & Morrison, A. Public perceptions of stigma towards people with schizophrenia, depression, and anxiety. Psychiatry Res. 220, 604–608 (2014).
Article PubMed Google Scholar
Wahlin, S. & Andersson, J. Liver health literacy and social stigma of liver disease: A general population e-survey. Clin. Res Hepatol. Gastroenterol. 45, 101750 (2021).
Article CAS PubMed Google Scholar
Google. Policy guidelines for the Gemini app. (2024).
StabilityAI. Acceptable Use Policy. (2024).
WHO. Definition of Key Terms. In Consolidated Guidelines on the Use of Antiretroviral Drugs for Treating and Preventing HIV Infection: Recommendations for a Public Health Approach. 2nd edition (2016).
UN. World Population Ageing (2019).
Jensen, E., et al. Measuring Racial and Ethnic Diversity for the 2020 Census (United States Census Bureau, 2021).
Lewis, C., Cohen, P. R., Bahl, D., Levine, E. M. & Khaliq, W. Race and ethnic categories: a brief review of global terms and nomenclature. Cureus 15, e41253 (2023).
PubMed PubMed Central Google Scholar
WHO. Malnutrition (2024).
Alba, D., Love, J., Ghaffary, S. & Metz, R. Google Left in ‘Terrible Bind’ by Pulling AI Feature After Right-Wing Backlash (TIME, 2024).
Global age-sex-specific fertility, mortality, healthy life expectancy (HALE), and population estimates in 204 countries and territories, 1950-2019: a comprehensive demographic analysis for the Global Burden of Disease Study 2019. Lancet 396, 1160–1203 (2020).
Smith, D. J. et al. Differences in depressive symptom profile between males and females. J. Affect. Disord. 108, 279–284 (2008).
Article PubMed Google Scholar
Kharroubi, S. A. & Diab-El-Harake, M. Sex-differences in COVID-19 diagnosis, risk factors and disease comorbidities: a large US-based cohort study. Front. Public Health 10, 1029190 (2022).
Article PubMed PubMed Central Google Scholar
Solmi, M. et al. Age at onset of mental disorders worldwide: large-scale meta-analysis of 192 epidemiological studies. Mol. Psychiatry 27, 281–295 (2022).
Article CAS PubMed Google Scholar
Wright, A. K. et al. Age-, sex- and ethnicity-related differences in body weight, blood pressure, HbA(1c) and lipid levels at the diagnosis of type 2 diabetes relative to people without diabetes. Diabetologia 63, 1542–1553 (2020).
Article CAS PubMed PubMed Central Google Scholar
Romero-Pinel, L. et al. The age at onset of relapsing-remitting multiple sclerosis has increased over the last five decades. Mult. Scler. Relat. Disord. 68, 104103 (2022).
Article PubMed Google Scholar
Paintsil, E. K., Omari-Sasu, A. Y., Addo, M. G. & Boateng, M. A. Analysis of haematological parameters as predictors of malaria infection using a logistic regression model: a case study of a hospital in the Ashanti Region of Ghana. Malar. Res. Treat. 2019, 1486370 (2019).
PubMed PubMed Central Google Scholar
Rodés-Guirao, S. D. L., Ritchie, H., Ortiz-Ospina, E. & Roser, M. Life Expectancy (OurWorldinData.org, 2023).
Haslam, D. W. & James, W. P. T. Obesity. Lancet 366, 1197–1209 (2005).
Article PubMed Google Scholar
Trends in adult body-mass index in 200 countries from 1975 to 2014: a pooled analysis of 1698 population-based measurement studies with 19·2 million participants. Lancet 387, 1377–1396 (2016).
Garfield, K. & Sergent, S. R. Pyloric Stenosis. in StatPearls (StatPearls Publishing, 2024).
Mahapatra, S. & Amsbaugh, M. J. Medulloblastoma. In StatPearls (StatPearls Publishing, 2024).
Li, Z. Z. et al. Global, regional, and national burden of gallbladder and biliary diseases from 1990 to 2019. World J. Gastrointest. Surg. 15, 2564–2578 (2023).
Article PubMed PubMed Central Google Scholar
Banerjee, P., Jain, A., Kumar, U. & Senapati, S. Epidemiology and genetics of granulomatosis with polyangiitis. Rheumatol. Int. 41, 2069–2089 (2021).
Article CAS PubMed Google Scholar
Mayeux, R. & Stern, Y. Epidemiology of Alzheimer disease. Cold Spring Harb. Perspect. Med. 2, a006239 (2012).
Padala, S. A. et al. Epidemiology, staging, and management of multiple myeloma. Med. Sci. 9, 3 (2021).
Rawla, P. Epidemiology of prostate cancer. World J. Oncol. 10, 63–89 (2019).
Article CAS PubMed PubMed Central Google Scholar
Alshaikhli, A., Killeen, R. B. & Rokkam, V. R. Hemophilia B. In StatPearls (StatPearls Publishing, 2024).
Hantsoo, L. et al. Premenstrual symptoms across the lifespan in an international sample: data from a mobile application. Arch. Women Ment. Health 25, 903–910 (2022).
Article Google Scholar
Abalos, E., Cuesta, C., Grosso, A. L., Chou, D. & Say, L. Global and regional estimates of preeclampsia and eclampsia: a systematic review. Eur. J. Obstet. Gynecol. Reprod. Biol. 170, 1–7 (2013).
Article PubMed Google Scholar
Saginala, K., Barsouk, A., Aluru, J. S., Rawla, P. & Barsouk, A. Epidemiology of melanoma. Med. Sci. 9, 63 (2021).
Walton, C. et al. Rising prevalence of multiple sclerosis worldwide: Insights from the Atlas of MS, third edition. Mult. Scler. 26, 1816–1821 (2020).
Article PubMed PubMed Central Google Scholar
Okiring, J. et al. Gender difference in the incidence of malaria diagnosed at public health facilities in Uganda. Malar. J. 21, 22 (2022).
Article PubMed PubMed Central Google Scholar
Global, regional, and national prevalence and mortality burden of sickle cell disease, 2000-2021 A systematic analysis from the Global Burden of Disease Study 2021. Lancet Haematol. 10, e585–e599 (2023).
Article Google Scholar
Rössler, W. The stigma of mental disorders: a millennia-long history of social exclusion and prejudices. EMBO Rep. 17, 1250–1253 (2016).
Article PubMed PubMed Central Google Scholar
Alonso, J. et al. Association of perceived stigma and mood and anxiety disorders: results from the World Mental Health Surveys. Acta Psychiatr. Scand. 118, 305–314 (2008).
Article CAS PubMed PubMed Central Google Scholar
Puhl, R. M., Himmelstein, M. S. & Speight, J. Weight stigma and diabetes stigma: implications for weight-related health behaviors in adults with type 2 diabetes. Clin. Diabetes 40, 51–61 (2022).
Article PubMed PubMed Central Google Scholar
Maguire, R. et al. Lung cancer stigma: a concept with consequences for patients. Cancer Rep.2, e1201 (2019).
Google Scholar
Vaughn-Sandler, V., Sherman, C., Aronsohn, A. & Volk, M. L. Consequences of perceived stigma among patients with cirrhosis. Dig. Dis. Sci. 59, 681–686 (2014).
Article PubMed Google Scholar
Panza, G. A. et al. Links between discrimination and cardiovascular health among socially stigmatized groups: a systematic review. PLoS ONE14, e0217623 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ma, D. S., Correll, J. & Wittenbrink, B. The Chicago face database: a free stimulus set of faces and norming data. Behav. Res Methods 47, 1122–1135 (2015).
Article PubMed Google Scholar
Schmidt, R. et al. Sex differences in Alzheimer’s disease. Neuropsychiatry 22, 1–15 (2008).
Google Scholar
Li, X. et al. Global, regional, and national burden of Alzheimer’s disease and other dementias, 1990-2019. Front. Aging Neurosci. 14, 937486 (2022).
Article PubMed PubMed Central Google Scholar
Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990-2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet 392, 1789–1858 (2018).
McHugh, J. et al. Prostate cancer risk in men of differing genetic ancestry and approaches to disease screening and management in these groups. Br. J. Cancer 126, 1366–1373 (2022).
Article PubMed Google Scholar
Zhu, L. et al. Global burden and trends in female premenstrual syndrome study during 1990-2019. Arch. Womens Ment. Health 27, 369–382 (2024).
Morgese, F. et al. Gender differences and outcomes in melanoma patients. Oncol. Ther. 8, 103–114 (2020).
Article PubMed PubMed Central Google Scholar
Global, regional, and national burden of multiple sclerosis 1990-2016: a systematic analysis for the Global Burden of Disease Study 2016. Lancet Neurol. 18, 269–285 (2019).
Shi, D. et al. Trends of the global, regional and national incidence, mortality, and disability-adjusted life years of malaria, 1990-2019: an analysis of the Global Burden of Disease Study 2019. Risk Manag Health. Policy 16, 1187–1201 (2023).
Article Google Scholar
Kato, G. J. et al. Sickle cell disease. Nat. Rev. Dis. Prim. 4, 18010 (2018).
Article PubMed Google Scholar
Mody, A. et al. HIV epidemiology, prevention, treatment, and implementation strategies for public health. Lancet 403, 471–492 (2024).
Article PubMed Google Scholar
UNAIDS. Global HIV & AIDS statistics — Fact sheet. (2022).
Abdool Karim, S. S., Abdool Karim, Q., Gouws, E. & Baxter, C. Global epidemiology of HIV-AIDS. Infect. Dis. Clin. North Am. 21, 1–17 (2007).
Article PubMed Google Scholar
Glaziou, P., Floyd, K. & Raviglione, M. C. Global epidemiology of tuberculosis. Semin Respir. Crit. Care Med. 39, 271–285 (2018).
Article PubMed Google Scholar
WHO. Global Tuberculosis Report 2023. Available from: https://www.who.int/teams/global-tuberculosis-programme/tb-reports/global-tuberculosis-report-2023 (2023).
Global, regional, and national burden of hepatitis B, 1990-2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet Gastroenterol. Hepatol. 7, 796–829 (2022).
Brown, R., Goulder, P. & Matthews, P. C. Sexual dimorphism in chronic hepatitis B virus (HBV) infection: evidence to inform elimination efforts. Wellcome Open Res. 7, 32 (2022).
Article PubMed PubMed Central Google Scholar
Tao, Y. T. et al. Global, regional, and national trends of syphilis from 1990 to 2019: the 2019 global burden of disease study. BMC Public Health 23, 754 (2023).
Article PubMed PubMed Central Google Scholar
Chen, T. et al. Evaluating the global, regional, and national impact of syphilis: results from the global burden of disease study 2019. Sci. Rep. 13, 11386 (2023).
Article CAS PubMed PubMed Central Google Scholar
WHO. COVID-19 epidemiological update – 16 February 2024. Available from: https://www.who.int/publications/m/item/covid-19-epidemiological-update-16-february-2024 (2024).
CDC. COVID-19 Stats: COVID-19 Incidence,* by Age Group† — United States, March 1–November 14, 2020§. Available from: https://www.cdc.gov/mmwr/volumes/69/wr/mm695152a8.htm (2021).
WHO. WHO COVID-19 dashboard. Available from: https://data.who.int/dashboards/covid19/cases?n=c (2024).
Liu, Q. et al. Changes in the global burden of depression from 1990 to 2017: Findings from the Global Burden of Disease study. J. Psychiatr. Res. 126, 134–140 (2020).
Article PubMed Google Scholar
Labaka, A., Goñi-Balentziaga, O., Lebeña, A. & Pérez-Tejada, J. Biological sex differences in depression: a systematic review. Biol. Res. Nurs. 20, 383–392 (2018).
Article PubMed Google Scholar
McHugh, R. K., Votaw, V. R., Sugarman, D. E. & Greenfield, S. F. Sex and gender differences in substance use disorders. Clin. Psychol. Rev. 66, 12–23 (2018).
Article PubMed Google Scholar
Degenhardt, L., Stockings, E., Patton, G., Hall, W. D. & Lynskey, M. The increasing global health priority of substance use in young people. Lancet Psychiatry 3, 251–264 (2016).
Article PubMed Google Scholar
Simha, A. et al. Effect of national cultural dimensions and consumption rates on stigma toward alcohol and substance use disorders. Int J. Soc. Psychiatry 68, 1411–1417 (2022).
Article PubMed Google Scholar
Javaid, S. F. et al. Epidemiology of anxiety disorders: global burden and sociodemographic associations. Middle East Curr. Psychiatry 30, 44 (2023).
Article Google Scholar
Solmi, M. et al. Incidence, prevalence, and global burden of schizophrenia - data, with critical appraisal, from the Global Burden of Disease (GBD) 2019. Mol. Psychiatry 28, 5319–5327 (2023).
Cortese, S. et al. Incidence, prevalence, and global burden of ADHD from 1990 to 2019 across 204 countries: data, with critical re-analysis, from the Global Burden of Disease study. Mol. Psychiatry 28, 4823–4830 (2023).
Article PubMed Google Scholar
Sørensen, T. I. A., Martinez, A. R. & Jørgensen, T. S. H. Epidemiology of obesity. Handb. Exp. Pharm. 274, 3–27 (2022).
Article Google Scholar
Blüher, M. Obesity: global epidemiology and pathogenesis. Nat. Rev. Endocrinol. 15, 288–298 (2019).
Article PubMed Google Scholar
Dai, H. et al. Global, regional, and national burden of ischaemic heart disease and its attributable risk factors, 1990-2017: results from the Global Burden of Disease Study 2017. Eur. Heart J. Qual. Care Clin. Outcomes 8, 50–60 (2022).
Article PubMed Google Scholar
Khan, M. A. B. et al. Epidemiology of type 2 diabetes—global burden of disease and forecasted trends. J. Epidemiol. Glob. Health 10, 107–111 (2020).
Article PubMed PubMed Central Google Scholar
Zhou, B. et al. Worldwide burden and epidemiological trends of tracheal, bronchus, and lung cancer: a population-based study. EBioMedicine 78, 103951 (2022).
Article PubMed PubMed Central Google Scholar
The global, regional, and national burden of cirrhosis by cause in 195 countries and territories, 1990-2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet Gastroenterol. Hepatol. 5, 245–266 (2020).

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

These authors contributed equally: Konstantinos Dimitriadis, Inga Katharina Koerte.

Authors and Affiliations

cBRAIN, Department of Child and Adolescent Psychiatry, Psychosomatics, and Psychotherapy, University Hospital, Ludwig-Maximilians-Universität, Munich, Germany
Tim Luca Till Wiegand, Leonard Ben Jung, Luisa Sophie Schuhmacher, Lisa Ha Nguyen & Inga Katharina Koerte
OneAIM, Munich, Germany
Tim Luca Till Wiegand, Leonard Ben Jung, Jonas Anton Gudera, Luisa Sophie Schuhmacher, Paulina Moehrle, Jon Felix Rischewski, Lisa Ha Nguyen, Laura Isabella Velezmoro, Linus Kruk & Konstantinos Dimitriadis
Computational Neurology, Department of Neurology, Charité-Universitätsmedizin Berlin, Berlin, Germany
Tim Luca Till Wiegand
Computational Neurology, Berlin Institute of Health, Berlin, Germany
Tim Luca Till Wiegand
Department of Neurosurgery, LMU University Hospital, LMU Munich, Munich, Germany
Leonard Ben Jung
Department of Pediatrics, Dr. von Hauner Children’s Hospital, University Hospital, Ludwig-Maximilians-Universität, Munich, Germany
Jonas Anton Gudera
Department of Medicine I, LMU University Hospital, LMU Munich, Munich, Germany
Paulina Moehrle
Institute for Diagnostic and Interventional Neuroradiology, University Hospital, Ludwig-Maximilians-Universität, Munich, Germany
Jon Felix Rischewski
Division of Hematology/Oncology, Boston Children’s Hospital, Harvard Medical School, Boston, USA
Pardiss Mehrzad & Michael Poeschla
Broad Institute of MIT and Harvard, Cambridge, MA, USA
Pardiss Mehrzad & Michael Poeschla
Harvard School of Dental Medicine, Boston, MA, USA
Subin Jeong
Department of Radiation Oncology, University Hospital, Ludwig-Maximilians-Universität, Munich, Germany
Laura Isabella Velezmoro
Walther Straub Institute of Pharmacology and Toxicology, Faculty of Medicine, Ludwig Maximilian University, Munich, Germany
Linus Kruk
Division of Nephrology, Department of Medicine IV, Ludwig Maximilian University Hospital, Munich, Germany
Linus Kruk
Department of Neurology, University Hospital, Ludwig-Maximilians-Universität, Munich, Germany
Konstantinos Dimitriadis
Psychiatry Neuroimaging Laboratory, Mass General Brigham Academic Medical Centers, Psychiatry Department, Boston, MA, USA
Inga Katharina Koerte
Harvard Medical School, Boston, MA, USA
Inga Katharina Koerte

Authors

Tim Luca Till Wiegand
View author publications
Search author on:PubMed Google Scholar
Leonard Ben Jung
View author publications
Search author on:PubMed Google Scholar
Jonas Anton Gudera
View author publications
Search author on:PubMed Google Scholar
Luisa Sophie Schuhmacher
View author publications
Search author on:PubMed Google Scholar
Paulina Moehrle
View author publications
Search author on:PubMed Google Scholar
Jon Felix Rischewski
View author publications
Search author on:PubMed Google Scholar
Pardiss Mehrzad
View author publications
Search author on:PubMed Google Scholar
Subin Jeong
View author publications
Search author on:PubMed Google Scholar
Lisa Ha Nguyen
View author publications
Search author on:PubMed Google Scholar
Michael Poeschla
View author publications
Search author on:PubMed Google Scholar
Laura Isabella Velezmoro
View author publications
Search author on:PubMed Google Scholar
Linus Kruk
View author publications
Search author on:PubMed Google Scholar
Konstantinos Dimitriadis
View author publications
Search author on:PubMed Google Scholar
Inga Katharina Koerte
View author publications
Search author on:PubMed Google Scholar

Contributions

Idea and study design: T.L.T.W., and L.B.J. Data collection: T.L.T.W., L.B.J., J.A.G., L.S.S., P.M., J.F.R., P.M., S.J., L.H.N., M.P., L.I.V., and L.K. Analysis: T.L.T.W. Writing: T.L.T.W. Editing and approval of the final version of the manuscript: T.L.T.W., L.B.J., J.A.G., L.S.S., P.M., J.F.R., P.M., S.J., L.H.N., M.P., L.I.V., L.K., K.D., and I.K.K. Supervision: K.D., and I.K.K.

Corresponding author

Correspondence to Tim Luca Till Wiegand.

Ethics declarations

Competing interests

T.L.T.W. and L.I.V. receive royalties for books published by ELSEVIER. I.K.K. receives funding for a collaborative project from Abbott Inc. She receives royalties for book chapters. Her spouse is an employee at Siemens AG and a stockholder of Siemens AG and Siemens Healthineers. The remaining authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Material

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wiegand, T.L.T., Jung, L.B., Gudera, J.A. et al. Demographic inaccuracies and biases in the depiction of patients by artificial intelligence text-to-image generators. npj Digit. Med. 8, 459 (2025). https://doi.org/10.1038/s41746-025-01817-6

Download citation

Received: 03 June 2024
Accepted: 18 June 2025
Published: 19 July 2025
DOI: https://doi.org/10.1038/s41746-025-01817-6