Abstract
Background
Equitable deployment of clinical artificial intelligence systems requires consistent performance across diverse patient populations. However, race information in electronic health records is often missing/inconsistently documented, limiting the ability to construct representative cohorts or assess algorithmic bias. This study evaluates model performance and fairness in predicting race from clinical text.
Methods
We compared four transformer-based deep learning models with a hierarchical convolutional neural network designed to capture the multilevel structure of clinical narratives. A two-phase active learning framework guided annotation of a primary care database. A fairness-aware loss function was applied to mitigate disparities across racial groups. Each model was trained with and without fairness-aware optimization. Performance and equity were evaluated using 10-fold cross-validation and subgroup audits across race, sex, age, and their intersections.
Results
Here we show that the hierarchical convolutional neural network achieves higher accuracy and performance equity than transformer models (macro F1 = 98.4%). Fairness constraints enhance parity across most transformer architectures, but degrade hierarchical model performance and cause one clinical model to collapse toward majority predictions, demonstrating that fairness interventions are highly model dependent. Persistent disparities across race, sex, and age indicate that inequities reflect architectural limitations and systemic biases.
Conclusions
This study demonstrates that fairness can be integrated into clinical language models, though effects vary by model type. Architectures aligned with clinical text structure inherently promote fairness, yet mixed fairness constraint outcomes highlight the need for tailored interventions. Persistent demographic disparities show that algorithmic bias often reflects upstream documentation inequities. This framework offers a scalable path toward equitable NLP for clinical artificial intelligence.
Plain Language Summary
Medical records often lack information about patients’ race, making it hard to identify potential race-associated health inequalities. We developed computer programs to find race information in doctors’ notes. We tested different types of artificial intelligence models and added special rules to make them work fairly for all racial groups. We found that a model designed to read notes the way doctors write them worked best. Adding additional fairness rules helped some models but hurt others, showing there is no one-size-fits-all solution. Many differences we saw came from how doctors write their notes differently for different patient groups. This research shows we can build fairer medical artificial intelligence, but fixing computer programs alone is not enough. We also need to improve how health information is recorded.
Data availability
The data used in this study are individual-level, de-identified electronic health record data. Policies, procedures, and Research Ethics Board (REB) regulations governing the source data prohibit public release of individual-level data; only aggregate data are permitted for disclosure. The nature of the data used in this particular project is such that there is no way to aggregate the data for public release. The dataset was derived from the University of Toronto Practice-based Research Network’s (UTOPIAN) Data Safe Haven, a large primary care EHR repository encompassing over 400 clinics and 400,000 patients in Ontario, Canada. The parent database has been archived and is not currently accessible. Access to the dataset may be considered in the future upon request and approval by the University of Toronto Health Sciences REB. Requests for data access should be directed to the Human Research Ethics Unit at ethics.review@utoronto.ca or to the research ethics coordinator, Mariya Gancheva (m.gancheva@utoronto.ca). Requests will be reviewed within approximately four weeks and are subject to applicable institutional data use agreements. All data are stored securely on encrypted institutional servers within the University of Toronto Data Safe Haven environment. All aggregate numerical source data underlying the main and Supplementary Figs. are provided in Supplementary data 1 (Excel), which is sufficient to reproduce the analyses and visualizations presented in this paper. Numerical data underlying Figure 7 (provider-level proportions) are not publicly shared due to potential re-identification risk under UTOPIAN Data Safe Haven REB policy.
Code availability
All models were implemented using the PyTorch framework (version 2.3.1+cu121)86, with transformer-based architectures developed using the HuggingFace Transformers library (version 4.37.1)87. Model development and analysis were conducted in Python 3.10.12 using NumPy 1.26.4, pandas 2.1.1, scikit-learn 1.4.dev0, Matplotlib 3.8.1, Seaborn 0.13.0, and NLTK 3.8.1. Training was performed on an NVIDIA Quadro RTX 6000 GPU using CUDA 12.2 (driver version 535.247.01). Hyperparameters and training configurations for all models are provided in the Methods section and summarized in Table 1. The code for the active learning pipeline used for data annotation is publicly available at https://github.com/seperahm/EMR_Race_Classification. The remaining modeling code, developed for model training and fairness-aware loss implementation, are stored within the secure University of Toronto Data Safe Haven environment alongside the study data and cannot currently be exported for public release following archival of the environment under institutional privacy and security regulations. All transformer-based models used are standard, publicly available pre-trained architectures, and the hierarchical CNN—the primary methodological contribution of this work—is fully specified in the Methods section, including architectural details, optimized hyperparameters, and training procedures, enabling independent reimplementation.
Researchers seeking further methodological clarification or architecture-level guidance may contact the corresponding author for additional details or code review under appropriate data-sharing agreements.
References
Ford, M. E. & Kelly, P. A. Conceptualizing and categorizing race and ethnicity in health services research. Health Serv. Res. 40, 1658–1675 (2005).
Prus, S. G. Comparing social determinants of self-rated health across the United States and Canada. Soc. Sci. Med. 73, 50–59 (2011).
Morris, S. M. et al. Predictive modeling for clinical features associated with neurofibromatosis type 1. Neurol. Clin. Pract. 11, e497–e505 (2021).
Brown, T. H., O’Rand, A. M. & Adkins, D. E. Race–ethnicity and health trajectories: Tests of three hypotheses across multiple groups and health outcomes. J. Health Soc. Behav. 53, 359–377 (2012).
Lubetkin, E. I., Jia, H., Franks, P. & Gold, M. R. Relationship among sociodemographic factors, clinical conditions, and health-related quality of life: Examining the EQ-5D in the US general population. Qual. Life Res. 14, 2187–2196 (2005).
Lingren, T. et al. Developing an algorithm to detect early childhood obesity in two tertiary pediatric medical centers. Appl. Clin. Inform. 7, 693–706 (2016).
Ahuja, Y. et al. Leveraging electronic health records data to predict multiple sclerosis disease activity. Ann. Clin. Transl. Neurol. 8, 800–810 (2021).
Franks, P., Gold, M. R. & Fiscella, K. Sociodemographics, self-rated health, and mortality in the US. Soc. Sci. Med. 56, 2505–2514 (2003).
Freeman, H. P. The meaning of race in science–considerations for cancer research: Concerns of special populations in the national cancer program. Cancer.: Interdiscip. Int. J. Am. Cancer. Soc. 82, 219–225 (1998).
Davidson, J., Vashisht, R. & Butte, A. J. From genes to geography, from cells to community, from biomolecules to behaviors: The importance of social determinants of health. Biomolecules 12, 1449 (2022).
Bucher, B. T. et al. Determination of marital status of patients from structured and unstructured electronic healthcare data. In AMIA Annu. Symp. Proc., vol. 2019, 267–274 (2019).
Han, S. et al. Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing. J. Biomed. Inform. 127, 103984 (2022).
Sholle, E. T. et al. Underserved populations with missing race ethnicity data differ significantly from those with structured race/ethnicity documentation. J. Am. Med. Inform. Assoc. 26, 722–729 (2019).
Polubriaginof, F. C. et al. Challenges with quality of race and ethnicity data in observational databases. J. Am. Med. Inform. Assoc. 26, 730–736 (2019).
Proumen, R., Connolly, H., Debick, N. A. & Hopkins, R. Assessing the accuracy of electronic health record gender identity and REal data at an academic medical center. BMC Health Serv. Res. 23, 884 (2023).
Qing, L., Linhong, W. & Xuehai, D. A novel neural network-based method for medical text classification. Fut. Internet 11, 255 (2019).
Nguyen, H. & Patrick, J. Text mining in clinical domain: Dealing with noise. In Proceedings of the 22nd Association for Computing Machinery Special Interest Group on Knowledge Discovery in Data International Conference on Knowledge Discovery and Data Mining, 549–558 (2016).
Abulibdeh, R. et al. Assessing the capture of sociodemographic information in electronic medical records to inform clinical decision making. PloS One 20, e0317599 (2025).
Senior, M. et al. Identifying predictors of suicide in severe mental illness: A feasibility study of a clinical prediction rule (oxford mental illness and suicide tool or OxMIS). Front. Psychiatry 11, 268 (2020).
Lybarger, K. et al. Leveraging natural language processing to augment structured social determinants of health data in the electronic health record. J. Am. Med. Inform. Assoc. 30, 1389–1397 (2023).
Patra, B. G. et al. Extracting social determinants of health from electronic health records using natural language processing: A systematic review. J. Am. Med. Inform. Assoc. 28, 2716–2727 (2021).
Bompelli, A. et al. Social and behavioral determinants of health in the era of artificial intelligence with electronic health records: A scoping review. Health Data Sci. 2021, 1–19 (2021).
Zhang, D., Thadajarassiri, J., Sen, C. & Rundensteiner, E. Time-aware transformer-based network for clinical notes series prediction. In Machine learning for Healthcare Conference, 566–588 (PMLR, 2020).
Yang, Z. et al. Hierarchical attention networks for document classification. In 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1480–1489 (2016).
Abulibdeh, R., Tu, K. & Sejdić, E. Natural language processing methods for assessing social determinants of health in the electronic health records: A narrative review. Expert Systems with Applications 127928 (2025).
Shi, J. et al. Accelerating clinical NLP at scale with a hybrid framework with reduced GPU demands: A case study in dementia identification. arXiv preprint arXiv:2504.12494 (2025).
Flaxman, A. D. & Vos, T. Machine learning in population health: Opportunities and threats. Public Library Sci. Med.15, e1002702 (2018).
Weissler, E. H. et al. The role of machine learning in clinical research: Transforming the future of evidence generation. BioMed. Cent. Trials 22, 1–15 (2021).
Habehh, H. & Gohel, S. Machine learning in healthcare. Curr. Genomics 22, 291 (2021).
Haider, S. A. et al. The algorithmic divide: A systematic review on AI-driven racial disparities in healthcare. J. Racial Ethnic Health Disparities 188–217 (2024).
Yu, Z. et al. Iguevara2024largedentifying social determinants of health from clinical narratives: A study of performance, documentation ratio, and potential bias. J. Biomed. Inform. 153, 104642 (2024l).
Guevara, M. et al. Large language models to identify social determinants of health in electronic health records. NPJ Digital Med. 7, 6 (2024).
Gao, Y., Sharma, T. & Cui, Y. Addressing the challenge of biomedical data inequality: An artificial intelligence perspective. Annu. Rev. Biomed. Data Sci. 6, 153–171 (2023).
University of Toronto family medicine report. Tech. Rep., Department of Family and Community Medicine at the University of Toronto, Toronto, ON, Canada https://issuu.com/dfcm/docs/u_of_t_family_medicine_report (2019).
OntarioMD. Provincial EMR-integrated access https://www.ontariomd.ca/emr-certification/omd-certified-emrs-numbers/integrated-ehr-products (2025).
OntarioMD. From foundation to integration: Annual report 2016-2017. https://www.ontariomd.ca/documents/annual_report_2017.pd (2017).
Canadian Insitute for Health Information. Guidance on the use of standards for race-based and indigenous identity data collection and health reporting in canadas https://www.cihi.ca/en/race-based-and-indigenous-identity-data (2022).
Lybarger, K., Ostendorf, M. & Yetisgen, M. Annotating social determinants of health using active learning, and characterizing determinants using neural event extraction. J. Biomed. Inform. 113, 103631 (2021).
Figueroa, R. L., Zeng-Treitler, Q., Ngo, L. H., Goryachev, S. & Wiechmann, E. P. Active learning for clinical text classification: Is it better than random sampling?. J. Am. Med. Inform. Assoc. 19, 809–816 (2012).
Chen, Y., Lasko, T. A., Mei, Q., Denny, J. C. & Xu, H. A study of active learning methods for named entity recognition in clinical text. J. Biomed. Inform. 58, 11–18 (2015).
Yang, Z., Dehmer, M., Yli-Harja, O. & Emmert-Streib, F. Combining deep learning with token selection for patient phenotyping from electronic health records. Sci. Rep. 10, 1432 (2020).
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A. & Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18, 1–52 (2018).
Caton, S. & Haas, C. Fairness in machine learning: A survey. Assoc. Comput. Mach. Comput. Surv. 56, 1–38 (2024).
Han, J., Kamber, M. & Pei, J.Data Mining: Concepts and Techniques (Morgan Kaufmann, Boston, 2012), 3rd edn.
Hardt, M., Price, E. & Srebro, N. Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems 29 (2016).
Khalili, M. M., Zhang, X. & Abroshan, M. Loss balancing for fair supervised learning. In International Conference on Machine Learning, 16271–16290 (PMLR, 2023).
Lai, Y. & Guan, L. Flexible fairness-aware learning via inverse conditional permutation. arXiv preprint arXiv:2404.05678 (2024).
Liu, M. et al. FAIM: Fairness-aware interpretable modeling for trustworthy machine learning in healthcare. Patterns 5, 101059 (2024).
Lee, G. & Sayer, S. Exploring equality: An investigation into custom loss functions for fairness definitions. arXiv preprint arXiv:2501.01889 (2025).
Stemerman, R. et al. Identification of social determinants of health using multi-label classification of electronic health record clinical notes. J. Am. Med. Inform. Assoc. Open 4, ooaa069 (2021).
Grandini, M., Bagli, E. & Visani, G. Metrics for multi-class classification: An overview. arXiv preprint arXiv:2008.05756 (2020).
Shaphiro, S. & Wilk, M. An analysis of variance test for normality. Biometrika 52, 591–611 (1965).
Scheffe, H.The analysis of variance, 72 (John Wiley & Sons, 1999).
Tukey, J. W. Comparing individual means in the analysis of variance. Biometrics 99–114 (1949).
Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32, 675–701 (1937).
Nemenyi, P. B. Distribution-free multiple comparisons. (Princeton University, 1963).
Wilcoxon, F. Individual comparisons by ranking methods (Springer, New York, NY, USA, 1992).
Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics 50–60 (1947).
Kruskal, W. H. & Wallis, W. A. Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 47, 583–621 (1952).
Pearson, K. X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond., Edinb., Dublin Philos. Mag. J. Sci. 50, 157–175 (1900).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. Assoc. Comput. Linguistics, 4171–4186 (2019).
Liu, Y. et al. RoBERTa: A robustly optimized bert pretraining approach. Clinical Orthopaedics and Related Research (2019).
He, P., Liu, X., Gao, J. & Chen, W. DeBERTa: Decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations https://openreview.net/forum?id=XPZIaotutsD (2021).
Alsentzer, E. et al. Publicly available clinical BERT embeddings. presented at the Proceedings of the 2nd clinical natural language processing workshop (2019).
Gichoya, J. W. et al. AI recognition of patient race in medical imaging: A modelling study. Lancet Digital Health 4, e406–e414 (2022).
Sun, M., Oliwa, T., Peek, M. E. & Tung, E. L. Negative patient descriptors: Documenting racial bias in the electronic health record. Health Aff. 41, 203–211 (2022).
Wen, D. et al. Characteristics of publicly available skin cancer image datasets: A systematic review. Lancet Digital Health 4, e64–e74 (2022).
Adam, H. et al. Write it like you see it: Detectable differences in clinical notes by race lead to differential model recommendations. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, 7–21 (2022).
Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623 (2021).
Webster, K. et al. Measuring and reducing gendered correlations in pre-trained models. arXiv preprint arXiv:2010.06032 (2020).
Kaneko, M. & Bollegala, D. Unmasking the mask–evaluating social biases in masked language models. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, 11954–11962 (2022).
Gallifant, J. et al. Peer review of GPT-4 technical report and systems card. PLOS Digital Health 3, e0000417 (2024).
Omiye, J. A., Lester, J. C., Spichak, S., Rotemberg, V. & Daneshjou, R. Large language models propagate race-based medicine. NPJ Digital Med. 6, 195 (2023).
Zack, T. et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: A model evaluation study. Lancet Digital Health 6, e12–e22 (2024).
Labban, M. et al. Disparities in travel-related barriers to accessing health care from the 2017 national household travel survey. J. Am. Med. Assoc. Netw. Open 6, e2325291–e2325291 (2023).
Yang, J., Soltan, A. A., Eyre, D. W., Yang, Y. & Clifton, D. A. An adversarial training framework for mitigating algorithmic biases in clinical machine learning. NPJ Digital Med. 6, 55 (2023).
Tsai, T. C. et al. Algorithmic fairness in pandemic forecasting: Lessons from COVID-19. NPJ Digital Med. 5, 59 (2022).
Dunkelau, J. & Leuschel, M. Fairness-aware machine learning. An Extensive Overview 1–60 (2019).
van de Sande, D., van Bommel, J., Fung Fen Chung, E., Gommers, D. & van Genderen, M. E. Algorithmic fairness audits in intensive care medicine: Artificial intelligence for all?. Crit. Care 26, 315 (2022).
Liu, X. et al. The medical algorithmic audit. Lancet Digital Health 4, e384–e397 (2022).
Hassija, V. et al. Interpreting black-box models: A review on explainable artificial intelligence. Cogn. Comput. 16, 45–74 (2024).
Nizam, T. & Zafar, S. Explainable artificial intelligence (XAI): Conception, visualization and assessment approaches towards amenable XAI. In Explainable Edge AI: A Futuristic Computing Perspective, 35–51 (Springer, 2022).
Ghai, B. & Mueller, K. D-bias: A causality-based human-in-the-loop system for tackling algorithmic bias. IEEE Trans. Vis. Comput. Graph. 29, 473–482 (2022).
Albert, S. M. et al. Do patients want clinicians to ask about social needs and include this information in their medical record?. BMC Health Serv. Res. 22, 1275 (2022).
Yelton, B. et al. Assessment and documentation of social determinants of health among health care providers: Qualitative study. J. Med. Internet Res. Formative Res. 7, e47461 (2023).
Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019).
Wolf, T. et al. Huggingface’s transformers: State-of-the-art natural language processing. in Proceedings of the 2020 Conference on Empirical Methods in NLP: System Demonstrations, 38–45 (2020).
Acknowledgements
This work was supported by the Canadian Institutes of Health Research [grant number 173094]. Dr. K Tu receives a Chair in Family and Community Medicine Research in Primary Care at UHN and a Research Scholar Award from the Department of Family and Community Medicine, Temerty Faculty of Medicine, University of Toronto. Dr. L Celi is funded by the National Institute of Health through DS-I Africa U54 TW012043-01 and Bridge2AI OT2OD032701, and the National Science Foundation through ITEST #2148451.
Author information
Authors and Affiliations
Contributions
K.T. and E.S. conceived the study. R.A. designed and conducted the study, developed and implemented the models, collected and processed the data, performed model and bias analyses, and drafted the manuscript. K.T. and E.S. supervised the study, provided resources, assisted in manuscript editing and review, and contributed to project administration. K.T. additionally curated data, and secured funding. Y.L. contributed to the conceptualization and development of the hierarchical CNN model and the active learning model. Y.L. also provided input on the methodology and interpretation of results. S.A. developed the active learning model, performed its analysis, and generated results. S.A. also assisted in drafting portions of the manuscript. L.A.C. contributed to the interpretation of findings, assisted in drafting the discussion and future directions, and provided critical feedback on the manuscript. Q.Z. provided support for data analysis and interpretation of results. All authors—R.A., Y.L., S.A., K.T., L.A.C., Q.Z., and E.S.—reviewed and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Medicine thanks Brandon Theodorou and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Abulibdeh, R., Lin, Y., Ahmadi, S. et al. Integration of fairness-awareness into clinical language processing models. Commun Med (2026). https://doi.org/10.1038/s43856-026-01433-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s43856-026-01433-9