Social determinants of health extraction from clinical notes across institutions using large language models

Keloth, Vipina K.; Selek, Salih; Chen, Qingyu; Gilman, Christopher; Fu, Sunyang; Dang, Yifang; Chen, Xinghan; Hu, Xinyue; Zhou, Yujia; He, Huan; Fan, Jungwei W.; Wang, Karen; Brandt, Cynthia; Tao, Cui; Liu, Hongfang; Xu, Hua

doi:10.1038/s41746-025-01645-8

Download PDF

Article
Open access
Published: 17 May 2025

Social determinants of health extraction from clinical notes across institutions using large language models

npj Digital Medicine volume 8, Article number: 287 (2025) Cite this article

4419 Accesses
4 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Detailed social determinants of health (SDoH) is often buried within clinical text in EHRs. Most current NLP efforts for SDoH have limitations, investigating limited factors, deriving data from a single institution, using specific patient cohorts/note types, with reduced focus on generalizability. We aim to address these issues by creating cross-institutional corpora and developing and evaluating the generalizability of classification models, including large language models (LLMs), for detecting SDoH factors using data from four institutions. Clinical notes were annotated with 21 SDoH factors at two levels: level 1 (SDoH factors only) and level 2 (SDoH factors and associated values). Compared to other models, instruction tuned LLM achieved top performance with micro-averaged F1 over 0.9 on level 1 corpora and over 0.84 on level 2 corpora. While models performed well when trained and tested on individual datasets, cross-dataset generalization highlighted remaining obstacles. Access to trained models will be made available at https://github.com/BIDS-Xu-Lab/LLMs4SDoH.

Large language models to identify social determinants of health in electronic health records

Article Open access 11 January 2024

Unveiling social determinants of health impact on adverse pregnancy outcomes through natural language processing

Article Open access 09 August 2025

Sociodemographic biases in medical decision making by large language models

Article 07 April 2025

Introduction

In the United States, social factors, including educational level, racism, and poverty, contribute to over a third of total deaths in a year¹. According to a study that surveyed the number of deaths attributable to social factors in the United States in the year 2000, 245, 000 deaths were attributed to low education, 176, 000 to experiences of racism, 162 ,000 to low social support, 133 ,000 to individual-level poverty, 119, 000 to income inequality, and 39 ,000 to area-level poverty¹. It should be noted that such mortality estimates are comparable to that of leading disease-related mortality rates. A growing body of research has reported on the negative impact of social determinants of health (SDoH) factors (socioeconomic and environmental factors that describe the conditions in which people are born, live, work, and age) creating undesirable circumstances, such as health disparities and discrimination^2,3,4. For example, the likelihood of premature death increases as income goes down², children born to parents who haven’t completed high school are more likely to live in environments that contain barriers to health³, and low education levels are directly correlated with low income, higher likelihood of smoking, and shorter life expectancy². We note that in the context of this study the term SDoH is used in a broad sense covering not only socio-economic factors, but also behavioral, environmental, and other closely linked factors. This aligns with the comprehensive definition of SDoH as outlined by Healthy People 2030⁵ and is supported by a systematic review of prior literature on SDoH⁶.

The mounting body of evidence demonstrating the impact of social determinants on health has resulted in an increasing acknowledgment within the health care sector that achieving improved health and health equity is likely to be contingent, at least in part, on addressing unfavorable social determinants^7,8,9,10. Over the last decade, electronic health record (EHR) systems have been increasingly implemented at US hospitals and have generated large amounts of longitudinal patient data. These massive EHR databases are becoming enabling resources for diverse types of clinical, genomic, and translational research, with successful demonstrations by large initiatives such as the Electronic Medical Records and Genomics (eMERGE) Network¹¹, the Patient-Centered Outcomes Research Institute (PCORI)¹², and the Observational Health Data Science and Informatics (OHDSI) consortium¹³. However, most current studies do not use the full spectrum of EHR data, and more efforts are being devoted to including unstructured textual data in EHRs for real world evidence generation¹⁴.

In clinical practice, there has been a lack of systematic collection and analysis of SDoH defined in the EHRs^15,16,17. While some SDoH factors (e.g., race, sex) are present as structured data in EHRs, the lack of specific EHR fields to record SDoH information and the lack of standards for collecting data related to SDoH are some of the major reasons for insufficient SDoH documentation. On the other hand, clinical narratives contain more detailed characterization of several SDoH factors (e.g., living status, financial instability, social support, etc.), beyond the structured representation. The unstructured text data often contains detailed information to describe how multiple socio-economic, behavioral factors, and situational variables interact with each other. To bridge this gap, NLP techniques have emerged to automatically identify SDoH information from unstructured notes.

The current landscape of NLP approaches for identifying SDoH factors has been extensively reviewed^6,18. While multiple NLP methods are available, there are four primary challenges for SDoH identification. Firstly, many studies, even the recent ones^{19,20,21,22,23,24,25} have focused on a limited set of SDoH factors (up to 8 factors). A comprehensive review⁶ revealed that among the 82 papers examined, only three SDoH factors were commonly addressed: smoking status (27), substance use (21), and homelessness (20). Other critical factors such as education, insurance status, and broader social issues are still in the developmental stage. The n2c2/UW SDOH Challenge²⁶ on SDoH extraction focused on the status, extent, and temporality of the factors, but was limited to five factors including the commonly studied alcohol, drug, tobacco, in addition to employment, and living situation using the Social History Annotated Corpus (SHAC) corpus²⁷. Secondly, a considerable number (22 out of 82) of these approaches rely on rule-based methodologies, which are highly dependent on specific lexicons. The lack of a standardized documentation framework for SDoH means these lexicons might not be easily transferrable or adaptable across different healthcare systems, significantly limiting the methods’ effectiveness. Recent work in this direction is the Gravity Project²⁸, a collaborative effort aimed at developing consensus-based data standards for SDoH information. Some prior studies have explored the use of deep neural network architectures such as CNN²⁹, LSTM³⁰, and BERT^21,23,31 to automatically classify SDoH categories³², or named entity recognition (NER) based approaches, utilizing tools like cTAKES³³, CNN, and BERT-based models³⁴, to extract SDoH factors from clinical text. Thirdly, most methods were developed and evaluated within specific healthcare systems, such as a Veterans Health Administration center³⁵, a Medical Center³⁶, and a Trauma Center³⁷. The lack of broader testing raises concerns about the generalizability and adaptability of these NLP methods to diverse healthcare settings. For example, Stemerman et al. ³⁸ designed a multi-label classifier to identify six SDoH categories within sentences extracted from clinical notes sourced from the University of North Carolina’s clinical data warehouse. Finally, the decision to formulate the SDoH extraction task as classification^15,32,38 vs. NER^27,39,40 requires careful consideration.

Existing research has also explored integrating patient SDoH data with geospatial datasets like census demographics and social vulnerability index (SVI) to estimate SDoH factors at a community or neighborhood level^41,42,43,44. One such study analyzed to what extend the self-reported SDoH needs assessment by a health system are associated with census tract–level social vulnerability measured using the SVI⁴². Other work has explored mapping SDoH documentation and area-level deprivation to assess individual-level social risks⁴⁴. While leveraging linkage between patient addresses and geo-referenced datasets offers a valuable community-level perspective, our work aims to directly identify patient-specific SDoH information from the unstructured text data using NLP techniques.

Recent studies have utilized the use of large language models (LLMs) for the identification of SDoH factors^19,24,45,46. Guevera et al. conducted a study using encoder-decoder language models, specifically fine-tuned Flan-T5 XL and Flan-T5 XXL, to classify mentions of SDoH categories from narrative text in EHRs⁴⁵. The study aimed to identify any mention of these SDoH factors as well as specifically adverse mentions. However, the study was limited in scope to identifying the presence or absence of only six SDoH factors, without exploring temporality or nuanced assertions beyond a broad adverse categorization. The potential of data augmentation by incorporating LLM-generated synthetic SDoH data into the training process to further enhance model performance was also investigated.

The inclusion of SDoH in clinical notes can vary depending on various factors, including the healthcare system, individual clinician practices, and the specific context of patient encounters^47,48. These factors have a profound influence on the heterogeneity in the distribution of SDoH factors in clinical notes. Focusing on a specific note type, single institution data, specific hospital setting or individual documenting practices and developing datasets and testing NLP performance under these conditions might not always translate to a real-world scenario especially if developing models that generalize well across different settings is of prime importance. Hence, the primary goal of this study is to investigate the feasibility of developing high-performing and generalizable NLP approaches for extracting comprehensive SDoH information from clinical notes. We train and evaluate several NLP models including deep-learning architectures and LLMs assessing their capabilities both within their training distribution (in-domain) as well as on datasets from other sites (out-of-domain). By analyzing patient data from diverse clinical settings across multiple healthcare institutions, we also shed light on the real-world distribution of SDoH factors.

In summary, this paper presents the following key contributions: (1) a study identifying the most comprehensive list of SDoH factors from clinical notes; (2) curation of the largest multi-institution annotated corpora covering assertions and temporality of SDoH factors; and (3) development and evaluation of the performance of both traditional models and LLMs highlighting the impact of varying label distribution of SDoH factors across datasets on cross-domain transfer learning. To enhance accessibility and encourage further research, we make our models—trained and fine-tuned on this multi-institutional SDoH dataset—publicly available to the broader scientific community.

Results

Annotation and datasets

For the UTHealth Harris County Psychiatric Center (HCPC) dataset, after eliminating duplicate notes/sentences, we ended up with a total of 4953 sentences (including 344 sentences with no SDoH). The distribution of level 1 annotated corpus is shown in Fig. 1 (top left). This dataset has a high prevalence of several factors such as adverse childhood, education level, isolation, geographic location where the person was born/raised, physical/sexual abuse and social support compared to other datasets in which these factors were not documented or less frequently documented. A total of 18 SDoH factors were annotated in this dataset.

**Fig. 1: Distribution of SDoH factors.**

For the UT Physicians (UTP) dataset, a total of 1691 sentences were annotated from 1000 chart notes. Several of these notes had duplicate information because of copy-pasting for multiple visits by the same patient which were eliminated. For this dataset comprising of ADRD patients in an outpatient setting, the chart notes documented substance use information in detail under “social history” sections. Sex and race were almost always mentioned in the beginning (e.g., The patient is an 80 yr old Caucasian F …) followed by some information regarding the patient’s employment, living and marital status. 14 SDoH were annotated showing high prevalence of sex, race, and substance abuse.

As notes with note type “social work” were extracted from MIMIC-III database⁴⁹, these notes were written by social workers majorly documenting their interactions with patients’ families. This explains the towering frequency of marital status and some inference that could be made regarding living status and social support. Other factors that were frequently reported included alcohol and drug use, sex, and employment. A total of 2838 sentences were annotated from 500 notes with 20 SDoH factors present.

The Mayo dataset was the smallest with 964 sentences annotated. The dataset contained all clinical notes of a chronic pain cohort with much of the data structured as templates to be filled in resulting in high volumes of duplicate information. In addition, major part was documentation of the diseases and medications explaining the high frequency of “non-SDoH” category. 18 SDoH factors were present with 10 factors having less than 10 samples.

The distribution SDoH factors with level 2 annotations for all the datasets are shown as part of Supplementary Figs. 1–4.

Model performance

In this section, we show the results of our experiments with different models for all annotated corpora. Table 1 presents the results of fivefold cross-validation on both level 1 and level 2 annotated corpora. Micro-averaged precision, recall and F1 scores are reported by averaging across all 5 folds. We observe that the instruction-tuned LLaMA model performed better than all other models for all datasets except on the MIMIC-III dataset (level 1) for which the ClinicalBERT⁵⁰ model was better. The ClinicalBERT model had the second-best performance for level 1 annotated corpora. For level 1, UTP dataset has only 15 classes (Fig. 1 top right) and has a comparatively better class balance. This is reflected in the model performance with 5-fold cross-validation consistently yielding an F1 score above 0.95 for all models. On the other hand, Mayo corpus (Fig. 1 bottom right) is highly imbalanced, is the smallest dataset, and has a total of 19 classes. It is interesting to note that the instruction tuned LLaMA model still achieved an F1 score of 0.94 closely followed by ClinicalBERT with 0.919.

Table 1 Micro-averaged precision, recall, and F-1 metrics (average of 5-folds) of models across all datasets

Full size table

If the number of classes increases and the number of datapoints remains the same, the model performance decreases as is evident from the results for level 2 annotated corpora (Table 1 bottom half). However, the decrease in performance for the LLaMA model (ranges from 2.7 to 4.6%) is less compared to other models (4.6 to 11.3% for XGBoost, 10.2 to 23.9% for TextCNN, 4.5 to 12% for SBERT, and 3 to 13.3% for ClinicalBERT). For UTP, except for TextCNN all other models have an F1 score above 0.91. The maximum difference between ClinicalBERT and LLaMA (13.2%) was observed for the Mayo dataset (level 2) due to the high class-imbalance (Supplementary Fig. 4) and most classes not having enough samples for learning.

The macro-averaged metrics (Table 2) is a better reflection of the model performances with such class-imbalanced datasets. We notice a widened performance gap between LLaMA and other models, with the LLM still performing better than other models with greater than 0.45 F1 across the datasets and levels of annotation. This also demonstrates the ability of the LLaMA models to learn from very few examples for many classes.

Table 2 Macro-averaged precision, recall, and F-1 metrics (average of 5-folds) of models across all datasets

Full size table

Our extensive prior work comparing zero/few-shot and fine-tuned LLaMA models have demonstrated the superior performance of fine-tuned models^51,52,53,54. We also evaluated LLaMA 2’s in-context learning capabilities (7B-base and 7B-chat variants) using up to ten prompt samples to demonstrate the utility and need of fine-tuning for this task. The 7B-base model produced inconsistent outputs, while the 7B-chat model, though slightly improved, generated overly verbose responses with numerous false positives (outputting the entire list of SDoH factors in majority of the cases), requiring extensive post-processing. As shown in Supplementary Table 12, even with optimized prompts and the larger 13B-chat model, in-context learning yielded poor results. Top performance on level 1 annotations was observed on the HCPC dataset (micro-average F1 of 0.518) with the 13-B chat model. The performance further dropped on level 2 annotations and macro-average evaluations likely due to the complexity of this task and increase in the number of classes. The prompts used are included in the Supplementary Materials (Supplementary Notes 2 and 3).

Generalizability experiments

Figure 2 shows the performance (micro-averaged) of all models when trained on one corpus and evaluated on all other corpora (level 1 annotated corpora). Overall, the LLaMA model performed better than other models in cross-dataset generalizability. When fine-tuned on HCPC, LLaMA achieved an F1 score of 0.91 on both UTP and Mayo datasets and 0.82 on MIMIC-III, demonstrating an overall best performance on cross-dataset generalizability. Other notable performances include training on MIMIC-III and testing on UTP (F1 = 0.94). ClinicalBERT demonstrated better performance when trained on UTP and tested on MIMIC-III and Mayo datasets and when trained on MIMIC-III and tested on Mayo dataset. These results can be attributed to the high overlap of SDoH categories between these datasets, with a minimal impact from non-overlapping categories on micro-averaged metrics due to their limited sample sizes.

**Fig. 2: Heatmap of cross-dataset performance evaluation (level 1).**

With the increase in the number of classes for level 2 annotated corpora, LLaMA model performed better compared to all other models (Fig. 3) for all cases of generalizability evaluation. The lowest performance for any dataset pair was an F1score of 0.40 for LLaMA 2, 0.08 for ClinicalBERT, 0.06 for SBERT, 0.04 for TextCNN, and 0.09 for XGBoost, clearly demonstrating the superior ability of instruction tuned LLaMA models to generalize better to unseen data. Micro-averaged precision and recall scores along with F1 for both levels of annotation can be found in Supplementary Tables 1, 2.

**Fig. 3: Heatmap of cross-dataset performance evaluation (level 2).**

LLaMA models outperformed all other models in cross-dataset generalizability, showing a wider performance gap when evaluated on macro-average precision, recall, and F1 (Supplementary Table 3). This performance advantage is likely due to LLaMA’s ability to learn effectively from fewer samples, a key factor given the datasets’ high imbalance. Per-factor performance evaluations (Supplementary Tables 4–7), with training data instance counts highlighted (* for <15 samples, ** for >50 samples), further illustrate LLaMA’s few-shot abilities and their impact on macro-averaged evaluation. These tables include instances where LLaMA achieved a perfect F1 score of 1.0, as well as scenarios where all models, including LLaMA, performed subpar or where LLaMA’s F1 score was lower than other models. A single instance of LLaMA’s zero-shot performance was also observed (Supplementary Table 7) for financial issues. Overall, LLaMA performs particularly well on low-prevalence factors; however, other models can achieve similar or even better performance in some cases when sufficient training data are available.

To further understand generalizability of models when sufficient training data are available, we performed another set of experiments involving five SDoH factors (Alcohol, Drug, Employment Status, Marital Status, Living Status) that are present in all four datasets and having at least 25 examples in the level 1 training data. Supplementary Tables 8 to 11 demonstrates the performance of all models on five factors trained on one data source and tested across three other data sources. Out of 60 experiments (5 factors, 4 data sources, and each tested across three other data sources), LLaMA achieved top performance on 41 experiments, ClinicalBERT achieved top performance on 13, XGBoost achieved top performance on 5, and TextCNN and SBERT on 1 each. Those results indicate that models such as ClinicalBERT and XGBoost can still outperform LLaMA when sufficient training data are available. We discuss in detail in the Discussion section the performance of different models in different scenarios and its implications for cross-dataset generalizability.

Final set of evaluations involved training models on a combined dataset (combining train split of all four datasets) and evaluating the performance of the resulting model on individual datasets. As shown in Table 3, ClinicalBERT achieved on par or sometimes better performance on level 1 micro-averaged metrics on combined data compared to LLaMA. However, LLaMA outperformed ClincialBERT on three out of four datasets at the level 2 evaluation. When macro-averaged metrics were used, ClinicalBERT lagged considerably behind LLaMA, as shown in Table 4. The difference ranged from 2.6% to 12.6% on level 1 and from 2% to 13.4% on level 2 annotations.

Table 3 Performance (P/R/F1) of all models (micro-averaged) when trained on the combined dataset for both levels of annotation

Full size table

Table 4 Performance (P/R/F1) of all models (macro-averaged) when trained on the combined dataset for both levels of annotation

Full size table

Error analysis

A thorough examination of the performance of the instruction-tuned LLaMA model revealed several key areas for potential improvement. Analyzing the predictions, we observed instances where the model failed to make any predictions, generated predictions outside the expected class labels (e.g., predicting “6th grade” instead of “Education level”), and struggled to differentiate between SDoH pertaining to the patient versus those referring to family members. The primary reasons for the notably low performance on the Mayo dataset, when evaluated on the combined dataset (Table 3), were the instances of no responses and the generation of irrelevant outputs such as numerical values (e.g., dates). Additionally, an interesting observation emerged regarding the level of annotation: for sentences describing the patient as a non-smoker, the LLaMA model failed to generate “smoking” as a label when using the level 1 annotated corpus. However, for the same sentence in the level 2 annotated corpus, the model consistently generated “nonsmoker” as a response. These issues highlight the need for improved prompt design with more specific instructions to regulate such outputs. Incorporating detailed instructions from annotation guidelines, as observed in our prior studies on LLMs⁵³, could help mitigate these errors and enhance the model’s overall performance.

Discussion

Our work builds upon and extends a growing body of research applying NLP techniques to automatically identify individual-level SDoH information from unstructured clinical notes. While existing studies specifically focus on highly documented SDoH factors or combine less frequent factors into a single category, we undertook a comprehensive, multilevel annotation of all SDoH factors with due focus on temporality and assertion in addition to fine-grained subcategorization of each factor. Since we do not constrain the SDoH factors as mentioned above, our model performance is a better reflection of the real-world operating conditions. We curated and constructed four annotated corpora of SDoH using data from four distinct healthcare systems, ensuring a diverse representation of SDoH factors. Leveraging these corpora, we conducted extensive experiments employing four text classification models and a generative model to detect SDoH factors. We not only evaluated the performance of these models but also highlighted the impact of the diverse label distribution of SDoH factors across datasets on the effectiveness of cross-domain transfer learning. We hope that our findings provide valuable insights into the complex landscape of SDoH factors, paving the way for enhanced understanding and future advancements in healthcare interventions and policies by facilitating and encouraging better documentation practices.

A primary motivation behind our work is to provide the research community with pre-trained models on these SDoH datasets, facilitating their customization and adaptation to local institutional contexts. We recognize that many academic and healthcare organizations face significant resource constraints that prohibit the utilization of LLMs at scales beyond a certain threshold including our own computational limitations. Consequently, our study prioritized the evaluation of more modestly sized LLMs that still demonstrate promising performance capabilities. By openly disseminating checkpoints of models fine-tuned on our aggregated multi-institutional SDoH dataset, we aim to lower the barrier for other research groups to build upon our work. These models can serve as effective initialization points for transfer learning, requiring further fine-tuning on relatively small amounts of locally curated data to improve performance.

Our results demonstrate the ability of instruction tuned LLaMA model to outperform all other models especially when the number of factors increases, and the performance is evaluated per factor considering the high class-imbalance. The base LLaMA model being a general-domain LLM and the task of SDoH identification being a non-medical one might have rendered these models more suitable for this task compared to other biomedical NLP tasks which requires considerable biomedical knowledge. The ClinicalBERT model performed on par or sometimes better on micro-averaged evaluation and when enough data is available for different factors. The performance of the XGBoost model demonstrates comparability to that of the SBERT model, and it is interesting to observe that the architectural variances did not significantly influence the outcomes when sufficient and diverse data is available, at least in this case within the scope of this study.

Comparing ClinicalBERT and SBERT, ClinicalBERT demonstrated better performance possibly due to training on MIMIC notes. The performance of TextCNN models was relatively much lower on two datasets - MIMIC-III and Mayo. MIMIC-III dataset has high lexical and semantic variability, and the sentences are diverse and do not follow any patterns/templates and is mostly written in a style of reproducing conversations with patients and their families, which might have affected the model performance. As for Mayo, it is the smallest dataset and has high class imbalance with more than half of the samples belonging to ‘non SDoH’ category. Even though ClinicalBERT, SBERT and XGBoost models performed well when trained and tested on the same datasets, this performance did not translate when tested on other datasets especially on level 2 corpora and when evaluating per factor performance. This low performance could be attributed to factors such as difference in (1) SDoH frequency distribution, and (2) the patterns in which SDoH factors are described differing significantly depending on the hospital setting, preference of individual practitioner, templates in place and medical specialties. To address this issue, data needs to be collected from diverse sources and combined to create datasets that capture the variations that contribute to heterogeneity.

We want to distinguish the two distinct aspects of model generalizability: (1) assessing the zero-/few-shot capabilities of models on new, unseen tasks (i.e., extracting new SDoH factors not seen during training) and (2) assessing the performance on new datasets for the same task for which the model was trained/fine-tuned (i.e., extracting shared SDoH factors across institutions). While LLMs excel in zero- and few-shot settings on unseen tasks, making them suitable for novel SDoH extraction, other models also show reasonable performance when evaluating generalizability on shared SDoH factors across datasets with sufficient training data. Our experiments focusing on five common SDoH factors (Alcohol, Drug, Employment Status, Marital Status, and Living Status), present in all four datasets with at least 25 training examples each (Supplementary Tables 8–11), demonstrate this point. In these cross-dataset evaluations, LLaMA achieved top performance in 41 out of 60 experiments, demonstrating strong generalizability. However, ClinicalBERT and XGBoost also showed competitive performance, achieving top performance in 13 and 5 experiments, respectively. For example, ClinicalBERT’s achieved a superior F1 score of 0.90 compared to LLaMA’s 0.71 on ‘living status’ when trained on the Mayo dataset and tested on the UTP dataset. This suggests there are other factors that affect model generalizability. With sufficient training data, simpler models may achieve comparable or sometimes better generalizability for well-defined tasks. Furthermore, our exploration of LLaMA 2’s in-context learning capabilities (Supplementary Table 12) revealed limitations in its zero-/few-shot performance for this task, with the 13B-chat model achieving a best micro-average F1 of only 0.518 on the HCPC dataset (the highest among all datasets) for level 1 annotations, and performance further declining in level 2 annotations and macro-average evaluations. This illustrates the need for fine-tuning to achieve robust performance for SDoH extraction using LLMs.

While our cross-dataset evaluation sheds some light on models’ ability to handle distribution shifts between sites, there are indeed inherent challenges to developing universally generalizable SDoH extraction models. Many social and environmental risk factors are intricately tied to the unique circumstances, lived experiences, and cultural contexts of different communities. The salience and manifestations of specific SDoH can vary drastically between patient populations. For example, documentation of housing instability or food insecurity may present very differently in an urban setting compared to a rural or affluent suburban communities. Such population-level heterogeneity makes it difficult to capture all potential linguistic variations in a single model. As such, while our work demonstrates some generalizability across datasets from different health systems, the highest performances are still achieved when models are tested in-domain. In cross-dataset evaluation, if a class absent from the training dataset isn’t predicted and causes a performance drop, it indicates zero-shot utilization. This dip in performance is anticipated due to label shift. Conversely, when classes are accounted for, generalizability assessment extends to evaluating the model’s capacity to manage covariate shift, which emerges when there are distribution differences between the training and testing data environments. Even though these factors have contributed significantly to the drop in performance, we still included rare SDoH factors in our evaluation as it allows us to assess the pipeline’s performance in real-world scenarios, even if it is solely based on zero-shot capabilities.

Developing and deploying LLMs can be challenging due to the significant computational resource demand for training, fine-tuning, and inference. High energy consumption and a large carbon footprint are environmental considerations. With parameter counts in the billions, LLMs may pose storage capacity challenges for some environments. Specifically, to fine-tune the 7B LLaMA model on about 7300 instruction demonstrations (combined training on all corpora) required approximately 10 min utilizing four NVIDIA A100 80GB GPUs. We performed full model fine-tuning, though techniques like parameter-efficient fine-tuning and quantization could further reduce the GPU and memory requirements considerably. Inference with the fine-tuned LLaMA model was performed on a single A100 GPU. The TextCNN, SBERT, and ClinicalBERT models had more modest training overheads, requiring only a single GPU. The XGBoost model being conventional tree-based methods, could be trained efficiently on CPUs and exhibited the fastest runtimes overall.

We note that differing criteria were used to construct the datasets from each site. While ideally datasets would undergo identical preprocessing, some degree of variation was required across sites to construct datasets with sufficient densities of SDoH documentation. However, we cannot rule out the possibility that these dataset-specific preprocessing may have inadvertently introduced biases that hindered cross-dataset generalization and performance difference between models. HCPC, UTP, and MIMIC datasets employed random sampling and keyword filtering, but the Mayo dataset originated from a study prescreening for chronic pain patients. This condition-specific cohort likely does not reflect the diverse SDoH profiles of Mayo’s general clinical population, which may have impacted model performance on this dataset along with its small sample size in addition to also having an impact on the SDoH distribution.

Careful consideration must be given to mitigating algorithmic bias when developing and deploying language models for extracting SDoH information from unstructured clinical notes across different healthcare settings. Models trained on data reflecting societal biases and disparities may perpetuate or amplify these inequities when deployed in clinical practice. Hence, evaluating bias in LLMs using clinical data of patients is crucial to ensure that the models do not perpetuate or amplify existing biases present in healthcare datasets^55,56,57. Differences in race, ethnicity, sex, age distributions and other traits across sites could contribute to language biases, disproportionate SDoH mention rates, or framing disparities stemming from clinician implicit biases. While some information about the data regarding the patient population has been provided in Table 5, we note that further characteristics such as age, sex, race, etc. could also explain differences in SDoH extracted from each dataset which might also result in bias in the model performance⁵⁶. A study design characterizing these cohort compositions, documentation patterns, and analyzing them uniformly across different datasets is needed to understand which factors contribute to SDoH distribution variations and model performance. Since our study was not originally designed to comprehensively investigate or report on bias evaluations, a redesigned study specifically focusing on bias evaluations and mitigation approaches would be required to address this limitation thoroughly. Additionally, utilizing a prior published dataset and limited structured data availability for some other datasets prevented us from providing further cohort characterization required for comprehensive bias evaluations.

Table 5 Description of the corpora used in this study

Full size table

Our study was conducted using data from US-based medical centers. The specific SDoH factors and linguistic patterns for expressing them may differ across other geographic and cultural contexts. To maximize the global impact and generalizability of using language models for SDoH extraction, it will be critical to create additional annotated datasets representing diverse populations worldwide⁵⁸. Evaluation of model performance and potential biases must also be conducted across different languages, healthcare settings, and sociocultural environments. To address this limitation, this work needs to be expanded beyond the US context and is an important future direction to advance global health equity through SDoH documentation.

It is imperative to continually benchmark the latest advances in LLMs for challenging applications like extracting SDoH from unstructured clinical text. More powerful models like LLaMA 3.1⁵⁹ might perform better compared to LLaMA 2 used in this study. However, our fine-tuning strategies and evaluations remain relevant independent of the specific model used. We will in future investigate commercial LLMs such as GPT-4⁶⁰ for SDoH extraction, after appropriate arrangements that allow sending clinical data to GPT models (e.g., via GPT models on Azure) are approved by our institution.

If the ultimate goal is to develop a single model that can identify a variety of SDoH factors for a wide range of clinical notes originating from multiple institutions, federated learning⁶¹ could be a promising direction. In federated learning, each institution will train their individual model with their privately-owned dataset and update their model through several rounds of model merging. After each round, the merged model will be broadcast to each institution before the next round of training begins. Other approaches such as a combination of Domain-incremental learning and Class-incremental learning⁶² can also be applied to sequentially update a single model with new domains and new sets of classes. We will explore these in our future work.

Guevera et al. ⁴⁵ explored the use of synthetic data for SDoH utilizing a simple prompt providing the model with several examples of each category. While showing promise, the generation of synthetic data did not exhaustively capture all possible linguistic variations of expressing SDoH concepts that could be encountered in real clinical text. We will systematically investigate different synthetic data generation strategies including zero-/few-shot, chain-of-thought and fine tuning^{63,64,65,66,67}. In addition to experimenting with advanced prompting strategies, in future, we plan to propose a template-based conditional generation approach to increase the coverage of long-tailed factors there by mitigating the factor imbalance and increasing the diversity of generated samples.

In summary, extracting and classifying social determinants of health from EHR and clinical notes have the potential for developing effective treatment plans, improving population health, and reducing health disparities. While what SDoH factors are documented and where they are documented varies widely across healthcare settings and medical specialties, the effects of these variations on the model performance in a real-world scenario is rarely studied, especially when developing generalizable models is the primary goal. Hence in this study, we analyze the performance of multiple models on SDoH extraction by designing and developing annotation guidelines for classifying SDoH factors and creating several annotated corpora using data from four healthcare systems. The datasets were curated to include different patient cohorts, note types, different layers of care in hospital settings, and documentation practices, thus showcasing the heterogeneity in the distribution of SDoH factors. Additionally, four classification models and a large language model were experimented with to detect SDoH factors with the LLM performing the best. The generalizability of the models across institutions was also evaluated. While all models perform relatively well when trained and tested on a single dataset, performance varied and dropped on cross-dataset evaluation, indicating the need for further research in this domain. To encourage research in this direction we will make available models trained by combining all annotated datasets.

Methods

Prior research has often constrained the number of SDoH factors within datasets due to the manual annotation burden or later integrated numerous factors with only a few samples into an “Other SDoH” category to potentially improve model performance³². However, this oversimplified approach fails to mirror the intricacies of real-world scenarios. Hence in this study, we undertook a meticulous annotation process, annotating all SDoH factors present in clinical notes, refraining from consolidation even in cases of limited sample sizes, to better reflect the landscape of SDoH factor distributions.

We collected data from four different hospitals and different hospital settings (inpatient and outpatient) including the publicly available MIMIC-III⁴⁹ database to acquire a rich and diverse documentation of SDoH factors. The datasets include psychosocial assessment notes, chart notes, social work notes, and all clinical notes from a published cohort study³⁴. This variety in note types also accounts for the variety in documentation practices as they are written by physicians, nurses, social workers, etc. While “social history” sections are rich in SDoH information (and many studies have solely used this section), SDoH factors can be scattered under different sections depending on the type of notes, note templates, and individual note-taking styles of the providers. In our study, we conducted a preliminary analysis of notes from each dataset to identify SDoH information across sections. In the case of the HCPC dataset, SDoH details were predominantly concentrated in psychosocial assessments within social history sections, leading to their sole consideration. Conversely, notes from UTP contained diverse SDoH information spread across different sections like “History of present illness” and “General observation,” prompting a comprehensive review of note contents to ensure inclusion of relevant data.

We decided to use multilabel classification instead of NER, as this approach offers flexibility in capturing the presence of various social determinants without requiring precise identification of entity boundaries. Annotators can focus on understanding the overall meaning and context of the sentence to determine which social determinants are relevant. This approach can potentially speed up the annotation process compared to the detailed entity-level annotation required in NER. Furthermore, there are certain factors such as social support, isolation or adverse childhood experiences that may not lend themselves well to strict entity identification as these factors often need to be inferred from the context. By adopting a multilabel sentence classification approach, annotators have the flexibility to capture these nuanced factors that cannot be precisely pinpointed as individual entities.

There is a better probability of finding some of the less studied SDoH factors such as social support, adverse childhood, and physical abuse-related information in clinical notes of patients experiencing mental health issues as these factors are highly relevant for mental health cohorts. These factors might be less prevalent in clinical notes written by physicians from other specialties^47,48 and hence SDoH documentation depends also on the patient population. Given such a real-world scenario, it is necessary to incorporate clinical notes from multiple medical specialties to develop models that can extract a wide range of SDoH factors. Hence, we utilize our multi-institutional corpora to explore the variations in the distribution of SDoH factors and conduct experiments towards understanding the feasibility of developing generalizable models that can extract multiple SDoH factors from clinical notes. Apart from traditional machine learning models and deep learning models we also inspect the use of an open-source LLM - LLaMA⁶⁸ for SDoH extraction and conduct a thorough evaluation of the model performance under different settings.

Selection of SDoH factors

We performed an extensive review of literature to identify prior work on social determinants including systematic and scoping reviews^6,69 with focus on SDoH. Several ontologies/terminologies that cover SDoH factors^70,71 (e.g., Ontology of Medically Related Social Entities (OMRSE)⁷², Semantic Mining of Activity, Social, and Health data (SMASH) system ontology⁷³) were also analyzed to identify major concepts besides standards such as SNOMED CT⁷⁴ and LOINC⁷⁵. Additional information was obtained from multiple surveys on SDoH including the All of Us SDoH Survey⁷⁶, Million Veteran Program (MVP) Lifestyle Survey⁷⁷, 2020 AHIMA Social Determinants of Health (SDOH) Survey⁷⁸, and the Gravity project²⁸. Following this process, the domain experts were consulted to identify important SDoH factors for each domain, finally narrowing down to 21 SDoH factors, which were then utilized for annotating clinical notes.

Annotation

After obtaining the Institutional Review Board (IRB) approval for the study the datasets were annotated. The Committee for Protection of Human Subjects at UTHealth (UTHealth CPHS) approved the study (HSC-SBMI-12-0754) for utilizing the clinical notes from UTHealth Harris County Psychiatric Center and UT Physicians. For utilization of clinical notes from Mayo clinic, the study was approved by the Mayo Clinic and the Olmsted Medical Center (OMC) IRBs (Mayo Clinic: 18-006536 and Olmsted Medical Center: 038-OMC-18). The requirement for informed consent was waived by all the governing IRBs under exempt category 4, as the research involved only secondary use of clinical notes and did not involve any patient contact.

The annotation process consists of two levels. At the first level (SDoH factors only) a sentence is assigned one or more labels from the 21 SDoH factors. For example, the sentence “She is single, lives with her parents, works 3 days per week.” will be labeled with ‘marital status’, ‘living status’ and ‘employment status’. The second level of annotation (SDoH factors along with their corresponding values/attributes) provides more granular information about these factors with respect to their subtypes, presence or absence, and temporality. The sentence mentioned above at level 2 will be labeled with ‘marital status - single’, ‘living status – with family’ and ‘employment status – employed, current’. The level 2 SDoH label set contains fine-grained subcategories that are not necessarily mutually exclusive within a given level 1 category. For example, a single sentence could potentially be assigned multiple level 2 labels corresponding to ‘living status – with family - past’ and ‘living status – alone - current’ under the broader ‘living status’ level 1 category. The subcategory labels can also encode assertions of presence/absence as well as temporality markers (e.g. ‘education level – college - ‘current’). During annotation, all applicable level 2 labels were assigned to a given sentence, including multiple subcategories from the same level 1 category. If a sentence does not correspond to any SDoH factor, it is labeled ‘non SDoH’. Detailed information about the 21 SDoH factors and their attributes along with examples can be found in the annotation guidelines (see Supplementary Table 13, 14 and Supplementary Note 1).

The team that developed the annotation guidelines, performed the annotation of the dataset and facilitated conflict resolution consisted of an MD specialized in psychiatry, an internal medicine physician, a postdoctoral fellow trained in clinical NLP, a PhD candidate specializing in the development of biomedical ontologies, a master’s student in public health with a prior master’s degree in biomedical informatics, and a research associate with a master’s degree in biomedical informatics and a bachelor’s degree in biomedical engineering. The postdoctoral fellow and the PhD candidate formulated the annotation guidelines under the expert guidance of the two physicians. The master’s student and the research associate performed the annotation of the dataset, and the entire team held regular discussions to resolve the conflicts. The annotators went through multiple rounds of training starting with a detailed discussion of the annotation guidelines and calculating the inter-annotator agreement (Cohen’s Kappa) after each training round. Depending on the datasets (discussed below), each round of annotator training utilized 10 clinical notes or 50 sentences. Once a Kappa value greater than 0.7 was achieved the remaining notes/sentences were annotated individually by the annotators and any discrepancies were resolved as mentioned above. The inter-annotator agreement after the final round of training was in the range of 0.73–0.90 for the datasets (which is similar to the annotator agreement reported in other studies³²).

Datasets

The prevalence of duplicate information in EHRs due to copy-pasting, templating, and summarizing is a major barrier to finding relevant information from EHRs^79,80. A recent study reported that the duplication increased from 33% in 2015 to 54.2% for notes written in 2020⁷⁹. The over representation of certain SDoH factors because of the duplication impairs the training and evaluation of machine learning and deep learning models^81,82. Hence, we removed duplicate sentences within multiple notes corresponding to a single patient from all datasets. This enabled us to reduce class imbalance among the SDoH factors to some extent. Nevertheless, the distribution of factors still varies, and class imbalance exists. We also observed several clinical notes documenting variations of the text “Nothing new to report” which were also eliminated.

We developed four datasets from de-identified and de-duplicated clinical notes of patients diagnosed with different health conditions from different hospitals. This includes “psychosocial assessment” notes of patients experiencing mental health issues, “chart notes” of patients diagnosed with Alzheimer’s disease, “social work” notes in the MIMIC-III database⁴⁹, and clinical notes of patients with chronic pain. The clinical notes were sampled using different techniques based on the note type and depending on where a majority of the SDoH information was present. The details of these datasets are described below and summarized in Table 5.

Our first corpus comprises of psychosocial assessment notes from UTHealth Harris County Psychiatric Center (HCPC). Psychosocial assessments inform a comprehensive understanding of the psychological, social, and cultural context of a person guiding the development of individual care plans. Harris County Psychiatric Center (HCPC) is one of the largest providers of inpatient psychiatric care in the USA. About 10,000 patients are admitted yearly, including adults, adolescents, and children. Commonly treated conditions are psychotic or mood disorders, patients with acute crisis, and signs of endangering themselves or others. The EHR goes back to 2001 and includes about 120,000 unique patients. We randomly selected 2000 assessment notes (corresponding to 1529 patients) and extracted the “Social history” sections (using MedSpacy sectionizer⁸³) from these notes which are rich in SDoH information. These notes were then annotated by two annotators based on the annotation guidelines.

The chart notes of patients diagnosed with Alzheimer’s disease from UT Physicians (UTP) constituted the second corpus. Several studies have shown the association of SDoH factors such as education level, isolation, and loneliness with the onset of Alzheimer’s disease and related dementias (ADRD) in older adults^84,85,86. For the second dataset we utilized chart notes of patients diagnosed with ADRD from UT Physicians (UTP). UTP provides outpatient care with multiple satellite clinics throughout the greater Houston Area. We reviewed various note/document types within the EHR (e.g. progress notes, discharge summaries, procedure notes) and identified that “chart notes” (different systems might be using different terminology as there is no specific standard) were enriched with SDoH documentation. We therefore filtered the UTP data to include only this specific note type. Unlike the psychosocial assessment notes from HCPC where the majority of the SDoH factors were described in the “Social history” section, social history in the chart notes from UTP recorded mostly information related to substance use. Other SDoH information was scattered under sections titled “History of present illness”, “General observation”, etc. Additionally, chart notes are comparatively longer and contain information regarding medications and different body systems which are irrelevant to this study. Hence to increase the annotation efficiency and decrease the annotation time we developed a list of keywords to filter notes rich in SDoH information. The list of keywords was collected during the initial literature review and expanded by combining keywords from our prior work developing SDoH ontology and those obtained while annotating the clinical notes from HCPC^32,70,71,87.

The social work notes from MIMIC-III database were utilized for constructing the third corpus. The Medical Information Mart for Intensive Care (MIMIC-III) database contains more than two million free text notes under different categories (e.g., nursing notes, discharge summaries) with information of patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. It includes 2670 “social work” notes documenting the patient’s social and family history and the interactions of the social workers with the patient and family members during their stay at the hospital. We randomly selected and annotated 500 such notes.

For our fourth dataset we utilized a subset of the corpus from a study³⁴ that characterized chronic pain episodes in adult patients receiving treatment at Mayo Clinic and the Olmsted Medical Center. An annotated corpus of 62 adults experiencing noncancer chronic pain was created using chronic pain ICD code for the anchor diagnosis and retrieving all the patient’s clinical notes 6 months before and 2 years after this anchor diagnosis (for more information please refer [27]). For our study, a total of 527 notes corresponding to the same 62 patients (retrieved using our keyword search) were annotated following our annotation guidelines. We annotated the entire note without filtering any sections. These notes exhibited a semi-structured format characterized by template structures, particularly noticeable in sections concerning substance use, along with huge amount of duplicated information.

A schematic representation of the workflow is illustrated in Fig. 4. For brevity we will hereafter refer to the four datasets by the abbreviations of the names of the hospitals/database from which the notes were extracted as HCPC, UTP, MIMIC-III, and Mayo. During the annotation process, it was observed that some of the factors among the 21 annotated had a notably low prevalence and that those factors varied based on the corpus. To address the issue of class imbalance and enhance model performance, it is common practice to merge classes with fewer samples into a single category. Han et al. ³² employed this approach by consolidating their initial set of 14 annotation categories for SDoH classification into eight categories, with the less frequent ones merged into an “other-social” category. However, in our study, we chose not to follow this practice and instead trained our models on all classes, including those with a smaller number of samples. By doing so, we aimed to provide a more realistic evaluation of model performance in real-world scenarios, considering the wide variation in the distribution of SDoH factors. Furthermore, maintaining and training models on all classes is crucial for our cross-dataset evaluation to ensure consistent and reliable assessment, as any differences in the total number and types of classes would otherwise impact the evaluation process.

**Fig. 4: A schematic representation of the workflow.**

Models

We experimented with five different models: XGBoost, TextCNN, SBERT, ClinicalBERT and LLaMA. Since each sentence in the annotated datasets can potentially be annotated with multiple SDoH factors, we formulated the task of identifying SDoH factors in a sentence as a multi-label binary relevance problem for all models except LLaMA. Formally, given an input sentence \(s\) and a label set \(C\) of SDoH factors, a model produces a binary label of \(\{0,\,1\}\) for each SDoH factor \(c\in C.\) We performed a supervised instruction fine-tuning for the LLaMA model. An instruction dataset was created with appropriate instructions to perform the task along with the sentences and expected response as input and output to fine-tune the model. Below we provide a brief overview of the five models: XGBoost, TextCNN, SBERT, ClinicalBERT and LLaMA.

We used the scikit-learn implementation of XGBoost⁸⁸ with tf-idf vectors as input features. OneVsRestClassifier was used with parameters n_jobs = –1 facilitating the use of all processors and max_depth value of 4. We implemented a TextCNN model following⁸⁹ for multi-label text classification. We used pre-trained word embeddings (GloVe⁹⁰) of dimension 300 for each word in a sentence as inputs. The model applies convolutional filters of kernel size 3, 4, and 5 to the input. The outputs of the convolutional filters are max-pooled and fed to a set of classification heads - each of them is a feed-forward network corresponding to a label.

We used a pre-trained Sentence-BERT (SBERT)⁹¹ encoder to encode each sentence into a dense vector representation aka. sentence embeddings. SBERT fine-tunes BERT³¹ in a siamese/triplet network architecture that produces sentence embeddings that are semantically meaningful. We further fine-tuned SBERT for multi-label classification with the SDoH datasets. The output sentence embeddings of SBERT are passed to C binary classifier heads, each of which is a feed-forward neural network. Note that the SBERT sentence encoder is shared among all classifier heads. We fine-tuned the classifier heads and the shared SBERT encoder by minimizing the binary cross-entropy loss of all classifier heads. During inference, if a classifier head produces a score of 0.5 or more, the input sentence is assigned a positive label for that SDoH factor.

The fourth model, ClinicalBERT⁵⁰, is yet another BERT model that has been further pre-trained on text from all note types in the MIMIC-III v1.4 database. We utilize the same process as detailed above for SBERT during training and inference just replacing the model with the ClinicalBERT model.

We utilized the open-source pretrained LLM, LLaMA 2 7B^68,92, and adapted it for multilabel classification by instruction fine-tuning. The training data for each dataset was converted into instruction demonstrations. An example of an instruction demonstration is shown in Supplementary Fig. 5 and has three components – an instruction describing the task to perform, an input which in this case is the sentence from which SDoH factors needs to be extracted, and an output which is the gold annotation label(s) for that sentence. We utilized the approach mentioned in⁹³ using Hugging Face’s training framework with fully sharded data parallel and mixed precision training⁵¹. During inference, we provided the fine-tuned models with “instruction” and “input” to generate the “output”. We also performed preliminary experiments with LLaMA 2 7B and 13B (base and chat variants) models using different prompts to assess the need for fine-tuning for this task compared to just utilizing base or chat variants of LLaMA models.

Evaluation

In total we have eight annotated corpora—two each (level 1: SDoH factors only; level 2: SDoH factors and values/attributes) for each corpus. We evaluated the models on all corpora by first training and testing on the same corpus and then performing cross-dataset evaluation. To assess the model’s performance in handling the task of assigning multiple SDoH categories to a sentence, we computed micro-averaged precision (P), recall (R) and F1. Micro-average is calculated by considering the total number of true positives, false positives, and false negatives across all classes.

We performed fivefold cross validation by dividing each corpus into 5 folds and performance of each fold was recorded and average performance across all 5 folds was calculated. Models were trained and tested only on the SDoH categories present in the corpus and hence the number of classes differs across the datasets.

It is observed that the performance is usually good for most models when trained and tested on the same corpus. We wanted to evaluate how different models would perform when trained on one corpus and evaluated on another (cross-dataset evaluation). Each corpus was randomly divided into 7:1:2 ratio for train, validation, and test, respectively. Models were trained and tested on 22 classes (21 SDoH and 1 non SDoH) for level 1 annotated corpora and 71 classes (70 SDoH + values and 1 non SDoH) for level 2 annotated corpora.

With factors having variations in their distribution across datasets, how would the performance be affected if we combine the datasets (combined dataset evaluation) and train a single model? For evaluating this we retained the exact same split used for cross-dataset evaluation and combined the training splits of all four corpora at level 1 to create a single training split and combined the validation splits into a single validation split. The test splits were preserved separately. The same process was repeated for level 2 annotated corpora. Next, a model was trained on the combined training data, validated, and tested on the test splits separately for each corpus. The number of classes, similar to cross-dataset evaluation was 22 for level 1 and 71 for level 2.

Data availability

The MIMIC-III social work notes can be obtained from https://physionet.org/content/mimiciii/1.4/ and the annotated data will be available on the PhysioNet website. The other datasets generated and/or analyzed during the current study are not publicly available due to privacy issues.

Code availability

The code and models will be made available on our GitHub repository https://github.com/BIDS-Xu-Lab/LLMs4SDoH.

References

Galea, S., Tracy, M., Hoggatt, K. J., DiMaggio, C. & Karpati, A. Estimated deaths attributable to social factors in the United States. Am. J. Public Health 101, 1456–1465 (2011).
Article PubMed PubMed Central Google Scholar
Marmot, M. et al. Closing the gap in a generation: health equity through action on the social determinants of health. Lancet 372, 1661–1669 (2008).
Article PubMed Google Scholar
Singh, G. K., Siahpush, M. & Kogan, M. D. Neighborhood socioeconomic conditions, built environments, and childhood obesity. Health Aff. (Millwood) 29, 503–512 (2010).
Article PubMed Google Scholar
Felitti, V. J. et al. Relationship of childhood abuse and household dysfunction to many of the leading causes of death in adults: The Adverse Childhood Experiences (ACE) Study. Am. J. Prev. Med. 14, 245–258 (1998).
Article PubMed CAS Google Scholar
Healthy People 2030, https://health.gov/healthypeople/priority-areas/social-determinants-health (2023).
Patra, B. G. et al. Extracting social determinants of health from electronic health records using natural language processing: A systematic review. J. Am. Med Inf. Assoc. 28, 2716–2727 (2021).
Article Google Scholar
Gold, R. & Gottlieb, L. National data on social risk screening underscore the need for implementation research. JAMA Netw. Open 2, e1911513–e1911513 (2019).
Article PubMed PubMed Central Google Scholar
Gottlieb, L. et al. Advancing social prescribing with implementation science. J. Am. Board Fam. Med. 31, 315–321 (2018).
Article PubMed Google Scholar
Gold, R. et al. Adoption of social determinants of health EHR tools by community health centers. Ann. Fam. Med. 16, 399–407 (2018).
Article PubMed PubMed Central Google Scholar
Integrating Social Needs Care into the Delivery of Health Care to Improve the Nation’s Health, https://www.nationalacademies.org/our-work/integrating-social-needs-care-into-the-delivery-of-health-care-to-improve-the-nations-health (2019).
Gottesman, O. et al. The electronic medical records and genomics (eMERGE) network: Past, present, and future. Genet Med. 15, 761–771 (2013).
Article PubMed PubMed Central Google Scholar
Patient-Centered Outcomes Research Institute (PCORI), https://www.pcori.org/ (2022).
The Observational Health Data Science and Informatics (OHDSI) consortium, https://www.ohdsi.org/ (2022).
Keloth, V. K. et al. Representing and utilizing clinical textual data for real world studies: An OHDSI approach. J. Biomed. Inf. 142, 104343 (2023).
Article Google Scholar
Winden, T. J., Chen, E. S. & Melton, G. B. Representing residence, living situation, and living conditions: An evaluation of terminologies, standards, guidelines, and measures/surveys. AMIA Annu Symp. Proc. 2016, 2072 (2016).
PubMed Google Scholar
Kepper, M. M. et al. The adoption of social determinants of health documentation in clinical settings.Health Serv. Res. 58, 67–77 (2023).
Article PubMed Google Scholar
Reeves, R. M. et al. Adaptation of an NLP system to a new healthcare environment to identify social determinants of health. J. Biomed. Inf. 120, 103851 (2021).
Article Google Scholar
Bompelli, A. et al. Social and behavioral determinants of health in the era of artificial intelligence with electronic health records: A scoping review. Health Data Sci. 2021, 9759016 (2021).
Article PubMed PubMed Central Google Scholar
Ong, J. C. L. et al. Artificial intelligence, ChatGPT, and other large language models for social determinants of health: Current state and future directions. Cell Rep. Med. 5, (2024).
Lybarger, K. et al. Leveraging natural language processing to augment structured social determinants of health data in the electronic health record. J. Am. Med Inf. Assoc. 30, 1389–1397 (2023).
Article Google Scholar
Richie, R., Ruiz, V. M., Han, S., Shi, L. & Tsui, F. Extracting social determinants of health events with transformer-based multitask, multilabel named entity recognition. J. Am. Med. Inf. Assoc. 30, 1379–1388 (2023).
Article Google Scholar
Romanowski, B., Ben Abacha, A. & Fan, Y. Extracting social determinants of health from clinical note text with classification and sequence-to-sequence approaches. J. Am. Med. Inf. Assoc. 30, 1448–1455 (2023).
Article Google Scholar
Yu, Z. et al. Identifying social determinants of health from clinical narratives: A study of performance, documentation ratio, and potential bias. J. Biomed. Inf. 153, 104642 (2024).
Article Google Scholar
Fu, Y. et al. Extracting social determinants of health from pediatric patient notes using large language models: novel corpus and methods. arXiv preprint arXiv:2404.00826 (2024).
Allen, K. S. et al. Natural language processing-driven state machines to extract social factors from unstructured clinical documentation. JAMIA Open 6, ooad024 (2023).
Article PubMed PubMed Central Google Scholar
Lybarger, K., Yetisgen, M. & Uzuner, Ö. The 2022 n2c2/UW shared task on extracting social determinants of health. J. Am. Med. Inf. Assoc 30, 1367–1368 (2023).
Article Google Scholar
Lybarger, K., Ostendorf, M. & Yetisgen, M. Annotating social determinants of health using active learning, and characterizing determinants using neural event extraction. J. Biomed. Inf. 113, 103631 (2021).
Article Google Scholar
Gravity project, https://thegravityproject.net/ (2023).
O’Shea, K. & Nash, R. An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458 (2015).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput 9, 1735–1780 (1997).
Article PubMed CAS Google Scholar
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies. Vol. 1, 4171–4186 (2019).
Han, S. et al. Classifying social determinants of health from unstructured electronic health records using deep learning-based natural language processing. J. Biomed. Inf. 127, 103984 (2022).
Article Google Scholar
Savova, G. K. et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J. Am. Med Inf. Assoc. 17, 507–513 (2010).
Article Google Scholar
Carlson, L. A. et al. Characterizing chronic pain episodes in clinical text at two health care systems: Comprehensive annotation and corpus analysis. JMIR Med. Inf. 8, e18659 (2020).
Article Google Scholar
Wray, C. M. et al. Examining the interfacility variation of social determinants of health in the Veterans Health Administration. Fed. Pract. 38, 15 (2021).
PubMed PubMed Central Google Scholar
Feller, D. J. et al. Towards the inference of social and behavioral determinants of sexual health: development of a gold-standard corpus with semi-supervised learning. AMIA Annu Symp. Proc. 2018, 422 (2018).
PubMed PubMed Central Google Scholar
Afshar, M. et al. Natural language processing and machine learning to identify alcohol misuse from the electronic health record in trauma patients: development and internal validation. J. Am. Med. Inf. Assoc. 26, 254–261 (2019).
Article Google Scholar
Stemerman, R. et al. Identification of social determinants of health using multi-label classification of electronic health record clinical notes. JAMIA open 4, ooaa069 (2021).
Article PubMed PubMed Central Google Scholar
Lituiev, D., et al. Automatic extraction of social determinants of health from medical notes of chronic lower back pain patients. medRxiv https://doi.org/10.1101/2022.03.04.22271541 (2022).
Article Google Scholar
Yu, Z. et al. A study of social and behavioral determinants of health in lung cancer patients using transformers-based natural language processing models. AMIA Annu Symp. Proc. 2021, 1225 (2021).
PubMed Google Scholar
Comer, K. F., Grannis, S., Dixon, B. E., Bodenhamer, D. J. & Wiehe, S. E. Incorporating geospatial capacity within clinical data systems to address social determinants of health. Public Health Rep. 126, 54–61 (2011).
Article PubMed PubMed Central Google Scholar
Brignone, E., LeJeune, K., Mihalko, A. E., Shannon, A. L. & Sinoway, L. I. Self-reported social determinants of health and area-level social vulnerability. JAMA Netw. open 7, e2412109–e2412109 (2024).
Article PubMed PubMed Central Google Scholar
Maroko, A. R. et al. Integrating social determinants of health with treatment and prevention: a new tool to assess local area deprivation.Prev. Chronic Dis. 13, E128 (2016).
Article PubMed PubMed Central Google Scholar
Brown, E. M. et al. Assessing area-level deprivation as a proxy for individual-level social risks. Am. J. Prev. Med 65, 1163–1171 (2023).
Article PubMed Google Scholar
Guevara, M. et al. Large language models to identify social determinants of health in electronic health records. NPJ Digital Med. 7, 6 (2024).
Article Google Scholar
Gabriel, R. A. et al. On the development and validation of large language model-based classifiers for identifying social determinants of health. Proc. Natl. Acad. Sci. 121, e2320716121 (2024).
Article PubMed PubMed Central CAS Google Scholar
Wang, M., Pantell, M. S., Gottlieb, L. M. & Adler-Milstein, J. Documentation and review of social determinants of health data in the EHR: measures and associated insights. J. Am. Med. Inf. Assoc. 28, 2608–2616 (2021).
Article Google Scholar
Vest, J. R., Grannis, S. J., Haut, D. P., Halverson, P. K. & Menachemi, N. Using structured and unstructured data to identify patients’ need for services that address the social determinants of health. Int J. Med. Inf. 107, 101–106 (2017).
Article Google Scholar
Johnson, A. E. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3, 1–9 (2016).
Article Google Scholar
Alsentzer, E. et al. Publicly available clinical BERT embeddings. Proceedings of the 2nd Clinical Natural Language Processing Workshop. 72–78 (2019).
Keloth, V. K. et al. Advancing entity recognition in biomedicine via instruction tuning of large language models. Bioinformatics 40, btae163 (2024).
Article PubMed PubMed Central CAS Google Scholar
Xie, Q. et al. Medical foundation large language models for comprehensive text analysis and beyond. npj Digital Medicine 8, 141 (2025).
Article PubMed PubMed Central Google Scholar
Hu, Y. et al. Improving large language models for clinical named entity recognition via prompt engineering. J. Am. Med. Inform. Assoc. 31, 1812–1820 (2024).
Article PubMed PubMed Central Google Scholar
Chen, Q. et al. Benchmarking large language models for biomedical natural language processing applications and recommendations.Nature Communications. 16, 3280 (2025).
Article PubMed PubMed Central CAS Google Scholar
Dai, S. et al. in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 6437-6447.
Dong, X., Wang, Y., Yu, P. S. & Caverlee, J. Disclosure and mitigation of gender bias in llms. arXiv preprint arXiv:2402.11190 (2024).
Ranjan, R., Gupta, S. & Singh, S. N. A Comprehensive Survey of Bias in LLMs: Current Landscape and Future Directions. arXiv preprint arXiv:2409.16430 (2024).
Faisal, F. & Anastasopoulos, A. Geographic and geopolitical biases of language models. Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL). 139–163 (2023).
Introducing Llama 3.1: Our most capable models to date, https://ai.meta.com/blog/meta-llama-3-1/ (2024).
Achiam, J. et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
Rieke, N. et al. The future of digital health with federated learning. NPJ Digital Med 3, 119 (2020).
Article Google Scholar
Van de Ven, G. M. & Tolias, A. S. Three types of incremental learning. Nature Machine Intelligence 4, 1185–1197 (2022).
Article PubMed PubMed Central Google Scholar
Wang, J. et al. Prompt engineering for healthcare: Methodologies and applications. arXiv preprint arXiv:2304.14670 (2023).
Reynolds, L. & McDonell, K. Prompt programming for large language models: Beyond the few-shot paradigm. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 1-7 (2021).
Yu, Z., He, L., Wu, Z., Dai, X. & Chen, J. Towards better chain-of-thought prompting strategies: A survey. arXiv preprint arXiv:2310.04959 (2023).
Frayling, E., Lever, J. & McDonald, G. Zero-shot and few-shot generation strategies for artificial clinical records. arXiv preprint arXiv:2403.08664 (2024).
Li, Z., Zhu, H., Lu, Z. & Yin, M. Synthetic data generation with large language models for text classification: Potential and limitations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 10443–10461 (2023).
Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
Chen, M., Tan, X. & Padman, R. Social determinants of health in electronic health records and their impact on analysis and risk prediction: A systematic review. J. Am. Med Inf. Assoc. 27, 1764–1773 (2020).
Article Google Scholar
Dang, Y. et al. Systematic design and data-driven evaluation of social determinants of health ontology (SDoHO). J. Am. Med. Inform. Assoc. 30, 1465–1473 (2023).
Article PubMed PubMed Central Google Scholar
Kollapally, N. M., Keloth, V. K., Xu, J. & Geller, J. Integrating commercial and social determinants of health: A unified ontology for non-clinical determinants of health. AMIA Annual Symposium Proceedings (2023).
Hicks, A., Hanna, J., Welch, D., Brochhausen, M. & Hogan, W. R. The ontology of medically related social entities: Recent developments. J. Biomed. Semant. 7, 1–4 (2016).
Article Google Scholar
Phan, N., Dou, D., Wang, H., Kil, D. & Piniewski, B. Ontology-based deep learning for human behavior prediction with explanations in health social networks. Inf. Sci. 384, 298–313 (2017).
Article Google Scholar
Donnelly, K. SNOMED-CT: The advanced terminology and coding system for eHealth. Stud. Health Technol. Inf. 121, 279 (2006).
Google Scholar
McDonald, C. J. et al. LOINC, a universal standard for identifying laboratory observations: a 5-year update. Clin. Chem. 49, 624–633 (2003).
Article PubMed CAS Google Scholar
Participant Surveys: Social Determinants of Health, https://www.researchallofus.org/data-tools/survey-explorer/social-determinants-survey/ (2022).
Gaziano, J. M. et al. Million veteran program: A mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).
Article PubMed Google Scholar
Social Determinants of Health Survey - AHIMA, https://ahima.org/landing-pages/social-determinants-of-health-survey/ (2022).
Steinkamp, J., Kantrowitz, J. J. & Airan-Javia, S. Prevalence and sources of duplicate information in the electronic medical record. JAMA Netw. open 5, e2233348–e2233348 (2022).
Article PubMed PubMed Central Google Scholar
Markel, A. Copy and paste of electronic health records: A modern medical illness. Am. J. Med 123, e9 (2010).
Article PubMed Google Scholar
Cohen, R., Aviram, I., Elhadad, M. & Elhadad, N. Redundancy-aware topic modeling for patient record notes. PLoS One 9, e87555 (2014).
Article PubMed PubMed Central Google Scholar
Cohen, R., Elhadad, M. & Elhadad, N. Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies. BMC Bioinforma. 14, 1–15 (2013).
Article Google Scholar
Clinical Sectionizer, https://github.com/medspacy/sectionizer (2024).
Shafighi, K. et al. Social isolation is linked to classical risk factors of Alzheimer’s disease-related dementias. PLoS One 18, e0280471 (2023).
Article PubMed PubMed Central CAS Google Scholar
Majoka, M. A. & Schimming, C. Effect of social determinants of health on cognition and risk of Alzheimer disease and related dementias. Clin. Ther. 43, 922–929 (2021).
Article PubMed Google Scholar
Barak, Y. & Glue, P. Do Social Isolation and Loneliness Kill People with Alzheimer’s Disease?. OBM Geriatrics 2, 1–5 (2018).
Article Google Scholar
Patra, B. G. et al. Extracting social determinants of health from electronic health records using natural language processing: a systematic review - Supplementary Material. J. Am. Med Inf. Assoc. 28, 2716–2727 (2021).
Article Google Scholar
Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM sigkdd international conference on knowledge discovery and data mining, 785-794 (2016).
Kim, Y., Jernite, Y., Sontag, D. & Rush, A. M. Character-aware neural language models. Thirtieth AAAI conference on artificial intelligence (2016).
Pennington, J., Socher, R. & Manning, C. D. Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532-1543 (2014).
Reimers, N. & Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–3992 (2019).
Introducing LLaMA: A foundational, 65-billion-parameter large language model, https://ai.meta.com/blog/large-language-model-llama-meta-ai/ (2023).
Taori, R. et al. Alpaca: A Strong, Replicable Instruction-Following Model, https://crfm.stanford.edu/2023/03/13/alpaca.html (2023).

Download references

Acknowledgements

We thank Dr. Rajarshi Bhowmik for their suggestions and initial feedback in designing the study. This study was funded by the following grants—NIA RF1AG072799, R01AG084236, RM1HG011558 and R01AG083039 and R01AG080429. The funders played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.

Author information

Authors and Affiliations

Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, USA
Vipina K. Keloth, Qingyu Chen, Christopher Gilman, Yujia Zhou, Huan He, Karen Wang, Cynthia Brandt & Hua Xu
Department of Psychiatry and Behavioral Sciences, UTHealth McGovern Medical School, Houston, TX, USA
Salih Selek
McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
Sunyang Fu, Yifang Dang & Hongfang Liu
School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA
Xinghan Chen
Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, USA
Xinyue Hu & Cui Tao
Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA
Jungwei W. Fan & Cui Tao
Equity Research and Innovation Center, Yale School of Medicine, New Haven, CT, USA
Karen Wang

Authors

Vipina K. Keloth
View author publications
Search author on:PubMed Google Scholar
Salih Selek
View author publications
Search author on:PubMed Google Scholar
Qingyu Chen
View author publications
Search author on:PubMed Google Scholar
Christopher Gilman
View author publications
Search author on:PubMed Google Scholar
Sunyang Fu
View author publications
Search author on:PubMed Google Scholar
Yifang Dang
View author publications
Search author on:PubMed Google Scholar
Xinghan Chen
View author publications
Search author on:PubMed Google Scholar
Xinyue Hu
View author publications
Search author on:PubMed Google Scholar
Yujia Zhou
View author publications
Search author on:PubMed Google Scholar
Huan He
View author publications
Search author on:PubMed Google Scholar
Jungwei W. Fan
View author publications
Search author on:PubMed Google Scholar
Karen Wang
View author publications
Search author on:PubMed Google Scholar
Cynthia Brandt
View author publications
Search author on:PubMed Google Scholar
Cui Tao
View author publications
Search author on:PubMed Google Scholar
Hongfang Liu
View author publications
Search author on:PubMed Google Scholar
Hua Xu
View author publications
Search author on:PubMed Google Scholar

Contributions

V.K. contributed to the study design, conducted the literature review, developed the methodology and annotation guidelines, contributed to the software development, carried out validation processes, and was primarily responsible for writing the original draft of the manuscript. S.S., K.W. Y.Z., and C.B., C.T. were involved in providing the clinical expertise in identifying the data sources, data collection, selecting SDoH factors and contributed to reviewing and revising the manuscript. S.S., K.W., Y.D., X.C. and X.H. participated in the development of the annotation guidelines, annotating the clinical notes and resolving the conflicts. S.F., J.F. and H.L. helped with obtaining the Mayo corpus, annotations and designing the study, and reviewing the manuscript and providing feedback. C.G. was involved in the software implementation and packaging the models. H.H. took charge of visualization, specifically in the preparation of figures to conceptualize the workflows and reviewing the manuscript. Q.C., H.L. and H.X. contributed to the design, development and evaluation of NLP techniques and reviewing and revising the manuscript. H.X. and S.S. provided overall supervision for the project, including study design, execution, and evaluation, coordinating study team and resources, and review and revision of the manuscript. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Hua Xu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary material

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Keloth, V.K., Selek, S., Chen, Q. et al. Social determinants of health extraction from clinical notes across institutions using large language models. npj Digit. Med. 8, 287 (2025). https://doi.org/10.1038/s41746-025-01645-8

Download citation

Received: 14 June 2024
Accepted: 16 April 2025
Published: 17 May 2025
DOI: https://doi.org/10.1038/s41746-025-01645-8

This article is cited by

Unveiling social determinants of health impact on adverse pregnancy outcomes through natural language processing
- Nidhi Soley
- MaKhaila Bentil
- Casey Overby Taylor
Scientific Reports (2025)