Introduction

In the United States, social factors, including educational level, racism, and poverty, contribute to over a third of total deaths in a year1. According to a study that surveyed the number of deaths attributable to social factors in the United States in the year 2000, 245, 000 deaths were attributed to low education, 176, 000 to experiences of racism, 162 ,000 to low social support, 133 ,000 to individual-level poverty, 119, 000 to income inequality, and 39 ,000 to area-level poverty1. It should be noted that such mortality estimates are comparable to that of leading disease-related mortality rates. A growing body of research has reported on the negative impact of social determinants of health (SDoH) factors (socioeconomic and environmental factors that describe the conditions in which people are born, live, work, and age) creating undesirable circumstances, such as health disparities and discrimination2,3,4. For example, the likelihood of premature death increases as income goes down2, children born to parents who haven’t completed high school are more likely to live in environments that contain barriers to health3, and low education levels are directly correlated with low income, higher likelihood of smoking, and shorter life expectancy2. We note that in the context of this study the term SDoH is used in a broad sense covering not only socio-economic factors, but also behavioral, environmental, and other closely linked factors. This aligns with the comprehensive definition of SDoH as outlined by Healthy People 20305 and is supported by a systematic review of prior literature on SDoH6.

The mounting body of evidence demonstrating the impact of social determinants on health has resulted in an increasing acknowledgment within the health care sector that achieving improved health and health equity is likely to be contingent, at least in part, on addressing unfavorable social determinants7,8,9,10. Over the last decade, electronic health record (EHR) systems have been increasingly implemented at US hospitals and have generated large amounts of longitudinal patient data. These massive EHR databases are becoming enabling resources for diverse types of clinical, genomic, and translational research, with successful demonstrations by large initiatives such as the Electronic Medical Records and Genomics (eMERGE) Network11, the Patient-Centered Outcomes Research Institute (PCORI)12, and the Observational Health Data Science and Informatics (OHDSI) consortium13. However, most current studies do not use the full spectrum of EHR data, and more efforts are being devoted to including unstructured textual data in EHRs for real world evidence generation14.

In clinical practice, there has been a lack of systematic collection and analysis of SDoH defined in the EHRs15,16,17. While some SDoH factors (e.g., race, sex) are present as structured data in EHRs, the lack of specific EHR fields to record SDoH information and the lack of standards for collecting data related to SDoH are some of the major reasons for insufficient SDoH documentation. On the other hand, clinical narratives contain more detailed characterization of several SDoH factors (e.g., living status, financial instability, social support, etc.), beyond the structured representation. The unstructured text data often contains detailed information to describe how multiple socio-economic, behavioral factors, and situational variables interact with each other. To bridge this gap, NLP techniques have emerged to automatically identify SDoH information from unstructured notes.

The current landscape of NLP approaches for identifying SDoH factors has been extensively reviewed6,18. While multiple NLP methods are available, there are four primary challenges for SDoH identification. Firstly, many studies, even the recent ones19,20,21,22,23,24,25 have focused on a limited set of SDoH factors (up to 8 factors). A comprehensive review6 revealed that among the 82 papers examined, only three SDoH factors were commonly addressed: smoking status (27), substance use (21), and homelessness (20). Other critical factors such as education, insurance status, and broader social issues are still in the developmental stage. The n2c2/UW SDOH Challenge26 on SDoH extraction focused on the status, extent, and temporality of the factors, but was limited to five factors including the commonly studied alcohol, drug, tobacco, in addition to employment, and living situation using the Social History Annotated Corpus (SHAC) corpus27. Secondly, a considerable number (22 out of 82) of these approaches rely on rule-based methodologies, which are highly dependent on specific lexicons. The lack of a standardized documentation framework for SDoH means these lexicons might not be easily transferrable or adaptable across different healthcare systems, significantly limiting the methods’ effectiveness. Recent work in this direction is the Gravity Project28, a collaborative effort aimed at developing consensus-based data standards for SDoH information. Some prior studies have explored the use of deep neural network architectures such as CNN29, LSTM30, and BERT21,23,31 to automatically classify SDoH categories32, or named entity recognition (NER) based approaches, utilizing tools like cTAKES33, CNN, and BERT-based models34, to extract SDoH factors from clinical text. Thirdly, most methods were developed and evaluated within specific healthcare systems, such as a Veterans Health Administration center35, a Medical Center36, and a Trauma Center37. The lack of broader testing raises concerns about the generalizability and adaptability of these NLP methods to diverse healthcare settings. For example, Stemerman et al. 38 designed a multi-label classifier to identify six SDoH categories within sentences extracted from clinical notes sourced from the University of North Carolina’s clinical data warehouse. Finally, the decision to formulate the SDoH extraction task as classification15,32,38 vs. NER27,39,40 requires careful consideration.

Existing research has also explored integrating patient SDoH data with geospatial datasets like census demographics and social vulnerability index (SVI) to estimate SDoH factors at a community or neighborhood level41,42,43,44. One such study analyzed to what extend the self-reported SDoH needs assessment by a health system are associated with census tract–level social vulnerability measured using the SVI42. Other work has explored mapping SDoH documentation and area-level deprivation to assess individual-level social risks44. While leveraging linkage between patient addresses and geo-referenced datasets offers a valuable community-level perspective, our work aims to directly identify patient-specific SDoH information from the unstructured text data using NLP techniques.

Recent studies have utilized the use of large language models (LLMs) for the identification of SDoH factors19,24,45,46. Guevera et al. conducted a study using encoder-decoder language models, specifically fine-tuned Flan-T5 XL and Flan-T5 XXL, to classify mentions of SDoH categories from narrative text in EHRs45. The study aimed to identify any mention of these SDoH factors as well as specifically adverse mentions. However, the study was limited in scope to identifying the presence or absence of only six SDoH factors, without exploring temporality or nuanced assertions beyond a broad adverse categorization. The potential of data augmentation by incorporating LLM-generated synthetic SDoH data into the training process to further enhance model performance was also investigated.

The inclusion of SDoH in clinical notes can vary depending on various factors, including the healthcare system, individual clinician practices, and the specific context of patient encounters47,48. These factors have a profound influence on the heterogeneity in the distribution of SDoH factors in clinical notes. Focusing on a specific note type, single institution data, specific hospital setting or individual documenting practices and developing datasets and testing NLP performance under these conditions might not always translate to a real-world scenario especially if developing models that generalize well across different settings is of prime importance. Hence, the primary goal of this study is to investigate the feasibility of developing high-performing and generalizable NLP approaches for extracting comprehensive SDoH information from clinical notes. We train and evaluate several NLP models including deep-learning architectures and LLMs assessing their capabilities both within their training distribution (in-domain) as well as on datasets from other sites (out-of-domain). By analyzing patient data from diverse clinical settings across multiple healthcare institutions, we also shed light on the real-world distribution of SDoH factors.

In summary, this paper presents the following key contributions: (1) a study identifying the most comprehensive list of SDoH factors from clinical notes; (2) curation of the largest multi-institution annotated corpora covering assertions and temporality of SDoH factors; and (3) development and evaluation of the performance of both traditional models and LLMs highlighting the impact of varying label distribution of SDoH factors across datasets on cross-domain transfer learning. To enhance accessibility and encourage further research, we make our models—trained and fine-tuned on this multi-institutional SDoH dataset—publicly available to the broader scientific community.

Results

Annotation and datasets

For the UTHealth Harris County Psychiatric Center (HCPC) dataset, after eliminating duplicate notes/sentences, we ended up with a total of 4953 sentences (including 344 sentences with no SDoH). The distribution of level 1 annotated corpus is shown in Fig. 1 (top left). This dataset has a high prevalence of several factors such as adverse childhood, education level, isolation, geographic location where the person was born/raised, physical/sexual abuse and social support compared to other datasets in which these factors were not documented or less frequently documented. A total of 18 SDoH factors were annotated in this dataset.

Fig. 1: Distribution of SDoH factors.
figure 1

Number of sentences documenting each SDoH factor for all four corpora annotated at level 1 (identifying SDoH factors only).

For the UT Physicians (UTP) dataset, a total of 1691 sentences were annotated from 1000 chart notes. Several of these notes had duplicate information because of copy-pasting for multiple visits by the same patient which were eliminated. For this dataset comprising of ADRD patients in an outpatient setting, the chart notes documented substance use information in detail under “social history” sections. Sex and race were almost always mentioned in the beginning (e.g., The patient is an 80 yr old Caucasian F …) followed by some information regarding the patient’s employment, living and marital status. 14 SDoH were annotated showing high prevalence of sex, race, and substance abuse.

As notes with note type “social work” were extracted from MIMIC-III database49, these notes were written by social workers majorly documenting their interactions with patients’ families. This explains the towering frequency of marital status and some inference that could be made regarding living status and social support. Other factors that were frequently reported included alcohol and drug use, sex, and employment. A total of 2838 sentences were annotated from 500 notes with 20 SDoH factors present.

The Mayo dataset was the smallest with 964 sentences annotated. The dataset contained all clinical notes of a chronic pain cohort with much of the data structured as templates to be filled in resulting in high volumes of duplicate information. In addition, major part was documentation of the diseases and medications explaining the high frequency of “non-SDoH” category. 18 SDoH factors were present with 10 factors having less than 10 samples.

The distribution SDoH factors with level 2 annotations for all the datasets are shown as part of Supplementary Figs. 14.

Model performance

In this section, we show the results of our experiments with different models for all annotated corpora. Table 1 presents the results of fivefold cross-validation on both level 1 and level 2 annotated corpora. Micro-averaged precision, recall and F1 scores are reported by averaging across all 5 folds. We observe that the instruction-tuned LLaMA model performed better than all other models for all datasets except on the MIMIC-III dataset (level 1) for which the ClinicalBERT50 model was better. The ClinicalBERT model had the second-best performance for level 1 annotated corpora. For level 1, UTP dataset has only 15 classes (Fig. 1 top right) and has a comparatively better class balance. This is reflected in the model performance with 5-fold cross-validation consistently yielding an F1 score above 0.95 for all models. On the other hand, Mayo corpus (Fig. 1 bottom right) is highly imbalanced, is the smallest dataset, and has a total of 19 classes. It is interesting to note that the instruction tuned LLaMA model still achieved an F1 score of 0.94 closely followed by ClinicalBERT with 0.919.

Table 1 Micro-averaged precision, recall, and F-1 metrics (average of 5-folds) of models across all datasets

If the number of classes increases and the number of datapoints remains the same, the model performance decreases as is evident from the results for level 2 annotated corpora (Table 1 bottom half). However, the decrease in performance for the LLaMA model (ranges from 2.7 to 4.6%) is less compared to other models (4.6 to 11.3% for XGBoost, 10.2 to 23.9% for TextCNN, 4.5 to 12% for SBERT, and 3 to 13.3% for ClinicalBERT). For UTP, except for TextCNN all other models have an F1 score above 0.91. The maximum difference between ClinicalBERT and LLaMA (13.2%) was observed for the Mayo dataset (level 2) due to the high class-imbalance (Supplementary Fig. 4) and most classes not having enough samples for learning.

The macro-averaged metrics (Table 2) is a better reflection of the model performances with such class-imbalanced datasets. We notice a widened performance gap between LLaMA and other models, with the LLM still performing better than other models with greater than 0.45 F1 across the datasets and levels of annotation. This also demonstrates the ability of the LLaMA models to learn from very few examples for many classes.

Table 2 Macro-averaged precision, recall, and F-1 metrics (average of 5-folds) of models across all datasets

Our extensive prior work comparing zero/few-shot and fine-tuned LLaMA models have demonstrated the superior performance of fine-tuned models51,52,53,54. We also evaluated LLaMA 2’s in-context learning capabilities (7B-base and 7B-chat variants) using up to ten prompt samples to demonstrate the utility and need of fine-tuning for this task. The 7B-base model produced inconsistent outputs, while the 7B-chat model, though slightly improved, generated overly verbose responses with numerous false positives (outputting the entire list of SDoH factors in majority of the cases), requiring extensive post-processing. As shown in Supplementary Table 12, even with optimized prompts and the larger 13B-chat model, in-context learning yielded poor results. Top performance on level 1 annotations was observed on the HCPC dataset (micro-average F1 of 0.518) with the 13-B chat model. The performance further dropped on level 2 annotations and macro-average evaluations likely due to the complexity of this task and increase in the number of classes. The prompts used are included in the Supplementary Materials (Supplementary Notes2 and 3).

Generalizability experiments

Figure 2 shows the performance (micro-averaged) of all models when trained on one corpus and evaluated on all other corpora (level 1 annotated corpora). Overall, the LLaMA model performed better than other models in cross-dataset generalizability. When fine-tuned on HCPC, LLaMA achieved an F1 score of 0.91 on both UTP and Mayo datasets and 0.82 on MIMIC-III, demonstrating an overall best performance on cross-dataset generalizability. Other notable performances include training on MIMIC-III and testing on UTP (F1 = 0.94). ClinicalBERT demonstrated better performance when trained on UTP and tested on MIMIC-III and Mayo datasets and when trained on MIMIC-III and tested on Mayo dataset. These results can be attributed to the high overlap of SDoH categories between these datasets, with a minimal impact from non-overlapping categories on micro-averaged metrics due to their limited sample sizes.

Fig. 2: Heatmap of cross-dataset performance evaluation (level 1).
figure 2

The diagonal shows micro-averaged F1scores when trained and tested on the same dataset for level 1 annotations. Other cells show F1 scores when trained on one dataset and tested on another dataset.

With the increase in the number of classes for level 2 annotated corpora, LLaMA model performed better compared to all other models (Fig. 3) for all cases of generalizability evaluation. The lowest performance for any dataset pair was an F1score of 0.40 for LLaMA 2, 0.08 for ClinicalBERT, 0.06 for SBERT, 0.04 for TextCNN, and 0.09 for XGBoost, clearly demonstrating the superior ability of instruction tuned LLaMA models to generalize better to unseen data. Micro-averaged precision and recall scores along with F1 for both levels of annotation can be found in Supplementary Tables 1, 2.

Fig. 3: Heatmap of cross-dataset performance evaluation (level 2).
figure 3

The diagonal shows micro-averaged F1scores when trained and tested on the same dataset for level 2 annotations. Other cells show F1 scores when trained on one dataset and tested on another dataset.

LLaMA models outperformed all other models in cross-dataset generalizability, showing a wider performance gap when evaluated on macro-average precision, recall, and F1 (Supplementary Table 3). This performance advantage is likely due to LLaMA’s ability to learn effectively from fewer samples, a key factor given the datasets’ high imbalance. Per-factor performance evaluations (Supplementary Tables 47), with training data instance counts highlighted (* for <15 samples, ** for >50 samples), further illustrate LLaMA’s few-shot abilities and their impact on macro-averaged evaluation. These tables include instances where LLaMA achieved a perfect F1 score of 1.0, as well as scenarios where all models, including LLaMA, performed subpar or where LLaMA’s F1 score was lower than other models. A single instance of LLaMA’s zero-shot performance was also observed (Supplementary Table 7) for financial issues. Overall, LLaMA performs particularly well on low-prevalence factors; however, other models can achieve similar or even better performance in some cases when sufficient training data are available.

To further understand generalizability of models when sufficient training data are available, we performed another set of experiments involving five SDoH factors (Alcohol, Drug, Employment Status, Marital Status, Living Status) that are present in all four datasets and having at least 25 examples in the level 1 training data. Supplementary Tables 8 to 11 demonstrates the performance of all models on five factors trained on one data source and tested across three other data sources. Out of 60 experiments (5 factors, 4 data sources, and each tested across three other data sources), LLaMA achieved top performance on 41 experiments, ClinicalBERT achieved top performance on 13, XGBoost achieved top performance on 5, and TextCNN and SBERT on 1 each. Those results indicate that models such as ClinicalBERT and XGBoost can still outperform LLaMA when sufficient training data are available. We discuss in detail in the Discussion section the performance of different models in different scenarios and its implications for cross-dataset generalizability.

Final set of evaluations involved training models on a combined dataset (combining train split of all four datasets) and evaluating the performance of the resulting model on individual datasets. As shown in Table 3, ClinicalBERT achieved on par or sometimes better performance on level 1 micro-averaged metrics on combined data compared to LLaMA. However, LLaMA outperformed ClincialBERT on three out of four datasets at the level 2 evaluation. When macro-averaged metrics were used, ClinicalBERT lagged considerably behind LLaMA, as shown in Table 4. The difference ranged from 2.6% to 12.6% on level 1 and from 2% to 13.4% on level 2 annotations.

Table 3 Performance (P/R/F1) of all models (micro-averaged) when trained on the combined dataset for both levels of annotation
Table 4 Performance (P/R/F1) of all models (macro-averaged) when trained on the combined dataset for both levels of annotation

Error analysis

A thorough examination of the performance of the instruction-tuned LLaMA model revealed several key areas for potential improvement. Analyzing the predictions, we observed instances where the model failed to make any predictions, generated predictions outside the expected class labels (e.g., predicting “6th grade” instead of “Education level”), and struggled to differentiate between SDoH pertaining to the patient versus those referring to family members. The primary reasons for the notably low performance on the Mayo dataset, when evaluated on the combined dataset (Table 3), were the instances of no responses and the generation of irrelevant outputs such as numerical values (e.g., dates). Additionally, an interesting observation emerged regarding the level of annotation: for sentences describing the patient as a non-smoker, the LLaMA model failed to generate “smoking” as a label when using the level 1 annotated corpus. However, for the same sentence in the level 2 annotated corpus, the model consistently generated “nonsmoker” as a response. These issues highlight the need for improved prompt design with more specific instructions to regulate such outputs. Incorporating detailed instructions from annotation guidelines, as observed in our prior studies on LLMs53, could help mitigate these errors and enhance the model’s overall performance.

Discussion

Our work builds upon and extends a growing body of research applying NLP techniques to automatically identify individual-level SDoH information from unstructured clinical notes. While existing studies specifically focus on highly documented SDoH factors or combine less frequent factors into a single category, we undertook a comprehensive, multilevel annotation of all SDoH factors with due focus on temporality and assertion in addition to fine-grained subcategorization of each factor. Since we do not constrain the SDoH factors as mentioned above, our model performance is a better reflection of the real-world operating conditions. We curated and constructed four annotated corpora of SDoH using data from four distinct healthcare systems, ensuring a diverse representation of SDoH factors. Leveraging these corpora, we conducted extensive experiments employing four text classification models and a generative model to detect SDoH factors. We not only evaluated the performance of these models but also highlighted the impact of the diverse label distribution of SDoH factors across datasets on the effectiveness of cross-domain transfer learning. We hope that our findings provide valuable insights into the complex landscape of SDoH factors, paving the way for enhanced understanding and future advancements in healthcare interventions and policies by facilitating and encouraging better documentation practices.

A primary motivation behind our work is to provide the research community with pre-trained models on these SDoH datasets, facilitating their customization and adaptation to local institutional contexts. We recognize that many academic and healthcare organizations face significant resource constraints that prohibit the utilization of LLMs at scales beyond a certain threshold including our own computational limitations. Consequently, our study prioritized the evaluation of more modestly sized LLMs that still demonstrate promising performance capabilities. By openly disseminating checkpoints of models fine-tuned on our aggregated multi-institutional SDoH dataset, we aim to lower the barrier for other research groups to build upon our work. These models can serve as effective initialization points for transfer learning, requiring further fine-tuning on relatively small amounts of locally curated data to improve performance.

Our results demonstrate the ability of instruction tuned LLaMA model to outperform all other models especially when the number of factors increases, and the performance is evaluated per factor considering the high class-imbalance. The base LLaMA model being a general-domain LLM and the task of SDoH identification being a non-medical one might have rendered these models more suitable for this task compared to other biomedical NLP tasks which requires considerable biomedical knowledge. The ClinicalBERT model performed on par or sometimes better on micro-averaged evaluation and when enough data is available for different factors. The performance of the XGBoost model demonstrates comparability to that of the SBERT model, and it is interesting to observe that the architectural variances did not significantly influence the outcomes when sufficient and diverse data is available, at least in this case within the scope of this study.

Comparing ClinicalBERT and SBERT, ClinicalBERT demonstrated better performance possibly due to training on MIMIC notes. The performance of TextCNN models was relatively much lower on two datasets - MIMIC-III and Mayo. MIMIC-III dataset has high lexical and semantic variability, and the sentences are diverse and do not follow any patterns/templates and is mostly written in a style of reproducing conversations with patients and their families, which might have affected the model performance. As for Mayo, it is the smallest dataset and has high class imbalance with more than half of the samples belonging to ‘non SDoH’ category. Even though ClinicalBERT, SBERT and XGBoost models performed well when trained and tested on the same datasets, this performance did not translate when tested on other datasets especially on level 2 corpora and when evaluating per factor performance. This low performance could be attributed to factors such as difference in (1) SDoH frequency distribution, and (2) the patterns in which SDoH factors are described differing significantly depending on the hospital setting, preference of individual practitioner, templates in place and medical specialties. To address this issue, data needs to be collected from diverse sources and combined to create datasets that capture the variations that contribute to heterogeneity.

We want to distinguish the two distinct aspects of model generalizability: (1) assessing the zero-/few-shot capabilities of models on new, unseen tasks (i.e., extracting new SDoH factors not seen during training) and (2) assessing the performance on new datasets for the same task for which the model was trained/fine-tuned (i.e., extracting shared SDoH factors across institutions). While LLMs excel in zero- and few-shot settings on unseen tasks, making them suitable for novel SDoH extraction, other models also show reasonable performance when evaluating generalizability on shared SDoH factors across datasets with sufficient training data. Our experiments focusing on five common SDoH factors (Alcohol, Drug, Employment Status, Marital Status, and Living Status), present in all four datasets with at least 25 training examples each (Supplementary Tables 811), demonstrate this point. In these cross-dataset evaluations, LLaMA achieved top performance in 41 out of 60 experiments, demonstrating strong generalizability. However, ClinicalBERT and XGBoost also showed competitive performance, achieving top performance in 13 and 5 experiments, respectively. For example, ClinicalBERT’s achieved a superior F1 score of 0.90 compared to LLaMA’s 0.71 on ‘living status’ when trained on the Mayo dataset and tested on the UTP dataset. This suggests there are other factors that affect model generalizability. With sufficient training data, simpler models may achieve comparable or sometimes better generalizability for well-defined tasks. Furthermore, our exploration of LLaMA 2’s in-context learning capabilities (Supplementary Table 12) revealed limitations in its zero-/few-shot performance for this task, with the 13B-chat model achieving a best micro-average F1 of only 0.518 on the HCPC dataset (the highest among all datasets) for level 1 annotations, and performance further declining in level 2 annotations and macro-average evaluations. This illustrates the need for fine-tuning to achieve robust performance for SDoH extraction using LLMs.

While our cross-dataset evaluation sheds some light on models’ ability to handle distribution shifts between sites, there are indeed inherent challenges to developing universally generalizable SDoH extraction models. Many social and environmental risk factors are intricately tied to the unique circumstances, lived experiences, and cultural contexts of different communities. The salience and manifestations of specific SDoH can vary drastically between patient populations. For example, documentation of housing instability or food insecurity may present very differently in an urban setting compared to a rural or affluent suburban communities. Such population-level heterogeneity makes it difficult to capture all potential linguistic variations in a single model. As such, while our work demonstrates some generalizability across datasets from different health systems, the highest performances are still achieved when models are tested in-domain. In cross-dataset evaluation, if a class absent from the training dataset isn’t predicted and causes a performance drop, it indicates zero-shot utilization. This dip in performance is anticipated due to label shift. Conversely, when classes are accounted for, generalizability assessment extends to evaluating the model’s capacity to manage covariate shift, which emerges when there are distribution differences between the training and testing data environments. Even though these factors have contributed significantly to the drop in performance, we still included rare SDoH factors in our evaluation as it allows us to assess the pipeline’s performance in real-world scenarios, even if it is solely based on zero-shot capabilities.

Developing and deploying LLMs can be challenging due to the significant computational resource demand for training, fine-tuning, and inference. High energy consumption and a large carbon footprint are environmental considerations. With parameter counts in the billions, LLMs may pose storage capacity challenges for some environments. Specifically, to fine-tune the 7B LLaMA model on about 7300 instruction demonstrations (combined training on all corpora) required approximately 10 min utilizing four NVIDIA A100 80GB GPUs. We performed full model fine-tuning, though techniques like parameter-efficient fine-tuning and quantization could further reduce the GPU and memory requirements considerably. Inference with the fine-tuned LLaMA model was performed on a single A100 GPU. The TextCNN, SBERT, and ClinicalBERT models had more modest training overheads, requiring only a single GPU. The XGBoost model being conventional tree-based methods, could be trained efficiently on CPUs and exhibited the fastest runtimes overall.

We note that differing criteria were used to construct the datasets from each site. While ideally datasets would undergo identical preprocessing, some degree of variation was required across sites to construct datasets with sufficient densities of SDoH documentation. However, we cannot rule out the possibility that these dataset-specific preprocessing may have inadvertently introduced biases that hindered cross-dataset generalization and performance difference between models. HCPC, UTP, and MIMIC datasets employed random sampling and keyword filtering, but the Mayo dataset originated from a study prescreening for chronic pain patients. This condition-specific cohort likely does not reflect the diverse SDoH profiles of Mayo’s general clinical population, which may have impacted model performance on this dataset along with its small sample size in addition to also having an impact on the SDoH distribution.

Careful consideration must be given to mitigating algorithmic bias when developing and deploying language models for extracting SDoH information from unstructured clinical notes across different healthcare settings. Models trained on data reflecting societal biases and disparities may perpetuate or amplify these inequities when deployed in clinical practice. Hence, evaluating bias in LLMs using clinical data of patients is crucial to ensure that the models do not perpetuate or amplify existing biases present in healthcare datasets55,56,57. Differences in race, ethnicity, sex, age distributions and other traits across sites could contribute to language biases, disproportionate SDoH mention rates, or framing disparities stemming from clinician implicit biases. While some information about the data regarding the patient population has been provided in Table 5, we note that further characteristics such as age, sex, race, etc. could also explain differences in SDoH extracted from each dataset which might also result in bias in the model performance56. A study design characterizing these cohort compositions, documentation patterns, and analyzing them uniformly across different datasets is needed to understand which factors contribute to SDoH distribution variations and model performance. Since our study was not originally designed to comprehensively investigate or report on bias evaluations, a redesigned study specifically focusing on bias evaluations and mitigation approaches would be required to address this limitation thoroughly. Additionally, utilizing a prior published dataset and limited structured data availability for some other datasets prevented us from providing further cohort characterization required for comprehensive bias evaluations.

Table 5 Description of the corpora used in this study

Our study was conducted using data from US-based medical centers. The specific SDoH factors and linguistic patterns for expressing them may differ across other geographic and cultural contexts. To maximize the global impact and generalizability of using language models for SDoH extraction, it will be critical to create additional annotated datasets representing diverse populations worldwide58. Evaluation of model performance and potential biases must also be conducted across different languages, healthcare settings, and sociocultural environments. To address this limitation, this work needs to be expanded beyond the US context and is an important future direction to advance global health equity through SDoH documentation.

It is imperative to continually benchmark the latest advances in LLMs for challenging applications like extracting SDoH from unstructured clinical text. More powerful models like LLaMA 3.159 might perform better compared to LLaMA 2 used in this study. However, our fine-tuning strategies and evaluations remain relevant independent of the specific model used. We will in future investigate commercial LLMs such as GPT-460 for SDoH extraction, after appropriate arrangements that allow sending clinical data to GPT models (e.g., via GPT models on Azure) are approved by our institution.

If the ultimate goal is to develop a single model that can identify a variety of SDoH factors for a wide range of clinical notes originating from multiple institutions, federated learning61 could be a promising direction. In federated learning, each institution will train their individual model with their privately-owned dataset and update their model through several rounds of model merging. After each round, the merged model will be broadcast to each institution before the next round of training begins. Other approaches such as a combination of Domain-incremental learning and Class-incremental learning62 can also be applied to sequentially update a single model with new domains and new sets of classes. We will explore these in our future work.

Guevera et al. 45 explored the use of synthetic data for SDoH utilizing a simple prompt providing the model with several examples of each category. While showing promise, the generation of synthetic data did not exhaustively capture all possible linguistic variations of expressing SDoH concepts that could be encountered in real clinical text. We will systematically investigate different synthetic data generation strategies including zero-/few-shot, chain-of-thought and fine tuning63,64,65,66,67. In addition to experimenting with advanced prompting strategies, in future, we plan to propose a template-based conditional generation approach to increase the coverage of long-tailed factors there by mitigating the factor imbalance and increasing the diversity of generated samples.

In summary, extracting and classifying social determinants of health from EHR and clinical notes have the potential for developing effective treatment plans, improving population health, and reducing health disparities. While what SDoH factors are documented and where they are documented varies widely across healthcare settings and medical specialties, the effects of these variations on the model performance in a real-world scenario is rarely studied, especially when developing generalizable models is the primary goal. Hence in this study, we analyze the performance of multiple models on SDoH extraction by designing and developing annotation guidelines for classifying SDoH factors and creating several annotated corpora using data from four healthcare systems. The datasets were curated to include different patient cohorts, note types, different layers of care in hospital settings, and documentation practices, thus showcasing the heterogeneity in the distribution of SDoH factors. Additionally, four classification models and a large language model were experimented with to detect SDoH factors with the LLM performing the best. The generalizability of the models across institutions was also evaluated. While all models perform relatively well when trained and tested on a single dataset, performance varied and dropped on cross-dataset evaluation, indicating the need for further research in this domain. To encourage research in this direction we will make available models trained by combining all annotated datasets.

Methods

Prior research has often constrained the number of SDoH factors within datasets due to the manual annotation burden or later integrated numerous factors with only a few samples into an “Other SDoH” category to potentially improve model performance32. However, this oversimplified approach fails to mirror the intricacies of real-world scenarios. Hence in this study, we undertook a meticulous annotation process, annotating all SDoH factors present in clinical notes, refraining from consolidation even in cases of limited sample sizes, to better reflect the landscape of SDoH factor distributions.

We collected data from four different hospitals and different hospital settings (inpatient and outpatient) including the publicly available MIMIC-III49 database to acquire a rich and diverse documentation of SDoH factors. The datasets include psychosocial assessment notes, chart notes, social work notes, and all clinical notes from a published cohort study34. This variety in note types also accounts for the variety in documentation practices as they are written by physicians, nurses, social workers, etc. While “social history” sections are rich in SDoH information (and many studies have solely used this section), SDoH factors can be scattered under different sections depending on the type of notes, note templates, and individual note-taking styles of the providers. In our study, we conducted a preliminary analysis of notes from each dataset to identify SDoH information across sections. In the case of the HCPC dataset, SDoH details were predominantly concentrated in psychosocial assessments within social history sections, leading to their sole consideration. Conversely, notes from UTP contained diverse SDoH information spread across different sections like “History of present illness” and “General observation,” prompting a comprehensive review of note contents to ensure inclusion of relevant data.

We decided to use multilabel classification instead of NER, as this approach offers flexibility in capturing the presence of various social determinants without requiring precise identification of entity boundaries. Annotators can focus on understanding the overall meaning and context of the sentence to determine which social determinants are relevant. This approach can potentially speed up the annotation process compared to the detailed entity-level annotation required in NER. Furthermore, there are certain factors such as social support, isolation or adverse childhood experiences that may not lend themselves well to strict entity identification as these factors often need to be inferred from the context. By adopting a multilabel sentence classification approach, annotators have the flexibility to capture these nuanced factors that cannot be precisely pinpointed as individual entities.

There is a better probability of finding some of the less studied SDoH factors such as social support, adverse childhood, and physical abuse-related information in clinical notes of patients experiencing mental health issues as these factors are highly relevant for mental health cohorts. These factors might be less prevalent in clinical notes written by physicians from other specialties47,48 and hence SDoH documentation depends also on the patient population. Given such a real-world scenario, it is necessary to incorporate clinical notes from multiple medical specialties to develop models that can extract a wide range of SDoH factors. Hence, we utilize our multi-institutional corpora to explore the variations in the distribution of SDoH factors and conduct experiments towards understanding the feasibility of developing generalizable models that can extract multiple SDoH factors from clinical notes. Apart from traditional machine learning models and deep learning models we also inspect the use of an open-source LLM - LLaMA68 for SDoH extraction and conduct a thorough evaluation of the model performance under different settings.

Selection of SDoH factors

We performed an extensive review of literature to identify prior work on social determinants including systematic and scoping reviews6,69 with focus on SDoH. Several ontologies/terminologies that cover SDoH factors70,71 (e.g., Ontology of Medically Related Social Entities (OMRSE)72, Semantic Mining of Activity, Social, and Health data (SMASH) system ontology73) were also analyzed to identify major concepts besides standards such as SNOMED CT74 and LOINC75. Additional information was obtained from multiple surveys on SDoH including the All of Us SDoH Survey76, Million Veteran Program (MVP) Lifestyle Survey77, 2020 AHIMA Social Determinants of Health (SDOH) Survey78, and the Gravity project28. Following this process, the domain experts were consulted to identify important SDoH factors for each domain, finally narrowing down to 21 SDoH factors, which were then utilized for annotating clinical notes.

Annotation

After obtaining the Institutional Review Board (IRB) approval for the study the datasets were annotated. The Committee for Protection of Human Subjects at UTHealth (UTHealth CPHS) approved the study (HSC-SBMI-12-0754) for utilizing the clinical notes from UTHealth Harris County Psychiatric Center and UT Physicians. For utilization of clinical notes from Mayo clinic, the study was approved by the Mayo Clinic and the Olmsted Medical Center (OMC) IRBs (Mayo Clinic: 18-006536 and Olmsted Medical Center: 038-OMC-18). The requirement for informed consent was waived by all the governing IRBs under exempt category 4, as the research involved only secondary use of clinical notes and did not involve any patient contact.

The annotation process consists of two levels. At the first level (SDoH factors only) a sentence is assigned one or more labels from the 21 SDoH factors. For example, the sentence “She is single, lives with her parents, works 3 days per week.” will be labeled with ‘marital status’, ‘living status’ and ‘employment status’. The second level of annotation (SDoH factors along with their corresponding values/attributes) provides more granular information about these factors with respect to their subtypes, presence or absence, and temporality. The sentence mentioned above at level 2 will be labeled with ‘marital status - single’, ‘living status – with family’ and ‘employment status – employed, current’. The level 2 SDoH label set contains fine-grained subcategories that are not necessarily mutually exclusive within a given level 1 category. For example, a single sentence could potentially be assigned multiple level 2 labels corresponding to ‘living status – with family - past’ and ‘living status – alone - current’ under the broader ‘living status’ level 1 category. The subcategory labels can also encode assertions of presence/absence as well as temporality markers (e.g. ‘education level – college - ‘current’). During annotation, all applicable level 2 labels were assigned to a given sentence, including multiple subcategories from the same level 1 category. If a sentence does not correspond to any SDoH factor, it is labeled ‘non SDoH’. Detailed information about the 21 SDoH factors and their attributes along with examples can be found in the annotation guidelines (see Supplementary Table 13, 14 and Supplementary Note 1).

The team that developed the annotation guidelines, performed the annotation of the dataset and facilitated conflict resolution consisted of an MD specialized in psychiatry, an internal medicine physician, a postdoctoral fellow trained in clinical NLP, a PhD candidate specializing in the development of biomedical ontologies, a master’s student in public health with a prior master’s degree in biomedical informatics, and a research associate with a master’s degree in biomedical informatics and a bachelor’s degree in biomedical engineering. The postdoctoral fellow and the PhD candidate formulated the annotation guidelines under the expert guidance of the two physicians. The master’s student and the research associate performed the annotation of the dataset, and the entire team held regular discussions to resolve the conflicts. The annotators went through multiple rounds of training starting with a detailed discussion of the annotation guidelines and calculating the inter-annotator agreement (Cohen’s Kappa) after each training round. Depending on the datasets (discussed below), each round of annotator training utilized 10 clinical notes or 50 sentences. Once a Kappa value greater than 0.7 was achieved the remaining notes/sentences were annotated individually by the annotators and any discrepancies were resolved as mentioned above. The inter-annotator agreement after the final round of training was in the range of 0.73–0.90 for the datasets (which is similar to the annotator agreement reported in other studies32).

Datasets

The prevalence of duplicate information in EHRs due to copy-pasting, templating, and summarizing is a major barrier to finding relevant information from EHRs79,80. A recent study reported that the duplication increased from 33% in 2015 to 54.2% for notes written in 202079. The over representation of certain SDoH factors because of the duplication impairs the training and evaluation of machine learning and deep learning models81,82. Hence, we removed duplicate sentences within multiple notes corresponding to a single patient from all datasets. This enabled us to reduce class imbalance among the SDoH factors to some extent. Nevertheless, the distribution of factors still varies, and class imbalance exists. We also observed several clinical notes documenting variations of the text “Nothing new to report” which were also eliminated.

We developed four datasets from de-identified and de-duplicated clinical notes of patients diagnosed with different health conditions from different hospitals. This includes “psychosocial assessment” notes of patients experiencing mental health issues, “chart notes” of patients diagnosed with Alzheimer’s disease, “social work” notes in the MIMIC-III database49, and clinical notes of patients with chronic pain. The clinical notes were sampled using different techniques based on the note type and depending on where a majority of the SDoH information was present. The details of these datasets are described below and summarized in Table 5.

Our first corpus comprises of psychosocial assessment notes from UTHealth Harris County Psychiatric Center (HCPC). Psychosocial assessments inform a comprehensive understanding of the psychological, social, and cultural context of a person guiding the development of individual care plans. Harris County Psychiatric Center (HCPC) is one of the largest providers of inpatient psychiatric care in the USA. About 10,000 patients are admitted yearly, including adults, adolescents, and children. Commonly treated conditions are psychotic or mood disorders, patients with acute crisis, and signs of endangering themselves or others. The EHR goes back to 2001 and includes about 120,000 unique patients. We randomly selected 2000 assessment notes (corresponding to 1529 patients) and extracted the “Social history” sections (using MedSpacy sectionizer83) from these notes which are rich in SDoH information. These notes were then annotated by two annotators based on the annotation guidelines.

The chart notes of patients diagnosed with Alzheimer’s disease from UT Physicians (UTP) constituted the second corpus. Several studies have shown the association of SDoH factors such as education level, isolation, and loneliness with the onset of Alzheimer’s disease and related dementias (ADRD) in older adults84,85,86. For the second dataset we utilized chart notes of patients diagnosed with ADRD from UT Physicians (UTP). UTP provides outpatient care with multiple satellite clinics throughout the greater Houston Area. We reviewed various note/document types within the EHR (e.g. progress notes, discharge summaries, procedure notes) and identified that “chart notes” (different systems might be using different terminology as there is no specific standard) were enriched with SDoH documentation. We therefore filtered the UTP data to include only this specific note type. Unlike the psychosocial assessment notes from HCPC where the majority of the SDoH factors were described in the “Social history” section, social history in the chart notes from UTP recorded mostly information related to substance use. Other SDoH information was scattered under sections titled “History of present illness”, “General observation”, etc. Additionally, chart notes are comparatively longer and contain information regarding medications and different body systems which are irrelevant to this study. Hence to increase the annotation efficiency and decrease the annotation time we developed a list of keywords to filter notes rich in SDoH information. The list of keywords was collected during the initial literature review and expanded by combining keywords from our prior work developing SDoH ontology and those obtained while annotating the clinical notes from HCPC32,70,71,87.

The social work notes from MIMIC-III database were utilized for constructing the third corpus. The Medical Information Mart for Intensive Care (MIMIC-III) database contains more than two million free text notes under different categories (e.g., nursing notes, discharge summaries) with information of patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. It includes 2670 “social work” notes documenting the patient’s social and family history and the interactions of the social workers with the patient and family members during their stay at the hospital. We randomly selected and annotated 500 such notes.

For our fourth dataset we utilized a subset of the corpus from a study34 that characterized chronic pain episodes in adult patients receiving treatment at Mayo Clinic and the Olmsted Medical Center. An annotated corpus of 62 adults experiencing noncancer chronic pain was created using chronic pain ICD code for the anchor diagnosis and retrieving all the patient’s clinical notes 6 months before and 2 years after this anchor diagnosis (for more information please refer [27]). For our study, a total of 527 notes corresponding to the same 62 patients (retrieved using our keyword search) were annotated following our annotation guidelines. We annotated the entire note without filtering any sections. These notes exhibited a semi-structured format characterized by template structures, particularly noticeable in sections concerning substance use, along with huge amount of duplicated information.

A schematic representation of the workflow is illustrated in Fig. 4. For brevity we will hereafter refer to the four datasets by the abbreviations of the names of the hospitals/database from which the notes were extracted as HCPC, UTP, MIMIC-III, and Mayo. During the annotation process, it was observed that some of the factors among the 21 annotated had a notably low prevalence and that those factors varied based on the corpus. To address the issue of class imbalance and enhance model performance, it is common practice to merge classes with fewer samples into a single category. Han et al. 32 employed this approach by consolidating their initial set of 14 annotation categories for SDoH classification into eight categories, with the less frequent ones merged into an “other-social” category. However, in our study, we chose not to follow this practice and instead trained our models on all classes, including those with a smaller number of samples. By doing so, we aimed to provide a more realistic evaluation of model performance in real-world scenarios, considering the wide variation in the distribution of SDoH factors. Furthermore, maintaining and training models on all classes is crucial for our cross-dataset evaluation to ensure consistent and reliable assessment, as any differences in the total number and types of classes would otherwise impact the evaluation process.

Fig. 4: A schematic representation of the workflow.
figure 4

Figure shows some of the SDoH factors, data sources, models used and evaluation process.

Models

We experimented with five different models: XGBoost, TextCNN, SBERT, ClinicalBERT and LLaMA. Since each sentence in the annotated datasets can potentially be annotated with multiple SDoH factors, we formulated the task of identifying SDoH factors in a sentence as a multi-label binary relevance problem for all models except LLaMA. Formally, given an input sentence \(s\) and a label set \(C\) of SDoH factors, a model produces a binary label of \(\{0,\,1\}\) for each SDoH factor \(c\in C.\) We performed a supervised instruction fine-tuning for the LLaMA model. An instruction dataset was created with appropriate instructions to perform the task along with the sentences and expected response as input and output to fine-tune the model. Below we provide a brief overview of the five models: XGBoost, TextCNN, SBERT, ClinicalBERT and LLaMA.

We used the scikit-learn implementation of XGBoost88 with tf-idf vectors as input features. OneVsRestClassifier was used with parameters n_jobs = –1 facilitating the use of all processors and max_depth value of 4. We implemented a TextCNN model following89 for multi-label text classification. We used pre-trained word embeddings (GloVe90) of dimension 300 for each word in a sentence as inputs. The model applies convolutional filters of kernel size 3, 4, and 5 to the input. The outputs of the convolutional filters are max-pooled and fed to a set of classification heads - each of them is a feed-forward network corresponding to a label.

We used a pre-trained Sentence-BERT (SBERT)91 encoder to encode each sentence into a dense vector representation aka. sentence embeddings. SBERT fine-tunes BERT31 in a siamese/triplet network architecture that produces sentence embeddings that are semantically meaningful. We further fine-tuned SBERT for multi-label classification with the SDoH datasets. The output sentence embeddings of SBERT are passed to C binary classifier heads, each of which is a feed-forward neural network. Note that the SBERT sentence encoder is shared among all classifier heads. We fine-tuned the classifier heads and the shared SBERT encoder by minimizing the binary cross-entropy loss of all classifier heads. During inference, if a classifier head produces a score of 0.5 or more, the input sentence is assigned a positive label for that SDoH factor.

The fourth model, ClinicalBERT50, is yet another BERT model that has been further pre-trained on text from all note types in the MIMIC-III v1.4 database. We utilize the same process as detailed above for SBERT during training and inference just replacing the model with the ClinicalBERT model.

We utilized the open-source pretrained LLM, LLaMA 2 7B68,92, and adapted it for multilabel classification by instruction fine-tuning. The training data for each dataset was converted into instruction demonstrations. An example of an instruction demonstration is shown in Supplementary Fig. 5 and has three components – an instruction describing the task to perform, an input which in this case is the sentence from which SDoH factors needs to be extracted, and an output which is the gold annotation label(s) for that sentence. We utilized the approach mentioned in93 using Hugging Face’s training framework with fully sharded data parallel and mixed precision training51. During inference, we provided the fine-tuned models with “instruction” and “input” to generate the “output”. We also performed preliminary experiments with LLaMA 2 7B and 13B (base and chat variants) models using different prompts to assess the need for fine-tuning for this task compared to just utilizing base or chat variants of LLaMA models.

Evaluation

In total we have eight annotated corpora—two each (level 1: SDoH factors only; level 2: SDoH factors and values/attributes) for each corpus. We evaluated the models on all corpora by first training and testing on the same corpus and then performing cross-dataset evaluation. To assess the model’s performance in handling the task of assigning multiple SDoH categories to a sentence, we computed micro-averaged precision (P), recall (R) and F1. Micro-average is calculated by considering the total number of true positives, false positives, and false negatives across all classes.

We performed fivefold cross validation by dividing each corpus into 5 folds and performance of each fold was recorded and average performance across all 5 folds was calculated. Models were trained and tested only on the SDoH categories present in the corpus and hence the number of classes differs across the datasets.

It is observed that the performance is usually good for most models when trained and tested on the same corpus. We wanted to evaluate how different models would perform when trained on one corpus and evaluated on another (cross-dataset evaluation). Each corpus was randomly divided into 7:1:2 ratio for train, validation, and test, respectively. Models were trained and tested on 22 classes (21 SDoH and 1 non SDoH) for level 1 annotated corpora and 71 classes (70 SDoH + values and 1 non SDoH) for level 2 annotated corpora.

With factors having variations in their distribution across datasets, how would the performance be affected if we combine the datasets (combined dataset evaluation) and train a single model? For evaluating this we retained the exact same split used for cross-dataset evaluation and combined the training splits of all four corpora at level 1 to create a single training split and combined the validation splits into a single validation split. The test splits were preserved separately. The same process was repeated for level 2 annotated corpora. Next, a model was trained on the combined training data, validated, and tested on the test splits separately for each corpus. The number of classes, similar to cross-dataset evaluation was 22 for level 1 and 71 for level 2.