Introduction

Social determinants of health (SDoH) are the conditions in which people grow, live, work, and age that influence their quality of life and health4. These determinants encompass a broad range of socioeconomic factors, including family support, employment status, and education, as well as health-related behaviors, such as substance use and physical activity5.

Despite a continuous increase in overall life expectancy, social inequalities in health persist and are widening throughout the life course1,2. Medical progress can prolong the life expectancy of individuals with severe diseases, but care management cannot be disconnected from the socio-economic environments where patients live3. The outcomes of chronic diseases are shaped by a combination of behaviors and exposures, resulting in a complex socio-biological process that determines both individual and societal health status4.

In addition to shaping behaviors, socioeconomic conditions also determine individuals’ exposure to environmental risks, such as air pollution, noise, and extreme weather events, which further impact health outcomes6. Together, social and behavioral factors are major drivers of health disparities, contributing to 47% and 34% of patient outcomes, respectively7.

In clinical settings, SDoH are documented in the electronic health records (EHR) through both structured data (e.g., coded fields) and unstructured data (e.g., clinical notes)8. However, unstructured clinical notes provide more intricate and detailed representations of SDoH than structured data9,10. To enable large-scale secondary use of EHR data, there is an increasing need for automated methods to extract and structure patient SDoH, enhancing our ability to study the impact of social inequalities on health11. In this context, studying SDoH in clinical sciences is essential for understanding how factors like income, education, housing, and social environments shape health outcomes beyond technico-biological variables. Integrating insights from SDoH into clinical practice highlights opportunities for early intervention, holistic patient care, and addressing health inequities at their root. Furthermore, recognizing the actionable nature of SDoH enables the development of evidence-based public health policies that target structural barriers and improve population health at scale12.

In recent years, automatic extraction of SDoH has been widely studied in the English language using natural language processing (NLP)13. Progress has been accelerated by the dissemination of annotated corpora and shared task challenges, for example the i2b2 NLP Smoking Challenge on identifying patients’ smoking status with a corpus of 502 clinical notes14 and the 2022 n2c2 exploring the extraction of SDoH from 4,405 social history sections from clinical notes, including substance use (alcohol, drug and tobacco), employment and living conditions15. Most subsequent studies have focused on U.S. hospital settings, showing that NLP methods applied to unstructured data from EHRs can reliably identify key social risk factors16,17,18,19,20.

In terms of coverage, the most commonly addressed SDoH are smoking status, substance abuse (alcohol and drug) and housing instability13, which are well known for their impact on health, and thus are well documented in EHRs8. In contrast, other SDoH, such as education, employment status, social support and isolation, remain underexplored and present ongoing challenges for NLP systems13.

To address SDoH identification and extraction, a range of NLP methods have been developed. Early approaches relied on rule-based systems and keyword matching, offering high precision but limited generalizability21. Subsequently, semantic word embedding methods such as word2vec22 enabled more nuanced lexical representations, supporting downstream machine learning classifiers for SDoH identification23,24,25,26,27. More advanced approaches have leveraged deep learning architectures, CNNs28, LSTMs28, and transformer-based models such as BERT28,29,30,31, which have demonstrated improved performance in extracting contextually rich SDoH information.

Recent studies have also shown the potential of large language models (LLMs) for the identification and classification of SDoH in clinical notes. Decoder-only models such as GPT-432 have been used in zero- and few-shot settings33, while encoder-decoder models such as Flan-T549 have been applied for generative SDoH extraction tasks, and are the current state-of-the-art approaches for SDoH extraction15,20,31,33,34,35.

However, most of available work is focused on English language, while resources for other languages remain scarce14,36. The extraction of smoking status from clinical narrative texts has been studied in Spanish37, Finnish38, Swedish39, and in a Korean-English bilingual setting40, but none of the underlying corpora are available. To our knowledge, the extraction of SDoH in French clinical texts has not been addressed.

In this work, we propose a sequence-to-sequence approach based on a large language model for extracting SDoH from French clinical texts. Our study focuses on 13 SDoH categories: living condition, marital status, descendants, employment status, occupation, tobacco use, alcohol use, drug use, housing, education, physical activity, income, and ethnicity/country of birth. To support model development and evaluation, we constructed and manually annotated four datasets consisting of social history sections from clinical notes. Two of these datasets are publicly released to promote reproducibility and support the development of new methods for SDoH extraction in French.

Methods

Data

To train and evaluate the proposed SDoH model, we used four datasets: MUSCADET-InHouse, MUSCADET-Synthetic, UW-FrenchSDOH and InHouse Tuberculosis/ALS.

MUSCADET-InHouse was obtained from clinical notes from the Nantes biomedical data warehouse (NBDW), as summarized in Fig. 1. The NBDW encompasses nearly 1.5 million patients who received care at the Nantes University Hospital, over the past 20 years. It includes different dimensions of patient-related data: structured data (e.g. Classification Commune des Actes Médicaux billing codes, a French coding system of clinical procedures; ICD-10 codes; laboratory results; drug administrations), and unstructured data such as outpatient and inpatient clinical notes, radiology and operative reports41. The NBDW was authorized by the French authority of data protection (Commission Nationale de l’Informatique et des Libertés) (Registration code n° 920242). The present study is compliant with French regulatory and General Data Protection Regulation requirements, including informed consent. This study was approved by the Nantes Ethics Group in Healthcare (Groupe Nantais d’Éthique dans le Domaine de la Santé – GNEDS) (Registration code n° 23-3-01-110). All methods were carried out in accordance with relevant guidelines and regulations. A total of 1,144,443 clinical notes were selected from the NBDW according to the following inclusion criteria: age ≥ 18 years, the presence of clinical notes within the NBDW between August 1, 2018, and June 1, 2022. The non-inclusion criterion was patient opposition to data reuse. We then focused on semi-structured clinical notes containing predefined sections (e.g., ‘History’, ‘Medications’, ‘Social History’, etc.), with a particular emphasis on two categories: consultation reports and hospital stay reports. These two types of notes span multiple medical specialties, resulting in a total of 206,973 clinical notes covering diverse patient profiles. The clinical notes selected in the previous step were filtered to retain only those containing a social history section, for a total of 32,666 clinical notes. The social history section was extracted using a rule-based approach. Finally, 1,700 social history sections were randomly selected to constitute our corpus for annotation. This dataset was randomly divided into training (70%), validation (10%) and test (20%) sets for our experiments.

To assess the generalization capabilities of our model, we constructed three external test datasets. For reproducibility, we introduce two open-source datasets: MUSCADET-Synthetic and UW-FrenchSDOH.

MUSCADET-Synthetic comprises synthetic social history section texts written by a physician. These synthetic documents follow the template of real medical records but were entirely written from scratch, ensuring that they do not reference any real patient. The corpus includes 340 documents, matching the test set size of MUSCADET-InHouse.

UW-FrenchSDOH is the second dataset, an automatically translated version of an existing dataset from the University of Washington (UW)42,43. It consists of 364 social history sections collected from MTSamples. The dataset was translated into French using GPT-4o (gpt-4o-2024-11-20) and manually corrected during annotation.

InHouse Tuberculosis and ALS. Since other datasets focus only on social history sections, it was essential to assess the model’s effectiveness in a broader clinical context, particularly on non-SDoH texts, to determine its propensity for false positives. To this end, we applied the model to full clinical notes in two use cases, focusing on patients hospitalized for tuberculosis or amyotrophic lateral sclerosis (ALS). These diseases were selected due to the significant impact of SDoH on their outcomes44,45,46. Both groups of patients and related clinical notes were selected on ICD-10 criteria: A15-A19 for tuberculosis and G12.2 for ALS. A total of 1,186 patients were identified for tuberculosis and 647 for ALS. We included the first clinical note recorded for each patient visit associated with the respective ICD-10 code as the principal diagnosis. For each disease, 200 clinical notes were fully annotated to serve as a test set, for a total of 400 clinical notes.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

MUSCADET-InHouse corpus construction flow-chart.

Annotation scheme

The annotation scheme was designed to provide a broad coverage of SDoH, with a fine-grained description of the determinants. The annotation scheme includes entities, attributes, and relations between entities.

The annotation scheme comprises 25 entities related to SDoH, covering 13 SDoH categories (living condition, marital status, descendants, employment status, job, tobacco use, alcohol use, drug use, housing, education, physical activity, income and ethnicity/country of birth) and 6 entities related to relations (StatusTime, History, Duration, Amount, Frequency, Type). SDoH entities are either text span-only (Job, Income, Education, Ethnicity, Alcohol, Tobacco, Drug) or labeled with categories (Living, MaritalStatus, Descendants, Employment, Housing, PhysicalActivity), while all relation entities are span-only. Table 1 presents all entities.

Table 1 Entities description. *Processing of sensitive data as ethnicity is subject to legal restrictions under general data protection regulation (GDPR), and its systematic collection or use for secondary research purposes is often not permitted, which restricts its structured availability. The country of birth is then often used as a substitute.

The annotation scheme also includes six relations (Table 2):

  • Status: encodes the status (current, past, or none) of the substance use (alcohol, tobacco, or drugs).

  • History: links any event to its date of occurrence.

  • Duration: encodes the duration of exposure to substance use.

  • Amount: encodes the quantity of substance use or the number of children in a lineage. Units of measurement: number of glasses, number of cigarettes, number of children, grams, etc.

  • Frequency: gives the frequency of an event’s occurrence. This relation is also used when the amount related to substance use is not precise enough. For example: drinks occasionally.

  • Type: details certain entities, such as the type of lineage or the type of substance use.

Table 2 Relations. Involved entities in bold indicates the relations is required when the SDoH entity is annotated. *Indicates that any entity can be linked.

Following the work on SHAC corpus47, we annotated the substance use (tobacco, alcohol, and drugs) using an event-based scheme characterized by a trigger entity and status-related attributes.

We used the BRAT Rapid Annotation Tool (BRAT) for datasets annotation48. For the MUSCADET-InHouse dataset, the annotation process was carried out in three phases, with inter-annotator agreement calculated at the end of the first two phases: (1) a preliminary annotation phase on 100 documents by three annotators (PCDB, a physician; AB, an NLP researcher; and MK, an epidemiologist) to evaluate the annotation scheme and refine the guidelines; (2) a second annotation phase on 200 documents by two annotators (PCDB, AB) to validate the modifications made to the guidelines following the first phase; and (3) a final annotation phase during which each annotator worked independently according to the finalized annotation guidelines.

For the MUSCADET-Synthetic dataset, all texts were annotated by three annotators (PCDB, AB, MK). The UW-FrenchSDOH dataset was annotated by a single annotator (AB), while the InHouse Tuberculosis and ALS dataset was annotated by two annotators (PCDB, AB). We computed inter-annotator agreement (IAA) values for entities using F-measure from the open-source tool bratiaa (https://github.com/kldtz/bratiaa) for each annotator pair. For relation annotations, F-measure scores were performed using an in-house script. The BRAT configuration files and the scripts for IAA computation are available in the project’s repository (https://github.com/CliniqueDesDonnees/SDoH).

Annotation statistics

Table 3 presents the distribution of entity types across all datasets. The most frequent entities are Living_WithOthers, MaritalStatus_InRelationship, Descendants_Yes, Job, Tobacco, Alcohol, and Housing_Yes, while the remaining entities appear less frequently. Similarly, Table 4 presents the distribution of relation types across all datasets. The most common relations are Status, Amount, and Type, the latter two being highly associated with the entity Descendants_Yes. Distribution of all possible entity-relations pairs are presented in Supplemental Table 3.

Table 3 Distribution of annotated entities in all datasets. The n corresponds to the number of documents while figures in cells correspond to the number of instances for this entity.
Table 4 Distribution of annotated relations in all datasets. The n corresponds to the number of documents while figures in cells correspond to the number of instances for this relation.

Experiment

Following recent studies on SDoH extraction leveraging large language models (LLMs)33,34, we used the Flan-T5-Large model49 in our experiments. We formulated SDoH extraction as a text-to-structure translation task, where the model receives a social history section from a clinical note and generates a linearized sequence of SDoH events. This sequence-to-sequence (seq2seq) formulation allows the model to jointly predict entities, attributes, and their relations within a single decoding pass. As illustrated in Fig. 2, the output events are ordered from left to right according to their token offsets in the original text.

To train the model, Flan-T5-Large was fine-tuned on the MUSCADET-InHouse training set, with input-output representations presented in Fig. 2. The model was fine-tuned for 10 epochs on two 24GB NVIDIA RTX 4090 GPUs.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Example of the model input-output format used for fine-tuning: (1) annotated social history section as input and (2) the corresponding structured sequence of SDoH events as output. Example translated into English: Social History: Lives with his wife and two daughters. Does the housework, gardening, and drives. Has been tobacco-free for 33 years and alcohol-free for 15 years.

During inference for evaluation, the sequence of SDoH events generated by the model was post-processed to recover token offsets corresponding to entities and relations. The source text was then searched for tokens matching the SDoH events in the output sequence. However, the output sequence tokens often matched multiple offsets in the source text. Ambiguities were resolved by applying distinct strategies for entities and relations. For entities, when multiple non-overlapping matches were found, we selected the leftmost occurrence in the source text that had not already been extracted. For relations, we selected the match nearest to the associated entity to ensure contextual accuracy.

For the InHouse Tuberculosis and ALS dataset, we evaluated the model using both the full clinical notes and the social history sections alone after preprocessing to assess its robustness on non-social history section texts.

Evaluation

We conducted the evaluation at two levels: (i) SDoH factors and associated values, and (ii) the fine-grained SDoH extraction including all entities and relations.

In the level 1 evaluation, we assessed the exact match presence of labeled entities in the gold standard and the model’s predictions. For alcohol, tobacco, and drug use, we included the corresponding Status relations to convert span-only entities into labeled entities (e.g., for Tobacco: Tobacco_StatusTime: current, Tobacco_StatusTime: past, Tobacco_StatusTime: none).

In the level 2 evaluation, we evaluated the extraction of SDoH as a slot-filling task, following prior work on evaluating SDoH extraction models in the context of 2022 n2c2/UW Shared Task15. This approach allows for multiple equivalent span annotations. Figure 3 illustrates this by presenting the same sentence with two equivalent sets of annotations.

Event equivalence was defined using two criteria for model evaluation: exact-match spans and overlap-match spans. In the exact match setting, two events were considered equivalent if the entity offsets matched exactly between the gold standard and predictions, and their associated relation offsets also matched exactly. In the overlap match setting, two events were considered equivalent if the entity offsets shared at least one overlapping character between the gold standard and predictions, and their associated relation offsets also shared at least one overlapping character.

The performance of the model for all evaluation settings was measured using macro precision (P), recall (R), and F1-score (F1). For level 2 evaluation, the performance was measured on each SDoH category by averaging the performance of all possible entity-relation pairs within this SDoH category (distribution of all possible entity-relation pairs in Supplemental Table 3).

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Examples of substance use annotated as events. Annotations (1) and (2) are considered equivalent. English translation of the example: Active smoking at 17 cigarettes per day.

Comparison with structured EHR data

To assess the completeness of SDoH documentation in structured versus unstructured EHR data, we collected Z-codes for all patients in the MUSCADET-InHouse dataset. Z-codes are ICD-10 codes that describe factors that influence health status and healthcare utilization when the primary reason for the encounter is not a specific disease or injury, which partially include SDoH-related codes. All collected Z-codes for MUSCADET-InHouse patients were manually mapped to SDoH categories if relevant (see Supplementary Table 1). We compared the presence of one or more SDoH categories in manually annotated text from MUSCADET-InHouse against the corresponding patient’s Z-codes from structured EHR data.

Results

Inter-annotator agreement

Table 5 presents the inter-annotator agreement scores (F-measure). During the first phase of annotation for the MUSCADET-InHouse dataset, the average entity agreement was 0.689 before adjudication; it improved to 0.725 in the second phase. A similar trend was observed for relations, with an F-measure of 0.795 in the first phase, increasing to 0.829 in the second phase. For the MUSCADET-Synthetic dataset, the average agreement was 0.742 for entities and 0.788 for relations.

Table 5 Inter-annotator agreement.

Model performance

Table 6 shows the macro-averaged performance of the fine-tuned Flan-T5-Large model on all datasets. The model demonstrates stable performance when evaluated on the social history sections alone, achieving a macro-F1 score ranging from 0.7618 to 0.7863 in the level 1 evaluation setting (SDoH entities with associated values), and from 0.3934 to 0.4804 in the level 2 (SDoH extraction with all entities and relations) under the exact match criteria. However, when applied to full clinical notes from the inhouse tuberculosis and ALS dataset, the model produces a high number of false positives, resulting in a substantial drop in performance, yielding a macro-F1 of 0.4017 in level 1, and 0.0451 in level 2 evaluation. The model’s performance on the UW-FrenchSDOH dataset is slightly lower in the level 2 evaluation, likely due to the dataset being a translated dataset from english to French. Although the SDoH-related terms are accurately translated, the word order and writing style retain Anglophone patterns. Since the model was not trained on such translated or non-native-like data, it struggles to extract SDoH text spans precisely.

Table 6 Macro-averaged precision, recall, and F1 metrics of the seq2seq FlanT5-large model across all SDoH datasets.

Table 7 presents the model’s performance for each SDoH category. Some categories, such as living condition, marital status, descendants, job, tobacco and alcohol use, are well modeled by Flan-T5-Large with F1 scores over 0.80. These SDoH categories are often expressed in consistent, structured ways in clinical documents, making it easier for the model to learn their patterns. In contrast, categories such as employment status, housing, physical activity, income, and education present greater challenges. The model struggles to achieve consistent performance on these categories, likely due to the greater linguistic variability and contextual diversity in how they are documented in the clinical notes. This variability, along with the scarcity of certain SDoH categories, make generalization more difficult, especially when annotations vary in phrasing or context.

Table 7 Performance of Flan-T5-Large on SDoH categories. Precision, recall and F1 scores are reported as macro-averaged scores across all entity-relation pairs within SDoH category.

Impact of applying model on non-social history section text

Applying the model to entire clinical documents significantly increases the number of incorrect predictions (false positives). Specifically, the model’s performance on full-text documents is considerably lower than when applied to social history sections only, with a macro-F1 of 0.4017 compared to 0.7893 in the level 1 evaluation setting. This discrepancy is expected, as the model was trained exclusively on social history sections and does not generalize effectively to other parts of the clinical notes.

While restricting inference to social history sections improves precision and overall performance, this approach risks missing important SDoH information that may appear elsewhere in the document. Thus, there is a trade-off between achieving high precision and ensuring comprehensive recall of patient-related social information.

To assess the extent of information missed under this constraint, we compared the number of SDoH annotations in the InHouse Tuberculosis and ALS dataset across full documents versus social history sections only. Across the full dataset, 665 annotations were identified, of which 461 (69.3%) were located within the social history sections. This indicates that 204 annotations (30.7%) lie outside these sections. Among these, 87 annotations occurred in documents that do not include a social history section, while the remaining 117 were found outside the social history section in documents that did include one. Notably, 81 of the 117 were redundant—i.e., SDoH categories that were already mentioned within the corresponding social history section. The remaining 36 annotations represented unique SDoH information not captured in the social history sections. These were primarily related to substance use (tobacco, alcohol, and drug use), commonly discussed in sections such as medical history or risk factors.

In total, restricting model inference to social history sections results in 123 missed unique SDoH annotations (87 from documents without a social history section, and 36 unique mentions from documents with one), accounting for approximately 18.5% of all SDoH annotations in the InHouse Tuberculosis and ALS dataset.

Error analysis

Supplementary Table 2 provides an overview of the primary differences observed between the model outputs and the reference annotations. Through qualitative inspection, we categorized these discrepancies into eight distinct error types: (1) human annotation errors, (2) false positives, (3) false negatives, (4) difficulties in adhering to the structured output format, and (5) cases where the predicted text span was correct but the associated label was incorrect. Additional discrepancies labeled as errors were, in fact, not entirely incorrect; these resulted from (6) post-processing rules—for instance, when multiple identical text spans were present—or from (7) model predictions that differed from the ground truth annotation in terms of text spans but were nonetheless valid in the context of the slot-filling task. A small number of errors also stemmed from (8) limitations of the tokenizer, which did not support several French characters, such as ï. This led the model to generate incorrect forms, such as producing ‘cocane’ instead of ‘cocaïne’, thereby introducing errors in post-processing.

Comparison with structured EHR data

Manual annotation of the MUSCADET-InHouse dataset identified at least one SDoH category in 98.5% of patients (1621/1646). In contrast, structured EHR data, based on Z-codes, captured SDoH information in only 2.8% of cases (46/1646). Among these, 17 SDoH mentions overlapped between the two sources. The remaining non-overlapping instances from the structured data were mainly associated with Z-codes such as Z29.0 and Z60.20, which correspond to living alone.

Discussion

We developed a sequence-to-sequence model to extract 13 SDoH categories from French clinical notes, demonstrating the potential of large language models for enhancing the collection of real-world SDoH data. Our model performed well in identifying SDoH mentions in clinical notes and showed consistent performance across four datasets, including two that are publicly available to the research community. SDoH mentions extracted from clinical notes identified 95.8% patients with relevant information, compared to 2.8% for ICD-10 codes from structured EHR data, underscoring the added value of unstructured data.

These results highlight the effectiveness of NLP approaches in leveraging unstructured clinical notes to improve the completeness of real-world data, which is often missing or sparsely represented in structured EHR data. For example, ICD-10 Z-codes describing SDoH (e.g., ‘Problems relating to housing and economic circumstances’) are used in less than 5% of cases by clinicians in routine discharge coding practice, whereas automated NLP systems can recover comparable information with far less effort, requiring about one day of processing versus nine person-days per physician50. Yet clinician documentation habits remain a bottleneck: in a U.S. study of > 5 million patients, structured data such as address and race were well-documented, while housing, income, and social isolation were mentioned in less than 5% of records51. Greater clinician awareness and consistent recording of these factors are therefore essential.

Our model achieved strong performance (macro-F1 > 0.80) in identifying well-documented SDoH categories (living condition, marital status, descendants, smoking status, alcohol use, employment and physical activity) but lower scores for housing status and drug use. These discrepancies were primarily due to inconsistencies in human annotation, limited training data, and highly variable language, ranging from direct mentions (e.g., “apartment,” “house”) to more indirect or context-dependent references (e.g., “nursing home,” “home nurse,” “in-home assistance”). These results highlight the strengths of our approach in extracting high-level SDoH categories (level 1 evaluation), which is particularly relevant for secondary use applications. Since the output is already structured for each SDoH category, it can be directly integrated into clinical databases and research cohorts without requiring additional post-processing. However, when more granular detail is needed, fine-grained SDoH extraction with entities and relations (level 2) involves additional post-processing steps and result in lower and less stable performance across SDoH categories. This indicates that while the model is reliable for detecting whether broad SDoH concepts are present (useful for screening, surveillance, or cohort characterization52, its outputs should be treated with caution when detailed entity or relation-level information is required for tasks in clinical practice such as care planning or automated decision support.

Direct comparisons with previous studies are challenging due to methodological differences in annotation schemes, evaluation strategies, and underlying SDoH distribution. To the best of our knowledge, the study by Romanowski et al.34 is the only prior work using a model training and evaluation approach comparable to ours. Even though some entities overlap, the annotation of entities and relations differs between the two studies, which limits strict comparability. In addition, distribution of entities and relations is not equivalent across French and English. For example, substance use is more frequently represented in the English datasets than in our datasets which may reflect the higher burden of substance abuse in the US53. Among the most comparable categories, such as alcohol, tobacco, and drug use, our results tended to be lower than those reported in English, which may reflect both linguistic challenges and differences in data availability. These observations underscore the need for multilingual benchmarks and harmonized annotation practices to enable robust cross-study comparison in SDoH extraction.

Our error analysis revealed several limitations in using language models to extract SDoH from French clinical notes. While such models are generally capable of identifying the presence of relevant concepts (level 1 evaluation), they often struggle to precisely extract detailed information (level 2 evaluation). This performance gap may be explained by multiple factors, including the relatively small number of models’ parameters compared to state-of-the-art architectures, and quality issues in the annotated data. Indeed, model performance is inherently limited by the quality and consistency of the annotations, which are challenging in the SDoH domain due to its conceptual complexity. Annotator bias and inconsistency further reduce reliability and, consequently, model accuracy.

Additional errors stem from the use of English-based tokenizers, which often mishandle accented characters. As a result, post-processing becomes difficult: the predicted spans cannot be reliably aligned with gold annotations, and character offsets are often miscalculated during evaluation. These issues underscore the need for tokenizers and models tailored to specific languages, as most publicly available models are English-centric and may not generalize well to other languages or multilingual contexts54,55.

Moreover, the generation-based approach introduces alignment errors during post-processing. Specifically, selecting the leftmost matching text span to align the generated SDoH outputs can result in incorrect mappings. Similarly, associating predicted entities with their nearest potential relation arguments can introduce a proximity bias, potentially overlooking longer-range dependencies. Together, these findings suggest that our current generation-based method is promising for high-level SDoH categorization, but more robust modeling and evaluation approaches are required before they can be reliably used for fine-grained extraction in clinical or research workflows.

Applying NLP to EHR data poses a persistent challenge of transferability. Models often struggle to maintain consistent performance across different patient sub-populations within the same institution, and even more so across hospitals or over time as clinical language and practices evolve. This limits the reliability, equity, and generalizability of NLP-driven insights, underscoring the importance of adaptable, continuously validated models in clinical settings. This highlights the need for ongoing ad-hoc validation studies, underlying the importance of methodological transparency in studies like ours, including the release of code and data. In this context, the choice of model and approach for retrieving SDoH also matters. Recently, the use of LLMs in clinical tasks has expanded rapidly, with proprietary models such as GPT-4 often achieving top performance in benchmarks. However, for SDoH extraction, existing studies suggest that their performance remains limited and often comparable to other deep learning methods that are less computationally intensive33. Beyond performance, proprietary models also raise concerns regarding reproducibility, transparency, bias, and data privacy. In contrast, open-source models like Flan-T5 can be fine-tuned and adapted to the target language and local contexts, offering a more controllable and reproducible approach in low-resource hospital settings.

Another key challenge in SDoH research is the scarcity of resources in languages other than English, which limits the development of NLP methods that account for social and cultural variability across healthcare systems. SDoH are deeply context-dependent, shaped by language, culture, policy, and local healthcare practice, making it essential to develop corpora that reflect diverse populations56. A major motivation behind our work is to address this gap by providing a French-language SDoH corpus that is openly accessible and free from legal constraints. Given the sensitivity of medical data under the General Data Protection Regulation (GDPR), we adopted a dual approach to ensure compliance: generating synthetic social history sections authored by a physician and translating a publicly available English-language dataset from the University of Washington into French. This approach enables us to uphold privacy standards while advancing FAIR (Findability, Accessibility, Interoperability, and Reusability) research principles. By introducing a corpus tailored to the French clinical context, we aim to promote inclusivity and facilitate the development of NLP methods for French-speaking populations and foster multilingual research in SDoH extraction.

Our study has several limitations that affect the generalizability of our findings. First, our training dataset was derived from a predominantly Caucasian population treated at one comprehensive center, Nantes University hospital. This demographic skew impacted certain SDoH categories, such as ethnicity, which are more likely to be documented for non-caucasian individuals. In addition, ethnic data are not usually collected by French physicians unless deemed relevant for healthcare purposes. In general, we observed variation in the amount of SDoH information available across populations. On average, patients born in France (n = 1397, mean = 5.45 (SD 2.32) SDoH mentions) had more SDoH information recorded (P < 0.01 using a Student’s t-test) than patients born outside France (n = 239, mean = 4.87 (SD 2.29) SDoH mentions), suggesting possible disparities in data completeness. A second limitation is that we trained our model only on the social history sections of clinical notes, rather than on full-text documents. While this decision reduced annotation effort, it limited the model’s ability to generalize to other sections. As a result, the model is not directly applicable to raw clinical notes and requires a preprocessing step to isolate the relevant sections prior to inference. This design choice may also reduce recall, as SDoH can also appear in other sections of clinical notes. Furthermore, because not all clinical notes include a social history section, information may be missed for certain patients.

The availability and quality of SDoH documentation in EHRs is often limited and inconsistent. Real-world data are primarily collected for clinical and administrative purposes by physicians during patient care, rather than for secondary use in research. Consequently, certain SDoH categories, apart from substance use, which is well known as a risk factor, are often overlooked during consultations. Several factors contribute to this under-documentation: lack of awareness among healthcare providers about the relevance of social factors to health outcomes, discomfort with asking about these factors, and restricted resources, staffing, and time to conduct screenings, which often compete with medical priorities57. Moreover, as Nantes University Hospital serves as the comprehensive center in our region, physicians there tend to focus more on medical care and less on the social environment than general practitioners58. As a result, SDoH information is frequently missing or incomplete, even in unstructured formats within the EHR. This under-documentation limits the ability to study social determinants at scale and hinders efforts to reduce health disparities. It also impairs the capacity of health systems to implement targeted, equity-oriented interventions based on complete, representative patient data.

Conclusion

Social determinants of health have a profound impact on both individual and population health outcomes, influencing morbidity, mortality, and healthcare access. Yet, SDoH are often under-documented in EHRs, particularly in structured data. In this work, we developed and evaluated a sequence-to-sequence language model to extract 13 SDoH categories from French clinical notes, demonstrating the effectiveness of NLP for improving the completeness of real-world health data. Our model consistently outperformed structured EHR data by identifying the majority of relevant SDoH across all patients and showed robust performance across multiple datasets, including publicly available benchmarks.

Future work will explore data augmentation techniques and the use of synthetic clinical text to improve the model’s generalization, and facilitate the open-source release of both the model and annotated training dataset to support reproducibility and multilingual SDoH research. Ultimately, advancing automated SDoH extraction from unstructured clinical text can support more equitable healthcare by enabling richer, more representative data for research, policy-making, and population health interventions.