Introduction

Post-acute sequelae of coronavirus disease 2019 (PASC), or Long COVID, is an often debilitating and complex infection-associated chronic condition (IACC) that occurs after SARS-CoV-2 infection and is present for at least 3 months beyond the acute phase of the infection1,2. PASC has affected tremendous individuals globally2,3,4,5, with over 200 distinct symptoms reported across multiple organ systems, emphasizing its complexity and multifaceted nature5,6,7. Currently, diagnosing, treating, and caring for patients with PASC remains challenging due to its myriad symptoms that evolve over long- and variable-time intervals1,5,8,9. The accurate characterization and identification of PASC patients are critical for effectively managing this evolving public health issue, such as accurate diagnosis, stratification of risk among patients, evaluation of the impact of therapeutics and immunizations, and ensuring diverse recruitment for clinical research studies1,5.

Electronic health records (EHRs) have been widely used in clinical practice and AI in healthcare research. Currently, EHR analyzes regarding PASC predominantly rely on structured data, such as billing diagnoses commonly recorded as ICD-10 codes. Yet, existing diagnostic codes for PASC (i.e., U09.9) have been shown to lack the requisite sensitivity and specificity for an accurate PASC diagnosis10,11,12,13,14. Additionally, findings reveal demographic biases among patients coded with U09.9, showing a higher prevalence in women, White, non-Hispanic individuals, and those from areas of lower poverty and higher education14,15. Structured billing diagnoses have limitations in assessing the true frequency of conditions associated with PASC, such as Postural Orthostatic Tachycardia Syndrome (POTS), which is diagnosed in 2–14% of individuals after a COVID-19 infection16. POTS did not have a specific ICD-10 code until October 1, 2022. Additionally, its primary symptoms—palpitations and dizziness—may be documented in clinical notes but often go unrecognized as part of the syndrome by clinicians, leading to underreporting in billing data. Comparable challenges are also found in diagnosing Myalgic Encephalomyelitis/Chronic Fatigue Syndrome (ME/CFS) when relying solely on structured data17,18,19. These biases and limitations complicate our understanding and management of PASC and highlight the need for improved diagnostic approaches.

Unstructured narratives, found in EHR notes, contain detailed accounts of a patient’s clinical history, symptoms, and the effects of those symptoms on physical and cognitive functioning. Previous studies have applied NLP to detect acute COVID symptoms within free-form text notes20,21,22,23. Studies on acute COVID-19 symptom extraction have achieved promising results24. However, due to the inherent difference between PASC and COVID-19, such NLP approaches are not directly transferable21,23,24. Other research has concentrated on identifying PASC symptoms without assertion detection. The PASCLex study by Wang et al. 25 developed a lexicon of symptoms and synonyms, employing a rule-based approach to search for symptoms within clinical notes.

To address the challenges of identifying PASC and build on previous research to improve diagnostic methods, we developed a hybrid NLP pipeline that integrates rule-based and deep-learning approaches. This hybrid pipeline not only detects relevant symptoms in clinical notes but also accurately determines their assertion status—whether a symptom is truly present or non-present. Our approach was implemented within data from the RECOVER Initiative, which provides access to large, diverse COVID-19 and PASC patient populations with electronic health records (EHR) data from a network of large health systems across the United States. One of the key features of our pipeline is a comprehensive PASC lexicon, developed in collaboration with clinicians, consisting of 25 symptom categories and 798 Unified Medical Language System (UMLS) concepts for precise symptom identification. Additionally, we designed a BERT-based module specifically to assess symptom assertions, distinguishing between symptoms that are present or non-present (including absent, uncertain, or other possible statuses, as detailed in the Methods). To ensure our approach extracts PASC symptoms from unstructured clinical notes robustly and efficiently, we curated 60 progress notes from New York-Presbyterian/Weill Cornell Medicine (WCM) for model development and internal validation, and 100 progress notes from 10 additional sites for external validation. To strengthen our evaluation in comparison with large language models, we developed a prompt and applied GPT-4 for symptom extraction with assertion detection on a 20-note subset of the WCM internal validation notes. Furthermore, we analyzed 47,654 clinical notes from 11 health systems to conduct a population-level prevalence study, aiming to improve the understanding of PASC-symptom mentioning patterns and refine disease characterization. Leveraging the RECOVER26 datasets, our approach addresses key limitations of prior research by incorporating a large, diverse population and validating performance across multiple sites, which helps account for variations in clinical language and documentation styles.

Results

PASC Lexicon

The PASC lexicon is an ontology graph consisting of 25 symptom categories, 798 finer-grained UMLS concepts, and their synonyms (Fig. 1). The complete symptom lexicon is available in Supplementary File 1.

Fig. 1
figure 1

PASC Lexicon with Examples of Representative Symptoms.

The 25 categories, together with the initial symptom lexicon, were identified by physicians (led by CF). The categories and lexicon were then manually mapped and consolidated using UMLS and SNOMED CT. To make the lexicon comprehensive, we included all synonyms from the UMLS concepts and the lexicon extracted from PASCLex25. We further leveraged the Broader-Narrower relationship (hierarchies between concepts) and included all children of a concept in the initial lexicon.

A Hybrid NLP Pipeline for PASC Symptom Extraction

We developed MedText, a hybrid NLP pipeline that integrates both rule-based modules and deep learning models at different stages (Fig. 2). At a high level, MedText employs a text preprocessing module to split the clinical notes into sections and sentences. It then uses the PASC lexicon and a rule-based NER module to extract PASC symptoms from clinical notes. Finally, a BERT-based assertion detection module is employed to determine whether the extracted symptom is positive or not (e.g., “there is no diarrhea”).

Fig. 2
figure 2

The architecture of the NLP pipeline.

Quantitative Results for Assertion Detection

In this study, we used three popular domain-specific pretrained BERT models, BioBERT, ClinicalBERT, and BiomedBERT27,28,29, to develop the assertion detection module in MedText. All BERT models were fine-tuned using the combination of the WCM Training Set and public i2b2 2010 assertion dataset30 (See Methods).

We evaluated the performance of assertion detection using the clinical notes in the WCM Internal Validation and Multi-site External Validation sets. Performance was measured regarding precision, recall, and F1-score (the harmonic mean of the precision and recall) across all the symptoms in these two datasets.

Figure 3 presents a comparative performance analysis on the WCM Internal Validation dataset. Among the three models, the BiomedBERT consistently shows superior performance across all metrics, particularly in an average recall of 0.75 ± 0.040 (95% CI: 0.676–0.834) and an average F1 score of 0.82 ± 0.028 (95% CI: 0.766–0.876).

Fig. 3: The performance of three BERT variants on the WCM internal validation set.
figure 3

Boxplots for the performance metrics—(a) Precision, (b) Recall, and (c) F1-score. ns not significant; **P ≤ 0.01; ***P ≤ 0.001; ****P ≤ 0.0001.

Figure 4 presents a comparative performance analysis on the Multi-site External Validation dataset. The precision, recall, and F1 scores show no significant differences across the models (i.e., P > 0.05 in t-tests). BiomedBERT achieved an average recall of 0.775 (95% CI: 0.726–0.825) and an average F1 score of 0.745 [95% CI: 0.697–0.796]) and ClinicalBERT (recall of 0.775 [95% CI: 0.713–0.838]) and F1 of 0.730 [95% CI: 0.649–0.812]), lower than BioBERT (recall of 0.792 [95% CI: 0.715–0.868] and F1 of 0.782 [95% CI: 0.716–0.848]). The average precision of BiomedBERT (0.737 [95% CI: 0.646–0.829]) is not significantly different from BioBERT (0.792 [95% CI: 0.704–0.880]), but is higher than ClinicalBERT (0.710 [95% CI: 0.598–0.822]).

Fig. 4: Multi-site External Validation.
figure 4

Radar Chart for the performance metrics—(a) precision, (b) recall, and (c) F1-score—of the three fine-tuned MedText-BERT pipeline variants on the 100-note non-WCM multi-site external validation set. The metrics are computed for the positive (i.e., “Present”) symptom mentions. The BERT-based models trained/fine-tuned in different scenarios for assertion detection are: BioBERT fine-tuned, BiomedBERT fine-tuned, BiomedBERT benchmark, and ClinicalBERT fine-tuned from left to right in each subfigure. a Seattle - Seattle Children’s; (b) Monte - Montefiore Medical Center; (c) CHOP - The Children’s Hospital of Philadelphia; (d) OCHIN - Oregon Community Health Information Network; (e) Missouri - University of Missouri; (f) CCHMC - Cincinnati Children’s Hospital Medical Center; (g) Nemours - Nemours Children’s Health; (g) MCW - Medical College of Wisconsin; (i) Nationwide - Nationwide Children’s Hospital; (j) UTSW - UT Southwestern Medical Center.

Frequency Analysis of PASC Symptoms in The Population-Level Prevalence Study

Based on the WCM Internal Validation and Multi-site External Validation results, the BiomedBERT model was selected for a population-level prevalence study.

Figure 5A summarizes the positive (i.e., “positive”) mentions and negative (i.e., “non-positive”) mentions of 25 PASC categories identified by MedText on the Population-level Prevalence Study dataset. Among all the 11 sites, the Positive mention of “pain” is the leading symptom mentioned. Other leading symptom categories include headache, digestive issues, depression, anxiety, respiratory symptoms, and fatigue, in potentially varying orders across different sites.

Fig. 5: Population-level Prevalence Study.
figure 5

A Frequency analysis of positive (“present”) in red and negative (“non-present”) in blue symptom category occurrences in different sites. B Spearman correlation coefficients between the positive (i.e., “present”) symptom mentioning patterns of sites and the overall dataset. C Spearman correlation coefficients between the negative (i.e., “non-present”) symptom-mentioning patterns of sites and the overall dataset. (a) Seattle Children’s, (b) Montefiore Medical Center, (c) The Children’s Hospital of Philadelphia, (d) Oregon Community Health Information Network, (e) University of Missouri, (f) Cincinnati Children’s Hospital Medical Center, (g) Nemours Children’s Health System, (h) Medical College of Wisconsin, (i) Nationwide Children’s Hospital, (j) UT Southwestern Medical Center, (k) Weill Cornell Medicine, (l) Total.

Figure 5B, C further compare the symptom-mentioning patterns in terms of the relative frequency of the symptom categories in different sites through the Spearman correlation test. In particular, the lowest Spearman correlation coefficient for the positive symptom-mentioning patterns between any two sites and the entire dataset exceeds 0.83, while for negative symptom-mentioning patterns, it is above 0.72. In the Spearman correlation test, any pair of sites or between any site and the overall dataset showed P < 0.0001.

Figure 6 shows the Spearman correlation coefficients for the positive mentions between each pair of the 25 symptom categories and the total number of symptom mentions.

Fig. 6: Cross-symptom-category correlation test and symptom-category distribution.
figure 6

Spearman correlation coefficients between the positive (i.e., “Present”) symptom-mentioning patterns of symptom categories. The total count of positive symptom mentions for each symptom category is to the right of the correlation diagram.

Processing Time

We deployed MedText on an AWS Sagemaker platform with a Tesla T4 GPU configuration, achieving efficient processing of the large-scale clinical notes datasets. Overall, MedText processed each note at an average of 2.448 ± 0.812 s across the 11 sites. Figure 7 shows the module-wise summary of the mean and standard deviation of runtime in seconds in the population-level prevalence study across 11 sites.

Fig. 7: Module-wise runtime summary of MedText processing.
figure 7

The mean (shown by the values on each bar) and standard deviation of the runtime of each MedText module in MedText computed for the mean runtime per note across the 11 sites.

Comparison between Rule-Based Module and GPT-4 for Symptom Extraction

To strengthen our evaluation, we conducted an experiment comparing our rule-based NER module with a large language model, GPT-4, for PASC-related symptom extraction (see Methods). Using a random sample of 20 intake notes from the WCM internal validation set, we applied GPT-4 (API version: 2024-05-01-preview) with a prompt specifying the 25 predefined symptom categories. GPT-4 generated 189 symptom mentions, which ZB manually reviewed to assess correctness—specifically, whether each mention existed in the note and corresponded to a valid symptom or synonym. This review identified 9 incorrect mentions, yielding a precision of 95.24% for GPT-4-based NER.

To estimate recall of the NER module in our hybrid NLP pipeline, we constructed a proxy ground truth by taking the union of all verified correct mentions identified by GPT-4 (180) and our rule-based NER module (640) on these 20 notes. This union set contains 706 verified correct mentions. Based on this reference set, the rule-based module achieved a recall of 90.65%.

In addition, our prompt also required GPT-4 to generate assertion labels. These assertion detection results produced by GPT-4 were manually reviewed for correctness, which achieved a weighted F1 score of 97.78% and a balanced accuracy of 97.72% on the 180 correctly recognized symptom mentions.

Discussion

In this study, we developed a hybrid NLP pipeline combining rule-based and deep learning techniques, leveraging state-of-the-art techniques31,32,33 for clinical text preprocessing and symptom extraction. Our pipeline not only identified relevant symptoms of PASC but also accurately evaluated whether the symptoms are truly present or absent. This pipeline includes a PASC lexicon we developed with clinical specialists, with 25 symptom categories and 798 finer-grained UMLS concepts and synonyms. We adapted a BERT-based module for symptom assertion detection. For model training and evaluation, we curated 160 progress notes from 11 health systems as part of the RECOVER initiative network across the U.S. Experiments on the WCM Internal Validation set and the Multi-site External Validation set show that BiomedBERT achieved the overall best performance in assertion detection. Our model showed high precision and recall, achieving an average F1 score of 0.82 in internal validation at one site and 0.76 in external validation across 10 sites for assertion detection, showing good generalizability across sites. Additionally, we collected 47,654 progress notes for a population-level prevalence study on PASC symptom-mention patterns. Our pipeline processed each note at an average speed of 2.448 ± 0.812 seconds. Spearman correlation tests showed ⍴ > 0.83 for positive mentions and ⍴ > 0.72 for negative mentions, both with P < 0.0001. These results demonstrate that our pipeline effectively, consistently, and efficiently captures PASC symptoms across multiple health systems. Our population-level prevalence study provided novel insights and may advance the understanding of the symptom-mentioning patterns related to PASC, shedding light on a more comprehensive understanding of PASC for future research and clinical practice.

The hybrid NLP pipeline we developed is tailored explicitly to PASC, while previous research has explored assertion detection in different clinical contexts, e.g., the BERT-based34 and prompt-based35. In particular, BERT-based models applied to Chia36 achieved an F1 score of ~0.77 for “Present” and a micro-averaged F1 score of ~0.72. Although these models claimed over 0.9 in both the F-1 score for “Present” and the micro-averaged F1 score evaluated on conventional benchmarking datasets, i2b2 201030, BioScope37, MIMIC-III38, and NegEx39, none of them have addressed the unique challenges posed by PASC-related narratives. Our model’s performance on the curated RECOVER datasets achieved an average F1 score of 0.82 in internal validation and 0.76 in 10-site external validation via BiomedBERT. This demonstrates that domain-specific models remain necessary compared to general-domain language models, given the nascent and rapidly evolving nature of PASC-related medical narratives. The lack of standardized terminology25, coupled with the limited availability of annotated training data40 and the heterogeneous symptomatology of PASC41, presents different challenges that are not as prevalent in more established clinical note datasets for NLP tasks. Comparative analyzes of holistic processing speeds across different NLP pipelines with BERT-based modules for unstructured clinical notes are not extensively documented in the literature, and the differences between datasets and experimental setups can undermine fair comparison. Nonetheless, our model offers a competitive runtime of ~2.5 seconds per note compared to LESA-BERT, a model originally designed for patient message triage, which required approximately 212.6 seconds to process its test set on a CPU with a batch size of 1. Its distilled versions, such as Distil-LESA-BERT-6 and Distil-LESA-BERT-3, showed inference times of 79.8 seconds and 40.8 seconds, respectively42. This demonstrates the efficiency of our pipeline.

We conducted error analysis during multi-site external validation and identified site-specific performance limitations, revealing potential causes that may affect model effectiveness. First, inconsistent manual annotations, e.g., “as needed” cases, led to varied interpretations of symptoms as rigorously “positive”, “hypothetical” in EHR was identified as “non-positive”; while phrases containing “history of”, symptoms were often misclassified as “positive”. Second, sentences listing multiple symptoms after the negation phrase “negative for” posed challenges. Excessive spacing in these instances caused the model to misinterpret the context, leading to symptoms frequently misclassified as “positive”, likely due to their distance from the negation keyword. Third, symptoms that were excluded from the performance evaluation, e.g., due to unresolved mismatches, could affect the NLP pipeline’s overall performance. Hopefully, these findings may provide clues for those working toward enhancing their model’s ability to address more intricate textual contexts or classify temporal context.

To date, we have not found any study reported applying Large Language Models (LLMs) (e.g., general-purpose models ChatGPT43,44, Gemini44,45, and open-source, medical domain-specific models OpenBioLLM-70B46 and Llama-3-8B-UltraMedical47) on PASC-related symptom extraction with assertion detection. To strengthen our evaluation, we randomly selected 20 notes from the 30-note WCM internal validation subset and applied GPT-4 to perform both NER and assertion detection, and compared with the rule-based NER module in symptom extraction. Without specifying PASC-specific lexicon, GPT-4 was able to achieve better performance in assertion detection on the 180 manually verified correctly identified symptoms, with a weighted F1 score of 97.78%. However, it also missed a notable amount of symptom/synonym-related tokens compared to the 640 correct symptom mentions by our rule-based NER module on the same 20 notes.

Our study has limitations: (i) Our current performance evaluation focused on the positive or negative mentions of symptoms already identified in the MedText pipeline. However, whether all possible symptoms are captured remains unclear. (ii) We excluded 18 symptom mentions from the performance evaluation due to unresolved mismatches that may affect the results of symptom extraction. (iii) Our primary goal in this work is to develop a generalizable NLP pipeline for extracting PASC-relevant symptoms from clinical narratives, with intake notes serving as an initial use case. While the intake notes are often the earliest comprehensive documentation of a patient’s initial concerns, they may overrepresent acute or prominent symptoms, potentially biasing the extracted distributions. To address these limitations, we will enhance our pipeline in the future via: (i) expanding the annotated datasets in collaboration with clinicians to annotate all possible symptoms given a note, (ii) integrating structured EHR data and COVID-related symptoms extracted from unstructured clinical notes, (iii) evaluating how different note types contribute to identifying high-risk patients and explore how the extracted outputs could be integrated into clinical workflows, for example, by flagging patients for specialist referral or follow-up care. Since our pipeline itself is not restricted to intake notes, it can be directly applied to other types of notes.

All the above demonstrate that our hybrid NLP pipeline, namely MedText, can extract symptoms effectively, efficiently, and robustly. Our MedText may contribute to (i) unlocking rich clinical information from narrative clinical notes, (ii) assisting clinicians in diagnosing PASC efficiently and precisely, (iii) supporting predictive modeling and risk assessment about PASC, (iv) facilitating the PASC clinical decision support system, and (v) adapting our pipeline for exact symptoms of other diseases may enable large-scale research and public health insights. However, MedText may not capture all nuanced clinical details, highlighting the importance of clinical review to ensure accuracy and completeness in symptom identification.

Methods

Ethics Oversight

Institute Review Board (IRB) approval was obtained under Biomedical Research Alliance of New York (BRANY) protocol #21-08-508. As part of the Biomedical Research Alliance of New York (BRANY IRB) process, the protocol has been reviewed in accordance with the institutional guidelines. The Biomedical Research Alliance of New York (BRANY) waived the need for consent and HIPAA authorization. IRB oversight was provided by the Biomedical Research Alliance of New York, protocol # 21-08-508-380. The study was performed in accordance with the Declaration of Helsinki.

Study cohort

We collected 47,814 clinical notes from 11 sites within the RECOVER network: Weill Cornell Medicine (WCM), Medical College of Wisconsin, Cincinnati Children’s Hospital Medical Center, The Children’s Hospital of Philadelphia, University of Missouri, Nationwide Children’s Hospital, Nemours Children’s Health System, Oregon Community Health Information Network, Seattle Children’s, UT Southwestern Medical Center, and Montefiore Medical Center. The data follow the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), with the Institutional Review Board (IRB) approval obtained under Biomedical Research Alliance of New York (BRANY) protocol #21-08-508.

Curating Data for PASC Symptom Review and NLP Pipeline Evaluation

For pipeline development and validation, we sampled 60 ambulatory patients from WCM and 100 from 10 other sites in the RECOVER network (10 patients/site) (Table 1). Then, we extracted each patient’s intake progress notes for human annotators. Here, we hypothesize that the intake notes contain the most detailed list of problems for each patient. The 60 WCM notes form the Model Development Dataset, in which 30 were used as the WCM Training set, and the other 30 as the WCM Internal Validation set. The 100 notes from 10 other sites form the Multi-site External Validation set (Fig. 8). We also integrated the publicly available 2010 i2b2 assertion dataset35 with the WCM training set for model fine-tuning.

Fig. 8
figure 8

Data construction workflow.

Table 1 Datasets curated from the 11 sites of the RECOVER Initiative

For annotation, we designed a qualified review pipeline to generate manual annotations for symptom mentions. We first applied MedText to extract symptoms from the notes. Then, we used Screen-Tool, an open-source R software developed by TB, to determine the assertion status for each symptom mention (e.g., Fig. 9). Screen-Tool displays each extracted symptom highlighted in its context (the passage it belongs to), along with the symptom category identified by MedText as its concept. The annotators are required to determine whether this token is “related” or “unrelated” to this concept, as well as which of the five statuses it belongs to: “present”, “absent”, “hypothetical”, “past”, or “other”. For performance validation of assertion detection, these statuses are eventually mapped into a binary status: positive vs. non-positive. Specifically, “present” mentions are mapped to positive, while all the other statuses of mentions are deemed Negative.

Fig. 9: Example screenshot for Screen_Tool.
figure 9

Screen_Tool is an open-source R-based software for manual annotation of symptom mention.

Two annotators (ZB and ZX) independently reviewed and annotated each mention of extracted symptoms. We used Cohen’s Kappa (CK) metric to measure inter-rater agreement (IRA). This process achieved a value of 0.98 on the WCM annotated subset (i.e., including both the WCM training and WCM internal validation sets), and 0.99 on the multi-site external validation set. All mentions that had disagreements between the two annotators or were unrelated were removed from the dataset. Specifically, 24 out of 2301 mentions (<1.05%) were removed from the WCM annotated subset, and 18 out of 1886 (<0.5%) were removed from the multi-site external validation set. Among the 4187 mentions extracted from 160 annotated notes, 12 were ‘unrelated’ NER results where the trigger token did not match a recognized symptom or synonym. This yields an overall NER precision of 99.7%, with 99.5% (11 unrelated) on the WCM subset and 99.9% (1 unrelated) on the multi-site external validation set.

Constructing PASC Lexicon

We first compiled an initial symptom lexicon with 25 categories for PASC-related terms, based on input from subject matter experts (SMEs) and a literature review6. PASC symptoms were defined as patient characteristics occurring in the post-acute COVID-19 period. Next, we searched the symptoms in the UMLS Metathesaurus® (version 2023AB) in the English language, with vocabulary sources of “SNOMEDCT_US” and “MeSH”. We included concepts that can be strictly matched in UMLS, together with their children (narrow concepts). To make the lexicon comprehensive, we included all synonyms from the UMLS concepts and the lexicon extracted from PASCLex25. This process yielded a hierarchical knowledge graph comprising terms and keywords that define 798 symptom sub-concepts and synonyms, organized into 25 categories (Supplementary File 2).

Developing a Hybrid NLP Pipeline for PASC Symptom Extraction

We developed a hybrid NLP pipeline that combines rule-based and deep-learning modules to mine PASC symptoms. At a high level, MedText employs a text preprocessing module to split the clinical notes into sections and sentences. It then uses a rule-based NER module that utilizes a clinician-curated PASC lexicon to extract PASC symptoms from clinical notes. It then employs a BERT-based assertion detection module to assess if the extracted symptom is absent (e.g., “there is no diarrhea”) (Fig. 2).

MedText48 is an open-source clinical text analysis system developed with Python. It offers an easy-to-use text analysis pipeline, including de-identification, section segmentation, sentence split, NER, constituency parsing, dependency parsing, and assertion detection. For this study, we utilized modules ranging from section segmentation and sentence split to NER and assertion detection (Fig. 2).

The section split module divides the report into sections. This rule-based module uses a list of section titles to divide the notes. We used rules from medspacy, which were adapted from SecTag and expanded through practice49. The sentence split module splits the report into sentences using NLTK50. The NER module recognizes mention spans of a particular entity type (e.g., PASC symptoms) from the reports. Technically, it uses a rule-based method via SpaCy’s PhraseMatcher to efficiently match large terminology lists. Here, the NER step identifies and clarifies concepts of interest according to the 798 symptoms and their synonyms in the PASC lexicon. The assertion detection module determines the status of the recognized concept based on its context within the sentence35, as detailed below in the assertion detection module development.

Additionally, MedText features a flexible modular design, provides a hybrid text processing schema, and supports raw text processing and local processing. It also adopts BioC51 as the unified interface, and standardizes the input/output into a structured representation compatible with the OMOP CDM. This allows us to deploy MedText and validate its performance across multiple, disparate data sources. In this study, we deployed MedText on an AWS Sagemaker platform with a Tesla T4 GPU configuration and 15 GiB of CPU memory.

Developing the Assertion Detection Module

In this study, we used three popular domain-specific pretrained BERT models, BioBERT, ClinicalBERT, and BiomedBERT27,28,29, to develop classifiers for the assertion detection step in the MedText pipeline after extracting symptom mentions. Detailed descriptions of these pretrained BERT models and their hyperparameter settings are provided in the Table 2.

Table 2 Hyperparameters for BERT-based assertion detection model training

To construct training sets for model development, we utilized a publicly available 2010 i2b2 assertion dataset of clinical notes30,35, which consists of 4359 sentences and 7073 labels collected from three sites: Partners Healthcare, Beth Israel Deaconess Medical Center, and the University of Pittsburgh Medical Center30,35, and the WCM Model Training set to construct training sets for model development. In particular, 3055 sentences and 4243 mentions from the public i2b2 2010 assertion dataset30,35 merged with the 30-note WCM Model Training set were used as the training set for fine-tuning the BERT-based assertion detection models. This integrated dataset is referred to as the i2b2&WCM-merged training set. To evaluate the effectiveness of our fine-tuning approach, we conducted comparative analyzes among BiomedBERT29, BioBERT27, and ClinicalBERT28 - all of which were trained on the i2b2&WCM-merged training set. We used the WCM internal validation set (30 notes, 350 sentences and 953 mentions) for model selection and the 10-site multi-site external validation set (100 notes, 1,113 sentences and 1886 mentions) for performance evaluation (Table 1). This direct comparison offered insights into the relative strengths of each model in handling biomedical text. The multi-category prediction task of multiple possible assertion statuses from the original BERT models was revised into a binary prediction task. Specifically, only the “present” status is mapped to present with a label of 1. In contrast, all other possible statuses, including “absent”, “hypothetical”, and any form of “uncertain”, are mapped to negative with a label of 0.

Evaluation Plan

We explored three different models of assertion detection in the MedText pipeline and considered two different application scenarios: the numerical performance evaluation using the smaller datasets with manually annotated ground-truth labels (i.e., the Internal Validation set, and the Multi-site External Validation set) and the population-level prevalence study using approximately 5000 clinical notes from each site from the 11 sites of RECOVER network (i.e., the population statistics set). The patient demographics (race, ethnicity, age, and sex) across the 11 RECOVER sites in the population-level prevalence study are summarized in Table 3.

Table 3 Summary of patient demographics

In the numerical performance evaluation, each model was assessed using precision, recall, and F1 score. For each trained and fine-tuned model, bootstrapping was applied in 100 iterations on the WCM Internal Validation set. The performance of these models was evaluated using the same metrics, along with their mean values, standard deviation, and 95% confidence intervals.

On the prevalence of PASC-related symptoms at the population level, we analyzed symptom mentions identified by the optimal configuration of our NLP pipeline across 11 RECOVER sites. Using the MedText pipeline, we analyzed 47,654 notes from these sites as a Population-level Prevalence Study dataset (Fig. 8). Specifically, this dataset included 5000 intake progress notes from unique patients per site, except for three sites: Nationwide had 3680 notes, Nemours had 2132, and Seattle had 1842. We then applied the Spearman rank correlation test to compare symptom-mentioning patterns across sites. We conducted pairwise Spearman tests between the sites and symptom categories, for positive and negative mentions, respectively. For between-site analysis, the symptom mentions over the 25 symptom categories of each site form the symptom-mentioning vector for this site. For between-site analysis, the symptom-mentions across the 11 sites of each symptom category form the symptom-mentioning vector of this symptom category.

Symptom Extraction and Assertion Detection Using Large Language Model: GPT-4

To compare our hybrid NLP pipeline with large language models for PASC symptom extraction and assertion detection, we conducted an experiment using GPT-4 (API version: 2024-05-01-preview). We randomly selected 20 intake notes from the subset used for WCM internal validation. GPT-4 was prompted with only the 25 predefined PASC symptom categories and instructed to extract all symptom mentions along with their assertion status (i.e., positive or negative). The model was allowed to rely on its internal knowledge to recognize synonymous expressions without being provided explicit lexicons. The specific prompt is provided in Supplementary File 3.

Each output mention was manually reviewed (by ZB) to determine whether the identified phrase existed in the original note and correctly matched a symptom or its synonym. To create a proxy ground truth for recall estimation, we took the union of all verified correct mentions extracted by either GPT-4 or the rule-based NER pipeline. Using this set, we evaluated recall for the rule-based method and precision for GPT-4. The assertion labels assigned by GPT-4 were also manually reviewed for correctness.