Abstract
Lower respiratory tract infections (LRTI) are a leading cause of mortality and are challenging to diagnose in critically ill patients, as non-infectious causes of respiratory failure can present with similar clinical features. We develop an LRTI diagnostic method combining the pulmonary transcriptomic biomarker FABP4 with electronic medical record text assessment using the large language model Generative Pre-trained Transformer 4. In a cohort of critically ill adults, a combined classifier incorporating FABP4 expression and large language model electronic medical record analysis achieves an area under the receiver operating characteristic curve (AUC) of 0.93 ± 0.08 and an accuracy of 84%, outperforming FABP4 expression alone (0.84 ± 0.11) and large language model-based analysis alone (0.83 ± 0.07). By comparison, the medical team admission diagnosis has an accuracy of 72%. In an independent validation cohort, the combined classifier yields an AUC of 0.98 ± 0.04 and accuracy of 96%. This study suggests that integrating a host biomarker with large language model analysis can improve LRTI diagnosis in critically ill adults.
Similar content being viewed by others
Introduction
Lower respiratory tract infections (LRTI) are a leading cause of death worldwide, yet remain challenging to diagnose1. This is especially true in the intensive care unit (ICU), where non-infectious acute respiratory illnesses often have similar clinical manifestations. Further complicating accurate diagnosis is the failure to identify a causative pathogen in most clinically recognized cases of LRTI2. The resulting diagnostic uncertainty drives the overuse of empiric antibiotics, leading to adverse outcomes ranging from Clostridioides difficile infection to the development of antimicrobial resistance3,4.
Host transcriptional biomarkers are a promising modality for LRTI diagnosis that overcome several limitations of traditional microbiologic tests5,6. By offering a more direct and dynamic measure of the host immune response, they can enable earlier and more accurate identification of infection and can differentiate bacterial and viral etiologies, even in cases where pathogen detection is unsuccessful6,7. Single gene biomarkers are particularly amenable to clinical translation, as they can be readily incorporated into simple nucleic acid amplification platforms which are already widely used in healthcare settings.
The pulmonary expression of the gene FABP4, for instance, was recently identified as an LRTI diagnostic biomarker in critically ill patients with acute respiratory failure, achieving an area under the receiver operating characteristic curve (AUC) of 0.85 ± 0.12 in adults and 0.90 ± 0.07 in children8. FABP4’s consistent performance across cohorts with differing microbiology–predominantly bacterial infections in adults and viral in children–suggested that FABP4 is agnostic to the type of pathogen causing LRTI8. Mechanistically, fatty acid binding proteins are a family of intracellular proteins that modulate fatty acid trafficking and inflammatory signaling9. FABP4 specifically is highly expressed in tissue-resident alveolar macrophages which are preferentially depleted during respiratory infections including bacterial pneumonia10 and COVID-1911. While its performance as a LRTI biomarker exceeds clinical biomarkers such as C-reactive protein12 or procalcitonin13, FABP4 alone likely does not achieve the accuracy necessary to enable confident clinical decisions regarding antimicrobial use in critically ill patients with acute respiratory failure.
Large language models (LLMs) such as Generative Pre-trained Transformer 4 (GPT-4) represent a new class of artificial intelligence tools with potential utility across a diversity of medical applications14. GPT-4 provides a text interface in which a clinician or other user may pose questions, to which GPT-4 then responds in conversational language. While LLMs have demonstrated remarkable performance for some medical use cases, including image interpretation15,16,17 and patient risk stratification18, their utility in aiding clinical reasoning remains unclear19,20,21, and their potential role for diagnosing LRTI or other critical illness syndromes based on electronic medical record (EMR) data has not been assessed. Furthermore, evaluation of LLMs in combination with other tools, such as host biomarkers, remains largely unexplored.
Here, we address this gap by building a diagnostic classifier combining FABP4 expression with GPT-4 analysis of electronic medical record (EMR) data. We find that this combination affords remarkably accurate LRTI diagnosis, suggesting a promising approach to improve the care of critically ill patients.
Results
We evaluated the performance of four different diagnostic approaches (FABP4, GPT-4, integrated FABP4/GPT-4 classifier, and admission diagnosis by the primary medical team) against a gold-standard of retrospective LRTI adjudication performed by two or more physicians. The derivation cohort included 42 patients with LRTI and 56 with no evidence of infection and a clear alternative explanation for respiratory failure, and the external validation cohort included 33 LRTI cases and 26 No LRTI cases (Fig. 1). The majority of LRTI patients in the derivation cohort had bacterial etiologies of LRTI, while the validation cohort, which was largely recruited during the COVID-19 pandemic, was predominantly viral (Table 1). Time to microbiologic diagnosis from intubation was a median of 56.3 h (interquartile range (IQR) 28.9–73.4 h) in the derivation cohort and 59.6 (33.9–88.8 h) in the validation cohort for patients with diagnoses other than COVID-19. Due to the incorporation of universal SARS-CoV-2 testing prior to hospital admission during the period of validation cohort enrollment, most patients with COVID-19 were diagnosed prior to intubation.
We provided GPT-4 with practical clinical summary information from the EMR that would be available to a treating physician assuming care of a patient in the ICU: a chest x-ray (CXR) radiology report from the day of enrollment and the note written by the medical team from the day prior. In the derivation cohort, notes and radiology reports from five patients were utilized for GPT-4 prompt engineering and optimization (Methods) and seven lacked a clinical note from the day prior to study enrollment, leaving a total of 37 LRTI and 49 No LRTI cases available for analysis.
We first compared the accuracy of the primary medical team’s ICU admission diagnosis, extrapolated by their decision to prescribe antimicrobials, against the gold-standard retrospective LRTI adjudication. The medical team correctly identified 37/37 (100%) of true LRTI cases but incorrectly called LRTI in 24/49 (49%) of patients in the No LRTI group, equating to an accuracy of 72% (Fig. 2A, and Table 1). All 24 of the derivation cohort No LRTI patients unnecessarily treated for LRTI received antibacterial coverage, with four additionally receiving empiric therapy directed at viral and/or fungal pathogens (Supplementary Table 1). We next assessed the diagnostic performance of FABP4 and found that it achieved an AUC of 0.84 ± 0.11 (mean ± standard deviation) by five-fold cross validation (Fig. 2B). We then assessed the performance of GPT-4 in diagnosing LRTI, with three independent diagnoses, resulting in a score from 0-3 for each patient. A logistic regression classifier based on this GPT-4 score achieved an AUC of 0.83 ± 0.07 (Fig. 2B).
A Confusion matrices for initial ICU diagnosis and the integrated FABP4/GPT-4 classifier in the derivation cohort. B Receiver operating characteristic curves from GPT-4 classifier, FABP4 classifier, and integrated FABP4/GPT-4 classifier in the derivation cohort. C Confusion matrices for initial ICU diagnosis and the integrated FABP4/GPT-4 classifier in the validation cohort. D Receiver operating characteristic curves from GPT-4 classifier, FABP4 classifier, and integrated FABP4/GPT-4 classifier in the validation cohort. In (A and C), the classifiers output an LRTI diagnosis if the patients had a predicted out-of-fold LRTI probability of 50% or higher, Intensity of color in confusion matrices reflects percentage of patients in each quadrant; red indicates the initial ICU diagnosis and green is the integrated FABP4/GPT-4 classifier. In (B and D), the area under the curves (AUCs) are presented as mean ± standard deviation.
Combining FABP4 and GPT-4 into a single logistic regression model achieved an AUC of 0.93 ± 0.08 (Fig. 2B), outperforming both FABP4 (P = 0.002, one-sided paired t-test) and GPT-4 alone (P = 0.008, one-sided paired t-test). We used the one-sided paired t-test because we were primarily interested in whether the integrated classifier outperformed each individual classifier, but the difference remained statistically significant when using two-sided t-tests (P = 0.004 and P = 0.016, respectively). Considering an out-of-fold probability of 50% as LRTI-positive, the integrated FABP4/GPT-4 classifier had a sensitivity of 78%, specificity of 88%, and accuracy of 84% (Fig. 2A). Assessment of the integrated classifier’s performance at the Youden’s index within each cross-validation fold demonstrated an average sensitivity of 86%, specificity of 98%, accuracy of 93%, positive predictive value of 97%, negative predictive value of 91%, and F1 score of 0.91.
Next, we assessed the validation cohort (Fig. 1), in which the primary medical team correctly identified 25/25 (100%) of LRTI cases but unnecessarily treated with antibacterials in 7/22 (32%) of patients in the No LRTI group, equating to an accuracy of 85% (Fig. 2C, Supplementary Table 1). The integrated FABP4/GPT-4 classifier achieved a sensitivity of 96%, specificity of 95%, and accuracy of 96%, again outperforming both FABP4 alone (accuracy 79%) and GPT-4 alone (accuracy 79%). In the validation cohort, the integrated classifier achieved an AUC of 0.98 ± 0.04 using 3-fold cross-validation, as compared to FABP4 (0.86 ± 0.06) or GPT-4 (0.90 ± 0.01) alone (P = 0.08 and P = 0.02, respectively, one-sided paired t-test) (Fig. 2D).
To gain insight into how GPT-4 returns diagnoses based on limited information, we compared the LLM against the decision making of three comparison physicians who were provided identical input. From the same limited EMR data and prompt provided to GPT-4, we asked the comparison physicians to assign a diagnosis of LRTI or No LRTI for each patient in the derivation cohort. Considering a threshold of at least one LRTI diagnosis per patient across the three physicians as LRTI-positive, we found a sensitivity of 78%, specificity of 88%, and accuracy of 84% (Fig. 3A). Finally, we sought to identify potential biases in GPT-4 diagnoses by comparing GPT-4 results to those of the comparison physicians (Fig. 3B), focusing on cases with two or more discordant LRTI diagnoses. Of the nine patients more frequently diagnosed with LRTI by GPT-4 versus the comparison physicians (Fig. 3B), six had clinical notes with no mention of LRTI, but explicit concern for LRTI in the CXR report, as judged by mention of “pneumonia” and/or “infection” in the radiologist read (CXR reports provided in Supplementary Appendix 4). This suggested that GPT-4 may have placed more weight on CXR reads relative to physicians. Of the two patients disproportionately diagnosed with LRTI by comparison physicians versus GPT-4 (Fig. 3B), one had a final diagnosis of e-cigarette/vaping associated lung injury, and the other had LRTI attributed to rhinovirus.
A Confusion matrix of diagnosis by three GPT-4 comparison physicians who received the same prompt and data as GPT-4. B Comparison of GPT-4 LRTI scores as compared to physicians. In B Y axis depicts the number of times GPT-4 diagnosed LRTI out of 3, X axis shows the number of times the physicians called LRTI out of 3. Blue boxes indicate instances in which GPT-4 diagnoses were most discordant with comparison physicians (the scores differ by 2 or more).
Discussion
Our findings demonstrate that the combination of a host transcriptomic biomarker with AI assessment of EMR text data can improve LRTI diagnosis in critically ill patients. We found that an integrated FABP4/GPT-4 classifier achieved higher LRTI diagnostic accuracy than FABP4 alone, GPT-4 alone, or the treating medical team. In our study population, we found that the initial treating physicians unnecessarily prescribed antimicrobials in a third to half of patients ultimately found to have non-infectious causes of acute respiratory failure. All patients who inappropriately received antimicrobials received antibacterial therapy, and a smaller number additionally received antiviral and/or antifungal therapy. Had our integrated classifier results been theoretically available at the time of ICU admission, we estimate that inappropriate antimicrobial use might have been prevented in 20/24 (83%) and 7/7 (100%) of No LRTI patients who were unnecessarily treated in the derivation and validation cohorts, respectively. Acute respiratory illness is a leading reason for inappropriate antibiotic use22, and our results suggest a potential role for biomarker/AI classifiers in antimicrobial stewardship, a major goal of the U.S. CDC23 and the World Health Organization24. However, given the challenges of de-escalating antimicrobials in critically ill patients, and the potential consequences of inappropriately stopping treatment in a patient with true LRTI, our results serve primarily as a proof of concept requiring further validation.
Finally, many patients with clinical LRTI never have a confirmed microbiologic diagnosis2; across the study cohorts, approximately 50% of enrolled patients were clinically adjudicated as having LRTI without an identified microbial pathogen, or of having an indeterminate LRTI status (Fig. 1). Although we focused on the unequivocal cases of proven LRTI or No LRTI to develop and test our GPT-4 and FABP4/GPT-4 classifiers, it is those cases without a clear diagnosis, in which LRTI is considered as one possible diagnosis among many, where this method may ultimately prove most useful. A future randomized clinical trial will be needed to conclusively test this.
Previous studies have found that GPT-4 is influenced by the precise language used in a prompt, leading to a need for prompt engineering20. By iterating our prompt on a subset of patients, and through direct comparison to physicians provided with identical EMR data, we identified possible blind spots of GPT-4 and gained insights that may help guide future optimization of LLMs for infectious disease diagnosis.
A primary strength of this study is the combination of a host transcriptional biomarker with AI interpretation of EMR text data to advance infectious disease diagnosis. We address one of the most common and challenging diagnostic dilemmas in the ICU, leverage deeply characterized cohorts, and employ a rigorous post-hoc LRTI adjudication approach incorporating multiple physicians.
Importantly, clinicians with access to a HIPAA-compliant GPT-4 interface can readily use our prompt without any prior bioinformatics expertise. Moreover, this approach yielded promising results in two cohorts of patients with very different microbial etiologies of LRTI, suggesting that both FABP4 and LLM analysis of EMR data may have utility as diagnostic approaches, agnostic to type of LRTI pathogen. Bacterial LRTI was predominant in the pre-pandemic derivation cohort while the validation cohort, which was enrolling during the COVID-19 pandemic, primarily consisted of patients with viral LRTI.
Limitations of this study include a relatively small sample size, particularly in the validation cohort, which necessitated the use of 3-fold (versus 5-fold) cross-validation. In addition, our focus on mechanically ventilated patients may limit generalizability to less severe respiratory illnesses. Antimicrobial administration is an imperfect proxy for clinical team LRTI diagnosis; however, it was an objective, reproducible and unbiased option for retrospectively estimating the clinical team’s decision making. We restricted GPT-4 analyses to a single EMR note and CXR read, and it is possible that assessment of more complete EMR data would have led to improved, or different, performance.
Given these limitations, this study is best seen as a proof-of-concept that establishes the feasibility and promise of a diagnostic approach combining artificial intelligence-based EMR analysis with a host biomarker. Future work can test whether GPT-4 can improve the marginal performance of widely available clinical biomarkers such as C-reactive protein, assess the generalizability of FABP4/GPT-4 classifier performance in larger independent cohorts of ICU patients, and evaluate these methods for the diagnosis of other critical illness syndromes such as sepsis.
Methods
Adjudication of LRTI status
Gold standard adjudication of LRTI status was performed retrospectively following ICU discharge by two or more physicians using all available information in the EMR, and based on the U.S. Centers for Disease Control and Prevention (CDC) PNEU1 criteria27 as well as an identified pulmonary pathogen. Patients with negative microbiological testing and a clear alternative reason for their acute respiratory failure besides pulmonary infection, representing the clinically relevant control group, were also identified (No LRTI group). Any adjudication discrepancies were resolved by a third physician, and patients with indeterminate LRTI status were excluded.
Extraction of EMR data
The primary medical or ICU team’s clinical note from the day prior to study enrollment and the CXR read from the day of enrollment were extracted from the EMR. If no note was written on the day prior to enrollment, a note from two days prior was substituted (Table 1). Notes were written in the EPIC EMR platform by physicians from the primary care team, which included Internal Medicine, Critical Care, and several other additional services (Table 1). Notes varied in length and structure, reflecting the real-world diversity of clinical practice, and allowing a realistic scenario for GPT-4 use. If no CXR was performed on the day of enrollment, the next closest CXR read prior to the date of enrollment was used instead. Patients with no clinical notes available prior to study enrollment were excluded (N = 7 derivation cohort, N = 12 validation cohort). The clinical treatment team’s LRTI diagnosis was extrapolated based on administration of empiric antimicrobials (antibacterial, antiviral, and/or antifungal agents) for at least 24 h within one day of study enrollment, excluding agents given for established non-pulmonary infections or prophylaxis.
RNA sequencing
RNA was extracted from tracheal aspirates collected on the day of enrollment and underwent rRNA depletion followed by library preparation using the NEBNext Ultra II kit on a Beckman-Colter Echo liquid handling instrument, as previously described8. Finished libraries underwent paired-end sequencing on an Illumina NovaSeq.
FABP4 diagnostic classifier
All analyses were done in R version 4.5.0. FABP4 expression was normalized using the varianceStabilizingTransformation function from DESeq2 package (v1.48.1)28, and used to train a logistic regression classifier. We chose to use logistic regression because among machine learning methods, it was best suited for the 1-2 features we sought to test within the sample size of the cohorts. More specifically, logistic regression is less vulnerable to overfitting compared to other more complex models, such as a random forest or gradient boosting classifiers. In addition, logistic regression is among the most broadly utilized statistical methods reported in the medical literature, and we believe that this inherent familiarity and interpretability would be particularly appealing in clinical settings compared to more advanced but less transparent machine learning models.
In each iteration of 5-fold cross-validation, both training and test sets were filtered to retain only genes with at least 10 counts across 20% of the samples in the training set. The test fold’s FABP4 expression level was normalized using variance-stabilizing transformation and the dispersions of the training folds, and input to the trained logistic regression classifier to assign LRTI or No LRTI status for each patient in the test fold. The performance and receiver operating characteristic (ROC) curve for each of the five folds was evaluated using the package pROC v1.19.0.129. The mean AUC and standard deviation were calculated from the average AUC derived from each test fold. The sensitivity and specificity at Youden’s index were extracted for each test fold separately using the function coords from the pROC package, and the average and standard deviation was calculated across the cross-validation folds.
GPT-4 input, scoring, and prompt engineering
We used the GPT-4 turbo model with 128k context length and a temperature setting of 0.2, implemented in Versa, a University of California, San Francisco (UCSF) Health Insurance Portability and Accountability Act (HIPAA)-compliant model. For each patient, compiled clinical notes and CXR reads were input into the GPT-4 chat interface. Prompt engineering was initially carried out by iterative testing on clinical notes and CXR reads from five randomly selected patients in the derivation cohort, who were excluded from subsequent analyses. We employed a chain-of-thought prompt strategy30 that involved asking GPT-4 to analyze the note and CXR step-by-step. The validation cohort included patients enrolled during the height of the COVID-19 pandemic and thus we redacted the terms “SARS-CoV-2” or “COVID-19” from their notes to avoid biasing the GPT-4 analysis. In our final version of the prompt (Supplementary Appendix 1), we asked GPT-4 to choose either LRTI or no LRTI, as exemplified in two example responses (Supplementary Appendix 2 and 3). We found that GPT-4 would sometimes give different answers to the same prompt and EMR input data in separate chat sessions. Therefore, for each patient, GPT-4 was asked to diagnose LRTI in three separate sessions. A per-patient GPT-4 score was calculated based on the total number of LRTI-positive diagnoses made by GPT-4 (ranging 0-3).
Integrated classifier
The integrated classifier’s performance was tested using 5-fold cross-validation in the derivation cohort. Because of the smaller sample size, 3-fold cross-validation was used in the validation cohort. For each test fold, a logistic regression classifier was trained on the remaining training folds using both normalized FABP4 expression and the GPT-4 score. The performance and ROC curve for each fold was evaluated as described above. The sensitivity, specificity and accuracy were calculated based on an out-of-fold predicted LRTI probability threshold of greater than or equal to 50%.
Comparing GPT-4 to physicians provided the same data
We compared LRTI diagnosis by GPT-4 against LRTI diagnosis made by three physicians trained in internal medicine (ADK) or additionally subspecializing in infectious diseases (AC, NLR). The physicians were provided with identical information and prompts as GPT-4, and they were asked to assign each patient as either LRTI or No LRTI. The comparison physician group score (0–3) was calculated based on the total number of LRTI-positive diagnoses made by the comparison physicians.
Ethics statement
We studied patients from two prospective observational cohorts of critically ill adults with acute respiratory failure enrolled within 72 h of intubation at the University of California San Francisco (UCSF) Medical Center (Fig. 1, Table 1). The derivation cohort7 (N = 202) was enrolled between 10/2013 and 01/2019, and validation cohort (N = 115) was enrolled between 04/2020 and 12/2023. This research was approved by the University of California, San Francisco Institutional Review Board (IRB) under the following protocols: #10-02701 for the derivation cohort, and #20-30497 and #10-02852 for validation cohort.
If a patient met inclusion criteria, then a study coordinator or physician obtained written informed consent for enrollment from the patient or their surrogate. Patients or surrogates were provided with detailed written and verbal information about the goals of the study, data and specimens that would be collected, and potential risks to the subject. Patients and their surrogates were also informed that there would be no benefit to them from enrollment in these studies and that they may withdraw informed consent at any time during the course of the study. All questions were answered, and informed consent documented by obtaining the signature of the patient or their surrogate on the consent document or on an IRB-approved electronic equivalent. As previously described25,26, the IRB granted an initial waiver of consent for patients who could not provide informed consent at time of enrollment.
More specifically, subjects who were unable to provide informed consent at the time of enrollment could have biological samples as well as clinical data from the medical record collected. Surrogate consent was actively pursued, and each patient was regularly examined to determine if and when they would be able to consent for themselves. For patients whose surrogates provided informed consent, direct consent from the patient was then obtained if they survived their acute illness and regained the ability to consent. A full waiver of consent was approved for subjects who died prior to consent being obtained. Further details on the enrollment and consent process for these studies can be found in two recent publications25,26.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Source data are provided with this paper. The human raw sequencing data are protected due to data privacy restrictions from the IRB protocol governing patient enrollment, which protects the release of raw sequencing data from those patients enrolled under a waiver of consent. To honor this, researchers who wish to obtain FASTQ files for the purposes of independently generating gene counts can contact the corresponding author (natasha.spottiswoode@ucsf.edu) and request to be added to the IRB protocol. All patient demographic data, sample metadata, and processed host gene counts needed to replicate this study are available in the GitHub repository, and all data generated in this study are available in the code outputs file and in the Source Data file. Source data are provided with this paper.
Code availability
All input data and code used in this study are available at https://github.com/infectiousdisease-langelier-lab/LRTI_FABP4_GPT4_classifier and at Zenodo (10.5281/zenodo.17362824).
References
World Health Organization. The top 10 causes of death. https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death (2020).
Jain, S. et al. Community-acquired pneumonia requiring hospitalization among U.S. adults. N. Engl. J. Med 373, 415–427 (2015).
Langford, B. J. et al. Antibiotic resistance associated with the COVID-19 pandemic: a systematic review and meta-analysis. Clin. Microbiol Infect. 29, 302–309 (2023).
Centers for Disease Control and Prevention, National Center for Emerging and Zoonotic Infectious Diseases (U.S) Division of Healthcare Quality Promotion. COVID-19: US impact on antimicrobial resistance, special report 2022. United States. (https://stacks.cdc.gov/view/cdc/119025 (2022).
Tsalik, E. L. et al. Host gene expression classifiers diagnose acute respiratory illness etiology. Sci. Transl. Med 8, ra311 (2016).
Mick, E. et al. Integrated host/microbe metagenomics enables accurate lower respiratory tract infection diagnosis in critically ill children. J. Clin. Invest. https://doi.org/10.1172/JCI165904 (2023).
Langelier, C. et al. Integrating host response and unbiased microbe detection for lower respiratory tract infection diagnosis in critically ill adults. Proc. Natl Acad. Sci. USA 115, E12353–E12362 (2018).
Lydon, E. C. et al. Pulmonary FABP4 is an inverse biomarker of pneumonia in critically ill children and adults. Am. J. Respir. Crit. Care Med 210, 1480–1483 (2024).
Furuhashi, M., Saitoh, S., Shimamoto, K. & Miura, T. Fatty acid-binding protein 4 (FABP4): pathophysiological insights and potent clinical biomarker of metabolic and cardiovascular diseases. Clin. Med Insights Cardiol. 8, 23–33 (2014).
Perea, L. et al. Reduced airway levels of fatty-acid binding protein 4 in COPD: relationship with airway infection and disease severity. Respir. Res 21, 21 (2020).
Liao, M. et al. Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19. Nat. Med 26, 842–844 (2020).
van der Meer, V., Neven, A. K., van den Broek, P. J. & Assendelft, W. J. Diagnostic value of C reactive protein in infections of the lower respiratory tract: systematic review. BMJ 331, 26 (2005).
Self, W. H. et al. Procalcitonin as a marker of etiology in adults hospitalized with community-acquired pneumonia. Clin. Infect. Dis. 66, 1640–1641 (2018).
OpenAI. Introducing ChatGPT. https://openai.com/index/chatgpt/ (2022).
Zhou, Y. et al. Evaluating GPT-V4 (GPT-4 with Vision) on detection of radiologic findings on chest radiographs. Radiology 311, e233270 (2024).
Kessler, D. et al. Development and testing of a deep learning algorithm to detect lung consolidation among children with pneumonia using hand-held ultrasound. PLoS One 19, e0309109 (2024).
Chen, K. C. et al. Diagnosis of common pulmonary diseases in children by X-ray images and deep learning. Sci. Rep. 10, 17374 (2020).
Beaulieu-Jones, B. K. et al. Machine learning for patient risk stratification: standing on, or looking over, the shoulders of clinicians? NPJ Digit. Med 4, 62 (2021).
Maillard, A. et al. Can chatbot artificial intelligence replace infectious diseases physicians in the management of bloodstream infections? A prospective cohort study. Clin. Infect. Dis. 78, 825–832 (2024).
Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. Reply. N. Engl. J. Med 388, 2400 (2023).
Goh, E. et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw. Open 7, e2440969 (2024).
Merenstein, D. J., Barrett, B. & Ebell, M. H. Antibiotics not associated with shorter duration or reduced severity of acute lower respiratory tract infection. J. Gen. Intern. Med. https://doi.org/10.1007/s11606-024-08758-y (2024).
Centers for Disease Control and Prevention. Antibiotic resistance threats in the United States. (U.S. Department of Health and Human Services, Atlanta, GA, 2019). https://www.cdc.gov/antimicrobial-resistance/media/pdfs/2019-ar-threats-report-508.pdf.
World Health Organization. Antimicrobial stewardship interventions: a practical guide. https://www.who.int/europe/publications/i/item/9789289056267 (2021).
Sarma, A. et al. Tracheal aspirate RNA sequencing identifies distinct immunological features of COVID-19 ARDS. Nat. Commun. 12, 5152 (2021).
Spottiswoode, N. et al. Microbial dynamics and pulmonary immune responses in COVID-19 secondary bacterial pneumonia. Nat. Commun. 15, 9339 (2024).
Centers for Disease Control and Prevention, National Healthcare Safety Network. CDC/NHSN surveillance definition for specific types of infections. https://www.cdc.gov/nhsn/pdfs/pscmanual/17pscnosinfdef_current.pdf (2021).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinforma. 12, 77 (2011).
Wei, J., et al. in Advances in Neural Information Processing Systems 35 (NeurIPS 2022). (NeurIPS Proceedings).
Acknowledgements
This work was supported by the National Institution of Allergy and Infectious Diseases, National Institutes of Health [grant number R01AI185511 to C.R.L.], the National Heart, Lung, and Blood Institute, National Institutes of Health [grant number NHLBI R35HL140026 to C.S.C.], and the Chan Zuckerberg Biohub [C.R.L.].
Author information
Authors and Affiliations
Contributions
H.V.P., N.S., and C.R.L. conceived and designed the study. C.S.C., C.R.L. supervised the study. E.C.L., V.T.C., and P.D. acquired the data. A.C., A.D.K., and N.L.R. provided comparison physician diagnoses. H.V.P., N.S., C.R.L., E.C.L., and V.T.C. analyzed and interpreted the data. H.V.P., N.S., and C.R.L. wrote the manuscript. E.C.L., V.T.C., and C.S.C. reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
No authors report financial or personal conflicts of interest.
Peer review
Peer review information
Nature Communications thanks Gustavo Sganzerla Martinez, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Phan, H.V., Spottiswoode, N., Lydon, E.C. et al. Integrating a host biomarker with a large language model for diagnosis of lower respiratory tract infection. Nat Commun 16, 10882 (2025). https://doi.org/10.1038/s41467-025-66218-5
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-025-66218-5





