Introduction

Lower respiratory tract infections (LRTIs) are an infectious cause leading to more than 3 million deaths a year1,2. Traditional diagnosis of acute respiratory disease from the causative agent is not often available due to the limitations of current microbiological tests in terms of sensitivity, speed, and spectrum. In fact, conventional microbial methods identify a pathogen in only 30-40% of LRTIs cases3. Without reliable diagnosis, clinicians may decide to initiate unnecessary antibiotic treatment, possibly driving the emergence of antibiotic resistance. Another challenge for clinicians is the complex clinical cases. Critically ill patients in the intensive care unit can have simultaneously overlapping clinical features between infections and noninfectious inflammatory conditions4, and severely immunocompromised individuals may present with atypical infection characteristics due to immuno-hyporesponsiveness5.

The conventional model of pathogenesis for LRTIs reflects the assumption that the lungs are sterile. However, results of lung microbiome studies using culture-independent assessment of microbial genomes have shown the pathogenesis of LRTIs is an emergent phenomenon arising from the combination of high microbial biomass, low microbial diversity, and host inflammation that arises from pre-existing homeostasis of biodiversity6,7,8. Another exciting new frontier in infectious disease diagnostics is examining the host response to infection, which offers a diagnostic strategy complementary to pathogen-based diagnostics of acute respiratory infections versus those with noninfectious illnesses already in use9,10.

Metagenomic next-generation sequencing (mNGS) holds promise for overcoming those diagnostic challenges deriving from traditional assessment for identifying pathogens, as well as extending both transcriptional biomarker assays of the immune response and the analysis of the airway microbiome of patients with LRTIs by testing microliter volumes of clinical samples4,11,12. We here combine host transcriptional profiling with the respiratory microbiome using machine learning for better diagnostic accuracy of LRTIs, and then evaluate the performance of this diagnostic model.

Methods

Study design and subjects

This study sequentially enrolled subjects with suspected LRTIs who were admitted to Peking University People’s Hospital between May 2020 and January 2021. This study was approved by the research ethics board at Peking University People’s Hospital (approval no. 2019PHB134). The inclusion criteria consisted of the following: (1) at least compliance with the clinical definition of LRTIs referring to US Centers for Disease Control/National Healthcare Safety Network (CDC/NHSN) surveillance case definitions (Dataset S1) and identified by at least two-physician adjudication; (2) informed consent and voluntary participation in the study; and (3) having complete clinical data and underwent a flexible bronchoscopy with BALF collected. Exclusion criteria were as follows: (1) cases with absolute evidence for co-infection with other organs; (2) bronchoscopy couldn’t be performed; (3) pregnant or breastfeeding women; (4) subjects who were unable to cooperate, unwilling to participate, or whose clinical data were incomplete.

Demographic and baseline characteristics, clinical presentation, radiography, and laboratory findings of 136 recruited subjects were investigated for LRTIs diagnosis with the review of electronic medical records. Continuous variables were summarized as medians with interquartile ranges (IQR) and were compared using the Mann–Whitney U test. Categorical data were presented as numbers (%), and were analyzed using either the chi-square test or Fisher’s exact test. Statistical significance was defined as a two-sided P-value less than 0.05.

Specimen preparation and mNGS

The BALF specimens from the 136 subjects were collected before antibiotic therapy after admission to the hospital. We mixed residual BALF with DNA/RNA Shield on ice and frozen at −80 °C. We established a mNGS and data analysis protocol including DNA/RNA extraction and purification, library construction, sequencing, and detection bioinformatics analysis. DNA and RNA were extracted from 300 μL of patient BALF mixture using MagMAX™ CORE Nucleic Acid Purification Kit (Thermo Fisher Scientific). Sterile deionized water was extracted alongside the specimens as a negative control, and it was involved in subsequent library construction and sequencing. In order to obtain transcriptional RNA of microbes to show they were alive at the time of sampling, DNA was removed from RNA using DNase (RNase free; Thermo Fisher Scientific), and the concentrations of purified RNA were then determined using the Qubit RNA HS (high sensitivity) Assay Kit (Thermo Fisher Scientific). The collected RNA was reverse transcribed to generate cDNA and used to prepare sequencing libraries using the Reverse Transcription Module for NGS Library Prep Kit (Matridx, Hangzhou, China). The final DNA-seq libraries were sequenced using Illumina’s NextSeq 550 System (150-bp paired-end reads). The sterile distilled water as a negative control sample ran through the full process, including specimen preparation and mNGS.

Bioinformatics pipeline

Reads were processed by Trimmomatic(version 0.36) software13 to remove adapters and trim low-quality bases. After trimming the bases with a quality of less than 20, the paired reads with at least 35 bp were considered high-quality sequencing data for further analysis. Firstly, these high-quality sequencing data were mapped on the human reference genome (GRCh38) to divide into host reads (mapped reads) and microbial reads (unmapped reads). Then, unmapped reads were aligned against the SILVA(version 138) database14 using Bowtie2(v2.3.4.3)15 to remove rRNA sequences.

Microbiome analysis

After filtering and rRNA removal, microbial reads were used to carry out species identification and abundance calculations by Kranken2. To decontaminate the data of each sample, the mean of the abundance of all negative control samples (sterile deionized water) was used as the baseline. The microorganisms with an abundance less than ten folds of negative controls’ mean were removed from further analysis. The α diversity of microbial flora was estimated using the vegan package in R. The analysis of differential microorganisms was performed using the edgeR package in R16. And 68 samples whose total read count was greater than 10,00,000 were selected for trimmed mean of M-values (TMM) normalization and testing significance of difference. The Rtsne software17 package was used for t-SNE analysis. At the same time, TMM normalized abundance of microorganisms was used for subsequent correlation analysis and machine learning.

Host transcriptome analysis

The high-quality data were aligned to the human reference genome by STAR (v2.7.8a), and then estimated the read count of the host gene was using featureCounts(v2.0.1)18,19. We analyzed differential expression using DESeq220. First, the samples with a total read count greater than 10,00,000 were selected for normalizing, and then the genes with the absolute value of log2FC greater than 1 and a corrected P-value less than 0.05 were selected as the significantly differentially expressed genes. Mitochondrial genes were removed. The enrichment analysis was performed using the clusterProfiler package21. Meanwhile, the read count of the gene was used to calculate the transcripts per million (TPM) value as the gene expression for subsequent correlation analysis and machine learning.

Correlation analysis between the microbiome and host transcriptome

We used the TMP expression of the host gene and the TMM normalized abundance of the microbial flora for correlation analysis. The genes and bacterial populations with significant differential expression were selected for correlation analysis and calculated with the Hmisc package in R. Finally, the absolute value of a correlation coefficient greater than 0.5 and a P-value less than 0.05 were selected as the genes and microbes with significant correlation. The microbes correlated with host genes were used to estimate gene expression. Microbial reads were mapped to selected microbes’ reference genomes by using Bowtie2(v2.3.4.3), and then the read count of the microbial genes was estimated using featureCounts(v2.0.1). The read count of microbial genes was normalized to TPM values for further analysis. Correlation analysis between microbial genes and significant differential host genes was performed with the Hmisc package in R. The Network of correlation between genes was visualized by CytoScape (v3.7.1).

Machine learning

For LRTIs samples and non-LRTIs samples, we trained machine learning models for diagnosis prediction. We selected significant differential expressed host genes and microorganisms with significant differential abundance as features for machine learning. The training of the machine learning model was mainly carried out with a caret package in R. First, we parted the data into training set and validate set randomly. Two-thirds of the samples were used as training set for model training, and the remaining one-third were used as validate set. Then, the random forest analysis was used to estimate the importance of features and select the feature set with the best prediction effect. Then the optimized features were used to train three typical machine learning models, including Random Forest, support vector machines (SVM), and artificial neural network (ANN). For the Random Forest model, the caret package was applied to the finetune model using the repeatedcv method, finetune grid set from 4 to 20, and ntree set to 1000. For the SVM model, finetune was also performed by caret package with repeatedcv method, finetune grid set sigma 0.005, 0.01, 0.015, 0.2 and set C 0.55, 0.7, 0.75, 0.9,1, SVM kernel used radial kernel. For the ANN model, mlp function of the RSNNS package was applied to finetune the model by setting learFunc = Std_Backpropagation, maxit = 1000, and learnFuncParams = c(0.1,2.0,0.0001,0.1). Finally, the performance of different models was evaluated, and the 95% confidence interval (CI) and the ROC curve were estimated and drawn using the pROC package22.

Results

Subject baseline characteristics

We retrospectively enrolled 68 subjects with suspected LRTIs who were admitted to Peking University People’s Hospital for machine learning. About 41 subjects were identified as LRTIs, including 28 cases in the training cohort and 13 cases in the validate cohort. Another 27 subjects were identified as non-LRTIs, including 18 cases in the training cohort and 9 cases in the validate cohort. The baseline demographic information, clinical parameters, and laboratory features between LRTIs and non-LRTIs sets were compared with calculated P value on SPSS 26, and given in Table 1 and Dataset S2. We subsequently enrolled another 68 subjects with suspected LRTIs for external validation of the prediction model, which included 45 patients identified as LRTIs and 23 identified as non-LRTIs. The baseline characteristics were also compared on SPSS 26 and listed in detail in Supplementary Table 1 and Dataset S2.

Table 1 Demographic, clinical, and laboratory characteristics of 68 cases of machine learning

The study schedule was performed according to the flow chart (Fig. 1), including experiment method, microbiome analysis, host transcriptome analysis, and machine learning for building the prediction model.

Fig. 1: Flow chart of this study.
figure 1

The RNA sequencing was performed on BALF specimens. The host transcriptome analysis included aligning to the reference genome by STAR, quantifying by RSEM, and analysis of differential expression by DESeq2. The microbiome analysis included the removal of rRNA, estimation of taxonomic abundance by Bracken, and testing the significance of differential abundance of micro-species by edgeR. The 68 cases were used to construct LRTIs diagnosis models. Two-thirds of the samples were used as training cohort, and the remaining one-third were used as validate cohort. We selected host genes with significant differential expression and microbial flora abundance as characteristics to train three typical machine learning models, Random Forest, SVM, and ANN. To further verify the effectiveness of this model, we verified it in an independent dataset of 68 cases.

Differential lower respiratory tract microbiome in LRTIs patients versus non-LRTIs patients

Compared to those with non-LRTIs, the microbiota diversity in BALF of patients with LRTIs was decreased (Fig. 2A and Dataset S3). Differences in the microbiota between samples can be visualized by the t-SNE algorithm, where each point represents one BALF sample. Similar samples are relatively close to each other, thus forming different clusters of microbiota compositions. The microbiota of 68 BALF samples consisted of 5 clusters. All samples in cluster 1 are LRTIs samples, the vast majority of cluster 3 samples (except for one) are LRTIs samples, while the majority of cluster 5 samples are non-LRTIs samples, and the LRTIs and non-LRTIs samples in clusters 2 and 4 are mixed (Fig. 2B). We looked at the species with the highest abundance on top of the t-SNE plot (Fig. 2C). In cluster 5 (mainly non-LRTIs samples), the species with the highest abundance in these BALF samples are Corynebacterium xerosis (n = 12) and Rothia mucilaginosa (n = 1), which are the normal members of the respiratory tract microbiome. The species with the highest abundance in all samples in cluster 1 (all LRTIs samples) is the pathogen Klebsiella pneumoniae. The species with the highest abundance in conditional most of the samples in cluster 3 (mainly LRTIs samples) are conditional pathogenic bacteria, such as Klebsiella pneumoniae, Escherichia coli and Pseudomonas aeruginosa. In clusters 2 and 4 (mixed), the normal respiratory flora and pathogenic bacteria were interwoven (Fig. 2C).

Fig. 2: Differential lower respiratory tract microbiome in LRTIs patients versus non-LRTIs patients.
figure 2

A Boxplot of Chao1 index, ACE index, Shannon index, and Simpson index. The p-value was calculated by the Wilcoxon test. B Scatter plot with five sub-groups divided by tSNE analysis. C Dot plot with five sub-groups divided by tSNE analysis, dot size indicated the abundance of species. D Barplot of the differential abundance of micro-species plotted by R, bar present relative abundance of species, error bar present 95% CI and P-value from edgeR was on the right side.

The lower respiratory tract microbiome (LRTM) showed that the abundance of oral and pharyngeal commensal bacteria (Treponema denticola, Porphyromonas gingivalis, Filifactor alocis, Corynebacterium dentalis, Leptotrichia hofstadii, and Actinomyces nasicola) in BALF of the non-LRTIs group was significantly higher than that of LRTIs, while the abundance of opportunistic pathogens (Pneumocystis jirovecii, Corynebacterium striatum, Haemophilus influenzae, Pseudomonas aeruginosa, Human betaherpesvirus 5, Streptococcus pneumoniae, Human coronavirus HKU1, Tropheryma whipplei, Aspergillus fumigatus, Klebsiella pneumoniae, Klebsiella aerogenes, Moraxella catarrhalis, Rhinovirus A, and Acinetobacter calcoaceticus/baumannii complex) in the LRTIs group was significantly higher than that of the non-LRTIs group (Fig. 2D and Dataset S4). Pneumocystis jirovecii showed the most significant difference between the two groups (Fig. 2D).

Differential host transcriptome in LRTIs patients versus non-LRTIs patients

BALF host transcriptome analysis revealed a total of 649 differentially expressed genes, of which 613 were upregulated and 36 were downregulated in the LRTIs group (Dataset S5). The number of DEGs upregulated in the LRTIs group was significantly higher than that in the non-LRTIs group (Fig. 3A). Host DEGs were classified into three clusters, which are mainly enriched in the immune system, signal transmission, infectious disease: viral and bacterial, signaling molecules and interactions, and other pathways. The DEGs enriched in cluster C’s signaling molecules and interaction pathway were significantly more than those in clusters A and B (Fig. 3A).

Fig. 3: Differential host transcriptome in LRTIs patients versus non-LRTIs patients.
figure 3

A Heatmap of host differential expressed genes, these genes were grouped into three sub-sets, each set presenting different KEGG pathway enrichment indicated by the pie chart. B Network of lineages of genes and KEGG pathways, big nodes present pathways, and small nodes present genes colored by log2 fold change. C Dot plot of DOSE enrichment of host differential expressed genes, each dot presents a significantly enriched disease, and the size of the dot indicates enriched gene number. FDR was applied to the color dot.

The DEGs were further enriched by Gene Ontology (GO) (Dataset S6). The DEGs were mainly enriched in infection-immune-related pathways, especially the upregulated DEGs in the LRTIs group, which were mainly enriched in leukocyte-related pathways (Supplementary Fig. 1). The Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment results of DEGs showed that the DEGs in the LRTIs group were mainly enriched in cytokine-cytokine receptor interaction, NF-kappa B signaling pathway, neutrophil extracellular trap formation, chemokine signaling pathway, TNF signaling pathway, HIF-1 signaling pathway, and Th17 cell differentiation (Fig. 3B, Supplementary Fig. 2 and Dataset S7). Using clusteProfiler, we performed disease enrichment analysis of DEGs in the DOSE database. The results indicate that DEGs are mainly enriched in lung infection-related diseases, such as pneumonia (Fig. 3C and Dataset S8).

Interaction between active lower respiratory tract microbiome and host genes

The interaction between the host genes and the LRTM plays an important role in LRTIs, therefore, we performed a correlation analysis of the host’s differentially expressed genes and the LRTM. Findings indicated that 42 host genes positively correlated with opportunistic pathogens while displaying a negative correlation with normal flora. In addition, four host genes were negatively associated with conditional pathogenic bacteria but positively linked with normal flora (Fig. 4A and Dataset S9). Several genes affiliated with opportunistic pathogens were tied to infection (IFNGR1, WAS) or inflammation (CSF3R, SIGLEC9). Notably, IFNGR1 demonstrated a strong positive correlation with Human coronavirus HKU1. This gene encodes the ligand-binding chain (alpha) of the gamma interferon receptor, pivotal for antimicrobial, antiviral, and antitumor responses by activating effector immune cells and enhancing antigen presentation23. A robust positive correlation was identified between SIGLEC9 and Rhinovirus A. SIGLEC9, known as Sialic Acid Binding Ig Like Lectin 9, played a significant role. Furthermore, a correlation analysis was performed between specific microbial genes and differentially expressed host genes, revealing a strong association between the alpha-2-macroglobulin family protein of Klebsiella pneumoniae and the host genes TNFRSF1B, CSF3R, and IL6R. Additionally, the peptide alpha-N acetyltransferase MAK3 of Pneumocystis jirovecii showed a strong relationship with the host’s Myeloperoxidase (MPO) (Fig. 4B and Dataset S10). TNFRSF1B, a member of the TNF-receptor superfamily, mediated most of the metabolic effects of TNF-alpha, while the receptor for granulocyte colony-stimulating factor (CSF3) was crucial for granulocyte maturation. CSF3 played a vital role in the proliferation, differentiation, and survival of neutrophil lineages24. The enrichment analysis of the host transcriptome underscored that soluble DGEs were enriched in leukocyte-related pathways. The soluble IL6 receptor (sIL6R), functioning as an IL6 active agonist, induced pro-inflammatory properties of IL6 and played a key role in chronic inflammatory diseases25. MPO served as part of the host defense system for polymorphonuclear leukocytes and exhibited bactericidal activity against various organisms. In stimulated polymorphonuclear leukocyte (PMN), MPO catalyzed the production of hypochlorous acid and other toxic intermediates, significantly boosting the microbial activity of PMN26 (Fig. 4).

Fig. 4: Interaction between active lower respiratory tract microbiome and host gene expression.
figure 4

A Heatmap of correlation between differential microorganisms and host differential expressed genes. B Network of correlation between host differential expressed genes and microbial genes; blue nodes are host genes and genes from microbes were colored according to the corresponding microbe’s name.

Discriminate LRTIs and non-LRTIs based on lower respiratory tract microbiome and host response

Currently, there are still many challenges in diagnosing LRTIs. Based on this situation, this study attempted to discriminate LRTIs and non-LRTIs based on LRTM, and host gene expression through machine learning. The random forest model was used to conduct feature engineering on differential LRTM and differentially expressed host gene, and finally, 12 features were selected by training and testing, including 6 optimal LRTM features and 6 optimal host gene features. Then we use these features to train, finetune, and test three classical machine learning models: Random Forest, SVM, and ANN. The results showed that the model trained by Random Forest in the validate set had the best performance (Fig. 5A–C and Dataset S11). The model that integrated LRTM and host gene features (ROC AUC: 0.937, 95% CI: 0.832–1) was superior to the single-use of host gene features (ROC AUC: 0.881, 95% CI: 0.736–1) or LRTM features (ROC AUC: 0.881, 95% CI: 0.731–1) and has the highest accuracy (82.9%), precision (85.7%) and recall (85.7%) (Dataset S11). To further verify the effectiveness of this model, we verified it in another independent dataset of 68 cases (Dataset S2C). The results showed that the sensitivity, specificity, positive predictive value, negative predictive value, and accuracy of the model were 95.6%, 39.1%, 75.4%, 81.8%, and 76.5%, respectively (Dataset S12).

Fig. 5: LRTIs prediction based on lower respiratory tract microbiome and host response.
figure 5

A ROC curve of the model in validate set, colored lines present different models include features of human genes and microbes. B ROC curve of the model in validate set, colored lines present different models include only features of human genes. C ROC curve of the model in validate set, colored lines present different models include only features of microbes. D Heatmap of features used in the model.

Further analysis of these features showed that the abundance of most opportunistic pathogens in the LRTIs group was higher than that in the non-LRTIs group, conversely, the abundance of most commensal bacteria in the non-LRTIs group was higher than that in the LRTIs group; host genes CSF3R, NBPF19, AL121985.1, RLF, and TNFRSF14 were up-regulated in LRTIs group, while CICP7 was down-regulated in LRTIs group (Fig. 5D). CSF3R and TNFRSF14 may play important roles in host-related features in LRTIs. TNFRSF14 encodes a member of the tumor necrosis factor (TNF) receptor superfamily. The encoded protein plays a role in signaling pathways that activate inflammation and suppressive T-cell immune responses. It binds to glycoprotein D (gD) on the envelope of herpes simplex virus (HSV), mediating its entry into cells.

Discussion

LRTIs have high morbidity and mortality1,2,27. In some cases, LRTIs patients and non-LRTIs were fairly similar in terms of clinical characteristics, despite cases meeting the definition of LRTIs, as our study showed in key clinical characteristics (P > 0.05). In order to solve the challenge of diagnosis, this study used machine learning to integrate host transcriptome and LRTM to distinguish LRTIs and non-LRTIs.

The characteristics and diversity of the respiratory microbiome are closely related to LRTIs. A study of children with LRTIs found that the presence and abundance of Haemophilus spp., Streptococcus pneumoniae, and oral species were closely related to infection28. In contrast, the abundance of potentially beneficial bacteria such as Moraxella, Corynebacterium, Dolosigramum, and Helcococcus spp. were lower in infection cases28. A decrease in the diversity of respiratory microbiota can lead to ecological imbalance, with a high diversity of respiratory microbiota in healthy individuals29. A previous study has found a decrease in fungal diversity in ventilated patients with pneumonia30. This study found a significant decrease in the diversity of respiratory microbiota in the LRTIs group, which is consistent with previous studies. This study also found that the abundance of normal flora (Porphyromonas gingivalis, Corynebacterium dentalis, and Treponema denticola) significantly increased in the non-LRTIs group, while the abundance of opportunistic pathogens, such as Klebsiella pneumoniae, Pseudomonas aeruginosa, Streptococcus pneumoniae, Pneumocystis jirovecii and so on, increased in the LRTIs group. It can be seen that the decrease in the diversity of LRTM and the increase in the abundance of opportunistic pathogens are high-risk factors for LRTIs.

LRTIs are caused by the interaction between microbiota and host characteristics rather than a single microorganism28. Therefore, for LRTIs, we need to focus on both the LRTM and the immune response of the host. This study showed that in the state of infection, the upregulated genes in the host are significantly higher than the downregulated genes, which may be related to pneumonia being a high emergency disease.

This study demonstrated that differentially expressed genes in the LRTIs group were predominantly enriched in pathways related to infection or inflammation. Pathogen invasion triggers the activation of the immune system, cytokines, and intricate signaling pathways involved in LRTIs. Pathogen components like lipopolysaccharides (LPS) are recognized by antigen-presenting cells (APCs). With the help of chemokines, immune cells such as neutrophils are recruited to the site of infection or injury to capture and eliminate microbes locally through the formation of neutrophil extracellular traps (NETs). Simultaneously, transcription factors like NF-kappa B and HIF-1 are activated by upstream signals, leading to the expression of various cytokines, including TNF and IL-17 downstream. These cytokines play a crucial role in mediating communication between immune cells and inflammation during the progression of pneumonia.

The interplay between the human LRTM and the host alveolar epithelial cells and the immune system is intricate and influenced by multiple factors31,32. Genera such as Haemophilus and Moraxella in the LRTM upregulate neutrophils by boosting interleukin-8 (IL-8) production, resulting in airway hyperresponsiveness during respiratory viral infections31. Tumor necrosis factor-alpha (TNF-α) serves as a critical inflammatory cytokine in acute respiratory distress syndrome (ARDS) pulmonary inflammation, with the ability to independently predict mortality33. The enrichment of Proteobacteria in the LRTM of ARDS patients can lead to an increase in TNF-α levels, while the enrichment of Bacteroidetes is associated with its reduction34. This study also identified a notable positive correlation between opportunistic pathogenic bacteria and genes linked to host infection or inflammation. Further analysis indicated a strong relationship between the alpha-2-macroglobulin family protein of Klebsiella pneumoniae and genes related to host infection or inflammation, such as TNFRSF1B, CSF3R, and IL6R. Additionally, the peptide alpha-N acetyltransferase MAK3 of Pneumocystis jirovecii displayed a strong correlation with the host’s infection-related gene MPO. These findings suggest an interaction between the LRTM and pathways associated with host infection or inflammation in LRTIs, potentially influencing the onset and progression of LRTIs. Further validation through cellular and animal experiments is necessary to confirm these observations.

As mentioned above, LRTIs are not diseases caused by a single microorganism, they are the result of the interaction between the LRTM and the host. In conventional clinical practice, LRTIs are usually initially diagnosed based on the patient’s clinical manifestations, radiography data, and inflammatory indicators. Including the respiratory microbiota and the immune response of the host in the construction model may better help clinicians in the discrimination of LRTIs and non-LRTIs. This study used a random forest algorithm to screen out 12 features (6 each for lung differential microbiota and 6 each for host DEGs) that can effectively discriminate LRTIs and non-LRTIs. Most of these features are related to LRTIs. Therefore, a Random Forest model based on the above two aspects of information is expected to be applied to clinical settings for the precise diagnosis of LRTIs biologically and statistically. This study sample size was relatively small, precision and long-term stability of the model were influenced. Therefore, this model cannot completely replace clinical diagnosis for the limitation of the model, but it can serve as a supporting tool for inexperienced clinicians to make clinical diagnoses. The long-term stability of the model can be improved by recruiting more samples. Meanwhile, adjusting the model’s features to enhance the long-term stability of the model requires updating the features and retraining the model. In addition to the microbiome and host transcriptome features, additional features such as metabolome and serum proteome are incorporated to improve the model’s scalability.

This study has several limitations. Although several key respiratory microbiota and host genes have been found to play important roles in LRTIs, their detailed biological functions have not been further validated. This study employed machine learning to identify features for effectively diagnosing LRTIs. However, there is a need to enhance the long-term stability and scalability of the model.

In summary, this study aimed at the clinical difficulty of LRTIs diagnosis, combined LRTM, and host transcriptome information, and used machine learning to construct a model based on 12 features that can diagnosis LRTIs well, which provides a novel strategy for clinical diagnosis of LRTIs.