Stacking ensemble learning models diagnose pulmonary infections using host transcriptome data from metatranscriptomics

Zhang, Tian; Deng, Ying; Wang, Wentao; Zhao, Zhe; Wu, Yiling; Wang, Haoqian; Xia, Shutao; Liao, Weifang; Liao, Weijie

doi:10.1038/s41598-025-15914-9

Download PDF

Article
Open access
Published: 20 August 2025

Stacking ensemble learning models diagnose pulmonary infections using host transcriptome data from metatranscriptomics

Tian Zhang^1,2^na1,
Ying Deng¹^na1,
Wentao Wang⁵,
Zhe Zhao⁵,
Yiling Wu²,
Haoqian Wang¹,
Shutao Xia^1,2,
Weifang Liao⁴ &
…
Weijie Liao^1,3

Scientific Reports volume 15, Article number: 30516 (2025) Cite this article

1618 Accesses
1 Citations
Metrics details

Subjects

Abstract

The prompt diagnosis of pulmonary infections with unknown etiology in patients in severe condition remains a challenge due to the lack of rapid and effective diagnostic methods. While metatranscriptomic sequencing offers a powerful approach, its clinical utility is often limited by issues of timeliness. In this study, we conducted metatranscriptomic sequencing on bronchoalveolar lavage fluid (BALF) collected from critically ill, severely ill, and ICU patients. Based on microbial detection results, patients were classified into four types: negative, bacterial infection, viral infection, and fungal infection. To identify host gene expression signatures associated with infection, we screened characteristic genes from human metatranscriptomic data by comparing 70% of patients with confirmed infections vs. non-infections. Leveraging these characteristic genes, we constructed classification sub-models employing 13 types of machine learning algorithms, and we further integrated these sub-models into stacking-based ensemble models with Lasso regression, resulting in diagnostic models that required only a small set of gene expression inputs. The average performance of five-fold cross-validation demonstrated high diagnostic accuracy: distinguishing infection from non-infection (AUC = 0.984), bacterial infection from non-bacterial infection (AUC = 0.98), and viral infection from non- viral infection (AUC = 0.98). Test cohorts’ results demonstrated the method’s high diagnostic accuracy consistency with metatranscriptomic sequencing in discerning patient infection status (AUC = 0.865) and the type of infection (viral: AUC = 0.934, bacterial: AUC = 0.871). Our study presented a rapid and inexpensive adjunctive diagnostic strategy that achieves diagnostic accuracy comparable to metatranscriptomic sequencing, enabling timely identification of both infection status and type in pulmonary infections.

Metagenomic fingerprints in bronchoalveolar lavage differentiate pulmonary diseases

Article Open access 07 October 2025

A four-gene signature from blood to exclude bacterial etiology of lower respiratory tract infection in adults

Article Open access 24 November 2025

Integrated host-microbe plasma metagenomics for sepsis diagnosis in a prospective cohort of critically ill adults

Article Open access 20 October 2022

Introduction

As a vital organ for gas exchange between the human body and the external environment, the lungs are highly susceptible to infection due to various factors, including microbial invasion, environmental exposure, and iatrogenic causes. However, the histopathological features of pulmonary infections, such as inflammatory infiltration and tissue necrosis, often closely resemble those of non-infectious diseases, posing a significant diagnostic challenge for surgical pathologists. Failure to establish a timely and accurate diagnosis may lead to disease progression, including pulmonary fibrosis or even systemic dissemination to other organs. In severe cases, patients can rapidly deteriorate into respiratory failure or sepsis, ultimately resulting in death¹. Therefore, timely and precise diagnosis, followed by targeted intervention, is critical for improving clinical outcomes.

Patients with severe illness, critical condition, and those admitted to the ICU frequently often face life-threatening pulmonary infections of undetermined etiology, posing major diagnostic and therapeutic challenges in clinical management^2,3. While conventional methods for detecting pulmonary infections encompass physical examinations, imaging studies, microbial cultures, and blood tests, they are encumbered by issues such as low specificity, delays, restricted detection capabilities, and intricate operational procedures^4,5. In the absence of a confirmed diagnosis, empirical therapy predominates, typically involving the combination use of broad-spectrum antibiotics, antiviral agents, and/or antifungal drugs. This approach is associated with considerable costs, lacks specificity, and contributes to antibiotic misuse, drug resistance, and potential adverse effects³.

In recent years, metagenomic sequencing has emerged as a promising strategy for diagnosing pulmonary infections^6,7. Both metagenomic and metatranscriptomic detection technologies offer substantial improvements in detection accuracy⁸ and microbial spectrum coverage including bacteria⁹ viruses¹⁰ and fungi, thereby enabling personalized treatment¹¹. Nonetheless, these techniques present notable limitations, including data analysis complexity, and the necessity for specialized knowledge and skills, resulting in prolonged turnaround times for results¹². These limitations could result in the oversight of optimal treatment opportunities for patients with severe illness, critical conditions, and those admitted to the ICU. Consequently, the development of rapid and inexpensive diagnostic approaches with diagnostic precision akin to metagenomic sequencing, capable of discerning infection status and type, assumes paramount importance. Such advancements would be pivotal in guiding timely, targeted antibiotic, antiviral, or antifungal interventions for these patients. This not only can significantly decrease drug selection pressure and adverse drug reactions, but also contribute to improving patient outcomes.

To address this critical need, we collected BALF samples from critically ill and ICU patients for metatranscriptomic sequencing analysis. Based on the metatranscriptomic microbial detection results, patients were categorized into negative, bacterial infection, viral infection, and fungal infection, and characteristic gene screening was conducted on the transcriptome data of human genes. Using the transcriptome data of these characteristic genes, classification diagnostic sub-models were constructed employing various machine learning algorithms. However, a solitary machine learning model may encounter limitations stemming from data quality, risks of overfitting, and deficiencies in model generalization within the biological context. In contrast, ensemble learning amalgamates the strengths of diverse models, thereby reducing model bias and overfitting risk, ensuring generalizability across subsets, laying the foundation for robust ensemble learning and enhancing the robustness of feature interpretation. Therefore, stacking-based ensemble learning models for each sub-model were developed using Lasso, ultimately yielding a rapid, cost-effective classification diagnostic method that only necessitates the detection of a few characteristic genes. Test results indicated that the developed method exhibited high consistency in diagnostic accuracy with metatranscriptomic sequencing in distinguishing whether patients were infected and the type of infection (viral and bacterial), thus serving as a potential auxiliary diagnostic method for pulmonary infections of unknown etiology in patients with severe illness, critical condition, and those admitted to the ICU.

Methods

Study design and participants

This retrospective study analyzed BALF specimens collected from 180 critically ill patients admitted to the intensive care unit (ICU) between 2021 and 2023 from Zhengzhou Kingmed Diagnostics. Patients with respiratory system infections of unknown etiology were included, while those under 18 years of age were excluded. All specimens underwent metatranscriptomic sequencing. Based on metatranscriptomic data analysis (Supplementary Table S1), data from non-infected patients (n = 41) and infected patients (n = 139) were included in subsequent analyses. Infected patients were further stratified into three diagnostic groups: bacterial infection (n = 91), viral infection (n = 33), and fungal infection (n = 15). The pathogenic microbial infection model incorporated 41 non-infected cases alongside 139 confirmed infected cases. For pathogenic bacterial infection classification, the cohort comprised 91 bacterial infected patients compared with 89 non-bacterial patients. Similarly, the pathogenic viral infection model analyzed 33 viral infected cases against 147 non-viral infected cases. This study was approved by the ethics committee of Kingmed Diagnostics (No.2023160), and all experiments were performed in accordance with relevant guidelines and regulations. Due to the retrospective nature of the study, the ethics committee of Kingmed Diagnostics waived the need of obtaining informed consent.

Metatranscriptomic sequencing

The Metatranscriptomic sequencing was carried out by KingMed Diagnostics. General DNA/RNA extraction kit (TR202-50) from GENSTONE Biotech was employed to extract total RNA from bronchoalveolar lavage fluid. Pathogenic microbial RNA detection kits (KS619-RNAmN48) from KingCreate Biotechnology were used to construct sequencing libraries, with library lengths ranging from 250 to 400 base pairs. Sequencing was performed using the Illumina NextSeq 550 platform. Each sample yielded over 2 Gb of raw data.

Bioinformatics analysis and statistical tests

Metatranscriptomic data of microbiota processing: After undergoing quality control with the fastp software¹³ the raw data underwent filtering to remove low-quality and low-complexity reads. Subsequently, adapter sequences and duplicate reads were eliminated, followed by sequence alignment. Reads aligned to microorganisms were then quantified. The proprietary database developed by KingMed Diagnostics comprised over 11,000 bacterial species, 7,300 viral species, and 1,600 fungal species.

Metatranscriptomic data of human processing: FastQC¹⁴ was employed to assess the quality of raw sequencing reads, followed by alignment using Hisat2 software^15,16 and deduplication and sorting using Samtools software¹⁷. Subsequently, featureCounts software¹⁸ was utilized to obtain count data and calculate Transcripts Per Million (TPM). Utilizing the cmdscale function and Vegan package¹⁹ both t-distributed Stochastic Neighbor Embedding (tSNE) and Principal Component Analysis (PCA) were conducted on TPM data derived from two sample groups. Differential analysis on count data from the same two sample groups was carried out using DESeq2²⁰, EdgeR²¹and Limma packages²². Simultaneously, statistical analysis on Transcripts Per Million (TPM) data from the two groups was performed using Wilcoxon tests. Hyperparamter details were set as follows: DESeq2: fitType = “mean”, minReplicatesForReplace = 7, parallel = FALSE. Genes with padj < 0.05 and |log2FC| > 1 were considered significant; limma: The design matrix was constructed using model.matrix(~ GroupLabel), and empirical Bayes moderation was applied via eBayes(). Significance thresholds: P.Value < 0.05, |logFC| > 1; Wilcoxon test: Non-parametric rank-sum test applied on a per-gene basis, using default settings. Genes with p < 0.05 and |log2FC| > 1 were retained; edgeR: Normalization with calcNormFactors(), dispersion estimated using estimateDisp(), followed by glmFit() and glmLRT() for model fitting and hypothesis testing. Design matrix: model.matrix(~ GroupLabel). Genes with P.Value < 0.05 and |logFC| > 1 were selected. Subsequently, intersecting genes identified from the four differential analysis methods were selected to determine differentially expressed genes, and the ggpubr package²³ was employed to generate a Venn diagram. Heatmaps were generated using the pheatmap package²⁴. Additionally, the clusterProfiler package²⁵ was utilized for Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG)^26,27,28 pathway analysis, while the IOBR package²⁹ was employed for immune infiltration analysis.

Machine learning model construction

This study included a total of 180 samples for a binary classification task (infection vs. non-infection, bacterial infection vs. non-bacterial infection or viral infection vs. non- viral infection). The samples were randomly divided into training set and testing set at a 7:3 ratio, resulting in 126 samples used for training and 54 for testing. Differential genes were analyzed using the training set and further screened using three machine learning algorithms: Support Vector Machine (SVM), Random Forest (RF), and Least Absolute Shrinkage and Selection Operator (Lasso). The intersection genes, representing the input features for the model, were obtained through analysis using the ggpubr package. Subsequently, thirteen algorithms, including Gradient Boosting Machine (GBM), Naive Bayes, Neural Network, RF, Elastic Net, Ridge Regression, Decision Tree, Support Vector Regression (SVR), Linear Model (LM), SVM, RPart, Generalized Linear Models (GLM), and Lasso, were employed to construct diagnostic sub-models. The performance of each model was assessed, and finally, the efficiency of each sub-model was integrated using the Lasso algorithm to build a classification diagnostic model based on ensemble learning. The hyperparamter details of each model were shown in Supplementary Table S2. The effectiveness of the model was evaluated using 5-fold cross-validation on the training set samples, visualized via pROC (Receiver Operating Characteristic Curve) package³⁰ and further tested on the test set samples to assess its performance (Fig. 1). The key components of the entire modeling pipeline are as follows:

1.
Dependent Variable (Target): The outcome variable (AllLable) is a binary classification label indicating clinical infection status: - Class 0: non-infected, - Class 1: infected. We constructed four types of datasets: negative, bacteria, virus, and fungus. This binary variable serves as the target (y) for all supervised classification models.
2.
Predictor Variables (Features): Input features are selected from differentially expressed genes (DEGs), identified through statistical comparisons (DESeq2, limma, Wilcoxon). Prior to modeling, genes with low variance were removed. Features were ranked using standard deviation, and selected genes were assembled into the predictor matrix: x_train <- as. matrix (gene_expression [, colnames(gene_expression) %in% ModelGene$ENSG]).
3.
Feature Selection and Cross-Validation: Each model incorporates feature selection and validation were showed in Supplementary Table S3.
4.
Classification Task Type: All models were built for binary classification, aiming to distinguish between infected and non-infected samples using gene expression data.
5.
Evaluation Metrics: To assess model performance and robustness, we applied Accuracy (Proportion of correctly predicted samples), AUC (Reflects model discrimination ability), Mean Predicted Probability (Average model confidence for each sample), Cross-validation average score (Stability across folds), four assessment criteria.

Results

Model construction for pathogenic microbial infection

To develop a diagnostic model for pathogenic microbial infections, we first established comparative cohorts based on metagenomic sequencing results. There were no significant differences in gender and age between the pathogenic microbial infected and non-infected cohorts (Fig. 2A). tSNE analysis revealed significant differentiation between the two groups of samples (p = 0.002), and consistent results were obtained through PCA analysis (p < 0.001) (Fig. 2B and Figure S1A). Immune cell composition analysis in the BALF showed a significant increase in immune cells promoting the immune response, including B cells, eosinophils, CD4 + T cells, CD8 + T cells, Th2 cells, and plasmacytoid dendritic cells, in the cohort with pathogenic microbial infections compared to those without, whereas the population of M2-type macrophages suppressing the immune response was significantly reduced. This suggested a heightened immune response in patients with detected pathogenic microbes compared to those without (Fig. 2C).

To identify DEGs between the two cohorts, we employed three algorithms (Figure S1B-D). Furthermore, we performed DEG analysis using Wilcoxon test. By intersecting the results from these four methods, we identified a set of DEGs between the two cohorts, with 132 genes downregulation, compared to the pathogenic microbial non- infection cohort (Fig. 2D and Figure S1E). GO enrichment analysis indicated that these DEGs are associated with cellular translation, oxidative phosphorylation, cellular respiration, electron transport chain, respiratory complexes, cytochrome c oxidase activity, as well as cell adhesion and migration (Fig. 2E). KEGG enrichment analysis revealed that the differential genes mainly affect ribosome biogenesis, mitochondrial energy metabolism and quality control, and intracellular oxidative stress (Fig. 2F). The combined results of GO and KEGG enrichment analyses suggested that these DEGs mainly impact cellular energy metabolism and substance metabolism, which are significantly affected following pathogenic microbial infections.

Subsequently, we conducted screening of DEGs related to the presence or absence of pathogenic microbial infections. SVM algorithm identified 66 genes, RF algorithm identified 66 genes, and Lasso algorithm identified 9 genes. By taking the intersection of these three sets of genes, FIS1, MARCKSL1, RPL8, RPS15, RPS5, and STUB1 were selected as input genes for the sub-model (Fig. 2G). We then constructed thirteen classification sub-models using thirteen algorithms, and the prediction outputs of each sample in the test set across all 13 models are presented in Supplementary Table S4. The SHAP values of each input gene were shown in Figure S1F. Subsequently, we integrated these sub-models using Lasso algorithm and selected the best regularization parameter (Lambda) through coefficient calculation (Fig. 2H) and 10-fold cross-validation (Fig. 2I), obtaining the optimal Lambda value of 3.3e-03. The determined lambda value was subsequently employed to construct a classification model, formally expressed as: S = −0.13 + 0.14E_MLP + 0.04 E_NaiveBayes + 0.11 E_NeuralNet + 0.49 E_RF + 0.48 E_GBM (Fig. 2J). Five-fold cross-validation results on the training set demonstrated the model’s robust diagnostic performance, with an AUC value of 0.984 (Fig. 2K). Furthermore, test using the test set yielded an AUC value of 0.865 (Fig. 2L), indicating good consistency of the metatranscriptomic detection of pathogenic microbial infections.

Model construction for pathogenic bacterial infection

In this part of the research, we aimed to establish a diagnostic model that aligned with the results of metatranscriptomic detection of solely bacterial infection. Cases without bacterial infection and with bacterial infection were identified (Fig. 3A). A substantial disparity emerged between the cohorts, as evidenced by tSNE (p < 0.001) and PCA(p < 0.001) (Fig. 3B and Figure S2A). DEGs were discerned using the same method as described above, with 1719 genes downregulation and 44 genes upregulation compared to cases without infection (Fig. 3C and S2B-E). GO enrichment analysis unveiled associations with bacterial infection, such as regulation of innate immune response, response to lipopolysaccharide, response to molecule of bacterial origin, focal adhesion, MHC protein complex binding (Fig. 3D). KEGG enrichment analysis revealed differential gene impacted on TNF signaling, NF-kappa B signaling, Tight junction, Necroptosis, MAPK signaling and bacterial invasion of epithelial cells, which were all associated with bacterial infection (Fig. 3E). Besides, Gene Set Enrichment Analysis (GSEA) enrichment analysis also identified numerous pathways associated with bacterial infection, for instance antigen processing and presentation, natural killer cell mediated cytotoxicity, lysosome, apoptosis, endocytosis and MAPK signaling pathway (Fig. 3F). These results suggested that the screened DEGs were highly correlated with bacterial infection.

DEGs underwent further screening, resulting in 10 genes identified by SVM, 10 genes selected by RF, and 17 genes determined by Lasso. By identifying the intersecting genes, AKT1, CRIP1, GTF3A, HDGF, HLA-DPB1, OGFR, PEBP1 and RPL13 genes were obtained and as input genes for next model constructions (Fig. 3G). The 13 classification sub-models were constructed, and the prediction outputs of each sample in the test set across all 13 models are presented in Supplementary Table S4. The SHAP value of each input gene were shown in Figure S2F. The 13 sub-models were integrated via the Lasso algorithm, coupled with selection of the optimal regularization parameter (Lambda) through coefficient calculation (Fig. 3H) and 10-fold cross-validation (Fig. 3I), yielded an optimal Lambda value of 5.63e-04. Leveraging this Lambda value, an integrated learning model was constructed, as represented by: S = −0.05 + 0.06E_MLP + 0.25 E_NeuralNet + 0.47 E_RF + 0.33 E_GBM (Fig. 3J). Five-fold cross-validation results on the training set underscored the model’s robust diagnostic prowess, with an AUC value of 0.98 (Fig. 3K). Furthermore, test utilizing the test set yielded an AUC value of 0.871 (Fig. 3L), indicating substantial concordance with the metatranscriptomic detection of pathogenic bacteria infections.

Model construction for pathogenic viral infection

Following our investigation, we endeavored to formulate a diagnostic framework congruent with findings from metatranscriptomic assessments exclusive to viral infections. Among our cohort, delineation revealed 147 cases devoid of viral infection and 33 cases characterized by viral infection (Fig. 4A). A conspicuous dichotomy surfaced between these cohorts, as substantiated by tSNE (p < 0.001) and PCA (p < 0.001) (Fig. 4B and Figure S3A). Employing analogous methodologies as aforementioned, we identified 598 genes exhibiting downregulation and 2 genes displaying upregulation in comparison to non-viral infection cases (Fig. 4C and Figure S3B-E). GO enrichment analysis illuminated associations indicative of viral infection, encompassing the viral process, antigen processing and presentation, MHC protein complex assembly, aerobic respiration, immune receptor activity and peptide antigen binding (Fig. 4D). Moreover, KEGG enrichment analysis unveiled perturbed genes implicated in TH17 cell differentiation, Th1 and Th2 cell differentiation, Oxidative Phosphorylation, influenza A, COVID-19, Necroptosis, lysosome, antigen processing and presentation (Fig. 4E). Furthermore, GSEA identified numerous pathways linked to mitochondrial function, including Huntington disease, Parkinson disease and Oxidative phosphorylation (Fig. 4F). These findings collectively indicated a correlation between the identified DEGs and viral infection.

Subsequently, DEGs underwent further screening, yielding CPVL, DENND4C, MAFB, PTGER4, SLAMF8, TSPO and UBE2Q2 genes, which served as input features for subsequent model construction (Fig. 4G). Thirteen classification sub-models were assembled and the prediction outputs of each sample in the test set across all 13 models are presented in Supplementary Table S4. The SHAP value of each input gene were shown in Figure S3F. The sub-models were integrated through the Lasso algorithm, incorporating the optimal regularization parameter (Lambda) determined via coefficient calculation (Fig. 4H) and 10-fold cross-validation (Fig. 4I), resulting in an optimal Lambda value of 1.29e-03. Leveraging this Lambda value, a composite classification model was devised, represented by the equation: S = −0.20 + 0.28E_MLP + 0.15 E_NeuralNet + 0.91 E_RF + 0.05 E_GBM + 0.10 E_Elastic − 0.05 E_NaiveBayes (Fig. 4J). Five-fold cross-validation on the training set underscored the robust diagnostic efficacy of the model, yielding an AUC value of 0.98 (Fig. 4K). Furthermore, test using an independent test dataset yielded an AUC value of 0.934 (Fig. 4L), indicating significant concordance with metatranscriptomic detection of pathogenic viral infections.

Robustness assessment of three stacking-based models

To further evaluate the robustness, we designed a comprehensive evaluation framework to quantitatively compare the performance stability of the three-stacking model against 13 conventional classifiers using the test cohorts. Specifically, we introduced a 5-dimensional evaluation scheme capturing both accuracy and generalization performance across heterogeneous subsets: AUC (measures overall discrimination ability); Accuracy (measures performance at a fixed threshold); Subset Accuracy Std (reflects prediction stability across different subgroups); Subset AUC Std (captures discrimination consistency across subsets); Subset Top3 Score (quantifies how often the model ranks in the top three across key subsets). All test set samples were used to evaluate the three stacking models and their 13 constituent sub-models, and average of the five metrics were calculated (Supplementary Table S5). We further formulated the following scoring function to aggregate these metrics into a single robustness score: 0.2 × AUC_average + 0.2 × Accuracy_average + 0.2 × (1 − Subset Accuracy Std) _average + 0.2 × (1 − AUC Std across Subsets) _average + 0.2 × Subset Top3 Score (Normalized)_average. This balanced scoring system prioritizes both global predictive quality and cross-scenario stability, with a modest boost from localized subset performance. The results showed that stacking models achieved the highest robustness score 0.93, outperforming all single model. This was further illustrated with visualizations including radar plots (Fig. 5A) and score distribution (Fig. 5B), highlighting its superior stability and resilience in diverse clinical conditions. In summary, we quantitatively and visually demonstrated that the stacking model was not only accurate but also robust across multiple data partitions, substantiating its advantage over individual models.

Discussion

Particularly concerning are patients in critical conditions, including those in ICU, where delayed diagnosis of severe pulmonary infections can lead to organ dysfunction and increased mortality risk³¹. Broad-spectrum antibiotics, antiviral agents, and antifungal medications are frequently employed as empiric therapies. However, up to half of ICU patients receiving empirical antimicrobial therapy lack definitive confirmation of infection. Administration of antimicrobial agents to non-infected patients not only fails to confer benefits but may also introduce potential risks of adverse events, secondary opportunistic pathogen infections, and drug resistance. Traditional microbiological detection methods, primarily microbial culture, suffer from limitations such as stringent technical requirements and conditions, resulting in unstable culture outcomes and poor timeliness, which may impede disease progression monitoring in critically ill patients awaiting results. The advent of sequencing technologies has introduced metagenomic and metatranscriptomic analyses, enabling discrimination between infection and non-infection states and accurate identification of microbial taxa. However, the technical complexity of these assays, combined with prolonged turnaround times for data analysis and result interpretation (typically 2–7 days in clinical settings), often compromises timely clinical decision-making. In China’s diagnostic landscape, the substantial cost (approximately 3,000 RMB per sample) and time requirements for BALF metatranscriptomic sequencing further limit its routine clinical application for critically ill, severely ill, and ICU patients. On the other hand, compared to metagenomic analyses, metatranscriptomic sequencing can identify the impact of infecting microorganisms on host gene expression³²facilitating the identification of host gene biomarkers for infection diagnosis³³. However, effective host biomarkers have yet to be fully elucidated, and diagnostic models based on basic biomarkers require further development.

In this study, we proposed a targeted framework that simultaneously addresses. Retrospective samples of BALF from respiratory and critically ill patients were collected, followed by metatranscriptomic sequencing. Based on metatranscriptomic microbial detection results, patients were categorized into 41 negative cases, 33 bacterial infections, 91 viral infections, and 15 fungal infections. Due to the limited number of fungal infection cases and substantial inter-individual variations (p = 0.475) (Figure S4A-B), further analysis was not pursued, constituting a major limitation of this study. However, further investigation will be conducted as more samples of fungal infections are continuously collected. Transcriptomic data of host genes were selected as characteristic genes from patients’ data with microbial detection results indicating bacterial or viral infections and those without detection. Using these characteristic gene transcriptomic data, various machine learning algorithms were employed to construct classification diagnostic sub-models. Subsequently, a stacking-based ensemble learning model was constructed using Lasso to integrate the sub-models, ultimately forming a rapid and inexpensive classification diagnostic method requiring the detection of only a few host gene biomarkers. Compared to utilizing a single model, ensemble learning models allowed for the consideration of the efficacy of multiple models in result evaluation, thereby enhancing predictive performance, mitigating the risk of overfitting, bolstering robustness, and providing increased flexibility and interpretability^34,35. We emphasized that the choice to apply a unified set of 13 models across all datasets was not arbitrary, but rather driven by a need to maintain methodological consistency and ensure fair cross-cohort evaluation. The final stacking model was thus selected based on its superior performance and robustness across all datasets, balancing predictive power with technical uniformity. The average performance of the five-fold cross-validation models for distinguishing infection from non-infection was AUC = 0.984, for distinguishing bacterial infection from non-bacterial infection was AUC = 0.98, and for distinguishing viral infection from non- viral infection was AUC = 0.98. Moreover, test set results demonstrated that the developed method exhibited high consistency in diagnostic performance compared to metatranscriptomic sequencing in differentiating patients with infection (AUC = 0.865) and infection types (viral AUC = 0.934 and bacterial AUC = 0.871), serving as a potential adjunctive diagnostic method for unidentified pulmonary infections in critically ill, severely ill, and ICU patients.

Our study presents a prospective adjunctive diagnostic approach based on qPCR detection of 21 genes, which requires only approximately 210 RMB (∼93% reduction compared to conventional metatranscriptomic sequencing) with a turnaround time of less than 6 h while maintaining diagnostic accuracy comparable to metatranscriptomic sequencing, enabling the identification of infection status and type in pulmonary infections of undetermined etiology among critically ill, severely ill, and ICU patients. Our developed detection method holds promise for reducing medical costs, mitigating drug selection pressures and adverse effects, and improving patient outcomes. Future directions of this study could involve integrated analysis linking pathogen metagenomic features with host immune signatures. A multi-omics data fusion approach would help elucidate critical molecular patterns of pathogen-host interactions, ultimately enhancing clinical diagnostic performance.

Data availability

The metatranscriptomic raw data generated and analyzed in this project are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

References

Kradin, R. L. & Mark, E. J. 8 - Pathology of Pulmonary Infection (Elsevier, 2018).
Whiteside, S. A., McGinniss, J. E. & Collman, R. G. The lung microbiome: progress and promise. J. Clin. Invest. 131 https://doi.org/10.1172/JCI150473 (2021).
Timsit, J. F. et al. Rationalizing antimicrobial therapy in the ICU: a narrative review. Intensive Care Med. 45, 172–189. https://doi.org/10.1007/s00134-019-05520-5 (2019).
Article PubMed Google Scholar
Jain, S. et al. Community-Acquired pneumonia requiring hospitalization among U.S. Adults. N Engl. J. Med. 373, 415–427. https://doi.org/10.1056/NEJMoa1500245 (2015).
Article CAS PubMed PubMed Central Google Scholar
Versporten, A. et al. Antimicrobial consumption and resistance in adult hospital inpatients in 53 countries: results of an internet-based global point prevalence survey. Lancet Glob Health. 6, e619–e629. https://doi.org/10.1016/S2214-109X(18)30186-4 (2018).
Article PubMed Google Scholar
Guo, Y. et al. Metagenomic next-generation sequencing to identify pathogens and cancer in lung biopsy tissue. EBioMedicine 73, 103639. https://doi.org/10.1016/j.ebiom.2021.103639 (2021).
Article CAS PubMed PubMed Central Google Scholar
Xu, H. et al. Clinical application and evaluation of metagenomic Next-Generation sequencing in pulmonary infection with pleural effusion. Infect. Drug Resist. 15, 2813–2824. https://doi.org/10.2147/IDR.S365757 (2022).
Article CAS PubMed PubMed Central Google Scholar
Huang, J. et al. Metagenomic Next-Generation sequencing versus traditional pathogen detection in the diagnosis of peripheral pulmonary infectious lesions. Infect. Drug Resist. 13, 567–576. https://doi.org/10.2147/IDR.S235182 (2020).
Article CAS PubMed PubMed Central Google Scholar
Charalampous, T. et al. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection. Nat. Biotechnol. 37, 783–792. https://doi.org/10.1038/s41587-019-0156-5 (2019).
Article CAS PubMed Google Scholar
Zhu, N. et al. A novel coronavirus from patients with pneumonia in china, 2019. N Engl. J. Med. 382, 727–733. https://doi.org/10.1056/NEJMoa2001017 (2020).
Article CAS PubMed PubMed Central Google Scholar
Carr, V. R. & Chaguza, C. Metagenomics for surveillance of respiratory pathogens. Nat. Rev. Microbiol. 19, 285. https://doi.org/10.1038/s41579-021-00541-8 (2021).
Article CAS PubMed PubMed Central Google Scholar
Diao, Z., Han, D., Zhang, R. & Li, J. Metagenomics next-generation sequencing tests take the stage in the diagnosis of lower respiratory tract infections. J. Adv. Res. 38, 201–212. https://doi.org/10.1016/j.jare.2021.09.012 (2022).
Article CAS PubMed Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890. https://doi.org/10.1093/bioinformatics/bty560 (2018).
Article CAS PubMed PubMed Central Google Scholar
Andrews, S. (Cambridge, United Kingdom, (2010).
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods. 12, 357–360. https://doi.org/10.1038/nmeth.3317 (2015).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915. https://doi.org/10.1038/s41587-019-0201-4 (2019).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The sequence alignment/map format and samtools. Bioinformatics 25, 2078–2079. https://doi.org/10.1093/bioinformatics/btp352 (2009).
Article CAS PubMed PubMed Central Google Scholar
Liao, Y., Smyth, G. K. & Shi, W. FeatureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930. https://doi.org/10.1093/bioinformatics/btt656 (2014).
Article CAS PubMed Google Scholar
Dixon, P. VEGAN, a package of R functions for community ecology. J. Veg. Sci. 14, 927–930 (2003).
Article Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated Estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550. https://doi.org/10.1186/s13059-014-0550-8 (2014).
Article CAS PubMed PubMed Central Google Scholar
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. EdgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140. https://doi.org/10.1093/bioinformatics/btp616 (2010).
Article CAS PubMed Google Scholar
Smyth, G. (New York: Springer, (2005).
Hornik, K. The comprehensive R archive network. Wiley Interdisciplinary Reviews: Comput. Stat. 4, 394–398 (2012).
Article Google Scholar
Kolde, R. Pheatmap: pretty heatmaps. R Package Version. 1, 726 (2019).
Google Scholar
Yu, G., Wang, L. G., Han, Y. & He, Q. Y. ClusterProfiler: an R package for comparing biological themes among gene clusters. Omics: J. Integr. Biology. 16, 284–287 (2012).
Article CAS Google Scholar
Kanehisa, M. Toward Understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947–1951. https://doi.org/10.1002/pro.3715 (2019).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M., Furumichi, M., Sato, Y., Matsuura, Y. & Ishiguro-Watanabe, M. KEGG: biological systems database as a model of the real world. Nucleic Acids Res. 53, D672–D677. https://doi.org/10.1093/nar/gkae909 (2024).
Article PubMed Central Google Scholar
Kanehisa, M. & Goto, S. K. E. G. G. Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30. https://doi.org/10.1093/nar/28.1.27 (2000).
Article CAS PubMed PubMed Central Google Scholar
Zeng, D. et al. Multi-Omics Immuno-Oncology biological research to Decode tumor microenvironment and signatures. Front. Immunol. 12, 687975. https://doi.org/10.3389/fimmu.2021.687975 (2021).
Article CAS PubMed PubMed Central Google Scholar
Robin, X. et al. pROC: an open-source package for R and S + to analyze and compare ROC curves. BMC Bioinform. 12, 1–8 (2011).
Article Google Scholar
Detsis, M., Karanika, S., Mylonakis, E. I. C. U. & Acquisition Rate Risk factors, and clinical significance of digestive tract colonization with Extended-Spectrum Beta-Lactamase-Producing enterobacteriaceae: A systematic review and Meta-Analysis. Crit. Care Med. 45, 705–714. https://doi.org/10.1097/CCM.0000000000002253 (2017).
Article PubMed Google Scholar
Ramachandran, P. S. et al. Integrating central nervous system metagenomics and host response for diagnosis of tuberculosis meningitis and its mimics. Nat. Commun. 13, 1675. https://doi.org/10.1038/s41467-022-29353-x (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Fan, S. et al. Integrating host transcriptomic signatures for distinguishing autoimmune encephalitis in cerebrospinal fluid by metagenomic sequencing. Cell. Biosci. 13, 111. https://doi.org/10.1186/s13578-023-01047-x (2023).
Article CAS PubMed PubMed Central Google Scholar
Polikar, R. Ensemble based systems in decision making. IEEE Circuits Syst. Mag. 6, 21–45 (2006).
Article Google Scholar
Kuncheva, L. I. & Rodríguez, J. J. A weighted voting framework for classifiers ensembles. Knowl. Inf. Syst. 38, 259–275 (2014).
Article Google Scholar

Download references

Funding

This work was supported by the National Natural Science Foundation of Young Scientists of China (32200438).

Author information

Tian Zhang and Ying Deng: Co-first author.

Authors and Affiliations

Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, People’s Republic of China
Tian Zhang, Ying Deng, Haoqian Wang, Shutao Xia & Weijie Liao
Peng Cheng Laboratory, Shenzhen, China
Tian Zhang, Yiling Wu & Shutao Xia
Department of Hematology and Oncology, International Cancer Center, Shenzhen University General Hospital, Shenzhen University, Shenzhen, People’s Republic of China
Weijie Liao
College of life science and technology, Wuhan Polytechnic University, Wuhan, People’s Republic of China
Weifang Liao
Zhengzhou KingMed Center for Clinical Laboratory, Zhengzhou, Henan, People’s Republic of China
Wentao Wang & Zhe Zhao

Authors

Tian Zhang
View author publications
Search author on:PubMed Google Scholar
Ying Deng
View author publications
Search author on:PubMed Google Scholar
Wentao Wang
View author publications
Search author on:PubMed Google Scholar
Zhe Zhao
View author publications
Search author on:PubMed Google Scholar
Yiling Wu
View author publications
Search author on:PubMed Google Scholar
Haoqian Wang
View author publications
Search author on:PubMed Google Scholar
Shutao Xia
View author publications
Search author on:PubMed Google Scholar
Weifang Liao
View author publications
Search author on:PubMed Google Scholar
Weijie Liao
View author publications
Search author on:PubMed Google Scholar

Contributions

W.J.-L. and W.F.-L. designed and performed the project, wrote and revised the manuscript, supervised the project. T.-Z. and Y.-D. designed this project, analyzed the data and revised the manuscript. W.T.-W. and Z.-Z. collected and interpretated the samples and analyzed the data. H.Q.-W. revised the manuscript and supervised the project. Y.L.-W. and S.T.-X. assisted in revising the manuscript.

Corresponding authors

Correspondence to Weifang Liao or Weijie Liao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethic approval and consent to participate

This study was approved by the ethics committee of Kingmed Diagnostics (2023160), and all experiments were performed in accordance with relevant guidelines and regulations. Due to the retrospective nature of the study, the ethics committee of Kingmed Diagnostics waived the need of obtaining informed consent.

Code availability

Code supporting the study are available in https://doi.org/10.5281/zenodo.15655782.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary Material 2

Supplementary Material 3

Supplementary Material 4

Supplementary Material 5

Supplementary Material 6

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, T., Deng, Y., Wang, W. et al. Stacking ensemble learning models diagnose pulmonary infections using host transcriptome data from metatranscriptomics. Sci Rep 15, 30516 (2025). https://doi.org/10.1038/s41598-025-15914-9

Download citation

Received: 18 June 2024
Accepted: 11 August 2025
Published: 20 August 2025
Version of record: 20 August 2025
DOI: https://doi.org/10.1038/s41598-025-15914-9