Introduction

Tuberculosis (TB), caused by Mycobacterium tuberculosis (Mtb), significantly contributes to population mortality worldwide1. Mycobacterium avium complex is frequently isolated non-tuberculous mycobacteria (NTM) species, and Mycobacterium avium complex pulmonary disease (MAC-PD) has a heterogeneous clinical course2. There is an inverse relationship between TB incidence and NTM disease in high TB-burden countries3. However, the clinical manifestations of NTM disease are similar to those of TB and may pose a diagnostic challenge even to an experienced clinician4.

Tuberculin skin test (TST) and interferon gamma release assays (IGRAs) are the main diagnostic tools for latent TB infection (LTBI) or concern for active TB disease (ATB), however, both present strengths and limitations5. The new NTM-IGRA could improve the clinical handling of NTM-infected patients and avoid unnecessary misdiagnosis and treatments6. However, the clinical presentation of patients varies according to the time of occurrence of Mycobacterium infection and the level of immunodeficiency7.

Microscopy should be routinely carried out using auramine-phenol staining as it allows the direct evaluation of Mycobacterium and provides a useful adjunct to culture for Mycobacterium diseases8. But the isolation of NTM alone does not make the diagnosis, and the diagnosis of NTM pulmonary disease rests upon microbiological, clinical, and radiological findings. Moreover, the proportion of smear negative patients among pulmonary TB cases was 51%9.

The next-generation sequencing (NGS) technology was used to comprehensively diagnose Mtb and NTM, however, NGS from the direct diagnostic specimen is not yet standardized10. The whole genome sequencing (WGS) may serve as a high-resolution assay for the diagnosis of different subtypes of Mycobacterium infection11. Nevertheless, the sequencing method not only necessitates expensive equipment and experimental consumables but also demands professional sequencing and data analysis teams, ultimately imposing the cost of detection onto the patient and rendering this detection unfeasible in low-level hospitals.

Therefore, there is an urgent need for a rapid and simple method that does not rely on immune tests, sputum specimens, or sequencing, particularly in low-level health care settings where the most readily available specimen is a patient's whole blood. Unfortunately, there is currently no specific biomarker that can accurately distinguish between active tuberculosis, latent infection, and NTM infection using whole blood specimens. With the advancements in RNA-sequencing (RNA-seq) technology, it has become a valuable tool for understanding biological processes in infectious diseases12, and it facilitate the detection of differentially expressed genes (DEGs), hence potential biomarker candidates13.

In this study, we obtained four sepsis datasets (GSE205161, GSE107991, GSE199403 and GSE152218) from the GEO database and classified the raw data with R packages. Afterwards, we applied enrichment analysis and identified key DEGs by performing machine learning techniques. Then, the protein–protein interaction (PPI) analysis, correlation analysis, gene ontology (GO) and gene set enrichment analysis (GSEA) enrichment analysis, and Kyoto Encyclopedia of Genes and Genomes (KEGG)14,15,16 pathway enrichment analysis were performed to demonstrate the potential biological process of Mycobacterium infection. To identify key biomarkers, we utilized logistic analysis, a LASSO-based tenfold cross-validation approach. Finally, external validation datasets were used for extensive filtering and identification of NTM/LTBI/ATB-related molecular biomarkers.

Material and methods

Data sources and processing

The gene expression profiles related to CIN progression were retrieved and downloaded from the Gene Expression Omnibus (GEO) database of the National Center for Biotechnology Information (NCBI). Four expression profiling by high throughput sequencing datasets (GSE205161, GSE107991, GSE199403 and GSE152218) were selected and downloaded from the GEO database for analysis.

GSE20516117 is an expression profiling by high throughput sequencing based on the GPL24676 platform (Illumina NovaSeq 6000, Homo sapiens) and contains samples of non-tuberculous mycobacteria (NTM) pulmonary disease. GSE10799118 is an expression profiling by high throughput sequencing based on the GPL20301 platform (Illumina HiSeq 4000, Homo sapiens) and contains samples of active tuberculosis (ATB) and latent tuberculosis infection (LTBI). GSE19940319 is an expression profiling by high throughput sequencing based on the GPL24676 platform (Illumina NovaSeq 6000, Homo sapiens) and contains samples of NTM pulmonary disease and healthy controls (IGRA + and IGRA-). GSE152218 is an expression profiling by high throughput sequencing based on the GPL1679120 platform (Illumina HiSeq 2500, Homo sapiens) and contains samples of latent TB infection (LTBI) and active TB (ATB).

Analysis of expression profiling datasets

Samples of datasets consisted of four groups: Control (CON), NTM, LTBI and ATB. The GEO2R tool was utilized to analyze the three expression datasets21,22. To filter out the DEGs of the datasets, each dataset is normalized and converted to log2. The filtering conditions were this: |log2-fold change|> 1 and FDR < 0.05. The findings of the intersection analysis were then compared and analyzed using the Venn diagram with the R package. The final DEGs are computed using the intersection of the data sets (Fig. 1).

Figure 1
figure 1

Flowchart for bioinformatics analysis of publicly available data from GEO databases.

Functional enrichment analysis

The clusterProfiler package, the org. Hs.eg.db package, and the GOplot package of R software were used to perform Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis on related DEGs using (P < 0.05). Gene set enrichment analysis (GSEA) software (version 4.1.0) was utilized to compare the biological processes that were significantly different between the low- and high-risk groups.

Protein–protein interaction network analysis

We submitted NTM-, LTBI-, and ATB-related DEGs to the STRING database (http://www.string-db.org/) for PPI network prediction (interaction score > 0.7). Genes in clusters were identified by importing STRING database analysis results into Cytoscape v.12.0, and the Molecular Complex Detection (MCODE) plugin was then used to perform clustering analysis for NTM-, LTBI-, and ATB-related DEGs (Degree Cutoff = 2, Node Score Cutoff = 0.2, K-Core = 2, and Max. Depth = 100).

Identification of NTM, LTBI and ATB related genes using LASSO regression

The software package "glmnet" was used to perform LASSO logistic regression analysis on the identified 43 unique genes, with the limited sample size and huge number of covariates considered. Genes at the intersection of those screened by LASSO have been used to diagnose NTM/LTBI/ATB diseases.

Verification using external validation datasets

To validate the mRNA expression levels of key crosstalk genes, we utilized independent external datasets (GSE199403 and GSE152218). The GSE199403 included 73 MAC-PD, 50 IGRA + healthy controls and 75 IGRA- healthy controls samples. The GSE152218 included 16 ATB and 32 LTBI samples. The genes that were screened by LASSO logistic regression analysis were determined as the key genes. Subsequently, we then compared the mRNA expression levels of the key genes between the cases and IGRA- healthy controls groups using the Mann–Whitney U test. The results were considered statistically significant if the p-value was less than 0.05, which was visualized using the “ggplot2” package. Finally, we evaluated the diagnostic efficacy of the key genes in the test dataset by constructing receiver operating characteristic (ROC) curves using the “pROC” package.

Statistical analysis

Statistical analyses were performed using R software (version 4.3.2, http://www.R-project.org). Wilcoxon rank-sum test was used to analyze the significance of differential correlation gene expression in the GEO dataset. Student t-tests were performed to analyze significant differences between distinct groups. The glmnet R package was used for the LASSO regression analysis. A p-value was less than 0.05 indicates statistical significance.

Results

Identification of differential expression genes associated with NTM, LTBI and ATB

To identify differentially expressed genes, we first identified the differentially expressed genes (DEGs) in GEO datasets (GSE205161 and GSE107991). There were 20,488 genes in the NTM vs LTBI group, 20,272 genes in the NTM vs ATB group, 20,299 genes in the LTBI vs ATB group, and a total of 20,736 non-dual genes across all combinations. The significance criteria for DEGs were based on a difference multiple of |Fold Change|≥ 2, which corresponds to |log2FC|> 1, and p-value < 0.05. Ultimately, 229 differential genes in the NTM vs LTBI group met these thresholds. In the NTM vs ATB group, 256 differential genes met the thresholds, and in the LTBI vs ATB group, 572 differential genes met the thresholds (Fig. 1). ​The top ten most significant up- and down-regulated genes were listed and were for further analysis (Fig. 2).

Figure 2
figure 2

Transcriptional changes in patients with NTM, LTBI and ATB. (A). NTM vs LTBI, volcano plot of all genes; significantly (FDR < 0.05) differentially expressed genes (|fold change|> 1) are indicated in red (upregulated), or green (downregulated); (B). NTM vs LTBI; (C) LTBI vs ATB; (D). NTM vs LTBI, the top 10 differentially expressed genes are indicated in blue (upregulated), or yellow (downregulated); (E). NTM vs LTBI; (F) LTBI vs ATB.

Heatmaps of differential expression genes (DEGs) of NTM, LTBI and ATB

The patients were subdivided into NTM, LTBI and ATB groups, according to the GEO (GSE205161 and GSE107991) cohorts. The gene expression profiles of the DEGs between the paired groups were shown in the heatmap in Fig. 3 A, B and C. Genes with higher expression are depicted in red on the heatmap, while genes with lower expression were depicted in blue, with the same expression level marked in white. These results indicated that patients with NTM, LTBI and ATB could be classified based on genes associated with these Mycobacterium infections. Then, we found 1,127 DEGs from GEO (GSE205161 and GSE107991) cohorts and finally screened 598 unique DEGs for further analysis (Fig. 3D).

Figure 3
figure 3

Heatmaps of differential expression genes (DEGs) of NTM, LTBI and ATB. (A). Heatmap of DEGs between NTM and LTBI; (B). Heatmap of DEGs between NTM and ATB; (C). Heatmap of DEGs between LTBI and ATB; (D). Venn diagram demonstrates the intersections of DEGs among three groups.

Gene ontology analysis and kyoto encyclopedia of genes and genomes enrichment (KEGG) clusters of DEGs

To investigate the role of DEGs in the NTM, LTBI and ATB groups, we performed GO and KEGG pathway enrichment analyses. The results showed that these genes were strongly enriched in GO–Biological Process (BP) terms related to the detoxification process, response to virus process, as well as to the defense response to bacterium process. In terms of GO–Cell Component (CC) terms, these genes were highly enriched in haptoglobin-hemoglobin complex, blood microparticle and collagen-containing extracellular matrix, and in GO–Molecular Function (MF) terms, they were enriched in antioxidant activity, serine-type endopeptidase activity and endopeptidase activity (Fig. 4 A B and C). KEGG clusters analysis showed that these genes were enriched in Ribosome biogenesis in eukaryotes, Coronavirus disease (COVID-19) and Staphlococcus aureus infection (Fig. 4 D E and F), which was consistent with the results of GO analysis. Overall, the DEGs in the GO enrichment was in agreement with the KEGG cluster analysis.

Figure 4
figure 4

Enrichment analysis of DEGs among NTM, LTBI and ATB. (A) Bubble plot shows that DEGs between NTM and LTBI are enriched in several biological processes (BP), cell components (CC), molecular functions (MF) in NTM group; (B) Bubble plot shows that DEGs between NTM and ATB are enriched in each term; (C) Bubble plot shows that DEGs between LTBI and ATB are enriched in each term; Circle plot of KEGG pathway enrichment analysis in (D) NTM vs LTBI, (E) NTM vs ATB and (F) LTBI vs ATB groups. In the bubble plot, we sorted the top 10 of BP, CC and MF by p-value and visualize them. In the circle plot of KEGG clusters, the top 5 KEGG pathway are shown by p-value and the color represents each pathway.

Exploration of signaling pathways

Gene set enrichment analysis (GSEA) has an advantage in exploring the involved signaling pathways from an overall perspective. GSEA revealed that the DEGs in NTM vs LTBI group of GEO cohorts were significantly enriched in the erythrocytes take up carbon dioxide and release oxygen, mRNA splicing, O2/CO2 exchange in erythrocytes and scavenging of heme from plasma (Fig. 5A). The DEGs in NTM vs ATB group of GEO cohorts were significantly enriched in the cohesin loading onto chromatin, ODX58/IFIH1-mediated induction of interferon-alpha/beta, mitotic telophase/cytokinesis, RNA polymerase-1 transcription termination and scavenging of heme from plasma (Fig. 5B). The DEGs in LTBI vs ATB group of GEO cohorts were significantly enriched in the amyloid fiber formation, complement cascade, innate immune system, interferon gamma signaling and neutrophil degranulation (Fig. 5C).

Figure 5
figure 5

GSEA analysis of Top 5 DEGs among NTM, LTBI and ATB. (A) The significant GSEA sets in NTM vs LTBI group; (B) The significant GSEA sets in NTM vs ATB group; and (C) The significant GSEA sets in LTBI vs ATB group.

Protein–protein interactions network and correlation analyses of DEGs

The interaction of DEGs was investigated further by constructing a PPI network (as shown in Fig. 6) using the online tools. Figure 6 A, B and C showed the number of neighboring nodes of DEGs for NTM vs LTBI, NTM vs ATB and LTBI vs ATB. The DEGs of the top three adjacent nodes (HP, HBD, HBB, PSAD2, ISG15, IFI44L, GBP1, IFIT3 and STAT1) were highlighted with bold (Fig. 6 D E and F). Overall, the LTBI vs ATB group had a more extensive PPI network and maximum adjacent nodes of 56, and the NTM vs LTBI group had the smallest PPI network and the corresponding number of nodes, while the NTM vs LTBI group was between these two groups with the maximum number of adjacent nodes of 24.

Figure 6
figure 6

PPI networks of DEGs in NTM, LTBI and ATB. Bar plot of adjacent nodes genes in (A) NTM vs LTBI group; (B) NTM vs ATB group and (C) LTBI vs ATB group. PPI network of DEGs in (D) NTM vs LTBI group; (E) NTM vs ATB group and (F) LTBI vs ATB group. Top 3 key genes featured by bold.

Construction of diagnosis gene signature in GEO training cohorts

Based on top 10 upregulated or downregulated DEGs in these three groups, 43 unique DEGs were selected and visualized with Venn diagram (Fig. 7A). These significant genes were subject to least absolute shrinkage and selection operator (LASSO) regression analysis to construct the diagnostic model of NTM, LTBI and ATB diseases. The calculation of the regression coefficient is visualized in Fig. 7 B, C and D. A coefficient profile plot was produced against the log (lambda) sequence in the LASSO model. The optimal parameter (Lambda) was chosen as the first black dotted line indicated (Fig. 7 E to G). When the minimum Lambda were 0.0002, 0.0009 and 0.001 in NTM vs LTBI, NTM vs ATB, and LTBI vs ATB groups, respectively, four significant genes were identified in each group (Fig. 7 H).

Figure 7
figure 7

Identification of the highly correlated gene expression signatures in NTM, LTBI and ATB group. (A) Venn diagram demonstrates the 43 unique DEGs of three groups; The variation characteristics of the coefficient of 43 unique DEGs in NTM vs LTBI (B), NTM vs ATB (C) and LTBI vs ATB (D) groups; The selection process of the optimum value in the Lasso regression model by cross-validation method in NTM vs LTBI (E), NTM vs ATB (F) and (G) LTBI vs ATB groups; (H) Potential gene signatures for distinguishing between NTM and LTBI, NTM and LTBI, and LTBI and ATB, respectively.

Validation of the diagnosis gene signature in other GEO datasets

GEO datasets (GSE199403 and GSE152218) were implemented to validate the performance of the gene-related diagnosis model. Figure 8 showed the expressions of four hub genes (HBA1, HBA2, HBD and METTL7B) which screened by LASSO regression in all group (Interferon gamma release assay negative healthy control, “IGRA-HC”, NTM, LTBI and ATB). Results revealed that the expression levels of HBA1/2 in patients with NTM between HC and LTBI (Fig. 8 A B). Figure 8 C showed that HBD gene was significantly upregulated in ATB group when compared with IGRA-HC and NTM (all P < 0.001). The METTL7B gene was downregulated significantly in LTBI groups (P < 0.001), and the differences of METTL7B gene expression could efficiently differ between LTBI and ATB (Fig. 8 D).

Figure 8
figure 8

Validation of gene expression (TPM) of the potential biomarkers. Expression levels of HBA1 (A), HBA2 (B), HBD (C), and METTL7B (D). “Gene exp”: Gene expression; “IGRA-HC”: Interferon gamma release assay negative healthy chontrol.

Further validation of the diagnosis gene signature of each group

For HBA1 group, the area under the curve (AUC) score of HBA1 classifiers was 0.803 (95%CI, 0.709–0.898) in test datasets. The expression of HBA1 gene could efficiently differ LTBI from control group (IGRA-HC and NTM) (Fig. 9 A). For HBA2 group, the area under the curve (AUC) score of HBA2 classifiers was 0.798 (95%CI, 0.702–0.894) in test datasets. The expression of HBA2 gene could also efficiently differ LTBI from control group (IGRA-HC and NTM) (Fig. 9 B). Figure 9 C showed that the expression of HBD gene could diagnosis ATB among control group (IGRA-HC and NTM), with an AUC of 0.772 (95%CI, 0.661–0.883) (Fig. 9 C). Moreover, the expression of METTL7B gene could identify patients with LTBI among IGRA-HC and ATB group, and ROC results indicated that the AUC was 0.983 (95%CI, 0.964–1) (Fig. 9 D).

Figure 9
figure 9

ROC for diagnosis power analysis in GEO validation sets. The prognostic value of unique DEGs in patients with NTM, LTBI and ATB. (A), (B) HBA1/2 was for distinguishing NTM from LTBI; (C) HBD was for distinguishing NTM from ATB; (D) METTL7B was for distinguishing LTBI from ATB.

Discussion

​Most genetic markers associated with Mycobacterium infection are identified by comparing DEGs with healthy individuals, however, exploring different markers within subtypes (e.g. Mtb and NTM) is of great significant clinical value in diagnostics and differential diagnosis, especially as such markers are also different in healthy populations. In this study, we identified the role of Mycobacterium infection related genes in NTM, LTBI and ATB diseases. The GEO samples consisted of four subgroups: IGRA-HC, NTM, LTBI and ATB. We used LASSO and ROC to analyze and confirm the unique DEGs of these subgroups and found that the upregulated HBA1/2 could distinguishing LTBI in the NTM vs LTBI group, and the upregulated HBD could identify ATB in the NTM vs ATB group. Moreover, the downregulated METTL7B gene has been associated with LTBI, and may distinguish LTBI from ATB. Most importantly, the expression levels of these genes were also differed significantly between IGRA-HC and these diseases.

Both ATB and NTM pose severe health risks worldwide. A simple non-sequencing technique is required for quick diagnosis, particularly in less experienced hospitals, however no one biomarker is widely utilized for all Mycobacterium23. Previous studies concluded that the level of serum coronin-1A might serve as a novel biomarker for alternative laboratory examination to effectively distinguish patients with tuberculosis from those with other related diseases and healthy individuals24. Results from Wu et al. collectively indicate that IRF1 could serve as a new biomarker for the diagnosis of pulmonary tuberculosis, and IRF1 was seen to be significantly upregulated in patients with TB compared with healthy controls25. However, these studies focused on DEGs between healthy people and patients. Comparative analysis of blood samples from subjects afflicted by these pathologies in an endemic setting may yield new differentiating biomarkers and insights into inflammatory and immunological responses to ATB and NTM. Studies demonstrated that plasma proteomes are different comparing LTBI, ATB, NTM and HC cohorts, and found that these changes were augmented following prior blood immune cell stimulation with the ESAT-6/CFP-10 antigen cocktail26. Studies also confirmed that whole-genome sequencing (WGS) may serve as a high-resolution assay for the diagnosis of different subtypes of Mycobacterium infection11. Moreover, when combining sequencing technology with traditional diagnostic methods could identify pathogens at the early stages of the disease27. Nucleic acid amplification testing combined with acid-fast staining in lung biopsy tissues can lead to timely and accurate diagnosis in patients with smear-negative pulmonary tuberculosis28.

Acid-fast bacilli (AFB) sputum smear microscopy and Mtb culture are commonly used to diagnose ATB29. However, AFB smear microscopy has low sensitivity30 and is not suitable for monitoring therapy effects as it cannot distinguish between living and dead bacilli. Although the nanopore sequencing assay using clinical samples showed the best diagnostic accuracy for sputum-scarce ATB31, for patients who had no positive tuberculosis results (smear or culture), the time to diagnosis has been significantly reduced and the diagnostic efficacy has been declined32. In this case, Transcriptome RNA-Sequencing provide some new information on the role of host’s protective immune responses to Mtb33.

In this study, we analyzed representative DEGs that differed between among HC, NTM, LTBI and ATB groups using transcriptome RNA-sequencing datasets. Although the levels of HBA1/2 were upregulated among all diseases group, the expressions between LTBI and ATB had no significant differences. Moreover, the expressions of HBA1/2 were between HC and LTBI/ATB groups, by utilising the corresponding cutoffs could efficiently differ patients of HC and NTM. As hemoglobin subunit, HBA1/2 were upregulated in NTM group. Indeed, the most NTM-linked most top upregulated gene, revealed by our GSEA analysis, were erythrocytes take up carbon dioxide and release oxygen. Studies showed that the enhancement of oxidative stress in the process closely related to methemoglobin generation in human erythrocytes34.

Bacterial coinfections were observed in 19–66% of patients with MAC-PD during the entire duration of the disease, and patients with bacterial culture positive at diagnosis were more likely to develop bronchiectasis exacerbation35. Therefore, patients with MAC-PD should initiate treatment, of course, the most important thigh before that is diagnosis of this disease. HBA1/2 could identify NTM from LTBI with AUCs of 0.803 and 0.798, respectively. Furthermore, HBD could distinguish NTM from ATB with an AUC of 0.772.

Therefore, these genes have increased the efficiency of NTM identification.

METTL7B, methyltransferase-like 7B, is a member of methyltransferase-like family36. Little is known about the exact role of METTL7B in LTBI and ATB. Our results showed that the downregulated METTL7B could not only distinguishing LTBI from HC with an area under the curve (AUC) of 0.87 (95% CI: 0.81 to 0.94) (P < 0.05), but also the expression of METTL7B between LTBI and ATB was significantly different (P < 0.001). Recent studies reported that METTL7B may be a novel tumor immune microenvironment-associated biomarkers, which correlated with the prognosis of patients prostate adenocarcinoma37. Li et al. found that interfering with METTL7B expression could significantly cause inhibition of lung adenocarcinoma by modulating the ability of cells to proliferate and migrate38. Study provides evidence that PD-1/PD-L1 signaling pathway, which also regulated by METTL7B39, may associated with ATB or other chronic infection and even prevent patients from development of cancer40.

Conclusions

In summary, this study employed bioinformatics and machine learning techniques to identify four potential Mycobacterium-related genes. The diagnostic significance of these genes (HBA1, HBA2, HBD, and METTL7B) was validated using external validation cohorts from the GEO dataset. These findings suggest that these genes could potentially improve the diagnosis of patients with NTM, LTBI, or ATB, particularly in cases where culture or other etiological methods are unsuccessful. Furthermore, our study provides insights into the signaling pathways enriched in mycobacterial infections, thereby contributing to the design of novel therapeutic strategies for managing this disease.