Introduction

Leprosy is a chronic infectious disease caused by Mycobacterium leprae or M. lepromatosis, likely transmitted via droplets from contagious patients1. About 170,000 new cases occur annually, mainly in low socio-economic regions1. The disease affects the skin and peripheral nerves and, if untreated, can cause irreversible disabilities, stigma, and social exclusion2. Leprosy presents various clinical forms, ranging from tuberculoid (TT) with low bacterial index (BI) and strong cellular immunity3 to lepromatous (LL) with high bacterial load and humoral response3,4. Intermediate forms include borderline tuberculoid (BT), borderline borderline (BB), and borderline lepromatous (BL)5. The WHO classifies cases as paucibacillary (≤ 5 lesions, no detectable bacilli) or multibacillary (> 5 lesions and/or positive skin smear)1,6,7.

Leprosy diagnosis is primarily clinical, based on skin lesions with sensory loss, peripheral nerve thickening, and detection of M. leprae8. The wide disease spectrum complicates diagnosis7, and often requires expertise of experienced clinicians to obtain invasive samples like skin smears or biopsies9. Early detection of leprosy is characterized as one of the main challenges to stop transmission and prevent irreversible handicaps due to tissue damage caused by M. leprae10,11,12. Therefore, the identification of host biomarkers that can identify M. leprae infection or the onset of disease in low-invasive samples would be a game-changer for leprosy control. In this respect, monitoring household contacts (HHC) remains crucial, as they are intensely and frequently exposed to the bacterium and thus at greater risk of developing leprosy7,13.

Since the identification of new host biomarkers can provide new insights into pathogenesis and ways to differentiate between the various spectral forms in the disease spectrum, including early disease and M. leprae infection without clinical signs14, the identification of host proteomic and transcriptomic biomarkers for leprosy has been studied intensely in the past years10,11,12,15,16,17,18,19.

Non-coding RNAs (ncRNAs) are genomic elements that are not translated into proteins, playing a regulatory role in gene expression. Recently, the use of ncRNA as host biomarkers for infectious diseases has gained attention. This included microRNAs (miRNAs), which are small non-coding RNAs, and long non-coding RNAs (lncRNAs) that were found to be associated with the regulation of the immune response against M. leprae and disease progression20,21,22,23. The expression of some miRNAs was shown to discriminate leprosy patients from healthy subjects in a Brazilian cohort using skin specimens23,24 as well as blood samples23, thereby highlighting the possible role of non-coding genes for the diagnosis of leprosy23,24.

Previously, we identified a four-gene signature, designated RISK4LEP (MT-ND2, REX1BD, TPGS1, UBC), that differentiated HHC who later developed PB leprosy (designated progressors) from those who did not in a leprosy endemic population in Bangladesh10,25. As point-of-care (PoC) tests assessing RNA expression of host genes in fingerstick blood did not exist yet at the time of the previous study, we then focused on the identification of genes encoding for serum proteins that could be detectable in capillary blood using a lateral flow format, while excluding non-coding genes from the validation process.

Recently, Sutherland et al.. described the use of the Cepheid GeneXpert MTB-HR prototype, a finger-stick blood test measuring the RNA expression levels of three different genes (GBP5, DUSP3, and KLF2)26,27, to discriminate TB from other respiratory diseases (ORD), thereby highlighting the potential of host transcriptomic biomarkers in PoC diagnostic tests26,27.

Considering the regulatory roles of ncRNAs, a host transcriptomic biomarker signature that includes non-coding genes could provide a more complete insight into the overall transcriptome in leprosy pathogenesis. Therefore, the present study reanalyzed the RNA-Seq dataset, including ncRNA10,25, and combined three established methodologies (edgeR, DESeq2, and Limma-Voom), as well as two machine learning approaches (Random Forest-RF and Support Vector Machine-SVM), to establish an improved gene host biomarker profile able to predict leprosy. The results of this study can contribute to the improvement of early diagnosis and prediction of leprosy.

Materials and methods

Ethics

Ethical permission for leprosy patients’ samples was received from the National Research Ethics Committee (Bangladesh Medical Research Council) in Bangladesh (Ref no. BMRC/NREC/2010–2013/1534 and BMRC/NREC/2016–2019/214)28 and local ethical boards in the Netherlands (MEC-2012-589), and followed the Helsinki Declaration (version Fortaleza, Brazil, October 2013). Participants were informed about the study objectives in their local language and provided written informed consent before being included in the study. All participants were informed about their right to refuse to take part or withdraw from the study without consequences.

Description of cohorts

Between April 2013 and April 2018, HHC (n= 5,352) of newly diagnosed leprosy patients were enrolled in the study, and a first blood sample was collected (t = 1) from participants in four districts of the northwest region of Bangladesh. A more detailed description can be found in Tió-Coma et al.10.. Briefly, HHC were classified as individuals living in the same house, within the same compound, sharing a kitchen, or as direct (first) neighbors of individuals diagnosed with leprosy. Exclusion criteria included individuals with prior leprosy, those who refused consent, pregnant women, (treated) tuberculosis patients, children under five years old, individuals with liver disease or jaundice, and temporary residents29. HHC were monitored annually for at least 36 months to check for leprosy symptoms, with follow-up until 2022. HHC diagnosed with leprosy 4–61 months after t = 1 were classified as progressors. A second blood sample was collected from progressors at diagnosis, before starting multidrug therapy (t = 2). Progressors who presented with five or fewer skin lesions and BI = 0 were classified as PB, while those with more than five skin lesions were classified as MB6. An initial set was drawn from the cohort, consisting of 39 HHC and 39 progressors who were diagnosed with leprosy 4–61 months after recruitment10.

To validate the results from the initial set, a validation set was later drawn from the same follow-up, which included 12 HHC and 12 progressors (validation cohort) who were diagnosed with leprosy 4–61 months after recruitment. ECs are characterized as individuals from the general population of the same area who have had no known contact with leprosy patients30. The demographic information of the different groups is described in Supplementary Table S1.

RNA isolation, library preparation, and sequencing

Venous blood was collected in PAXgene tubes (BD Biosciences, Franklin Lakes, NJ), frozen until shipment on dry ice to Leiden University Medical Centre (LUMC, The Netherlands) for analysis. The RNA isolation from PAXgene tubes was automated using a QIAcube machine (Qiagen, Hilden, Germany) and PAXgene blood RNA kits (Qiagen) according to the manufacturer’s protocol. RNA concentrations were measured by Qubit RNA BR Assay (Thermo Fisher Scientific, Waltham, MA). The libraries for sequencing were prepared using NEBNext Ultra II Directional RNA Library Prep Kit for Illumina (New England Biolabs, Ipswich, MA), which includes poly(A) enrichment, and the globin reduction was performed using GLOBINclear kit (Thermo Fisher Scientific). Clustering and sequencing were performed in a NovaSeq6000 System (Illumina, San Diego, CA). The library preparation and sequencing followed the methodological approach outlined in previously10.

RNA-Sequencing analysis

RNA-Seq files were processed using the open-source BIOWDL RNA-Seq pipeline v2.0 (https://github.com/biowdl/RNA-seq/tree/v2.0.0) developed at Leiden University Medical Centre. The pipeline of the RNA-Seq used in this study was based on the framework established by previously10.

Reversed transcribed quantitative PCR (RT qPCR)

RNA was reverse-transcribed into cDNA by annealing phase at 25 °C for 5 min, extending phase at 42 °C for 60 min, and deactivation phase at 70 °C for 15 min using the GoScript™ Reverse Transcription System kit (Promega A5001) following the manufacturer’s instructions. Subsequently, a quantitative PCR was performed by mixing 10 µl of GoTaq® Probe qPCR Master Mix (Promega, A6102), 3 µl of nuclease-free water, 1 µl of the primer solution for the target gene (Supplementary Table S2), 1 µl of the primer solution for the reference gene GAPDH (Hs99999905_m1), and 5 µl of the cDNA sample. To monitor assay performance, ROX dye (Thermo Fisher) was added to the qPCR mix. Nuclease-free water was used as a negative control. cDNA was amplified using the following protocol: 2 min at 95 °C followed by 40 cycles of 15 s at 95 °C and 1 min at 60 °C in the QuantStudio 6 Flex Real-Time PCR System (Applied Biosystems). Relative target gene expression was determined by calculating ∆Ct using GAPDH as a reference gene. Since mRNA levels of immune genes were expressed as ΔCt values, larger values represent lower mRNA levels in the sample. To obtain the correct orientation, the ΔCt values were multiplied by −1.

Differential gene expression analysis using multi-method integration

Differential gene expression (DGE) was performed from RNA-Seq data10 analysis to identify genes differentially expressed between the leprosy progressor patients at t = 1 and the HHC group that presented statistical significance. DGE analysis was performed using three different established libraries, edgeR, DESeq2, and Limma-Voom, executed according to their guidelines31,32,33. In the edgeR and Limma-Voom, the raw counts were normalized for library sizes with the Trimmed Mean of the M-values (TMM) method. In the DESeq2 library, the median of ratios normalization method was used. The p-values were adjusted for multiple testing with the Benjamini-Hochberg statistical method to control the false discovery rate34. Genes with false discovery rates below 0,05 (adjusted p-values < 0,05) and above log2(1,5) or below -log2(1,5) threshold were classified as differentially expressed. These methods were applied to all genes, coding and non-coding genes, separately. The analyses were performed using R (version 4.4.1) and the RStudio program for Windows 11.

Machine learning approaches

The intersection of genes from the DGE analysis shared between the three libraries was divided into three categories: whole RNA-Seq, coding genes, and non-coding genes. Through two different feature selection techniques, the chi-square test and the recursive feature elimination (RFE), different group sizes between three and 20 genes were generated for each of the 3 categories. These groups were then applied in two different machine learning approaches, the Random Forest (RF) and the Support Vector Machine (SVM). For the RF method, the Leave One Out cross-validation (LOOCV) method was applied, consisting of iterations between one sample as a validation set and the remaining samples to train the model. This procedure was repeated five times. For the SVM, LOOCV was applied without the five-times iteration, since SVM is a deterministic approach. The analyses were performed in PyCharm Edu (version 2020.2.1) and scikit-learn (version 1.4.1) using Python (version 3.10) for Windows 11. To compare the different subgroups generated through RF and SVM, the area under the curve (AUC) score was determined using a test set. These methods were applied to the whole RNA-Seq, coding genes, and non-coding gene groups separately.

Based on the average of higher AUC score values from both RF and SVM approaches, the best group size of genes was chosen in the whole RNA-Seq, coding genes, and non-coding genes groups, totaling six groups. After that, the intersection genes between the RF and SVM approaches were selected in the three different gene groups. The genes in the intersection of RF and SVM were split into coding and non-coding genes and ranked according to previously determined order of significance by the EdgeR, DESeq2, and LimmaVoom libraries, resulting in three different rankings. The ten most significant genes of each library were compared, and the common genes were selected. The RISK4LEP gene signature was included in the intersection of common genes of these two different machine learning approaches. Based on the intersection of shared genes, the chi-square and the RFE methods were applied to generate subgroups of three genes, after which the subgroups were applied in RF and SVM.

Statistical analysis in the validation cohort

DGE analysis of RT-qPCR data of the validation cohort of progressors and HHC was performed by comparing the median expression between the two groups and was evaluated with the Mann-Whitney test. When the comparison was made with more than two groups, the difference in median expressions between groups was assessed with the Kruskal-Wallis test, followed by Dunn’s multiple comparisons test. The DGE was considered significant when its corresponding p-value was below the 0,05 threshold. The analyses were performed using GraphPad Prism (version 9.3.1) for Windows.

Results

Differential gene expression in blood between leprosy progressors and HHC before leprosy diagnosis

In a previous study, we analyzed RNA-Seq data of a cohort of HHC in Bangladesh to identify leprosy risk-predicting host transcriptomic biomarkers encoding for serum proteins that could apply to rapid PoC diagnostic tests. Therefore, non-coding genes were excluded from that analysis10. Since the potential of host transcriptomic biomarkers in PoC diagnostic tests26,35 was recently demonstrated for TB, the current study aimed to optimize the previously identified host transcriptomic biomarker signature that predicted risk of developing leprosy (RISK4LEP)10, by including also non-coding genes in the RNA-Seq data for evaluation of their discriminatory potential for progressors and HHC.

Since performing different analyses and combining various analytical frameworks can help mitigate bias and ensure reliable findings, identification of DGE between progressors (n = 39) and HHC (n = 39), using the same RNA-seq data10, was here performed through three different statistical analyses: the DESeq2, EdgeR, and LimmaVoom methods as summarized in the study design pipeline in Fig. 1 and further detailed in Supplementary Fig. S1.

Fig. 1
figure 1

Schematic overview of analyses used to discover host transcriptomic biomarkers predicting the risk for leprosy. Including both coding and non-coding genes from the RNA-Seq data, a funnel approach of statistical analysis identified 19 candidate biomarkers, of which 14 were testable in RT-qPCR and 12 validated as host transcriptomic biomarkers associated with risk of leprosy. FC: Fold Change; RFE: Recursive feature Elimination; RF: Random Forest; SVM: Support Vector Machine. * Three of the 4 genes in the RISK4LEP gene signature were added for comparative purposes; one of the genes (TPGS1) was among 37 genes identified by the current analysis.

From 58,741 genes included in total in the RNASeq dataset (coding and non-coding genes), the DESeq2 method identified 344 genes as significantly differentially expressed between progressors and HHC. Among these, 142 were upregulated and 202 were downregulated (Supplementary Figure S2). Using the EdgeR method, 367 DEGs were found, with 183 upregulated and 184 downregulated. The LimmaVoom method identified 187 DEGs, of which 101 were upregulated and 86 were downregulated. For each method, the 20 most significant genes are highlighted in the volcano plots in Supplementary Figure S2.

Comparison of the three analytical methods (DESeq2, EdgeR, and LimmaVoom) identified 108 genes that were identified by all 3 methods, including 62 coding and 46 non-coding (Figs. 2a, b, and c, respectively). DEG identified in the subgroups (coding or only non-coding genes) are described in Supplementary Tables S4 and S5, respectively.

Fig. 2
figure 2

Venn diagram representing DEGs of whole blood RNA expression of leprosy progressors (n = 39) and household contacts (HHC) (n = 39) in (a) the whole RNA-Seq dataset (b) coding, or (c) non-coding genes. DEGs were identified using three statistical methods (DESeq2, EdgeR, and LimmaVoom). Numbers within the diagrams indicate either the overlap of DEGs between the 3 methods or DEGs uniquely identified by each method.

Application of different machine learning approaches to identify genes capable of predicting leprosy development

Next, two different machine learning models (RF and SVM) were applied to the DEGs that were identified by all 3 methods mentioned above in either the whole RNA-Seq dataset (n = 108), coding (63), and non-coding (45) genes. In addition, using different feature importance parameters (RFE and chi-squared methods), groups of 3 to 20 genes were generated to identify the optimal gene combination (based on the AUC score) able to predict leprosy. Out of the 108 genes, 37 DEGs were identified by RF as well as SVM and included for further analysis (Supplementary Table S6). From these 37 genes, one gene was previously identified in the RISK4LEP signature (TPGS1). For the sake of comparison, we also included the other three RISK4LEP genes (MT-ND2, REX1BD, and UBC) for further validation10. Among these 40 DEG, 18 were coding and 22 were non-coding genes (Fig. 1 and Supplementary Fig. S1; Supplementary Fig. S3).

Application of machine learning approaches to identify the optimal 3-biomarker signature to predict leprosy

To identify the best combination of genes that jointly identify progressors amongst HHC, feature selection was performed on the 40 selected genes to identify the optimal 3-gene combination in analogy to the 3-gene biomarker signature for active TB. In the RF approach, RFE identified SNHG8, C6orf48, and SNHG5 as the optimal gene combination (AUC:0.96; sensitivity (Sn): 88%; specificity (Sp): 88%; Supplementary Table S7). Using the chi-square method, the best 3 gene-combination consisted of UBC, RPL18AP3, and RPS21 (AUC: 0.85; Sn: 76%; Sp: 78%; Supplementary Table S7). In the SVM approach, RFE identified TRIM74, RP11-385D13.3, and TMEM238 as the best combination (AUC:0.94; Sn:85%; Sp:95%). When chi-square was applied in SVM, the same optimal gene set as in RF (UBC, RPL18AP3, and RPS21 genes) was obtained, but with an improved performance (AUC: 0.91; Sn:77%; Sp:85%).

Validation of host transcriptomic biomarkers for leprosy prediction

From the 40 genes, the ten most significant coding and non-coding genes were ranked, and the genes shared between the DGE approaches were selected (Supplementary Table S8). This selection, along with the groups containing the optimal 3-gene combinations, yielded 19 genes, of which 14 were commercially available for RT-qPCR purposes, characterized by C6orf48, UBC, MT-CO1, C12orf57, MZT2B, MT-ND2, TMEM238, DPM3, RPL13AP5, RPSAP58, SNHG8, SNHG5, RPS21, and TRIM74. These genes were tested in a different independent cohort of 12 progressors and 12 HHC from the same area in northwest Bangladesh for subsequent validation.

Of the 14 tested genes, 12 were significantly differentially expressed between progressors and HHC: RPSAP58, RPS21, C12orf57, SNHG5, DPM3, SNHG8, C6orf48, UBC, TMEM238, RPL18AP5, and TRIM74 were upregulated in the progressors group compared to the HHC, while MT-ND2 was downregulated (Fig. 3). No significant differences were found for MT-CO1 and MZT2B.

Fig. 3
figure 3

Differential gene expression (DGE) in whole blood samples between progressors and household contacts (HHC). mRNA levels were measured by RT-qPCR in whole blood RNA samples of progressors (n = 12) and HHC (n = 12). The Mann–Whitney U test was performed to determine the statistical significance between groups (adjusted p-values: * p < 0,05, ** p < 0,01, *** p < 0,001, **** p < 0,0001). Relative target gene expression was determined by calculating ∆Ct using GAPDH as a reference gene. Since mRNA levels were expressed as ΔCt values, larger values represent lower mRNA levels. ΔCt values were multiplied by −1.

The discriminatory potential of these genes in distinguishing progressors from HHC was assessed as well, and yielded the highest AUC values for RPSAP58, C12orf57, and RPS21 (0.96, 0.92, and 0.90, respectively) (Supplementary Fig. S4). Remarkably, RPSAP58 achieved a sensitivity of 100% and a specificity of 83%, underscoring its outstanding potential as a host biomarker for discriminate progressors from HHC. C12orf57 had a sensitivity of 83% and a specificity of 92%, whereas RPS21 achieved a sensitivity of 75% and a specificity of 92% (Supplementary Fig. S4).

To validate the optimal 3-gene signature (SNHG8, C6orf48, and SNHG5) found in the RNA-Seq analysis, NUM-scores (based on the sum of the number of individual positive biomarkers detected) were calculated using optimal cut-offs (Youden’s index)14,36, to evaluate discrimination between progressors and HHC in the validation cohort. This 3-gene signature achieved a Sn of 100% and Sp of 66,67% (AUC: 0.92; Supplementary Fig. S5). If we consider the second-highest Youden index value, this signature achieved a SN of 83,33% and Sp of 75%. These findings highlight the pivotal contribution of non-coding genes, in conjunction with coding genes, to improve early diagnosis and accurate prediction of leprosy.

Assessment of host transcriptomic biomarkers in leprosy patients

Since the RNA-Seq dataset was based on whole blood RNA expression levels in contacts of leprosy 4–61 months before diagnosis (progressors) or contacts never developed disease (HHC), we now also assessed the12 DEGs in MB, PB, and compared thee to the group of progressors and HHC as well as an endemic control (EC) group (Fig. 4). Progressors presented a higher expression of RPS21 and SNHG5 compared to PB patients and a higher expression of RPS21 and C12orf57 compared to EC and HHC, while SNHG5 and DPM3 presented a higher expression only related to HHC (Fig. 4). DPM3 also presented a significant downregulation in MB patients compared to EC and progressors (Fig. 4).

MB and PB patients showed significantly higher expression of SNHG8 and C6orf48 compared to HHC. SNHG8 also presented a higher expression in MB patients compared to progressors (Fig. 4). The C6orf48 presented significantly increased expression in the PB patients compared to EC. Furthermore, UBC expression was higher in MB patients compared to EC (Fig. 4).

TMEM238 and RPSAP58 presented significantly different DEGs between the groups PB, MB, progressors, EC, and HHC. Both genes presented a higher expression in the EC group compared to progressors and HHC. Furthermore, they also showed an upregulation in PB patients compared to HHC (Fig. 4). TMEM238 presented a higher expression in PB patients compared to MB patients and progressors. RPSAP58 also showed an upregulation in MB patients compared to HHC (Fig. 4). All the significantly DEGs are described in Supplementary Table S9. These findings describe the potential role of RPS21 and SNHG5 in the progression of leprosy, suggesting a temporary role during the early (preclinical) stages of the disease. The main results of this study are described in the Supplementary Fig. S5.

Fig. 4
figure 4

Differential gene expression (DGE) in whole blood samples between multibacillary (MB) patients, paucibacillary (PB), endemic controls (EC), progressors, and household contacts (HHC). The mRNA levels were measured by RT-qPCR in human samples of MB (n = 17), PB (n = 16), EC (n = 16), progressors (n = 12), and HHC (n = 12). The Kruskal-Wallis test, followed by Dunn’s multiple comparisons test, was performed to determine the statistical significance between groups (adjusted p-values: * p < 0,05, ** p < 0,01, *** p < 0,001, **** p < 0,0001). The horizontal lines indicate the median with the interquartile range. GAPDH was used as a reference gene, as highlighted in Fig. 3.

Discussion

Leprosy is a chronic infectious disease that can cause nerve damage, disability, and stigma if untreated2. M. leprae’s slow growth delays symptoms and potentially allows silent transmission37, while diagnosis still relies on clinical expertise and invasive tests9. A low complexity test based on host biomarkers that can be identified in minimally invasive samples would facilitate leprosy care. Recent advances in host transcriptomic biomarker signatures offer promising tools for diagnosis, early detection, and surveillance of leprosy10,11,12,18,19,38. The successful application of host transcriptomic biomarkers in PoC tests has been described for tuberculosis: the Cepheid GeneXpert MTB-HR test evaluates three genes (GBP5, DUSP3, and KLF2) that distinguish active TB from other respiratory diseases (ORDs) using fingerstick blood26,27,39, demonstrating the potential of host transcriptomic-based diagnostics as PoC tests.

Recent studies have demonstrated that non-coding genes, such as miRNAs and lncRNAs, are involved in the regulation of leprosy pathogenesis21,22,23. Cáceres-Durán et al.. highlighted a miRNA signature composed of hsa-miR-16-5p, hsa-miR-106b-5p, hsa-miR-1291, hsa-miR-144-5p, and hsa-miR-20a-5p, can discriminate between individuals without leprosy and leprosy patients40. The function of these miRNAs in leprosy patients may lead to the dysregulation of several biological pathways, such as apoptosis, autophagy, cell cycle regulation, and mitophagy40. These studies highlight the potential of non-coding genes as host biomarkers to improve leprosy diagnostics and provide insights into the immunopathology of this disease.

In this study, 12 individual genes were differentially expressed between progressors from HHC in a validation cohort, highlighting these host transcriptomic biomarkers as promising candidates for early leprosy. Also, the differential expression of these genes across progressors, PB, MB, EC, and HHC groups was evaluated.

An optimal 3-gene signature (SNHG5, SNHG8, and C6orf48) was identified, which achieved a sensitivity of 88% and specificity of 88% (AUC = 0.96), with higher expression in progressors compared to HHCs who did not develop disease. The C6orf48 gene is reported to be possibly involved in translation processes, in association with ribosomal proteins, in colon cancer41. Moreover, polymorphisms in this gene, located at chromosome 6 in the MHC III region, are positively correlated with the susceptibility to the development of breast cancer in a Chinese cohort42. For SNHG5, a long non-coding RNA (lncRNA), several functions have been described, such as cell survival, differentiation, and migration43. Similarly, this gene also plays a pivotal role in the development of different types of cancer43. For example, the inhibition of SNHG5 in the human leukemia cell line promotes the reduction of cell proliferation and induction of apoptosis44. SNHG5 is also involved in the modulation of immune responses, wherein the upregulation of this gene is related to a reduction of inflammation, once this gene binds to nuclear factor-kappa B (NF-κB) p65 subunit, thereby promoting the signaling inhibition and reduction of the inflammatory response cascade45. This could indicate that SNHG5 participates in the regulation of immune response, in an attempt to dampen pro-inflammatory signaling as present in those HHCs who develop to PB leprosy. SNHG8, like SNHG5, belongs to the lncRNA family and shares functions with SNHG5, indicating its regulatory role in cellular functions, such as cell survival, differentiation, and migration46,47,48,49.

The 3-gene signature identified in the present study (SNHG5, SNHG8, and C6orf48) outperformed the RISK4LEP signature (MT-ND2, REX1BD, TPGS1, and UBC; Sn:87.5%; Sp: 72.3%; AUC: 0,86). In this study, the optimal 3-gene signature achieved a Sn of 100% and Sp of 67% (AUC: 0.92). This improvement highlights the value of incorporating non-coding genes, which can capture critical disease-related regulatory signals. When opting for higher specificity (75%), sensitivity was decreased to 83,33%. However, when targeting all individuals who will develop leprosy for prophylactic treatment, a higher sensitivity is preferred to minimize missed cases, although specificity remains important to avoid overtreatment.

The current data also showed that RPS21 and SNHG5 expression were significantly higher in progressors compared to PB patients and HHC who did not develop disease. The elevated expression of these genes in HHC who progress to disease later indicates a temporary activation of these genes during early (preclinical) disease, which decreases again after development of clinical PB disease. The gene RPS21 encodes a component of the 40 S ribosomal subunit50. This gene has been described as upregulated in metastasis of hepatocellular carcinoma cells, reported as a promising prognostic biomarker for patients with this disease51,52. It has been reported that the increase of RPS21 is negatively correlated with GPX4 enzyme ubiquitination, a key regulator of ferroptosis. The stability of GPX4 leads to a reduction of reactive oxygen species (ROS) levels and the inhibition of ferroptosis52. During M. leprae infection, a reduction of oxidative stress could favor bacterial survival and progression of disease. Furthermore, variants of other genes encoding ribosomal proteins, such as the mitochondrial ribosomal protein S5 (MRPS5) gene, were shown to contribute to risk for leprosy in a Chinese population53. To the best of our knowledge, this is the first study to report RPS21 and SNHG5 as promising host transcriptomic biomarkers for detecting leprosy in its early stages.

In our findings, the SNHG8 gene was found to be higher in MB patients compared to progressors. In line with our findings on the biomarker potential of lncRNA, Fava et al.., who identified a lncRNA with two isoforms, RP11-135D11.2, and an uncharacterized LOC105378318 gene with two variants, rs1875147 and rs10826321, which were potential risk factors for the development of leprosy type-1 reactions (T1R)22. Although downregulated in MB patients, the DPM3 gene also showed discriminatory potential. This gene encodes subunit 3 of the dolichol-phosphate mannose (DPM) synthase complex, which is comprised of two other subunits, DPM1 and DPM254,55. DPM synthase plays a role in a range of glycosylation processes, such as N-glycosylation and O-mannosylation56. Once downregulated in MB patients, this could indicate that glycosylation may be altered, like the synthesis of truncated N-glycans and hypo-O-mannosylated proteins57. All these processes would affect multiple immune response pathways, such as lymphocyte activation and immune signaling58, leading to a poor cellular immunity against M. leprae.

Although in the current study we analyzed multiple clinical groups, all samples were collected in Bangladesh. As socioeconomic, environmental, genetic, and demographic factors can influence RNA expression, future studies with cohorts from other countries are essential to validate our findings worldwide. However, evaluation of the DEGs from this study in an alternative cohort might be challenging, as the current analysis was based on > 9 years of sampling of HHCs without leprosy and active case finding by field staff due to the low number of HHCs who develop leprosy after infection, as well as the long incubation time of leprosy.

In summary, analysis of RNA-Seq data of HHC of leprosy patients in Bangladesh using a combination of novel DGE analyses and machine learning approaches resulted in validation of 12 genes differentially expressed between progressors and HHC (RPSAP58, RPS21, C12orf57, SNHG5, DPM3, SNHG8, C6orf48, MT-ND2, UBC, TMEM238, RPL13AP5, and TRIM74). Three of these genes, SNHG5, SNHG8, and C6orf48, jointly formed the most optimal 3-gene signature, which achieved a sensitivity of 88% and specificity of 88% (AUC = 0.96). Moreover, RPS21 and SNHG5 genes were significantly higher expressed in progressors compared to diagnosed PB patients. This study identifies promising new host biomarkers that may contribute to improved targeting of prophylactic measures to prevent leprosy, enhance prognosis and diagnosis, and provide a better understanding of the immunopathology of the disease.