Introduction

Hashimoto’s thyroiditis (HT) is an autoimmune disease characterized by specific thyroid pathology1. Its prevalence has increased in recent decades, affecting approximately 0.3–1.5 cases per 1000 individuals2,3. Pathological features include lymphocyte infiltration, thyroid enlargement, cell destruction, atrophy, and fibrosis4. Thyroglobulin antibodies (TGAb) and thyroid peroxidase antibodies (TPOAb) 5 are serological markers indicating autoimmune attack on the thyroid. HT often presents with thyroid morphological changes and elevated antibody levels, sometimes without obvious symptoms or significant thyroid dysfunction6, posing challenges in molecular characterization.

Non-coding RNAs (ncRNAs) are RNA transcripts that don’t encode proteins but play vital roles in regulating gene transcription and protein translation. Altered expression patterns and regulatory networks of ncRNAs have shown potential for enhancing our understanding of chronic diseases like cancer, neurodegenerative disorders, and cardiovascular diseases7,8,9. Specific ncRNA expression studies offer promise as molecular signatures10. In HT research, Zhao et al. identified 6 miRNAs as potential plasma markers11, Peng et al. found that upregulated lncRNA-IFNG-AS1 contributes to HT’s pathogenesis12, and Xiong et al. highlighted the role of has_cir_0089172 in HT pathogenesis by sequestering miR-125-3p13. However, identifying expression patterns and regulatory network changes in HT remains challenging, with implications for understanding disease mechanisms and potential therapeutic strategies.

Traditional methods of analyzing differential expression focus on individual RNA types, providing detailed biological information but neglecting interrelationships and joint regulation between different RNA types14. Integrative analysis methods, such as multi-omics factor analysis (MOFA), are needed to capture complex relationships by combining multiple RNA sequencing data types15,16. However, careful feature selection, appropriate data preprocessing, and efficient computing methods are necessary to enhance MOFA’s performance with high-dimensional RNA-seq data. Artificial intelligence models, including machine learning, have shown promise in disease characterization and personalized medicine17,18. In HT research, Zhao et al. used a deep learning model based on CNN to analyze ultrasound images19, while Peng et al. developed a machine learning model using clinical and laboratory data to characterize HT and identify key factors20. Integrating differential expression analysis, MOFA, and machine learning to identify RNA signatures and transcriptional regulatory networks in HT requires further research.

This study aims to analyze whole-transcriptome sequencing data from HT patients. By integrating differential expression analysis and MOFA models, we seek to identify transcriptomic signatures for characterizing HT and potential regulatory mechanisms. Co-expression and regulatory networks will be constructed to reveal changes in gene regulation. Additionally, a novel machine learning stacking model will be developed to assess the potential of these signatures for enhanced molecular characterization of HT. This comprehensive approach aims to provide valuable insights into HT molecular mechanisms and identify potential targets for future research and therapeutic strategies.

Methods

Study participants

This study involved two distinct cohorts from different medical centers. Between June 2022 and May 2023, a total of 31 early HT patients and 30 healthy controls (HC) were recruited from Qilu Hospital of Shandong University and the Second Affiliated Hospital of Xi’an Jiaotong University. The diagnosis of HT was based on specific criteria such as thyroid enlargement, characteristic ultrasound images, and elevated thyroid antibodies, with no clinical or biochemical evidence of thyroid dysfunction. To further identify early HT, a longitudinal assessment was conducted over a period of six months, with evaluations performed every two months, to observe the development of autoantibodies and ultrasound changes. If subjects experience thyroid dysfunction during the assessment period, they are considered to be in a stage other than early HT and excluded from this study. Participants with prior history of thyroid diseases, thyroid cancer, or thyroid surgery were excluded. The healthy control group exhibited normal thyroid function, no ultrasound abnormalities, and tested negative for thyroid autoantibodies. Exclusion criteria included the presence of cardiovascular, autoimmune, infectious, musculoskeletal, or malignant diseases, as well as recent surgery or trauma. Thyroid function tests were conducted using the Roche Cobas 6000 E601 module immunoassay analyzer. Demographic information was collected through a questionnaire. The study was ethically approved by the Medical Ethics Committee of Xi’an Jiaotong University and conducted following the Helsinki Declaration guidelines (NO. 2020844). Written informed consent was obtained from all participants.

RNA sequencing and data pre-processing

Peripheral blood samples were collected during the confirmed early HT assessment for transcriptome sequencing. Total RNA was extracted from peripheral blood samples of each individual using the PAXgene Blood RNA Kit (BD Biosciences, USA). Ribosomal RNA (rRNA) was depleted, and libraries were prepared using the TruSeq Small RNA Library Preparation Kit. The libraries were sequenced on the NovaSeq 6000 platform, generating 150 bp paired-end reads. Raw data quality control was conducted using FastQC (v0.11.8)21. miRNA identification was performed using Bowtie (v1.1.2)22, and miRNA expression profiles were generated with miRDeep2. The filtered and quality-controlled sequences, referred to as “clean reads”, were aligned using HISAT2 (v2.0.4)23. Transcript assembly and gene expression estimation were performed using StringTie (v1.3.1)24. After quality control and alignment, transcripts were classified through a systematic pipeline. For protein-coding genes, transcripts were mapped to the human reference genome (GRCh38) and annotated based on RefSeq database using StringTie. miRNAs were identified through alignment to miRBase (v22) using Bowtie and quantified with miRDeep2. For lncRNA identification, we first excluded known protein-coding transcripts and small RNAs, then selected transcripts longer than 200 nucleotides, and assessed their coding potential using CPC2 and CNCI tools. circRNAs were identified by detecting back-spliced junction reads using find_circ and CIRI2 algorithms, with at least two unique back-spliced reads required for circRNA annotation. All identified transcripts were further filtered based on expression level (FPKM > 0.1) to ensure reliable quantification. The resulting gene expression matrices for mRNA, miRNA, lncRNA, and circRNA were utilized for subsequent analysis.

Identification of HT-associated RNA signatures

Differential expression analysis

Differential expression analysis was conducted on four RNA expression matrices (mRNA, miRNA, lncRNA, and circRNA) using the limma25 and edgeR26 packages. The parallel implementation of both limma and edgeR packages for identifying DEGs was strategically chosen to leverage their complementary strengths. Limma excels in handling complex experimental designs through its sophisticated linear modeling framework and empirical Bayes methods, which are particularly effective for controlling false discovery rates in multi-factor analyses. Meanwhile, edgeR specifically addresses the challenges of RNA-seq count data through its negative binomial distribution-based statistical framework, making it especially robust for analyzing genes with low expression levels. This dual-package approach enhances the reliability of our differential expression analysis by combining limma’s statistical power in handling experimental complexity with edgeR’s specialized capabilities for RNA-seq data characteristics. The intersection of results from both methods provides a more stringent and confident set of differentially expressed genes. Differentially expressed RNAs meeting the significance criteria were identified (P-value < 0.05&|log2FC| > 1). The intersection of differentially expressed RNA lists obtained from both packages was used for further analysis.

Multi-omics factor analysis

MOFA16 was employed to integrate the expression matrices of the four RNA types identified in the differential expression analysis. Prior to generating the MOFA model, z-score normalization was applied to eliminate biases caused by library size discrepancies. The normalization was performed using the formula:

$${x}_{ij}^{\ast}=\frac{{x}_{ij}-{\mu}_{i}}{{\sigma}_{i}}$$

Where xij represents the expression value of RNA j in sample i, µi represents the mean expression value of sample i, and σi represents the standard deviation of sample i. A 15-factor MOFA model was generated using 10,000 iterations in ‘slow’ convergence mode.

Subsequent analysis was performed using the MOFA + tool15. The sample factor matrix was extracted to examine correlations between factors and clinical variables. Factors showing significant differences between the HT and control groups were identified. High-contributing weight features meeting the criteria (weight ≥ mean + 2 * standard deviation & weight ≤ mean − 2 * standard deviation) under these factors were selected as the characteristic RNA signatures.

Pathway enrichment analysis

For the mRNAs among the characteristic RNA signatures, we performed enrichment analysis based on Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) using the clusterProfiler package27. Enrichment information for cellular components, biological processes, molecular functions, and KEGG pathways with P-value < 0.05 was considered statistically significant. This analysis aimed to provide insights into the biological processes and pathways potentially involved in HT pathogenesis.

Analysis of the biological basis underlying HT-associated RNA signatures

Construction of co-expression networks

To identify potential regulatory relationships, we conducted a co-expression network analysis for the characteristic RNA signatures, focusing on lncRNA-miRNA, lncRNA-mRNA, miRNA-mRNA, and circRNA-miRNA interactions. Pearson correlation coefficient was used to examine the correlation between RNA pairs. RNA pairs with a statistical significance (P-value < 0.05) and meaningful correlation strength (|r| > 0.2) were included to construct the co-expression network, where RNAs represented nodes and significant correlations were depicted as edges. Visualization of the co-expression network was performed using Cytoscape. For node ranking, we employed CytoHubba with five attribute-ranking methods: Betweenness, Closeness, Degree, Edge Percolated Component (EPC), and Maximal Clique Centrality (MCC). The shared top 5 ranking nodes in attributes were considered as hub RNAs in the co-expression network. To identify modules with potential similar expression patterns, we utilized ClusterONE with the following parameters: Minimum size = 3, Minimum density = Auto, Edge weight = unweighted, Node penalty = 2, Haircut threshold = 0, Merging method = Single-pass, Similarity = Match coefficient, Overlap threshold = 0.8, Seeding method = From unused nodes.

Development of regulatory networks based on experimental non-coding RNA interactome

To construct a novel human Experimental Non-coding RNA Interactome (ENCI), we integrated the latest data from ten databases, including DIANA28, lncBook29, miRTarBase30,31, miRNet32, NPinter433, RAIN34, RISE35, Starbase36, Circbank37, RNAInter38, and data from a study by Morselli et al.39. Standardized IDs were used for different RNA types, with HUGO Gene Nomenclature Committee (HGNC)40 ID for mRNA and lncRNA, ENSEMBL ID for lncRNA not in HGNC, miRBase41 ID for miRNA, and circBase42 ID for circRNA. Self-interactions and duplicate records were removed for data quality assurance. Based on the ENCI, regulatory networks were constructed for characteristic RNA signatures, and CytoHubba and clusterONE software were used to identify hub RNAs and modules in these networks.

Development of a stacking machine learning model for enhanced HT molecular characterization

We extracted and integrated expression matrices corresponding to characteristic RNA signatures as the dataset for machine learning (ML) model construction. The dataset was randomly split into two independent subsets: one for model training (training set: 21 HT, 20 HC) and the other for model evaluation (testing set: 10 HT, 10 HC). Our modeling strategy involved combining 10-fold bagging and two-layer stacking. In the first layer, we trained 10 base models using 9/10 of the training set and made predictions on the remaining 1/10 of the data. The base models included logistic regression (LR)43, decision tree (DT)44, random forest (RF)45, extra tree (ET)46, gradient boosting decision tree (GBDT)47, XGBoost (eXtreme Gradient Boosting)48, support vector machine (SVM)49, multilayer perceptron (MLP)50, CatBoost51, and K-nearest neighbors (KNN)52. In the second layer, we concatenated the predictions from the first layer with the original training set as input. The second-layer models included the ten ML models from the first layer, and the final model was a logistic regression model. The selection of logistic regression as the final model was based on several key considerations. First, logistic regression excels in binary classification tasks and is particularly effective when combining predictions from multiple models, making it ideal for our stacking architecture. Second, its linear nature helps prevent overfitting when integrating diverse predictions from the double-layer models. Third, the model provides interpretable probability outputs and clear insights into the contribution of each base model, which is crucial for understanding the relative importance of different RNA signatures in HT characterization. Furthermore, in our preliminary evaluations, logistic regression demonstrated stable performance in handling the transformed feature space created by the double-layer predictions, while maintaining computational efficiency. Hyperparameter optimization and fine-tuning of the models were performed using the Tree-structured Parzen Estimator Approach (TPE)-based Bayesian optimization algorithm. Model performance and reliability were assessed using accuracy (ACC) and area under the curve (AUC) metrics on the testing set.

Results

Demographic and clinical features of the study participants

This study included 61 participants (31 HT, 30 HC), with similar demographic variables between the groups. Laboratory examinations revealed significant differences in thyroid morphology, function, hormone, and immunity levels (all p values < 0.05). The HT group showed increased thyroid volume, elevated tissue hardness, and higher levels of TPOAb and TGAb antibodies. Thyroid function indicators (FT3, FT4) showed varying degrees of abnormality, reflecting different stages of HT progression. Table 1 summarizes the demographic and clinical characteristics of the participants.

Table 1 The demographic and clinical characteristics of the subjects.

Identification and characterization of HT-associated RNA signatures

Distinct RNA profiles in HT by differential expression analyses

Differential expression analysis using the limma and edgeR packages identified significant mRNAs, miRNAs, lncRNAs, and circRNAs. The intersection analysis of both methods revealed distinct regulatory patterns across different RNA types. Specifically, we identified 1279 differentially expressed mRNAs (52 up-regulated, 1227 down-regulated), 9 miRNAs (3 up-regulated, 6 down-regulated), 652 lncRNAs (35 up-regulated, 617 down-regulated), and 18 circRNAs (3 up-regulated, 15 down-regulated) (Fig. 1). Notably, the predominant down-regulation pattern, particularly evident in mRNAs (96% down-regulated) and lncRNAs (95% down-regulated), suggests a broad transcriptional suppression in HT pathogenesis. The detailed results can be found in the supplementary materials (Table S1).

Fig. 1
figure 1

Differential expression analysis of RNA signatures in Hashimoto’s thyroiditis. (A-D) Volcano plots depicting the results of differential expression analysis based on limma for four RNA-seq datasets. Significantly differentially expressed RNAs were identified: 1743 mRNAs (342 Up, 1401 Down), 13 miRNAs (5 Up, 8 Down), 969 lncRNAs (198 Up, 771 Down), and 24 circRNAs (9 Up, 15 Down). (E-H) Volcano plots showing the results of differential expression analysis based on edgeR for the same four RNA-seq datasets. Significantly differentially expressed RNAs were identified: 1919 mRNAs (71 Up, 1848 Down), 20 miRNAs (7 Up, 13 Down), 873 lncRNAs (53 Up, 820 Down), and 33 circRNAs (9 Up, 24 Down). (I-L) Venn diagrams illustrating the intersection of upregulated differentially expressed RNAs identified by both limma and edgeR for the four RNA-seq datasets. (M-P) Venn diagrams displaying the intersection of downregulated differentially expressed RNAs identified by both limma and edgeR for the four RNA-seq datasets. Intersection analysis of the two methods resulted in a final set of preliminary RNA signature candidates: 1279 mRNAs (52 Up, 1227 Down), 9 miRNAs (3 Up, 6 Down), 652 lncRNAs (35 Up, 617 Down), and 18 circRNAs (3 Up, 15 Down). ENSEMBL IDs are shown only for transcripts without officially assigned gene symbols, primarily novel lncRNAs; all other transcripts are displayed using their HGNC gene symbols.

Key RNA signatures in HT by multi-omics factor analyses

MOFA model was used to identify key RNA signatures for HT molecular characterization. The model explained variances of 3.7%, 39.4%, 8.8%, and 39.9% for mRNA, miRNA, lncRNA, and circRNA data, respectively (Fig. 2A, Table S2). Factor correlation analysis indicated low correlations, reflecting a good model fit (Fig. 2B, Table S2). Variance decomposition analysis revealed that Factor7 explained the main mRNA variance, while miRNA variance was primarily explained by Factor1 and Factor3. lncRNA variance was distributed across multiple factors, with Factor2, Factor4, and Factor12 being prominent. Factors 1 and 2 predominantly explained circRNA variance (Fig. 2C, Table S2). Significant differences were observed in Factor3 (p = 0.016) and Factor10 (p = 0.006) between HT and HC groups (Fig. 2D and E, Table S2). RNA signatures meeting specific criteria (weight ≥ mean + 2 * standard deviation & weight ≤ mean − 2 * standard deviation) were identified based on feature weight matrices for Factor3 and Factor10, resulting in 79 characteristic RNA signatures including 3 mRNAs, 6 miRNAs, 64 lncRNAs, and 6 circRNAs (Table S2). The expression heatmap of these RNA signatures was presented in Fig. 2F. Enrichment analysis revealed the involvement of these RNA signatures in various biological processes, including peroxisome organization, peroxisomal membrane transport, immune system processes, Golgi apparatus function, and lipid metabolism (Fig. 2G, Table S2). These findings provide insights into the molecular mechanisms potentially involved in HT pathogenesis and progression.

Fig. 2
figure 2

Multi-omic factor analysis of RNA signatures in Hashimoto’s thyroiditis. (A) Bar plot demonstrating the total variance explanation of the model for the four RNA-seq datasets. (B) Heatmap representing the correlation between factors. (C) Heatmap displaying the variance explanation for each factor across the four RNA types. (D, E) Violin plots indicating the values of Factor 3 and Factor 10 across Hashimoto’s thyroiditis and healthy control samples. (F) Expression heatmap visualizing the characteristic RNA signatures identified through multi-omic factor analysis. (G) Bar plot presenting the results of GO and KEGG enrichment analysis for the mRNAs included in the characteristic RNA signatures. ENSEMBL IDs are shown only for transcripts without officially assigned gene symbols, primarily novel lncRNAs; all other transcripts are displayed using their HGNC gene symbols.

Molecular interactions and regulatory mechanisms in HT

Hub genes and modules in HT by co-expression networks

We categorized RNA signatures based on RNA interactions (lncRNA-miRNA, lncRNA-mRNA, miRNA-mRNA, circRNA-miRNA) and performed correlation tests, resulting in co-expression networks. The networks comprised 77 nodes and 266 edges, including lncRNA-miRNA (65 nodes, 121 edges), lncRNA-mRNA (66 nodes, 122 edges), miRNA-mRNA (7 nodes, 8 edges), and circRNA-miRNA (9 nodes, 15 edges) (Fig. 3A, Table S3). Using CytoHubba, we identified hsa-miR-548aq-3p, PEX13, MARCHF1, and hsa-miR-374a-5p as hub RNAs from the top shared nodes in attributes (Table S3). ClusterONE software extracted two modules with similar expression patterns, containing 5 RNAs each (Fig. 3B,I). Module 1 exhibited significant patterns among its RNA pairs (all p values < 0.05, Fig. 3C–H). Module 2 showed significant patterns for all pairs except between ENSG00000214955 and hsa-miR-6767-3p (p = 0.36) (all other p values < 0.05, Fig. 3J–P).

Fig. 3
figure 3

Results of co-expression network analysis in Hashimoto’s thyroiditis. (A) Visualization of the co-expression network. The yellow nodes represent hub genes, “V”-shaped nodes indicate modules, and lines between nodes represent interactions. (B) Visualization of Module 1. This module consisted of 5 RNA signatures (MARCHF1, PDC-AS1, ENSG00000236308, ENSG00000279098, PEX13). (CH) Expression correlation plots for molecular pairs within Module 1. Red points represent the HT group, purple points represent the healthy control group, the black line indicates the fitted line, and the gray interval represents the confidence interval. (I) Visualization of Module 2. This module comprised 5 RNA signatures (ENSG00000272372, GAL3ST2, hsa-miR-6767-3p, ENSG00000214955, ENSG00000229660). (JP) Expression correlation plots for molecular pairs within Module 2. Red points represent the HT group, purple points represent the healthy control group, the black line indicates the fitted line, and the gray interval represents the confidence interval. ENSEMBL IDs are shown only for transcripts without officially assigned gene symbols, primarily novel lncRNAs; all other transcripts are displayed using their HGNC gene symbols.

Dysregulation of multiple processes in HT by regulatory networks

By integrating information from 10 databases and the study by Morselli et al., an ENCI with 210,672 nodes and 30,392,876 edges was constructed (Table S4). Analysis of the 79 characteristic RNA signatures using the ENCI identified 18 RNAs with direct interactions, forming a regulatory network with 18 nodes and 45 edges (Fig. 4A). Hub RNAs in the network, ranked by CytoHubba, included hsa-miR-144-3p, hsa-miR-424-5p, hsa-miR-374a-5p, and hsa-miR-548aq-3p (Table S4). The regulatory network was clustered into four modules using ClusterONE software (Fig. 4B–E). The first module comprised 6 RNAs (hsa-miR-548aq-3p, hsa-miR-374a-5p, PEX13, MARCHF1, hsa-miR-144-3p, PTENP1-AS). The second module consisted of 5 RNAs (BACH1-IT2, hsa-miR-1178-3p, PDC-AS1, hsa-miR-374a-5p, PTENP1-A). The third module included 5 RNAs (hsa-miR-1178-3p, NUTM2B-AS1, hsa-miR-424-5p, hsa-miR-6767-3p, ARHGAP15-AS1). The fourth module contained 5 RNAs (BACH1-IT2, hsa-miR-1178-3p, HIF1A-AS3, hsa-miR-374a-5p, PTENP1-AS). GO and KEGG enrichment analysis of the 18 nodes revealed their involvement in key biological processes and pathways, such as immune system regulation, cell adhesion and migration, RNA and protein regulation, lipid metabolism, and vesicle and membrane regulation (Table S4, Fig. 4F). These findings suggest that HT is characterized by dysregulation of multiple biological processes and pathways, potentially contributing to its complex pathogenesis and progression.

Fig. 4
figure 4

Regulatory network analysis of RNA signatures in Hashimoto’s thyroiditis. (A) Visualization of the regulatory network. Gray circular nodes represent mRNAs, red “V”-shaped nodes represent miRNAs, yellow diamond-shaped nodes represent lncRNAs, and green square nodes represent circRNAs. (BF) Visualization of four modules within the regulatory network, highlighting potential functional RNA signature clusters. (G) Bar plot presenting the results of GO and KEGG enrichment analysis for the regulatory network, providing insights into the biological processes and pathways associated with the identified RNA signatures.

Enhanced HT molecular characterization via ensemble machine learning

We utilized the expression matrix of 79 RNA signatures to build ML models. Twenty randomly selected samples (10 HT, 10 HC) were used for testing, while the rest were used for training. Ten ML models mentioned in the methods were employed, along with a stacking model for cross-model comparison (hyperparameter settings in Table S5). Figure 5A showed the overview of our newly constructed stacking model. Figure 5B displayed the prediction results on the testing set (showing only models with ACC > 80%). The stacking model showed the best performance, achieving an ACC of 95% and an AUC of 97%. These results demonstrate the superiority of the multi-model fusion approach over individual models in characterizing HT molecular features using a small sample dataset. Our RNA signature selection method combined with the stacking model has potential for enhancing HT molecular characterization.

Fig. 5
figure 5

Stacking model construction and performance evaluation for molecular characterization of Hashimoto’s thyroiditis. (A) Overview of the proposed stacking model integrating multiple machine learning algorithms for enhanced molecular characterization of HT. (B) Comparison of various machine learning models, including the stacking model, on the independent testing set, demonstrating the performance in distinguishing Hashimoto’s thyroiditis from healthy controls based on the identified RNA signatures.

Discussion

In this study, performed a comprehensive transcriptomic analysis of HT patients and healthy individuals, identifying 79 characteristic RNA signatures. We constructed co-expression and regulatory networks based on these signatures and developed a Stacking model for enhanced molecular characterization of HT. Our model demonstrated excellent performance on an independent testing set, highlighting its potential for improving HT molecular profiling. These findings provide valuable insights into HT’s molecular mechanisms and may contribute to the development of more precise diagnostic and therapeutic strategies.

The widespread down-regulation observed across different RNA types provides important insights into HT pathogenesis. The down-regulated mRNAs were primarily enriched in pathways related to thyroid hormone synthesis and metabolism, suggesting compromised thyroid function. The dysregulated miRNAs, particularly the up-regulated hsa-miR-144-3p and hsa-miR-374a-5p, may contribute to immune dysregulation through their targeting of immune-related genes. The extensive down-regulation of lncRNAs implies altered transcriptional regulation, potentially affecting thyroid-specific gene expression programs. These regulatory patterns align with the progressive nature of HT and may represent early molecular events in disease development.

Previous transcriptomic studies often focused solely on differential expression analysis, overlooking the complex interactions between multiple RNA types. To address this limitation, we introduced MOFA, which simultaneously identifies factors explaining variation across different omics datasets. However, MOFA’s computational demands and sensitivity to data quality posed challenges. As noise or outliers can impact its accuracy, it is very necessary to reduce the dimensionality of RNA-seq data and remove irrelevant variables. To overcome these, we combined differential expression analysis with MOFA for screening HT characteristic RNA signatures and employed a fusion strategy using Bagging and Stacking to characterize HT based on the identified signatures. Bagging reduces overfitting by training diverse learners on subsets53, while stacking integrates different learners to reduce variance54. This strategy significantly mitigates model fluctuations caused by data changes and enhances performance for the complex task of molecularly characterizing HT.

HT is a complex autoimmune disease with a multifaceted pathogenesis. Integrating different RNA data types provides comprehensive insights into gene regulation, potentially identifying novel factors and pathways involved in HT55. Analysis of characteristic RNA signatures and enrichment helps uncover key molecules and processes involved in HT dysregulation. For instance, PEX13 and MARCHF1 impact immune cell activity. PEX13 regulates peroxisome balance, while MARCHF1 influences MHC II expression and immune recognition56,57. MiRNAs like hsa-miR-144-3p and hsa-miR-374a-5p control inflammation and autoimmune-associated genes58. While these molecules have known roles in immune regulation and inflammation, their specific involvement in HT pathogenesis and identification as hub genes in HT-specific regulatory networks represents a novel finding. Similarly, hsa-miR-1178-3p interacts with BRCA2, potentially influencing thyroid cancer risk59. HIF1A-AS3 expression is higher in multiple sclerosis60. This combination of known immune regulators and novel HT-associated RNAs suggests both the biological relevance of our findings and their potential to reveal previously unknown aspects of HT pathogenesis. These findings suggest common expression patterns and regulatory changes among autoimmune diseases, with PEX13, MARCHF1, hsa-miR-144-3p, hsa-miR-374a-5p, hsa-miR-1178-3p, and HIF1A-AS3 potentially playing significant roles in HT pathogenesis. Our enrichment analysis supports these findings, indicating the involvement of immune system processes, cell processes, and RNA/protein regulation in HT pathogenesis. Notably, the characteristic RNA signature co-expression network resembles the regulatory network but contains more nodes and edges, suggesting additional molecules and relationships worth exploring further.

While our study provides valuable insights into HT’s molecular landscape, there are limitations to consider. The sample size was relatively small, and the study population was homogeneous, which may limit the generalizability of our findings. Although we implemented several strategies to maximize reliability, including rigorous patient selection with longitudinal assessment, 10-fold bagging in our stacking model, and balanced training-testing sets, external validation using larger and more diverse cohorts remains essential. This validation should include independent cohorts from different populations, cross-center studies, and testing across various disease stages and subtypes. Further research is necessary to functionally characterize the identified RNA signatures and explore their potential as therapeutic targets. Cellular and animal models can help elucidate their mechanisms in HT pathogenesis. Integrating other omics data, such as proteomics and metabolomics, will further enhance our understanding of HT’s molecular features. Addressing these limitations through additional investigations will advance our understanding of HT and potentially uncover novel therapeutic targets.

Conclusions

Our study has yielded several novel and significant findings in HT research. First, we identified a unique set of 79 characteristic RNA signatures specific to early-stage HT, including previously unreported associations such as hsa-miR-548aq-3p. Second, we constructed the first comprehensive experimental non-coding RNA interactome (ENCI) for HT, integrating data from ten databases and revealing novel regulatory networks with 18 nodes and 45 edges. Third, we discovered two distinct co-expression modules and four regulatory modules that provide new insights into potential HT pathogenic mechanisms. Finally, our innovative stacking model achieved superior performance (95% accuracy, 97% AUC) in molecular characterization of early HT, demonstrating the potential of machine learning in disease diagnosis. These computational findings establish a foundation for future experimental validation and potential therapeutic development.

In summary, this study performed a comprehensive transcriptomic analysis of peripheral blood from HT patients, identifying 79 characteristic RNA signatures through the integration of differential expression analysis and MOFA. Our co-expression and regulatory network analyses revealed key molecular interactions, including previously unreported RNA relationships and functional modules in HT. The constructed stacking model achieved promising performance in molecular characterization, suggesting the potential utility of RNA signatures in HT diagnosis. While experimental validation is needed, our findings provide new insights into the complex RNA regulatory networks in HT and establish a foundation for future mechanistic studies and potential therapeutic developments. This computational framework also demonstrates the value of integrative approaches in understanding autoimmune disease pathogenesis.