Introduction

Endometriosis (EMs) and recurrent miscarriage (RM) are widespread reproductive disorders that significantly affect women globally. Endometriosis, which affects approximately 10% of women of reproductive age, is characterized by the presence of endometrial tissue outside the uterus. This condition often leads to dysmenorrhea, chronic pelvic pain, and infertility1. Recurrent miscarriage, defined as two or more consecutive pregnancy losses before 20–24 weeks of gestation, affects approximately 2.5% of couples. It is associated with several factors, including chromosomal abnormalities, uterine defects, and endometrial dysfunction2. Both of these conditions present significant diagnostic challenges, as the diagnosis of endometriosis relies primarily on subjective symptom assessments, while recurrent miscarriage is typically diagnosed following post-loss evaluations. Laparoscopy remains the gold standard for confirming endometriosis3,4 Current treatments for both conditions focus primarily on alleviating symptoms rather than addressing the underlying disease pathogenesis5,6. This therapeutic stagnation can be attributed to the incomplete understanding of the mechanisms underlying these diseases. For instance, there is ongoing debate in the management of recurrent miscarriage, particularly regarding the role of antiphospholipid antibodies and immune modulators7, with controversial opinions on their use8. The economic burden of these conditions is substantial, with endometriosis alone costing over 69.4 billion dollars annually on a global scale, while recurrent miscarriage imposes a cost of £471 million per year in the UK9,10. These figures highlight the urgent need to elucidate the pathogenesis of these conditions and to develop novel diagnostic and therapeutic strategies.

Given this unmet clinical need, recent attention has turned toward novel biological processes that may unify the pathophysiology of both diseases. Endothelial-mesenchymal transition (EndMT) has recently emerged as a critical but underexplored mechanism in both endometriosis and recurrent miscarriage. While both epithelial-mesenchymal transition (EMT) and endothelial-mesenchymal transition (EndMT) are characterized by a shift toward mesenchymal phenotypes, they differ in their cellular origin and biological functions. EMT arises from epithelial cells and has been widely studied in cancer progression and tissue repair, whereas EndMT derives from endothelial cells and plays a central role in vascular remodeling, inflammation, and fibrosis11,12,13,14. Recent studies have identified EndMT as a critical contributor to gynecological disorders such as endometriosis and recurrent miscarriage, where vascular and immune dysregulation are central features15. Therefore, our study specifically focused on EndMT-related gene signatures to elucidate the endothelial-derived molecular mechanisms underlying EMS and RM. Despite the traditional classification of these diseases as epithelial disorders, accumulating evidence suggests that EndMT serves as a mechanistic bridge between vascular dysfunction and immune dysregulation in these conditions16. EndMT is characterized by the loss of endothelial markers and the acquisition of mesenchymal features, and it plays a pivotal role in promoting fibrosis in endometriosis through TGF-β signaling. These pathological effects highlight the multifaceted roles of EndMT beyond its traditional vascular context. Additionally, it impairs spiral artery remodeling in recurrent miscarriage, which is essential for proper placental development17,18. Unlike epithelial-mesenchymal transition (EMT), EndMT is particularly influential in modulating immune interactions, including macrophage polarization and cytokine network regulation19,20. However, the precise role of EndMT in the pathogenesis of these diseases remains unclear, especially in terms of the conservation of regulatory pathways, the identification of reliable biomarkers, and the development of targeted therapies. These knowledge gaps impede the development of more effective treatments.

To address these gaps, this study adopts a multi-omics strategy to systematically analyze the common molecular mechanisms of EndMT involved in both endometriosis and recurrent miscarriage. By combining transcriptomic data from publicly available repositories with curated EndMT gene profiles, this study aims to identify key molecular modules involved in the EndMT process, uncover biomarkers that could assist in diagnosis, and explore therapeutic targets grounded in mechanistic insights. This approach will enhance our understanding of how EndMT contributes to the pathogenesis of both diseases, with the potential to propose targeted interventions that could modulate the vascular and immune dysfunctions underlying these conditions.

Materials and methods

Data acquisition and sample overview

This study utilized the R package GEOquery (version 2.70.0) to download two datasets from the GEO database (https://www.ncbi.nlm.nih.gov/geo/): the Endometriosis dataset (GSE120103) and the Recurrent Miscarriage (RM) dataset (GSE165004)21,22. These datasets were selected based on criteria such as sample type consistency, data completeness, and clinical annotation quality. Both datasets consist of human endometrial tissue samples with clearly defined case and control groups. GSE120103 is based on the GPL6480 platform and includes 18 endometriosis and 18 control samples, while GSE165004 uses the GPL16699 platform and includes 24 recurrent miscarriage (RM) and 24 control samples. The detailed information of the GEO microarray chips is summarized in Table 1. Other candidate GEO datasets were initially reviewed but excluded due to inadequate sample size or missing clinical information. The two selected datasets provided the most complete transcriptomic resources for the analysis of endometrial lesions and recurrent pregnancy loss—core focuses of this study.

Table 1 GEO microarray chip information.

Raw expression matrices were inspected to determine whether log2 transformation had been applied; for datasets in non–log2 scale, log2 transformation was performed. Between-array quantile normalization was applied using limma::normalizeBetweenArrays (v3.58.1). Missing values were imputed using the k-nearest neighbors method implemented in the impute R package (default k = 10), and probes with > 20% missing values were excluded from analysis. Probe annotation was performed using the official GPL platform files; probes without valid gene symbols were removed, and for genes with multiple probes, the probe with the highest average expression across all samples was retained. Quality control was performed using boxplots before and after normalization for each dataset (Supplementary Fig. S1).

Because the datasets were generated on different microarray platforms, all differential expression analyses were conducted within each dataset separately. Cross-cohort evidence was obtained by identifying genes and pathways showing concordant changes in both datasets. Direct batch-effect correction (e.g., ComBat) and matrix merging were not applied, as platform-specific preprocessing and within-cohort analysis avoided potential cross-platform biases.

Endothelial-to-Mesenchymal Transition (EndMT)-related genes (EndMTRGs) were collected from public databases. Specifically, 142 protein-coding EndMTRGs were obtained from the GeneCards database (https://www.genecards.org/23 using the keyword “Endothelial-to-Mesenchymal Transition”. A complementary search in PubMed with the same keyword identified 10 additional genes. After merging and deduplication, a total of 150 unique EndMTRGs were retained for subsequent analysis (Table S1).

Identification of differentially expressed genes (DEGs)

For the GSE120103 dataset, samples were divided into Endometriosis and Control groups. While for the GSE165004 dataset, samples were categorized into Recurrent Miscarriage (RM) and Control groups. Differential gene expression analysis was performed using the Limma R package (version 3.58.1) on both the Endometriosis/Control and RM/Control groups24,25. To identify DEGs in the primary analysis, p-values were adjusted using the Benjamini–Hochberg (BH) method, and genes with adjusted p-value (FDR) < 0.05 were considered statistically significant. As a sensitivity analysis for effect-size stringency, we additionally examined results under absolute log2 fold change (|log2FC|) thresholds of 0.30 and 0.58 (≈1.5-fold), showing that the principal cross-cohort conclusions were robust to these thresholds (Supplementary Tables 4). Volcano plots and ranked expression plots of top DEGs were generated with the R package ggplot2 (version 3.4.4) to visually represent the results. To identify EndMT-related DEGs associated with both Endometriosis and Recurrent Miscarriage, we intersected the FDR-controlled DEG lists from both datasets, and a Venn diagram was used for visualization to highlight the core shared EndMT genes. Genes meeting FDR significance in one dataset and nominal significance (p < 0.05) with the same direction of change in the other were considered cross-cohort consistent candidates.

Functional and pathway enrichment analysis

Gene Ontology (GO) analysis, covering Biological Process (BP), Cellular Component (CC), and Molecular Function (MF), as well as Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis, were performed to explore the functional enrichment of the EndMTRDEGs26,27,28,29,30. The analyses were conducted using the R package clusterProfiler (v4.10.0), with the genome-wide gene list from each dataset used as the background universe.31. A significance threshold of p < 0.05 and false discovery rate (FDR, q-value) < 0.05 was applied, with p-value adjustment performed using the Benjamini–Hochberg (BH) method. Enriched terms were visualized with the R package ggplot2.

Pathway-level gene set enrichment analysis (GSEA)

Gene Set Enrichment Analysis (GSEA) was performed on the GSE120103 and GSE165004 datasets using the R package clusterProfiler (v4.10.0).32. Gene sets were obtained from the Molecular Signatures Database (MSigDB, v7.5.1), specifically from the c2 curated gene set collection. Genes were ranked by log₂ fold change (logFC) values, and the genome-wide gene list from each dataset was used as the background universe. Parameters for GSEA included a random seed of 2020, a minimum gene set size of 10, and a maximum of 500 genes per gene set. Statistical significance was determined at p < 0.05 and FDR (q-value) < 0.05, with p-values adjusted using the Benjamini–Hochberg procedure. Representative enrichment plots and network diagrams were generated to illustrate gene–function associations.

Protein–protein interaction (PPI) network

Protein–protein interaction (PPI) networks were constructed to analyze the interactions among the identified EndMTRDEGs. PPI data were obtained from the STRING database (version 11.0), with the species restricted to Homo sapiens and a medium confidence score threshold set to ≥ 0.4.33. Interactions included experimental evidence, database-curated associations, predicted associations from gene neighborhood, gene fusions, co-occurrence, and functional associations inferred from co-expression and homology, as provided by STRING. The PPI network was visualized using Cytoscape (version 3.9.1)34. Key (hub) genes were identified using the cytoHubba plugin in Cytoscape, applying five ranking algorithms: Degree, maximal clique centrality (MCC), edge percolated component (EPC), density of maximum neighborhood component (DMNC), and maximum neighborhood component (MNC). Among these, the top 10 genes were designated as core EndMTRDEGs35. The GeneMANIA website was used to predict functionally similar genes, and an interaction network was constructed for these key genes36.

Regulatory network construction

RNA-binding proteins (RBPs) play a crucial role in gene regulation, including RNA synthesis, splicing, modification, transport, and translation37. The StarBase v3.0 database (https://starbase.sysu.edu.cn/) was used to predict RBPs interacting with the EndMTRDEGs38. Only interactions supported by high-confidence CLIP-seq experimental evidence were retained, and low-confidence or computationally predicted interactions without experimental support were excluded. The mRNA-RBP regulatory network was visualized using Cytoscape software. Additionally, gene expression is regulated by transcription factors (TFs) through the control of the transcriptional process. The ChIPBase v2.0 database (http://rna.sysu.edu.cn/chipbase/) was used to retrieve TFs associated with the EndMTRDEGs. Interactions were filtered to include only those with experimental evidence from ChIP-seq datasets, and the interaction score threshold was set to ≥ 0.4 to ensure reliability. The TF–mRNA regulatory network was constructed and visualized using Cytoscape39.

Differential expression and core gene identification

To determine key genes linked to Endometriosis and Recurrent Miscarriage, the Mann–Whitney U test (Wilcoxon rank sum test) was employed to examine the expression discrepancies between the Endometriosis/Control and RM/Control groups within the GSE120103 and GSE165004 datasets. Following differential expression analysis, Receiver Operating Characteristic (ROC) curve analysis was performed to assess the diagnostic potential of the core genes40.

Immune cell-related gene set enrichment analysis (ssGSEA)

Single-sample Gene Set Enrichment Analysis (ssGSEA) was applied to evaluate the relative enrichment of immune cell–related gene sets in the samples from Endometriosis and Recurrent Miscarriage41. The immune cell gene sets were obtained from the MSigDB C7 collection and previously published signatures (Bindea et al., Immunity, 2013), covering multiple immune cell populations including activated CD8⁺ T cells, dendritic cells, γδ T cells, natural killer (NK) cells, and regulatory T cells, among others. The ssGSEA method calculates enrichment scores, which represent the relative expression of each predefined gene set within a sample, rather than the absolute proportion of immune cells. Enrichment scores were computed using the GSVA package from Bioconductor and visualized through ggplot2 (version 3.4.4). Differences in enrichment scores between groups were compared, and Spearman correlation analysis was used to examine relationships between immune cell types. R packages including pheatmap (version 1.0.12) and ggplot2 were employed to generate heatmaps and correlation bubble charts.

SVM-RFE and diagnostic model construction

To further identify core genes associated with Endometriosis and Recurrent Miscarriage, feature selection was conducted using the support vector machine recursive feature Elimination (SVM-RFE) algorithm implemented in the R caret package with a radial basis function (RBF) kernel. The model was evaluated using tenfold cross-validation (method = “cv” in caret), and the optimal number of features was determined automatically by the function based on minimizing the root mean square error (RMSE) from cross-validation results, and visually confirmed at the point of minimal RMSE. The SVM-RFE algorithm, commonly used for pattern recognition and classification, was applied to identify the most relevant genes. Nomograms were created using the R package ‘rms’ based on the expression levels of these pivotal genes. To assess model performance and risk of overfitting, 1000 bootstrap resamples were employed for calibration analysis, and calibration curves were generated to evaluate agreement between predicted and observed probabilities. Receiver operating characteristic (ROC) curves were constructed, and the area under the curve (AUC) with 95% confidence intervals (CIs) was calculated for both individual genes and the composite nomogram scores. The Hosmer–Lemeshow (H–L) test were used to assess model fit.

Statistical analysis

Data processing and statistical analysis were performed using R software (version 4.2.2). Normality of continuous variables was assessed using the Shapiro–Wilk test. For variables meeting the normality assumption, independent Student’s t-tests were applied to compare continuous variables between the two groups, and one-way ANOVA was used for comparisons among three or more groups. For data that were not normally distributed, the Mann–Whitney U test (Wilcoxon rank-sum test) was applied, and the Kruskal–Wallis test was used for comparing three or more groups.

All p-values were adjusted for multiple comparisons using the Benjamini–Hochberg (BH) method unless otherwise specified. Spearman’s correlation analysis was utilized to determine correlation coefficients between different molecules. For key comparisons, effect sizes (Cohen’s d for t-tests, rank-biserial correlation for Mann–Whitney tests) and 95% confidence intervals (CIs) were calculated to enhance statistical interpretation. All p-values were two-tailed, and statistical significance was set at p < 0.05.

Results

Dataset correction

This research mainly using bioinformatics methods to explore the Endometriosis (Endometriosis) and biological characteristics of recurrent miscarriage (RM), flow chart of the overall analysis such as Fig. 1. In data collection GSE120103, samples could be divided into normal sample (Control) group and Endometriosis (Endometriosis), in data collection GSE165004, samples could be divided into normal sample (Control) group and recurrent miscarriage group (RM), The dataset GSE120103 (Fig. S1A–B) and GSE165004 (Fig. S1C–D) standardizing, annotation probe cleaning operations, such as data before and after the standardization and mapped the boxplot of data distribution. Upon normalization, the expression patterns across different samples in the dataset became more consistent.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Technology roadmap. This figure illustrates the overall study design, detailing the datasets used, the analytical workflow, and the key bioinformatics methods applied.

Endometriosis and recurrent miscarriage related endothelial-mesenchymal transition related differentially expressed genes

Significant transcriptional alterations were observed in both the GSE120103 and GSE165004 datasets. Differential expression analysis was conducted using the limma R package. To identify Endothelial–Mesenchymal Transition–Related Differentially Expressed Genes (EndMTRDEGs), the overlapping genes between the DEGs from both datasets and a predefined set of EndMT-related genes were determined. Under the primary criterion of FDR < 0.05, 24 EndMTRDEGs were identified in GSE120103 and 15 in GSE165004. Thirteen genes were present in the intersection: SOX4, TUBB3, ITGB1, NR4A1, PARP1, VIM, SOX7, KITLG, SMAD1, CDH11, MAPK1, FGF2, and ROBO4 (Fig. 2E).

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Differential gene expression analysis. (A) Volcano plot showing differentially expressed genes (DEGs) between Endometriosis and Control groups in the GSE120103 dataset. (B) Ranked expression plot of DEGs in GSE120103. (C) Volcano plot showing DEGs between Recurrent Miscarriage (RM) and Control groups in the GSE165004 dataset. (D) Ranked expression plot of DEGs in GSE165004. (E) Venn diagram depicting the overlap between DEGs and endothelial-to-mesenchymal transition-related genes (EndMTRGs) in both datasets.

Several of these genes, including SOX4, VIM, ITGB1, and FGF2, showed consistent direction of change across datasets, with VIM and ITGB1 subsequently emerging as top-ranked predictive features in the machine learning analysis. Sensitivity analyses applying |log2FC| thresholds of 0.30 and 0.58 confirmed that the overall cross-cohort overlap pattern was robust to moderate but not strict effect-size cut-offs.

Gene ontology (GO) and path (KEGG) enrichment analysis

GO and KEGG enrichment analyses were conducted using the R package clusterProfiler (v4.10.0) to investigate the functional roles of the 13 EndMTRDEGs, with the genome-wide gene list from each dataset serving as the background universe. The GO analysis included BP, CC, and MF categories, while the KEGG analysis focused on canonical pathways. Significance was defined as p < 0.05 and FDR (BH-adjusted) < 0.05. The EndMTRDEGs were enriched in angiogenesis- and tissue-remodeling-related processes, involving classical signaling pathways such as PI3K–Akt and MAPK (Fig. 3A; Table 2). A network diagram was constructed to visualize the associations between EndMTRDEGs and enriched terms, with node size reflecting the number of genes and edges representing gene–term relationships (Fig. 3B–E).

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

GO and KEGG enrichment analysis of EndMTRDEGs. (A) Bar plot showing Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis. (B) Network diagram of Biological Process (BP) terms. (C) Network diagram of Cellular Component (CC) terms. (D) Network diagram of Molecular Function (MF) terms. (E) Network diagram of KEGG pathways. Node sizes indicate the number of associated genes.

Table 2 Results of GO and KEGG enrichment analysis for EndMTRDEGs.

Gene set enrichment analysis (GSEA)

GSEA was performed on the GSE120103 and GSE165004 datasets using gene sets from the MSigDB c2 curated collection, with genes ranked by logFC values and the genome-wide gene list from each dataset used as the background universe. Parameters were set with seed = 2020, minGSSize = 10, and maxGSSize = 500. Significance was determined at p < 0.05 and FDR (BH-adjusted) < 0.05. In both datasets, GSEA revealed significant enrichment of inflammatory, immune-related, and endothelial–mesenchymal transition (EndMT)–related processes, consistent with the biological context of endometriosis and recurrent miscarriage. Representative enrichment plots are shown in Fig. 4A, F, and the enriched gene sets are listed in Tables 3 and 4. Network diagrams illustrate the associations between EndMTRDEGs and enriched functional pathways (Fig. 4B–E,G–J).

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Gene Set Enrichment Analysis (GSEA) results. (A) GSEA enrichment score plot for GSE120103 dataset. (B–E) GSEA enrichment plots for Jain NF-κB signaling, JAK-STAT pathway, PI3K/AKT pathway, and Hedgehog signaling pathway in GSE120103. (F) GSEA enrichment score plot for GSE165004 dataset. (GJ) GSEA enrichment plots for Hinata NF-κB fibroblast targets, WP PI3K/AKT pathway, Hedgehog pathway, and TP53 senescence targets in GSE165004.

Table 3 Results of GSEA for datasets GSE120103.
Table 4 Results of GSEA for datasets GSE165004.

Proteins-protein interaction network (PPI network)

A protein–protein interaction (PPI) network of the 13 EndMT-related differentially expressed genes was generated using the STRING database (v11.0) with the species set to Homo sapiens and a medium confidence threshold (interaction score ≥ 0.4). Interactions included experimental validation, curated database information, and computational predictions from co-expression and homology. The network was visualized using Cytoscape (Fig. 5A).

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Protein–protein interaction (PPI) network of EndMTRDEGs and key gene identification. (A) PPI network of 13 EndMTRDEGs constructed using the STRING database. (B) Venn diagram showing overlap of top hub genes identified by cytoHubba algorithms. (C) Refined PPI network for eight selected key genes. (D) GeneMANIA-predicted network of functionally similar genes.

Hub genes were identified using the cytoHubba plugin with five ranking algorithms (MCC, MNC, EPC, Degree, DMNC). Genes ranked in the top 10 by each algorithm were collected, and those appearing in at least three ranking lists were defined as hub genes. These hub genes were located at critical network nodes and may represent potential biomarkers or therapeutic targets related to EndMT processes (Fig. 5B–C).

We further used GeneMANIA to predict genes functionally similar to the identified hub genes and to construct an extended interaction network, highlighting shared domains, co-expression patterns, and other functional associations (Fig. 5D).

Construction of control network

RNA-binding proteins (RBPs) associated with the endothelial-mesenchymal transition-related differentially expressed genes (EndMTRDEGs) were identified using the StarBase database. Cytoscape software was used to construct and visualize the mRNA-RBP regulatory network (Fig. 6 A). This network includes 8 EndMTRDEGs and 117 RBPs. Further details are presented in Table S2.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Regulatory networks of key genes. (A) mRNA–RNA-binding protein (RBP) interaction network based on StarBase predictions. (B) mRNA–transcription factor (TF) regulatory network based on ChIPBase data.

Transcription factors (TFs) associated with the EndMTRDEGs were extracted from the ChIPBase database. The resulting mRNA-TF regulatory network was subsequently built and visualized using Cytoscape software (Fig. 6B). This network includes 8 EndMTRDEGs and 81 TFs, with detailed information available in Table S3.

Expression patterns and group separation analysis of hub genes

We analyzed the expression levels of eight hub genes (FGF2, ITGB1, VIM, NR4A1, MAPK1, SMAD1, TUBB3, CDH11) in the GSE120103 dataset. Violin plots (Fig. 7A) revealed significant differential expression between the Endometriosis and Control groups for several of these genes. To evaluate diagnostic performance, receiver operating characteristic (ROC) curve analysis was conducted (Fig. 7B–C). These hub genes exhibited moderate to high diagnostic potential, with FGF2 and ITGB1 showing particularly strong performance.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Validation of key gene expression levels and diagnostic evaluation. (A, D) Violin plots comparing expression levels of key genes between disease and control groups in GSE120103 and GSE165004. (BC, EF) Receiver operating characteristic (ROC) curves demonstrating diagnostic value, with area under the curve (AUC) values indicated.

In the GSE165004 dataset, the expression patterns of these genes were also compared between Recurrent Miscarriage (RM) and Control groups (Fig. 7D), and significant differences were observed for most of them. ROC analysis (Fig. 7E–F) further confirmed that FGF2 maintained high diagnostic accuracy, while other hub genes demonstrated moderate predictive performance.

These results suggest that the identified hub genes may serve as potential diagnostic markers for both Endometriosis and RM. Across both datasets, group differences corresponded to moderate-to-large effect sizes (Cohen’s d range: 0.65–1.25), with 95% confidence intervals excluding zero for all significant comparisons, supporting the robustness and clinical relevance of these findings.

Immune cell–related gene set enrichment analysis (ssGSEA)

The single-sample Gene Set Enrichment Analysis (ssGSEA) algorithm was used to calculate enrichment scores for 28 immune cell–related gene sets in both the GSE120103 and GSE165004 datasets. Group differences in enrichment scores, as well as correlations among immune cell types and between key gene expression and immune cell–related enrichment scores, were systematically assessed.

The analysis revealed altered enrichment score patterns across both datasets, which may indicate potential involvement of immune dysregulation in the pathogenesis of Endometriosis and Recurrent Miscarriage (RM). Notably, several immune cell–related gene sets demonstrated distinct enrichment profiles and correlation patterns with hub genes, indicating a possible link between immune microenvironment remodeling and disease-related transcriptional changes.

These findings were visualized through heatmaps and correlation plots, highlighting distinct immune landscapes and their associations with key gene expression (Fig. 8A–F).

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Immune cell–related gene set enrichment analysis (ssGSEA). (A) Comparison of immune cell enrichment scores between Endometriosis and Control groups in GSE120103. (B) Heatmap showing correlation among immune cell populations in GSE120103. (C) Correlation bubble plot between immune cell enrichment scores and hub gene expression in GSE120103. (D) Comparison of immune cell enrichment scores between Recurrent Miscarriage and Control groups in GSE165004. (E) Heatmap showing correlation among immune cell populations in GSE165004. (F) Correlation bubble plot between immune cell enrichment scores and hub gene expression in GSE165004.

Identification of hub genes and construction and evaluation of predictive models

To identify hub genes associated with endometriosis and recurrent miscarriage, SVM-RFE (Support Vector Machine-Recursive Feature Elimination) analysis was conducted on datasets GSE120103 and GSE165004 to screen for key gene signatures using the R caret package with an RBF kernel and tenfold cross-validation, with optimal features selected based on minimal RMSE from cross-validation results. Note that the set of differentially expressed genes identified from the initial analysis is not identical to the final set of key genes selected by machine learning. This discrepancy arises from differences in sample distribution, feature selection criteria, and algorithm-specific characteristics. This analysis identified 4 gene signatures from each dataset: ITGB1, VIM, TUBB3, and NR4A1 from GSE120103, and FGF2, SMAD1, VIM, and ITGB1 from GSE165004. Subsequently, the intersection of the gene signatures from the two datasets was analyzed, revealing ITGB1 and VIM as hub genes (Fig. S2).

These genes are not only associated with both endometriosis and recurrent miscarriage but also serve as critical link genes between the two conditions, underscoring their significant biological importance. Subsequently, we constructed predictive nomograms based on the key genes (Fig. 9A,D) and evaluated their calibration and discrimination. Calibration curves, generated from 1,000 bootstrap resamples, demonstrated excellent agreement between predicted and observed probabilities, suggesting minimal overfitting risk. ROC curves (Fig. 9 B–C, E–F) showed high discriminative ability, with the AUC for the nomogram predicting endometriosis being 0.925 (95% CI: 0.89–1.00) and for recurrent miscarriage being 0.951 (95% CI: 0.85–0.99). Both nomograms exhibited higher AUC values compared to those of ITGB1 or VIM alone, indicating promising predictive performance and potential clinical utility. The Hosmer–Lemeshow test yielded p-values greater than 0.05 for both models, indicating good model fit. The expression differences and ROC analyses for the key genes in GSE165004 are shown in Fig. 9.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Nomogram models for diagnosis of EMs and RM. (A, D) Nomogram models constructed using key genes to predict risk scores for EMs and RM. (B, E) Calibration plots assessing model prediction accuracy. (C, F) ROC curves evaluating the diagnostic performance of nomogram models.

Discussion

Endometriosis (EMS) and recurrent miscarriage (RM) are two prevalent gynecological disorders that significantly impact women’s reproductive health, often resulting in infertility or recurrent pregnancy loss. EMS is characterized by the ectopic growth of endometrial tissue outside the uterine cavity, accompanied by chronic inflammation, cyclical bleeding, and fibrosis42. RM typically occurs during early pregnancy and is associated with genetic, hormonal, immunological, and anatomical abnormalities, ultimately impairing women’s fertility43. Importantly, both conditions involve pathological processes such as aberrant angiogenesis and immune dysregulation, mechanisms closely linked to endothelial-mesenchymal transition (EndMT).

To provide a comprehensive overview of shared molecular mechanisms in EMS and RM, we integrated differential expression analysis, functional enrichment, network construction, and immune profiling. We identified a core set of EndMT-related DEGs involved in angiogenesis, immune modulation, and cell migration, with functional enrichment analyses highlighting pathways such as PI3K-Akt, MAPK, and Rap1 signaling. PPI network analysis revealed hub genes like FGF2, ITGB1, and VIM, which were associated with diagnostic potential in our bioinformatics analysis.

Among these, FGF2 has been shown to promote EndMT through the activation of the TGF-β/SMAD signaling pathway, thereby contributing to endothelial plasticity and fibrotic transformation in multiple disease contexts44,45,46. VIM (vimentin), a canonical mesenchymal cytoskeletal protein, serves as a robust marker of EndMT and is actively involved in cytoskeletal remodeling and cellular migration during transition processes47,48. ITGB1 (Integrin β1) mediates endothelial adhesion to the extracellular matrix and activates signaling cascades such as FAK and AKT, facilitating EndMT and matrix remodeling, especially in fibrotic and inflammatory environments49,50. Together, these genes may represent core regulatory nodes that bridge vascular remodeling and immune dysregulation, reinforcing their pathophysiological relevance in both endometriosis and recurrent miscarriage.

These findings, together with distinct immune-related enrichment score profiles, are consistent with the hypothesis that EndMT-associated dysregulation of vascular and immune pathways may act as a convergent mechanism contributing to the pathogenesis of both diseases. This integrated perspective provides a foundation for future mechanistic studies and potential therapeutic targeting.

Immune dysregulation emerged as a convergent mechanism linking EMS and RM51,52. Significant alterations were observed in enrichment scores of key immune cell–related gene sets, including γδ T cells, monocytes, natural killer (NK) cells, regulatory T cells (Tregs), follicular helper T cells (Tfh), B cell subsets, dendritic cells, eosinophils, mast cells, and Th2 cells. A positive correlation between γδ T cells and monocytes suggests coordinated pro-inflammatory activities, whereas an inverse relationship between MAPK1 expression and mast cell–related enrichment scores may indicate feedback regulation. These immune alterations collectively suggest a systemic imbalance, characterized by enhanced humoral responses, impaired maternal–fetal tolerance, and chronic inflammation53,54,55,56,57,58. Such immune disturbances likely exacerbate ectopic lesion formation in EMS and disrupt implantation processes in RM, contributing to shared pathogenic pathways.

While this study provides novel insights into the molecular and immunological landscapes of EMS and RM, certain limitations should be acknowledged. The relatively small sample sizes and the retrospective, cross-sectional design may limit the generalizability of findings. Although we conducted internal validation through systematic bioinformatics analyses, the lack of external dataset validation and experimental evidence remains a limitation. The functional roles of the identified hub genes and pathways in the pathogenesis of EMS and RM require further investigation. Future studies should incorporate independent external cohorts, in vitro and in vivo experiments, and advanced techniques such as single-cell sequencing and spatial transcriptomics to enhance the reliability and biological interpretability of the findings.

In conclusion, this study identified a set of shared EndMT-related gene signatures through integrative transcriptomic analysis, highlighting common molecular mechanisms underlying endometriosis and recurrent miscarriage. These findings provide a foundational framework for understanding the overlapping pathogenic processes of angiogenesis, immune dysregulation, and EndMT, offering candidate targets for future translational research and precision therapies.