Introduction

Thymoma is a rare neoplasm of the anterior mediastinum that typically arises from the epithelial cells of the thymus. Although it accounts for less than 1% of all malignancies, thymoma is the most common epithelial tumor in the anterior mediastinum1,2,3. Thymomas exhibit diverse histological types, with significant variations in clinical manifestations and biological behaviors among different subtypes4. Notably, thymomas are frequently associated with autoimmune diseases, particularly myasthenia gravis (MG), making the molecular mechanisms of thymoma tumorigenesis a focal point of research5. In terms of imaging diagnosis, computed tomography (CT) is the primary tool for the identification and staging of thymomas. In clinical practice, CT scans not only effectively diagnose thymomas but also predict their histological classification based on imaging features6,7. Additionally, in complex cases, magnetic resonance imaging (MRI) and F-fluorodeoxyglucose positron emission tomography (PET) are utilized for further evaluation and differential diagnosis of anterior mediastinal masses8. The treatment strategy for thymoma primarily relies on surgical resection, especially in the early stages. For advanced thymomas, a combined approach of surgery and radiotherapy has been shown to significantly improve patient survival rates9,10,11. The prognosis of thymoma is closely associated with multiple factors, among which the Masaoka staging system and WHO histological classification are the most critical prognostic indicators12. Studies have shown that thymomas of cortical origin are not only associated with an increased risk of other malignancies but also with significantly poorer survival rates, making them an important prognostic factor12. Therefore, in-depth investigation into the molecular mechanisms underlying thymoma, coupled with the optimization of diagnostic approaches and therapeutic strategies, holds significant importance for enhancing patient outcomes.

The ability of the immune system to distinguish between self and foreign substances is a core mechanism for maintaining organismal homeostasis, and this function plays a critical role in the development of cancer immunotherapy13. Studies have shown that immune cells can spontaneously control tumor growth at early stages and even induce tumor rejection14. However, alterations in the tumor immune microenvironment may lead to defects in central tolerance and immune regulation, which are particularly important in the pathogenesis of thymoma15. A key feature of thymoma-related immune deficiency is the abnormal changes in T cell populations, particularly the increase in naive T cells and the decrease in CD247 expression (the latter being crucial for T cell receptor (TCR) signaling). These changes may result in impaired immune responses, increased susceptibility to infections, and significant alterations in T cell subsets and cytokine responses, indicating disrupted immune regulation16,17. Furthermore, observed changes in the immune system in thymic malignancies further underscore the complex interplay between the thymus and immune function18.

In recent years, immunotherapy has shown remarkable efficacy in the treatment of thymic carcinoma. For example, the application of PD-1 inhibitors such as pembrolizumab and PD-L1 inhibitors has significantly improved the immune therapeutic outcomes for thymic carcinoma patients19,20. The disruption of the thymic microenvironment in thymoma patients can lead to immune system dysregulation and may even trigger graft-versus-host disease-like syndromes21. Therefore, investigating the interaction between the immune system and thymoma not only helps to elucidate the immunological characteristics of thymoma but may also provide a new theoretical basis for developing immune-based therapeutic strategies targeting thymoma.

In summary, thymoma, as a tumor closely associated with the immune system, exhibits unique immunological characteristics that make it an important model for tumor immunology research. In this study, we integrated multi-omics data from the GEO and TCGA databases and employed differential analysis combined with machine learning methods to identify key immune-related biomarkers in thymoma. Additionally, we conducted a systematic analysis of the immune infiltration characteristics in thymoma, along with the enriched pathways and regulatory networks associated with these biomarkers, aiming to elucidate the immunological mechanisms underlying thymoma and establish a scientific foundation for the development of personalized treatment strategies. These findings not only enhance our understanding of the immune microenvironment in thymoma but also provide valuable theoretical support for the identification and development of novel immunotherapeutic targets.

Materials and methods

Data source

Transcriptomic sequencing data for thymoma (TCGA-THYM) was obtained from the UCSC Xena platform (https://xenabrowser.net/datapages/), which included a total of 121 samples, comprising 2 control samples and 119 thymoma samples. Additionally, the control group was supplemented with data from the GTEx database (https://gtexportal.org/), which encompassed 444 blood samples. Additionally, the GSE177522 dataset, comprising 19 samples—6 normal controls, 11 thymic carcinoids, and 2 thymoma samples—was obtained from the GEO database (https://www.ncbi.nlm.nih.gov/geo/) as our validation cohort. Furthermore, a total of 1039 immune-related genes (IRGs) were identified based on previous studies22.

Data processing and differential analysis

The SVA package (v3.52.0)23 was utilized to merge the TCGA-THYM and GTEx datasets, effectively eliminating batch effects. The expression matrix was formatted into FPKM-normalized counts and subsequently transformed using log2(x + 1). This process resulted in a training set comprising 446 control samples and 118 tumor samples. The limma package (v3.60.4)24 was subsequently utilized to compare gene expression levels between the thymoma and control groups, resulting in the identification of differentially expressed genes (DEGs). The criteria for selecting differences included |log2FC|> 1 and p.adj < 0.05. Results were visualized through volcano plots and heatmaps that were generated using ggplot2 (v3.5.1)25 and pheatmap (v2.20.0)26, respectively. Afterthat, the intersection of DEGs and IRGs was used to identify differentially expressed IRGs (DEIRGs) associated with thymoma, and these results were visualized using the ggvenn package (v0.1.9) in R27.

Enrichment analysis, PPI network construction, chromosomal localization, and correlation analysis

Protein–protein interaction (PPI) network was constructed using the STRING database (https://cn.string-db.org/) with a maximum interaction limit of 0 and a confidence score threshold of ≥ 0.7. The resulting network was then visualized and analyzed using Cytoscape software (v3.10.2)28 to enhance the understanding of molecular interactions. Module analysis of the PPI network was conducted using Molecular Complex Detection (MCODE) to identify significant subnetworks, thereby highlighting key gene within the network. For Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) 29,30,31 enrichment analyses, relevant enriched pathways were identified using the clusterProfiler package in R (v 4.7.1.001)32. The top 5 enriched terms for each category, ranked by p-value, were visualized using the ggplot2 package to facilitate the interpretation of the results (v3.5.1)25. Chromosomal localization of the identified genes was determined using the RCircos package (v1.2.0)33 in R. Additionally, spearman correlation analysis was performed using the cor function from R, with visual representation of correlation coefficients generated via ggplot2 (v3.4.4)34.

Machine learning

Based on the training set, potential biomarkers were identified using three distinct feature selection methods: LASSO regression, SVM-RFE, and the Boruta algorithm. Primarily, SVM-RFE analysis was performed using the caret package35. This method involved the recursive training of SVM-RFE models and the elimination of the least important features, which allowed for the selection of genes with the highest discriminatory power and predictive performance. Next, LASSO analysis was conducted using the glmnet (v4.1.7)36 package. The LASSO model that minimized the error was chosen as the optimal model, and tenfold cross-validation was employed to ensure the robustness of the results. Detailedly, the Boruta algorithm, which was based on the Random Forest methodology, was utilized for feature selection. This approach was specifically designed to identify features that had a significant impact on the target variable. The Boruta analysis was executed using the Boruta package, with 100 iterations performed to enhance the reliability of the selected features. The intersection of the genes obtained from these three algorithms was computed, resulting in a final list of genes that were identified as potential biomarkers.

Validation of potential biomarkers

Using the training and validation datasets derived from thymoma, ROC analysis was conducted with the pROC package (v1.18.0)36 in R. This analysis enabled the calculation of AUC values for the selected biomarkers. Subsequently, gene expression data associated with these biomarkers were visualized through box plots to confirm their differential expression.

Immune infiltration analysis

The Cibersort algorithm implemented in the IOBR R package (v0.99.9)37 was employed to assess the extent of immune cell infiltration in each sample from the training dataset. Subsequently, the levels of immune cell infiltration were visualized using box plots generated with the ggplot2 package (v3.4.4)38 in R. Differences between thymoma and control groups were evaluated using wilcoxon tests (P < 0.05). Furthermore, correlation coefficients (cor) and corresponding p-value for differentially infiltrated immune cells relative to the identified biomarkers were calculated using Spearman’s method.

Construction of a nomogram

Utilizing the expression levels of the identified biomarkers, a nomogram model was constructed using the rms R package (v6.8–1)39. Each gene was assigned a score based on its expression level, and these scores were aggregated to derive a total score that predicted the risk of thymoma occurrence from the nomogram. To evaluated the diagnostic performance of the nomogram model, calibration curves and decision curve analysis (DCA) curves were subsequently developed.

GSEA analysis and drug prediction of biomarkers

Gene Set Enrichment Analysis (GSEA) was conducted utilizing the clusterProfiler (v3.18.1)40 R package based on the training dataset to further evaluate the biological functions of biomarkers. This analysis aimed to determine the statistical significance of molecular pathways, with q values < 0.25 and p values < 0.05 defined as statistically significant. Additionally, compounds corresponding to each biomarker were retrieved from the Comparative Toxicogenomics Database (CTD). A biomarker-drug interaction network was constructed using Cytoscape software to visualize these relationships.

Regulatory network construction

The NetworkAnalyst database (https://www.networkanalyst.ca/) was utilized to identify miRNAs and TFs that regulate the biomarkers. Subsequently, StarBase v2.0 was employed to predict the lncRNAs associated with these miRNAs. Finally, a regulatory network encompassing TFs, mRNAs, miRNAs, and lncRNAs was constructed using Cytoscape software.

Real-time polymerase chain reaction (RT-qPCR)

Total RNA was extracted from tumor of 8 patients and 8 control samples using the TRIzol reagent (Thermo Fisher). Reverse transcription was subsequently performed with the PrimeScript RT Reagent Kit (Takara) to synthesize complementary DNA (cDNA). Amplification was carried out using SYBR Green PCR Master Mix. Data analysis was conducted using the 2−ΔΔCt method for relative quantification, with β-actin serving as an internal reference gene to normalize gene expression levels. The primer sequences were provided in Table 1. All experiments were conducted in triplicate to ensure reliability and reproducibility. This study was approved by the Clinical Research Ethics Committee of Shanxi Bethune Hospital (Approval No. LYLL-2025-002/PJ29). All procedures involving human participants were conducted in accordance with the Declaration of Helsinki.

Table 1 The primer sequences of biomarkers in RT-qPCR.

Western blotting

Total protein was extracted from tissue using RIPA lysis buffer (Beyotime Biotechnology, P0013B) supplemented with protease and phosphatase inhibitors (Sigma, P8340; Sigma, P0044; Servicebio, CR2404006). The protein concentration was determined using the BCA protein assay kit (Beyotime Biotechnology, P0010). Equal amounts of protein (typically 20–35 µg) were separated by SDS-PAGE10% gel and transferred to a PVDF membrane (Milipore, 0,000,279,048). Membranes were then incubated overnight at 4 °C with primary antibodies specific to diluted in blocking solution at appropriate concentrations (CCR9 Rabbit pAb: abclonal, A14848, 1:1000; CCL13 Antibody: abmart, PK21663S, 1:1000). After washing with PBS-T three times, the membranes were incubated with HRP-conjugated secondary antibodies (Goat anti-Rabbit IgG (H + L) Secondary Antibody, HRP, Invitrogen, 31,460, 1:5000) for 1 h at room temperature. Protein bands were visualized using an ECL detection system (bltlux, GelView 6000 Plus) and captured on X-ray film or a digital imaging system. Densitometric analysis was performed using ImageJ software to quantify the protein expression levels.

Immunohistochemical (IHC)

For immunohistochemical analysis, paraffin-embedded tissue sections were deparaffinized, rehydrated, and incubated with 3% hydrogen peroxide at room temperature for 5–10 min to block endogenous peroxidase activity. After washing with distilled water, sections were immersed in PBS for 5 min, repeated twice, followed by antigen retrieval for 10 min. Once cooled, sections were washed with PBS and incubated with 5% BSA blocking solution for 20 min. Primary antibodies against CCR9 (Proteintech, 11,386-1-AP, 1:100) or CCL13 (abmart, PK21663S, 1:100) were applied and incubated at 37 °C for 2 h. After washing, sections were treated with S-vision polyclonal secondary antibody (Servicebio, G1302-100ML) at room temperature for 20 min. Following further washes, DAB (Servicebio, G1212) was used for color development, and after stopping the reaction with distilled water, sections were counterstained with hematoxylin (Servicebio, G1004), dehydrated, cleared, mounted, and observed under a microscope. Immunohistochemistry was performed using cancer and adjacent normal tissue samples from 3 groups of patients, with 3 different fields captured for each group.

Statistical analysis

All analyses were performed using R version 4.1.2. The significance of various metrics, including expression levels, infiltration ratios, and other pertinent features, was evaluated using the wilcoxon test. In graphical representations, a p value < 0.05 was indicated as statistically significant.

Results

A total of 616 DEIRG in thymoma

To identify common DEGs between thymoma and IRGs, a differential expression analysis was conducted, yielding a total of 1,844 DEGs. Among these, 1,303 genes were downregulated, while 541 genes were upregulated (Fig. 1A, B, Supplementary Table 1). Subsequently, this set was intersected with IRGs, resulting in the identification of 161 overlapping DEIRGs (Fig. 1C). This analysis served as a foundation for further research, stimulating more inquiries into these genes and their interactions, regulatory networks, and relationships with other biomarkers, thereby laying the groundwork for subsequent scientific investigations.

Fig. 1
figure 1

Differential expression analysis in the training dataset. (A) Volcano plot illustrating differentially expressed genes (DEGs) between control and thymoma samples. (B) Heatmap depicting the expression patterns of DEGs in the training set. (C) Venn diagram of the differentially expressed immune-related genes (DEIRGs).

Acquisition of key genes by PPI MCOD analysis.

PPI network was constructed as a comprehensive framework for integrating multiple interactions among proteins, fundamental to various critical biological processes within the organism. By conducting PPI module analysis, key modules and nodes within these networks were identified, enhancing the understanding of protein functions and interaction patterns. In this study, a PPI network was constructed based on DEIRGs, resulting in the identification of 438 unique protein interaction pairs (Fig. 2A). Subsequently, MCODE analysis was performed to uncover densely connected subnetworks within the PPI network. This analysis revealed 5 biologically significant modules, with a focus on the top 3 clusters (Fig. 2B). Notably, cluster 1 exhibited the highest score of 12.462 and comprised 14 genes: CCL13, CCL17, CCL22, CCL25, CCL21, CCR5, CCL24, CCL19, CCR7, CCR9, CX3CL1, CCL2, CXCL2, and CD8A. These genes were selected as key genes for further investigation in subsequent analyses.

Fig. 2
figure 2

Construction of Protein–protein interaction (PPI) network and module mining analysis. (A) PPI network analysis. (B) Identification of submodules within the PPI network.

Enrichment function, chromosome location and gene correlation of key genes

KEGG pathway analysis was conducted to annotate the pathways associated with all identified key genes and to investigate the primary metabolic and signaling pathways in which these genes were involved. Additionally, the GO analysis revealed a total of 441 enriched terms, with significant enrichment observed in 412 entries related to biological processes (BP), one entry pertaining to cellular components (CC), and 28 entries concerning MF. Notably enriched molecular functions (MF) included the “chemokine-mediated signaling pathway”, “neutrophil migration”, and “cytokine receptor binding” (Supplementary Fig. 1A). Furthermore, KEGG pathway analysis revealed 15 significantly enriched pathways. Among these, key findings included “the chemokine signaling pathway”, “viral protein interaction with cytokines and their receptors”, and “the IL-17 signaling pathway”. These analyses provided valuable insights into the functional roles of the key genes within relevant biological contexts (Supplementary Fig. 1B). Chromosomal localization analysis indicated that the key genes were predominantly concentrated on chromosomes 16 and 17 (Supplementary Fig. 1C). Additionally, correlation analyses demonstrated a significant relationship between CCL13 and CCR9 (cor = 0.4992, P < 0.05), suggesting potential functional interactions relevant to the study context (Supplementary Fig. 1D).

Identificaction of CCR9, CX3CL1, and CCL13 as potential biomarkers

Machine learning algorithms were capable of automatically extracting relevant features from high-dimensional candidate gene data, thereby enhancing the accuracy and efficiency of biomarker identification. In this research, machine learning techniques were applied to the candidate genes, employing Lasso regression analysis, which identified 4 genes: CCL13, CCL21, CCR9, and CX3CL1 (Fig. 3A). Additionally, the Boruta algorithm yielded a total of 13 significant genes, including CCL13, CCL17, and CCL22 (Fig. 3B). Through SVM-RFE analysis, it was determined that the model achieved optimal accuracy when the number of features was set to 4, identifying CCR9, CX3CL1, CCL13, and CCL22 as key genes (Fig. 3C). Ultimately, by taking the intersection of the gene sets obtained from these 3 algorithms, 3 potential biomarkers were identified: CCR9, CX3CL1, and CCL13 (Fig. 3D). This study conducted a comprehensive screening of key genes and not only identified potential biomarkers with significant potential, but also laid the foundation for further exploration of new biomarkers.

Fig. 3
figure 3

Machine learning screening of potential biomarkers. (A) Candidate gene selection via Least Absolute Shrinkage and Selection Operator (LASSO) regression. (B) Identification of characteristic variables using the Boruta algorithm. (C) Important gene prioritization in thymoma via Support Vector Machine Recursive Feature Elimination (SVM-RFE). (D) Venn diagram showing the intersection of three potential biomarkers.

Identification of biomarkers with better diagnostic value for thymoma

To better understand the relationship between each candidate biomarker and disease progression, expression differences of the biomarkers were analyzed across various samples in both the training and validation sets. In the training cohort, all 3 candidate biomarkers exhibited low expression levels in thymoma samples (Fig. 4A). ROC curves were subsequently generated for these biomarkers to assess their predictive utility. The analysis revealed that all biomarkers demonstrated high efficiency, with AUC values exceeding 0.9: CCR9 (AUC = 0.999), CX3CL1 (AUC = 0.962), and CCL13 (AUC = 0.993) (Fig. 4B). In the validation cohort, expression levels of all three candidate biomarkers were found to be lower in tumor samples (thymic carcinoid and thymoma) compared to the control group; however, CX3CL1 did not reach statistical significance (Fig. 4C). ROC curves were again generated for each biomarker to further evaluate their predictive performance. The results indicated that, with the exception of CX3CL1 (AUC = 0.603), CCR9 and CCL13 exhibited AUC values exceeding 0.7, demonstrating strong discriminatory ability (Fig. 4D). These findings supported the potential of CCR9 and CCL13 as reliable biomarkers for accurate disease prediction. In summary, both CCR9 and CCL13 were consistently downregulated in the thymic tumor group within the validation cohort, maintaining AUC values above 0.7. In contrast, CX3CL1 failed to show statistical significance with an AUC value below 0.7. Consequently, CCR9 and CCL13 were validated as biomarkers for further analysis.

Fig. 4
figure 4

Verification of the 3 potential biomarkers. (A) The expression patterns of the potential biomarkers in training set. (B) Receiver Operating Characteristic (ROC) curve analysis of the training set. (C) The expression patterns of the potential biomarkers in validation set. (D) ROC curve analysis of the validation set.

Relevance between biomarkers and the infiltration of immune cells

Immunoinfiltration was recognized as a critical factor in various physiological and pathological processes, particularly in the context of tumor immunotherapy. Significant differences in immune cell scores were observed between the thymoma group and the control group across 14 types of immune cells, which included naive CD4 + T cells, follicular helper T cells, and regulatory T cells (Tregs) (Fig. 5A). Furthermore, a histogram was presented to illustrate the distribution of different immune cell types, thereby providing insights into their abundance and relative quantities within the samples (Fig. 5B). Next, we conducted a correlation analysis of the 14 differentially expressed immune cell types. We found a significant positive correlation between regulatory T cells (Tregs) and naive CD4 + T cells, while a significant negative correlation was observed between Tregs and monocytes (Fig. 5C). The correlation between biomarkers and differentially expressed immune cells was subsequently analyzed. CCL13 was found to exhibit the highest positive correlation with M0 macrophages (cor = 0.4763), while it demonstrated a strong negative correlation with M2 macrophages (cor = − 0.3219) (Fig. 5D). In contrast, CCR9 showed the strongest positive correlation with naive CD4 + T cells (cor = 0.7684) and a significant negative correlation with activated NK cells (cor = − 0.3023) (Fig. 5E). These findings underscored the complex interplay between specific biomarkers and various immune cell populations in patients with thymoma. The results highlighted potential avenues for targeted immunotherapy strategies aimed at modulating immune responses in this patient population.

Fig. 5
figure 5

Immune infiltration analysis. (A) The differences in immune cell infiltration between control and thymoma groups. (B) Relative abundance of immune cell types in thymoma and control samples. (C) Correlation analysis between differentially expressed immune cell subsets. (D, E) Pearson correlation between biomarkers CCL13 and CCR9 and immune cells.

Construction a robust diagnostic nomogram of biomarkers

The use of nomograms was allowed for the comprehensive consideration of multiple predictive variables, thereby enhancing the accuracy of predictions. When disease risk involving various factors was assessed, distinct advantages were demonstrated by nomograms, which provided a more holistic representation of disease risk. In this study, 2 biomarkers were incorporated into the nomogram. Specifically, a higher score was indicated an increased likelihood of disease occurrence (Fig. 6A). Calibration curve results indicated that the slope of the curve approached 1, demonstrating the model’s strong predictive capabilities for survival rates (Fig. 6B). Furthermore, in DCA, the curve representing the nomogram consistently showed the highest value, reinforcing its robust predictive power (Fig. 6C). These findings underscored the utility of using a nomogram in clinical settings for accurate risk assessment and informed decision-making regarding patient management.

Fig. 6
figure 6

Construction and validation of the nomograms for thymoma. (A) Nomogram of the biomarkers. (B) Calibration curve of the nomogram. (C) Decision curve analysis (DCA) curve of the nomogram.

Regulatory networks and drug prediction of biomarkers

The regulatory network of TF-mRNA and mRNA-miRNA-lncRNA was recognized as instrumental in biological and medical research, providing a comprehensive understanding of the mechanisms governing gene expression regulation. This structured framework facilitated in-depth investigations into gene regulatory processes. In the analysis, it was predicted that CCL13 and CCR9 would interact with a total of 28 molecular entities, comprising 9 miRNAs, 7 lncRNAs, and 12 TFs. Notably, CCR9 was associated with 9 effective miRNAs, including has-miR-335-5p, has-miR-145-3p, and has-miR-30a-5p. Unique TFs were predicted for both CCR9 and CCL13, with each having 6 distinct TFs (Supplementary Fig. 2A). Additionally, drug prediction analyses revealed that CCL13 had the highest number of associated drugs, totaling 15 (including Methotrexate and Prednisone) (Supplementary Fig. 2B). Among these, prednisone was frequently utilized in combination therapies to alleviate symptoms related to thymoma41. These findings underscored the complex interplay between various molecular components within the regulatory network and highlighted potential therapeutic avenues for addressing thymoma-related conditions.

Activation and inhibition pathways of biomarkers

To further evaluate the biological functions of the identified biomarkers in patients with thymoma, single-gene GSEA was conducted. The analysis of CCR9 revealed that several pathways were suppressed, including those associated with asthma, complement and coagulation cascades, and graft-versus-host disease. Conversely, pathways such as DNA replication and primary immunodeficiency were found to be activated (Fig. 7A). For CCL13, the GSEA results indicated that pathways related to extracellular matrix (ECM) interactions with receptors and the TGF-beta signaling pathway were suppressed. In contrast, pathways associated with allograft rejection and autoimmune thyroid disease showed activation (Fig. 7B). These findings suggested that CCR9 and CCL13 may have significant roles in modulating immune responses and pathological processes in thymoma patients, indicating the need for further investigation into their potential as therapeutic targets or prognostic indicators.

Fig. 7
figure 7

Gene Set Enrichment Analysis (GSEA) of biomarkers. (A) GSEA results for CCR9. (B) GSEA results for CCL13.

Validation of biomarkers through biological experiments

To validate biomarker expression in thymoma, qRT-PCR analysis was performed on 8 paired thymoma and adjacent normal tissues. As shown in Fig. 8A, CCR9 expression was significantly downregulated in thymoma tissues compared with normal controls (P < 0.05), whereas CCL13 expression was significantly upregulated in thymoma samples. The findings for CCR9 were consistent with those obtained from data analysis. Additionally, Western blot results demonstrated heterogeneity in CCR9 and CCL13 expression, which corroborated the PCR findings (Fig. 8B). Furthermore, we performed immunohistochemical analysis on tumor and adjacent normal tissues from 3 additional patients to examine the expression levels of CCR9 and CCL13. The results showed that both CCR9 and CCL13 were expressed at lower levels in tumor tissues compared to normal tissues, which is consistent with our previous analyses (Fig. 8C).

Fig. 8
figure 8

Validation of biomarkers through biological experiments. The protein and mRNA expression of CCR9 and CCL13 were further validated in clinical tissue by Real-time polymerase chain reaction (RT-qPCR) (A), Western blotting (B), and Immunohistochemistry (IHC) (C). Original blots are presented in Supplementary Fig. 3. *P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001.

Discussion

The thymus, as a central immune organ, serves a dual role in host defense: 1) providing immunosurveillance against pathogens, tumor cells, and antigens derived from damaged tissues; and 2) integrating immune responses through the coordinated actions of its two functional arms—thymus-independent (innate) and thymus-dependent (adaptive) immunity42. IRGs are genes identified through research as being significantly associated with immune response pathways. These genes are involved in the activation of immune cells, migration, and the release of inflammatory factors, playing an important role in the development and progression of cancer. Studies have shown that high expression of IRGs is associated with better prognosis in both EBV-positive and EBV-negative gastric cancer patients43. However, not all patients benefit from immunotherapy. Therefore, establishing predictive models based on IRG signatures to screen thymoma patients for survival prognosis and to identify populations that may benefit from cancer immunotherapy holds significant clinical importance.

In this study, we first identified DEGs between thymoma patients and normal samples based on the TCGA-GTEx cohort. Due to the extreme scarcity of normal thymic tissue samples in TCGA, we incorporated GTEx-derived blood samples to enhance statistical power in the control group, while taking biological measures such as batch effect correction to minimize confounding effects caused by tissue background differences44. Subsequently, by intersecting IRGs with DEGs, we obtained key IRGs in thymoma. Further, using machine learning methods, including SVM-RFE, Boruta, and Lasso algorithms, along with validation from external GEO datasets, we identified CCR9 and CCL13 as immune-related biomarkers in thymoma. Through in-depth analysis of these biomarkers, we explored their correlations with differential immune cell infiltration and revealed their involvement in enriched pathways and regulatory networks. These findings not only provide new insights into the immune microenvironment of thymoma but also offer potential targets for developing personalized treatment strategies based on IRGs. The CCR9 and CCL13-based nomogram may complement existing imaging and histopathological tools by providing immune-related risk stratification, offering a more comprehensive assessment of thymoma. While traditional methods focus on anatomical changes, this nomogram integrates immune microenvironment data, enhancing disease risk prediction and guiding treatment decisions. Additionally, identifying CCR9 and CCL13 as key biomarkers for thymoma opens the possibility of developing targeted immunotherapies, though challenges remain in clinical translation, including the need for further validation, potential off-target effects, and the complexity of immune modulation in the thymic microenvironment.

CCR9 (C–C chemokine receptor 9) is a specific receptor for CCL25 and is selectively expressed in the thymus, lymph nodes, and spleen45. CCR9 is predominantly distributed on the surface of immature T lymphocytes and intestinal cells, where it plays a crucial role in T lymphocyte development and tissue-specific homing upon binding to its specific ligand, CCL2546. Studies have shown that CCR9 mediates thymocyte chemotaxis by responding to CCL25/thymus-expressed chemokines, and its mRNA is selectively expressed in T lymphocyte-generating sites such as the thymus and small intestine47. Additionally, the expression of CXCR4, CCR7, and CCR9 in T lymphocytes significantly influences their entry into the thymus, further underscoring the critical role of chemokine receptors in thymocyte development48. Thymic homing signals, including those mediated by CCR7 and CCR9, play a vital role in the migration and selection of T cell progenitors within the thymus42. Although CCR9 is highly expressed in various cancers, such as breast cancer, colorectal cancer, and melanoma49, and inflammatory factors like TNF-ɑ can promote CCR9 expression in human breast cancer MCF-7 cells50, our study revealed a significant downregulation of CCR9 in thymoma, which may indicate disrupted thymocyte migration and immune selection processes specific to the thymic microenvironment. Given the frequent association of thymoma with autoimmune diseases such as myasthenia gravis51, this aberrant expression may also reflect impairments in central immune tolerance, highlighting CCR9’s potentially unique role in thymoma pathogenesis. This phenomenon may be related to the regulation of CCR9 expression by certain non-coding RNAs (miRNAs and lncRNAs), which could further modulate its biological functions by influencing CCR9 expression. Overall, the literature suggests that CCR9 plays a key role in thymoma and thymus-related immune responses, affecting thymocyte development, migration, and selection. Further investigation into the mechanisms of CCR9 in thymoma and immune-mediated diseases will not only help elucidate its function in the thymic microenvironment but may also provide new targets and strategies for the treatment of these diseases.

CCL13 (CXC chemokine ligand 13) has been found to be closely associated with the clinical staging and progression of various tumors. Studies have shown that high serum levels of CCL13 are an independent predictor of distant metastasis in colorectal cancer and are significantly correlated with advanced age, late T-stage, distant metastasis, and UICC staging52. In thymoma, pan-cancer analysis has revealed significant upregulation of CCL13 expression53,54. Additionally, CCL13 mRNA levels are significantly higher in hepatocellular carcinoma (HCC) tumor tissues compared to normal tissues55. In ovarian cancer, CCL13 triggers epithelial-mesenchymal transition (EMT) through activation of the p38 MAPK pathway, thereby promoting tumor invasion and metastasis56. However, in vitro experiments indicate that chronic hypoxia has no significant effect on CCL13 expression in breast cancer and lung adenocarcinoma cells57. While CCL13’s role in other cancers has been well studied, its function in thymoma remains unclear. Our findings suggest that CCL13 is closely linked to macrophage infiltration within the unique immune context of thymoma, possibly influencing tumor behavior through local immune modulation. This thymoma-specific role may differ from mechanisms observed in other tumors and warrants further investigation.

Immune cell infiltration analysis revealed significant differences in 14 immune cell types between the disease and control groups. Among these, CD8 + T cells, as key players in adaptive immune responses, play a crucial role in immune surveillance58. When tumor cells are recognized in the body, CD8 + T cells are activated through T cell receptor (TCR) antigen recognition, rapidly proliferate, and differentiate into cytotoxic T lymphocytes (CTLs), which directly destroy tumor cells through cell-to-cell contact59. Studies have shown that CD8 + T cells can be used as part of an immune scoring system, providing a more accurate assessment of patient prognosis compared to traditional pathological criteria60.

Further analysis of the correlation between biomarkers and differentially expressed immune cells revealed that CCL13 is significantly associated with macrophages. Research indicates that inhibition of SOCS3, a negative regulator of the STAT3 pathway in macrophages, leads to increased production of CCL13. This upregulation of CCL13 is associated with reduced cancer metastasis, suggesting a complex interaction between chemokine signaling and macrophage-mediated immune responses61. In contrast, CCR9 shows a stronger correlation with naive CD4 + T cells. Studies have found that the heteromeric complex formed by CCR9 and dopamine receptor D5 plays a critical role in the homing of CD4 + T cells to the gut during inflammation, highlighting the multifaceted role of CCR9 in immune cell migration41. These findings provide new perspectives on the immune regulatory mechanisms in thymoma and suggest that CCL13 and CCR9 may serve as potential immunotherapeutic targets.

In summary, this study is the first to integrate multi-omics data and employ machine learning methods to identify and validate CCL13 and CCR9 as key immune-related biomarkers in thymoma, providing new insights into the immune regulatory mechanisms of thymoma and potential therapeutic targets. However, there are some limitations that should be noted. First, the sample size for the experimental part of this study is relatively small, and future research will aim to expand the sample size to enhance the experimental reliability. Second, the data primarily originate from public databases such as TCGA and GEO, and differences in sequencing platforms, sample sources, and processing methods may introduce sample heterogeneity, potentially affecting the consistency and generalizability of the findings. Third, due to the limited number of control samples in the TCGA database, we incorporated the GTEx database for analysis. Despite batch correction efforts, this integration may still lead to systematic biases that influence biomarker identification. Additionally, another important limitation is the composition of our validation dataset (GSE177522), which predominantly consists of thymic carcinoids rather than typical thymomas. Given the biological differences between these two tumor types, caution must be exercised when interpreting the validation results. Therefore, future studies should incorporate independent validation cohorts to further assess the robustness and reproducibility of the identified biomarkers. Additionally, functional validation through in vitro and in vivo experiments, followed by clinical studies, will be essential to confirm their practical utility in thymoma diagnosis and treatment.

Conclusion

In this study, through multi-omics data analysis and experimental validation, we successfully identified CCR9 and CCL13 as key immune-related biomarkers in thymoma. These findings not only reveal the critical roles of CCR9 and CCL13 in the immune microenvironment of thymoma but also highlight their potential as therapeutic targets. Based on these results, this study provides potential directions for developing novel treatment strategies for thymoma and lays a theoretical foundation for early diagnosis and targeted therapy of thymoma.