Introduction

Thyroid cancer (THCA) incidence has risen in recent decades, notably among adolescents and young adults1. Young-onset THCA exhibits distinct molecular and clinical features versus older patients, often presenting with aggressive, advanced-stage disease2. The tumor microenvironment (TME), particularly tumor-associated macrophages (TAMs), critically influences tumor progression and therapy resistance3. In thyroid cancer, high TAM density correlates with lymph node metastasis and poor survival4. However, TAM roles in young-onset THCA remain uncharacterized. This study integrates single-cell RNA sequencing (scRNA-seq) data from young-onset THCA to identify TAM-specific markers, validated across RNA-seq and proteomic datasets. We aim to elucidate TAM-related biomarkers in young-onset THCA aggressiveness, informing targeted therapeutic strategies.

A sophisticated scRNA-seq dataset including 38,224 cells (Fig. 1A, B) was accessed and re-analyzed. Nine major cell types were identified using canonical markers (Fig. 1C). Macrophage abundance was significantly increased in the Young group (Fig. 1D). Differential expression analysis identified 47 differentially expressed genes (DEGs) with age- and macrophage-specific changes, linked to immune regulation and inflammatory signaling (Fig. 1E, F). Among the 47 DEGs, 45 were protein-coding genes. We examined the expression of these 45 DEGs across age groups and pathological subtypes and found that age substantially contributes to the expression variability of these DEGs; however, due to high collinearity between age group and subtype, these estimates should be interpreted with caution (Supplementary Fig. 1). The 45 DEGs were further validated in the GSE153659 dataset, showing higher expression in the Young group (Fig. 1G). Eight genes exhibited significant differential expression between age groups (Fig. 1H), with a similar trend in the GSE53157 dataset (Fig. 1I). Ten genes showed significant expression differences (Fig. 1J). This trend persisted after stratifying by pathological subtypes (Supplementary Fig. 2). Notably, OLR1 and SIGLEC1 were significantly upregulated in the Young group in both datasets, suggesting relevance to age-related changes in TAMs.

Fig. 1: Identification of TAM-specific genes in young-onset THCA through scRNA-seq analysis and validation in bulk RNA-seq datasets.
Fig. 1: Identification of TAM-specific genes in young-onset THCA through scRNA-seq analysis and validation in bulk RNA-seq datasets.
Full size image

A UMAP visualization of 9 major cell types in the GSE193581 dataset. B UMAP visualization of two age groups. C Dotplot of marker genes. D Box plot of the cell type percentage in two age groups. All boxes are centered at the median and bounded by the first (Q1) and third (Q3) quartiles. Upper whiskers indicate the minimum (maximum, Q3 + 1.5 IQR), and lower whiskers indicate the maximum (minimum, Q1—1.5 IQR). E Filter criteria to obtain age-specific and macrophage-specific genes. F Bar plot exhibiting enrichment results of upregulated and downregulated genes on GOBP terms. Bar length represents the −log10(q-value) of enrichment, reflecting correlation strength. G Gene expression heatmap in the Young and Old groups from GSE153659, based on Z-scores of normalized gene expression levels. H Boxplot of significant DEGs in GSE153659, based on FPKM values. I Gene expression heatmap in the Young and Old groups from GSE53157, based on Z-scores of normalized gene expression levels. J Boxplot of significant DEGs in GSE53157, based on signal intensity. Asterisks indicate the level of statistical significance: ns, non-significant, *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001.

The 45 DEGs were then validated in The Cancer Genome Atlas (TCGA)-THCA dataset to examine the association with age, lymph node metastasis and tumor staging. Although the differences between age groups were not highly significant (Supplementary Fig. 3), in the metastasis subgroup comparison, 21 genes showed significant differential expression, with 14 upregulated and 7 downregulated in the metastasis group (Fig. 2A), and the top 8 upregulated genes were visualized (Fig. 2B). Additionally, 18 DEGs differed in T staging, with 11 upregulated and 7 downregulated in the T3_4 group (Fig. 2C), and the top 8 upregulated genes were visualized (Fig. 2D). To evaluate the relationship between these 45 DEGs and immune status, we assessed their T-cell dysfunction scores using the TIDE tool, where a higher score indicates that samples with high expression tend to be enriched in the T-cell dysfunction phenotype. KCNMA1, SIGLEC1, and FOLR2 exhibited the highest dysfunction scores (Fig. 2E). Specifically, in the TCGA Melanoma dataset, these genes were grouped by High and Low expression, with survival analysis showing opposite trends between the cytotoxic T cell (CTL) Top and Bottom groups for KCNMA1 (Fig. 2F), SIGLEC1 (Fig. 2G), and FOLR2 (Fig. 2H). In other datasets, KCNMA1 and OLR1 (TCGA Endometrial; Fig. 2I-J) and P2RY13 (METABRIC; Fig. 2K) also showed significant dysfunction. Pearson correlation analysis revealed significant correlations with CTL levels (Fig. 2L), and survival risk scores were consistent across five datasets (Fig. 2M).

Fig. 2: Associations of selected DEGs with lymph node metastasis and tumor staging in the TCGA-THCA dataset and association with tumor immune dysfunction, immune infiltration, and prognosis.
Fig. 2: Associations of selected DEGs with lymph node metastasis and tumor staging in the TCGA-THCA dataset and association with tumor immune dysfunction, immune infiltration, and prognosis.
Full size image

A DEGs associated with lymph node metastasis. B Significant DEGs between N0 and N1 groups, represented by gene read count. C DEGs associated with tumor T staging. D Significant DEGs between T1_2 and T3_4 groups, represented by gene read count. E Heatmap of T cell dysfunction scores for the 45 genes across five datasets. F Kaplan–Meier (K–M) curves for KCNMA1 gene High and Low expression groups with different CTL proportions in the TCGA Melanoma dataset. G K–M curves for SIGLEC1 gene High and Low expression groups with different CTL proportions in the TCGA Melanoma dataset. H K–M curves for FOLR2 gene High and Low expression groups with different CTL proportions in the TCGA Melanoma dataset. I K–M curves for KCNMA1 gene High and Low expression groups with different CTL proportions in the TCGA Endometrial dataset. J K–M curves for OLR1 gene High and Low expression groups with different CTL proportions in the TCGA Endometrial dataset. K K–M curves for P2RY13 gene High and Low expression groups with different CTL proportions in the METABRIC dataset. L Pearson correlation coefficient (Pearson’s r) heatmap between the 45 genes and CTL levels across five datasets. M Survival risk scores heatmap for the 45 genes across five datasets. Asterisks indicate the level of statistical significance: ns non-significant, *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001.

The 45 DEGs were further validated at the proteomic level. The coefficient of variation across the 12 pooled samples mainly ranged from 0 to 0.2 (Supplementary Fig. 4A), with a similar distribution after missing value imputation (Supplementary Fig. 4B). Principal component analysis after batch effect correction with Combat showed no apparent batch effects (Supplementary Fig. 4C). A total of 412 differentially expressed proteins were shared between the pediatric malignant (PM) vs pediatric benign (PB) and PM vs adult malignant (AM) comparisons (Supplementary Fig. 4D). Overlaps with the 45 selected genes were four and two, respectively (Supplementary Fig. 4E), corresponding to ALOX5, IL4I1, MNDA, and HPGDS (upregulated in PM vs. PB), MERTK (upregulated in PM vs. AM), and OLFML3 (downregulated in PM vs. AM). Nevertheless, OLR1 was not detected in this dataset, and SIGLEC1 did not show significant differences between groups.

Our analysis revealed increased macrophage infiltration in young-onset THCA patients, identifying 45 TAM-associated DEGs with age-specific expression patterns validated across multiple datasets. These DEGs showed significant correlation with mast cell-mediated immunity and activation, upregulation of vascular endothelial growth factor production, nitric oxide biosynthesis, and calcium-mediated signaling, which are potential mechanisms for tumor dissemination5,6,7,8. Two markers, OLR1 and SIGLEC1, showed consistent upregulation, aligning with our previous findings. Comprehensive transcriptomic analysis and protein staining confirmed OLR1’s specific expression on TAMs within the TME, with OLR1 levels serving as a reliable biomarker for macrophage infiltration and correlating with poor clinical outcomes in head and neck squamous cell carcinoma9. Our prior research also highlighted the widespread presence of Siglec family members in TAMs across various cancers, suggesting their role in tumor progression through immunomodulation and TAM polarization within the TME10. Their co-expression suggests synergistic TAM polarization toward immunosuppressive phenotypes. Importantly, high marker expression reversed the protective CTL infiltration association, indicating TAM-mediated cytotoxic T-cell impairment through direct suppression or checkpoint modulation11,12, suggesting that the interaction between tumor cells and immune cells in the tumor microenvironment significantly influences tumor progression13. Unfortunately, since TIDE does not include a thyroid cancer dataset, we were unable to directly validate this hypothesis in thyroid cancer samples.

In conclusion, our study highlights the critical role of TAMs in the prognosis of young-onset THCA. The identified TAM-specific genes serve as important biomarkers for tumor metastasis, staging, and immune dysfunction, as well as potential therapeutic targets. Further prospective cohort studies and experimental validation are warranted to confirm these findings and explore their clinical implications.

Methods

Thyroid cancer scRNA-seq data analysis

The single-cell RNA sequencing (scRNA-seq) dataset was obtained from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) under the accession number GSE19358114. After excluding normal thyroid tissues, cell line data, and tumor samples treated with drugs, seven anaplastic thyroid cancer (ATC) samples and five papillary thyroid cancer (PTC) samples were selected for analysis. The raw gene expression matrix and corresponding metadata were processed and analyzed using Seurat (v5)15 in R (v4.4.2).

The gene expression matrix was first normalized using the normalize_total and log1p functions. Highly variable genes were identified using the highly_variable_genes function and selected for downstream analysis. Principal component analysis (PCA) and Harmony were applied for dimensionality reduction and batch effect correction, with the parameters “batch_key = ‘SampleID’, n_pcs = 20”. Uniform Manifold Approximation and Projection (UMAP) was then used for visualization.

Samples were divided into two age groups based on a 35-year-old threshold: the Old group (n = 8) and the Young group (n = 4). To identify differentially expressed genes (DEGs) in macrophages, we used the FindMarkers function with the parameters: “ident.1 = young, ident.2 = old, logfc.threshold = 0.5”, applying an adjusted p-value threshold of p.adj < 0.01. Finally, Gene Ontology (GO) enrichment analysis was performed using the ClusterProfiler (v4.12.6) package16 in R.

To assess the confounding effect of pathological subtype on target gene expression, we examined the distribution of pathological types within the two age groups. The expression values of these 45 genes were aggregated into pseudobulk for each sample, scaled, and used to generate a gene expression matrix. We then compared gene expression across age and pathological subgroups and performed ANOVA to quantify the contributions of age and disease to expression variance. To better visualize their contributions, we calculated the proportion of the sum of squares for each factor relative to the total sum of squares.

Bulk RNA-seq data acquisition and processing

The GSE15365917 and GSE5315718 datasets were downloaded from the GEO database. GSE153659 contains data from 24 PTC samples, while GSE53157 includes data from 24 thyroid carcinoma samples. In GSE153659, patients older than 50 years were excluded, and the remaining samples were classified into two age groups: Young (≤35 years) and Old (>35 years). After removing two samples with abnormal values, 15 PTC samples were retained for analysis. Similarly, in GSE53157, samples with the pathological type “follicular variant of papillary carcinoma” were excluded, leaving 5 poorly differentiated thyroid carcinomas (PDTC), 7 PTC, and 4 follicular thyroid carcinomas (FTC). Following the removal of one outlier, 15 samples remained, which were also stratified into Young and Old groups using the 35-year-old threshold.

To assess differential gene expression between the Young and Old groups, the Wilcoxon rank-sum test was performed. In addition, we present the expression level of differentially expressed target genes across the pathological subgroups in the GSE53157 dataset.

The Cancer Genome Atlas (TCGA)—THCA data analysis

The TCGA-THCA dataset and the corresponding clinical data were downloaded from the Xena platform, comprising 508 THCA patients, of which 500 are PTC samples.

The differential gene analyses were conducted between the Young and Old subgroups, the non-lymph node metastasis subgroup and the metastasis subgroup, as well as between the T1_2 (combined T1 and T2 stages) group and T3_4 (combined T3 and T4 stages) group. Differentially expressed genes (DEGs) between groups were identified using the DESeq2 R package (v1.46.0)19, with significance defined as an adjusted p-value < 0.05 and an absolute fold change greater than 1.2.

TIDE-based evaluation of selected genes, immune infiltration, and tumor prognosis

The tumor immune dysfunction and exclusion (TIDE)20,21 tool was used to evaluate the association between the selected genes, immune infiltration in the TME, and tumor prognosis.

T-cell dysfunction scores for 45 genes were computed across five public datasets. Gene expression levels were classified into High and Low groups and further stratified by cytotoxic T lymphocyte (CTL) levels into CTL Top and CTL Bottom groups. The interaction effect between gene expression and CTL levels was assessed using the Cox proportional hazards (CoxPH) model, with z-scores and p-values calculated to determine statistical significance. A higher T-cell dysfunction score indicates that samples with high expression tend to be enriched in the T-cell dysfunction phenotype, while a lower score indicates that a sample with a low expression level tends to be enriched in the T-cell functional phenotype.

To further explore immune interactions, Pearson correlation analysis was performed to examine the relationship between selected genes and CTL levels, assessing their impact on T cell activity within the tumor microenvironment.

Finally, the survival risk score was determined by calculating the z-score of each gene’s effect on death risk using the CoxPH model.

Proteomics level validation

Proteomics data from thyroid cancer samples22, including 83 pediatric benign (PB), 85 pediatric malignant (PM), and 66 adult malignant (AM) nodules, were used for further validation. All malignant samples were of the PTC pathological type. To reduce statistical bias, 1272 proteins with a missing value rate greater than 85% were excluded, resulting in a final dataset containing 9154 proteins.

Data quality was assessed by evaluating the coefficient of variation (CV) across pooled samples and technical replicates. Missing values were excluded, and the protein abundance was log2-transformed for further analysis. Missing value imputation was performed using the NAguideR R package, with robust sequential imputation performed using the impsqrob method. The batch effect in the protein matrix was corrected using Combat, an empirical Bayes method implemented in the sva R package23. After imputation and correction, non-positive values were substituted with half of the minimum positive abundance for the corresponding protein. Each pair of technical replicates was averaged to create a single sample representing the mean protein abundance.

Differentially expressed proteins (DEPs) were identified with a fold change (FC) > 1.2.