Abstract
We developed a deep residual network (ResNet) framework to classify thyroid cancer differentiation states by integrating multiomic data and interpretability analysis. Our framework incorporated untargeted metabolomic, whole-exome sequencing, and transcriptomic data from 158 thyroid tumors and 57 matched normal tissues, encompassing well-differentiated, poorly differentiated, and anaplastic thyroid cancers. We further examined single-cell RNA sequencing datasets from the Gene Expression Omnibus (GEO) to map key metabolic reprogramming pathways in dedifferentiated thyroid cancer. We systematically integrated transcriptomic data from follicular epithelial-derived thyroid carcinomas across all GEO cohorts, establishing a pan-pathological classification model based on a 10-gene metabolic signature. To complement this approach, a 10-metabolite model was developed via the ResNet architecture, capitalizing on the direct pathophysiological responsiveness of metabolites to tumor progression states. By employing Shapley additive explanations, we highlighted critical metabolic signatures driving differentiation states. Our findings reveal how metabolic shifts underpin thyroid cancer progression, and based on these findings, we propose an accurate, interpretable model that may facilitate early diagnosis and inform clinical decision-making.
Similar content being viewed by others
Introduction
The incidence of thyroid cancer has been increasing worldwide1,2,3. The most prevalent type is well-differentiated thyroid cancer (WDTC), which includes papillary thyroid cancer (PTC) and follicular thyroid cancer (FTC). Dedifferentiated thyroid cancer (DDTC), including poorly differentiated thyroid cancer (PDTC) and anaplastic thyroid cancer (ATC), is frequently fatal despite its low incidence4. ATC, recognized as the most lethal form, has a median survival of ~5 months and a one-year survival rate of merely 20%5. Fine-needle aspiration (FNA) is generally utilized for the first evaluation. However, this approach can be challenging because of the variable histopathology, leading to a suboptimal accuracy of ~60%. Overall, more precise diagnostic strategies are urgently needed6.
Dedifferentiation, driven by genetic and environmental factors, is a key mechanism in tumor progression5,7. Since metabolic status is closely tied to tumor differentiation, deciphering metabolic differences in different differentiation states is biologically meaningful. Recent studies have applied metabolomics to uncover molecular mechanisms and identify predictive biomarkers for diagnosis and prognosis8,9,10. In our previous study, we conducted a proteogenomic and metabolomic analysis of 102 PTC patients from Fudan Shanghai Cancer Center (FUSCC)11. However, the relevant studies have been limited by the absence of certain pathological subtypes, the insufficient representation of metabolic characteristics and the use of precision models for dedifferentiation prediction12.
Machine learning has been utilized to analyze and process sophisticated multiomic data. Residual networks (ResNet), a type of deep neural network, enable higher-order feature learning without manual engineering and are well-suited for scaling diagnostic tasks to larger datasets13. A mature ResNet framework facilitates scaling up predictive tasks for subsequent larger datasets14,15. Currently, studies on the application of deep neural networks to generate diagnostic models remain largely unexplored, offering considerable potential for further research.
Here, we employed untargeted metabolomics, whole-exome sequencing (WES) and transcriptomics on FTC, PTC, PDTC, and ATC samples, with a total of 215 samples. By combining Gene Expression Omnibus (GEO) single-cell-sequencing datasets, we revealed correlations between various clinical phenotypes (e.g., age, lymph node metastasis, and distant metastasis) and metabolites, revealed persistently activated metabolic behaviors during dedifferentiation and elucidated DDTC-specific metabolic reprogramming profiles. Furthermore, to overcome the abovementioned challenges, multiomic strategies increasingly integrate transcriptomic and metabolomic data to capture tumor biology comprehensively. However, batch effects in metabolomics threaten reproducibility, driving the search for robust biomarkers. We addressed this by building a ResNet classifier system aggregating the GEO and FUSCC datasets (n = 453). Rigorous cross-validation revealed a 10-gene signature that accurately distinguished all differentiation states, with an average accuracy of 92.7%. We utilized Shapley additive explanations (SHAP) analysis to confirm its biological relevance16. Because metabolites represent the final readouts of regulatory networks, we also trained a ResNet model on metabolomic profiles. Despite the limited availability of public datasets, our 10-metabolite (10M) signature achieved a mean area under the curve (AUC) of 0.98 in the FUSCC cohort, effectively separating PDTC from ATC.
To the best of our knowledge, this study represents the first comprehensive multiomic investigation that spans the full spectrum of follicular epithelial-derived thyroid carcinomas. In contrast to previous studies, our work integrates metabolomic, transcriptomic, and genomic data to provide a holistic view of thyroid cancer dedifferentiation. Moreover, we propose novel, multiomic-based deep learning frameworks that enhance classification accuracy and reproducibility—addressing key challenges faced in earlier studies and offering new opportunities for precision medicine in thyroid oncology.
Results
Clinical and demographic overview
To comprehensively profile the patient cohort and evaluate their clinicopathological and genomic features, we conducted an integrated multiomic analysis as summarized in Fig. 1. The clinicopathological information of the patients is shown in Table 1. In summary, the cohort had a mean age of 45.92 (±16.25) years and an average tumor size of 3.52 (±1.97) cm.
Integrated multi-omics and ResNet models delineate dedifferentiation and accurately classify four thyroid cancer subtypes in 158 patients. The illustration was created with a full license on BioRender.com. In this study, a total of 158 patients were recruited from FUSCC, including those with papillary thyroid carcinomas (PTCs, n = 104), follicular thyroid carcinomas (FTCs, n = 24), poorly differentiated thyroid carcinomas (PDTCs, n = 19), and anaplastic thyroid carcinomas (ATCs, n = 11). Using an integrative multiomic approach, including whole-exome sequencing (WES), metabolomics, transcriptomics, and GEO single-cell sequencing, we comprehensively characterized the metabolic reprogramming underlying thyroid cancer dedifferentiation. We built two ResNet classifiers. The 10-MG model was based on a 10-Metabolic genes signature that accurately distinguished all differentiation states. Because metabolites represent the final readouts of regulatory networks, we also trained a ResNet model on metabolomic profiles. Despite the limited availability of public datasets, our 10M signature achieved a mean AUC of 0.98 in the FUSCC cohort, effectively separating PDTC from ATC.
By utilizing high-precision Sanger sequencing, WES, and high-throughput RNA sequencing, we further characterized the cohort, revealing distinct gene mutations that drive cancer development in both WDTC and DDTC (Fig. 2a). Tumor Protein p53 (TP53) and telomerase reverse transcriptase (TERT) promoter mutations (C228T and C250T) were the most prevalent in DDTC, with the TERT C250T mutation significantly linked to radioiodine (RAI) refractoriness and the occurrence of ATC (Fig. 2b) (Supplementary Fig. 1). In contrast, B-Raf proto-oncogene (BRAF) V600E and Lipopolysaccharide-responsive beige-like anchor (LRBA) mutations were more frequently observed in WDTC (Fig. 2a). The total mutation burden in the DDTC group was greater than that in the WDTC group, with higher concentrations of mutations and gene fusions (Supplementary Figs. 2 and 3). The mutation distributions in the gene regions of WDTC and DDTC are displayed in Supplementary Fig. 4.
Genetic alterations reshape lipid-centered metabolism and correlate with key metabolite shifts in tumors. a Genetic profile and associated clinical information of the FUSCC cohort. b TERT C250T mutation was significantly associated with RAI refractoriness and ATC occurrence (filtered by ANOVA. ***p < 0.001, *****p < 0.00001). c The proportions of annotated polar metabolites and lipids in our study (n = 512). FA fatty acid, GL glycerolipid, SP sphingolipid, GP glycerophospholipid, PR prenol lipid, ST sterol lipid. d Differential analysis demonstrating the major gene expression and metabolite changes influenced by the TERT C250T mutation. e GO enrichment analysis showing that TERT C250T led to lipid metabolism changes. f Clustered heatmap revealing the correlation between genomic mutations analyzed by Welch’s t-test. g Welch’s t-test revealing a significant correlation between metabolites and gene fusion. The AC026191.1_SRGAP3 fusion was positively related to NADH abundance (p < 0.05).
Dedifferentiation status drives poor survival, with lipid metabolites adding prognostic insight beyond conventional clinicopathologic factors. a Univariate Cox regression of thyroid cancer patient prognosis FUSCC, indicating that age (>55 years), pathology type, ETE, ENE, DM, TNM and RR were independent prognostic factors for thyroid cancer, with p < 0.05. ETE extrathyroidal extension, ENE extranodal extension, DM distant metastasis, RR RAI refractoriness, CI confidence interval. b Multivariate Cox regression of thyroid cancer patient prognosis at FUSCC showing that differentiation status had the most significant impact among all the aforementioned factors. c Kaplan‒Meier curve showing that PDTC and ATC patients had poor prognoses. P values were calculated via the log-rank test. d Thyroid differentiation scores of different differentiation statuses. TDS scoring failed to differentiate between PDTC and ATC. e Multigroup volcano plots showing correlations between clinical phenotypes and metabolites analyzed by Benjamini‒Hochberg-corrected Mann‒Whitney U tests (q < 0.05, Log2 FC > 2 or Log2 FC < −2). ETE and LNM. No both had the strongest correlation with the metabolites (n = 7). f Spearman correlation analysis of clinical phenotypes and metabolites. Both (FDR < 0.05) PC (20:1_18:1) and PC (18:1_18:1) were related to age, metastasis, ETE and LNM number. 1-Methylnicotinamide was positively related to ENE.
Compared with WDTC, DDTC exhibited significant associations with large tumor size, extrathyroidal extension (ETE), extranodal extension (ENE), and distant metastasis (DM) (Table 1).
Impact of gene mutations and fusions on cancer metabolism
We annotated 512 polar metabolites and lipids (Fig. 2c). The data quality was checked using internal standards and quality control samples. We grouped the samples on the basis of different TERT mutation types (Fig. 2d). The TERT C250T mutation was associated with alterations in lipid metabolism (Fig. 2e). The adipocytokine signaling pathway, involving Acetyl-CoA Carboxylase Beta, was significantly upregulated, along with changes in adipocyte metabolism (Supplementary Table 1). In addition, the Welch t-test revealed that carbohydrates, nucleotides, peptides, vitamins, and cofactors exhibited significant correlations across various mutation backgrounds. Notably, PI3KCA mutation had a positive correlation with kynurenine (KYN) and S-adenosylhomocystine (SAH) levels (q < 0.05). TERT C250T had a positive linear correlation with amino acid, peptide and fatty acid (FA) levels (q < 0.05) (Fig. 2f).
We also analyzed gene fusion events. Utilizing Welch’s t-test, we identified differentially abundant metabolites (q < 0.05) associated with the following gene fusions: HLA_A_HLA_S (n = 38), SMC5_AS1_MAMDC2_AS1 (n = 106), and AC026191.1__SRGAP3 (n = 82) (Supplementary Fig. 5). Notably, the AC026191.1_SRGAP3 fusion was positively related to nicotinamide adenine dinucleotide (NADH) abundance (q < 0.05) (Fig. 2g).
Differentiation states significantly affect overall survival time
Univariate Cox analysis revealed that several factors, including age, pathology type, ETE, ENE, DM, TNM stage, and RAI refractoriness (RR), significantly influenced patients’ overall survival (OS) (Fig. 3a). Multivariate Cox analysis revealed that pathology type, classified by differentiation state, had the most significant impact among all the aforementioned factors, underscoring the importance of further investigation (Fig. 3b). The follow-up duration and censoring rates varied across the four thyroid cancer subtypes, reflecting the natural heterogeneity of clinical progression and follow-up patterns in retrospective studies. As expected, PTC and FTC groups exhibited the longest follow-up durations and highest censoring rates (94.2% and 79.2%, respectively), consistent with their indolent behavior and favorable prognosis. In contrast, the PDTC and ATC groups showed shorter follow-up times and lower censoring rates (21.1% and 18.2%, respectively), reflecting their more aggressive clinical course and higher death rates (Supplementary Table 2). The Kaplan‒Meier survival curve demonstrated that ATC patients had the worst survival outcome, whereas WDTC patients and DDTC patients had markedly different prognoses (p < 0.0001) (Fig. 3c). The traditional thyroid differentiation score (TDS) did not effectively distinguish between PDTC and ATC (Fig. 3d).
To identify metabolites associated with key clinical factors, we performed Benjamini‒Hochberg-corrected Mann‒Whitney U tests, which revealed that ETE and lymph node metastasis numbers (LNM.No) were most strongly associated with seven metabolites, predominantly glycerophospholipids (GPs) (Fig. 3e). Further Spearman rank correlation analysis revealed that the levels of phosphatidylcholine (PC) (20:1_18:1) and PC (18:1_18:1) were significantly correlated with age, DM, ETE, and increased LNM.No (Fig. 3f).
The metabolic landscape in thyroid cancer patients with different differentiation states
Through sparse partial least-squares discriminant analysis (sPLS-DA), we revealed significant metabolic differences between DDTC and WDTC. Additionally, the orthogonal partial least squares discriminant analysis (OPLS-DA) results for each group also revealed distinct differences (Supplementary Fig. 6). Interestingly, despite their high degree of differentiation, FTC and PTC exhibited varied compositions (Supplementary Fig. 7). We first performed single-sample gene set enrichment analysis (ssGSEA) using the Reactome database. The top 15 pathways with the most significant intergroup differences were selected and visualized in a heatmap on the basis of their normalized enrichment scores (NESs) (Fig. 4a). These pathways were functionally categorized into three types: cellular proliferation, immune infiltration, and metabolic reprogramming. Notably, the thyroxine biosynthesis pathway—critical for normal follicular epithelial cell function—had the highest NES in FTC, followed by PTC, with PDTC and ATC exhibiting the lowest scores. This gradient decline in pathway activity correlates with progressive dedifferentiation, indicating more complete loss of parenchymal cell functionality in advanced thyroid malignancies. Therefore, we selected PTC, which has a relatively high incidence and represents a relatively large proportion, as a representative of WDTC.
Multi-layer analyses reveal stepwise metabolic-pathway activation tracking the transition from PTC to aggressive ATC. a Heatmap visualizing ssGSEA results for thyroid carcinoma samples in the FUSCC cohort, with pathway enrichment scores normalized and represented as normalized enrichment scores (NESs). b Multigroup differential analysis showing the common significantly varied component (q < 0.05, Log2 FC > 2 or Log2 FC < −2). Metabolomic data were analyzed via Benjamini‒Hochberg-corrected Mann‒Whitney U tests (left panel). Transcriptomic data were analyzed via DEseq2 (right panel). c Linear clustering showing the polar metabolites (n = 7) and lipid categories (n = 2) with an increasing tendency during dedifferentiation. d Pathway enrichment in KEGG and SMPDB (q < 0.05). Red represents persistently changed pathways, and green and blue represent dedifferentiation-specific pathways. e UMAP plot visualizing the single-cell transcriptomics data from untreated ATC samples in the GSE193581 database. f Active metabolic pathways in epithelial cells from untreated PTC and ATC samples in the GSE193581 cohort.
Benjamini‒Hochberg-corrected Mann‒Whitney U tests revealed that the abundances of dihydroxyacetone phosphate (DHAP), 3-phosphoglyceric acid (3-PGA), N-acetylglucosamine-6-phosphate (GlcNAc-6P), NADH, nicotinamide adenine dinucleotide phosphate (NADPH), inorganic pyrophosphate (PPi) and citrate (CA) significantly changed during dedifferentiation (Fig. 4b left panel). Notably, NADPH and NADH showed significantly increased abundance in WDTC but markedly reduced abundance in DDTC. The oxidized form of nicotinamide adenine dinucleotide (NAD+)/NADH ratio reflects the cellular energy status, whereas the NAD+/NADPH ratio indicates redox capacity. Statistical analysis revealed a significant increase in the NAD+/NADH ratio in DDTC, accompanied by a decrease in the NAD+/NADPH ratio across tumor subtypes (Supplementary Fig. 8).
At the gene level, ATC exhibited the most pronounced changes, indicating the most significant molecular divergence from normal tissues. UGT2B11, ADH4, ALDOB, ENO3, and 25 other genes may play critical roles in cancer progression (Fig. 4b right panel). We profiled the metabolic genes and visualized consensus alterations across differentiation states using a Sankey diagram, identifying 25 genes with functional convergence in glycolytic reprogramming and retinoic acid metabolism. Notably, the enriched pathways included P450-mediated drug metabolism and protein glycosylation. As monooxygenase family members, P450 enzymes drive oxidative reactions linked to cytotoxic effects through redox imbalance (Supplementary Fig. 9).
In DDTC, there was an overall increase in fatty acid and amino acid metabolism (KYN, 2-aminobutyric acid, proline) (Supplementary Fig. 10). Using linear clustering methods, we identified a series of metabolites that accumulated throughout dedifferentiation. The abundances of polar metabolites (n = 7) (left panel), such as KYN and SAH, and lipid metabolites (n = 7) (right panel), such as lysophosphatidylcholine-O (LPC-O) and free fatty acids (FFAs), tended to increase (Fig. 4c). This result aligns with our research on the correlation between phenotypes and metabolites.
Pathway enrichment analysis based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) and the Small Molecule Pathway Database (SMPDB) revealed that cancer metabolic reprogramming primarily encompassed core pathways (n = 13). Retinol metabolism, glycolysis, linoleic acid metabolism, histidine metabolism, tyrosine metabolism and beta-alanine metabolism showed persistent changes. Phenylalanine metabolism specifically plays a role in DDTC, whereas tryptophan metabolism particularly contributes to ATC (Fig. 4d).
Single-cell analysis allows more precise identification of distinct cell populations. Therefore, we utilized single-cell transcriptomic data from the public database GSE193581. Patients who had received immunotherapy were excluded from the analysis (Fig. 4e). Pathway enrichment analysis was performed specifically on epithelial cells. Our analysis revealed that glycolysis and oxidative phosphorylation were active within the tumor cells (Fig. 4f). Notably, these metabolic pathways were more active in PTC, indicating enhanced energy production in this cancer type.
Tryptophan metabolic pathway in anaplastic thyroid cancer
KYN was found to accumulate in ATC, and the tryptophan metabolic pathway played a critical role in the metabolism of ATC, which suggests the need for further analysis (Fig. 5a). By combining pathway diagrams with differential analysis, we found that in ATC, the upregulation of TDO2 led to increased conversion of tryptophan to KYN. This was accompanied by the downregulation of DDC, which mediated the conversion of tryptophan to the alternative pathway leading to indole acetate.
ATC up-regulates the kynurenine arm of tryptophan metabolism, reshaping immune-stromal interactions within the tumor microenvironment. a Pathway diagrams revealing the dedifferentiation-specific regulatory genes and metabolites involved in tryptophan metabolism. Squares present the average abundances and expression in different tumor types. Uncharacterized metabolites and mRNAs are not included in this diagram. TDO2 and kynurenine were significantly upregulated in ATC, whereas DDC was markedly downregulated in ATC. b Volcano plot showing differentially expressed genes between samples with high and low Try/Kyn ratios. c Bar chart of pathway enrichment analysis for differentially expressed genes. d Scatter plots showing the expression levels of KYNU and TDO2 across various cell types. e UMAP plots depicting the spatial expression patterns of KYNU and TDO2 in single-cell RNA sequencing data, highlighting their localization within specific regions of the tumor microenvironment. f Elevated expression of KYNU and TDO2 was particularly observed in specific immune and stromal cell populations.
On the basis of the tryptophan/kynurenine ratio, we divided the samples into the high and low groups and performed Gene Ontology (GO) pathway enrichment analysis (Fig. 5b). We found that the top-ranked pathways were clustered around processes associated with the immune microenvironment (Fig. 5c). On the basis of our previous metabolic pathway analysis of ATC epithelial cells in which tryptophan metabolism was not observed, we hypothesized that the tryptophan pathway functions primarily within the immune microenvironment. By analyzing the metabolism-regulating genes associated with KYN that were also significantly altered during dedifferentiation, we identified Kynureninase (KYNU) and Tryptophan 2,3-dioxygenase (TDO2) (Fig. 5d). Single-cell analysis revealed that KYNU was predominantly upregulated in monocytes, macrophages, and dendritic cells (DCs), whereas TDO2 was upregulated in fibroblasts (Fig. 5e). Additionally, tryptophan metabolism was strongest in DCs, monocytes, and fibroblasts (Fig. 5f).
Metabolic reprogramming profiles of dedifferentiation
According to our knowledge, PDTC and ATC share quite similar metabolic reprogramming characteristics. Through integrative analysis of mRNA expression and metabolite abundance, we identified unique metabolic reprogramming changes associated with dedifferentiation. We selected significantly changed metabolites and mRNAs (p < 0.05) and visualized them using a four-quadrant plot according to their function (Supplementary Table 3). The upregulation of GAMT directly impacted SAH, leading to an imbalance in methylation. Lysophosphatidylcholine Acyltransferase 1 (LPCAT1) upregulation led to an increase in LPC-O (18:1) (Fig. 6a, b).
Coordinated mRNA–metabolite analysis pinpoints a metabolic rewiring distinguishing dedifferentiated from well-differentiated thyroid carcinomas. a The mRNA expression of metabolic genes paired with the abundances of paired polar metabolites. Red points represent significant changes in both the metabolite and the mRNA. b The mRNA expression of metabolic genes paired with the abundances of paired lipid metabolites. Red points represent significant changes in both the metabolite and the mRNA. c The correlation network centered on NT5E and PNP between WDTC and DDTC. Squares represent metabolites, and circles represent mRNAs.
A network was generated on the basis of the differential analysis of WDTC and DDTC. The metabolites involved in nucleotide metabolism (such as adenosine monophosphate, adenosine diphosphate, and guanosine monophosphate), amino acid metabolism (such as l-arginine) and other essential biological pathways were scattered across the network. PNP and NT5E appeared to be central nodes in the network. This network highlighted the interconnections between key metabolites and mRNAs, reflecting metabolic pathways that are likely critical in cellular processes such as energy production, purine metabolism, immune regulation, and lipid metabolism (Fig. 6c).
10 Metabolic gene model
In our previous study, we screened dedifferentiation-associated metabolites using transcriptomic data, but that work did not encompass all histological subtypes and involved a relatively small sample size10. To investigate the predictive relevance of transcriptomic features in multiclass classification tasks, we designed a comprehensive workflow that integrates data preprocessing, deep learning-based modeling, and interpretability analysis. We selected 30 genes critical within the dedifferentiation network from the aforementioned study as candidate genes for model training. We first integrated all publicly available thyroid carcinoma cohorts from the GEO database, including five datasets (GSE29265, GSE33630, GSE53157, GSE65144, and GSE76039) (Fig. 7a). This significantly increased the proportion of PDTC samples in the training cohort (Fig. 7b). After applying batch effect correction to the transcriptomic data from these cohorts, we adopted a stratified splitting approach (2:1 training-to-test ratio) to partition the dataset while preserving the original distribution of tumor subtypes, given the limited sample size of the integrated cohort. We then employed a supervised classification framework based on a one-dimensional ResNet, which was trained and validated using stratified 9-fold cross-validation. This ensured that each class was proportionally represented across folds and enhanced the robustness of the performance estimates. To avoid overfitting and improve model generalizability, early stopping and dropout mechanisms were incorporated during training.
10 MG ResNet classifier accurately stratifies thyroid tumors across five differentiation states with clear SHAP-based biomarker insights. a Schematic representation of ResNet model development and validation for distinguishing the differentiation states of thyroid cancer. This model demonstrated high accuracy and robustness in distinguishing thyroid cancer differentiation states. b Pie chart depicting the composition of the training and test sets integrated from the FUSCC and public GSE cohorts, with transcriptomic data processed for batch effect removal. The cohort included 453 subjects: normal controls (n = 139), PTC patients (n = 179), FTC patients (n = 27), PDTC patients (n = 44), and ATC patients (n = 64). Sample origins were annotated for the PDTC and ATC subgroups. c Bar chart showing the average SHAP values for the top 10 metabolic genes contributing to the ResNet model’s predictions. The most influential genes included GPX3, TPO, CYP1B1, and PDE8B, reflecting their significant roles in differentiating thyroid cancer states. d Embedding plot displaying sample clustering on the basis of expression levels of the top two SHAP-ranked features. e Bee schematic illustrating the contribution levels of 10MG genes across differentiation states, quantified through SHAP value analysis. f ROC curve for the test cohort. The AUROC ranged from 0.87 to 0.96, demonstrating the model’s robustness in distinguishing differentiation states. g Bar plots demonstrating the contribution levels of 10MG genes across differentiation states, quantified by SHAP values.
Following model training, we performed a comprehensive evaluation using both cross-validation folds and a held-out test subset drawn from the merged internal and public datasets to assess generalizability. SHAP analysis was systematically implemented to quantify feature contributions across classification categories. Through iterative feature pruning guided by SHAP importance rankings during each training cycle, we observed that the average accuracy on the test set remained stable above 90%, even when the feature set was reduced to 10 genes, demonstrating optimal predictive efficiency. Through iterative model validation, we demonstrated robust stability in feature selection, consistently identifying a core set of 10 genes across all experimental replicates: Glutathione Peroxidase 3 (GPX3), Thyroid Peroxidase (TPO), Cytochrome P450 Family 1 Subfamily B Member 1 (CYP1B1), Phosphodiesterase 8B (PDE8B), Alpha-1,3-Mannosyl-Glycoprotein 4-β-N-Acetylglucosaminyltransferase B (MGAT4B), Dehydrogenase Reductase 3 (DHRS3), Arachidonate 5-Lipoxygenase (ALOX5), Phosphodiesterase 5A (PDE5A), Ecto-5′-Nucleotidase (NT5E) and ST3 β-Galactoside α-2,3-Sialyltransferase 5 (ST3GAL5) (Fig. 7c). Subsequently, we visualized swarm plots stratified by differentiation state and constructed SHAP value heatmaps, systematically delineating the magnitude of feature contributions across distinct differentiation endpoints (Fig. 7d, e).
The final model incorporating this refined 10-gene signature was retrained and validated through receiver operating characteristic (ROC) curve analysis (Fig. 7f). The AUC values across subtypes were as follows: Normal AUC = 0.92, FTC AUC = 0.87, PTC AUC = 0.95, PDTC AUC = 0.92, and ATC AUC = 0.96. We delineated gene-specific contributions across tumor differentiation states (Fig. 7g).
10 Metabolite model
Metabolites represent the terminal readouts of complex tumor regulatory networks. To leverage this biological hierarchy, we developed a diagnostic framework using a ResNet1D neural network trained on metabolomic profiles (including both polar metabolites and lipid species) to predict tumor differentiation states. To enable a 2:1 split between training and test sets, we curated a cohort of 210 FUSCC samples (Fig. 8a). Given the scarcity of publicly available metabolomic datasets, all model development was performed within this cohort, using stratified sampling to preserve class proportions during partitioning (Fig. 8b, c). Iterative feature pruning guided by SHAP importance analysis revealed that a refined 10-metabolite signature achieved superior diagnostic performance (mean AUC 0.98), demonstrating significant improvements in distinguishing differentiation states (Fig. 8d, e).
10-M ResNet classifier cleanly separated PDTC and ATC from other thyroid states in metabolomic space. a Workflow for constructing the 10-M classifier neural network model to stratify thyroid cancer differentiation status, with feature selection based on SHAP value rankings. The model was developed using the FUSCC cohort, comprising normal (n = 57), FTC (n = 23), PTC (n = 103), PDTC (n = 17) and ATC (n = 10) samples. b Bar chart showing the average SHAP values for the top 10 metabolites contributing to the ResNet model’s predictions. The most influential metabolites include PC (17:0_24:0), PC (36:4), Hex1Cer (d18:1_22:0), N-acetylaspartic acid, ADP-ribose, ribose-5-phosphate, 1-methyl-nicotinamide, PC (32:1), UDP-glucose and ChE (20:3). c ROC curve for the validation cohort, showing an average AUROC of 0.98. d The dimensionality reduction embedding plot illustrates the model’s accuracy in predicting PDTC. e The dimensionality reduction embedding plot illustrates the model’s accuracy in predicting ATC.
Prior studies have consistently demonstrated the histological and molecular convergence between PDTC and ATC, which manifests as overlapping transcriptional profiles in traditional TDS evaluations. However, our 10-M model achieved exceptional discriminative capacity. Dimensionality reduction visualization demonstrated that PDTC and ATC could be accurately distinguished using the 10-metabolite model.
Discussion
Although the vast majority of thyroid cancers appear to be well differentiated with extremely promising prognoses, there are dedifferentiated types that appear to be lethal1,3,6. DDTC is defined here as encompassing both PDTC and ATC, reflecting a spectrum of progressive dedifferentiation and increasing malignancy7. The primary clinical hurdle in DDTC management lies in its diagnostic complexity6. While thyroid cancer diagnosis typically relies on fine-needle aspiration biopsy (FNAB) or core-needle biopsy (CNB), the limited cellular yield and poorly differentiated cytomorphology of DDTC often lead to inconclusive results. Diagnostic accuracy further decreases when samples contain necrosis or inflammatory infiltrates. Moreover, the heterogeneous histopathology of DDTC frequently overlaps with that of other aggressive malignancies, such as thyroid lymphoma, necessitating ancillary techniques such as immunohistochemistry and molecular profiling for definitive diagnosis. In accordance with the 2021 ATA guidelines, DDTC treatment combines surgery, radiotherapy, chemotherapy, and targeted therapies6. However, its aggressive biology and rapid progression often preclude curative resection. For locally advanced cases (stage IVA/B), multimodal therapy (surgery with adjuvant chemoradiation) may prolong survival, whereas distant metastases (stage IVC) limit surgical utility, shifting the focus to palliative radiotherapy and systemic therapies.
In this study, we employed untargeted metabolomics, WES and transcriptomics to investigate the different differentiation states of thyroid cancer. By combining these data with GEO single-cell sequencing datasets, we delineated the distinct metabolic landscape of different differentiation states. Moreover, we leveraged the power of the ResNet algorithm to distinguish thyroid cancer differentiation states from both transcriptomic and metabolomic perspectives. The 10 MG model was further validated in an external GEO cohort17,18,19. The model achieved excellent sensitivity and robustness, with an average accuracy of 92.7%, highlighting the ability of the model to predict differentiation states. In the future, commercial kits and simplified instruments could be invented on the basis of the 10M and 10-Metabolic Genes (10MG) models to aid in the early detection of dedifferentiation states, support clinical decision-making through patient risk stratification, and ultimately promote precision medicine and clinical applications.
Our study integrated clinical information with intricate omics data. For example, we revealed that GPs, such as PCs (20:1_18:1), were closely associated with metastatic characteristics. It has been reported that GPs help regulate cell membrane fluidity, signaling pathways, and the tumor microenvironment, supporting tumor cell migration, invasion, and metabolic adaptability20,21,22. A significant association between 1-methyl-nicotinamide and ENE was also found, implying an imbalance of methylation in thyroid cancer23.
Owing to the insufficient sequencing depth of WES in high-GC regions and genomic loci with complex structural variations, which may compromise its ability to comprehensively detect low-frequency or heterozygous mutations, in this study, targeted sequencing of candidate genes in tumor samples was further performed. The integration of these two sequencing platforms enabled the acquisition of more complete and precise genomic variation data, leading to the successful construction of a genomic alteration profile for the FUSCC cohort. Notably, we identified AHNAK and DICER1 mutations in undifferentiated carcinoma, including a novel missense mutation site in the DICER1 gene. Furthermore, this study reaffirmed the driving roles of two classic genomic events—TP53 nonsense mutations and TERT promoter mutations—in tumor dedifferentiation within our cohort24,25.
Gene fusion events are chromosomal rearrangements that abnormally merge two distinct genes, potentially generating novel chimeric genes or altering the expression patterns of parental genes. These molecular aberrations may produce oncogenic fusion proteins capable of disrupting cellular26 functions involving dedifferentiation-associated genes such as NTRK, RET, and ALK, which are rare in thyroid carcinoma6,27. These mutations are characteristically mutually exclusive with other oncogenic driver mutations. This observation may explain the predominance of patients with conventional driver mutations over those harboring dedifferentiation-specific fusion events in the FUSCC cohort, reflecting current clinical diagnostic practices that prioritize the detection of key mutational biomarkers in WDTC.
Metabolic profiling using sPLS-DA revealed distinct clustering of metabolic features across thyroid cancer differentiation states. Notably, NADPH and NADH, which are critical contributors to the cellular redox capacity, were depleted in all the subtypes26,28. However, phenylalanine metabolism was aberrantly activated in DDTC, whereas tryptophan metabolism showed specific activation only in ATC29,30,31.
Cellular biosynthesis relies heavily on reductive reactions, necessitating substantial redox equivalents9,11,32. NADPH serves as the primary cellular reductant and is regenerated from NAD+ via metabolic activity. It supplies hydrogen for biosynthesis and sustains antioxidant defenses. In contrast, NADH primarily shuttles electrons to the mitochondrial electron transport chain for adenosine triphosphate (ATP) production. Proliferating cells balance carbon allocation—partitioning glucose between energy-generating catabolism and NADPH-producing anabolism. NADH synthesis occurs via glycolysis, the tricarboxylic acid cycle, and fatty acid β-oxidation, with consumption linked to ATP production and lactate generation28,33. NADPH is synthesized predominantly through the pentose phosphate pathway and is utilized for gluconeogenesis and glutathione recycling. In ATC, depletion of both NADH and NADPH indicates compromised redox buffering during rapid tumor growth, where accelerated energy demands and gluconeogenic stress override compensatory mechanisms, collectively triggering the collapse of the antioxidant system. Sustained differential pathway analysis highlighted metabolic reprogramming strategies23,34. WDTC upregulated histidine metabolism, β-alanine metabolism, and glycolysis to meet proliferation demands. Conversely, DDTC shifted toward phenylalanine metabolism and tryptophan–KYN axis activation, reflecting an evolutionary shift from differentiation-dependent energetics to survival-driven immunometabolic adaptation. Dynamic glycolytic dysregulation further validated this pattern.
We established a three-dimensional analytical framework encompassing ‘genetic mutation–transcriptional regulation–metabolic remodeling’ to systematically dissect the dynamic evolution of metabolic networks during dedifferentiation. By leveraging multiomic integration, we revealed spatiotemporally specific interactions between thyroid cancer differentiation hierarchies and metabolic-immune microenvironments. Notably, PDTC, as a critical transitional state in the differentiation continuum, exhibited metabolic profiles closer to those of ATC than to those of well-differentiated subtypes. Consequently, we classified thyroid cancers into two broad states, WDTC and DDTC, to analyze representative metabolic components and core regulatory genes during dedifferentiation from a macroscopic perspective. This approach facilitated the systematic construction of a core metabolic regulatory network, laying a molecular foundation for developing multiomic classifiers based on metabolic biomarkers and identifying clinically actionable metabolic checkpoints. Using the Recon3D database, we matched differentially altered metabolites and metabolic genes under dedifferentiation conditions and visualized them via a four-quadrant matrix35. These components were further modeled as interconnected network nodes, enabling comprehensive characterization of metabolic behaviors in thyroid cancer. The ATA guidelines emphasize diagnostic challenges in DDTC due to the limited number of biopsy samples, high necrotic content, and pathological heterogeneity, which delay early intervention6. To address this, we focused on constructing classifiers to distinguish tumor differentiation states. Our metabolomic models incorporate both lipid metabolites, which reflect chronic biological adaptations such as membrane remodeling and immune evasion, and polar metabolites, which capture acute changes in energy metabolism. This integrative approach highlights the potential of metabolomics for future development of non-invasive diagnostic tools. This dual approach enhances the resolution of metabolic heterogeneity and improves sensitivity to differentiation continua. Serum lipidomic profiles, which are more stable than tissue-based measurements, further support the translational viability of hybrid metabolite models. Conventional basic machine learning classifiers have shown limited performance. Given that ResNet has advantages in feature learning and dimensionality reduction over traditional machine learning models, we selected this architecture for predictive modeling. Metabolomics captures real-time physiological states with high stability in plasma, making it ideal for developing noninvasive diagnostic models. In contrast, transcriptomics exhibits lower susceptibility to batch effects and reduced assay development costs, coupled with extensive external public cohorts available for validation. These complementary strengths thus prompted us to develop our dual-dimensional modeling approach. We innovatively implemented a ResNet neural network, employing stratified sampling to ensure a proportional representation of PDTC and ATC in each training epoch while increasing DDTC weighting in the validation sets. Despite ResNet’s inherent limitations in feature interpretability, we integrated SHAP values to quantify biomarker contributions. The 10-M model exhibited superior discriminative power for PDTC versus ATC, outperforming conventional 16-gene TDS classifiers. Both the sensitivity and robustness of the 10MG model were validated in external cohorts. Although operating independently, the 10-M and 10MG models circumvent the technical complexities of data integration while allowing flexible clinical deployment. However, broader independent validation remains essential to confirm classifier robustness given current sample size constraints. Future studies will enhance the robustness of the 10-M model through an expanded collection of patient-derived metabolomic data.
In summary, we combined untargeted metabolomics, WES, and transcriptomics to investigate the metabolic landscape of thyroid cancer with varying differentiation states. By integrating GEO single-cell sequencing datasets, we identified distinct metabolic reprogramming patterns associated with dedifferentiation. This study provides the first comprehensive analysis of follicular epithelial thyroid tumors across all histopathological subtypes by integrating neural networks with publicly available GEO datasets. We developed dual transcriptomic‒metabolomic classification frameworks and implemented SHAP analysis to clarify feature contributions. These insights elucidate the biological complexity of thyroid malignancies and provide clinically actionable frameworks for precision oncology.
Methods
Ethics approval and patient consent
This study was approved by the Institutional Review Board of Fudan University Shanghai Cancer Center, Shanghai, China (FUSCC-IRB No. 05043-4-2307E). Written informed consent was obtained from all participants for the use of their specimens and clinical data. All procedures followed the ethical standards of the approving committee and the principles of the Declaration of Helsinki and its later amendments.
Clinical sample collection
All procedures adhered to the ethical standards of our institutional research committee and the 1964 Helsinki Declaration. Thyroid cancer samples were collected from patients at FUSCC between October 2014 and July 2022, in accordance with FUSCC Institutional Tissue Bank protocols. To avoid confounding molecular signals induced by immunotherapy—which is known to cause significant transcriptional and metabolic reprogramming—patients who had received any form of immunotherapy prior to sample collection were excluded. Our study thus included a total of 158 patients diagnosed with thyroid cancer, comprising 104 PTCs, 24 FTCs, 19 PDTCs, and 11 ATCs (Fig. 1). Patients from the PTC cohort in the present study partially overlapped with the one described in our previous study11. The other PTC samples, together with follicular (FTC, n = 24), poorly differentiated (PDTC, n = 19), and anaplastic (ATC, n = 11) samples, were newly collected for this work, thereby expanding our cohort to include histological subtypes not covered in the earlier study. Samples were processed into fresh-frozen and formalin-fixed, paraffin-embedded sections, stained with hematoxylin and eosin, and independently evaluated by expert pathologists. Inclusion criteria required ≥80% tumor cell nuclei, ≥80% overall cellularity, and ≤20% necrosis. Clinical parameters, including TNM stage and lymph node involvement, were recorded. Written informed consent was obtained from all patients, and the study was approved by the FUSCC Ethical Committee.
DNA library preparation and whole-exome sequencing (WES)
DNA libraries for WES were prepared from 0.4 μg of genomic DNA using the Agilent SureSelect Human All Exon V6 Kit. After DNA end repair, A-tailing, and adaptor ligation, polymerase chain reaction (PCR) was performed to enrich fragments with adaptors. Biotin-labeled probes and streptavidin-coated beads were then used to capture the exonic regions, followed by a second PCR to introduce index tags. Final libraries were purified using Agencourt AMPure XP beads, quantified using a Qubit 3.0 fluorometer and real-time PCR, and subsequently sequenced on an Illumina platform with 150 bp paired-end reads. Adaptor sequences and low-quality reads were removed from the raw WES data using Trimmomatic (v0.39), and data quality was assessed with FastQC (v0.11.9). Clean sequencing reads were aligned to the human reference genome (GRCh38/hg38) using Burrows–Wheeler Aligner (BWA, v0.7.17) and processed using Sequence Alignment/Map tools (SAMtools, v1.8). Somatic mutations in tumor samples with matched normal samples were identified using VarScan2 (v2.4.4). For tumor-only samples, variant calling was performed with VarScan2’s mpileup2cns function using stringent filters. Variants were annotated with ANNOVAR (2020-06-08) against multiple databases, including refGene, ExAC03, COSMIC70, avsnp147, 1000 Genomes (2015_08), ESP6500, and ClinVar_20220320.
RNA library preparation and RNA-seq analysis
Total RNA was extracted and assessed for integrity using a Bioanalyzer 2100 system (Agilent Technologies), with samples having an RNA Integrity Number (RIN) ≥ 7.0 deemed suitable for downstream processing. Approximately 1 μg of total RNA per sample was used as input for mRNA isolation with poly-T oligo-attached magnetic beads. Library preparation was performed using the NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (New England Biolabs), following the manufacturer’s protocol.
Purified mRNA was fragmented in First Strand Synthesis Reaction Buffer, and first-strand cDNA synthesis was carried out using random hexamers and M-MuLV Reverse Transcriptase. Second-strand cDNA was generated using DNA Polymerase I and RNase H. The resulting double-stranded cDNA fragments were then end-repaired, adenylated at the 3’ ends, and ligated to Illumina sequencing adaptors. Target-sized cDNA fragments (~370–420 bp) were purified using the AMPure XP system and subsequently PCR-amplified to generate the final libraries. Library quality was confirmed using a Bioanalyzer 2100, and libraries were sequenced on the Illumina NovaSeq 6000 platform to produce 150 bp paired-end reads, with a mean yield of ~40–50 million reads per sample and an average coverage of approximately 100×.
For downstream data analysis, raw sequencing reads were aligned to the human reference genome (GRCh38/hg38) using HISAT2 (v2.0.5). Gene-level quantification was performed using HTSeq (v2.0.2) to obtain raw read counts. The resulting gene expression matrix was normalized using the counts function from DESeq2 (v1.26.0) with the parameter normalized=TRUE. Differential expression analysis was then conducted using DESeq2, and genes with an adjusted p-value < 0.05 and a defined fold-change threshold were considered significantly differentially expressed.
Gene fusion
Gene fusions were detected from the RNA-seq data using Arriba (v 2.3.0) and STAR-Fusion (v 1.4.0, https://github.com/STAR-Fusion/STAR-Fusion/wiki). The results from the two methods were further annotated and filtered on the basis of annoFuseData (https://github.com/d3b-center/annoFuseData) and annoFuse (v 0.90.0), and only fusions with JunctionReadCount > 3 and evaluated as having high or median confidence were retained.
Metabolomic analysis and mass spectrometry
Metabolite extraction and mass spectrometry analyses were conducted according to previously described methods (Nat Commun. 2024;15(1):317515)11. Briefly, metabolites were extracted using a chloroform/methanol/water system with subsequent phase separation and centrifugation. Both hydrophilic and hydrophobic fractions underwent capillary electrophoresis-mass spectrometry (CE-MS) and liquid chromatography–mass spectrometry (LC–MS), respectively. Quality control (QC) samples, prepared by pooling aliquots from each experimental sample, were interspersed throughout analyses to monitor system stability.
Mass spectrometry analyses were performed using a CE system (Agilent) coupled to a time-of-flight mass spectrometer (TOF-MS) in both cation and anion modes. LC–MS analyses utilized an ACQUITY UPLC system coupled to a TripleTOF™ 5600+ spectrometer (Sciex), with both positive and negative electrospray ionization modes employed.
Metabolite identification relied on pre-analyzed metabolite standard libraries (human metabolome technologies, HMT) for CE-MS data, and comprehensive databases including the human metabolome database (HMDB), the metabolite and tandem MS database (METLIN), and lipid metabolites and pathways strategy (LIPID MAPS) for LC–MS data. Data were normalized using internal standards and tissue weights to reduce analytical variability.
WES data analysis
Adaptor sequences and low-quality reads from WES data were removed using Trimmomatic (v0.39), and data quality was assessed with FastQC (v0.11.9). The sequencing reads were aligned to the human genome (GRCh38/hg38) using Burrows–Wheeler Aligner (v0.7.17) and Sequence Alignment/Map tools (v1.8). Somatic mutations in tumor samples with matched normal samples were identified via VarScan2 (v2.4.4), whereas variants in tumor-only samples were identified using VarScan2’s mpileup2cns function with stringent filters. Variants were annotated via ANNOVAR (2020-06-08) using multiple databases, including refGene, ExAC03, COSMIC70, avsnp147, 1000 Genomes (2015_08), ESP6500, and ClinVar_20220320.
Sanger sequencing
For the detection of gene mutations via Sanger sequencing, genomic DNA was first extracted from tissue samples using an appropriate extraction kit, ensuring high purity with an A260/A280 ratio between 1.8 and 2.0. Specific primers were designed on the basis of the target gene sequence to amplify the region of interest through PCR. The PCR mixture included template DNA, primers, dNTPs, Taq DNA polymerase, and buffer, followed by thermal cycling to achieve amplification. After amplification, the PCR products were purified to remove unreacted components and then subjected to Sanger sequencing, where a separate reaction was set up with fluorescently labeled dNTPs and primers. The sequencing process involved thermal cycling to generate readable sequence data, which were subsequently analyzed using bioinformatics software to identify mutations. Finally, mutations were confirmed by comparing the results against a reference sequence and documenting their types and locations for further interpretation.
Survival analysis
Kaplan–Meier survival curves were generated using the “survival” package (version 3.8-3) in R software (version 4.3.2)36. Survival time was defined as the period from initial diagnosis to death or the date of last follow-up. Patients who were alive at the end of follow-up were considered censored. For comparisons between two or more groups, the log-rank test was used to evaluate statistical differences in survival distributions. In addition, univariate and multivariate Cox proportional hazards regression analyses were performed to assess the effects of clinicopathological variables on survival outcomes. Hazard ratios (HRs) and 95% confidence intervals (CIs) were calculated.
Statistical analysis
Normality testing was performed using the Shapiro–Wilk test, followed by Levene’s test for homogeneity of variances. Based on distribution and variance characteristics, we applied Student’s t-test, Welch’s t-test, one-way ANOVA, Mann–Whitney U test, or Kruskal–Wallis test as appropriate. For categorical variables, either Pearson’s chi-square test or Fisher’s exact test was used. Correlation analyses employed Pearson or Spearman methods depending on data normality and linearity. All statistical analyses were performed in R software (version 4.3.2). A two-sided p < 0.05 was considered statistically significant. For multiple comparisons, p-values were adjusted using the Benjamini–Hochberg method to control the false discovery rate (FDR), and corresponding q-values were reported where applicable. And q < 0.05 was considered statistically significant.
Metabolomic data analysis
MetaboAnalyst 6.0 was used to analyze the metabolomic data. All the data that were input into MetaboAnalyst were autoscaled (mean-centered and divided by the standard deviation of each variable). We employed sparse partial least squares discriminant analysis (sPLS-DA) using the mixOmics package (version 6.30.0)37. sPLS-DA was chosen for its ability to perform simultaneous dimensionality reduction and feature selection. It was performed to visualize the metabolic profiles and identify relationships among several groups: normal, FTC, PTC, PDTC and ATC.
Enrichment analysis was performed to identify significantly altered metabolic pathways. The analysis was completed by comparing the metabolite data to both SMPDB metabolite set and KEGG based on normal human metabolic pathways38,39. The enrichment ratio and significance of enrichment of pathways were calculated on the basis of the number of metabolites detected within a specific pathway relative to the number of known metabolites in that pathway.
We conducted a comprehensive network analysis and visualized the network using the Network Explorer module within MetaboAnalyst to explore the biological processes driven by interactions among metabolites and genes40.
Analysis of the associations between phenotypes and metabolomics
We employed Spearman's rank correlation analysis to assess the associations between phenotypes and polar metabolomics and lipidomics. Prior to the analysis, we applied min–max normalization to the metabolite data, ensuring comparability across different metabolites. The procedure involved collecting all the metabolites along with the corresponding phenotype information. We calculated the Spearman correlation coefficient for each normalized metabolite against the phenotype variables, ensuring the exclusion of missing values and the performance of appropriate data preprocessing. The results are reported as Spearman correlation coefficients along with corresponding p-values to evaluate the significance of the correlations. All the statistical analyses were conducted using R software, utilizing the ‘cor.test’ function for Spearman rank correlation calculations.
Analysis of the associations among somatic mutations, gene fusions and metabolomics
We analyzed the associations among somatic mutations, gene fusions, and metabolomics using Welch’s t-statistic. First, metabolomic data were collected alongside information on somatic mutations and gene fusions. To account for potential variations in sample sizes and unequal variances, we applied Welch’s t-test to compare the metabolite levels between groups defined by specific mutations and gene fusions. Each metabolite was analyzed separately, and the calculated t-statistics, along with corresponding p-values, were reported to evaluate the significance of the associations. All the statistical analyses were performed using R software.
Metabolomic and transcriptomic matching
We conducted metabolomic and transcriptomic matching to identify key associations between metabolites and their corresponding genes. We began by selecting important metabolites on the basis of our previous analyses. Using the SMPDB and KEGG, we then mapped these metabolites to their respective metabolic pathways and identified the genes involved in each pathway. Each selected metabolite was matched with the corresponding genes, allowing us to create a comprehensive dataset linking metabolomic profiles to transcriptomic data.
Single-cell RNA-seq data processing and analysis
Publicly available single-cell RNA-seq data were obtained from the GSE193581 database41. Based on the “treatment protocol” metadata, we included only samples with the annotation “untreated” to avoid confounding effects of therapeutic interventions on gene expression. The selected patient samples included: PTC01, PTC02, PTC04, PTC07, ATC09, ATC10, ATC11, ATC12, ATC13, ATC14, ATC15, ATC17, NORM07, NORM18, NORM19, NORM20, and NORM2141. Data processing and analysis were performed using the Seurat package (version 5.3) in R42,43,44. Dimensionality reduction was conducted using principal component analysis on highly variable genes identified with FindVariableFeatures(). UMAP embedding was then computed using ‘RunUMAP() ‘(method = “uwot”, metric = “cosine”) for visualization of cell clusters in two-dimensional space45.
Trend clustering of metabolomic data
The ‘Mfuzz’ R package was used to perform trend clustering of the metabolomic data46. We focused on analyzing metabolites according to the normal, PTC, PDTC, and ATC states. First, the metabolomic data were preprocessed to ensure quality and consistency. Using Mfuzz, we identified fuzzy clusters of metabolites, capturing the underlying trends and dynamic patterns within the data. We set 4 clusters on the basis of biological relevance and performed clustering analysis to group metabolites with similar expression patterns across the different states.
Machine learning metabolomic state model
All analyses were conducted in a Conda‐managed Python 3.9 environment using PyTorch for model development and SHAP for interpretability47. We developed a one‐dimensional residual convolutional network (ResNet1D) for multiclass classification on preprocessed feature matrices. We integrated transcriptomic data from multiple sources, including four public microarray datasets48. The microarray datasets were processed using the robust multi-array average method for background correction and quantile normalization, resulting in log₂-transformed gene expression values. For the RNA-seq data, raw read counts were normalized for sequencing depth and then log₂-transformed to produce expression values on a comparable scale to the microarray data. Before combining the datasets, each dataset was independently standardized on a gene-by-gene basis using Z-score normalization. This per-gene standardization ensured that all features contributed equally and had comparable variance across different platforms. We then merged the datasets using the set of genes common to all platforms. After merging, we applied the ComBat algorithm (from the sva package v3.40.0 in R) to adjust for systematic differences between the platforms. Samples were then stratified by pathology type and split into training and test sets at a 2:1 ratio. Importantly, batch correction and normalization were performed separately within the training set and the test set (to prevent any information leakage between training and testing phases). Metabolomic intensities from a single‐center platform were directly Z‐score normalized within each cohort. Model training employed 9-fold stratified cross-validation (sklearn.model_selection.StratifiedKFold) to preserve class proportions, running 40 epochs per fold with a batch size of 32. We optimized a categorical cross-entropy loss via the Adam optimizer (initial learning rate = 0.0001) without explicit learning-rate schedules, weight decay, or early-stopping, leveraging Adam’s adaptive moment estimation and dropout as regularization. SHAP’s DeepExplainer was applied only to the training subset of one representative fold (randomly sampled background of 100 instances) to compute global feature importances without leaking test-set information.
For input preparation, one-hot encoding was applied to class labels using PyTorch’s F.one_hot, and features were normalized before training. During each fold, training and validation subsets were dynamically generated using index slicing, and performance was recorded in terms of classification accuracy. ROC curves and AUC scores were computed using sklearn.metrics.
To interpret the trained model, we applied SHAP’s DeepExplainer, using randomly sampled background data from the training set to estimate local and global feature contributions. The mean absolute SHAP values were computed to rank features globally, and the top-N features were selected to retrain the model on a reduced input space. The retrained model was assessed with the same cross-validation scheme and subsequently tested on a held-out subset drawn from the combined internal–external dataset (cohort = 2)17,18,19,49. The external cohort consisted of samples from the GEO datasets, including GSE29265 (20 normal, 20 PTC, 9 ATC), GSE33630 (49 PTC, 45 normal, 11 ATC), GSE53167 (3 normal, 15 PTC, 4 FTC, 5 PDTC), GSE65144 (13 normal, 12 ATC), and GSE76039 (17 PDTC, 20 ATC). ROC curves, classwise AUCs, and macro-average ROCs were plotted using matplotlib.
All codes were executed within a Miniconda environment with the following key dependencies: PyTorch, scikit-learn, pandas, numpy, shap, and matplotlib.
Data availability
All external transcriptomic reference cohorts employed in this work are publicly available from the NCBI Gene Expression Omnibus (GEO) and can be downloaded without restriction from the GEO website (https://www.ncbi.nlm.nih.gov/geo/) under accession numbers GSE29265, GSE33630, GSE53157, GSE65144, and GSE76039. The raw sequencing and untargeted metabolomics data generated at FUSCC cannot be released publicly at this stage because follow-up analyses are still in progress and the files contain participant-level information that could permit re-identification. Researchers may request access to the de-identified datasets by e-mailing the corresponding author and the FUSCC Data-Access Committee (dac@fudan.edu.cn). All requests are acknowledged within 5 working days and evaluated within 30 working days.
Code availability
The code is available at https://github.com/Aceracede3000/Aceracede3000. All analyses were conducted in a Conda‐managed Python 3.9 environment using PyTorch and all required software versions and dependencies are listed in the repository’s requirements file with no access restrictions.
References
Kim, J., Gosnell, J. E. & Roman, S. A. Geographic influences in the global rise of thyroid cancer. Nat. Rev. Endocrinol. 16, 17–29 (2020).
Pereira, M., Williams, V. L., Hallanger Johnson, J. & Valderrabano, P. Thyroid cancer incidence trends in the United States: association with changes in professional guideline recommendations. Thyroid 30, 1132–1140 (2020).
Megwalu, U. C. & Moon, P. K. Thyroid cancer incidence and mortality trends in the United States: 2000–2018. Thyroid 32, 560–570 (2022).
Yoo, S. K. et al. Integrative analysis of genomic and transcriptomic characteristics associated with progression of aggressive thyroid cancer. Nat. Commun. 10, 2764 (2019).
Zeng, P. Y. F. et al. The genomic and evolutionary landscapes of anaplastic thyroid carcinoma. Cell Rep 43, 113826 (2024).
Bible, K. C. et al. 2021 American Thyroid Association Guidelines for management of patients with anaplastic thyroid cancer: American Thyroid Association Anaplastic Thyroid Cancer Guidelines Task Force. Thyroid 31, 337–386 (2021).
Pan, Z. et al. Integrative proteogenomic characterization reveals therapeutic targets in poorly differentiated and anaplastic thyroid cancers. Nat. Commun. 16, 3601 (2025).
Fung, M. M. H. et al. Developing a named entity framework for thyroid cancer staging and risk level classification using large language models. npj Digit. Med. 8, 134 (2025).
Kim, Y. H. et al. Integrative multi-omics analysis reveals different metabolic phenotypes based on molecular characteristics in thyroid cancer. Clin. Cancer Res. 30, 883–894 (2024).
Ma, B. et al. Transcriptome analyses identify a metabolic gene signature indicative of dedifferentiation of papillary thyroid cancer. J. Clin. Endocrinol. Metab. 104, 3713–3725 (2019).
Qu, N. et al. Integrated proteogenomic and metabolomic characterization of papillary thyroid cancer with different recurrence risks. Nat. Commun. 15, 3175 (2024).
Shao, C. et al. Optical diagnostic imaging and therapy for thyroid cancer. Mater. Today Bio 17, 100441 (2022).
Zifeng, W., Chunhua, S. & Hengel, A. V. D. Wider or deeper: revisiting the ResNet model for visual recognition. Pattern Recognit. 90, 119–133 (2019).
Buergel, T. et al. Metabolomic profiles predict individual multidisease outcomes. Nat. Med. 28, 2309–2320 (2022).
Balamurugan, V. & Sivasankari, K. Segment anything model-based segmentation with inception-ResNet-v2 classifier for Parkinson’s disease diagnosis and metaverse visualization. Comput. Biol. Med. 187, 109732 (2025).
Wang, H., Liang, Q., Hancock, J. T. & Khoshgoftaar, T. M. Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods. J. Big Data 11, 44 (2024).
Wen, S. et al. Cancer-associated fibroblasts positively correlate with dedifferentiation and aggressiveness of thyroid cancer. OncoTargets Ther 14, 1205–1217 (2021).
Landa, I. et al. Genomic and transcriptomic hallmarks of poorly differentiated and anaplastic thyroid cancers. J. Cin. Investig. 126, 1052–1066 (2016).
Pita, J. M., Banito, A., Cavaco, B. M. & Leite, V. Gene expression profiling associated with the progression to poorly differentiated thyroid carcinomas. Br. J. Cancer 101, 1782–1791 (2009).
Lesko, J. et al. Phospholipid dynamics in ex vivo lung cancer and normal lung explants. Exp. Mol. Med. 53, 81–90 (2021).
Hidalgo, I., Sorolla, M. A., Sorolla, A., Salud, A. & Parisi, E. Secreted phospholipases A2: drivers of inflammation and cancer. Int. J. Mol. Sci. 25, 12408 (2024).
Laface, C. et al. Autotaxin-lysophosphatidate axis: promoter of cancer development and possible therapeutic implications. Int. J. Mol. Sci. 25, 7737 (2024).
Kilgour, M. K. et al. 1-Methylnicotinamide is an immune regulatory metabolite in human ovarian cancer. Adv. Sci. 7, eabe1174 (2021).
Freed-Pastor, W. A. et al. Mutant p53 disrupts mammary tissue architecture via the mevalonate pathway. Cell 148, 244–258 (2012).
Hussain, T. et al. Activation of PPAR-γ prevents TERT-mediated pulmonary vascular remodeling in MCT-induced pulmonary hypertension. Heliyon 9, e14173 (2023).
Parlani, M., Jorgez, C. & Friedl, P. Plasticity of cancer invasion and energy metabolism. Trends Cell Biol 33, 388–402 (2023).
Doebele, R. C. et al. Entrectinib in patients with advanced or metastatic NTRK fusion-positive solid tumours: integrated analysis of three phase 1–2 trials. The Lancet Oncol 21, 271–282 (2020).
Mladenov, M., Sazdova, I., Hadzi-Petrushev, N., Konakchieva, R. & Gagov, H. The role of reductive stress in the pathogenesis of endocrine-related metabolic diseases and cancer. Int. J. Mol. Sci. 26, 1910 (2025).
Xue, C. et al. Tryptophan metabolism in health and disease. Cell Metab 35, 1304–1326 (2023).
Lou, F. et al. Dysbiotic oral microbiota-derived kynurenine, induced by chronic restraint stress, promotes head and neck squamous cell carcinoma by enhancing CD8(+) T cell exhaustion. Gut. 74, 925–947 (2025).
Eom, H. et al. Discovery of highly active kynureninases for cancer immunotherapy through protein language model. Nucleic Acids Res. 53, gkae1245, (2025).
Ma, W. & Zhou, S. Metabolic rewiring in the face of genomic assault: integrating DNA damage response and cellular metabolism. Biomolecules 15, 168 (2025).
Dutta, D. et al. A defect in mitochondrial fatty acid synthesis impairs iron metabolism and causes elevated ceramide levels. Nat. Metab. 5, 1595–1614 (2023).
Nie, M. et al. Evolutionary metabolic landscape from preneoplasia to invasive lung adenocarcinoma. Nat. Commun. 12, 6479 (2021).
Brunk, E. et al. Recon3D enables a three-dimensional view of gene variation in human metabolism. Nat. Biotechnol. 36, 272–281 (2018).
Therneau, T. M. A Package for Survival Analysis in R. R package version 3.7-0. CRAN, R Foundation for Statistical Computing, Vienna, Austria. Available at: https://CRAN.R-project.org/package=survival (2024).
Rohart, F., Gautier, B., Singh, A. & Le Cao, K.-A. mixOmics: an R package for ‘omics feature selection and multiple data integration. PLoS Comput. Biol. 13, e1005752 (2017).
Jewison, T. et al. SMPDB 2.0: big improvements to the Small Molecule Pathway Database. Nucleic Acids Res 42, D478–D484 (2014).
Kanehisa, M. The KEGG database. In Bock, G. R. & Goode, J. A. (eds) ‘In Silico’ Simulation of Biological Processes (Novartis Foundation Symposium, Vol. 247), 91–103 (Wiley, 2002).
Chong, J. & Xia, J. Using MetaboAnalyst 4.0 for metabolomics data analysis, interpretation, and integration with other omics data. Comput. Methods Data Anal. Metabolomics 2104, 337–360 (2020).
Lu, L. et al. Anaplastic transformation model in thyroid cancer revealed by single cell lineage and fate analysis. Cancer Res 83, 3131–3131 (2023).
Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 42, 293–304 (2024).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e3529 (2021).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e1821 (2019).
Lundberg, S. M. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 (eds Guyon, I., von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S. & Garnett, R.) 4765–4774 (Curran Associates, Inc., 2017).
Kumar, L. & Futschik, M. E. Mfuzz: a software package for soft clustering of microarray data. Bioinformation 2, 5 (2007).
Lundberg, S. M. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30 (eds. Guyon, I. et al.) (4765–4774 (Curran Associates, Inc., Red Hook, NY, USA, 2017).
Foltz, S. M., Greene, C. S. & Taroni, J. N. Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously. Commun. Biol. 6, 222 (2023).
Von Roemeling, C. A. et al. Aberrant lipid metabolism in anaplastic thyroid carcinoma reveals stearoyl CoA desaturase 1 as a novel therapeutic target. J. Clin. Endocrinol. Metab. 100, E697–E709 (2015).
Acknowledgements
We would like to thank the participants enrolled in the study and acknowledge Yihao Chen and Wu Nv for their kind assistance in participant recruitment, sample, and clinical data collection. The study was supported by the Ministry of Science and Technology of the People’s Republic of China (2024ZD0525600 to Yu Wang), the National Natural Science Foundation of China (82473361 to Yu Wang, 82203052 to Ben Ma), the Science and Technology Commission of Shanghai Municipality (22Y21900100 to Yu Wang), the Shanghai Anticancer Association (SACA-AX202213 to Yu Wang), the Shanghai Municipal Health Commission (20234Y0120 to Ben Ma).
Author information
Authors and Affiliations
Contributions
Y.Z. designed the overall experiment and wrote the manuscript, while X.D., S.C., and Y.C. collected clinical samples and data. D.Z., W.X., and T.L. performed data checking and verification. N.Q., R.S., and Q.J. provided clinical guidance and contributed to methodological refinements. B.M. and Y.W. provided funding support and overall supervision of the project. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, Y., Du, X., Cai, S. et al. Developing a thyroid cancer differentiation state classification system using deep residual networks and metabolic signature profiling. npj Digit. Med. 8, 572 (2025). https://doi.org/10.1038/s41746-025-01927-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-025-01927-1