Introduction

Cancer is an inherently heterogeneous disease, yet it progresses through well-defined biological principles that govern its development and spread1,2. Hanahan and Weinberg introduced the concept of the “hallmarks of cancer,” a unifying framework that identifies the fundamental capabilities cancer cells acquire during tumorigenesis3. These core hallmarks include: (1) sustaining proliferative signaling to drive uncontrolled growth; (2) evading growth suppressors to bypass regulatory constraints; (3) resisting cell death to survive environmental and intracellular stress; (4) enabling replicative immortality to achieve limitless cell division; (5) inducing angiogenesis to ensure a continuous nutrient supply through neovascularization; and (6) activating invasion and metastasis to colonize distant tissues. This framework has been expanded to incorporate emerging hallmarks, such as (7) deregulating cellular energetics to sustain rapid proliferation, and (8) avoiding immune destruction by escaping immune surveillance. Enabling characteristics, including (9) genome instability and mutation, which accelerate tumor evolution, and (10) tumor-promoting inflammation, which supports a microenvironment conducive to malignancy, further illustrate the complexity of cancer biology4.

Despite the insights offered by these hallmark frameworks, current diagnostic approaches often fail to integrate these molecular underpinnings into routine clinical practice. Traditional methods, such as staging systems (e.g., AJCC and TNM) and grading scales (e.g., Gleason grading), primarily focus on macroscopic and microscopic tumor characteristics, overlooking the molecular heterogeneity that drives tumor behavior5,6. Consequently, patients with the same cancer type, stage, and grade may exhibit divergent outcomes, exposing the limitations of these approaches. Moreover, these approaches do not provide insights into the dynamic, micro-evolutionary molecular changes within tumors, limiting their capacity to guide personalized treatment strategies7. A hallmark-based diagnostic framework has the potential to address these limitations by integrating molecular data to illuminate the biological mechanisms underlying tumorigenesis—an essential step toward precision oncology, in which treatments are tailored to the individual tumor’s molecular profile8. Although multi-omics technologies, artificial intelligence, and real-time monitoring have advanced considerably, we still lack a single, unified method that can simultaneously annotate all hallmark activities in a tumor9,10.

To address this critical gap, we developed OncoMark, a high-throughput neural multi-task learning (N-MTL) framework designed to simultaneously quantify the activity of all cancer hallmarks using transcriptomic data from tumor biopsies. This is, to the best of our knowledge, the first computational tool specifically designed to predict all cancer hallmarks concurrently. Upon input, OncoMark calculates the probability of each hallmark’s activity, providing a detailed molecular profile of the tumor. The model underwent a rigorous validation process to ensure robustness and generalizability. Cross-validation demonstrated accuracy, precision, recall, and F1 scores exceeding 99%. Testing on five independent external datasets further confirmed its performance, consistently maintaining a minimum of 96.6% across these metrics. Additional validation on eight gold-standard datasets, comprising six cancer (TCGA, MET500, CCLE, TARGET, PCAWG, and POG570) and two normal (GTEx and ENCODE) datasets, verified the model’s high accuracy in distinguishing cancer and normal samples. Moreover, predicted hallmark activity demonstrated significant co-association with AJCC stages and TNM staging, with the strongest co-association observed at advanced stages of cancer progression. Building on its clinical utility, we have developed a user-friendly software platform, accessible at https://oncomark-ai.hf.space/, which seamlessly integrates hallmark activity profiling into research and clinical workflows.

Results

Overview of OncoMark

The hallmarks of cancer comprise a set of core biological capabilities acquired by tumor cells during malignant transformation and progression4. Despite their central role in oncogenesis, no experimental standard currently exists for annotating hallmark activity in clinical biopsy samples. While techniques such as immunohistochemistry (IHC) offer some capacity to infer hallmark states, their reliance on single biomarkers limits both specificity and sensitivity, as no single marker universally defines a given hallmark11. Consequently, hallmark-annotated biopsy datasets remain unavailable for model training.

To address this gap, we developed a strategy to generate synthetic, hallmark-annotated bulk transcriptomic profiles by leveraging single-cell RNA sequencing (scRNA-seq) data12,13,14 (Fig.1 and Supplementary Fig. 1). We first curated gene sets corresponding to each hallmark of cancer and used these to compute digital hallmark activity scores for individual cells15. These scores were binarized to indicate the presence or absence of each hallmark, and cells with identical annotations were aggregated to create synthetic pseudo-bulk samples. Single-cell profiles were not used directly for model training, as their resolution captures only localized and often transient transcriptional activity, which does not reliably reflect the population-level hallmark patterns observed in clinical biopsies16. Our aggregation approach preserved biological realism while ensuring training labels were accurate and noise-free.

Fig. 1: Overview of the OncoMark framework.
figure 1

Single-cell transcriptomic data from multiple cancer types undergo quality control to remove low-quality cells. Each cell is then scored for hallmark gene expression signatures, followed by binary annotation (Yes/No) indicating the presence or absence of each hallmark. These annotated single cells are aggregated to create synthetic pseudo-bulk datasets for each hallmark. A multi-task neural network (M-TNN) is trained on this synthetic data, learning a shared feature representation across all hallmarks, with hallmark-specific output layers enabling accurate prediction of hallmark presence.

However, real-world tumor samples frequently exhibit co-activation of multiple hallmarks, a complexity not present in our synthetic training data by design. If synthetic biopsies were constructed to include cells with diverse hallmark profiles, reliable binary annotation of hallmark presence or absence would become infeasible. Such ambiguity would increase label noise and hinder model performance, particularly in a supervised learning context where robust class separation is critical for effective training17. Moreover, the biological processes underlying different hallmarks are highly interdependent, with shared regulatory pathways and transcriptional programs18. To address both the lack of mixed hallmark states in the training data and the need to capture inter-hallmark dependencies, we adopted a multi-task learning (MTL) framework19.

Our MTL architecture is designed with shared backbone layers that learn generalized transcriptomic features across all hallmarks, paired with hallmark-specific output heads that refine these representations for individual predictions. This setup enables the model to leverage shared biological signals and learn hallmark coordination patterns implicitly—even though the training data contains only non-overlapping hallmark labels. (Supplementary Fig. 2) By doing so, the MTL framework mitigates the disconnect between our cleanly labeled synthetic data and the complex, interwoven hallmark states present in real tumors, allowing the model to generalize effectively to heterogeneous biopsy transcriptomes. The resulting predictions are biologically grounded, robust to hallmark co-activation, and aligned with the functional interplay of cancer hallmark processes. More details are provided in the Methods section.

Validation of hallmark annotations in synthetic training data

To assess the accuracy of hallmark annotations used for model training, we compared the gene expression profiles of hallmark-positive and hallmark-negative samples within the synthetic (pseudo-bulk) datasets. For each hallmark, differential expression analysis was performed20, and the results were visualized as heatmaps representing the log2 fold change and −log10 false discovery rate (FDR) for the associated gene sets (Fig. 2). We observed strong enrichment of hallmark genes in samples annotated as hallmark-positive, as indicated by deep blue signals in both the fold change and FDR panels. Notably, several gene sets displayed infinite FDR values (i.e., unadjusted p values of zero), represented as white or missing entries in the −log10 FDR heatmap, reflecting extremely significant associations. These findings confirm the reliability of our binary hallmark annotations and support the validity of the synthetic training data used for model development.

Fig. 2: Validation of hallmark annotations in synthetic training data.
figure 2

Heatmaps depict the log2fold change and −log10FDR values for hallmark gene sets, comparing hallmark-present versus hallmark-absent samples for each hallmark. Deep blue intensity in both the log2fold change and −log10FDR panels indicates strong and significant enrichment of the gene set in hallmark-positive samples, supporting accurate hallmark annotation. Blank areas in the −log10FDR heatmap indicate infinite FDR values (i.e., unadjusted p values of zero), corresponding to highly significant associations.

Robust performance across internal cross-validation

We evaluated our model performance in predicting cancer hallmarks using a five-fold cross-validation repeated twice. Our model performance in predicting AIM hallmark achieved an accuracy of 99.97% and an F1 score of 99.97%, with a balanced accuracy of 99.97%, tested on 729 patients (3364 positive and 4184 negative samples). Similarly, the DCE hallmark demonstrated an accuracy of 98.91% and an F1 score of 97.99%, achieving a balanced accuracy of 98.88% across 764 patients (3841 positive and 4249 negative samples). The EGS hallmark delivered exceptional performance with an accuracy of 99.95% and an F1 score of 99.92%, supported by a balanced accuracy of 99.96%, evaluated on 711 patients (3724 positive and 3168 negative samples). The GIM hallmark attained an accuracy of 98.73% and an F1 score of 98.68%, with a balanced accuracy of 98.74% across 395 patients (1282 positive and 846 negative samples). The RCD hallmark showed strong metrics, achieving an accuracy of 99.93%, an F1 score of 99.92%, and a balanced accuracy of 99.93%, tested on 652 patients (2590 positive and 2833 negative samples). The SPS hallmark exhibited perfect performance, achieving 100% accuracy, F1 score, and balanced accuracy, validated on 695 patients (3305 positive and 3796 negative samples). The AID hallmark achieved an accuracy of 99.71% and an F1 score of 99.96%, with a balanced accuracy of 99.74%, evaluated across 691 patients (2832 positive and 3671 negative samples). The IA hallmark recorded an accuracy of 99.92% and an F1 score of 99.87%, with a balanced accuracy of 99.91%, tested on 705 patients (2318 positive and 4334 negative samples). The ERI hallmark demonstrated an accuracy of 99.28% and an F1 score of 98.53%, alongside a balanced accuracy of 99.08%, across 816 patients (3205 positive and 7238 negative samples). Lastly, the TPI hallmark achieved an accuracy of 99.43% and an F1 score of 99.13%, with a balanced accuracy of 99.57%, tested on 730 patients (2326 positive and 4824 negative samples) (Fig. 3a and Table 1). The confusion matrix demonstrated strong classification performance, with clear separation between true positives and false predictions across hallmarks. (Supplementary Fig. 3) For the ROC curves, all cancer hallmark predictions achieved near-perfect true positive rates (TPR) across all thresholds, with AUROC values of 1.00 ± negligible standard deviations, indicating robust discrimination between positive and negative samples for each hallmark (Supplementary Fig. 4a). Similarly, the precision-recall curves confirm the models’ ability to maintain high precision at varying recall levels for each hallmark. The precision values remained consistently high, even as recall approached 1.0, highlighting the reliability of predictions in identifying positive cases without sacrificing precision (Supplementary Fig. 4b).

Fig. 3: Performance evaluation of OncoMark using cross-validation and external datasets.
figure 3

a Bar plots summarize precision, recall, F1 score, accuracy, and balanced accuracy from two independent rounds of five-fold cross-validation. Individual dots represent performance on each fold, and error bars denote standard deviations. b OncoMark’s generalizability is assessed across synthetic data created from five independent external datasets, using the same evaluation metrics, highlighting robust and consistent predictive performance.

Table 1 Sample information along with performance evaluation of OncoMark for the prediction of ten cancer hallmarks

Generalizability across independent external cohorts

The model was validated on five external datasets, with performance metrics reported for OncoMark’s predictions. On the dataset from ref. 21, an accuracy of 97.26% and an F1 score of 96.19% were achieved across 13 patients (105 positive and 41 negative samples). Yost et al.22 achieved an accuracy of 99.64% and an F1 score of 99.42% on 23 patients (175 positive and 105 negative samples). The dataset from ref. 23 achieved an accuracy of 97.40% and an F1 score of 93.46%, validated on 45 patients (260 positive and 395 negative samples). Both Gao et al.24 and Nam et al.25 achieved perfect metrics with 100% accuracy and F1 scores on four patients (1 positive and 22 negative samples) and ten patients (65 positive and 50 negative samples), respectively (Fig. 3b and Table 1). The ROC curves show high AUC values for datasets from ref. 21 (0.98), ref. 22 (0.99), and ref. 23 (0.97), whereas datasets from Gao et al.24 and Nam et al.25 achieved perfect AUC scores of 1.0 (Supplementary Fig. 4c). Similarly, precision values remained close to 1.0 across recall levels for most datasets, with ref. 21 (0.99), ref. 22 (0.99), and ref. 23 (0.98) achieving excellent results, whereas Gao et al.24 and Nam et al.25 maintained perfect values (Supplementary Fig. 4d). Altogether, the external validation included 95 patients with 627 positive and 600 negative samples, highlighting the robust generalizability and performance of the models.

Superior discrimination of normal and malignant tissues

The hallmark-specific signature probabilities predicted by OncoMark across normal (GTEx and ENCODE) and cancer datasets (TCGA, CCLE, POG570, PCAWG, TARGET, and MET500) revealed a critical biological distinction underlying tissue homeostasis and malignancy (Fig. 4a and Supplementary Fig. 5). In normal datasets, hallmark activities exhibit tightly regulated density distributions, reflecting the stable, balanced biological processes essential for maintaining normal cellular function. In contrast, cancer datasets display a marked shift, with significantly elevated probabilities for hallmark-specific signatures, indicative of dysregulated pathways driving tumor progression.

Fig. 4: Generalizability of OncoMark across diverse biological contexts.
figure 4

a The left panel displays probability density distributions of hallmark predictions in normal tissue datasets (GTEx and ENCODE), revealing hallmark-specific variation in healthy tissues. The right panel contrasts this with hallmark activity in cancer datasets (TCGA, CCLE, POG570, PCAWG, TARGET, and MET500), showing pronounced shifts in predicted probabilities consistent with malignant transformation. b Density plot for the activating invasion and metastasis hallmark across primary tumor sites with known metastatic cases.

To contextualize OncoMark’s performance, we also compared it against several standard machine learning models, including logistic regression (LR), support vector classifier (SVC), decision tree (DT), random forest (RF), XGBoost, and multi-layer perceptron (MLP). (Supplementary Fig. 6a–f) Unlike OncoMark, these baseline models exhibited a strong bias toward assigning near-zero probabilities for hallmark signatures across all samples. This resulted in acceptable performance for normal tissues but consistently poor sensitivity for cancer samples, indicating that standard models failed to learn meaningful patterns associated with malignancy. Instead, they defaulted to predicting hallmark inactivity across all inputs, highlighting a lack of genuine discriminative learning.

The Kolmogorov–Smirnov (K–S) test further validated the hallmark distributional differences captured by OncoMark, with K–S statistic values exceeding 0.7 and p values effectively zero for all hallmarks (Table 2). These findings highlight both the biological significance of OncoMark’s hallmark activity profiles and its superior capacity to distinguish between healthy and malignant states—a performance not replicated by baseline approaches.

Table 2 The Kolmogorov–Smirnov (K–S) test statistic and p value of the hallmark-specific probability difference in the model prediction

Detection of metastasis hallmark in primary tumors

To evaluate the ability of OncoMark to detect hallmark activity associated with metastatic progression, we focused on the Activating Invasion and Metastasis hallmark using datasets from Vareslija et al.26 and Cosgrove et al.27, comprising matched primary and metastatic tumor transcriptomes. Given that all patients had confirmed metastatic disease, we reasoned that the primary tumors should already exhibit transcriptional programs indicative of metastatic potential. We therefore applied OncoMark exclusively to primary tumor samples. The model predicted elevated activity for the Activating Invasion and Metastasis hallmark across these cases, suggesting that it captures biologically relevant signatures reflective of early metastatic competence (Fig. 4b).

Hallmark activity increases with advancing clinical cancer stage

Given the critical role of clinical staging in hallmark progression, we extended the model’s application to the Cancer Genome Atlas (TCGA) data to further investigate hallmark co-occurrence patterns across various AJCC and TNM staging systems. We quantified co-occurrence using odds ratios (ORs) to assess the strength of associations between specific hallmarks and their corresponding clinical stages28. We found a dynamic progression of hallmark activity corresponding to different clinical cancer stages, offering valuable insights into the biological changes associated with tumor development and progression. In the AJCC stages, hallmark activities progressively increased from Stage I to Stage IV, with the most significant co-association observed at advanced stages, underscoring the critical role of hallmark pathways in promoting tumor aggressiveness (Fig. 5a). The metastasis stage (M0 to M1) revealed increased hallmark activity associated with metastatic potential, while the node stage (n0 to n2/3) demonstrated intensified activity with greater lymph node involvement, highlighting the role of hallmark pathways in tumor spread and metastasis (Fig. 5b, c). A similar trend is observed in tumor stages (t1 to t4), where hallmark activity is highest in t4, reflecting the elevated engagement of these pathways in advanced and aggressive tumors (Fig. 5d). Collectively, these findings highlight the dynamic regulation of hallmark pathways during cancer progression and their potential as biomarkers for disease staging and therapeutic intervention.

Fig. 5: OncoMark reveals associations between hallmark activity and clinical cancer staging.
figure 5

Heatmaps display associations between hallmark activity and key clinical staging metrics such as (a) AJCC stage, b metastasis stage, c lymph node involvement, and d tumor size. Color intensity reflects the strength of association, with darker hues indicating stronger relationships. Asterisks (*) denote statistically significant associations (p < 0.05).

Therapeutic associations of hallmark activity with patient survival outcomes

We examined the impact of cancer therapies on patient outcomes, specifically overall survival (OS), disease-free survival (DFS), and progression-free survival (PFS). Using TCGA data, we investigated how cancer drugs influence hallmark activity. Associations between hallmarks and drug treatments were quantified for all cancer patients and incorporated as features in a logistic regression model, with survival status as the dependent variable. The feature weights derived from the trained logistic regression model were interpreted as impact scores, offering a quantitative measure of each drug’s contribution to improving patient survival. These impact scores were used to rank the efficacy of different treatments in terms of their ability to alter hallmark pathways. For disease-free survival (DFS), hallmark ERI demonstrated strong associations with anastrozole, AIM with cyclophosphamide, TPI with radiation, and RCD with trastuzumab and vinorelbine, as indicated by higher impact scores. These findings suggest that these therapies effectively target and modify the underlying hallmarks that drive tumor recurrence (Fig. 6a). In progression-free survival (PFS), hallmark RCD was predominantly influenced by cyclophosphamide, trastuzumab, and vinorelbine, reflecting these drugs’ potential to suppress disease progression. The high impact scores of these treatments highlight their efficacy in disrupting hallmark-specific pathways crucial for tumor growth and metastasis (Fig. 6b). For overall survival (OS), hallmark ERI was significantly modulated by anastrozole and leuprolide, RCD by cyclophosphamide and trastuzumab, GIM and IA by irinotecan, and SPS by radiation, as reflected by high impact scores (Fig. 6c). These results underscore the capacity of these therapies to enhance patient survival by effectively targeting hallmark pathways central to tumor progression.

Fig. 6: Linking hallmark activity to drug response and clinical outcomes.
figure 6

Heatmaps show the impact of various anticancer drugs on hallmark activity, alongside their associations with (a) disease-free survival, b progression-free survival, and c overall survival. Darker shades correspond to stronger drug–hallmark interactions, revealing key therapeutic contributions to improved patient outcomes.

Discussion

Tumor staging and grading are indispensable for cancer assessment, offering insights into tumor size, spread, and cellular differentiation5. However, these conventional metrics primarily reflect anatomical and morphological characteristics, often failing to capture the molecular mechanisms driving tumor behavior29. Cancer progression is governed not only by physical growth but also by hallmark biological processes4. Identifying and quantifying these hallmarks could provide a more nuanced understanding of tumor biology, uncovering therapeutic vulnerabilities and resistance mechanisms that traditional diagnostic methods overlook30,31. Routine cancer diagnostics, such as imaging and histopathology, are limited in their ability to assess molecular hallmarks due to their focus on observable features rather than the underlying gene expression profiles that drive hallmark activation. Advances in transcriptomics and machine learning offer the potential to bridge this gap32. By leveraging these tools, hallmark-specific molecular patterns can be identified and quantified, enabling more precise assessments of tumor biology. This approach may complement conventional methods, enhancing prognostic accuracy and supporting personalized therapeutic interventions.

OncoMark addresses these challenges by quantifying the activation states of cancer hallmarks using a multi-task neural network trained on synthetic biopsy transcriptomics data. By identifying hallmarks, OncoMark enables the stratification of tumors based on molecular profiles rather than solely anatomical characteristics. This molecular stratification may reveal biologically aggressive tumors that appear indolent through traditional grading and staging methods, improving risk assessment and early detection of relapse or resistance. Additionally, hallmark-driven profiling might uncover novel prognostic biomarkers, enhancing personalized care and disease outcome predictions11. Furthermore, OncoMark capacity to capture hallmark interdependencies offers a comprehensive view of tumor heterogeneity, which is critical for advancing precision oncology. Beyond prognosis, OncoMark offers significant potential in therapy design by identifying hallmark-specific vulnerabilities. For instance, tumors characterized by hallmark activation of angiogenesis may benefit from anti-angiogenic therapies, while those with genomic instability might respond to DNA damage repair inhibitors. OncoMark also ensures clinical relevance across diverse types of cancer, potentially supporting the development of tailored treatment strategies. Moreover, tracking hallmark states over time might help guide therapeutic adjustments and monitor treatment responses, which could contribute to improving patient outcomes31.

Despite its promise, the widespread adoption of OncoMark faces several challenges. The high cost of transcriptomic profiling and the need for specialized infrastructure limit accessibility in many healthcare systems, particularly in resource-limited settings. Integration into clinical workflows will require significant adaptations, including updates to diagnostic guidelines and training for healthcare providers. Moreover, validation of the OncoMark model on real-world data was not possible due to the unavailability of ground truth hallmark data, and comprehensive benchmarking could not be conducted in the absence of tools capable of annotating these hallmarks. However, these challenges may diminish with technological advancements. The declining cost of sequencing, development of portable sequencing devices, and emergence of cloud-based analytical platforms could democratize access to transcriptomic analysis33. Additionally, the growing integration of electronic health records and data-sharing initiatives may facilitate the incorporation of hallmark-based diagnostics into clinical practice.

In conclusion, OncoMark is, to our knowledge, the first computational tool to predict all hallmarks of cancer simultaneously. By bridging the gap between molecular biology and clinical practice, OncoMark has the potential to enhance prognostication, improve therapeutic targeting, and support the transition toward precision oncology. With continued advances in technology and infrastructure, hallmark-based diagnostics may become a routine component of personalized cancer management, offering patients more accurate and effective care.

Methods

Data overview

We used a comprehensive dataset of 3.1 million single-cell transcriptomes from 14 tumor sites (Head and neck, lung, liver/biliary, kidney, prostate, sarcoma, brain, breast, pancreas, neuroendocrine, colorectal, ovarian, skin, and hematologic), collected from 941 patients across 56 studies worldwide as part of the Weizmann 3CA repository, to generate synthetic (pseudo-bulk) datasets. (Supplementary Table 1) For model training and cross-validation, we used 90% of the studies (encompassing 846 patients from 51 studies). For external validation, we used the remaining 10% of studies, which were excluded from the training set. These five excluded studies—refs. 2125—included a total of 95 patients. Additionally, we collected real-world bulk transcriptomic data from publicly available sources, including The Cancer Genome Atlas (TCGA, n = 6679), MET500 (n = 868)34, POG570 (n = 570)35, Cancer Cell Line Encyclopedia (CCLE, n = 1019)36, therapeutically applicable research to generate effective treatments (TARGET, n = 734), Pan-cancer analysis of whole genomes (PCAWG, n = 1210)37, and normal datasets from genotype-tissue expression (GTEx, n = 8228)38, and the encyclopedia of DNA elements (ENCODE, n = 329)39,40. These datasets were used to evaluate the sensitivity and specificity of OncoMark in predicting hallmark activity in both synthetic and real-world settings. The detailed methodology for processing and sampling real-world bulk transcriptomic data were outlined in the Supplementary Methods. All datasets were publicly available, de-identified, and obtained with appropriate consent from participants in their respective studies. Institutional permissions for data use were secured by respective studies, negating the need for further ethics approvals.

Synthetic data construction

Single-cell RNA-seq data processing and quality control

Prior to synthetic data generation, raw single-cell RNA sequencing (scRNA-seq) data underwent rigorous quality control (QC) to ensure reliability and biological relevance of downstream analyses41,42. The following QC criteria were applied:

• Mitochondrial content filtering: Cells with mitochondrial transcript content exceeding 15% were excluded to eliminate stressed or dying cells, which typically exhibit elevated mitochondrial RNA levels due to leakage.

• Transcript count thresholding: Cells expressing fewer than 200 or more than 6000 genes were discarded to remove low-quality cells and potential multiplets. These thresholds ensured the retention of biologically meaningful cells with appropriate RNA content.

Importantly, we did not filter cells based on cell type, nor did we restrict the analysis to malignant cells. This approach preserved the cellular heterogeneity characteristic of real tumor biopsies, including stromal and immune infiltration. Additionally, we intentionally refrained from applying batch correction to the source transcriptomes to retain technical variability introduced by differing sequencing platforms, library preparation protocols, and contributing centers. This design choice allowed the synthetic data to emulate the complexity of real-world transcriptomic profiles.

Gene set curation for cancer hallmarks

Gene sets associated with cancer hallmarks were curated from multiple databases, incorporating only genes identified in at least two independent sources43,44,45,46,47,48. To evaluate their prognostic relevance, we applied a Cox proportional hazards model using patient-level expression and survival data from The Cancer Genome Atlas (TCGA)49. Genes with hazard ratios (HR) <1.05 were excluded to enrich for features with a stronger association to adverse clinical outcomes and tumor progression. The resulting gene sets were further refined via manual literature curation to ensure inclusion of genes mechanistically implicated in hallmark-related biological processes. (Supplementary Data 1)

Computation of digital hallmark scores

Digital hallmark scores were computed across 3.1 million single cells using UCell, a robust method for quantifying gene signature enrichment in scRNA-seq data15. UCell calculates gene signature scores for single-cell RNA sequencing (scRNA-seq) data based on the Mann–Whitney U-statistic. Given a gene expression matrix \(X\) of size \({G\; \times\; C}\), where \(G\) is the number of genes and \(C\) is the number of cells, UCell first constructs a relative rank matrix \(R\) by ranking the expression values within each column (i.e., each cell). Due to the sparse nature of scRNA-seq data—characterized by many zero values—low-ranking genes tend to form a long, uninformative tail. To address this, all ranks greater than a predefined threshold rmax (default = 1500) are set to rmax + 1, thereby mitigating the influence of this tail. For each of the hallmark signatures consisting of n genes, the UCell score \({U}_{j}^{{\prime} }\) for each cell j in X is computed:

$${U}_{j}^{{\prime} }=1-\frac{{U}_{j}}{n\cdot {r}_{\max }}$$

Where \({U}_{j}\) is the Mann–Whitney U statistics computed by:

$${U}_{j}={\sum }_{i=1}^{n}{r}_{i,j}^{* }-\frac{n\left(n+1\right)}{2}$$

Threshold determination and binary label assignment

To assign binary hallmark labels to single cells, we benchmarked three thresholding algorithms: Otsu50, Yen51, and Isodata. Given the tissue-specific expression patterns of hallmark-associated genes, we computed digital score thresholds independently for each tissue site and hallmark. Threshold performance was evaluated based on the separation of bimodal hallmark score distributions. Otsu’s method consistently outperformed others and was adopted for subsequent analyses (Supplementary Fig. 7).

Rather than applying a single cutoff, we defined upper and lower thresholds per hallmark and tissue type. Cells with scores above the upper threshold were labeled as hallmark-positive, while those below the lower threshold were labeled as hallmark-negative. Cells with intermediate scores were excluded to minimize label ambiguity.

Synthetic biopsy generation for model training and validation

Following label assignment, synthetic pseudo-bulk datasets were created by aggregating hallmark-positive or hallmark-negative single cells12,13,14,52. For each tumor site (n = 14), we selected 200 cells per sample (where available) from either class of each hallmark and summed their gene expression values to form a pseudo-bulk profile. This aggregation strategy enabled the modeling of hallmark-specific transcriptomic signals while retaining biological heterogeneity. The choice of 200 cells balanced representational depth with the need for adequate sample diversity for training.

For validation, we generated synthetic datasets from five independent scRNA-seq studies. To mitigate inductive bias, we reused the previously established hallmark thresholds but adopted a modified aggregation strategy. Specifically, we included all available hallmark-positive or negative cells per patient, applying only a minimum count threshold of 100 cells. This resulted in variable pseudo-bulk sizes, better reflecting the inherent heterogeneity and compositional imbalance of real tumor samples. These datasets were used exclusively for testing the generalizability and robustness of our model, across unseen conditions.

Feature selection and gene filtering

To identify the most informative genes while accounting for missing data, we implemented a variance-based feature selection approach53. Gene-wise variance was computed across all pseudo-bulk samples. Genes with undefined or zero variance were excluded. The top 10,000 genes with the highest valid variance were retained. To ensure consistency across hallmark-positive and -negative datasets, we intersected the top variable genes from both classes, yielding a final set of 9326 features for model training.

Data normalization and preprocessing

To ensure uniform feature representation, missing gene values were imputed as zeros. All datasets—including both synthetic and external validation sets—were transformed into rank space to minimize batch effects and negate the effects of diverse data normalization techniques across diverse datasets54. Expression values were subsequently log2-transformed and z-score standardized (mean = 0, standard deviation = 1) across samples. These normalization steps were done to improve model convergence.

Model architecture

The OncoMark framework leverages a deep learning approach to predict hallmark activity in biopsy samples, emphasizing both hallmark-specific precision and the biological interplay among hallmarks. The model architecture is structured as a multi-task neural network, consisting of a shared base layer and task-specific output layers19. The shared base layer processes the input features \(x\in {R}^{d}\) to extract pan-hallmark characteristics that are universal across all cancer hallmarks, resulting in a latent representation \(h={f}_{{shared}}\left(x\right)\). The task-specific output layers then refine these shared representations by focusing on hallmark-specific features, enabling the model to capture the nuanced interplay between hallmarks. Each output layer produces an independent probability \(\hat{{y}_{i}}={{\rm{\sigma }}}({f}_{{{\mbox{task}}}_{i}}(h)),{{\mbox{where}}} \,{{\mbox{i}}}\in \{1,\ldots ,10\}\) and \(\hat{y}=\left[\hat{{y}_{1}},\hat{{y}_{2}},\ldots ,\hat{{y}_{10}}\right]\) \(\in {\left[{\mathrm{0,1}}\right]}^{10}\) represents the model’s prediction of hallmark presence across all ten hallmarks. This architectural design reflects the interconnected nature of hallmark activities observed in tumor biology and ensures that predictions are biologically meaningful. By combining a shared representation with hallmark-specific refinement, the framework achieves accurate predictions that align with the cooperative and dynamic behavior of hallmarks in cancer progression. The detailed model’s architecture, and training methodology are elaborated in the supplementary methods (Supplementary Figs. 2, 8).

Model training and hyperparameter optimization

The model was trained on a balanced dataset comprising 67,930 samples with 9326 input features representing gene expression profiles. Of these, 57,735 samples (85%) were used for training, and 10,195 samples (15%) were used for validation, with data splitting performed separately for each hallmark using the train-test split method to ensure balanced representation across both sets. To prevent catastrophic forgetting (a phenomenon where a model forgets previously learned information when trained on new tasks), data from all hallmark tasks were merged and shuffled randomly during training, avoiding hallmark-specific batches and ensuring uniform exposure of the model to all hallmark datasets. The Adam optimizer, with a learning rate of 0∙0001, was employed to minimize the binary cross-entropy loss, which was calculated independently for each hallmark prediction task and combined into a weighted average using task-specific indicators to ensure balanced learning. Early stopping, with a patience of six epochs, was implemented to prevent overfitting by halting training when validation loss showed no improvement, while a learning rate scheduler further adjusted the learning rate by reducing it by 0∙5 after three consecutive stagnant epochs, with a minimum threshold set at 1e-6. Training was conducted for 50 epochs with a batch size of 256. Although validation losses consistently improved, gains became marginal in later epochs, leading to the decision to halt training after 50 epochs before full convergence (Supplementary Fig. 9 and Supplementary Table 2).

Evaluation protocol and performance metrices

Our methodology incorporated a five-fold cross-validation repeated twice to ensure robust model assessment. In each iteration, four folds were used to train the model, while the fifth fold was divided into two equal parts: one part served as a validation set for model selection and performance monitoring, and the other part was reserved for independent performance evaluation. (Supplementary Table 3) Moreover, the model was also validated on five independent datasets. Model performance was assessed using the F1 score, accuracy score, precision score, recall score, balanced accuracy, confusion matrix, area under the precision-recall curve (AUC-PR) and the receiver operating characteristic curve (AUC-ROC). (Supplementary Data 2) We report the mean and standard deviation (SD) of these AUC values across all repetitions of cross-validation to capture variability and reliability. Additionally, the model was applied to two normal datasets, Genotype-Tissue Expression (GTEx) and The Encyclopedia of DNA Elements (ENCODE), as well as six cancer datasets from The Cancer Genome Atlas (TCGA), MET500, POG570, Cancer Cell Line Encyclopedia (CCLE), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and Pan-Cancer Analysis of Whole Genomes (PCAWG). Probability density distributions of hallmark predictions were plotted to assess the model’s ability to identify hallmark occurrences in cancer datasets, with no hallmark predictions made for normal samples. To determine whether the probability distributions between cancer and normal samples differ significantly, the Kolmogorov–Smirnov (K–S) test was conducted, providing statistical insights into these differences55,56.

Statistics and reproducibility

Python v3.11.0 with numpy v2.0.2 and pandas v2.2.3 was used for data analysis, and matplotlib v3.9.2 and seaborn v0.13.2 for visualization. Tensorflow v2.18.0 was used for training the model. Differential expression analysis was conducted using PyDESeq2 v0.5.1. Scipy v1.14.1 was used for conducting all the statistical tests. All p values resulting from the corresponding statistical tests were used without any adjustments or modifications, except for differential expression analysis, where p values were transformed using −log10.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.