Introduction

Gastric cancer is a major global health burden and remains among the top causes of cancer-related deaths worldwide1. Despite advances, many patients still present with advanced disease due to limited early detection, complicating treatment2. Accurate preoperative T staging guides decisions from endoscopic resection for T1 to multimodal therapy for T3/T4 disease3,4,5. Accurate T staging thus enables more tailored interventions, potentially improving survival and reducing unnecessary morbidity.

Contrast-enhanced CT is the standard for preoperative evaluation, but conventional interpretation has known limitations. The overall accuracy of CT-based gastric T staging is often reported around 65–75%, with particular difficulty in discriminating T2 from T3 tumors and in recognizing subtle serosal invasion6. Endoscopic ultrasound (EUS) can visualize distinct layers of the gastric wall, thereby helping to distinguish T1 disease from deeper invasion; however, its accuracy declines in advanced tumors and proximal lesions, and the technique is highly operator-dependent7,8. Other adjunct imaging methods, including double contrast-enhanced ultrasound, may improve diagnostic performance in experienced hands but are not universally available. Consequently, there is a clear clinical need for new tools to enhance the objectivity and accuracy of T staging.

Artificial intelligence (AI), especially deep learning using convolutional neural networks (CNNs), has emerged as a powerful approach to analyzing medical images9. CNNs can learn multi-scale textural and morphological features from large volumes of data, surpassing traditional machine learning or radiologist interpretation in tasks such as tumor segmentation, disease classification, and prognostic modeling. Notably, deep learning models have achieved high diagnostic performance in areas like breast lesion classification on MRI, prediction of EGFR mutations in lung cancer using CT, and detection of peritoneal carcinomatosis9,10,11. In gastric cancer imaging research, many published efforts have either focused on binary distinctions (e.g., early vs. advanced stages) or relied on radiomics approaches requiring manual tumor segmentation12,13. Manual segmentation can be time-consuming and subject to significant inter-observer variability, which hampers clinical implementation at scale.

To address these gaps, we developed an end-to-end deep learning system—termed Gastric Cancer T-stage ResNet Network (GTRNet)—that provides a fully automated, four-class T stage classification (T1, T2, T3, T4) from standard portal venous phase CT scans. Unlike many prior methods requiring manual segmentation, our model processes a single axial slice of the largest tumor cross-section, using a modified ResNet-152 backbone to extract relevant features. We hypothesize that GTRNet will improve staging accuracy, reduce reliance on operator skill, and demonstrate generalizability across multiple institutions.

In this study, we collected data from three tertiary centers, resulting in a total of 1792 patients with pathologically confirmed gastric adenocarcinoma. We trained and tested GTRNet using internal datasets and performed external testing in two independent cohorts. We additionally designed a comparative reader study in one external cohort to compare the model’s performance against expert gastrointestinal radiologists. To promote transparency, we incorporated Grad-CAM (Gradient-weighted Class Activation Mapping) to highlight image regions that drive the network’s predictions, providing clinicians with intuitive saliency maps that localize tumor invasion. Finally, we integrated the network output into a combined nomogram, incorporating clinical and pathologic factors known to correlate with disease aggressiveness, thus creating a more holistic preoperative risk stratification tool.

Below, we detail our methodology, results, and clinical implications, focusing on three goals: (1) evaluating GTRNet’s staging accuracy, (2) demonstrating interpretability, and (3) assessing its impact on neoadjuvant therapy decisions.

Results

Study design and patient selection

This retrospective, multicentre study analyzed patients who underwent curative‑intent resection for gastric adenocarcinoma between January 2015 and December 2021 at three tertiary hospitals. Eligible cases met four criteria: (i) histologically confirmed T1–T4 gastric cancer (AJCC 8th edition)14; (ii) pre‑operative contrast‑enhanced abdominal CT available; (iii) no prior chemotherapy or radiotherapy; and (iv) diagnostic image quality without severe artefacts. We excluded patients with incomplete records, non‑diagnostic CT, or distant metastasis precluding curative surgery. The screening process is summarized in Fig. 1. Ultimately, 1792 patients were allocated to a training cohort (n = 953) and an internal test cohort (n = 239) from Hospital A, plus two external test cohorts (n = 360, Hospital B; n = 240, Hospital C).

Fig. 1: Patient inclusion flowchart.
figure 1

Among 2134 consecutive gastric cancer cases (January 2015 to December 2021), 953 formed the training set, 239 the internal test set, and 360 + 240 formed two external test sets after applying inclusion/exclusion criteria.

Study population

In the training set, cases meeting the inclusion criteria were distributed in a relatively balanced manner across T1–T4 stages to minimize potential data bias. The mean age across all cohorts was approximately 62 years, with approximately 70–73% of patients being female. Variations in the distribution of tumor locations and histologic subtypes were observed across centers (e.g., proximal tumors accounted for 9.6% in the internal test set and 25.4% in external test set 2). However, no statistically significant differences in demographic characteristics were identified. Baseline demographic and clinicopathologic variables—including age, sex, tumor location, size, Lauren classification, differentiation status, serum tumor marker levels, and PD-L1 expression—were extracted from electronic health records and are presented in Table 1.

Table 1 Baseline demographics and clinicopathologic features in the training, internal test, and external test cohorts (mean ± standard deviation or n (%))

In the internal test and external cohorts, GTRNet achieved high discriminatory performance for T staging. The internal test accuracy was 89.9%, and external test accuracies were around 87–94%. AUCs ranged from 0.97 internally to 0.91–0.95 in the external sets. Stage-specific sensitivities were robust (e.g., 75–95% across T1–T4 in internal testing) and specificities were 83–98%. The model’s ROC curves and confusion matrices for each cohort are shown in Fig. 2, and detailed performance metrics are provided in Table 2.

Fig. 2: Model performance.
figure 2

ad One-vs-rest ROC curves in the training, internal test, and two external cohorts (macro-AUC 0.91–0.98). eh Normalized confusion matrices; darker blue indicates higher accuracy.

Table 2 Performance of GTRNet in predicting pathologic T stage across cohorts

Evaluation metrics and statistical analysis

All reporting complies with the CLAIM 2024 checklist for AI studies in medical imaging. Model performance was evaluated in the internal test set and two external validation cohorts using standard metrics. For multi-class classification, we calculated one-vs-rest ROC curves and macro-average AUCs, and we derived confusion matrices along with class-specific sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). These results are presented in Fig. 2. All evaluation metrics were statistically analyzed based on the patients’ final classification decisions. During the external validation phase, the prediction results of the three key slices were integrated using the majority-voting method. This approach was adopted to ensure a high degree of consistency between the evaluation results and clinical practice.

The performance of GTRNet and radiologists in the prediction of clinical T-staging of gastric cancer

We conducted a comparison of the performance between the GTRNet model and radiologists from multiple centers in predicting the clinical T-stage of gastric cancer. As presented in Supplementary Table 1, the GTRNet model demonstrated a marked superiority over the independent diagnoses of radiologists in terms of both overall accuracy (92.8% in Hospital A, 93.6% in Hospital B, and 86.7% in Hospital C) and consistency with pathological findings. The accuracies of radiologists’ independent diagnoses were 58.4%, 59.7%, and 55.3%, respectively, all of which showed a highly significant difference (p < 0.001). With the aid of the GTRNet model, the overall staging capabilities of radiologists were substantially enhanced. The diagnostic accuracy of radiologists increased to 85.4–91.5%, approaching the level of the model’s independent diagnosis. In the evaluation of pathological consistency, the weighted Kappa value between the GTRNet model and the pathological T-stage ranged from 0.87 to 0.91, which was significantly higher than that of radiologists’ sole interpretation, which ranged from 0.41 to 0.45. When assisted by artificial intelligence (AI), the Kappa value of radiologists’ interpretation rose to 0.83–0.88. This indicates that AI-assisted interpretation can significantly reduce staging discrepancies (p < 0.001). During the evaluation of early-stage gastric cancer (T1/T2), the GTRNet model exhibited an extremely high level of sensitivity, registering 98.5–99.0% for T1 and 93.0–97.8% for T2. These values were significantly higher than those of radiologists, which were 65.0–71.0% for T1 and 55.3–60.2% for T2. Such results hold significant implications for clinicians in formulating individualized treatment strategies for patients with early-stage gastric cancer. Regarding advanced-stage gastric cancer (T3/T4), the GTRNet model also demonstrated favorable sensitivity, with values of 83.3% for T3 and 81.7–93.3% for T4. These findings suggest that the model has the potential to optimize treatment decisions and minimize unnecessary interventions in practical clinical settings.

Grad-CAM heatmap visualization

To enhance the clinical interpretability of the GTRNet attention mechanism, we employed Grad-CAM heatmaps to visualize the regions of interest (ROIs) that the model prioritizes. As illustrated in Fig. 3, warm colors such as red and yellow indicate higher model attention, with increasing color intensity corresponding to greater focus; conversely, cooler colors such as blue and green represent lower attention, with darker shades indicating reduced focus. When compared with the ROIs (red-filled areas) manually annotated by radiologists, the attention distribution generated by GTRNet demonstrated a high degree of spatial overlap with expert annotations. For example, in T1-stage lesions, the heatmap predominantly highlighted the inner layers of the gastric wall, indicating that the model effectively captured imaging features associated with superficial submucosal infiltration. In contrast, for T4-stage lesions, the model’s attention extended beyond the thickened gastric wall to include the interface between the tumor and adjacent organs, suggesting its ability to detect radiological signs of tumor invasion into surrounding structures. These findings further confirm that GTRNet possesses strong interpretability and clinical relevance in the T staging of gastric cancer.

Fig. 3: Grad-CAM visualization.
figure 3

For representative T1–T4 cases, portal-venous CT images (left) and color heatmaps (right) highlight mucosa (cT1), muscularis propria (cT2), serosal surface (cT3), and pancreatic invasion (cT4).

To further quantify the spatial alignment between the model’s attention regions and the actual tumor locations, we binarized the Grad-CAM heatmaps and computed the Dice similarity coefficient by comparing them with the gold-standard masks manually segmented by radiologists. The Dice coefficients across different T stages were as follows: 0.56 for T1, 0.59 for T2, 0.60 for T3, and 0.63 for T4, indicating moderate to substantial spatial overlap between the model’s attention maps and the ground truth lesion areas. These findings suggest that the attention mechanism of GTRNet does not erroneously focus on image artifacts or non-pathological regions, but rather accurately identifies key anatomical structures that are critical for T staging. Furthermore, in Supplementary Fig. 1, we illustrate several representative misclassified cases to investigate potential error patterns under conditions of ambiguous staging boundaries or complex anatomical configurations. This analysis provides insights that may guide future improvements in model performance.

Nomogram integration and clinical utility

While GTRNet alone performed well, we sought to incorporate clinical and histologic factors to further refine T-stage prediction. After extracting deep features from the penultimate network layer, we computed a continuous Rad-score for each patient. We then built an ordinal logistic-regression model that combined the Rad-score with tumor size (≥ 5 cm), poor differentiation, and diffuse Lauren type. All four predictors contributed significantly to the model (each P < 0.001 in multivariable analysis). The corresponding regression coefficients are listed in Table 3, and inclusion of the Rad-score significantly improved model fit.

Table 3 Multivariable ordinal logistic model (nomogram) features for T-stage prediction, with regression coefficients and odds ratios (OR)

The resulting nomogram (Fig. 4) provides an intuitive tool for clinicians: each predictor is allocated points, and the total score maps to estimated probabilities of T1, T2, T3, or T4 disease. Calibration was good across all cohorts, with Hosmer–Lemeshow tests showing no significant lack of fit (P > 0.05). In order to evaluate the potential applied value of the model in clinical decision-making, we employed decision curve analysis (DCA) to compare the net benefits of the AI model and EUS staging in facilitating the identification of high-risk patients. Taking the test set as a case in point, within most of the clinically justifiable threshold probability ranges, the AI model exhibited a higher net benefit (as shown in Fig. 5 and Supplementary Table 2). More specifically, the over-treatment rate of the AI model was 2.09%, which was significantly lower than the 12.97% of EUS. The under-treatment rate was 2.51%, also markedly lower than the 17.57% of EUS. Furthermore, the number needed to treat (NNT) for the AI model was 2.19, which was notably superior to the 5.09 of EUS. This indicates that while enhancing clinical benefits, the AI model can effectively minimize unnecessary interventions, thereby demonstrating a higher level of clinical utility.

Fig. 4: Integrated clinical–radiomic nomogram.
figure 4

a Top-30 deep features selected by LASSO ranked by absolute coefficient. b Nomogram combining Radscore, tumor size, differentiation, and Lauren type; higher total points correspond to a more advanced T stage.

Fig. 5: Calibration and decision-curve analysis.
figure 5

ad Calibration curves compare predicted and observed probabilities (dashed 45 line = perfect fit). eh Decision curve analysis (DCA) comparing the net clinical benefit of the AI prediction model versus EUS staging across different datasets. The curves illustrate the net benefit of intervention across a threshold probability range of 0 to 0.8.

Discussion

Compared with traditional CT interpretation and earlier AI or radiomics work, our end‑to‑end network offers three incremental advantages. Historically, CT has achieved only moderate accuracy (≈65–75%) for gastric-cancer T staging, particularly when differentiating borderline invasions and across readers6. Endoscopic ultrasound can visualize individual gastric wall layers and improves detection of early (T1) tumors, yet performance drops in proximal or bulky lesions and remains operator-dependent7,8. Dual-energy CT and MRI have recently shown incremental value, but require specialized scanners and are not universally available15. Several radiomics or AI studies have focused on binary “early vs advanced” tasks or have relied on labor-intensive manual segmentation15,16,17.

Secondly, in recent years, multiple studies have explored the application of deep learning techniques to the T staging of gastric cancer (see Supplementary Table 3 for details). For example, Tao et al. (2024) developed a vision Transformer model based on CT imaging, which achieved an accuracy of 75.7% on an external test set. The model’s performance was further enhanced through the integration of radiomics features. However, this study was limited to a binary classification task distinguishing between T1 and T4 stages and required manual tumor segmentation, which constrained its clinical scalability. Similarly, Chen et al., Guan et al., and Zeng et al. proposed various deep learning models based on CT or endoscopic ultrasound (EUS) images. While these models demonstrated high AUC values in early gastric cancer detection, most failed to perform a complete four-category classification across T1–T4 stages. Moreover, they generally lacked integration with clinical workflows and multi-center validation18,19,20. Therefore, there is a pressing need for an AI model that is independent of manual annotation, capable of multi-class T staging, and seamlessly integrable into clinical pipelines to enhance its real-world translational potential. In contrast, GTRNet was developed using over 1700 retrospective cases from three tertiary hospitals and incorporated 270 prospective samples to simulate real-world clinical procedures. This approach demonstrates superior generalization and potential for clinical deployment. The proposed model eliminates the need for laborious manual segmentation and supports a full four-class T staging (T1–T4), effectively addressing the clinical demand for precise differentiation between borderline T2 and T3 stages. Furthermore, successful multi-center AI implementations in other medical domains—such as breast MRI9, lung cancer EGFR status prediction10, and peritoneal metastasis detection in colorectal cancer11—demonstrate that models combining large-scale heterogeneous data with modern convolutional neural network (CNN) architectures consistently outperform traditional radiomics approaches in multi-task settings. These precedents further validate the feasibility and clinical relevance of applying the GTRNet framework to gastric cancer T staging.

These performance gains translate into several clinically relevant implications for neoadjuvant decision‑making and surgical planning. Current Western and Asian guidelines emphasize accurate discrimination between T2 and ≥T3 disease because perioperative chemotherapy improves survival in locally advanced cases. Understaging may deprive patients of effective therapy, whereas overstaging exposes early tumors to unnecessary toxicity. GTRNet’s high sensitivity for T4 and its integration into a nomogram with tumor size and histology provide an intuitive, probability-based aid to select candidates for neoadjuvant treatment or direct surgery. In practice, the model could flag CT-occult serosal invasion that radiologists frequently miss, or reassure surgeons when a lesion is very likely limited to T1/T2.

Interpretable heat‑maps further strengthen clinician trust by showing that the model attends to the same perigastric margins that experts scrutinize. Adoption of AI hinges on transparent reasoning. GradCAM visualizations localized attention to the gastric wall and perigastric fat—areas radiologists inspect for transmural spread—thereby bridging the “black box” gap14. Such heatmaps can also serve as training feedback for junior readers, fostering human–machine synergy.

Ethical considerations. The algorithm may propagate scanner‑ or cohort‑specific bias; therefore, continuous audit, domain‑shift monitoring, and patient informed‑consent procedures will accompany any prospective deployment.

Notwithstanding these findings, this study is not without certain limitations. First and foremost, for the sake of streamlining the clinical workflow, we grouped the T4a and T4b stages together for analysis. Future research that further refines this categorization could potentially offer more accurate clinical guidance for surgical margin assessment and strategies regarding combined organ resection. It is important to note that among the T4 cases incorporated in this study, some were T4b patients (such as those with tumors invading adjacent organs), yet advanced T4b cases with peritoneal metastasis or other conditions precluding radical surgery were not included. Consequently, this model was predominantly developed using a cohort of surgically treatable patients. As a result, it may face certain limitations in generalizing to some inoperable advanced cases. Future investigations should incorporate a greater number of real-world inoperable T4b samples to enhance the model’s applicability and practical utility within complex clinical settings. Secondly, all external validation cohorts were sourced from the Chinese population. Thus, there is a pressing need to conduct cross-regional validation among international populations with diverse body characteristics and scanning protocols15 to assess the model’s broad applicability. Thirdly, this study did not encompass the N and M staging aspects. Existing research has demonstrated that multimodal models integrating imaging, pathological, and multi-omics data have exhibited promising potential in predicting lymph node status21, tumor mutational burden22,23, and EB virus subtypes24. In the future, GTRNet could be extended into a unified analytical framework covering the entire TNM staging system, thereby enabling more comprehensive auxiliary diagnosis for gastric cancer staging. Finally, despite implementing a variety of data augmentation strategies12 and an adaptive early-stopping mechanism13 to mitigate the risk of overfitting, it is essential to conduct prospective deployment research within multi-center, real-world clinical environments to comprehensively evaluate the model’s stability and clinical viability16.

Looking forward, several technical and translational extensions could broaden the model’s utility. Research directions include: (i) transformer or 3D CNN architectures to leverage contextual slices; (ii) spiking deep residual networks that emulate neuromorphic efficiency and may reduce inference latency on edge devices25; (iii) panomic integration—combining imaging, genomics, and histopathology—to build comprehensive, patient-specific digital twins. Multicentre benchmark challenges would accelerate reproducibility and standardize evaluation metrics.

In summary, an end‑to‑end deep‑learning pipeline accurately discriminates T1–T4 disease and outperforms radiologists across centres. An interpretable deep learning framework (GTRNet) achieved robust, external validated performance for four-class gastric cancer T staging on routine CT, outperforming expert radiologists and preserving transparency via heatmap visualization. Coupled with key clinicopathologic variables, the model underpins a nomogram that can refine preoperative decision-making and optimize allocation of neoadjuvant therapy. Prospective trials and broader geographic validation are warranted to translate these findings into global clinical practice.

Methods

CT imaging protocol and image preprocessing

Patients from three hospitals underwent standardized abdominal contrast-enhanced CT (CECT) scans during the portal venous phase, with the administration of 1.5 mL/kg contrast agent following a protocol of fasting and water intake to achieve gastric distension. Although the institutional scanners and acquisition parameters differed (Table 4), all CT images were subjected to rigorous preprocessing to reduce heterogeneity. Preprocessing steps included resampling to isotropic 1 × 1 × 1 mm³ voxel spacing, intensity normalization within a standardized Hounsfield Unit (HU) range (−1024 to 1024), and application of an abdominal window setting (WL 50, WW 350). Further corrections included N4 bias-field correction and Z-score normalization. For the purposes of model training and internal testing, a single representative axial tumor slice per patient was selected by an expert radiologist. This slice was resampled to a resolution of 224 × 224 pixels and standardized using z-score normalization (mean subtraction and division by standard deviation), without requiring manual region-of-interest segmentation, thereby enabling an end-to-end workflow. To enhance model robustness and generalizability, data augmentation—including random rotations, flips, and intensity adjustments—was implemented during the training phase.

Table 4 CT acquisition parameters across three centers

Ethics statement

The study protocol was approved by the institutional review boards of Liaoning Cancer Hospital & Institute (KY20240503, 15 Jan 2024), Shengjing Hospital of China Medical University (2024PS184K, 22 Jan 2024), and Zhejiang Cancer Hospital (2024‑ZJ‑GC‑009, 3 Feb 2024). All procedures conformed to the Declaration of Helsinki and relevant national regulations. Written informed consent was waived because only de‑identified, routinely acquired imaging and clinical data were analyzed.

In the external test set, we included the slices immediately above and below the key slice of each case’s images, along with the key slice itself, resulting in a total of three images for predictive analysis. The final case-level prediction outcomes were obtained by aggregating the model outputs of these three slices through a majority-voting strategy. This approach aimed to mitigate the randomness that could be associated with single-slice predictions. During the training process, standard data augmentation techniques, such as random flipping, rotation, scaling, and intensity perturbation, were implemented to enhance the model’s generalization capabilities12. Additionally, an adaptive early-stopping strategy was adopted. This strategy involved monitoring the loss of the validation set to dynamically adjust the training progress, thereby reducing the risk of overfitting13.

A comparative analysis of the performance of GTRNet and radiologists in predicting gastric cancer T staging

To delineate the performance of the GTRNet model in predicting the T-staging of gastric cancer, we recruited radiologists from Liaoning Cancer Hospital, Shengjing Hospital of China Medical University, and Zhejiang Cancer Hospital. Their predictive outcomes were then juxtaposed with those of the GTRNet model. Three radiologists specializing in digestive system radiology, each with 8–15 years of experience, were invited from each hospital. Under single-blind conditions, they independently determined the clinical T-staging (cT1–cT4) of gastric cancer solely based on enhanced CT images. The study was structured into two distinct phases, with the pathological T-staging (pT) established as the gold standard. In the initial phase, the radiologists completed the staging assessment independently. In the subsequent phase, one month later, the radiologists repeated the staging prediction with the aid of the GTRNet model. Metrics such as the accuracy rate, weighted Kappa value, and sensitivity for each stage were meticulously calculated and analyzed. Additionally, the number of cases with over-staging and under-staging were tallied to assess the potential clinical implications. Regarding the sample selection, at Hospital A (Liaoning Cancer Hospital), radiologists selected 270 cases from the 1192 samples within the center. For each of the two predictions, 50% of non-overlapping samples were utilized for evaluation. At the other two hospitals (Hospital B and C), 50% of the total samples from each respective center, with no overlap between the two sets of samples used in the two predictions, were employed. Detailed metrics are provided in Supplementary Table 1.

Deep learning model: GTRNet

We constructed the GTRNet architecture by modifying the ResNet-152 backbone, a deep residual network known for strong feature extraction (Fig. 6). Our modifications aimed to enhance the model’s ability to capture multi-scale features relevant to T staging. Specifically, we introduced parallel max-pooling and center-cropping streams in the early network layers, allowing the network to focus on both local tumor detail and wider contextual information around the gastric wall. Transfer learning was applied by initializing ResNet-152 weights from ImageNet pretraining, followed by replacing the final dense layer with four softmax outputs corresponding to T1, T2, T3, T4. We trained with Adam (learning rate ~1 × 10−4), a mini-batch of 32, and categorical cross-entropy, stopping early if validation performance plateaued for 10 epochs. All training was performed on an NVIDIA Tesla V100 GPU, allowing relatively fast convergence.

Fig. 6: Model development pipeline.
figure 6

A Portal-venous CT slice showing the largest tumor cross-section. B GTRNet architecture: modified ResNet-152 with parallel max-pool and center-crop branches. C End-to-end workflow linking the deep-learning Radscore to a clinical–radiomic nomogram. D Summary of model evaluation metrics, including ROC curve, confusion matrix, calibration curve, and decision curve analysis (DCA). In the DCA plot, the y-axis represents the net benefit and the x-axis represents the threshold probability, with comparisons among the treat-all, treat-none, and EUS-prediction strategies.

Analyses were performed in Python 3.10.13 (PyTorch 2.2.0, Torchvision 0.17, NumPy 1.26, SciPy 1.11, scikit‑learn 1.3; Grad‑CAM via pytorch‑grad‑cam 1.4.8); R 4.3.1 and IBM SPSS 26.0 were used for statistical analyses; ROI overlays were created in ITK‑SNAP 4.0.1. Full code and environment files are provided in the “Code availability” statement.