Introduction

Esophageal squamous cell carcinoma (ESCC) remains a substantial global health burden, with disproportionately high incidence and mortality rates in China1,2. For early-stage lesions limited to the mucosa or superficial submucosa (T1a), endoscopic submucosal dissection (ESD) is established as the first-line curative treatment3. Post-ESD histopathological evaluation identifies high-risk features such as lymphovascular invasion (LVI) and tumor budding, which correlate with lymph node metastasis (LNM), and often prompt recommendations for supplemental esophagectomy4. However, the significant invasiveness and morbidity associated with esophagectomy, particularly in the old or those with multiple comorbidities, raise concerns about overtreatment, as only ~10% of patients with LVI ultimately develop nodal metastases5. These findings underscore the critical need for refined risk stratification in post-ESD specimens to accurately identify occult LNM, thereby enabling personalized management and reducing unnecessary surgery in low-risk cohorts.

The conventional diagnostic paradigm for detecting tumor metastases, particularly micrometastases, relies on labor-intensive manual slide evaluation by pathologists, a process prone to diagnostic uncertainty due to subtle morphological features6. These challenges highlight the need for automated, objective tools to augment histopathological assessment. Over the past decade, artificial intelligence (AI) has emerged as a transformative tool in medical diagnostics, enabling automated or semi-automated analysis of complex imaging data7. Advances in computational pathology-fueled by high-throughput slide scanning, enhanced computing power, and scalable storage solutions, have further expanded AI’s capacity to mine microscopic lesions and interpret gigapixel-sized digital images (WSIs)8. While AI-driven prediction of LNM has been explored in multiple cancers9,10, its application to ESCC remains unexplored, representing a critical gap in optimizing risk stratification for early-stage disease.

A cornerstone of AI implementation in WSIs analysis involves segmenting high-resolution images into smaller, computationally manageable patches. Current methodologies predominantly employ supervised learning frameworks, utilizing dichotomized LNM status (positive/negative) as supervisory labels11,12. This methodology offers distinct advantages for LNM prediction in cancer. First, supervised learning leverages histologically validated labels to establish a robust ground truth, enabling models to discern metastasis-associated features with high diagnostic accuracy13. Second, it explicitly models known clinicopathological features such as LVI or tumor budding that correlate strongly with metastatic risk, ensuring biologically relevant feature prioritization14. Third, supervised frameworks enhance interpretability by linking predictions to specific histopathological patterns, a prerequisite for clinical adoption where model transparency and reliability are paramount15. Finally, the flexibility of supervised learning supports integration with advanced architectures, including convolutional neural networks (CNNs) and graph neural networks (GNNs), which excel at capturing spatial and contextual dependencies within WSIs16.

To address the unmet need for precise LNM risk assessment in early-stage ESCC with invasion depths from MM to SM2, we developed an AI-driven GNN model using supervised learning in order to analyze WSIs from ESD specimens. This approach aims to reduce diagnostic subjectivity, improve detection of micrometastases, and ultimately guide personalized post-resection management.

Results

Study population and cohort characteristics

This study was conducted utilizing two independent patient cohorts. The model was developed from a surgical cohort comprising 374 patients who underwent primary esophagectomy without prior ESD. Within this cohort, 72 patients were LNM+, and 302 were LNM–. To address the class imbalance and enhance model generalizability, a balanced training set was constructed, comprising 72 LNM+ and 88 randomly selected LNM– cases. The representativeness of this LNM subset was confirmed, as no significant differences in key baseline characteristics were observed compared to the remaining 214 LNM– patients (Table S1).

This cohort of 160 patients was then randomly divided into a training/validation set (n = 112, 442 WSIs) and an internal test set (n = 48, 217 WSIs) in a 7:3 ratio. The distribution of critical prognostic factors, including LVI (61.6% vs. 64.6%) and actual LNM rate (46.4% vs. 41.6%), was well-balanced between these sets, with no statistically significant differences in age, sex, tumor size, clinical stage, tumor location, differentiation grade, lymphovascular or perineural invasion status, or lymph node yield (Table 1).

Table. 1 Baseline clinicopathological characteristics of the training and validation set and internal test set of surgical cohorts (n = 160)

For external validation, we utilized a separate cohort of 35 high-risk patients who had previously undergone ESD. This cohort comprised patients who subsequently received esophagectomy with systematic lymphadenectomy (n = 18, 85 WSIs) and those who managed with surveillance alone (n = 17, 76 WSIs), resulting in a total of 161 WSIs for analysis. The final nodal status, confirmed by histology or follow-up, identified 4 patients as LNM+ and 31 as LNM–. The model’s performance was rigorously evaluated on this independent ESD cohort to assess its clinical applicability.

Validation performance of the AI model

As illustrated in Fig. 1, the optimal cutoff value for the model was determined from the internal test set. At this optimized threshold, the model demonstrated robust performance in predicting LNM in ESCC, achieving an area under the ROC curve (AUC) of 0.949 (95% CI: 0.912–0.986) in the internal validation cohort and 0.866 (95% CI: 0.768–0.964) in the external ESD validation cohort.

Fig. 1: ROC curves of the training and testing set.
figure 1

a Area under the ROC curve for the attention-based WSIs deep learning model for predicting LNM in ESCC of the validation set; b The model was used in the test set to predict LNM in ESD specimens. ROC receiver operating characteristic, AUC area under the curve.

Test performance and clinical utility

Table 2 summarizes the distribution of histopathological features, including submucosal invasion depth and tumor budding grade, within the external validation cohort, providing context for correlation analyses with model predictions.

Table. 2 Clinicopathological characteristics and outcomes of the external ESD test cohort (n = 35)

On a per-slide basis within the external cohort, the AI model achieved an accuracy of 90.1%. Performance metrics included a sensitivity of 81.8%, specificity of 91.4%, an F1-score of 69.2%, and a negative predictive value (NPV) of 96.9% (Fig. 2). This high NPV suggests a potential to reduce unnecessary surgeries by correctly identifying a substantial proportion of non-metastatic cases, highlighting its utility for patient stratification towards non-surgical surveillance.

Fig. 2
figure 2

The confusion matrix and performance metrics of AI in patients undergoing additional surgery.

The corresponding confusion matrix is detailed in Table 3, which shows 18 true positives (TP), 4 false negatives (FN), 127 true negatives (TN), and 12 false positives (FP). These results underscore the model’s high accuracy and reliability, particularly in correctly classifying non-metastatic cases, thereby effectively minimizing the risk of false-positive predictions.

Table. 3 The confusion matrix and performance metrics of AI in the external test set

Case-level diagnostic performance aligned with clinical practice

Reflecting real-world clinical decision-making, where a single positive slide typically defines a case as high-risk, we aggregated slide-level predictions to the case level using a max-pooling rule. In the external ESD cohort, the model achieved robust case-level performance, with a sensitivity of 100.0% (4/4), a specificity of 83.9% (26/31), and an overall accuracy of 85.7% (30/35). Notably, the NPV at the case level reached 100.0% (26/26) (Table 4). This exceptionally high NPV indicates the model’s high reliability in identifying patients who can safely avoid esophagectomy, while maintaining high sensitivity for the detection of true metastatic cases.

Table. 4 Case-level diagnostic performance of the AI model in the external ESD cohort

Discussion

The strategic management of T1 ESCC with submucosal invasion (MM-SM2) following ESD remains a considerable clinical challenge, primarily due to the substantial risk of LNM (approximately 15~30%)17,18. Current clinical guidelines rely on conventional histopathological assessment, evaluating features such as depth of invasion, LVI, poor differentiation, and other high-risk histopathological features, to guide decisions regarding additional esophagectomy19,20. Nevertheless, this approach is hampered by considerable interobserver variability and limited reproducibility in identifying features predictive of nodal involvement. The suboptimal discriminative capacity of these morphological criteria can lead to potential overtreatment of patients with minimal LNM risk and underscoring the urgent need for more precise and objective risk stratification tools21,22.

To address this critical unmet need, we developed an AI-driven model for predicting LNM using computational pathology. Our model utilizes a hierarchical GNN architecture to autonomously learn multi-scale histopathological representations from WSIs, capturing intricate morphological patterns without relying on subjective human interpretation. Due to the scarcity of ESD specimens with surgically confirmed nodal status, model development incorporated surgically resected T1–T4 cases, while external validation was rigorously restricted to T1 ESD cases to ensure clinical relevance. This approach enables a fully automated, objective, and reproducible prediction of metastatic risk. The model demonstrated robust performance in internal validation (AUC: 0.949), and, crucially, in an external cohort of real-world MM-SM2 ESD cases, the most relevant subgroup of post-ESD decision-making. It achieved a sensitivity of 81.8% and a high NPV of 96.9%, with case-level max-pooling further enhancing its clinical utility (100% sensitivity, 100% NPV, 83.9% specificity). The consistently high NPV underscores the model’s capability to reliably identify patients at low risk of LNM, for whom conservative management may be appropriate, thereby potentially reducing unnecessary surgeries.

A key innovation of our framework is its ability to transcend the limitations of conventional region-of-interest (ROI) or patch-based analyses10,23,24. By constructing a biologically interpretable k-nearest neighbor graph integrating multimodal features (including color histograms, spatial coordinates, and deep feature embeddings)25,26,27, our GNN architecture effectively models local and global tissue architecture without manual annotation, overcoming the limitations of methods that introduce noise or fail to capture spatial dependencies. This end-to-end, supervised approach explicitly captures spatial relationships among histopathological patches, addressing the “needle-in-a-haystack” challenge inherent in WSI analysis and identifying subtle metastatic signatures potentially overlooked in conventional assessment28,29.

Notably, our AI system autonomously learned prognostically relevant morphological patterns directly from WSIs, without explicit programming of established risk factors30,31, It successfully identified a subset of low-risk patients, confirmed by postoperative histology, who might otherwise have been recommended for surgery under current guidelines32,33. To enhance interpretability and mitigate the “black box” concern, we generated decision heatmaps that visualized model-prioritized regions. These heatmaps consistently highlighted areas concordant with established high-risk features, such as the invasive front and lymphocyte-rich stroma, a finding validated by independent expert pathologists, thereby providing biologically plausibility to the model’s predictions. Nevertheless, it should be acknowledged that heatmaps remain indirect proxies of the underlying model reasoning.

Notwithstanding these promising performances, several limitations merit consideration. The single-center, retrospective design may affect generalizability, necessitating future multi-institutional prospective validation. The inclusion of multiple tumor slides per patient, while improving data utilization, introduces analytical complexity regarding intra-patient dependency. Furthermore, the incorporation of more advanced ESCC cases during training, necessitated by the limited availability of node-positive T1 ESD cases, creates a potential domain shift, a common compromise in computational pathology. In the external cohort, the inference of nodal status based on recurrence-free survival for non-surgical patients, while clinically accepted, represents an indirect method of outcome assessment. Future work should also systematically investigate case-level prediction integration, which may yield even higher diagnostic performance.

In conclusion, we developed and validated a pathologist-independent AI model that accurately predicts LNM risk in T1 ESCC from WSIs. This GNN-based framework provides a robust, automated decision-support tool to optimize post-ESD management pathways, facilitating personalized care and potentially improving quality of life. Future efforts should focus on external validation, real-world clinical integration, and the development of hybrid models combining AI predictions with molecular biomarkers for enhanced risk stratification.

Methods

Study design

This retrospective single-center study enrolled 374 patients with ESCC, stages T1–T4, who underwent primary esophagectomy with systematic lymphadenectomy without previous ESD at Zhongda Hospital Affiliated to Southeast University from January 2019 to December 2024 (Fig. 3). Among them, 72 were LNM+ and 302 were LNM–. To address class imbalance, 72 LNM+ and 88 randomly selected LNM– cases were included as the surgical cohort (n = 160) for model training, validation, and internal testing.

Fig. 3: Study workflow.
figure 3

From January 2019 to December 2024, 160 patients with 1284 WSIs were considered for the surgical cohort, 625 WSIs were excluded due to no tumor tissue, the remaining 160 patients with 659 WSIs were randomly divided into a training set (112 patients, 442 WSIs) and a test set (48 patients, 217 WSIs) at a 7:3 ratio, and an ESD cohort of 35 patients with 161 WSIs was used as an external validation set for the prediction model. Only surgically resected cases were used for model training and internal validation. ESD cases were used solely for external testing to simulate real-world post-ESD clinical scenarios.

An independent external validation cohort comprised 35 patients with T1 ESCC (MM to SM2) who underwent ESD. This cohort included patients with LNM+ status confirmed by subsequent surgical resection, as well as LNM– patients defined by the absence of tumor recurrence during a 3-year follow-up period after ESD34. This follow-up criterion is grounded in established oncological principles, where 3-year recurrence-free survival (RFS) serves as a clinically validated surrogate for confirming true nodal negativity in non-surgically managed patients35.

This AI model employed a supervised GNN framework to analyze histopathological patterns in WSIs. Notably, no handcrafted histologic features (such as submucosal invasion depth, tumor budding, LVI, etc.) were manually extracted or explicitly incorporated as input variables. Instead, the model was trained directly on raw WSIs, allowing it to infer predictive patterns from the underlying morphology in a data-driven manner. The study protocol was approved by the institutional ethics review committee (No. 2024ZDSYLL385-P01).

Conventional histologic assessment

All specimens obtained were immediately fixed in 10% neutral buffered formalin. They were then cut at the point where the deepest invasion area could be exposed on the cut end surface. Histological sections of ESD specimens were cut into parallel 2–3 mm-thick sections, and esophagectomy specimens into 4–5 mm-thick sections, followed by Hematoxylin and eosin (H&E) staining. All specimens were diagnosed on the basis of the 2019 World Health Organization Classification of Tumors and the categorizing lesions as well differentiated, moderately differentiated, or poorly differentiated3,5. Submucosal invasion depth was measured vertically from the muscularis mucosa, with cases stratified as SM1 (≤200 μm) or SM2 (>200 μm)17. LVI was assessed through combined immunohistochemical (D2-40) and histochemical methods (Victoria blue staining)17. Tumor budding, defined as isolated cancer cell clusters (≤5 cells) at the invasive margin, was graded as BD1 (0–4 buds/field), BD2 (5–9 buds/field), or BD3 (≥10 buds/field) under 200× magnification4. At our institution, additional surgery following ESD is recommended if any of the following features are present: (1) submucosal invasion depth >200 μm (SM2), (2) presence of LVI, (3) poorly differentiated histology, (4) positive vertical or horizontal resection margins, and (5) tumor budding grade ≥BD2. For surgically resected cases, both the number of metastatic lymph nodes and the total number of dissected lymph nodes were recorded from pathology reports. Lymph node yield was used to assess the adequacy of lymphadenectomy, with reference to guideline standards (≥15 nodes for accurate staging according to AJCC criteria)36.

Data preparation and preprocessing

Among the 1284 WSIs obtained from the surgical cohort, slides without tumor tissue, slides of inadequate quality, or those containing only blank regions were excluded. As a result, 659 WSIs containing sufficient tumor regions were retained for model development. To provide a clinically interpretable workflow, the selected WSIs were then divided into small patches, morphological and spatial features were extracted, and graphs were constructed to represent the histological architecture. The proposed computational framework implements a unified analytical workflow for predicting LNM in T1-stage ESCC by systematically combining multimodal computational histopathological feature extraction with a hierarchical GNN architecture. Multimodal features (color histograms, spatial coordinates, ResNet-50 embeddings) were concatenated and normalized to a shared latent space via a fully connected layer (512 dimensions).

The overall framework of the proposed method is illustrated in Fig. 4. The ‘GNNClassifier’ leverages two graph convolutional layers (GCNConv) with ReLU activation and dropout (p = 0.4) to propagate node features across the graph structure, ultimately aggregating slide-level representations via global mean pooling for classification. Results, including predicted probabilities and binary classifications, are systematically logged in ‘prediction_results.txt’ for retrospective analysis. Auxiliary utilities validate CUDA compatibility and GPU acceleration prerequisites, completing a robust computational ecosystem that bridges histopathological feature engineering with clinical decision support through modular, reproducible design.

Fig. 4: A hierarchical GNN-based model was built for predicting lymph node metastasis.
figure 4

The workflow comprises: (1) Input WSIs are divided into 224 × 224 pixel patches (stride 112), followed by spatial domain preprocessing and graph construction. (2) ResNet-50 extracts patch features (2048 channels), reduced to 512 channels via dimensionality reduction, and fused through multimodal integration. (3) A hierarchical architecture with two-stage graph convolutional layers (GCNConv), feature compression, and global context aggregation modules learns spatial dependencies and semantic representations. (4) Global mean pooling and a softmax classifier perform binary classification, outputting invasion probabilities.

WSI acquisition and annotation

H&E-stained slides of all the tissue masses in each case were selected for further analysis. The slides were captured as WSIs at 40× magnification using NanoZoomer (Hamamatsu Photonics, Hamamatsu, Japan). QuPath (https://qupath.github.io) was used to annotate and designate cancerous regions by two experienced pathologists (T.K.H. and X.J.J.). All results were double reviewed and were discussed with an independent and blinded pathologist (W.X.Y.) if not in concordance. The captured WSIs were partitioned into non-overlapping 224 × 224 pixel patches. Blank patches and patches without cancerous areas were excluded. Patches were assigned slide-level labels according to the LNM status of the corresponding patient, and patches in cases without LNM were defined as LNM-negative patches.

Data preprocessing and feature extraction

To balance computational efficiency with tissue representation, a maximum of 1000 patches per WSI was retained. Data augmentation strategies included random horizontal/vertical flipping and 30° rotation to enhance rotational invariance, supplemented by a multi-scale sampling strategy (0.5~1.5× scaling) by randomly selecting patches across different WSI pyramid levels to improve scale invariance. For feature extraction, a pretrained ResNet-50 architecture (with final classification layers removed) generated 2048-dimensional feature vectors. These features were subsequently reduced to 512 dimensions via a fully connected layer. To address illumination invariance, LAB color space-based histogram matching was applied for standardization, with additional random brightness/contrast perturbations (±20%) simulating tissue staining variations under diverse exposure conditions. Spatial coordinates were normalized to the [0,1] range, followed by construction of a 10-nearest neighbor graph (k = 10) using the ‘knn graph’ function, establishing topological connections to model spatial relationships between adjacent tissue regions. The selection of k = 10 for nearest neighbor graph construction was empirically validated through ablation studies (k = 5, 10, 15, 20). Performance peaked at k = 10, which balances local context capture and computational efficiency.

Graph neural network architecture

The proposed hierarchical GNN model was trained on graph representations constructed from WSIs. First, each WSI was divided into non-overlapping 224 × 224 pixel patches. Patch-level feature vectors were extracted using a pretrained ResNet-50 backbone (2048 dimensions), followed by a linear compression layer that reduced the features to 512 dimensions. Spatial adjacency among patches was then used to construct a graph, where each patch served as a node and neighboring patches were connected via edges. This hierarchical design was chosen to capture both local tumor microenvironment features and global tissue architecture, which are both critical for predicting LNM. The detailed process of patch division, feature extraction, and graph construction is shown in Table 5.

Table. 5 Architecture of the hierarchical graph neural network (GNN)

The resulting graph was processed through two graph convolutional layers (GCNConv) with 512 hidden units, ReLU activation, and dropout. A global mean pooling layer was applied to aggregate node-level information into a slide-level embedding. Finally, a fully connected classification head (512 → 2 units) with softmax activation outputs the predicted probability of LNM. The GCN layers allowed the model to capture spatial patterns within the tumor microenvironment, while global pooling enabled holistic WSI-level prediction based on local features.

Supervised training protocol

The model was trained using PyTorch Lightning with class-weighted cross-entropy loss to address class imbalance. Optimization was performed via the Adam optimizer (initial learning rate = 1 × 10⁻⁴) paired with a ‘ReduceLROnPlateau’ scheduler (factor = 0.1, patience = 5 epochs). Early stopping (patience = 4000 epochs) monitored validation accuracy to mitigate overfitting, while mixed-precision training (16-bit) on NVIDIA A100 GPUs accelerated computational efficiency. Early stopping at 4000 epochs was determined by plateau analysis of validation loss (no improvement for 50 epochs), preventing overfitting while ensuring convergence. Class weighting and early stopping were implemented to reduce bias from class imbalance and to prevent overfitting, thereby improving the generalizability of the model. Five-fold cross-validation demonstrated stable performance across partitions (accuracy: 88.7% ± 1.1%, F1-score: 0.85 ± 0.03). Three independent trials with randomized seeds yielded consistent results (accuracy: 89.2% ± 1.3%, F1-score: 0.87 ± 0.02), confirming low variance. Augmentation robustness tests with randomized parameters (rotation, flipping, multi-scale sampling) showed negligible performance degradation (accuracy <1.5%), underscoring feature invariance under diverse transformations. This multi-faceted validation framework ensured statistical reliability and minimized bias in clinical deployment.

Evaluation of the trained model

Model performance was evaluated using the area under the curve (AUC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). The optimal classification threshold was determined by maximizing Youden’s J statistic. Results were compared against the actual pathological condition. Threshold optimization prioritized NPV (maximizing Youden’s J with NPV > 95%), as false negatives (missed LNM) may lead to under-treatment, whereas false positives (unnecessary surgery) were deemed clinically tolerable.

Model interpretability and attention heatmap generation

To improve model interpretability, we computed node-level attention scores from the hierarchical GNN and generated attention heatmaps overlaid on the original WSIs. Each node corresponded to a histopathological patch, with color intensity reflecting its relative contribution to the final prediction. These heatmaps consistently highlighted histological regions of interest, such as invasive tumor fronts and areas rich in lymphoid tissue with clustered vessels, features that are known to be associated with LNM risk. All heatmaps were independently reviewed by two gastrointestinal pathologists (both with >10 years of diagnostic experience), who confirmed that the high-attention regions corresponded closely with established pathological risk areas. Representative examples are shown in Fig. 5.

Fig. 5: Representative attention heatmaps generated from the hierarchical GNN model.
figure 5

Warmer colors indicate regions with higher attention scores, reflecting greater contribution to the model’s final prediction. The highlighted areas frequently correspond to invasive tumor fronts, tumor budding foci, and lymphovascular invasion, which align with established pathological risk features. The marked boxes denote regions independently confirmed by expert gastrointestinal pathologists.

Statistical analysis

The translation model was developed using Python 3.8 (Python Software Foundation) with PyTorch 1.12.0 and PyTorch Geometric 2.2.0 libraries. The architecture integrated a GNN with CNNs. Statistical analyses were conducted using SPSS 26.0 (IBM, Armonk, NY). Continuous variables were assessed for normality via the Shapiro–Wilk test and homogeneity of variance with Levene’s test. Normally distributed variables were compared using Student’s t-test, while non-parametric data were analyzed with the Mann–Whitney U-test. Categorical variables were evaluated by χ²-test or Fisher’s exact test when expected cell counts fell below 5. The discriminative performance of the predictive model was quantified by the area under the receiver operating characteristic curve (AUC), with 95% confidence intervals (CI) calculated through bootstrap resampling (1000 iterations) using the percentile method. For the primary endpoint of predicting LNM in patients with T1 ESCC undergoing ESD, model performance was evaluated using standard diagnostic metrics, including sensitivity, specificity, accuracy, PPV, NPV, F1-score, and ROC AUC. The results were comprehensively summarized both in tabular format and through graphical presentations. All P-values were two-sided, and P < 0.05 was considered statistically significant. Predictions were saved to a file that contains detailing filenames, LNM probabilities, and classifications for clinical review.