Abstract
Accurate and efficient grade assessment of tumor budding (TB) in hematoxylin and eosin-stained whole slide images (H&E-stained WSIs) of colorectal cancer (CRC) remains challenging. This study proposes a full-automated TB annotation approach to assist in manual grade assessment by extracting tumor invasive front boundaries, annotating TBs in tumor invasive front patches, and transferring annotations to WSIs. Our approach demonstrates exceptional performance in tumor invasive front boundary extraction, achieving AUCs of 0.988, 0.921, and 0.929 on three different validation datasets. For TB annotations in tumor invasive front patches, the approach shows better recalls of 0.850, 0.753, and 0.720 on the same datasets. The average time of TB grade assisted by the approach in each WSI from different datasets is limited to 21 s, 15 s, and 18 s, respectively. These results demonstrate that this approach significantly improves assessment efficiency while guaranteeing accuracy, offering a reliable tool for CRC clinicopathological diagnosis.
Similar content being viewed by others
Introduction
Colorectal cancer (CRC) is one of the most common malignant tumors worldwide with its incidence and mortality rates continuously rising in recent years1,2. In the clinicopathological diagnosis of CRC, tumor budding (TB) is regarded as an important biomarker for assessing the tumor invasion and prognosis of patients3,4,5. It refers to a single tumor cell or clusters of up to four tumor cells at the tumor invasive front of CRC, which reflects the invasiveness and metastatic potential of the tumor. By grading TB, pathologists can more accurately identify high-risk patients due to the fact that TB with a higher grade is often closely associated with increased lymphovascular invasion, more likely lymph node metastasis, and poorer survival rates6,7. Therefore, completing an accurate grade assessment of TB is essential for advancing personalized diagnosis and targeted treatment of CRC.
The most authoritative approach for manually assessing TB grade refers to the standard recommended by the 2016 International Tumor Budding Consensus Conference (ITBCC)8. According to it, the pathological grade assessment of TB is typically conducted under a 20x objective lens, requiring pathologists to count the number of TBs within a designated hotspot of 0.785 mm². The TB grade is ultimately determined based on the maximal number of TBs observed in a single field and the specific grading criteria (Bd1 (Low, 0–4 buds), Bd2 (Intermediate, 5–9 buds), Bd3 (High, ≥10 buds)). However, this approach is not only laborious but also susceptible to significant inter-observer discrepancies due to the subjective nature of hotspot identification and bud enumeration, resulting in low consistency of grade assessment results9,10. Thus, achieving accurate and efficient automated grade assessment of TB while alleviating pathologists’ workloads is a significant trend in the current development process of clinicopathological diagnostics.
With the rapid development of artificial intelligence (AI), deep learning technologies such as image classification11, object annotation12, and semantic segmentation13 have shown substantial potential for TB grade assessment in whole slide images (WSIs) of CRC. In recent years, several approaches have attempted to combine various deep learning techniques to automate the process of grade assessment. Among them, some approaches utilize immune-histochemical-stained (IHC-stained) WSIs to achieve the automated assessment effect14,15,16,17,18,19. While recognizing and assessing TB in IHC-stained WSIs is easier, it usually results in the excessive staining resource waste for each patient and does not align with the 2016 ITBCC standard that mandates the grade assessment to be conducted in hematoxylin and eosin-stained (H&E-stained) slides. Moreover, although a few deep learning approaches have achieved automated grade assessment of TB in H&E-stained WSIs20,21,22,23,24,25,26, some of them are developed based solely on single-center samples and exhibit very limited generalizability20,21,22,23. In contrast, other TB grade assessment approaches developed by using multi-center H&E-stained WSIs tend to perform more stable24,25,26. However, when faced with uncertainties such as irregular tumor area or large tumor stroma in some special WSIs, these approaches often have difficult in localizing the accurate tumor invasive front area, which always leads to erroneous grade assessment results.
Considering various limitations of existing manual and automated approaches in grade assessment of TB for CRC, it is important to develop better deep learning approaches that reduce manual assessment time while improving the accuracy of assessment results. In this background, automating the annotations of TBs in H&E-stained WSIs by deep learning technologies and confirming the final TB grade by pathologists is a viable strategy. To achieve full-automated annotations of TBs in WSIs, it is important to extract tumor invasive front boundaries and then annotate TBs within the hotspot at the tumor invasive front27. However, a central challenge remains the lack of specialized approaches for precise tumor invasive front delineation, as most existing boundary extraction approaches focus only on overall tumor area segmentation rather than invasive front margin identification28,29,30. Simultaneously, reliable automated TB annotation in H&E-stained WSIs remains underdeveloped, and conventional object annotation models (e.g., RCNN31, YOLO32, SSD33) struggle with the high pixel-density and morphological variability of TBs, leading to high uncertain annotation results in WSIs.
To directly address these intertwined challenges in accurate tumor invasive front boundary extraction and robust TB annotation in WSIs, we propose a full-automated TB annotation approach explicitly designed to enhance robustness and reduce uncertainty. This approach achieves slide-level automated annotations for TBs through three main steps: firstly, the tumor invasive front boundaries are determined by classifying tumor and normal tissue patches, which provides more adaptive and precise margin identification compared to standard segmentation approaches, especially in areas with irregular stroma or tumor morphology; secondly, patches at the tumor invasive front are extracted to annotate TBs by using a new strategy optimized for low-certainty and high pixel-density WSI environments; finally, all annotated data in these patches are transferred to WSIs through a reconstruction process that ensures spatial consistency and minimizes annotation errors. This integrated pipeline mitigates the adverse impact of irregular tumor and large stromal areas on TB grade assessment accuracy, directly targeting the shortcomings of existing automated approaches. Full-automated TB annotations in WSIs effectively resolve inconsistencies that may arise in manual assessment and reduce the impact of irregular tumor areas and large tumor stroma on results, significantly improving both accuracy and efficiency of the TB grade assessment process.
Results
Dataset establishment
This study utilized two types of datasets sourced from different medical centers: internal and external datasets. The process of dataset establishment in this study was shown in Fig. 1. Most of the images in these datasets were H&E-stained WSIs with a rich presence of TBs and came from CRC patients selected by experienced pathologists. The internal datasets were derived from the Department of Pathology at the Union Hospital, Tongji Medical College, Huazhong University of Science and Technology (referred to as the PUTH), including a model development subset of 356 WSIs from 117 patients (referred to as the PUTH-MD) and an internal validation subset of 184 WSIs from 184 patients (referred to as the PUTH-IV), the latter of which contained the comprehensive prognostic information and was used for both technical and clinical validation. In contrast, external datasets were obtained from the National Cancer Institute and comprises two components: the TCGA-COAD external validation subset, which included 362 WSIs (referred to as the TCGA-COAD-EV), and the TCGA-READ external validation subset comprising 139 WSIs (referred to as the TCGA-READ-EV). Due to the absence of both WSI with abundant TB at the tumor invasive front and complete survival data for prognostic analysis, the two external validation datasets were only used for technical validation.
A novel full-automated annotation framework of TBs in WSIs for assisting in the manual grade assessment
In this study, based on using CLAM34 to preprocess by dividing H&E-stained area in WSIs into numerous patches of 2048 × 2048 pixels and recording the absolute coordinate of each patch, the proposed full-automated annotation framework achieved accurate and efficient TB annotations in WSIs through three key steps as shown in Fig. 2. First, a trained tumor and normal tissue classification model (TNTCM) was applied to determine the specific category of each patch. Based on the categorical information, tumor patches adjacent to normal tissue patches were identified, and the lines connecting their absolute coordinate points in WSIs were defined as the tumor invasive front boundaries. Second, all tumor patches within the invasive front areas were extracted along the tumor invasive front boundaries, and a trained TB annotation model for tumor invasive front patches (TBAM-TIFP) was used to annotate all TBs while recording the relative coordinates of TB bounding boxes with respect to the patches. Finally, all TB annotations in these patches were mapped back to WSIs by combining the absolute coordinates of the patches in WSIs with the relative coordinates of the TBs bounding boxes in the patches, ultimately achieving comprehensive annotations of TBs in WSIs. Based on the annotation results in WSIs, pathologists could quickly accomplish TB grade assessment by using the professional software ASAP.
The full-automated TB annotation framework in WSIs yielded several significant benefits. On one hand, this framework could efficiently handle large volumes of WSI data and significantly reduce manual annotation time, allowing pathologists to focus on more challenging case analyses. On the other hand, as each step employed trained models for relevant judgments and annotations, the accuracy and consistency of annotation results were ensured, which contributed to the reliability of TB grade assessment. Moreover, the integration of different deep learning models for patch classification and annotation not only enhanced the sensitivity of TB annotations but also reduced the likelihood of false positives (FPs) and false negatives (FNs), providing strong auxiliary support for the clinicopathological diagnosis of CRC.
Performance of the TNTCM
The TNTCM exhibited notable performance in the classification task between tumor and normal tissue patches across different datasets. As detailed in Table 1, the model’s performance was particularly outstanding on the PUTH-IV, where specificity reached 0.951 (95% CI: 0.934, 0.968) and sensitivity was 0.935 (95% CI: 0.915, 0.955). This indicated that the model had an exceptional ability to effectively distinguish between tumor and normal tissues, significantly reducing the likelihood of misclassifying normal tissues as tumors. Meanwhile, the AUC soared to 0.988 (95% CI: 0.979, 0.997), further confirming the model’s excellent performance in classification tasks. However, the performance of the model slightly declined on the external validation dataset TCGA-COAD-EV, with a specificity of 0.889 (95% CI: 0.809, 0.968) and sensitivity of 0.848 (95% CI: 0.758, 0.939). Although these metrics remained within an acceptable range, they indicated a diminished capability to distinguish between tumor and normal tissues compared to the PUTH-IV, likely due to the impact of sample heterogeneity or different data sources. However, the corresponding AUC was 0.921 (95% CI: 0.853, 0.990), reflecting that the model still possessed good classification capabilities on this dataset. For the TCGA-READ-EV, the TNTCM’s specificity dropped to 0.733 (95% CI: 0.620, 0.847), but sensitivity significantly improved to 0.929 (95% CI: 0.862, 0.995), demonstrating that the model was relatively effective in identifying tumors, albeit with reduced accuracy in excluding normal tissues. The AUC was 0.929 (95% CI: 0.862, 0.995), indicating that the high sensitivity and AUC performance suggested reasonable classification capability despite the lower specificity.
To further validate the advantages of the TNTCM in the classification task of tumor and normal tissue patches, this section selected several typical comparative classification models, including INSIGHT35, CLAM34, DSMIL36, FRMIL37, and AttentionMIL38. Unlike these MIL-based models that were trained by slide-level labels, the TNTCM achieved more precise and efficient patch-level classification. The core innovation is its use of more accurate patch-level labels, which results in superior accuracy and computational efficiency compared to the MIL-based models. Comparative performance results of different models on the PUTH-IV, TCGA-COAD-EV, and TCGA-READ-EV included the ROC curves in Fig. 3 and the AUC comparison in Fig. 4. Firstly, the TNTCM stood out across all three datasets, particularly on the PUTH-IV, where its specificity and sensitivity were higher than those of other models, achieving an AUC of 0.988 and showing remarkable classification ability. In comparison, INSIGHT displayed a relatively average performance on all datasets with its highest AUC of 0.959 on the PUTH-IV, while other datasets showed lower performance, especially on the TCGA-COAD-EV, where the AUC was only 0.823. This reflected its limitations in handling certain specific tasks. While CLAM lagged behind the TNTCM on the PUTH-IV with an AUC of 0.893, it performed relatively well on the TCGA-READ-EV, achieving an AUC of 0.892 and indicating well adaptability across different tasks. DSMIL also demonstrated significant performance across different datasets, with an AUC of 0.973, 0.868, and 0.908 for the PUTH-IV, TCGA-COAD-EV, and TCGA-READ-EV, respectively, which showed its relatively robust performance. FRMIL reached an AUC of 0.930 on the TCGA-READ-EV, exhibiting potential despite lower specificity. However, it performed poorly in other datasets, particularly on the PUTH-IV, with an AUC of only 0.882. AttentionMIL exhibited a comparatively insufficient performance because its AUC on the PUTH-IV was 0.979, but it dropped to 0.852 and 0.858 on the TCGA-COAD-EV and TCGA-READ-EV, respectively. The bar chart in Fig. 4 clearly illustrated that the AUC of the TNTCM surpassed that of other models, while the performance differences among other models were more obvious on the TCGA-COAD-EV and TCGA-READ-EV, where INSIGHT, CLAM, and AttentionMIL performed significantly worse and highlighted their shortcomings in specific tasks.
In summary, the TNTCM introduced a novel patch-specific learning framework that differs fundamentally from slide-oriented MIL models, providing statistically significant improvements in tumor and normal tissue patch classification accuracy, generalizability, and efficiency. These results demonstrate that a dedicated patch-level model offers substantial practical benefits for patch classification.
Qualitative analysis of the tumor invasive front boundary extraction
To evaluate the effect of the TNTCM in extracting tumor invasive front boundaries, this section selected one WSI from three different datasets: the PUTH-IV, TCGA-COAD-EV, and TCGA-READ-EV. The respective resolutions of them were 102400 × 94208 pixels, 59392 × 36864 pixels, 88064 × 79872 pixels. As shown in Fig. 5, boundary extraction effects based on the proposed model were compared with the consensus manual annotation results established by three pathologists.
It was obvious that the TNTCM effectively extracted tumor invasive front boundaries from WSIs across different datasets, and comparisons with manual annotations confirmed its ability to delineate the entire invasive front boundaries. This was largely attributed to the high accuracy of the TNTCM in classifying patches of tumors and normal tissues. Specifically, the TNTCM accurately determined the category of each patch based on numerous visual features of tumors and normal tissues, which demonstrated its excellent performance in the patch classification task. However, when compared to manual delineation, the TNTCM occasionally misidentified the boundaries between the tumor core and the tumor stroma as part of the tumor invasive front boundaries. Nonetheless, these errors did not affect subsequent complete annotations of TBs at the tumor invasive front areas or the manual grade assessment results. Effects of the tumor invasive front boundary extraction in WSIs from the three different datasets indicated that the TNTCM demonstrated its superior applicability and stability even under various pathological features, tumor shapes, and staining protocols. This reflected its strong generalization capability and laid a solid foundation for the extraction of tumor invasive front patches and accurate annotations of TBs.
Performance of the TBAM-TIFP
To evaluate the performance of the TBAM-TIFP on three different datasets, we selected 216, 121, and 105 tumor invasive front patches from the PUTH-IV, TCGA-COAD-EV, and TCGA-READ-EV, respectively. A TB annotation was considered as a true positive (TP) only when the predicted bounding box achieved an Intersection over Union (IoU) of at least 0.7 with the ground truth annotation, and the associated confidence score exceeded the threshold of 0.4. The TBAM-TIFP was used to annotate TBs in these patches. Based on the results of manual judgment and statistics from three experienced pathologists, confusion matrices for TB annotations in tumor invasive front patches by the TBAM-TIFP across different datasets were shown in Fig. 6. According to the data in the confusion matrices, Table 2 summarized various performance parameters. From the information in Fig. 6 and Table 2, it was evident that the TBAM-TIFP exhibited better TB annotation capability on the PUTH-IV with TPs of 392, FNs of 69, and FPs of 103, achieving a missed annotation ratio (MAR) of 0.150, a recall of 0.850, a precision of 0.792, and an F1 score of 0.820. These results indicated that the TBAM-TIFP could effectively identify TBs and achieved a balance between precision and recall, reflecting the model’s efficiency in a specific environment. On the TCGA-COAD-EV and TCGA-READ-EV, there was an obvious decrease in TP and corresponding performance metrics to some extent. On the TCGA-COAD-EV, the TP dropped to 241 and the FP increased to 137, with the FN at 79; the recall was 0.753, the MAR increased to 0.247, the precision fell to 0.638, and the F1 score was 0.691. This performance could be seen as a challenge the model faced under different sample characteristics and background noises. When analyzing the performance on the TCGA-READ-EV, the TP was 172, the FN was 67, and the FP was 115, leading to a recall of 0.720, a MAR of 0.280, a precision of 0.599, and an F1 score of 0.653. These numbers further emphasized the complexity of TB annotations on diverse datasets. Notably, the differences in the performance of the TBAM-TIFP on various datasets not only reflected the limited variability of background noises and changes in sample characteristics but also offered an opportunity for ongoing optimization and deeper understanding in the field of TB annotations.
To further verify the advantage of the TBAM-TIFP in TB annotations, this section selected five classic object annotation models for comparative analysis, including GLEE39, DAF40, TBD-CNDES26, Detclipv241, and Faster R-CNN42. As illustrated in Fig. 7, the TBAM-TIFP consistently achieved superior recall across different datasets. The key innovation of TBAM-TIFP lied in its training strategy: it was rigorously trained on carefully selected tumor invasive front patches and use a full-supervised way. On the PUTH-IV, the TBAM-TIFP achieved the highest recall of 0.850. In comparison, GLEE achieved a recall of 0.842, DAF 0.837, TBD-CNDES 0.830, Detclipv2 0.825, and Faster R-CNN 0.815. These data indicated the leading position of the TBAM-TIFP in TB annotations and its effectiveness in improving annotation recall. Further analysis of the TCGA-COAD-EV dataset showed that the TBAM-TIFP still exhibited better performance, with a recall of 0.753, surpassing GLEE’s 0.740, DAF’s 0.735, TBD-CNDES’s 0.721, Detclipv2’s 0.715, and Faster R-CNN’s 0.705. On the TCGA-READ-EV, the TBAM-TIFP also maintained its excellent annotation capability, with a recall of 0.720, continuing to lead other models. On this dataset, GLEE had a precision of 0.711, DAF 0.702, TBD-CNDES 0.695, Detclipv2 0.688, and Faster R-CNN 0.683. Throughout the performance comparison results, the TBAM-TIFP achieved better results in TB annotations on both internal and external validation datasets. These series of results not only showed the superiority of the TBAM-TIFP but also provided important justification for optimizing and improving subsequent models, aiding in the realization of higher-level TB annotations in the future.
Overall, the TBAM-TIFP demonstrated its better performance and adaptability across different datasets for TB annotations. These results underscored the model’s novelty in leveraging specific tumor patches under full supervision, leading to more accurate and reliable TB annotations. The model achieved a balance between recall and precision, and it exhibited excellent annotation capabilities in diverse samples.
Quantitative analysis of TB annotations in tumor invasive front patches
In the evaluation of TB annotations in tumor invasive front patches, the TBAM-TIFP demonstrated variability and adaptability in its annotation capabilities across multiple datasets. As illustrated in Fig. 8, the model’s predictions were compared with consensus annotations from three pathologists on each dataset. In the PUTH-IV patch, the TBAM-TIFP annotated 3 TBs, matching exactly the manual annotations of 3 TBs by pathologists, with no missed annotations, indicating its high accuracy and reliability on this dataset. However, in patches from the TCGA-COAD-EV and TCGA-READ-EV, the annotation performance exhibited significant deviations from expert annotations. In the TCGA-COAD-EV patch, the model annotated 9 TBs, whereas the pathologists identified 13 TBs, resulting in several missed annotations. This was likely due to the irregular shapes and small sizes of some TBs, suggesting that the model is more prone to misclassifying challenging TBs as background in the presence of diverse sample characteristics and background noise. Similarly, in the TCGA-READ-EV patch, the TBAM-TIFP annotated 5 TBs, compared to 7 TBs annotated by the pathologists, again indicating missed annotations. These discrepancies further underscored the impact of inter-dataset variability, including differences in staining protocols and tissue morphology, on model performance. The observed missed annotations appeared closely related to the diversity and complexity of WSIs. Overall, while the TBAM-TIFP showed its better TB annotation performance on the PUTH-IV dataset, the observed gaps on external validation datasets highlighted the need for further model optimization to improve robustness against varying sample features and background noises. In particular, enhancing the diversity of the training dataset might help improve the model’s flexibility and annotation accuracy in complex tumor microenvironments.
Qualitative analysis of the full-automated annotations for TBs in WSIs
This section provided a comprehensive evaluation of the full-automated annotations performance of TBs in WSIs across three different datasets: the PUTH-IV, TCGA-COAD-EV, and TCGA-READ-EV. The aim was to assess the effectiveness and reliability of the full-automated annotation framework.
The average TB grade assessment time for each WSI from the three datasets is shown in Table 3, and these results were derived from samples of 10 WSIs on each dataset. The full-automated annotation assistance (FAAA) demonstrated a remarkable reduction in total grade assessment time compared to purely manual assessments by three pathologists. While pathologists took between approximately 50 s and 1 min 29 s per WSI across different datasets, the average assessment time under FAAA was only about 15–21 s. This represented an average time savings of over 70%, highlighting the exceptional efficiency and consistency of full-automated annotation. Such a significant reduction in processing time enabled high-throughput analysis and greatly accelerated workflow in TB grade assessment, underscoring the potential of enhancing scalability and reducing observer variability in large-scale WSIs across different datasets.
Figure 9 presented a comparison between the full-automated TB annotation results and the consensus manual annotations from three pathologists on a representative WSI from each dataset. The full-automated approach produced complete TB annotations on the PUTH-IV with obvious hotspots in the WSI. Compared to manual annotations, the approach demonstrated the ability to identify TBs at the tumor invasive front; however, it also exhibited a certain degree of missed and false annotations in these areas. This not only reflected the model’s superior performance on internal samples but also demonstrated its shortage in processing intrinsic pathological features. When evaluating the TCGA-COAD-EV and TCGA-READ-EV, the annotation performance of this framework exhibited similar instability. In contrast to manual annotations, the approach incorrectly annotated tumor cells located in tumor stroma or intratumoral areas as TBs. Although the erroneous annotations did not affect the final grading assessment results, they did lead to a longer time required for manually localizing accurate hotspots of TBs. Despite the influence of the erroneous annotations on the overall assessment efficiency for the two external validation datasets, the information extracted from WSIs indicated that the full-automated approach displayed promising performance in TB annotations, providing a high degree of annotation accuracy and consistency. The effective annotations in WSIs significantly enhanced the efficiency of manual grade assessment for TB.
Prognostic evaluation of the full-automated TB annotations
To evaluate the prognostic value of the full-automated TB annotation approach, we focused on the relationship between TB grades determined by the complete annotated data and the survival proportion of CRC patients. Based on FAAA of all WSIs on the PUTH-IV, three experienced pathologists used professional software ASAP to observe accurate tumor invasive front areas and count TBs in a single square field normalized to 0.785 mm². TB grading categories included Bd1 (0–4 buds: Low), Bd2 (5–9 buds: Intermediate), and Bd3 (≥10 buds: High) according to the 2016 ITBCC recommendations8 was used for risk stratification combined with practical TB count in H&E-stained WSIs of CRC. The prognostic significance of TB grade was assessed through Kaplan–Meier survival analysis, and comparing results derived from FAAA and purely manual assessments were shown in Fig. 10.
As shown in the three figures, the overall survival proportions of Bd1 and Bd2 patients were significantly better than those of Bd3 patients, which were in line with previous pathological reports. Comparing to the survival curves based on FAAA with purely manual grade assessment of TB, it was evident that patients with Bd1 and Bd2 determined by annotation assistance in the former had similar proportions of survival at 15–22 months. This phenomenon was largely due to FP annotation errors at the tumor invasive front, which led to the misclassification of some Bd1 cases as Bd2, resulting in an underestimation of the actual survival distinction between these two groups. In other words, the full-automated approach tended to produce false TB annotations in WSIs with low-grade or intermediate-grade TB, causing some Bd1 cases to be incorrectly elevated to Bd2.
As illustrated in Fig. 10C, it was obvious that for patients graded as Bd3 assisted by full-automated annotations, there was a distinct difference in survival curves when compared with Bd1 and Bd2 patients. This indicated that the full-automated annotation approach demonstrated good auxiliary effects in identifying high-grade TB in WSIs, suggesting that false TB annotations at the tumor invasive front were controlled in a relatively low range.
In summary, the prognostic evaluation through the full-automated TB annotation approach showed the potential benefits of incorporating different deep learning models to assist in manual grade assessment of TB. While there were instances of missed and false annotations in WSIs, this full-automated approach effectively assisted pathologists in determining high-grade TB, significantly reducing manual interpretation time while maintaining satisfactory accuracy in TB grade assessments.
Discussion
TB grade assessment played a critical role in the clinicopathological diagnosis of CRC, as it was closely associated with adverse pathological features such as lymphovascular invasion, lymph node metastasis, and reduced survival. Traditional manual grade assessment was not only time-consuming but also suffered from substantial inter-observer variability, leading to inconsistent clinical interpretations. Although some deep learning approaches were developed to tackle these limitations, most of them struggled with generalizability across different datasets, accurate location of the tumor invasive front, and efficient annotation in WSIs. To address these challenges, we developed a full-automated TB annotation approach that integrated two deep learning models, namely the TNTCM and TBAM-TIFP, to assist in manual grade assessment. This approach systematically localized the tumor invasive front boundaries, identified TBs within relevant patches, and projected all TB annotations back to WSIs. Since retaining the pathologist as the final arbiter of TB grade, our approach minimized erroneous assessments led by imperfect invasive front localization, significantly accelerating the grade assessment process and enhancing the accuracy of results.
The accurate identification of the tumor invasive front was foundational for reliable TB grade assessment. The TNTCM demonstrated its superior performance in distinguishing tumor from normal tissue across multiple datasets, outperforming established MIL models including INSIGHT35, CLAM34, DSMIL36, FRMIL37, and AttentionMIL38. Conceptually, the TNTCM’s key advantage was its precise tumor invasive front localization. This accuracy mainly benefited from its superior patch-level tumor and normal tissue classification performance compared to other MIL models. Its high AUC (0.988), specificity (0.951), and sensitivity (0.935) on internal validations underlined its robustness, likely attributable to its patch-level learning architecture, which enhanced focus on local morphological features. Although a decline in performance was observed on external validation datasets, mainly due to variability in staining protocols, tissue processing, and scanner differences, the model maintained competitive accuracy, highlighting its generalizability. Qualitative assessments confirmed that the TNTCM completely delineated tumor invasive front boundaries across diverse WSIs, though occasional misclassification between tumor stroma and core tumor did occur. Importantly, these errors did not affect complete TB annotations or grade assessment, demonstrating the clinical utility of the model.
For TB annotations in tumor invasive front patches, the TBAM-TIFP demonstrated a well-balanced performance in both recall and precision on internal validation datasets, achieving an F1-score of 0.820. While its performance on external validation datasets showed some reduction, it remained competitive with or even superior to other leading object annotation models such as GLEE39, DAF40, TBD-CNDES26, Detclipv241, and Faster R-CNN42. Unlike generic object annotation models, the TBAM-TIFP was a specialized TB annotation model trained under a full supervision strategy exclusively in tumor invasive front patches, enabling more complete and accurate TB annotations within critical areas. The decrease in performance on external validation datasets underscored the influence of domain shift, caused by factors such as variations in TB morphology, staining intensity, and tissue heterogeneity. Qualitative assessments further indicated that the model sometimes failed to annotate small or irregularly shaped TBs, especially in external WSIs. These shortcomings emphasized the importance of integrating more diverse training instances and enhancing the model’s capacity to generalize across variable tissue environments.
The integration of these models into a full-automated approach dramatically reduced the average time required for TB grade assessment. Pathologists assisted by our approach achieved an average time saving of over 70% per WSI compared to purely manual assessment. The sequential and anatomically guided workflow in this approach provided a more transparent and interpretable process than other annotation models. It operated by first locating critical areas and then annotating TBs within them, which allowed pathologists to quickly verify TB annotation results and confirm the final grade. While the approach produced clinically consistent and accurate annotations on internal data, external validation results revealed occasional FPs, such as misclassifying stromal tumor cells as TBs. Although these errors did not affect final grade assignments, they occasionally necessitated additional assessment time, suggesting that further refinement was needed to optimize specificity in unfamiliar samples.
From a clinical perspective, our full-automated TB annotation approach demonstrated significant prognostic value. Survival curves generated through FAAA effectively differentiated high-risk Bd3 patients from those with lower TB grades. However, the convergence of survival curves between Bd1 and Bd2 patients indicated that our model might fail to annotate accurate TBs at the tumor invasive front in lower-grade cases, likely owing to FPs in diagnostically challenging areas in WSIs of CRC. This limitation underscored an important consideration for clinical implementation: although the proposed approach performed exceptionally well in assisting pathologists to identify patients with high-grade TB, which were of the greatest clinical concern, it might still need additional manual assessment to ensure more accurate TB grade results within lower-risk categories.
In conclusion, our study introduced a novel dual-model framework for full-automated TB annotation in WSIs, significantly improving the efficiency and accuracy of grading assessment while maintaining clinical interpretability. The methodologically advanced use of dedicated patch-level classifiers offered greater transparency and computational efficiency. However, several limitations must be acknowledged. Firstly, the performance of the TNTCM exhibited a certain degree of instability across different datasets, which emphasized the necessity of optimizing patch classification models to enhance their applicability in heterogeneous samples. Secondly, although the TBAM-TIFP achieved higher recall in annotating TBs in patches compared to existing models, the issue of missed TB annotations posed significant challenges, particularly on external validation datasets. Therefore, continuous optimization was essential to improve the stability of the annotation model. Finally, while the full-automated TB annotations in WSIs significantly reduced the time required for manual grade assessment, many incorrect annotations not only affected the efficiency in determining the final TB grades but also resulted in less distinct survival curves for Bd1/Bd2. Consequently, a critical direction for future studies was to improve the accuracy of TB annotations in WSIs while refining better TB grading criteria suitable for the hybrid workflow combining full-automated annotations with final grade determinations by pathologists.
Methods
Data preprocessing
We employed an approach similar to CLAM34 to preprocess all WSIs from different datasets. First, the RGB color space was converted to the HSV color space by using a downsampling technique, and all background noise and hole areas in WSIs were filtered out by readjusting the threshold of the saturation channel, allowing for the segmentation of H&E-stained tissue areas. Next, we performed equidistant cutting on all segmented stained areas to obtain a large number of 2048×2048 pixel patches. These patches encompassed all tumor tissues as well as various types of normal tissues. Additionally, the absolute coordinates of all patches in the H&E-stained WSIs were saved in a .txt file, ensuring precise localization of each patch in the WSIs. Through this series of specific preprocessing steps, we not only improved the processing speed of WSIs but also laid a solid foundation for the subsequent model training.
Development of the TNTCM
The development of the TNTCM primarily consisted of two key stages: the manual annotation stage and the model training stage.
In the manual annotation stage, pathologists used the professional software ASAP to meticulously annotate in WSIs, distinguishing between tumor and normal tissue areas. During the annotation process, polygonal areas along the boundaries of tumor areas were drew to ensure annotation accuracy, while brush tools were used to delineate normal tissue boundaries and assigned specific colors to differentiate the two types of areas. The annotation examples were shown in Fig. 11.
The annotation results could be represented as tumor areas and normal tissue areas:
where \(\left({x}_{i},{y}_{i}\right)\) denoted the coordinates of the vertices of annotated areas.
In the model training stage, patches of size \(w\times h,w=2048,h=2048\) were first extracted based on annotated areas and absolute coordinates of patches in the WSIs. These patches were derived from tumor areas and normal tissue areas, represented as:
Next, all extracted patches were fed into a ResNet3443 convolutional neural network (CNN) for feature extraction, with the extracted feature vectors represented as:
These feature vectors were used to train a classifier, whose output was:
where \(W\) was the weight matrix, \(b\) was the bias vector, and \(y\) was the predicted class probability distribution.
During training, the error between predicted values and true labels was calculated using the cross-entropy loss function:
where \({y}_{i}^{* }\) was the true label and \({y}_{i}\) was the predicted probability.
Model parameters were continuously optimized using gradient descent, with the weight matrix and bias vector updated as:
Where \(\eta\) was the learning rate.
Finally, the model’s performance was evaluated using validation and test sets. The evaluation metrics included specificity, sensitivity, and AUC.
Following the above steps, a reliable TNTCM model was successfully developed, which could effectively distinguish between tumor patches and normal tissue patches.
Development of the TBAM-TIFP
The development of the TBAM-TIFP mainly consisted of three phases: manually annotating TB hotspots and extracting patches, annotating TBs and poorly differentiated clusters (PDCs) in the patches, and training a TB annotation model in the patches.
In the process of manually annotating TB hotspots, pathologists used professional software ASAP to precisely delineate areas of interest (ROIs) containing many TBs at the tumor invasive front in WSIs of CRC, laying the foundation for subsequent patch-level analysis. The annotation example was shown in Fig. 12.
These annotated areas were then automatically cropped into patches to improve the efficiency of subsequent analysis. Assuming the WSI was \(I\), with a size of \(W\times H\) pixels, the annotated area was \(A\), with a size of \(w\times h\) pixels. The annotated area \(A\) was divided into \(n\) patches \({P}_{i}\), This process could be expressed as:
where \(i=1,2,\ldots ,n,\,w=2048,\,h=2048\).
Next, pathologists used professional software Labeling to annotate TBs and poorly differentiated cell clusters (PDCs) in each extracted patches, focusing on their location, morphology, and size. The annotations were carefully cross-checked to form a structured, high-quality annotated dataset of patches, providing a solid foundation for model training. For each patche \({P}_{i}\), the annotated TBs and PDCs were represented as:
where \(j=\mathrm{1,2},\ldots ,{m}_{i}\) \({t}_{j}\) was the specific type of the annotations, \(({x}_{j},{y}_{j})\) were the coordinates of the bounding box center, and \({w}_{j}\) and \({h}_{j}\) were the width and the height of the bounding box, respectively. The examples of the annotation in patches were shown in Fig. 13.
Finally, the YOLOv8 object annotation framework was employed to train the TB annotation model. The dataset was divided into training, validation, and test sets in an 8:1:1 ratio to ensure comprehensive training and effective evaluation of the model. CNNs were used to extract features, while data augmentation techniques were applied to enhance the model’s robustness. The learning rate was dynamically adjusted to optimize the training speed. Model performance was monitored through the loss functions and the evaluation metrics. The loss function was defined as follows:
where \({L}_{{cls}}\) was the classification loss, \({L}_{{reg}}\) was the regression loss, \({L}_{{obj}}\) was the object annotation loss. The learning rate \(\eta\) was dynamically adjusted to optimize the training speed:
Where \(\gamma\) was the learning rate decay factor. The model’s performance was finally evaluated using metrics including recall, precision, and F1 score.
Through these steps, a robust TB annotation model capable of accurately identifying TBs from patches was ultimately established.
Implementation of the full-automated TB annotations in WSIs
The implementation of full-automated annotations of TBs in WSIs could be divided into three main parts: tumor invasion front boundary extraction, TB annotations, and whole-slide annotations transfer. These three steps were closely connected, forming a complete automated pipeline.
First, tumor invasion front boundary extraction served as the foundation of the entire process. All WSIs were divided into patches of the same size, and these patches were fed into a pre-trained ResNet34 model for feature extraction. Specifically, for each patch \({I}_{p}\), the model extracted its feature vector:
Then, a classification model was used to classify the feature vector, yielding the category label for each patch:
where \(W\) and \(b\) were model weights and bias. By analyzing the distribution of tumor patches and normal tissue patches, the tumor invasive front boundary points were determined:
where \(\left({x}_{i},{y}_{i}\right)\) represented the coordinates of the tumor patch, and the neighborhood range was defined by \(\Delta x=2048\) and \(\Delta y=2048\). This step provided precise boundary information for subsequent TB annotations.
Next, along the identified tumor invasion front boundary, all relevant patches \({I}_{p}^{{\prime} }\) at the tumor invasive front were extracted, and the YOLOv8 model was employed for TB annotations. The YOLOv8 model extracted features from the input patches through a CNN and generated a series of candidate areas indicating potential TB locations. The annotation results included bounding box coordinates \(\left({x}_{p},{y}_{p},{w}_{p},{h}_{p}\right)\) and confidence scores \(s\). By setting a confidence threshold \(\tau\), valid detection results were filtered:
These results were used to draw TB bounding boxes in the patches and saved in a standardized format, providing data support for the subsequent whole-slide annotation transfer.
Finally, the coordinates of TB bounding boxes \(\left({x}_{p},{y}_{p},{w}_{p},{h}_{p}\right)\) in the patches were transferred to the absolute coordinates \(\left({x}_{s},{y}_{s},{w}_{s},{h}_{s}\right)\) in WSIs. Specifically, the coordinate transformation was performed using:
where \(({x}_{{offset},}{y}_{{offset}})\) represented the absolute coordinates of patches within the WSIs. This step ensured the precise localization of every TB bounding boxes in the WSIs.
Through this series of rigorous processes, full-automated annotations of TBs in WSIs were achieved, providing critical data support for TB grade assessment of CRC.
Quantification and statistical analysis
The primary technology validation metrics included AUC, precision, recall, specificity, and sensitivity to enable a comprehensive performance assessment. For each validation experiment, the mean and standard deviation of model performance were calculated across both internal and external validation datasets, along with 95% confidence intervals derived from bootstrapping with 1000 iterations. To statistically compare performance between internal and external validation datasets, paired t-tests were applied with p-values < 0.05 considered significant. To ensure data independence in the clinical validation, only one WSI per patient was included, with patients exclusively assigned to internal validation datasets. External validation datasets from diverse sources were utilized to rigorously evaluate model generalizability. All processes of model development and validation were implemented in Python (version 3.12.0), PyTorch (version 1.10), and Ubuntu (version 20.0.4), ensuring robust and reproducible analysis.
Data availability
The datasets generated or analysed in this study come from two sources. The public TCGA dataset is available at https://www.cancer.gov/ccg/research/genome-sequencing/tcga. The patient data from the PUTH are not publicly available in order to protect patient privacy and confidentiality in compliance with ethical regulations.
References
Eng, C. et al. A comprehensive framework for early-onset colorectal cancer research. Lancet Oncol. 23, 116–128 (2022).
Yin, H. F. et al. Machine learning-based analysis identifies and validates serum exosomal proteomic signatures for the diagnosis of colorectal cancer. Cell Rep. Med. 5, 101689 (2024).
Lugli, A., Zlobec, I., Berger, M. D., Kirsch, R. & Nagtegaal, I. D. Tumour budding in solid cancers. Nat. Rev. Clin. Oncol. 18, 101–115 (2021).
Kemi, N., Eskuri, M., Ikalainen, J., Karttunen, T. J. & Kauppila, J. H. Tumor budding and prognosis in gastric adenocarcinoma. Am. J. Surg. Pathol. 43, 229–234 (2019).
Cyr, D. P. et al. Tumor budding assessment in colorectal carcinoma: normalization revisited. Am. J. Surg. Pathol. 48, 251–265 (2024).
Pihlmann, K. M. et al. The prognostic value of tumor budding in a thoroughly characterized stage II colon cancer population in the context of a national screening program. Hum. Pathol. 146, 15–22 (2024).
Zlobec, I. & Lugli, A. Tumour budding in colorectal cancer: Molecular rationale for clinical translation. Nat. Rev. Cancer 18, 203–204 (2018).
Lugli, A. et al. Recommendations for reporting tumor budding in colorectal cancer based on the International Tumor Budding Consensus Conference (ITBCC) 2016. Mod. Pathol. 30, 1299–1311 (2017).
Regmi, P. et al. Prognostic significance of tumor budding in biliary tract cancer. Eur. J. Surg. Oncol. 48, 160–168 (2022).
Sajjad, U. et al. Enhancing colorectal cancer tumor bud detection using deep learning from routine H&E-stained slides. In Proc SPIE Int Soc Opt Eng. 12933, 129330T (2024).
Wang, X. Y. et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Med. Image Anal. 81, 102559 (2022).
Yang, H. C. et al. An empirical study of ground segmentation for 3-D object detection. IEEE Trans. Intell. Transp. Syst. 26, 3071–3083 (2025).
Wu, L. S. et al. Querying labeled for unlabeled: cross-image semantic consistency guided semi-supervised semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45, 8827–8844 (2023).
Weis, C. A. et al. Automatic evaluation of tumor budding in immunohistochemically stained colorectal carcinomas and correlation to clinical outcome. Diagn. Pathol. 13, 64 (2018).
Fauzi, M. F. A. et al. Tumor budding detection system in whole slide pathology images. J. Med. Syst. 44, 38 (2019).
Koelzer, V. H., Zlobec, I. & Lugli, A. Tumor budding in colorectal cancer—ready for diagnostic practice? Hum. Pathol. 47, 4–19 (2016).
Haddad, T. S. et al. Combining immunoscore and tumor budding in colon cancer: an insightful prognostication based on the tumor-host interface. J. Transl. Med. 22, 1090 (2024).
Yamadera, M. et al. Differential clinical impacts of tumour budding evaluated by the use of immunohistochemical and haematoxylin and eosin staining in stage II colorectal cancer. Histopathology 74, 1005–1013 (2019).
Takamatsu, M. et al. Immunohistochemical evaluation of tumor budding for stratifying T1 colorectal cancer: Optimal cut-off value and a novel computer-assisted semiautomatic method. Mod. Pathol. 32, 675–683 (2019).
Pai, R. K. et al. Development and initial validation of a deep learning algorithm to quantify histological features in colorectal carcinoma including tumour budding/poorly differentiated clusters. Histopathology 79, 391–405 (2021).
Liu, S. L. et al. Establishment and clinical application of an artificial intelligence diagnostic platform for identifying rectal cancer tumor budding. Front. Oncol. 11, 626626 (2021).
Tavolara, T. E. et al. A modular cGAN classification framework: application to colorectal tumor detection. Sci. Rep. 9, 18969 (2019).
Hou, X. P. et al. Triple-negative breast cancer survival prediction using artificial intelligence through integrated analysis of tertiary lymphoid structures and tumor budding. Cancer 130, 1499–1512 (2024).
Griem, J. et al. Artificial intelligence-based tool for tumor detection and quantitative tissue analysis in colorectal specimens. Mod. Pathol. 36, 100327 (2023).
Tolkach, Y. et al. Artificial intelligence for tumour tissue detection and histological regression grading in oesophageal adenocarcinomas: a retrospective algorithm development and validation study. Lancet Digit. Health 5, 265–275 (2023).
Bokhorst, J. M. et al. Fully automated tumor bud assessment in hematoxylin and eosin-stained whole slide images of colorectal cancer. Mod. Pathol. 36, 100233 (2023).
Gao, L. F. et al. Tumor bud-derived CCL5 recruits fibroblasts and promotes colorectal cancer progression via CCR5-SLC25A24 signaling. J. Exp. Clin. Cancer Res. 41, 81 (2022).
Wang, Z. Z. et al. Label cleaning multiple instance learning: Refining coarse annotations on single whole-slide images. IEEE Trans. Med. Imaging 41, 3952–3968 (2022).
Pati, S. et al. Federated learning enables big data for rare cancer boundary detection. Nat. Commun. 13, 7346 (2022).
Cheplygina, V., Bruijne, M. & Pluim, J. P. W. Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis. Med. Image Anal. 54, 280–296 (2019).
Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc IEEE/CVF Conf. Comput. Vis. Pattern Recognit (CVPR) 580–587 (IEEE, 2014).
Sangaiah, A. K., Yu, F.-N., Lin, Y.-B., Shen, W.-C. & Sharma, A. UAV T-YOLO-Rice: An enhanced Tiny YOLO networks for rice leaves diseases detection in paddy agronomy. IEEE Trans. Netw. Sci. Eng. 11, 5201–5216 (2024).
Gao, Q., Chen, Y. & Ju, Z. Oropharynx visual detection by using a multi-attention single-shot multibox detector for human–robot collaborative oropharynx sampling. IEEE Trans. Hum.-Mach. Syst. 53, 1073–1082 (2023).
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Chang, X. N. et al. Predicting colorectal cancer microsatellite instability with a self-attention-enabled convolutional neural network. Cell Rep. Med. 4, 100914 (2023).
Li, B., Li, Y. & Eliceiri, K. W. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In Proc IEEE/CVF Conf. Comput. Vis. Pattern Recognit (CVPR) 14313–14323 (IEEE, 2021).
Chikontwe, P. et al. FR-MIL: distribution re-calibration-based multiple instance learning with transformer for whole slide image classification. IEEE Trans. Med. Imaging 44, 409–421 (2025).
Ilse, M., Tomczak, J. M. & Welling, M. Attention-based deep multiple instance learning. In Proc. Int. Conf. Mach. Learn (ICML) 2127–2136 (PMLR, 2018).
Wu, J. F. et al. General object foundation model for images and videos at scale. In Proc IEEE/CVF Conf. Comput. Vis. Pattern Recognit (CVPR) 3783–3795 (IEEE, 2024).
Cai, Y. X. et al. A discrepancy aware framework for robust anomaly detection. IEEE Trans. Ind. Inf. 20, 3986–3995 (2024).
Yao, L. W. et al. Detclipv2: scalable open-vocabulary object detection pre-training via word-area alignment. In Proc IEEE/CVF Conf. Comput. Vis. Pattern Recognit (CVPR) 23497–23506 (IEEE, 2023).
Ren, S. Q., He, K. M., Girshick, R. & Sun, J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2017).
He, K. M., Zhang, X. Y., Ren, S. Q. & Sun, J. Deep residual learning for image recognition. In Proc IEEE/CVF Conf. Comput. Vis. Pattern Recognit (CVPR) 770–778 (IEEE, 2016).
Acknowledgements
This work was supported by grants from the Ministry of Science and Technology of the People's Republic of China (CN) (grant no. 2022YFF1203300) and the National Natural Science Foundation of China (grant no. 62476106).
Author information
Authors and Affiliations
Contributions
Song Peng, Xiaona Chang, and Yuehao Luo wrote the main manuscript text. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Peng, S., Chang, X., Luo, Y. et al. A full-automated tumor budding annotation approach in hematoxylin and eosin-stained whole slide images of colorectal cancer. npj Precis. Onc. 10, 33 (2026). https://doi.org/10.1038/s41698-025-01238-4
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41698-025-01238-4















