Introduction

Magnetic Resonance Imaging (MRI) plays a pivotal role in staging rectal cancer and selecting treatment plans, providing valuable information regarding the extent of tumor infiltration within and beyond the bowel wall, and into critical anatomical structures, including perirectal vessels, the mesorectal fascia (MRF), peritoneum, and neighboring pelvic organs1,2. T2-weighted imaging (T2WI) forms the mainstay of the MRI protocol because of its superior soft tissue contrast to discern the different layers of the rectal wall, mesorectal fat, adjacent vessels, and MRF to allow for detailed local staging3,4.

Precisely segmenting the tumor is an important task in rectal cancer management. Tumor segmentations are utilized for several purposes including radiation treatment planning, volumetric analysis, and extraction of imaging biomarkers. Tumor delineation by radiologists is considered the current gold standard. Nevertheless, it is time-consuming and subject to substantial intra- and inter-observer variation5,6,7. Developing an accurate, generalizable, and robust rectal tumor segmentation model can help reduce this variability and assist in standardizing several steps of diagnostic and therapeutic rectal cancer management.

Deep learning in rectal tumor segmentation

Deep learning (DL) has seen a rapid uptake in several fields, achieving promising results in medical image analysis8. Convolutional Neural Network-based (CNN) based DL approaches excel in learning image representations from annotated data by using learnable feature extraction filters and sequential convolution, activation, and pooling operations. Among CNNs, the U-Net9 and its variants are the most popular architecture. Studies10,11,12 have explored the ability of 2D U-Net and other 2D CNNs variants. These CNNs achieved a Dice similarity coefficient score (DSC) ranging from 0.59 to 0.84 for tumor segmentation. Although 2D CNNs are less computationally expensive, 3D models can leverage richer context to improve predictions13.

Hamabe et al.14 implemented a 3D U-Net, achieving an averaged DSC of 0.73 (0.60–0.80) over 10-fold cross-validation. However, there was no external validation in their study. Besides CNN-based models, transformer-based architectures are also being applied in medical image analysis due to their ability to access long-range semantic information15. Li et al.16 proposed RTAU-Net, a novel 3D dual path fusion network containing a transformer encoder for extracting global contour information of the rectal tumor. RTAU-Net achieved an averaged DSC of 0.80 and 0.68 in data from two respective medical centers. However, RTAU-Net requires the manual removal of tumor-free slices, which hinders the fully automated implementation. Additionally, RTAU-Net was not compared against the state-of-the-art medical segmentation networks, such as nnUNet17, a self-configuring implementation of the U-Net architecture, or nnFormer18, which introduces 3D transformer blocks on top of nnUNet.

Deep learning in rectum and mesorectum segmentation

Besides rectal tumor delineation, some studies also demonstrated that CNNs can accurately delineate anatomical structures such as rectum and mesorectum or perirectal fat with DSC above 0.9014,19,20. Automated rectum and mesorectum delineation could potentially improve radiological evaluation. The prognosis for rectal cancer depends on how far the tumor has infiltrated the layers of the rectal wall and the mesorectum, and the successful attainment of negative circumferential resection margins (CRMs) through surgical intervention21. Additionally, Lee et al.22 demonstrated that a 2D model’s variance in tumor regions can be reduced by 90% by incorporating rectal segmentation on the model’s objective. Integrating rectal anatomical knowledge can provide a more comprehensive representation of the T2WI, allowing the model to learn richer and more nuanced patterns, leading to better performance on unseen data. However, the impact of adding rectal anatomical structures including mesorectum for rectal tumor segmentation has not been investigated in a multi-institutional setting. Also, incorporating rectal anatomical structure has so far been limited to adding additional segmentation tasks as presented in Lee et al.22.

Anatomy-aware inpainting for anomaly detection

Most automated medical image segmentation methods rely on supervised learning, which requires large volumes of reliably labeled data—often difficult to obtain in medical imaging. As a result, semi-supervised and unsupervised methods have gained increasing attention. Anomaly detection, which can operate in both settings, has been applied to tasks such as segmentation. Generative models—such as Generative Adversarial Networks (GANs)23, Autoencoders (AEs), and their variants including Variational Autoencoders (VAEs)24 and Vector Quantized VAEs (VQ-VAEs)25—have shown promise in this field26,27,28. These models are typically trained to reconstruct images from a distribution of normal tissue; when presented with anomalies (e.g., tumors), they often fail to reconstruct the affected regions, resulting in higher reconstruction errors.

When trained on full MRI slices, these models are required to reconstruct the entire image—including both relevant and irrelevant anatomy—which can impair accurate modeling of healthy structures and lead to errors in both normal and abnormal regions. Consequently, anomaly maps—pixel-wise error maps designed to localize abnormalities—may incorrectly flag normal anatomical variations or imaging artifacts as anomalies, reducing specificity.

Unlike image-to-image reconstruction, inpainting focuses on restoring missing or occluded image regions using surrounding context, typically with dataset-independent masks. Nguyen et al.29 applied inpainting for brain tumor segmentation in T1-weighted MRI by identifying regions with the highest reconstruction loss. Incorporating anatomical priors allows inpainting to target high-risk areas, minimizing background influence. Yeganeh et al.30 introduced an anatomy-aware masking strategy to improve organ shape learning, while Woo et al.31 proposed a UNet-based model for bone lesion detection in knee MRI via inpainting, showing that identified anomalies can support downstream segmentation.

In anatomical inpainting based anomaly detection, the region of interest (ROI) is masked—typically covering relevant structures—and the model is trained to reconstruct these regions assuming normal anatomy. The difference between the original and reconstructed images is then used to localize anomalies.

Our contributions

In this study, we present several key contributions aimed at advancing rectal tumor segmentation in T2WI. We developed and evaluated a rectal tumor segmentation model that incorporates anomaly maps generated through anatomical inpainting, achieving improved segmentation performance. These anomaly maps were derived using a novel end-to-end inpainting model trained exclusively on prostate T2-weighted imaging (T2WI)32 and applied to rectal T2WI, challenging conventional domain-specific practices and demonstrating the potential of transfer learning. The generated maps can also support various clinical downstream tasks.

Unlike for medical challenges with public datasets like clinically significant prostate lesion segmentation (PICAI)32 or brain tumor segmentation33, there is no large multicenter publicly available MRI dataset for rectal cancer studies. This makes it difficult to benchmark different models. An extensive external validation study with multicenter data is highly desirable to compare different deep learning segmentation approaches. We benchmarked nine state-of-the-art 3D deep learning models for rectal tumor segmentation using a large multicenter dataset, offering comprehensive comparative insights. A 3D deep learning model was specifically developed to segment rectal anatomical structures, including the rectum and mesorectum (fatty tissue surrounding the rectum). Additionally, we explored multiple strategies for integrating rectal anatomical information into tumor segmentation, including the use of anomaly maps, auxiliary segmentation tasks, and prior knowledge, demonstrating their potential to both improving segmentation accuracy and clinical utility. These contributions collectively address critical challenges in rectal MRI analysis, advancing the field toward more robust and clinically applicable solutions.

Results

Dataset and patient characteristics

As a part of a previous institutional review board approved multicenter study project34,35,36,37,38,39, the clinical and imaging data from 1426 patients with biopsy-proven rectal cancer were retrospectively collected from ten medical centers between 2011 and 2018. The original study was conducted in accordance with the Declaration of Helsinki and informed consent was waived due to the retrospective nature of the study. For the current study, the baseline staging T2WI of 705 patients (from 9 centers) was selected from this previous dataset. Patients were excluded if they had one of the following properties: non-diagnostic image quality issues, multiple tumors in the field of view, abscesses surrounding the tumor, unavailability of pre-treatment T2WI, or the unavailability of axial images. Table 1 shows the patient characteristics of all the rectal cancer patients including age, gender, clinical T and N staging, tumor location, and EMVI status. To train the anatomical inpainting model for reconstructing the healthy rectum and mesorectum, 300 samples with healthy rectum and mesorectum were randomly selected from the PICAI dataset.

Table 1 Summary of patient demographic and clinical characteristics of the multicenter dataset.

Rectal anomaly detection

According to Table 2, the inpainting model demonstrated superior performance and less variance in reconstructing the rectum and mesorectum in prostate T2WI, with an averaged SSIM (aSSIM) of 86.72 and an averaged PSNR (aPSNR) of 25.87, compared to in-house rectal T2WI, which had an aSSIM of 83.38 and an aPSNR of 23.87. These results aligned with expectations. Prostate T2WI was in-distribution and represented healthy samples of the rectum and mesorectum. Conversely, rectal T2WI with tumor regions were out-of-distribution data and exhibited greater discrepancies from their corresponding pseudo-healthy counterparts. Importantly, we observed that aSSIM, 39.32, and aPSNR, 17.50 of tumoral regions are noticeably lower than the tumor-free regions, indicating effective anomaly detection. The anomaly map can be generated from the differences between original and pseudo-healthy images (Fig. 1).

Table 2 Rectum and mesorectum inpainting performance in the external test data.
Fig. 1
figure 1

The visualization of anatomical inpainting. The columns from left to right are original T2WI slices, masked slices (grey: rectum; black: mesorectum), inpainted slices, anomaly map, which is the difference between original and reconstructed slices, and tumor masks. The first two rows are prostate T2WI (without tumor), and the last two rows are from rectal T2WI (with tumor). Colorbar: Shows pixel-wise reconstruction error; higher values indicate greater differences and potential anomalies.

The comparison of nine DL models for rectal tumor segmentation

In the internal dataset (Training Cohort 1, 5-fold cross-validation), MedFormer delivered the best overall performance, with an average DSC (aDSC) of 66.3 and a median 95% HD (mHD) of 6.39 mm. U-mamba achieved the highest median DSC (mDSC), while ResUNet excelled in the average 95% HD (aHD), see Table S1. However, in the external testing data (num = 666, 9 centers), as is displayed in Table 3, S2, and Fig. 2a, nnUNet achieved the best results, with aDSC 62.8 and aHD 17.28 mm, significantly better than others. We observed that nnUNet consistently outperformed other models in the external test when the training cases was increased to 132 from a single center, see Tables S3, S4. Additionally, transformer-based networks including UNetR, SwinUNet, and nnFormer or the SSMs-inspired architecture like U-Mamba underperformed with respect to CNN-based architectures in external test. We also compared the size of trainable parameters and Floating Point Operations (FLOPs) for each model (Fig. 2b). In summary, nnUNet demonstrated the highest DSC despite having relatively few model parameters and a low number of FLOPs. From Fig. 3, nnUNet showed better prediction performance in identifying rectal tumors in the external test. nnUNet exhibited a significantly reduced false positive voxel classification in these displaying cases, especially in the last row of Fig. 3, which was a tumor-free slice.

Table 3 Comparison of various models on rectal tumor segmentation on the external test (Num = 666, 9 centers).
Fig. 2
figure 2

(a) The DSC and 95% HD boxplots of nine DL models. Yellow triangle denotes the mean DSC. (b) Averaged DSC or 95% HD vs. number of trainable parameters vs. FLOPs. y-axis: DSC or 95% HD, x-axis: number of parameters/M, bubble size/number under model: the FLOPs/G.

Fig. 3
figure 3

The visualization of the segmentation performance of all nine DL models using T2WI. Each row is a different patient from a different center (external set). The columns from left to right are original T2WI, GT: ground truth, tumor prediction masks from UNet, ResUNet, UNetR, SwinUNetR, Atten-UNet, MedFormer, nnFormer, U-Mamba, and nnUNet.

Rectum and mesorectum segmentation

nnUNet demonstrated superior performance for rectum segmentation with an aDSC of 0.87 and mHD 10.15 mm, better than the mesorectum segmentation with an aDSC of 0.81, aHD 10.57 mm in the external cohort containing 141 samples see Table S5, Fig. S1. This disparity in performance between the rectum and mesorectum is attributed to the rectum’s more regular shape than that of the mesorectum. This is consistent with the findings by Hamabe et al.14. Overall, the segmentation tasks for rectal anatomical structures were significantly easier than for tumors.

MTnnUNet, MCnnUNet and AAnnUNet (Fully-Supervision)

From Table 4, S6 and Fig. 4a, MTnnUNet and AAnnUNet significantly outperformed both nnUNet and MCnnUNet. Even though MCnnUNet showed the best results in the internal validation (Table S7), it exhibited the lowest aDSC and aHD in the external test. This may be due to MCnnUNet’s reliance on accurate anatomical inputs. It was trained with ground truth anatomical masks but used AI-generated masks (rectum and mesorectum) during inference, introducing inconsistencies that likely contributed to the performance drop on the external test. Unlike MCnnUNet, MTnnUNet uses anatomical knowledge only during training. Although AAnnUNet—fusing anomaly maps that highlight tumoral regions—also relied on the quality of rectum and mesorectum masks, it slightly outperformed MTnnUNet in terms of aDSC. Unlike MCnnUNet, which directly incorporated anatomical masks as input channels, AAnnUNet utilized anomaly maps derived from healthy distributions. We also introduced a union ensemble combining MTnnUNet and AAnnUNet, which improved aDSC by 3%. However, it did not reduce HD. Figure 5 shows that AAnnUNet effectively segmented both small and large tumors, demonstrating the benefit of anomaly fusion. Furthermore, the Grad-CAM saliency map in Fig. S2 indicates that AAnnUNet more effectively captures tumoral features. However, its performance declined when anomaly maps were suboptimal, as seen in the last row. In some cases, all models failed to detect the tumor, though high anomaly map intensities—illustrated in Fig. 6—still indicated potential abnormalities.

Table 4 Comparison of nnunet, mtnnunet, mcnnunet, AAnnUNet and ensemble for rectal tumor segmentation on the external test (fully-supervision) (Num = 666, 9 centers).
Fig. 4
figure 4

(a) The DSC and 95% HD boxplots of nnUNet, MCnnUNet, MTnnUNet, AAnnUNet, and Ensemble in a fully-supervised manner. (b) The DSC 95% HD boxplots of nnUNet, MCnnUNet, MTnnUNet, AAnnUNet, and Ensemble in a mixed-supervised manner.

Fig. 5
figure 5

The visualization of the segmentation performance of nnUNet, MTnnUNet, MCnnUNet, AAnnUNet, and Ensemble using T2WI, supervised setting. Each row is a different sample from the external test set. The columns from left to right are original T2WI, ground truth, tumor prediction masks from nnUNet, MTnnUNet, MCnnUNet, AAnnUNet, and Ensemble. Colorbar: Shows pixel-wise reconstruction error; higher values indicate greater differences and potential anomalies.

Fig. 6
figure 6

The example (from the external test) where all algorithms failed to segment the rectal tumor, but the anomaly map highlighted the tumoral region. Colorbar: Shows pixel-wise reconstruction error; higher values indicate greater differences and potential anomalies.

MTnnUNet, MCnnUNet and AAnnUNet (Mixed-Supervision)

Mixed-supervision, using manually annotated tumors and AI-generated rectum and mesorectum masks, was applied during training on cohort 2, which included 141 cases from a single center for nnUNet, MTnnUNet, MCnnUNet, and AAnnUNet. In the internal validation set, MTnnUNet achieved the best performance, while AAnnUNet performed best on the external test. Unlike the fully supervised setting, MCnnUNet showed improved aDSC and aHD on the external test. This may be attributed to the consistent use of AI-generated rectum and mesorectum masks during both training and inference, leading to more stable performance across datasets (Table 5, Tables S8, S9, Fig. 4b).

Table 5 Comparison of nnunet, mtnnunet, mcnnunet, AAnnUNet and ensemble for rectal tumor segmentation on the external test (mixed-supervision) (Num = 564, 8 centers).

In both fully- and mixed-supervision, MTnnUNet outperformed the baseline nnUNet. MCnnUNet showed improvement only in mixed-supervision. Notably, incorporating anomaly maps enhanced tumor localization accuracy across both settings. As illustrated in the first row of Fig. S3, while nnUNet misclassified part of the bladder as tumor due to similar intensities, MTnnUNet, MCnnUNet, and AAnnUNet—fusing anatomical masks or anomaly maps—correctly identified the tumor regions.

Methods

Ground truth segmentation

The ground truth segmentation masks were annotated by a GIT radiologist (M.A.A) with 6–7 years of radiological experience in interpreting rectal MRI. Masks for the rectum and mesorectum were annotated using 180 randomly selected rectal cases and 100 prostate T2W images. Specifically, the entire rectum was annotated, including its lumen from the anorectal junction to the recto-sigmoid junction, following the definition of the sigmoid take-off as the upper anatomical landmark of the rectum. The mesorectal fat, enveloped by the mesorectal fascia, was identified as the high T2 signal area surrounding the rectum on all sides, thinner anteriorly than postero-laterally40,41. Moving caudally towards the lower rectum, the thickness of the mesorectal fat enveloping the rectal wall decreases due to the gradual tapering of the mesorectum41. Tumor segmentation was annotated for all the cases (n = 705), where tumors were labeled as abnormal mural growth within the rectal lumen and extending outside into the mesorectum41,42. See Table S10 for annotation details.

Rectal anomaly detection

The overall pipeline, inspired by43,44, for detecting rectal anomalies is shown in Fig. 7a. A single inpainting model was trained to generate both the rectum and mesorectum. Prostate T2WI with masked healthy rectum and mesorectum were used to train the model for their reconstruction. The inpainting model was adapted from Han et al.43, an end-to-end MRI sequence generation framework. The framework consists of two stages. In the first stage, only the reconstruction loss is optimized for the encoder and generator. In the second stage, both adversarial loss and cycle-consistent loss are incorporated in addition to the reconstruction loss, with the optimization applied to the encoder, generator, and discriminator. The training was based on 2D slices. The inpainting model contains an encoder \({\varvec{E}}\) and a decoder \({\varvec{G}}\). A masked (rectum and mesorectum) 2D T2WI slice X, can be compressed by \({\varvec{E}}\) into a latent space \(z={\varvec{E}}\left( X \right)\) and \({\varvec{G}}\) can reconstruct the original slice from the latent representation z.The skip connections were added to recover fine-grained details. To enforce similarity between the generated slices and the actual slices, a supervised reconstruction loss is used:

$${L_{rec}}={{\text{\varvec{\uplambda}}}_r}{\left\| {X^{\prime} - X} \right\|_1}+{{\text{\varvec{\uplambda}}}_p}{L_p}\left( {X^{\prime} - X} \right)$$
(1)

where X is the original slice, and \(X^{\prime}={\varvec{G}}\left( {{\varvec{E}}\left( X \right)} \right)\) is the restored image. \({\left\| \cdot \right\|_1}\) is the \({L_1}\) loss and \({L_p}\) is the perceptual loss from pre-trained VGG19, which involves comparing high-level features (not just pixel values) from both the generated and reference images45. Instead of measuring raw pixel differences, it evaluates how similar the images are in terms of their content and style, based on features extracted from different layers of the VGG19 network. \({{\text{\varvec{\uplambda}}}_r}\) and \({{\text{\varvec{\uplambda}}}_p}\) are weight factors, for which the respective values 10 and 0.01 were chosen empirically.

Fig. 7
figure 7

(a) The anatomical inpainting workflow. The inpainting model containing an encoder E and a decoder G was trained using prostate T2WI with a healthy rectum and mesorectum. The trained model was inferred across the rectal dataset to generate a reconstructed pseudo healthy rectum and mesorectum. The difference between the reconstructed and the original image is the anomaly map. Ground Truth tumor is in red. Colorbar: Shows pixel-wise reconstruction error; higher values indicate greater differences and potential anomalies. (b) The workflow figure connecting anatomy nnUNet, inpainting and AAnnUNet/MCnnUNet during inference.

For the second stage of the training, the adversarial loss and cycle-consistent loss46 were added on top of the reconstruction loss to ensure that the inpainted images were both realistic and consistent with the original images. The adversarial loss helps to ensure that the completed regions look realistic and blend seamlessly with the surrounding areas and the cycle-consistent loss focuses on preserving the original structure by ensuring the inpainted image can be accurately reconstructed back to the original.

$$mi{n_D}ma{x_G}~~{L_{adv}}={\left\| {{\varvec{D}}\left( X \right) - 1} \right\|_2}+{\left\| {{\varvec{D}}\left( {X^{\prime}} \right)} \right\|_2}$$
(2)
$${L_{cyc}}={\left\| {X^{\prime\prime} - X} \right\|_1}$$
(3)

where \(X^{\prime\prime}=~~\) \({\varvec{G}}\left( {{\varvec{E}}\left( {X^{\prime}} \right)} \right)\) and \({\left\| \cdot \right\|_2}\) is the \({L_2}\) loss and \({\varvec{D}}\) is the discriminator. The anomaly maps are then defined as the absolute differences between the reconstructed slice and the original slice,

$$M=\left| {X~ - ~X^{\prime}} \right|$$
(4)

Let I be an image with intensity values. The normalization scheme before training involved the following steps:

\(l=Percentil{e_{0.5}}\left( I \right)\)

\(h=Percentil{e_{99.5}}\left( I \right)\)

\({I_{norm}}=\frac{{max\left( {I,l} \right) - l}}{{h~ - ~l}}\)

First Compute the 0.5th percentile (l) and the 99.5th percentile (h) of the intensity values in the image and normalize the intensity values in the image using the computed percentiles. The rectum was masked with a value of 0, while the mesorectum was masked with a value of 0.5.

To train the inpainting model, 100 prostate T2WI with manually segmented healthy rectum and mesorectum masks were split into 80 for training and 20 for internal validation. The model was externally tested on all slices of 200 prostate T2WIs and 705 in-house rectal T2WIs. The inference also requires rectum and mesorectum masks. However, only 180 rectal samples have radiologist-annotated rectum and mesorectum masks. To overcome this, the anatomy nnUNet (see Sect. 3.3) was used to generate the required masks (Fig. 7b).

Rectum and mesorectum segmentation

A 3D nnUNet model, referred to as anatomy nnUNet, was specifically trained to segment the rectum and mesorectum using Training Cohort 1 with five-fold cross-validation, as illustrated in Fig. 8a. The model was then externally evaluated on the rest 141 cases from different centers. The anatomy nnUNet was then used to infer all the rectum and mesorectum masks. The predicted rectum and mesorectum masks were defined as AI-generated rectum and mesorectum masks.

Fig. 8
figure 8

(a) Model definition. A: nnUNet for tumor segmentation only; B: MTnnUNet: multi-target nnUNet, segmenting tumor, rectum, and mesorectum simultaneously, C: MCnnUNet: multi-channel nnUNet with rectum and mesorectum masks, and T2WI as input, D: AAnnUNet: Anomaly-Aware nnUNet, with anomaly maps as additional input E: Anatomy nnUNet, to segment rectum and mesorectum. F: MTnnUNet but trained with AI-generated rectum and mesorectum, G: MCnnUNet but using AI-generated rectum and mesorectum, H: AAnnUNet but using AI-generated rectum and mesorectum (b) The training scheme for different models. 5 F-CV: 5-fold cross-validation. (b) Fully-Supervised: Training with entirely manual annotations. Mixed-Supervised: Training with manually annotated tumors and AI-generated rectum and mesorectum.

MTnnUNet, MCnnUNet and AAnnUNet (Fully-Supervision)

We incorporated anomaly maps from the inpainting model into rectal tumor segmentation by adding them as an additional input to nnUNet, referred to as Anomaly-Aware nnUNet (AAnnUNet). This approach was compared with two alternative strategies for integrating anatomical knowledge: Multi-Target nnUNet (MTnnUNet), which added rectum and mesorectum segmentation as auxiliary tasks, and Multi-Channel nnUNet (MCnnUNet), which used rectum and mesorectum masks as additional input channels (Fig. 8a). Same as the benchmark settings, MTnnUNet, MCnnUNet, and AAnnUNet were trained on training cohort 1 using 5-fold cross-validation and externally tested on 666 samples from 9 centers, as illustrated in Fig. 8b. For the inference of MCnnUNet and AAnnUNet, AI-generated rectum and mesorectum were used. We also included ensemble results, which is a union of MTnnUNet and AAnnUNet. Results from both 5-fold internal validation and external testing are presented.

MTnnUNet, MCnnUNet and AAnnUNet (Mixed-Supervision)

Using AI-generated pseudo-anatomical structures, MTnnUNet, MCnnUNet, and AAnnUNet were trained on Training Cohort 2 (141 cases) with 5-fold cross-validation (see Fig. 8). Instead of ground truth rectum and mesorectum masks, AI-generated annotations were used, combining them with manually labeled tumors. We also included ensemble results from the union of MTnnUNet and AAnnUNet. Results from both internal validation and external testing (564 cases from eight centers) were presented.

The comparison of nine DL models for rectal tumor segmentation

We established a baseline for rectal tumor segmentation by evaluating the performance of nine 3D deep learning models (see Fig. 9), including UNet47, ResUNet48, UNetR49, SwinUNetR50, AttentionUNet (Atten-UNet)51, MedFormer52, nnFormer18, U-Mamba (bot)53 and nnUNet17.

Fig. 9
figure 9

The rectal tumor segmentation performances of nine 3D DL models were compared. All the models were trained with training cohort 1 including 39 patients from Center 9 only and externally tested with 666 samples from 9 centers. Manual annotated tumor segmentation by the expert was used for both training and external test.

  1. 1.

    UNet47, extends the previous u-net architecture from Ronneberger et al.9 by replacing all 2D operations with their 3D counterparts.

  2. 2.

    ResUNet48, is a modified version of UNet. It replaces double convolution layers of UNet with residual blocks from ResNet54, incorporating shortcut connections for faster convergence. This adaptation works in both 2D and 3D settings, enhancing performance in capturing complex patterns.

  3. 3.

    UNETR49, adopts a ViT-inspired encoder and employs a CNN decoder for 3D image segmentation. The images are initially divided into patches, which are linearly transformed into token embeddings. These tokens undergo processing through a self-attention block, akin to ViT. To manage the quadratic complexity of self-attention, the patch size is set to be relatively large (16) to prevent overly long sequence lengths.

  4. 4.

    SwinUNETR50, reformulates the segmentation task as a sequence-to-sequence prediction using a Swin Transformer as the encoder. The encoder is then connected to a Fully Convolutional Neural Network (FCNN)-based decoder through skip connections.

  5. 5.

    Attention UNet51, introduces an attention-gating module to UNet to enhance its ability to suppress irrelevant regions and highlight salient features crucial for a given task.

  6. 6.

    nnFormer18, is a 3D transformer for volumetric medical image segmentation nnFormer combines interleaved convolution and self-attention operations. It introduces a special self-attention mechanism to understand both local and global aspects of the image volume. To improve efficiency, it also uses skip attention instead of the usual concatenation or summation operations.

  7. 7.

    MedFormer52, is a transformer-based designed to handle scalable 3D medical image segmentation, including three crucial components: a beneficial inductive bias, hierarchical modeling using linear-complexity attention, and multi-scale feature fusion that combines spatial and semantic information globally. MedFormer can learn from both small and large scale data without pre-training.

  8. 8.

    U-Mamba53, is inspired by the State Space Sequence Models (SSMs)55, which are known for their ability to handle long sequences. The model is designed specifically for biomedical image segmentation with the hybrid CNN-SSM block that integrates the local feature extraction power of convolutional layers with the abilities of SSMs to capture the long-range dependency.

  9. 9.

    nnUNet17, is a self-configuring framework for medical image segmentation. It utilizes UNet as its architecture but offers a specialized preprocessing, training technique, and hyper-parameter configuration. nnUNet achieves state-of-the-art performance on several medical image segmentation challenges with a relatively simple architectural design.

  10. 10.

    Anatomy nnUNet, is the nnUNet trained to segment rectal-related anatomical structures including the rectum and mesorectum.

  11. 11.

    MTnnUNet, is the nnUNet trained to segment rectum, mesorectum, and rectal tumors.

  12. 12.

    MCnnUNet, is the nnUNet trained to segment rectal tumors with rectum and mesorectum masks as additional input channels.

  13. 13.

    AAnnUNet, is the nnUNet trained to segment rectal tumors with anomaly maps \(\:M\) derived from anatomical inpainting as an additional input channel.

All the models underwent training using training cohort 1 (39 cases) with a 5-fold cross-validation. Subsequently, all models were externally tested on the remaining 666 samples from nine centers. Training cohort 1 is relatively small. To assess the effect of training size, we repeated the experiments using 132 cases from the same center to evaluate whether increasing the number of cases would influence the performance of the nine benchmarked models. Results from both 5-fold internal validation and external test were presented.

Implementation details

For the training of the inpainting models, the input patch size was (384, 384), with a batch size of 1. We used the AdamW optimizer to train the network (both stages) with \(\:{\upbeta\:}\) 0.9 and 0.95, an initial learning rate of 0.0001, a weight decay factor of 0.05, and following a polynomial decay.

For tumor segmentation, the imaging preprocessing was adopted from nnUNet, which included ZScoreNormalization for standardizing intensities, uniform resampling of all images, and a cropping process. All the segmentation models were trained with randomly initialized weights without transfer learning. The batch size was set to 2 and the models were trained for 1000 epochs with the SGD optimizer. The loss function is the sum of the cross-entropy and Dice loss. During inference, predictions were obtained by averaging the outputs of each model resulting from the 5-fold cross-validation procedure. All the models were implemented in PyTorch (Torch version 2.1.2), and the training was conducted on an NVIDIA RTX A6000 GPU.

Evaluation metrics and statistical analysis

Statistical analysis was conducted in Python (version 3.9) with the SciPy package (version 1.13.1). To assess reconstruction performance, Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR) were calculated between the reconstructed and original images. To measure the segmentation performance, the Dice Similarity Coefficient score (DSC) and 95% Hausdorff Distance (HD) were utilized on both cross-validation and external tests. To provide a comprehensive comparison of the nine benchmarked models, we report the number of trainable parameters and inference-time floating point operations (FLOPs) for each. The characteristic differences of cohorts were compared by the Kruskal-Wallis test. The Mann–Whitney U-test was used to compare the difference of indicators among different methods. The model performance differences were calculated using the paired sample t-test. All statistical analyses were two-sided and p-values below 0.05 were regarded as statistically significant. 95% confidence intervals were generated using the bootstrap method with 10,000 replications.

Discussion and conclusion

In this study, we successfully developed a rectal anomaly detection model by training the anatomical inpainting model using prostate T2WI with healthy rectum and mesorectum. The model was then applied to the rectal T2WI to generate pseudo-healthy rectal structures and the anomaly map was the differences between the pseudo-healthy and original slices. The derived anomaly maps were used in the downstream tumor segmentation tasks and outperformed the baseline nnUNet, MTnnUNet, which jointly predicts tumor, rectum, and mesorectum, and MCnnUNet including rectum and mesorectum as additional input channels, in both the fully-supervised and mixed-supervised settings. To automatically generate the rectum and mesorectum masks, we trained an nnUNet to effectively delineate the normal rectum (without tumoral involvement) and mesorectum.

As part of this research, we benchmarked nine DL models, including CNN-based, transformer-based, and Mamba-based architectures, on a large multicenter dataset. nnUNet achieved the best results in the external test set, despite having a relatively lower model complexity, indicating that increased model complexity does not guarantee improved results. By fusing anatomical knowledge into the tumor segmentation model, MTnnUNet, MCnnUNet, and AAnnUNet demonstrated improved performance compared to nnUNet. Research in medical image analysis with AI bears many promises to improve patients’ health. However, Varoquaux et al.56 have pointed out that in academia, even though the goal is to solve scientific problems, the emphasis on publication quantity is influenced by Goodhart’s law57, which can compromise scientific quality. Researchers, in pursuit of novelty, may introduce unnecessary complexity in methods, contributing to technical debt without substantial improvement in predictions. Isensee et al.58 has conducted an extensive benchmarking of current segmentation methods across different dataset and their results revealed a concerning trend that most models introduced in recent years fail to outperform the nnUNet. This is consistent with the findings of this study. Recently published models with higher complexity did not exhibit higher tumor segmentation performance or better generalization ability. Although ViT and SSMs have demonstrated promising results in the natural image classification15, in rectal tumor contouring, transformer-based networks or SSMs-inspired networks do not show better results than CNN-based architectures. The reasons could be, that the transformer relies heavily on large-scale training and remains inferior to CNNs when training data is scarce. Unlike natural images, medical datasets are smaller, typically in the hundreds or thousands59. Secondly, the quadratic complexity of self-attention poses challenges when dealing with long token sequences50, especially for 3D T2WI.

Beyond architectural optimization, a profound comprehension of medical imaging is integral to advancing tumor segmentation. The rectum and mesorectum provide crucial anatomical context for the localization of tumors. Several studies60,61 have proved that mesorectal fat and tumor neighboring regions contain important prognostic information in rectal cancer patients. Therefore, there is a substantial demand for a rectum and mesorectum segmentation model that is accurate and robust. While previous studies14,19,20,22 have proposed segmentation networks for these structures, an extensive multi-center external test is highly desirable, and the majority employed 2D models. In this study, we externally tested the anatomical 3D nnUNet and observed successful rectal structure contouring.

Currently, most rectal tumor segmentation studies rely on retrospective datasets, which do not provide a healthy representation of the rectum. Abdominal imaging, particularly prostate MRI, shares overlapping anatomical structures with rectal MRI, and most prostate MRIs display healthy rectal structures. As a cross-domain application, we utilized a public prostate dataset to train the inpainting model, allowing it to learn the distribution of a healthy rectum and mesorectum. This design ensures that, during inference, any deviation from the learned healthy patterns—such as tumor tissue—results in a higher reconstruction error, which is reflected in the anomaly maps. Compared to simpler fusion strategies like MCnnUNet and MTnnUNet, AAnnUNet demonstrated a stronger anatomical understanding of the rectum and mesorectum, leading to improved tumor localization. MCnnUNet uses rectum and mesorectum masks to guide the model toward relevant anatomy, improving spatial specificity but depending heavily on mask accuracy. MTnnUNet promotes anatomical context learning via multi-task prediction, which can benefit complex cases but may underperform on small tumors. AAnnUNet incorporates anomaly maps to detect tumors—including those missed by other models—and is particularly valuable when manual annotations are limited. These maps also offer diagnostic value to radiologists and hold potential for downstream clinical tasks such as lymph node assessment, treatment response prediction, and staging stratification. However, they may introduce false positives due to reconstruction errors or anatomical variability.

There are some limitations of this study. First, all the T2WI was annotated by one radiologist. Multiple readers would add extra value to our research. Second, this study exclusively included T2W-MRI, with scans from 2011 to 2018. Some of them exhibited relatively low resolution and larger slice thicknesses, which would not meet the high-resolution criteria established by current protocol recommendations, see Table S11. Third, other MRI sequences like DWI and ADC can potentially improve the segmentation performance. Dou et al.62 have demonstrated by combining T2WI, ADC, and DWI, their model showed the best tumor segmentation performance. Fourth, although we have a relatively large and heterogeneous cohort, the data was solely from the Netherlands. Fifth, concerning rectal anatomical structures, the segmentation process focused solely on the rectum and mesorectum. In future studies, other anatomical structures such as the lumen could contribute to the performance. Lastly, the training and testing were performed exclusively on pre-treatment T2WI. The rectal environment in pre-treatment MRI is visually and pathologically distinct from post-treatment MRI. Incorporating post-treatment T2WI can enhance tumor characterization and improve downstream analytical workflow.

We proposed an anatomical inpainting model trained on prostate MRI to generate pseudo-healthy rectal images. The resulting anomaly maps, highlighting differences from original images, strongly aligned with tumor regions. Integrated into a segmentation model (AAnnUNet), they improved accuracy over other anatomy-informed models (MTnnUNet and MCnnUNet). These results show that anomaly maps enhance segmentation and have potential for broader rectal cancer monitoring. Code and annotated prostate masks are publicly available. (https://github.com/Liiiii2101/Anatomy-aware-nnUNet-for-Rectal-Tumor-Segmentation).