Fig. 7: Overview of the proposed framework.

The input 3D CT volume is processed by a hierarchical Swin Transformer backbone, yielding multi-scale feature maps (FLow, FMid, FHigh). These features feed into a dual-branch decoder. The Structural Core Predictor uses high-level features FHigh to predict a core probability map PCore, from which a structural Core Anchor S is derived. The Context-Aware Shape Decoder (CAS-Decoder) takes the anchor S and all feature maps as input, utilizing a pipeline of a lightweight CNN, Fast Marching Method (FMM), an Attention Block, and an MLP to generate the final predicted mask. During training, the Feature Manifold Regularization module provides an additional supervisory signal, \({{\mathcal{L}}}_{Reg}\), to structure the feature space.