Introduction

Lung cancer is the leading cause of cancer-related mortality worldwide, and early diagnosis through low-dose computed tomography (CT) significantly improves patient outcomes1. Pulmonary nodules, which are small focal lesions in the lung parenchyma, are often the earliest radiographic sign of lung cancer. Accurate segmentation of these nodules in CT scans is thus critical for early lung cancer screening and intervention. However, automated nodule segmentation remains a challenging task due to several technical factors. First, nodules exhibit high diversity in size, shape, and texture—they can range from tiny ground glass opacities to larger solid masses, and may appear in different anatomical locations, including morphologies that are subtle or obscured by surrounding structures2. Such variability means a segmentation model must capture both fine details and broad context. Additionally, many nodules have low contrast or ambiguous boundaries, making them difficult to distinguish from normal lung anatomy. Small nodules in particular can closely resemble blood vessels or other tissues, so reliable identification often requires contextual reasoning over a larger region3.

Another major challenge is class imbalance and scale disparity: nodules typically occupy only a minute fraction of the CT volume, whereas the vast majority of voxels belong to background lung tissue. This imbalance can bias learning and cause standard models to miss the tiny nodular targets. Moreover, the small physical size of many nodules (often just a few millimeters in diameter) demands extremely precise boundary localization4. In summary, an effective pulmonary nodule segmentation approach must achieve robust multi-scale representation—capturing subtle edge details while also incorporating global contextual information—and must handle the inherent data imbalance and noise in clinical CT scans.

Convolutional neural network (CNN) based architectures (such as U-Net and its variants) have become the de facto standard for medical image segmentation, including lung nodule segmentation, due to their ability to learn rich feature hierarchies. Yet, conventional CNN encoders have limitations in this application. Their intrinsic receptive field is finite, which can hinder the modeling of long-range dependencies needed to discern nodules from surrounding tissue. Consequently, pure CNN models may struggle with very small or highly subtle nodules that require integration of global cues. On the other hand, transformer-based architectures have recently been explored to overcome the locality bias of CNNs by leveraging self-attention for global context capture. Transformers can inherently model long-range relationships without the receptive field constraints of CNNs, but they introduce other drawbacks: their high computational cost and need for large training datasets can be prohibitive, and naive transformer models may over-emphasize global structure at the expense of fine local detail3. In practice, neither pure CNN nor pure transformer solutions alone have fully solved the nodule segmentation problem, especially under limited data and the requirement for both precision and contextual awareness.

Many existing hybrid models employ sequential or late-stage fusion strategies, where features from convolutional and transformer blocks are either alternated or combined only deep within the network.5 This approach can create a “representational clash" by forcing the fusion of disparate feature types that have been learned independently. Such abrupt integration can disrupt the stable propagation of information, leading to optimization difficulties and degraded performance. The central problem is not merely the inclusion of both global and local feature extractors, but the development of a principled framework through which these two complementary representations can harmoniously and continuously inform one another throughout the feature extraction hierarchy.

Recent years have seen a proliferation of deep learning models for lung nodule segmentation, with many efforts devoted to improving the U-Net style encoder-decoder backbone. A central theme is enhancing multi-scale representation and feature fusion to cope with nodules of varying sizes and appearances. For example, Selvadass et al. introduced SAtUNet, which applies a series of atrous (dilated) convolution modules at multiple encoder stages to capture multi-scale context, yielding improved performance on small nodules6,7,8. Other works have proposed deeper and more recursive architectures to enrich feature fusion across scales—one such approach is to employ nested UNet structures with residual connections, as in the channel residual UNet model reported by Wu et al., which better preserves fine details through iterative feature refinement9. In addition to purely convolutional designs, some researchers have explored multi-branch and multi-encoder networks. Xu et al. presented a dual encoding fusion model to handle atypical nodules, wherein two parallel encoders extract complementary features that are later merged for segmentation10. Similarly, incorporating external knowledge or alternate feature streams has shown benefits: Jiang et al.2 developed a dual-branch architecture that integrates a prior feature extraction branch (infusing radiological prior knowledge) alongside a standard image-based branch, significantly improving segmentation of subtle, adherent, or otherwise challenging nodules. These strategies highlight the importance of multi-scale and multi-modal feature learning in advancing pulmonary nodule segmentation beyond the basic U-Net.

The use of attention techniques has become widespread in medical segmentation to focus on critical regions and features. In the context of lung nodules, attention-based modules help the model discriminate nodules from complex backgrounds. Early examples include attention gates that learn to emphasize salient node regions while suppressing irrelevant features11. More recent architectures integrate channel-wise and spatial attention to refine feature maps. Yang et al. proposed an uncertainty-guided segmentation model that incorporates a Squeeze and Excitation attention block to adaptively highlight informative feature channels, guided by model uncertainty to target ambiguous nodules12. Ma et al. likewise improved the V-Net architecture by adding a pixel threshold separation module (to enhance features under different intensity ranges) together with 3D channel and spatial attention modules, which encouraged the network to focus on important nodule regions and boundaries4. Beyond these, novel attention blocks such as Coordinate Attention13 have been adopted in nodule segmentation models to better capture long-range dependencies while preserving positional information. By embedding attention mechanisms at various layers (e.g., in skip connections or decoder blocks), these methods boost sensitivity to small or low contrast nodules and improve the precision of nodule boundaries.

Motivated by the success of Vision Transformers, several works have investigated hybrid encoders that combine CNN and self-attention mechanisms for nodule segmentation. The rationale is to leverage CNNs for local detail extraction and transformers for global context understanding. One notable example is the Swin-UNet model by Ma et al., which fuses a sliding window transformer block (based on the Swin Transformer) into the U-Net framework14. This hybrid design allows the network to capture long-range contextual relationships in the CT while maintaining strong localization ability. Similarly, other studies have inserted transformer modules at specific stages of a convolutional network to enhance its receptive field. For instance, inserting a self-attention block at the encoder bottleneck can propagate global information through the network, improving segmentation in complex cases15. Overall, CNN-Transformer hybrids represent a promising direction, although they must be carefully designed to balance computational cost with segmentation accuracy.

It is important to differentiate this architectural challenge from the concurrent challenge of data scarcity. While architectural design seeks to optimize feature representation (as we address), orthogonal approaches like semi-supervised learning (SSL) aim to improve data efficiency by leveraging unlabeled scans, often through techniques like knowledge distillation16. Such learning paradigms are complementary to model design; an effective architecture should, in principle, perform well in a fully-supervised setting while also serving as a robust backbone for data-efficient training. This study, therefore, focuses on the fundamental architectural problem: establishing a principled framework for global-local feature exchange. We posit that solving this representational bottleneck is a prerequisite for, and will ultimately enhance, future investigations into data-efficient learning strategies.

We propose GLANCE (Continuous Global-Local Exchange with Consensus Fusion), a framework designed to create synergy between global and local features. Our model features parallel global and local branches that continuously exchange information at multiple scales, guided by a consensus fusion mechanism that leverages their complementary strengths. We validate our approach on four public benchmarks (LIDC-IDRI, LNDb, LUNA16, and Tianchi), where GLANCE establishes a new state-of-the-art in both segmentation and detection. The results demonstrate a powerful synergy over single-task baselines, and an ablation study confirms the critical contribution of each architectural component. The following sections detail the related work, methodology, and present the full empirical evidence supporting GLANCE’s superior performance.

Results

This section presents a comprehensive empirical validation of the proposed GLANCE architecture. The quantitative evaluation is designed to rigorously assess two central hypotheses: first, that the multi-task learning framework for joint detection and segmentation yields synergistic performance benefits over specialized single-task models; and second, that the GLANCE model establishes a new state-of-the-art (SOTA) benchmark for both lung nodule segmentation and detection across multiple public datasets. The evidence supporting these claims is detailed in the subsequent subsections, with Table 1 providing a direct comparison against single-task baselines, and Table 2 and Table 3 offering an exhaustive comparison against recent SOTA methods. These results provide definitive validation of GLANCE’s novel design principles, particularly its dual-stream context encoder and continuous cross-scale consensus fusion (CSCF), demonstrating their efficacy in addressing the challenges of automated pulmonary nodule analysis.

Table 1 Detection and segmentation performance of the proposed multi-task model vs. single-task baselines are tested on the above dataset, shown in Table 8
Table 2 Segmentation performance of recent models on LIDC-IDRI and LNDb
Table 3 Detection performance of recent models on LUNA16 and Tianchi

Overall detection and segmentation performance

To isolate and quantify the benefits of our multi-task learning strategy, we first compared the performance of the unified GLANCE model against two specialized baselines: a Detection Only model and a Segmentation-Only model. The results, summarized in Table 1, provide unequivocal evidence that the joint optimization of both tasks leads to significant improvements in each domain, highlighting a powerful synergistic relationship.

The most profound advantage of the multi-task approach is observed in the detection performance, particularly in the high specificity regime. The Detection-Only baseline achieves a sensitivity of 80.2% at a stringent operating point of 1 false positive per scan (FP/s). In stark contrast, our multi-task GLANCE model reaches a sensitivity of 95.7% at the same 1 FP/s threshold, marking a remarkable absolute improvement of 15.5 percentage points. This substantial gain underscores the model’s enhanced ability to identify true nodules with high confidence. The operating point of 1 FP/s is clinically critical and technically demanding, as it requires the model to assign higher confidence scores to true positive nodules than to nearly all other structures in an entire CT volume. A standard detection model often struggles with ambiguous findings, such as vessel cross sections or pleural attachments, which can be mistaken for nodules. By incorporating a segmentation objective, the shared feature encoder is compelled to learn not only localization cues but also rich morphological features related to nodule shape, texture, and boundary definition. These segmentation-aware features provide the detection head with a more discriminative representation, enabling it to better differentiate true, well-defined nodules from morphologically inconsistent mimics. This reduces ambiguity and boosts the confidence scores for true positives, leading directly to the dramatic increase in sensitivity at low false positive rates. While the improvement at 4 FP/s is more modest (from 92.5% to 94.8%), the overall enhancement is captured by the competition performance metric (CPM), which improves from 90.1 to 92.1, confirming a holistically superior detection capability.

Simultaneously, the segmentation accuracy also benefits from the multi-task framework. The Segmentation-Only baseline achieves a strong Dice coefficient of 95.2%. Our unified GLANCE model surpasses this, attaining a Dice score of 96.4%. Although the absolute increase of 1.2 points is numerically smaller than the gains seen in detection, it represents a significant improvement in a high-performance domain where further gains are notoriously difficult to achieve. This enhancement can be attributed to the detection task acting as a potent spatial regularizer. As detailed in the methodology, the detection head is trained to predict a center heatmap, which explicitly forces the network to learn the concept of a nodule’s centroid. This localization objective imposes a strong spatial constraint on the segmentation output, penalizing masks that are diffuse or deviate significantly from the lesion’s core. Consequently, the model is encouraged to produce more compact, topologically sound segmentations that are accurately centered on the nodule. This “focusing" effect improves adherence to true nodule boundaries and minimizes pixel-level inaccuracies, particularly for lesions with indistinct margins, culminating in a higher Dice score.

SOTA comparison for lung nodule segmentation and detection

To establish the performance of GLANCE relative to the current state-of-the-art, we conducted an extensive comparative analysis against a wide range of recently published models on four standard public benchmarks. The evaluation was performed separately for segmentation (on LIDC-IDRI and LNDb) and detection (on LUNA16 and Tianchi), with the comprehensive results presented in Table 2 and Table 3. Fig. 1 displays our model along with visual segmentation and detection results from recent SOTA models. The findings from this benchmarking exercise consistently position GLANCE as a new leading method in both domains.

Fig. 1: Qualitative comparison of segmentation (top) and detection (bottom).
figure 1

For each case, we show the original image, ground truth, ours, and a series of baselines arranged left-to-right in progressively worse visual quality relative to (c)(k)ours. Detection baselines: (d)LN-DETR, (e)NDLA, (f)AWEU-Net, (g)CSE-GAN, (h)EHO-Deep CNN, segmentation baselines: (l)UnetTransCNN, (m)BRAU-Net++, (n)CT-UFormer, (o)UNETR++, (p)Improved V-Net; Our method yields sharper boundaries and fewer false positives/negatives, with competing methods exhibiting increasing boundary erosion, missed lesions, and spurious responses to the right.

SOTA Segmentation Performance: The segmentation capabilities of GLANCE were benchmarked on the LIDC-IDRI and LNDb datasets, with results detailed in Table 2. On the widely used LIDC-IDRI dataset, GLANCE achieves a Dice similarity coefficient (DSC) of 95.5%, an Intersection over Union (IoU) of 91.4%, a sensitivity of 96.0%, and a 95% Hausdorff Distance (HD95) of 0.7 mm. This performance represents a new state-of-the-art, outperforming all other listed methods across all four evaluation metrics. While the model shows a clear lead in overlap-based metrics over strong competitors like CT-UFormer17 (94.0% DSC) and Wavelet U-Net++18 (93.7% DSC), its most significant advantage lies in the boundary-based HD95 metric. The HD 95 score is particularly sensitive to major boundary discrepancies and serves as a crucial indicator of a model’s ability to accurately delineate complex shapes. GLANCE’s HD95 of 0.7 mm constitutes a 30% reduction in maximum boundary error compared to the next best models, which achieve 1.0 mm. This substantial improvement in boundary fidelity is a direct consequence of GLANCE’s core architectural innovations. The Multi-Receptive Grouped Atrous Mixer (MRGAM) is specifically designed to capture fine-scale boundary and texture details, while the Global Context Transformer (GCT) provides the long-range contextual information necessary to distinguish the nodule from adjacent structures like vessels or the pleura. The continuous integration of these complementary feature streams via the cross-scale consensus fusion (CSCF) mechanism ensures that this dual understanding of local detail and global context is maintained throughout the network, resulting in exceptionally precise boundary localization.

To assess the model’s generalization capabilities, we evaluated its performance on the LNDb dataset, which originates from a different clinical environment with distinct scanners and acquisition protocols. As shown in Table 2, many competing models exhibit a discernible drop in performance when transitioning from LIDC-IDRI to LNDb. For example, UnetTransCNN19 sees its DSC fall from 92.6% to 90.7%, and Improved V-Net4 drops from 85.7% to 81.1%. In contrast, GLANCE demonstrates remarkable robustness, with its DSC remaining exceptionally stable at 95.3% (compared to 95.5% on LIDC-IDRI). This consistency indicates that our model has learned more fundamental and invariant representations of pulmonary nodules, rather than overfitting to the specific characteristics of a single dataset. The global context provided by the GCT likely contributes to this robustness by making the model less sensitive to variations in image contrast and noise levels. This proven ability to generalize across diverse data sources is a critical attribute for any model intended for real-world clinical deployment.

SOTA Detection Performance: The detection performance of GLANCE was rigorously evaluated on the LUNA16 and Tianchi datasets, with comparative results presented in Table 3. On the LUNA16 benchmark, GLANCE achieves a precision of 92.3%, a recall of 94.1%, and an F1-score of 93.2. While its F1-score is closely matched by models such as NDLA20 (91.3) and Santone et al.21 (91.3), a deeper analysis of the underlying metrics reveals a superior clinical profile. GLANCE’s recall of 94.1% is the highest among all competing methods, significantly surpassing that of other top-performing models like NDLA (90.7%). In the context of cancer screening, maximizing recall (sensitivity) is paramount to ensure that no potentially malignant lesions are missed. Critically, GLANCE achieves this best-in-class recall without compromising precision, which remains highly competitive. This ability to break the typical trade-off between recall and precision stems from its multi-task design. The concurrently trained segmentation head implicitly acts as a powerful filter; a detection candidate is not only evaluated based on localization signals but also on its ability to be segmented into a morphologically plausible nodule. This helps the model reject anatomically inconsistent false positives, thereby maintaining high precision while identifying more true nodules than any other model.

The superior performance and generalization of GLANCE are further confirmed on the Tianchi dataset, a large-scale dataset created for a competitive challenge. On this benchmark, GLANCE’s dominance is even more pronounced. It achieves a precision of 95.0%, a recall of 93.6%, and an F1-score of 94.3. This F1-score is more than two full points higher than the next best competitor, NDLA20, at 90.8. The widening performance gap on this more heterogeneous dataset reinforces the conclusions drawn from the segmentation analysis on LNDb. GLANCE’s robust architecture, which effectively integrates global anatomical context with fine-grained local analysis, proves to be highly adaptable to variations in data sources and annotation protocols. This decisive victory on a second, distinct detection benchmark provides strong evidence that the model’s high performance is not an artifact of a single dataset but a reflection of a fundamentally more powerful and generalizable approach to nodule detection.

Ablation study

To quantify the contribution of each component in GLANCE, we ablated the Global Context Transformer (GCT), the multi-receptive grouped atrous mixer (MRGAM), and the CSCF on LIDC-IDRI. Table 4 summarizes the results.

Table 4 Ablation study results on LIDC-IDRI

We start from a baseline U-Net without any of the three modules (w/o GCT & w/o MRGAM & w/o CSCF), which reaches 79.0% Dice and 65.3% IoU. Removing the GCT from the full model yields the largest single module drop-down to 80.0% Dice (−15.2 points vs. 95.2%) and 66.7% IoU, highlighting the importance of long-range context for disambiguating nodules from vessels and airways. Ablating MRGAM reduces performance to 82.5% Dice (−12.7 points) and 70.2% IoU, confirming the role of multi-receptive atrous mixing in capturing fine boundaries and textures. Disabling CSCF (no cross-scale fusion) similarly degrades accuracy to 82.0% Dice (−13.2 points) and 69.5% IoU, showing that continuous cross-scale integration is crucial rather than deferring fusion to the decoder.

CSCF is most effective when fusing complementary signals. With two local streams but no GCT (w/o GCT, w/ CSCF), performance is only 80.4% Dice/67.2% IoU, a marginal +1.4 Dice over the baseline, indicating limited benefit from fusing redundant features. Single-stream variants without CSCF also underperform: MRGAM only (w/o GCT & w/o CSCF) achieves 80.3% Dice/67.1% IoU, and GCT only (w/o MRGAM & w/o CSCF) reaches 81.0%/68.1%; both are better than the baseline but far from the full model. Simply doubling plain encoders (w/o GCT & w/o MRGAM) gives 79.5% Dice/66.0% IoU, showing that gains do not come from model width alone.

Overall, the full GLANCE integrating global context (GCT) and local detail (MRGAM) through continuous crossscale fusion (CSCF), achieves 95.2% Dice and 90.8% IoU, substantially outperforming all ablations and evidencing strong component complementarity.

To test whether component synergies generalize beyond LIDC-IDRI segmentation, we replicated the key ablations on detection for LUNA16 and Tianchi (Table 5). Removing GCT, MRGAM, or CSCF reduces F1 from 93.2%/94.3% to 83.7%/84.4% (−9.5/−9.9), 88.0%/86.8% (−5.2/−7.5), and 86.7%/85.3% (−6.5/−9.0), respectively; single-stream controls plateau near 82% F1, confirming that continuous fusion of complementary global and local cues is essential across tasks and datasets.

Table 5 Ablation study results on LUNA16 and Tianchi detection

We further compared branch-weighting mechanisms inside MRGAM (Table 6): our Global+Local softmax gate achieved the best F1 (93.2%/94.3%), outperforming uniform mixing (90.9%/92.5%), Local-SE only (91.3%/93.0%), learned static scalars (91.1%/92.8%), 1 × 1-conv mixing (91.2%/92.9%), and a scaled dot-product attention-based weighting (91.2%/93.0%). These results demonstrate that the claimed GCT-MRGAM-CSCF synergy and our lightweight gating choice are robust on two external detection benchmarks.

Table 6 MRGAM branch-weighting mechanisms on LUNA16 and Tianchi detection

Computational complexity and efficiency

We analyze parameter count, FLOPs, and memory for the main components of GLANCE—the Global Context Transformer (GCT), the Multi-Receptive Grouped Atrous Mixer (MRGAM), and CSCF—within the staged encoder-decoder used in this work (input H × W = 512 × 512, stride-2 downsampling per stage). Architectural details are as introduced in Section 4 and Figs. 3 and 4 (DiSCo, CSCF, PRD).

Let \(X\in {{\mathbb{R}}}^{h\times w\times C}\) be the feature at a given stage, n = hw tokens, and d the token width. Grouped atrousk × k conv (MRGAM). FLOPs \(\approx 2\,h\,w\,\frac{{k}^{2}{C}_{in}{C}_{out}}{g}\), params \(=\frac{{k}^{2}{C}_{in}{C}_{out}}{g}\). With N parallel dilation branches {di} and context gating (SE), cost adds linearly in N plus a negligible SE MLP term \(\approx \frac{2{C}^{2}}{r}\) params/FLOPs per spatial location (reduction ratio r = 16 used in practice). Self-attention (GCT). For full attention at resolution (h, w): FLOPs ≈ 4nd2 + 2n2d (QKV projections and attention product) plus MLP ≈ 2nd2. To control cost, we (i) restrict full global attention to mid/low resolutions and (ii) use local windows or token downsampling at high resolutions (Sec. 4.2), yielding O(nM2d) with \(M\ll \sqrt{n}\). Consensus fusion (CSCF). One 1 × 1 convolution and a channel SE gate: FLOPs ≈ 2hwC2 (for 1 × 1) \(+\,\frac{2hw{C}^{2}}{r}\); params \(\approx {C}^{2}+\frac{2{C}^{2}}{r}\). CSCF is thus O(hwC2) with a small constant and adds < 3 − 5% FLOPs per stage in practice. Decoder (PRD). Two 3 × 3 convs per level with skip concat; standard UNet-style cost O(hwk2C2) dominated by early decoder stages.

(i) Early, high-resolution stages rely on MRGAM (linear in hw) instead of attention; (ii) full global attention is deferred to lower resolutions with fewer tokens; (iii) CSCF uses only 1 × 1 + SE gating; (iv) grouped atrous mixing keeps k2C2 small at high h, w by using groups g > 1. Together, these yield a dual-stream encoder whose cost is close to a single strong CNN/UNet++ backbone, while retaining global context.

Let FC denote the FLOPs of a single CNN-style stream at a stage, \({F}_{A}^{hi}\) the (windowed) attention FLOPs at high resolution (omitted in GLANCE), and \({F}_{A}^{low}\) the attention FLOPs at lower resolutions (kept). Then a naïve two-stream sum would be \({F}_{C}+{F}_{A}^{hi}+{F}_{A}^{low}\). Our design replaces \({F}_{A}^{hi}\) by an MRGAM mixer with cost \({\widetilde{F}}_{C}\ll {F}_{A}^{hi}\), so the dual-stream ratio is

$$\frac{{F}_{GLANCE}}{{F}_{single}}\,\approx \,\frac{{F}_{C}+{\widetilde{F}}_{C}+{F}_{A}^{low}+{F}_{CSCF}}{{F}_{C}}$$
(1)
$$=1+\mathop{\underbrace{\frac{\mathop{{F}_{{\rm{C}}}}\limits^{ \sim }}{{F}_{{\rm{C}}}}}}\limits_{{\rm{s}}{\rm{e}}{\rm{c}}{\rm{o}}{\rm{n}}{\rm{d}}({\rm{l}}{\rm{o}}{\rm{c}}{\rm{a}}{\rm{l}}){\rm{s}}{\rm{t}}{\rm{r}}{\rm{e}}{\rm{a}}{\rm{m}}}+\mathop{\underbrace{\frac{{F}_{{\rm{A}}}^{{\rm{l}}{\rm{o}}{\rm{w}}}}{{F}_{{\rm{C}}}}}}\limits_{{\rm{g}}{\rm{l}}{\rm{o}}{\rm{b}}{\rm{a}}{\rm{l}}{\rm{a}}{\rm{t}}{\rm{l}}{\rm{o}}{\rm{w}}{\rm{r}}{\rm{e}}{\rm{s}}}+\mathop{\underbrace{\frac{{F}_{{\rm{C}}{\rm{S}}{\rm{C}}{\rm{F}}}}{{F}_{{\rm{C}}}}}}\limits_{{\rm{f}}{\rm{u}}{\rm{s}}{\rm{i}}{\rm{o}}{\rm{n}}}.$$
(2)

Because \({F}_{A}^{low}\) is computed at \(\frac{1}{4}-\frac{1}{16}\) the resolution of the input and FCSCF is 1 × 1, the empirical overhead remains modest while conferring global context at the right scales.

To validate our choice of λdet in Equation (9), we performed a sensitivity analysis by varying its value within the previously mentioned [0.2, 0.5] range, while holding λseg constant at 1.0. The performance was evaluated on our validation set, with key results summarized in Table 7.

Table 7 Sensitivity analysis for the detection loss weight (λdet)

Discussion

The empirical results presented herein provide compelling evidence for the efficacy of the GLANCE architecture, establishing a new SOTA for both lung nodule segmentation and detection. Our central hypothesis that continuous, multi-scale fusion of complementary global and local features yields synergistic benefits is strongly supported by the ablation study. The full GLANCE model reached 95.2% Dice, whereas the removal of the GCT caused the most significant performance degradation, with the Dice score plummeting by 15.2 points to 80.0%, underscoring the critical role of long-range contextual information in disambiguating nodules from complex surrounding anatomy. Similarly, disabling the CSCF resulted in a sharp drop to 82.0% Dice, confirming that it is not merely the presence of dual feature streams, but their continuous, hierarchical exchange that unlocks the architecture’s full potential. The model’s superior boundary delineation, evidenced by a state-of-the-art 95% Hausdorff Distance of 0.7 mm on LIDC-IDRI, a 30% reduction in boundary error over the next-best competitor, can be directly attributed to the synergy between the GCT’s contextual awareness and the MRGAM’s proficiency in capturing fine-grained texture. Furthermore, the multi-task learning framework demonstrated a clear symbiotic relationship between segmentation and detection. The inclusion of a segmentation objective dramatically improved detection sensitivity at a clinically critical threshold of 1 false positive per scan, increasing it from 80.2% to 95.7%. This suggests the segmentation task acts as a potent morphological regularizer, compelling the shared encoder to learn features that help reject anatomically implausible false positives. Conversely, the detection task’s center-heatmap prediction provides a strong spatial prior that refines segmentation boundaries, boosting the Dice score from 95.2% to 96.4%. This synergy, combined with the model’s demonstrated robustness across datasets with different scanners and annotation protocols, maintaining a stable Dice score of 95.5% on LIDC-IDRI and 95.3% on LNDb, underscores the potential of the GLANCE framework as a generalizable and clinically relevant tool for automated pulmonary nodule analysis.

Methods

GLANCE (Continuous Global-Local Exchange with Consensus Fusion), shown in Fig. 2 is a sophisticated, hybrid encoder-decoder architecture meticulously engineered for high fidelity, arbitrary-oriented object detection. This architecture is conceived as a direct response to the inherent limitations of prevailing computer vision paradigms. While traditional CNNs excel at extracting rich, localized feature hierarchies, their effectiveness is constrained by a fundamentally local receptive field, which impedes the modeling of long-range spatial dependencies. Conversely, Vision Transformers (ViTs) have demonstrated remarkable capabilities in capturing global context through self-attention mechanisms, yet standard implementations can suffer from a loss of fine-grained local detail and encounter training instabilities, particularly in very deep configurations.

Fig. 2: GLANCE (continuous global-local exchange with consensus fusion) overview.
figure 2

A dual-stream, encoder-decoder architecture for pixel-precise nodule segmentation and slice-level detection on 2D thoracic CT. GLANCE is built around three original components.

GLANCE is founded upon a set of core principles designed to synergize the strengths of both approaches while mitigating their respective weaknesses. The architectural philosophy is rooted in a modular, problem-driven composition, where each component is selected to address a specific challenge within the complex object detection pipeline. The foundational principles are as follows: Dual-Stream Context Encoder (DiSCo): two co-evolving streams process the image at matched resolutions:(i) a Global Context Transformer (GCT) that models long-range dependencies; (ii) a Multi-Receptive Grouped Atrous Mixer (MRGAM) that captures fine-scale texture and boundaries over multiple receptive fields. Cross-Scale Consensus Fusion (CSCF): a lightweight integrator that fuses the global and local streams at every scale, ensuring uninterrupted propagation and exchange of global and local evidence throughout the network. Pyramid Refinement Decoder (PRD): an upsampling pathway with attention-gated skips from CSCF outputs, followed by a dual head that produces (i) a nodule probability map and (ii) a center heatmap for detection.

The dual-stream design with continuous global-local exchange addresses the instability observed when convolutional and transformer features are forced to alternate or to act on mismatched inputs; prior work shows that alternating stacks can disrupt the stable transmission of global and local cues, degrade training, and reduce accuracy. Our fusion before each downsampling strategy explicitly avoids that pitfall by making the two streams mutually conditioning and continuously fused across scales.

Notation and preprocessing

Let a lung CT slice be \(I\in {{\mathbb{R}}}^{H\times W}\). We apply lung windowing ([−1000, 400]HU). This Hounsfield Unit (HU) range is a standard in clinical radiology, selected to optimize the visual contrast of lung parenchyma (approx. −700 HU) and soft-tissue nodules (approx. −100 to +100 HU) while clipping signals from dense bone (>400 HU) and external air (<−1000 HU) that are irrelevant to this task. We clamp intensities to this interval and linearly rescale to [0, 1].

To increase the proportion of informative pixels and stabilize learning by removing large, feature-sparse regions, we optionally extract a coarse lung mask and crop to its bounding box. Slices are then resized to a fixed working resolution (H = W = 512). This 512 × 512 resolution is selected as a standard, empirically-validated trade-off: it preserves sufficient spatial detail to resolve small micronodules while remaining computationally tractable for deep learning on modern GPU hardware, balancing fidelity with memory constraints. During training, we sample full slices or patches that include at least one positive nodule mask when available (hard-example mining).

Dual-Stream Context Encoder (DiSCo)

DiSCo shown in Fig. 3 produces scale-aligned feature pyramids \(\{{F}^{(1)},\ldots ,{F}^{(L)}\}\) for both global and local streams. Each stage downsamples by a factor of 2 (spatial) and increases channels.

Fig. 3: Dual Stream Context Encoder (DiSCo).
figure 3

A global transformer path models long-range dependencies while a local convolutional path captures fine structure; their features are aligned across scales and fused through context-aware gating.

Global Context Transformer (GCT): Overlapping patch embedding: Overlapping patch embedding. We embed I via a shallow convolution E (kernel 4 × 4, stride 4) to produce \({X}^{(1)}\in {{\mathbb{R}}}^{\frac{H}{4}\times \frac{W}{4}\times {C}_{1}}\). Overlapping convolutional embeddings preserve local structure and stabilize early tokenization while serving the transformer with context-rich tokens, a practice shown to be effective in hybrid biomedical segmentation.

Self-attention block: At a generic stage k, flatten X(k) to nk tokens of width dk. Multi-head self-attention (MHSA) is

$$MHA(X)=[{\oplus }_{i=1}^{h}Att(X{W}_{i}^{Q},X{W}_{i}^{K},X{W}_{i}^{V})]{W}^{O},$$
(3)
$$\,Att(Q,K,V)=Softmax(\frac{Q{K}^{\top }}{\sqrt{{d}_{k}}})V.$$
(4)

A position embedding is added at the first transformer stage to retain spatial relations. After MHA, we apply a two-layer MLP with GeLU, with pre-norm and residual connections. To respect computational economy at high resolutions, we restrict full global attention to moderate/low-resolution stages and use local windows or token-downsampling early; replacing early global attention with a specialized convolutional mixer is both effective and efficient when token counts are large. GCT outputs a pyramid \({\{{F}_{g}^{(k)}\}}_{k=1}^{L}\) (global stream).

Multi-Receptive Grouped Atrous Mixer (MRGAM):

Motivation and design: Grouped atrous (dilated) 3 × 3 convolutions at different dilation rates capture multi-scale context while keeping parameters and FLOPs modest. Using multiple grouped atrous branches in parallel has been shown to fuse complementary receptive fields effectively and to reduce computational burden compared with dense convolutions or attention at the same resolution.

Given \({X}^{(k)}\in {{\mathbb{R}}}^{h\times w\times C}\), MRGAM applies N parallel grouped atrous convolutions \({\{{{\mathcal{C}}}_{{d}_{i},g}\}}_{i=1}^{N}\) with dilation \(\{{d}_{i}\}\) and group size g, followed by a residual blend:

$${U}_{i}=\sigma ({{\mathcal{C}}}_{{d}_{i},g}({X}^{(k)})),\,{F}_{\ell }^{(k)}={X}^{(k)}+\mathop{\sum }\limits_{i=1}^{N}{\alpha }_{i}{U}_{i},$$
(5)

where σ is ReLU, and αi are adaptive branch weights computed by a context gate that conditions on both streams (details below). A practical triplet is \(\{{d}_{i}\}=\{1,3,5\}\) with moderate grouping, balancing detail and context.

Complexity. For feature size h × w and kernel k × k, the attention cost grows as \({\mathcal{O}}({(hw)}^{2})\) in token length, whereas grouped atrous mixing costs \({\mathcal{O}}(\frac{hw{k}^{2}{C}^{2}}{g})\) and scales linearly in spatial size; at early stages (large hw), this makes MRGAM a principled mixer.

Context-gated weighting. To let global evidence steer local receptive field selection, we obtain αi via a squeeze excite gate:

$$\alpha ={\rm{S}}{\rm{o}}{\rm{f}}{\rm{t}}{\rm{m}}{\rm{a}}{\rm{x}}\,({W}_{2}\phi \,({W}_{1}[{\rm{G}}{\rm{A}}{\rm{P}}({F}_{\ell }^{(k)})\parallel {\rm{G}}{\rm{A}}{\rm{P}}({F}_{{\rm{g}}}^{(k)})])),\alpha \in {{\rm{{\mathbb{R}}}}}^{N},\mathop{\sum }\limits_{i}{\alpha }_{i}=1$$
(6)

with global average pooling (GAP), concatenation , MLP \(({W}_{1},{W}_{2})\), and ϕ = ReLU. Intuitively, when the global stream indicates vessel-like linearity, the gate can upweight larger dilations to better encompass elongated context; for isolated spherical patterns, it prefers d = 1. MRGAM produces \({\{{F}_{\ell }^{(k)}\}}_{k=1}^{L}\) (local stream).

Algorithm 1

Global Context Transformer (GCT) Stage

1: procedure GCT-StageFink, PE

2:   Input:Fin (Input feature map from previous stage, or image I for k = 1).

3:   Input:k (Current stage index).

4:   Input: PE (Positional Embedding, used only for k = 1).

5:   Output:\({F}_{g}^{(k)}\) (Global context feature map for stage k).

6:        Overlapping patch embedding and downsampling

7:   X(k) ← ConvEmbedk(Fin) e.g., Conv(kernel=4, stride=4) for k = 1

8:   if k = 1 then

9:    X(k) ← X(k) + PE      Add positional embedding at the first stage

10:   end if

11:         Flatten feature map to a sequence of tokens

12:    T ← Flatten(X(k))

13:            Transformer Block with Pre-Norm

14:   \(T{\prime} \leftarrow LayerNorm(T)\)

15:   \(T{\prime} \leftarrow MHSA(T{\prime} )\)            Multi-Head Self-Attention as in Eq. (1)

16:   \(T\leftarrow T+T{\prime} \)                  First residual connection

17:   \(T{\prime} \leftarrow LayerNorm(T)\)

18:   \(T{\prime} \leftarrow {MLP}_{GeLU}(T{\prime} )\)            Two-layer MLP with GeLU activation

19:   \(T\leftarrow T+T{\prime} \)             Second residual connection

20:                  Reshape tokens back to spatial feature map

21:   \({F}_{g}^{(k)}\leftarrow ReshapeAs(T,{X}^{(k)})\)

22:   return \({F}_{g}^{(k)}\)

23: end procedure

Cross-scale consensus fusion

At each scale k, CSCF shown in Fig. 4 synthesizes the two streams into a single consensus map \({F}_{c}^{(k)}\) and feeds it back as the input to the next stage mixers in both streams. This preserves continuous global-local transmission and avoids the alternate stacking contradiction highlighted in prior analyses of hybrid encoder-decoders.

Fig. 4: Cross-scale consensus fusion and pyramid refinement decoder (CSCF-PRD): integrating global-local features and reconstructing dual outputs.
figure 4

Global and local features are merged at each scale via CSCF and progressively upsampled through the PRD. The final dual heads produce both a segmentation map and a detection heatmap, enabling precise and interpretable multi-scale nodule analysis.

We implement CSCF as a residual, channel aware fusion:

$$\begin{array}{rcl}{Z}^{(k)} & = & BN([{F}_{g}^{(k)}\parallel {F}_{\ell }^{(k)}])\\ {F}_{c}^{(k)} & = & {\gamma }^{(k)}\odot {Conv}_{1\times 1}({Z}^{(k)})+({F}_{g}^{(k)}+{F}_{\ell }^{(k)})\end{array}$$
(7)

where \({\gamma }^{(k)}\in {{\mathbb{R}}}^{C}\) is a channel gate from a squeeze excite on Z(k), and denotes channel wise multiplication. The fused \({F}_{c}^{(k)}\) is shared as input to both the next GCT and MRGAM blocks, ensuring that each subsequent block operates on already fused features rather than being forced to extract global (resp. local) information from a purely local (resp. global) input. This parallel then fuse pattern matches empirical observations that continuous exchange and multi-scale fusion yield stable training and superior boundary accuracy in biomedical segmentation.

Algorithm 2

Multi-Receptive Grouped Atrous Mixer (MRGAM) Stage 1: procedure MRGAM-Stage\({X}^{(k)},{F}_{g}^{(k)},{\{{d}_{i}\}}_{i=1}^{N},g\)

2:   Input:X(k) (Input local feature map for stage k).

3:   Input:\({F}_{g}^{(k)}\) (Global context feature map from GCT at the same stagek).

4:   Input:\({\{{d}_{i}\}}_{i=1}^{N}\) (Set of dilation rates, e.g., {1, 3, 5}).

5:   Input:g (Group size for convolutions).

6:   Output:\({F}_{\ell }^{(k)}\) (Local context feature map for stage k).

7:                Context-gated weighting based on both streams

8:   v ← GlobalAveragePooling(X(k)) Squeeze local features

9:   \({v}_{g}\leftarrow GlobalAveragePooling({F}_{g}^{(k)})\) Squeeze global features

10:   v ← Concatenate(vvg)

11:   \(\alpha \leftarrow Softmax(MLP(v))\)      Compute adaptive branch weights as in Eq. (4)

12:                        Apply parallel grouped atrous convolutions

13:   Fmix ← 0

14:   for i = 1 to N do

15:    \({U}_{i}\leftarrow ReLU({{\mathcal{C}}}_{{d}_{i},g}({X}^{(k)}))\) \({{\mathcal{C}}}_{{d}_{i},g}\) is a grouped conv with dilation di

16:    Fmix ← Fmix + αiUi Aggregate features with adaptive weights

17:   end for

18:          Residual blend to produce the final local feature map

19:   \({F}_{\ell }^{(k)}\leftarrow {X}^{(k)}+{F}_{mix}\)                  As in Eq. (3)

20:   return \({F}_{\ell }^{(k)}\)

21: end procedure

Pyramid refinement decoder (PRD) and dual heads

Starting from the bottleneck \({F}_{c}^{(L)}\), PRD shown in Fig. 4 upsamples (and detailed in Algorithm 3) by a factor of 2 per stage. At decoder stage k we concatenate the upsampled feature with the encoder’s consensus skip \({F}_{c}^{(k)}\), then refine via a residual block:

$${D}^{(k)}={\mathcal{R}}\,({\rm{U}}{\rm{p}}({D}^{(k+1)})\parallel {F}_{{\rm{c}}}^{(k)})$$
(8)

where \({\mathcal{R}}\) is two 3 × 3 conv-BN-ReLU units with a residual shortcut. Skips carry both global and local signals (already fused), a design that has been linked to improved contour fidelity in biomedical segmentation.

Segmentation head. A 1 × 1 convolution maps D(1) to logits \({\mathcal{S}}\in {{\mathbb{R}}}^{H\times W};\widehat{Y}=\sigma ({\mathcal{S}})\in {[0,1]}^{H\times W}\) is the nodule probability.

Detection head. In parallel, a center heatmap head predicts \(\widehat{H}\in {[0,1]}^{H\times W}\) with peaks at nodule centroids. Ground truth heatmaps H are generated by placing a Gaussian of radius proportional to the nodule diameter at each annotated centroid. During inference, non-maximum suppression on \(\widehat{H}\) yields candidate detections, optionally filtered by the segmentation mask to suppress vascular false positives.

Algorithm 3

Pyramid Refinement Decoder (PRD) and Dual Heads

1: procedure PRD-Decode\({\{{F}_{c}^{(k)}\}}_{k=1}^{L}\)

2:   Input:\({\{{F}_{c}^{(k)}\}}_{k=1}^{L}\) (Pyramid of fused encoder features from CSCF).

3:   Output:\(\widehat{Y}\) (Nodule probability map), \(\widehat{H}\) (Center heatmap).

4:   Define:\({\mathcal{R}}(\cdot )\) (Residual refinement block, e.g., 2x 3 × 3 Conv-BN-ReLU).

5:   Define: Up( ) (Upsampling operation, e.g., Transposed Conv or Bilinear).

6:   Define:σ( ) (Sigmoid activation function).

7:              Initialize decoder bottleneck

8:   \({D}^{(L)}\leftarrow {F}_{c}^{(L)}\)

9:        Iterative upsampling and refinement loop

10:   for k = L − 1 down to 1 do

11:    Dup ← Up(D(k+1)) Upsample previous decoder stage

12:    \({S}_{k}\leftarrow {F}_{c}^{(k)}\) Get encoder skip-connection at scale k

13:    Dconcat ← Concatenate(DupSk) As in Eq. (6)

14:    \({D}^{(k)}\leftarrow {\mathcal{R}}({D}_{concat})\) Refine fused features

15:   end for

16:           Final dual-head prediction from the highest resolution D(1)

17:   \({\mathcal{S}}\leftarrow {Conv}_{1\times 1}({D}^{(1)})\) Segmentation logits

18:   \(\widehat{Y}\leftarrow \sigma ({\mathcal{S}})\) Nodule probability map

19:   \({{\mathcal{H}}}_{logits}\leftarrow {Conv}_{1\times 1}({D}^{(1)})\) Detection logits (separate 1 × 1 conv)

20:   \(\widehat{H}\leftarrow \sigma ({{\mathcal{H}}}_{logits})\) Center heatmap

21:   return \(\widehat{Y},\widehat{H}\)

22: end procedure

Pulmonary CT-specific adaptations

Small, variable-sized nodules and their frequent adjacency to vessels or pleura put conflicting demands on context and detail. GLANCE addresses these by design: Scale diversity. MRGAM’s parallel dilations {1, 3, 5} cover micronodules through larger lesions in one operation, with grouped kernels reducing parameter/FLOP growth at high spatial resolutions. Context disambiguation. GCT supplies long-range cues (lung position, vessel trajectories). Fusing streams before each downsampling preserves this context as features coarsen, combating the global/local alternation issue reported in hybrid stacks. Boundary precision. Decoder skips bring high-resolution consensus features into PRD, improving boundary accuracy of juxta-pleural nodules without sacrificing robustness.

Implementation and training details

All models were implemented using PyTorch and trained on NVIDIA A100 GPUs. We specify our hyperparameters as follows: we used the AdamW optimizer with an initial learning rate of 1e-4, a weight decay of 0.05, and a batch size of 8. The learning rate was managed by a Cosine Annealing with Warm Restarts schedule with an initial cycle length (T0) of 20 epochs. To enhance model robustness and prevent overfitting, we applied standard data augmentations, including random rotations (±15 degrees), random scaling (±10%), and minor elastic deformations.

Given ground-truth mask Y {0, 1}H×W and center heatmap H [0, 1]H×W, we optimize a compound loss

$${\mathcal{L}}={\lambda }_{seg}({{\mathcal{L}}}_{Dice}(\widehat{Y},Y)+{{\mathcal{L}}}_{BCE}(\widehat{Y},Y))+{\lambda }_{\det }{{\mathcal{L}}}_{focal}(\widehat{H},H),$$
(9)

Based on the sensitivity analysis detailed in Table 7, the weights for all reported results were finalized at λseg = 1.0 and λdet = 0.3.

$${{\mathcal{L}}}_{Dice}(\widehat{Y},Y)=1-\frac{2\sum \widehat{Y}Y+\epsilon }{\sum \widehat{Y}+\sum Y+\epsilon }$$
(10)

Pairing Dice with BCE is a strong baseline in biomedical segmentation under class imbalance and is widely adopted in unified benchmarking.

Cosine Annealing with Warm Restarts Learning Rate Schedule: To guide the optimization process, a dynamic learning rate schedule known as Cosine Annealing with Warm Restarts is employed. This schedule has two key components: Cosine Annealing: Within a given cycle of epochs (e.g., T0 epochs), the learning rate starts at an initial maximum value and is smoothly decayed to a minimum value (often close to zero) following the shape of a cosine curve. This gradual, non-linear decay allows the optimizer to make large exploratory steps early in the cycle and then slowly fine tune the parameters as it approaches a minimum in the loss landscape. This smooth descent helps the model converge into broad, stable minima, which are associated with better generalization. Warm Restarts: At the end of each cycle (after T0 epochs), the learning rate is abruptly reset to its initial maximum value. The length of the subsequent cycle, Ti, can be kept the same or progressively increased (Ti = Ti−1 × Tmult These periodic “restarts" provide a mechanism for the optimizer to escape from sharp, suboptimal local minima that it may have converged to during the annealing phase. The sudden increase in the learning rate propels the optimizer to a different region of the loss landscape, allowing it to continue its search for a more globally optimal solution. Deep supervision: Auxiliary segmentation heads on intermediate decoder stages are supervised with downsampled masks using a reduced weight ( × 0.3), accelerating convergence and improving small object recall.

Datasets and annotations

The availability of high-quality annotated chest CT datasets has been crucial for advancing computer-aided diagnosis (CAD) of pulmonary nodules. In particular, three public datasets have become widely used benchmarks for lung nodule detection and analysis: the LIDC-IDRI dataset22, the LNDb dataset23, and the Tianchi lung nodule dataset24. Each of these datasets provides a different scope and annotation protocol, offering complementary resources for developing and validating nodule detection algorithms. We provide here a description of each dataset and a comparative summary of their characteristics (Table 8).

Table 8 Summary of three public lung nodule CT datasets

LIDC-IDRI dataset: The Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) dataset is a landmark collection of thoracic CT scans with comprehensive lesion annotations22. It contains 1018 CT scans from ~1010 patients, acquired from a multi-institutional effort primarily in the United States. Each scan was independently reviewed by four experienced radiologists in a two-round reading process: first, each radiologist marked lesions in a blinded fashion; then, in a second unblinded round, they reviewed each other’s marks to finalize their own annotations (without forced consensus). All pulmonary nodules ≥3 mm in diameter were annotated with detailed segmentations and were given subjective ratings on nine characteristics (such as subtlety, spiculation, malignancy likelihood). Nodules smaller than 3 mm were also noted (as “non-measured” nodules) but not segmented or characterized in detail. The LIDC-IDRI provides a rich reference standard with multi-observer annotations, encompassing over 2600 distinct nodules in total across the dataset.

LNDb dataset: The Lung Nodule Database (LNDb) is a more recent dataset designed to complement LIDC-IDRI by providing an external validation cohort with a focus on radiologist variability and local clinical practice23. LNDb consists of 294 chest CT scans collected from a single hospital (CHUSJ in Porto, Portugal) between 2016 and 2018. The inclusion criteria and annotation protocol for LNDb closely followed those of LIDC-IDRI: radiologists annotated all nodules ≥3 mm with manual segmentations and recorded the same set of nine nodule attributes used in LIDC-IDRI. They also marked smaller nodules (<3 mm) and certain non-nodule radiographic findings that could be mistaken for nodules, providing a more exhaustive labeling of potential findings in each scan. Five radiologists (each with at least 4 years of experience) participated in the LNDb annotations. Each scan was read by at least one radiologist, and a subset of scans received multiple independent readings (90 scans were annotated by three different radiologists, 145 scans by two radiologists, and the remaining by a single reader). This yielded a total of 1429 annotated findings (including all nodules and non-nodule lesions) in the database, reflecting the presence of up to a few nodules per scan on average. Notably, LNDb also includes eye-tracking data recorded during the radiologists’ reading sessions, which can facilitate studies of observer attention and human-AI interaction, although these gaze data are supplementary to the core CT images and annotations.

Tianchi dataset: The Tianchi lung nodule dataset (named after the Alibaba Tianchi competition platform) was released as part of an international challenge for automated pulmonary nodule detection. It comprises 1000 low-dose chest CT scans (with 600 used for training and 400 for testing in the original challenge) with a total of 1244 labeled nodules24. For each nodule, the dataset provides the lesion’s center coordinates and diameter (approximate size) as the ground truth annotation. No segmentation masks or radiologist subjective ratings are included in this dataset; the annotations are limited to nodule locations and sizes, reflecting its primary use for detection algorithms. The annotated nodules range from ~3 mm up to 30 mm in diameter (with only one nodule slightly below 3 mm), which is a size range comparable to the inclusion criteria of LIDC-IDRI and LNDb. Because the Tianchi data were prepared for competitive benchmarking, the available labels are confined to the training portion (with the test set labels withheld for challenge evaluation), and the emphasis is on providing a large quantity of scans for algorithm development. This dataset offers a valuable independent testbed from a different population (Chinese screening cases) but lacks the multi-rater and richly characterized annotations present in LIDC-IDRI and LNDb.

Comparative summary: Table 8 summarizes key attributes of the LIDC-IDRI, LNDb, and Tianchi datasets. LIDC-IDRI remains the largest public collection with detailed, multi-radiologist annotations, making it a standard reference for training and internal validation of nodule analysis methods. LNDb is smaller in scale but provides a highly curated set of scans from a different clinical setting, useful for external validation and for studying inter-observer variability (given its multiple readings and eye tracking information). The Tianchi dataset approaches LIDC-IDRI in scale and is valuable for testing generalization of algorithms on an external cohort; however, its annotations are limited to nodule locations (without segmentations or radiologist feature ratings). In combination, these three datasets provide complementary resources: researchers can develop algorithms on the extensive and richly annotated LIDC-IDRI set, validate robustness and calibration using LNDb (which simulates a different clinical population and reading process), and evaluate detection performance on a large held-out set via the Tianchi challenge data.

Ethics approval and consent to participate

Not applicable. This study relied exclusively on fully de-identified, publicly available datasets (LIDC-IDRI, LNDb, Tianchi, LUNA16) obtained under their respective data-use policies. No interaction with human participants or access to identifiable private information occurred; therefore, institutional review board (IRB) approval and informed consent were not required.

Consent for publication

Not applicable. This work exclusively utilizes de-identified datasets available from public repositories.

Materials availability

Not applicable. No new biological specimens, cell lines, or other unique materials were generated or used in this work beyond publicly available datasets.