Introduction

Lung cancer is the leading cause of cancer-related death worldwide, and non-small cell lung cancer (NSCLC) is one of the most common types1. Among NSCLC patients, lung adenocarcinoma is frequently observed as a histological subtype. Visceral pleural invasion (VPI), which refers to the tumor’s penetration beyond the elastic layer of the visceral pleura and encompasses levels Pl1 and Pl2, is critical in the staging of the tumor-node-metastasis (TNM) system and treatment planning. Specifically, VPI escalates the TNM T stage to T2a when the tumor’s solid component measures up to 30 mm, significantly affecting treatment strategies2,3,4. The survival outcomes vary distinctly across stages, for instance, the estimated 5-year overall survival rate for clinical stage IA lung cancer stands at 82%, whereas it drops to 69% for stage IB (T2aN0M0) disease2. Results from a secondary analysis of a randomized clinical trial indicated that patients with small NSCLC (≤20 mm) exhibiting VPI experienced poorer disease-free and recurrence-free survival, as well as a higher incidence of local and distant disease recurrence5. Additionally, VPI was associated with nodal involvement and skip N2 metastases in small NSCLC, which complicates disease management3,4. Thus, precise preoperative assessment of VPI is essential for determining the optimal surgical approach, particularly regarding the timing of surgery and the extent of lymphadenectomy.

However, accurately diagnosing VPI preoperatively poses a significant challenge, particularly in cases of small NSCLC (solid component ≤ 30 mm). In recent years, researchers have investigated specific computed tomography (CT) findings to predict VPI, yielding mixed outcomes6,7,8,9. Sun Q et al.6 initially identified a CT manifestation known as the jellyfish sign, which reliably predicted VPI, showing an odds ratio (OR) of 21.6 (P < 0.001). Onoda H et al.7 examined tumors that do not appear touching the pleural surface and introduced the bridge tag sign to potentially enhance VPI prediction accuracy. Yang S et al.8 first proposed a parameter, called pleural indentation fraction (PIF), which defined as the ratio of pleural shift distance to the projected length of involved pleura, for quantifying pleural shifts8. However, the validity of these CT features as substitutes for clinical T2 descriptors remains controversial and warrants further confirmation. These limitations underscore the necessity for more accurate and thorough methods for VPI predicion.

In the realm of medical imaging analysis, Artificial intelligence (AI) has shown considerable promise, integrating techniques such as radiomics and deep learning (DL). Radiomics involves extracting quantitative attributes from images, capturing minute tissue details that might escape visual detection10. Nonetheless, the utility of radiomics is somewhat limited by its dependence on predefined algorithms for feature extraction.

Conversely, DL technology enables in-depth exploration of high-dimensional quantification of radiological images and automatically learns hierarchical features from raw image data. DL is proficient in capturing complex patterns and has demonstrated superior performance in various medical imaging applications11,12. Despite some studies applying DL to predict VPI status, the reported area under the receiver operating characteristic curve (AUC) values were comparable to those from radiologists’ evaluation or clinical models13,14,15. However, the efficiency of these DL models is not yet satisfactory, and their clinical benefits needs further assessment11.

Recognizing the complementary strengths of these methodologies, recent studies have investigated the integration of radiomics and DL features16,17. This synergy promises to exploit both the interpretable, biologically relevant features of radiomics and the high-level abstract features acquired by deep neural networks. In the realm of VPI status, such an integrated approach could provide a more comprehensive characterization of tumor properties, potentially enhancing model performance and robustness.

Therefore, our study aimed to utilize the complementary nature of handcrafted radiomics and DL by developing a multi-feature integrated imaging fusion (MIIF) model, which integrates general CT findings, radiomics features, and deep imaging features. This model is designed for risk assessment of VPI status in patients with small NSCLC (solid component ≤ 30 mm) on preoperative CT images. Furthermore, we evaluated the model’s utility in clinical practice using a paired design to compare the diagnostic performance of radiologists, both without and with the assistance of our proposed model. We also investigated the CT findings for predicting VPI.

Results

Basic characteristics of patients

A total of 2698 patients with 2822 pathologically confirmed NSCLCs were enrolled in this study (Fig. 1). The characteristics of the patients were detailed in Table 1. VPI was present in 20.61% (408/1980) of lesions in the training set and 22.64% (91/402) in the validation set. In the internal and external test sets, VPI was observed in 9.75% (27/277) and 15.34% (25/163) of lesions, respectively.

Fig. 1
Fig. 1
Full size image

Patient inclusion and exclusion criteria. NSCLC non-small cell lung cancer, VPI visceral pleural invasion.

Table. 1 Baseline characteristics of the four data sets

Diagnostic performance of the proposed model

The diagnostic performances of the DL and MIIF models for predicting VPI was shown in Table 2 and Supplementary Table 1. The MIIF model, which incorporated 42 significant features identified by the least absolute shrinkage and selection operator (LASSO) algorithm, included 29 deep imaging features, 9 radiomics features and 4 CT findings (Fig. 2). This model demonstrated improved performance in the validation, internal and external test sets. It yielded an AUC of 0.978 (95% CI: 0.973–0.984) and an accuracy of 0.922 in the training set, along with an AUC of 0.864 (95% CI: 0.828–0.899) and an accuracy of 0.821 in the validation set. In the internal and external test sets, the MIIF model achieved AUCs of 0.869 (95% CI: 0.817–0.921) and 0.785 (95% CI: 0.703–0.867), with accuracies of 0.812 and 0.798, respectively, significantly surpassing the DL model, which showed AUCs of 0.794 (95% CI: 0.722–0.865) and 0.679 (95% CI: 0.575–0.782) and accuracies of 0.690 and 0.644, respectively (P < 0.001).

Fig. 2
Fig. 2
Full size image

Bar plot of significant features associated with VPI status.

Table 2 Diagnostic performance of the established deep learning (DL) model and multimodal integrated imaging fusion (MIIF) model

In the internal and external test sets, the MIIF model achieved the smaller Brier score of 0.122 and 0.154, compared to DL model (0.191 and 0.226). The decision curves analysis (DCA) curves (bottom row) demonstrate that both models provide a positive net benefit across a wide range of threshold probabilities (≈0.0–0.2) in all test sets, including the low-prevalence internal test (prevalence = 9.75%). The results of calibration curve analyses and DCA across all datasets are shown in Supplementary Fig. 1.

Diagnostic performance using CT alone and CT with MIIF model assistance

The diagnostic performance of six radiologists, with and without MIIF model assistance, is detailed in Table 3. The MIIF model achieved higher AUCs than each radiologists in both internal and external test sets (0.767-0.839 and 0.656-0.724). The average AUCs of all radiologists in the internal/external test sets were slightly higher/lower than those of the MIIF model, without significant statistical differences (0.879 & 0.869/0.739 & 0.785, P > 0.05). The MIIF model generally displayed better accuracies (P = 0.006/0.001) and specificities (all P < 0.001) in both test sets, particularly for junior radiologists (P = 0.010/0.003, all P = 0.002), although the sensitivities did not show significant differences (P > 0.05).

Table 3 Comparisons of diagnostic capability values of each metric among six observers without (w/o) and with (w/) MIIF model’s assistance

In the internal test set, the average AUC of radiologists improved from 0.879 without MIIF assistance to 0.921 with MIIF assistance (P = 0.073). Similarly, in the external test set, the average AUC increased from 0.739 to 0.828 (P = 0.003). These improvements were particularly notable for junior radiologists, whose average AUC in the internal test set improved from 0.853 to 0.904 (P = 0.036), and in the external test set from 0.731 to 0.824 (P < 0.001). Senior radiologists also showed improvements (P = 0.105, 0.008). Each radiologist demonstrated higher AUCs with MIIF model assistance (P < 0.05) (Fig. 3).

Fig. 3
Fig. 3
Full size image

The receiver operating characteristic (ROC) curves of our proposed models in training, validation, internal and external test set.

The MIIF model significantly enhanced radiologists’ accuracy and specificity across both test sets. In the internal test set, average accuracy improved from 0.736 to 0.845 (P < 0.001), and specificity increased from 0.720 to 0.836 (P < 0.001). Similar trends were observed in the external test set, where accuracy increased from 0.663 to 0.828 (P < 0.001), and specificity improved from 0.674 to 0.841 (P < 0.001). Sensitivity showed modest improvements, though these were not statistically significant (P > 0.05).

Associations of specific CT findings with VPI

Specific CT findings associated with VPI were summarized in Table 4. In the internal and external test sets, there were 81 ground-glass nodules (GGNs), 246 part-solid nodules (PSNs), and 113 solid nodules (SNs), with VPI present in 11.82% (52/440) of these nodules. No GGNs exhibited VPI, while VPI was observed in 6.91% of PSNs (17/246) and 30.97% of SNs (35/113) (Supplementary Table 2).

Table 4 Comparisons of CT findings between nodules with and without VPI in the internal and external test sets

Among the PSNs and SNs, there were 160 pleural-attached nodules (24, 15% VPI positive), 194 pleural-tag nodules (23, 11.86% VPI positive), and 5 nodules pushed against the pleura (all VPI positive) (Fig. 4).

Fig. 4: Schematic illustration of pleura-associated nodules on CT.
Fig. 4: Schematic illustration of pleura-associated nodules on CT.
Full size image

Representative images of the pleural-attached nodule (A, B), pleural-tag nodule (C, D) and nodule pushed against pleura (E, F). Different slice of the same nodule in CT image of a right lower lobe pleural-attached nodule with pleural indentation (A, B). The nodule surface directly touched the pleura surface. Axial CT (C, D) showed a right upper lobe pleural-tag nodule. There was one or more linear tag between the nodule and pleura, but the nodule does not directly touch the pleura. Axial CT (E) and coronal CT (F) showed a left lower lobe nodule pushed against the interlobar pleura.

Univariable analysis revealed the statistically significant features according to the VPI status. Multivariable logistic regression identified nodule type (OR, 4.86; P = 0.036], solid component mean diameter (cut-off value: 15 mm; OR, 4.67; P = 0.024) of pleural-attached nodules, and solid component mean diameter (cut-off value: 13 mm; OR, 3.94; P = 0.026), and PIF (cut-off value: 0.405; OR, 23.32; P = 0.003) of pleural-tag nodules as independent predictors for VPI, with the exception of the jellyfish and bridge tag signs (Supplementary Table 3).

Discussion

In our study, we developed and validated CT-based DL and MIIF models for preoperative VPI prediction in NSCLC, of which the solid component size was 30 mm or smaller. The diagnostic performance of our proposed model was comparable to that of radiologists, exhibiting significantly higher accuracy and specificity. The high clinical utility if the model was demonstrated through a paired design, which showed improved radiologists’ performance with MIIF assistance. Additionally, we categorized subpleural NSCLCs into three groups and identified nodule type, solid component mean diameter, and PIF value as independent predictors of VPI.

These findings have significant clinical implications, especially in guiding treatment decisions. For instance, T1-sized tumors with VPI (stage IB) require more extensive lymph node (LN) dissection than those without VPI (stage IA)18. The MIIF model’s capacity to enhance diagnostic accuracy, particularly for junior radiologists, can standardize VPI assessments and guide surgical planning. In cases where VPI is highly suspected, prompt surgical intervention is crucial. Integrating the MIIF model into multidisciplinary discussions could improve treatment strategies, especially for small subpleural NSCLC, where accurate staging is vital for optimizing patient outcomes19,20.

Previous studies have assessed the role of DL in predicting VPI. Choi H et al.14 developed a DL model that matched radiologist-level performance, achieving an AUC of 0.75, with the advantage of adjustable sensitivity and specificity to meet clinical needs. Similarly, Lim WH et al.15 developed a DL model for VPI prediction with an AUC of 0.79, comparable to the pooled AUC of 0.78 reported for radiologists. While these studies highlight the potential of DL in VPI prediction, they also underscore the limitations of current models, indicating the necessity for more advanced approaches.

In this study, our multi-feature integrated approach significantly enhanced VPI prediction in small NSCLC, achieving excellent performance in the internal test set (AUC = 0.869) and acceptable performance in the external test set (AUC = 0.785). This improvement can be attributed to several key factors in our model construction: 1) The attention mechanism in our DL model focuses on relevant image regions, emulating radiologist assessments; 2) the integration of radiomics features and automatically obtained general CT findings offers a comprehensive depiction of tumor characteristics, bridging computational analysis and radiological expertise; 3) The fusion of multiple features capitalizes on the strengths of each modality, allowing for more refined predictions. This integrated approach not only improves model performance but also enhances its generalizability and clinical interpretability, making it a valuable tool for VPI prediction in small NSCLC.

The incorporation of the MIIF model into radiologists’ workflow demonstrated its potential to significantly enhance diagnostic performance. In the internal test cohort, the mean AUC increased from 0.879 to 0.921 (P = 0.073), and accuracy improved from 0.736 to 0.845 (P < 0.001). These improvements were consistent across multiple observers, demonstrating the model’s ability to standardize diagnostic evaluations and reduce variability.

The radiologists in our study demonstrated unbalanced diagnostic performance. For example, a senior radiologist demonstrated high specificity (0.912) but low sensitivity (0.704) in VPI assessment, while junior radiologists tended to show either high sensitivity with low specificity or vice verse14,15. This variability suggests that inexperience may lead to either overestimation or underestimation of VPI risk using preoperative CT. Accurate identification of VPI in small NSCLC remains complex and challenging in current clinical practice, even with the advent of specific CT findings, underscoring the task’s complexity and the need for more reliable diagnostic tools.

The MIIF model effectively addressed these challenges, particularly benefiting junior radiologists. Almost all radiologists exhibited improved accuracies and specificities with the assistance of the MIIF model. Notably, the model also enhanced diagnostic sensitivity for a senior radiologist who initially had low sensitivity (0.704), potentially assisting surgeons in determining the optimal extent of LN dissection. These findings emphasize the high clinical utility of AI-assisted tools in enhancing diagnostic precision and reducing inter-observer variability.

Our analysis of specific CT findings revealed that SNs exhibited a higher rate of VPI (30.97%) compared to PSNs (6.91%), while no VPI was observed in GGNs. These findings align with previous studies6,19. However, a few studies have shown that VPI can be present in GGNs at rates ranging from 4.7% to 17.4%21,22, which remains controversial in clinical practice. In our study, NSCLC with VPI was larger than that without VPI; the solid component size served as an independent predictor of VPI, with an OR of 4.67/3.94 for pleural-attached/tag nodule. The solid component, representing the invasive portion of the tumor2, is closely associated with increased malignancy risk23,24, which may clarify this result. NSCLCs pushed against the interlobar pleura in our study suggested tumor penetration through this pleura, with all cases showing VPI, consistent with prior research8. These predictors can be readily assessed on preoperative CT scans, serving as valuable tools for clinical decision-making.

Pleural-attached nodules are in direct contact with the pleura, while pleural-tag nodules have one or more linear tag connections6,7,9,19,25. Our study showed that nodule (solid component)-pleura attachment distance was significantly greater in tumors with VPI than in those without. Sun Q, et al6. named the multiple linear septations in pleural-attached nodules the ‘jellyfish sign’, which was confirmed in our study to potentially identify NSCLCs with VPI.

In pleural-tag nodules, it was observed that those with VPI frequently exhibited pleural indentation (PI) (95.65% vs. 76.61%). Previously, the PIF value was defined to measure the degree of pleural indentation caused by the nodule8. In this larger study, the PIF value was identified as a significant independent predictor of VPI (OR, 23.32; P = 0.003). It is hypothesized that a higher PIF value may indicate a greater degree of intratumoral fibrosis, which could be associated with tumor invasiveness7,9. These specific CT findings can be utilized in daily clinical practice and have potential to predict VPI. Moreover, with the aid of our proposed MIIF model, the diagnostic performance improved.

Several limitations were noted in our study. First, the MIIF model demonstrated slightly inferior performance on the internal and external test sets compared to the training and validation sets. The lower incidence of VPI in clinical practice contributed to a small proportion of positive VPI cases in the training and validation sets, potentially affecting the robustness of our model. Second, the retrospective collection of data might lead to possible selection bias. Despite this, the datasets were compiled from a substantial patient cohort, encompassing internal and external test sets. Plans are in place to gather more data and integrate specific CT findings for further iteration and prospective validation of the MIIF model. Furthermore, cross-center variability in imaging protocols (e.g., reconstruction kernels and slice thickness) likely contributed to residual domain shift, as reflected by the lower external AUC. Although we standardized preprocessing (isotropic resampling and intensity normalization), we did not apply explicit domain-adaptation or harmonization in this study. Domain-adaptation/standardization methods-such as feature-space harmonization (e.g., ComBat) and image-level translation (e.g., CycleGAN)-are promising for reducing scanner- and protocol-related shifts26,27. In future work, we will systematically investigate and rigorously validate these methods, together with site-specific calibration, to improve robustness and narrow the performance gap in external cohorts.

In conclusion, the CT-based MIIF model developed for identifying VPI in NSCLC (with a solid component size of 30 mm or smaller) outperformed the diagnostic accuracy of radiologists, particularly for junior radiologists. Its high clinical utility could enhance their diagnostic efficacy. Specific CT findings, including SN, nodule solid component mean diameter, and PIF value, were identified as predictors of VPI and will be important features in the MIIF model.

Methods

The retrospective study received approval from the institutional review boards of all participating hospitals. The necessity for written informed consent was waived, as data were analyzed retrospectively and anonymously. Clinical and pathological data were reviewed from medical records, and CT images were obtained from picture archiving and communication system (PACS). Figure 5 illustrates a schematic drawing of the overall study design.

Fig. 5
Fig. 5
Full size image

Schematic drawing of the overall study design.

Study patients

The main dataset were collected from patients with small NSCLCs at six centers over earlier time windows for training and validating the proposed model: Center 1 (Shanghai Zhongshan Hospital, 2015.2–2018.11), Center 2 (Shanghai Public Health Clinical Center, 2015.6–2020.6), Center 3 (Shanghai Sixth People’s Hospital, 2011.12–2019.10), Center 4 (Shanghai Ruijin Hospital, 2015.2–2021.8), Center 5 (Wuhan Union Hospital, 2014.9–2018.12) and Center 6 (Shanghai Xuhui District Central Hospital, 2017.8–2021.7).

A detailed flowchart of patient inclusion and exclusion was shown in Fig. 1. Patients were randomly assigned to the training and validation sets in an 8:2 ratio at the patient-level.

Consecutive patients underwent surgery for small NSCLC at Center 1 (Shanghai Zhongshan Hospital, 2018.12–2023.9) and Center 7 (Shanghai Minhang District Central Hospital, 2020.10–2024.2) were included as the internal and external test sets, respectively. The patient inclusion and exclusion criteria were consistent with those previously described (Fig. 1). Notably, the internal test temporal window at Center 1 does not overlap with Center 1’s contribution to the main dataset, although it partially overlaps the overall multi-center timeframe of the main dataset.

Pathological analysis

All resected tumors were stained with Hematoxylin and Eosin, and their associated pleura were stained with masstone. Microscopic examination was conducted on specimens sliced at 0.5 cm thickness. Elastica van Gieson staining was applied whenever the elastic layer of the involved visceral pleura was indistinct. VPI was defined as tumor invasion beyond the elastic layer, classified as Pl1 (without exposure on the pleural surface) or Pl2 (with exposure), but excluding involvement of the parietal pleura28.

Data preparation

Detailed CT acquisition parameters for each participating center were provided in Supplementary Table 4. Tumor volumes of interest from all datasets were automatically segmented on CT images using the research platform uAI Research Portal (uRP, United Imaging Intelligence Co., Ltd.)29. A three-dimensional (3D) DL network, VB-net, was utilized to automatically detect and segment lung tumors. This automated approach achieved a Dice similarity coefficient of 91.5%30. Radiologist S.Y.Y. subsequently reviewed and manually adjusted the segmentation results as needed, utilizing the lung window setting (window center = −450 to -600 Hu, window width = 1500 to 2000 Hu) on uRP.

Model development

A MIIF model for VPI prediction was proposed in this study. The MIIF model comprises three principal components: 1) Image preprocessing, including imaging cropping, resampling, and normalization; 2) Multi-feature extraction, where an attention-based residual network was trained on the training set to extract 256 deep image features from the last fully connected layer. Additionally, 1185 radiomics features were extracted using PyRadiomics (version 3.0.1), and 13 general CT findings were automatically obtained and verified; and 3) MIIF model construction, where the deep imaging features, radiomics, and general CT findings were integrated. Z-score normalization, the LASSO algorithm, and a machine learning classifier (quadratic discriminant analysis, QDA) were employed for model construction.

The enrolled CT volumes were first resampled into 0.7 × 0.7 × 1.0 mm3 resolution by trilinear interpolation, and cropped around the centers of lung nodules with a cubic patch of 112 × 112 × 96 voxels. Then the CT intensity was converted into Hounsfield units (HU) and normalized by Z-score standardization method. Finally, the image intensity of each patient was clipped to the range of [−1, 1] to facilitate the observation of lung tissue. The equation of CT normalization is given as follows:

$$I=\left\{\begin{array}{c}-1,{if}\frac{{I}_{{Hu}}-{mean}}{{STD}} < -1\\ 1,{if}\frac{{I}_{{Hu}}-{mean}}{{STD}} > 1\\ \frac{{I}_{{Hu}}-{mean}}{{STD}},{other}\end{array}\right.$$
(1)

Where the mean value is set as −400 and the STD (standard deviation) is set as 750. This process ensures that each CT scan is standardized to have a uniform resolution and a range of the same intensity.

We developed a DL model to predict VPI status and extract deep imaging features. The DL model incorporates residual blocks and a class activation mapping (CAM) mechanism, enabling it to focus on the nodule and its surrounding pleural region. The architecture of the DL model consists of two major components: image augmentation; and an attention-based nodule diagnosis module with CAM. The model with the best performance on the validation set was selected, and 256 deep imaging features were extracted from its last fully connected layer.

In this study, we randomly adopted flipping along each axis, scaling by a range of 0.8 to 1.2, and rotation by an angle along an axis in a range of −10° to 10° on each CT patches in the training set with a probability of 50%. Besides, the number of VPI-negative and VPI-positive data in the training samples is extremely imbalanced, with a distribution ratio close to 4:1. The imbalance in samples can lead to significant bias in our classification model. Therefore, in this study, we address this issue by conducting over-sampling of the minority class to increase the number of samples in the minority class, aiming to achieve a balanced ratio of positive and negative samples in the input network.

The DL model was built on the foundation of 3D residual network (ResNet) framework, serving to discriminate VPI-positive from VPI-negative in NSCLC. To guide the network to focus on the features of NSCLC and its surrounding regions, CAM attention mechanism was introduced in this framework. Online supervision of network response regions was implemented during the training procedure to optimize the classification performance.

The architecture consists of several key components, including an input block, four downsampling blocks, a global average pooling (GAP) layer, a fully connected layer and a softmax layer. Specifically, the input block consists of a convolutional module that includes a 3 × 3 × 3 kernel size and a 1 × 1 × 1 stride size convolutional layer, followed by a batch normalization (BN) layer and a ReLU layer. For the downsampling blocks, we employed residual structures where the input and output of each downsampling block are combined through addition and then passed as input to the next downsampling block. Each downsampling block begins with a convolutional block that includes a convolutional layer (kernel size=2 × 2 × 2, stride size= 2 × 2 × 2), followed by a BN layer and a ReLU layer. The remaining convolutional blocks in each downsampling block consist of a convolutional layer with kernel size of 3 × 3 × 3 and stride size of 1 × 1 × 1, along with subsequent a BN and ReLU layer. The output channels for the input block and the four downsampling blocks are set as follows: 16, 32, 64, 128 and 256, respectively. The softmax layer has an output channel setting of 2 which represents the probabilities of VPI-negative and VPI-positive for the given input sample.

In order to enable the model to learn feature related to nodules and their pleural region associated with VPI, this study utilizes attention maps of 3D CAM for online learning. Specifically, a 1 × 1 × 1 3D convolutional layer and ReLU activation function were used to generate attention feature maps corresponding to network response regions. The equation of attention maps are as follows:

$$A={ReLU}({Convolution}(\,f,w))$$
(2)

Where f represents the feature map before the GAP layer, w represents the weight matrix of the fully connected layer. To make our attention generation procedure trainable, a convolution layer with kernel size of 1×1×1 and a ReLU layer was employed to generate the attention feature map of A. The size of A is 1/16 of the corresponding size of the input CT images. We then upsampled the attention feature map to match the input image size, normalized it to a range of 0-1, and applied a sigmoid function for linear mapping to obtain the final attention map, as follows:

$$T\left(A\right)={Sigmoid}(A)=\frac{1}{1+exp(-\alpha (A-\beta ))}$$
(3)

The values of α and β are set to 100 and 0.4, respectively, where T(A) represents the attention map generated by the online attention module. During the training process, the weights of the fully connected layer are used to assign values to the convolution kernel parameters in the Eq.2, thereby optimizing the online network attention map.

The loss function of the proposed DL model includes label smooth cross entropy (LSCE) loss and mean squared error (MSE) loss, which can be rewritten as follows:

$${\rm{Loss}}={\rm{LSCE}}\_{\rm{loss}}+\alpha * {\rm{MSE}}\_{\rm{loss}}$$
(4)
$${\text{LSCE}}\_{\text{loss}}=-\mathop{\sum }\limits_{i=1}^{N}{q}_{i}\,{\text{log}}\,{y}_{i}$$
(5)
$${q}_{i}=\left\{\begin{array}{c}1-\varepsilon {if}i==y,\\ \varepsilon /({\rm{N}}-1){otherwise},\end{array}\right.$$
(6)
$${\rm{MSE}}\_{\rm{loss}}=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{(T\left(A\right)-G)}^{2}$$
(7)

Where \({q}_{i}\) denotes the positive prediction probability of the input samples, \({y}_{i}\) is the gold standard, N is the number of samples, and ε is a small constant, which makes the target probability in the LSCE loss function no longer 1 or 0, thus avoiding overfitting of the model. \(T\left(A\right)\) represents the network attention map of the network, and G is the input volume of interest (VOI). The MSE loss function is used to make the attention map of the network as similar as possible to the input VOI, thereby achieving guidance of the attention regions in the network. In the training process, a constant weight α is used to balance the classification task and the task of guiding the attentional map of the network.

We trained the attention-based 3D ResNet for a maximum of 1001 epochs with early stopping (patience of 100 epochs). All network parameters are initialized by Kaiming initialization. The model was optimized using the Adam optimizer with an initial learning rate of 10−5, betas of 0.9 to 0.999, epsilon of 1×10−8, and weight decay of 0.01. The learning rate was scheduled using MultiStepLR with milestones at epochs 50, 100, 200, 400, and 800, and a gamma value of 0.4. The batch size was set to 36. A total of 16 CPU threads were used for data loading. The smoothing factor (ε) of LSCE loss is set as 0.1, and the attention guidance weight (α) of MSE loss is set as 100. No dropout was applied in the network architecture. The model with the best performance on the validation set was finally selected.

A total of 1185 radiomics features were extracted from the preprocessed CT images using the PyRadiomics (version 3.0.1) package implemented in the research platform uAI Research Portal (uRP, United Imaging Intelligence Co., Ltd.). Three types of features fell into categories: first-order statistics, morphological features, and texture features. They were extracted from 3D tumor masks in the original CT images, and from two filtered images (i.e., Laplacian of Gaussian and wavelet filtered images).

Thirteen general CT findings were automatically obtained for each pulmonary nodule through the uRP platform, including density, size, volume, long and short diameters, and nodule signs of spiculation, vocule sign, calcification, lobulation, pleural traction (traditionally, pleural tag sign/ pleural indentation), air bronchogram, spinous, and vessel convergence. All CT findings were manually verified by a radiologist (S.Y.Y) and corrected if necessary.

We integrated the deep imaging features, radiomics features, and general CT findings, resulting in 1454 features per pulmonary nodule. To ensure comparability across different feature types, Z-score normalization were applied. Subsequently, the LASSO algorithm was employed to select the most relevant features for distinguishing between positive VPI and negative VPI nodules. The LASSO regularization was performed with L1 penalty using 5-fold stratified cross-validation. We conducted a grid search over the regularization strength α in the range [0.001, 0.01, 0.05, 0.1, 1, 10]. Features that were selected in at least 80% of the cross-validation folds were retained. The selected features (n = 42) were then used to train a QDA classifier. The QDA classifier was implemented with a regularization parameter of 0.1 to prevent overfitting by smoothing the covariance estimates. The priors were set based on the class frequencies in the training set.

All experiments were conducted on a high-performance computing cluster with NVIDIA A40 GPUs (48GB memory). We used Pytorch1.12.1, CUDA 12.4, and scikit-learn 1.6.1. Random seeds were fixed for PyTorch, NumPy, and Python to ensure deterministic results.

To mitigate class imbalance, several strategies were applied to handlie class imbalance: 1) Oversampling of the minority class. During DL model training, we oversampled VPI-positive cases to achieve a 1:1 ratio per batch, ensuring the network learned balanced features; 2) Loss function design. We used LSCE, which down-weights overconfident predictions and improves minority-class learning; 3) Stratified cross-validation during LASSO. We used LASSO for feature selection on the general CT/radiomics/deep learning features prior to model fusion. Stratified 5-fold cross-validation was performed, i.e., each fold preserved the proportion of VPI-positive and VPI-negative cases present in the full training set. Stratification reduces variability in class composition across folds under imbalance, yielding a more stable selection of regulation parameter of LASSO and preserving minority-informative features; 4) QDA priors set to training frequencies. The MIIF classifier uses QDA. QDA models class-conditional densities with class-specific covariance matrices and uses Bayes’ rule. Aligning priors with the development base rate mitigates bias from assuming equal priors in an imbalanced setting and improves probability calibration within development-like distributions.

Observer performance test

Six board-certified thoracic radiologists (Y.Z., S.Y.Y., Q.W., W.S., S.Y., F.S., with 6-23 years of experience in chest imaging) independently assessed the presence of VPI in NSCLC using a 5-point scale: 0-unlikely to have VPI (0% possibility); 1-slight likely to have VPI (0-25% possibility); 2-moderately likely to have VPI (26-50% possibility); 3-very likely to have VPI (51-75% possibility); 4-extremely likely to have VPI (76-100% possibility). A CT-VPI presence score of 3 or 4 defined the presence of VPI. The radiologists were informed of the patients’ ID, age, and tumor location. To mirror daily clinical practice, scoring was conducted without prior education on specific pleural-related CT findings. Thus, the CT-VPI presence score was determined based on the radiologists’ own experience. The six radiologists assessed the CT-VPI presence using axial, coronal, and sagittal images for all patients in the internal and external test sets to evaluate the relationship between the tumors and pleura.

Paired design (sequential session) for comparing diagnostic performance

To compare performance between AI-unassisted and AI-assisted interpretations, interpretation typically occurs without AI in the first session and with AI in the second session11. In this study, a washout period (> one month) was implemented between the two sessions to prevent learning effects from the first session. The order of case review was randomly reshuffled prior to the second session. A 5-point scale was also employed to evaluate the likelihood of VPI presence on CT with the MIIF model results, conducted by the same 6 radiologists.

Specific CT findings evaluation

For the internal and external test sets, tumors were categorized into SN, PSN, and GGN based on CT image analyses. Subpleural nodules identified by CT were classified into three categories (Fig. 4):

  1. 1.

    Pleural-attached nodules, which were in direct contact with the pleural surface.

  2. 2.

    Pleural-tag nodules, which were not in contact with the pleura6. The nodules were with thin, linear structures (≤2 mm in maximum width) extending from the surface of nodule to the visceral pleura; the tag must be continuous with both the nodule and the pleura (to distinguish it from unrelated linear opacities such as atelectatic bands).

  3. 3.

    Nodules that pushed against the pleura, which were also in direct contact with the interlobar fissure. The pulmonary nodule displaced the pleura to the opposite side, or growed across the fissure.

For pleural-attached nodule, specific CT findings were assessed, including the distance between the nodule (solid component) and the pleura, PI, and the jellyfish sign. For pleural-tag nodules, specific CT findings such as the distance between the nodule (solid component) and the pleura, bridge tag sign, PI, PIF, and pleural tag type (Supplementary Fig. 2) were also evaluated.

The CT findings for each subpleural nodule were assessed by two radiologists (S.Y.Y./F.S., with 10/23 years of experience in chest CT imaging), who were blinded to the clinicopathologic data. Any disagreements were evaluated together and resolved by consensus.

Statistical analysis

Model performance was evaluated using AUC, accuracy, sensitivity, specificity, positive and negative predictive values (PPV, NPV), and F1-score. Calibration curves and DCA were also performed to evaluate the accuracy of risk estimate. Additionally, Brier scores were calculated that quantitatively measure the distance in the probability domain and a lower score means better prediction. For both the internal and external test sets, AUC comparisons between the MIIF model and radiologists, and between unassisted and assisted radiologist interpretations, were performed using the DeLong test. AUC interpretations were as follows: 1) acceptable (AUC, 0.70–0.80), 2) excellent (AUC, 0.80–0.90), 3) outstanding (AUC, greater than 0.90)31. Continuous variables were presented as mean ± standard deviation and analyzed using the independent samples t-test, Mann-Whitney U test, or analysis of variance, depending on data distribution. Categorical variables were presented as frequencies with percentages and analyzed using the Pearson χ2 or Fisher exact test, as appropriate. Multivariable logistic regression analysis (Forward: LR) was utilized to identify independent CT features associated with VPI. The McNemar test was employed to compare parameters of accuracy, sensitivity, and specificity. Analyses were conducted using SPSS software (version 29.0; IBM) and Python (version 3.9.12). A two-tailed P-value of less than 0.05 was considered statistical significant.

Sample size calculation

We calculated the required sample size for developing the prediction model for VPI status using the approach described by Riley et al.32. This method aims to minimize model overfitting and ensure precise predictions by considering multiple criteria. The sample size was determined using the following formula for binary outcomes:

$${\rm{n}}={\left(\frac{Z}{\sigma }\right)}^{2}\hat{\Phi }(1-\hat{\Phi })$$

where n represents required sample size, Z refers to Z-value,\(\sigma\) refers to the margin of error and is generally recommended as ≤0.05, \(\hat{\Phi }\) is the overall outcome proportion. In this study, the Z is set as 1.96 for a 95% confidence level, \(\sigma\) is set as 0.05. Based on previous study by Huang et al.33, the VPI incidence rate for lung tumors less than 30 mm in size ranges from 8% to 38%. We calculated the sample size for both the lower and upper bounds of this range: at least 114 participants (i.e., about 10 participants with positive VPI) are required when the outcome proportion (\(\hat{\Phi }\)) is 0.08, and 363 participants (i.e., 138 participants with positive VPI) are required when the \(\hat{\Phi }\) is 0.38.

To ensure robust model development and validation, we aimed to at least recruit a sample size of 363 participants (including approximately 138 with positive VPI) at the upper end of the incidence rate range.