Introduction

Recent advances in deep learning for medical image analysis have demonstrated significant progress in oncology applications, particularly in lung cancer detection and classification1,2,3,4. For example, one method involves the dynamic inspection of casting defect images using the deep learning feature matching method5,6. Another method extracts features from the region of interest in test images and subsequently constructs a multilayer neural network for product defect detection7,8,9. A study proposed a learning-based framework for automatically detecting fabric defects10. Feng et al. proposed a deep active learning system to maximize the model recognition performance. Ren et al. proposed a general surface defect detection method based on deep learning which was innovative because only a small dataset was required11. These research on computer vision–based inspection techniques are increasingly carried out in different industries.

Image recognition is an essential subfield of computer vision technology and a vital application direction of deep learning. The development of deep learning technology has facilitated the use of image recognition in many fields, including object recognition12,13, facial recognition14, automatic driving15, agricultural intelligence16, aerospace17, and other fields18,19. In recent years, image recognition technology based on deep learning has provided a broader space for medical and health research, and scholars have used this technology to conduct in-depth research in the field of medical imaging and achieved remarkable research results. Examples of future research topics may include the following aspects:

  1. (1)

    Detection and diagnosis: Various studies have used image recognition for automatic detection and diagnosis of medical images20, such as lung nodules21, breast cancer22, and diabetic lesions23. These studies showed that image recognition im-proved the early diagnosis rate and accuracy of diseases.

  2. (2)

    Medical image segmentation: This process separates different tissues or organs in medical images. Many studies used image recognition for medical image segmentation, such as tumor segmentation24, blood vessel segmentation25, and so forth. These studies revealed that image recognition could improve the accuracy and efficiency of medical image segmentation.

  3. (3)

    Medical image alignment: This process aligns medical images at different times or in different scanning directions. Several studies used image recognition for medical image alignment, such as brain image alignment26, heart image alignment27, and so forth. These studies demonstrated that image recognition could improve the accuracy and efficiency of medical image alignment.

  4. (4)

    Medical image reconstruction: This process recovers medical images from low-resolution or low-quality data. Several studies have used image recognition for medical image reconstruction, such as computed tomography (CT) image reconstruction28, reconstruction of magnetic resonance imaging images29, and so forth. The results showed that image recognition could improve the accuracy and efficiency of medical image reconstruction.

In general, image recognition is widely used in medical imaging. As new research results continue to emerge, its significance is expected to play an increasingly important role in clinical diagnosis and treatment. In recent years, lung diseases have become diversified and complicated, especially since the new coronary epidemic. Some achievements have been made in medical imaging, especially in diagnosing lung diseases. Researchers have used the convolutional neural network (CNN) technique to classify and segment lung CT images to identify diseases such as lung cancer, tuberculosis, and emphysema. Unlike other studies, this study focused on identifying and classifying pulmonary ground glass nodules to provide a basis for seeking a further diagnosis of diseases and to enhance the efficiency of automatic diagnosis. Therefore, this study designed a multiscale convolution neural network (MCNN) model based on the traditional CNN model technique with the help of image Gaussian differential pyramidal decomposition. The model was applied to recognize multiple categories of lung nodules to achieve automatic labeling of lesion nodule locations and nodule categories. Moreover, the model was used to automatically mark the location of lesion nodules and identify nodule categories. Meanwhile, the feasibility and effectiveness of the model and method were verified by performing experiments with hospital-specific clinical lung nodule lesion detection imaging data to explore the search for a scientific and intelligent diagnosis method for lung nodule identification.

Methods

Related theory and model elaboration

Convolutional neural network

CNN is a classical deep learning algorithm (Alzubaidi et al. 2021), mainly used in image processing, speech processing, natural language processing, and other deep learning fields. CNNs are the first successfully trained deep neural networks. They are feed-forward neural networks. The basic architecture of a general CNN (shown in Fig. 1) contains an input layer, a convolutional layer, a pooling layer, a fully connected (FC) layer, and an output layer.

Fig. 1
figure 1

Architecture of a convolutional neural network.

CNNs are made displacement, scaling, and distortion invariant for image recognition through three features: local receptive fields, weight sharing, and downsampling30.

  1. (1)

    Local receptive field: The area of input neurons connected to each hidden layer neuron, as illustrated in Fig. 2.

Fig. 2
figure 2

Localized sensory fields.

  1. (2)

    Weight sharing: Connected neurons share weights and biases within local receptive fields. The hidden neuron output is calculated as:

$$f\left({\sum\:}_{n=0}^{4}{\sum\:}_{m=0}^{4}{w}_{n,m}{a}_{j+n,k+m}+b\right)$$
(1)

where f is the activation function, \(\:{w}_{n,m}\)is a 5 × 5 weight sharing array, \(\:{a}_{j+n,k+m}\) denotes the input activation value at position \(\:j+n,\:k+m,\) and \(\:b\) is the shared bias value.

  1. (3)

    Down sampling: A downsampling layer (pooling layer) is connected behind the convolutional layer to solve this problem. Downsampling simplifies the information output from the convolutional layer and reduces the resolution of the features.

Gaussian pyramid decomposition

DOG (Difference of Gaussian) is a variant of an image pyramid that uses Gaussian smoothing and difference operations to extract the scale-space information of the image. The image can generate N different resolution images after Gaussian difference pyramid decomposition. The Gaussian difference pyramid is composed of multiple groups of pyramids, where each group contains several layers. The Gaussian difference pyramid is further constructed from the Gaussian pyramid based on m-layer n-1 order. The Gaussian difference pyramid decomposition process is as follows:

Step 1: initialize \(\:i=0\).

Step 2: Sampling on the standard image \(\:I(x,y)\) to obtain the first layer image \(\:{\text{g}}_{\text{0,0}}\) of the first group of the Gaussian pyramid.

Step 3: initialize \(\:j=0,x=0\).

Step 4: Convolution of the Gaussian kernel \(\:{G}_{x}\) with the image \(\:{\text{g}}_{i,0}.\)

$${G}_{x}\left(x,y,{\sigma\:}_{x}\right)=\frac{1}{2\pi\:{\sigma\:}_{x}^{2}}{e}^{\frac{{\left(x-{x}_{0}\right)}^{2}+{\left(y-{y}_{0}\right)}^{2}}{2{\sigma\:}^{2}}}$$
(2)
$${\text{g}}_{i,j+1}\left(x,y\right)={\text{g}}_{i,j}\left(x,y\right)\otimes\:{G}_{x}\left(x,y,{\sigma\:}_{x}\right)$$
(3)

where \(\:{\sigma\:}_{x}\) is the smoothing parameter.

Step 5: Differentiate the Gaussian image \(\:{\text{g}}_{i,j}(x,y)\) from the Gaussian image \(\:{\text{g}}_{i,j+1}\left(x,y\right)\) to obtain the Gaussian difference image \(\:{d}_{i,x}.\)

$${d}_{i,x}\left(x,y\right)={\text{g}}_{i,j}\left(x,y\right)-{\text{g}}_{i,j+1}\left(x,y\right).$$
(4)

Step 6: \(\:j=j+1,\) \(\:x=x+1,\) iterate Step4 and Step5, and when \(\:j>n-1,\) and \(\:x>n-2,\) execute Step7.

Step 7: Downsample the image \(\:{\text{g}}_{i,0}\) to get the Gaussian image \(\:{\text{g}}_{i+\text{1,0}}\) at layer \(\:i+1.\) When \(\:i+1,\) go to Step 3, and the decomposition process ends when \(\:I>m-1\) is satisfied.

Model elaboration

Considering that CNN can automatically learn features without the need for manually designing a feature extractor, the accuracy and robustness of the model are improved. CNNs also have features such as translational invariance and spatial hierarchy, which enable them to process data, particularly images. The Gaussian difference pyramid is an effective image-processing algorithm with the advantages of scale invariance, feature extraction, computational efficiency, visualization, and so forth. This study combined the CNN structure and GPD features and constructed an MCNN model based on the traditional CNN architecture with multipixel defective images as input (shown in Fig. 3).

The MCNN model designed in this study incorporates a sophisticated nine-layer architecture that processes multiscale inputs derived from Gaussian Pyramid Decomposition. The network begins with a slice layer that separates four multiscale images (the original 640 × 640 pixels image and three Gaussian difference pyramid levels) into independent channels for initial processing. The first convolutional layer (Conv1) applies 96 filters with 9 × 9 kernels and a stride of 4 to each multiscale input, generating feature maps of dimensions 158 × 158 × 96. Following max pooling with a 3 × 3 kernel and stride of 2, the resulting 78 × 78 × 96 feature maps from all four scales are concatenated using a concatenation layer, creating a unified multiscale feature representation with 384 channels (96 × 4). This concatenated feature map then proceeds through the second convolutional layer (Conv2), which employs 256 filters with 6 × 6 kernels and stride of 1, producing 73 × 73 × 256 feature maps. After a second max pooling operation with 6 × 6 kernels and stride of 2, the 34 × 34 × 256 feature maps are processed by the third convolutional layer (Conv3) using 384 filters with 3 × 3 kernels and stride of 1, resulting in 32 × 32 × 384 feature maps. A final max pooling layer with 3 × 3 kernels and stride of 2 reduces the spatial dimensions to 15 × 15 × 384 before the fully connected layers.

The activation function used in both the convolutional layers and the fully connected (FC) layers in this model was a rectified linear unit (ReLU). Compared with the hyperbolic tangent (Tanh) and S-type (Sigmoid) functions, the ReLU activation function is a nonlinear, nonsaturated function that trains faster than saturated functions. The ReLU not only has a nonlinear expression capability but also possesses a linear nature, which makes it possible to overcome the gradient vanishing problem of the Tanh and Sigmoid functions by not causing the network to be locally optimal due to the nonlinearity when the error is backpropagated.

The network architecture concludes with three fully connected layers: the first two FC layers each contain 4096 neurons, while the final FC layer contains 5 neurons corresponding to the five classification categories (SN, PGGO, MGGO, special type, and normal type). A dropout operation with a rate of 0.5 was performed in the first two FC layers of the network to avoid overfitting problems31. Unlike the L1 and L2 normalization, Dropout does not rely on modifications to the cost function, but it changes the network itself. This multiscale processing approach enables the MCNN to capture both fine-grained textural details and broader contextual information, allowing it to effectively distinguish between different types of lung nodules while maintaining computational efficiency through the unified processing pipeline after initial multiscale feature extraction.

Fig. 3
figure 3

Multiscale convolutional neural network model.

In this study, Softmax and support vector machine (SVM) classifiers suitable for processing multiple classification problems were used, respectively. Assuming that the training samples are \(\:\left\{\left({x}_{1},{y}_{1}\right),\left({x}_{2},{y}_{2}\right),\dots\:\dots\:,\left({x}_{k},{y}_{k}\right)\right\},\) the hypothesis function will output a six-dimensional vector to represent the classification probability values of these six classes. The specific hypothesis function \(\:{h}_{\theta\:}\left(x\right)\) is as follows:

$${h}_{\theta\:}\left(x\right)=\left[\begin{array}{c}p\left({y}^{\left(i\right)}=1|{x}^{\left(i\right)},\theta\:\right)\\\:p\left({y}^{\left(i\right)}=2|{x}^{\left(i\right)},\theta\:\right)\\\:\dots\:\dots\:\\\:p\left({y}^{\left(i\right)}=k|{x}^{\left(i\right)},\theta\:\right)\end{array}\right]=\left[\begin{array}{c}{e}^{{\theta\:}_{1}^{T}{x}^{\left(i\right)}}/\sum\:_{j-1}^{k}{e}^{{\theta\:}_{j}^{T}{x}^{\left(i\right)}}\\\:\dots\:\dots\:\\\:{e}^{{\theta\:}_{k}^{T}{x}^{\left(i\right)}}/\sum\:_{j-1}^{k}{e}^{{\theta\:}_{j}^{T}{x}^{\left(i\right)}}\end{array}\right]$$
(5)

where: \(\:{x}^{\left(i\right)}\) denotes the input, \(\:{y}^{\left(i\right)}\) denotes the output, \(\:p\left({y}^{\left(i\right)}\right)=k|{x}^{\left(i\right)},\theta\:\) denotes the probability that \(\:{y}^{\left(i\right)}\) belongs to category \(\:k\) given a known input sample \(\:{y}^{\left(i\right)}\) with model parameters \(\theta.\)

SVM can map multidimensional features to a high-dimensional kernel space, thus allowing otherwise indistinguishable data to acquire new features more conducive to classification. It is a machine learning classification method that relies on kernel functions. The radial basis of magnitude kernel function, with good classification performance, was chosen for our lung nodule recognition classification experiments. The kernel function is as follows: (where the default parameters \(\:C=10\) and \(\:\sigma\:=0.038\) were chosen)

$$K\left({x}_{1},{x}_{2}\right)={e}^{\left(-\frac{{\parallel\:x}_{1}-{x}_{2}\parallel\:}{2{\sigma\:}^{2}}\right)}$$
(6)

In CNNs, the size and number of convolutional kernels were set by themselves. Different convolutional kernels extract different features, making them suitable for diverse lung nodule image recognition applications. At the same time, the Gaussian difference pyramid can detect features in the image at different scales, achieving scale invariance. This means that objects in the image can be detected regardless of their size variations, which is beneficial for recognizing the same type of lung nodules of different sizes. This study validated the effectiveness of the MCNN model incorporating GPD in lung nodule image recognition using specific clinical experimental data.

Empirical analysis-lung nodule image recognition experiment as an example

Description of the process of identifying common lung nodules

The incidence of ground glass opacity (GGO) nodules in the lungs has been increasing in recent years, and many patients are found to have multiple GGOs. However, the management of multiple GGOs is still controversial. GGOs in the lungs are an imaging presentation, encompassing various pathological types, some of which are early-stage lung cancers. A pulmonary ground glass shadow is a focal nodular hyperdense shadow in the lung that is not dense enough to obscure the bronchial vascular bundles traveling within it. They are classified as pure ground-glass opacity (PGGO), solid nodule type (SN), mixed ground-glass opacity type (MGGO), special type (S), and normal type (N) (shown in Fig. 4), depending on whether they contain a solid component. Pulmonary GGO encompasses pathological types such as malignancy, benign tumors, inflammation, interstitial lung disease, and intrapulmonary lymph nodes. Several studies suggested a close relationship between the imaging presentation and the pathology of GGO32,33,34.

Fig. 4
figure 4

Image of the Criteria for Identifying a Lung Nodule.

A prospective clinical trial led by the National Cancer Centre in Japan included GGO35 with nodules ≤ 3 cm in maximum diameter and solid components ≤ 5 mm. The study showed that GGO progressed sequentially from PGGO to solid components visible in the lung window, to solid components visible in both the lung and mediastinal windows, and this progression was extremely slow. The Fleischner Society36 recommended that further positron emission tomography–CT for multiple GGOs should be performed if a prominent partially solid GGO lesion of 8–10 mm existed to facilitate a more accurate assessment of prognosis and optimize preoperative staging. The maximum diameter of the main multiple GGO lesion and the maximum diameter of the solid component/maximum diameter of the nodule ratio (C/T value) were references for the physician to determine the benignity and malignancy of the nodule and the timing of surgery. Kim et al. reviewed 40 cases of surgically resected PGGO and found that all PGGOs < 5 mm were benign nodules, while only 10.5% of PGGO 5–10 mm were lung malignancies37.

Table 1 Types of lung nodules and criteria for Identification.

In practice, most lung nodule images were determined through CT scans and visual inspection by physicians. Table 1 provides the common conditions and criteria for determining lung nodule images in a hospital38,39,40.

In this study, a multiscale CNN model was constructed to experimentally validate the classification and detection of lung nodule images based on common conditions and criteria. The main steps involved in the construction were as follows:

Step 1: The Caffe open-source framework for deep learning was selected as the experimental environment for building multiscale CNN models.

Step 2: During the construction of the experimental environment, raw lung nodule image data and publicly available machine learning image library data were collected using hospital CT detection equipment to form a sample dataset. The obtained lung nodule images were processed and displayed on the monitor.

Step 3: The obtained lung nodule images were intercepted, and the intact solid nodules, MGGO, PGGO, special types, and untreated normal types were selected. The detected “typical nodule images” were normalized, and the obtained image size was standardized to 640 × 640 pixels. A standard image dataset of nodules was created, and experimental samples were formed.

Step 4: The experimental samples were used to train and validate a multiscale CNN model using Softmax and SVM classifiers. Comparison experiments were conducted, and various nodal images were labeled.

Step 5: Test samples were randomly selected from each category in the standard sample image dataset, and the classification results and accuracy of the model in detecting lung nodule images were analyzed.

Data acquisition and pre-processing

Lung nodule images from hospital patients were acquired using a CT scan unit and an image acquisition card. This system obtained images of the patients’ lung nodules from the CT scan, received and transmitted image data, acquired lung nodule images, and displayed processed images on a monitor. The dataset was obtained from two primary sources: clinical lung nodule images from patients at Chongming Hospital, Shanghai University of Medicine & Health Sciences, and publicly available data from The Cancer Imaging Archive (TCIA) Public Access database. The dataset contained over 4000 raw images extracted from different categories of images across various detection periods and publicly available machine learning image libraries.

Disparities were noted in the images of lung nodules acquired in different orientations and individual patient physical conditions. Therefore, the raw nodule images were preprocessed for normalization. In this study, lung nodule images acquired from CT scans were selected to build a standard image dataset. Complete nodule images and normal-type images were chosen, resulting in more than 4000 standard sample images. The standard image dataset was artificially extended by rotating the images to improve model training quality, resulting in more than 10,000 standard sample image datasets. These included typical real type, MGGO type, PGGO type, special type I, and normal type. The obtained standard sample image dataset was divided into three parts in a certain proportion: training set, validation set, and test set (shown in Table 2).

Table 2 Image dataset of determination criteria.

In this study, the acquired standard sample images underwent multi-scale preprocessing using Gaussian Pyramid Decomposition. Specifically, we implemented the Gaussian pyramid construction and Difference of Gaussians computation components from the SIFT framework to generate multi-scale representations. This process, detailed in our Gaussian Pyramid Decomposition section, created four different scale versions of each input image: the original 640 × 640 image and three additional scales derived through iterative Gaussian smoothing and difference operations (shown in Fig. 5). These multi-scale representations were then fed into our MCNN architecture to enable scale-invariant feature learning, without utilizing the subsequent keypoint detection and descriptor generation stages of the complete SIFT algorithm. The experiment focused on recognizing and classifying five types of lung nodule images, including SN type, MGGO type, PGGO type, special type, and normal type.

Fig. 5
figure 5

Original and preprocessed lung CT images.

Model training and validation

The neural network model MCNN used in this study was implemented based on the Caffe deep learning open-source framework. Two different classification approaches were employed to conduct comparative analysis: direct end-to-end training with Softmax classifier and feature extraction followed by SVM classification.

Unified training protocol

Both approaches share a common initial training pipeline. Lung nodule images were collected from CT scans and public databases, with complete nodule locations extracted and normalized to obtain standardized 640 × 640 pixels images. Gaussian difference pyramid images were generated through the GPD technique to create multiscale datasets suitable for MCNN training. The multiresolution image training dataset was input into the network through a slice layer that separates the four multiscale images for independent initial processing. Network weights were initialized using the “Gaussian” method with biases set to “constant.” Training proceeded through iterative batch processing with forward propagation through network layers including concatenation for feature merging, error computation against ground truth labels, and backpropagation-based weight parameter tuning according to minimum error cost principles.

Classification-specific procedures

For the MCNN + Softmax approach, the model underwent direct end-to-end training where the final fully connected layer with 5 neurons directly outputs classification probabilities through the Softmax activation function. Training continued until convergence was achieved (approximately 1400 iterations) based on validation performance monitoring.

For the MCNN + SVM approach, the training process diverged after initial MCNN feature extraction. The pretrained MCNN + Softmax network (without the final classification layer) was used to extract feature vectors from the multiresolution training images. These feature vectors served as input for training five binary SVM classifiers using a one-versus-rest strategy, with each classifier distinguishing one nodule type from all others. During testing, the final prediction was determined by selecting the classifier with the highest confidence score among the five SVM outputs.

Training configuration

Both models utilized identical training parameters: batch size of 32, learning rate of 0.001 with stepwise learning policy, weight decay of 0.0005, momentum factor of 0.9, and maximum 2000 iterations with early stopping based on validation performance.

Supervised learning configuration

The multiscale pulmonary nodule recognition models were trained in a supervised manner using vector pairs consisting of normalized lung nodule images and their corresponding categorical labels representing the five classification categories (SN, MGGO, PGGO, special type, and normal type). During the training phase, RGB images underwent preprocessing by subtracting the mean RGB value from each pixel to normalize the input distribution. Table 3 provides the detailed parameter configurations used during systematic training, including the specific filter configurations and stride patterns described in the model architecture section.

Table 3 Partial configuration of parameters for Training.

The ultimate specification of network parameters for the formal training process was delineated as follows based on established best practices and literature validation: a weight decay coefficient of 0.0005 was implemented to mitigate overfitting, following the seminal work of Krizhevsky et al. who demonstrated that “this small amount of weight decay was important for the model to learn” in their ImageNet classification study41. A momentum factor of 0.9 was employed to enhance convergence, representing the widely adopted standard in deep learning optimization that provides optimal balance between convergence speed and stability42. The learning rate, set at 0.001, governed the magnitude of parameter updates during optimization, chosen based on established practices for CNN-based medical image classification that ensure stable convergence while allowing adequate parameter updates43. The learning policy adopted a stepwise progression, proven effective for systematic learning rate reduction during training phases. A modest batch size of 32 instances was selected for mini-batch training to balance computational efficiency and model generalization, considering both hardware constraints and gradient estimation quality requirements for medical imaging datasets44. The scale normalization factor was adeptly used to expedite training iterations by harnessing the computational prowess of GPU (Graphics Processing Unit) acceleration. Notably, the training procedure was limited to a maximum of 2000 iterations, determined through empirical validation of our training convergence patterns as demonstrated in Fig. 6, ensuring optimal convergence within the imposed constraints while preventing over-training.

As depicted in Fig. 6, the accuracy curve, training loss curve, and validation loss curve showcase the training process of the MCNN + Softmax model. Figure 6 illustrates that as the number of iterations increased, the model exhibited a continuous improvement in accuracy on the validation set, accompanied by a steady decrease in loss. This trend indicated that the model converged effectively, with the accuracy reaching approximately 89% after around 1400 iterations.

Fig. 6
figure 6

Recognition accuracy and loss curves.

By directly using the converged model file obtained after 1400 iterations, the multiscale image training data were fed into this trained model to obtain the corresponding feature vectors. These feature vectors were then used to create the training dataset for training the SVM classifiers. The training dataset was divided into five different extraction approaches, and each approach was used to train a binary SVM classifier.

Code availability

The custom Python code used to generate the results reported in this paper is available from the corresponding author upon reasonable request. Due to institutional data security policies and intellectual property considerations, the code cannot be made publicly available. Researchers interested in accessing the code for academic purposes should contact the corresponding author (Honglin Xiong, honyex@126.com) with a detailed description of the intended use and institutional affiliation for review and approval.

Results

Model testing and evaluation

To measure the overall performance of the proposed approach and evaluate the results of the statistical analyses, the commonly accepted confusion matrix45 was used. The metric calculations of the confusion matrix are provided in the equations below. The meanings of the abbreviations used in the equations are as follows: TP refers to True Positive, TN refers to True Negative, FP refers to False Positive, and FN refers to False Negative46.

$$Accuracy = (TP+TN)/(TP+TN+FP+FN)$$
(7)
$$Recall\:=\frac{TP}{TP+FN}$$
(8)
$$F1=\frac{2\times\:P\times\:R}{P+R}$$
(9)
$$Precision\:=TP/(TP+FP)$$
(10)
$$F-Score\:=(2\times\:\:Precision\:x\:Recall)/(Precision+Recall)$$
(11)

Regarding the MCNN + Softmax network model, the model file obtained after reaching convergence at 700 iterations in training and validation was used for testing the samples. Subsequently, the model’s generalization performance was further assessed on the training dataset. A subset of 200 randomly selected images per class was extracted from the standardized sample image dataset to compose the test samples. The model’s classification outcomes on the test set, the corresponding actual classification instances, and the accuracy of the classification results were quantified using a confusion matrix, as presented in Table 4.

Table 4 MCNN + Softmax classification test results regarding the MCNN + SVM network model, the testing phase involved converting the image data from the test dataset into feature vectors. Subsequently, the well-trained ensemble of five binary SVM classifiers was used to perform predictive classification. The final predicted class was determined by selecting the maximum value among the five computed results. Furthermore, a meticulous analysis of the classification outcomes was conducted using a confusion matrix, thereby facilitating a comprehensive assessment of the model’s performance (shown in Table 5).
Table 5 MCNN + SVM classification test results.

Comparing Table 4 with Table 5, a notable disparity was found in terms of pulmonary nodule image classification. The employment of the Softmax approach surpassed the efficacy of the SVM classifier. Specifically, the Softmax classification outperformed the SVM counterpart in terms of precision and recall measures across all nodule classes. Moreover, the overall accuracy of the Softmax model surpassed that of SVM, solidifying its superiority in the domain of pulmonary nodule image classification.

Analysis of experimental results

This study also compared the MCNN with the traditional CNN for feature extraction. The process of training and testing CNN + Softmax was essentially the same as that for MCNN + Softmax, using the same Softmax classifier. Moreover, the dataset used for each convolutional neural network was also the same. This was only a comparison test, and, as a result, the network parameters were not repeated. Finally, the same confusion matrix was used to analyze the classification results (shown in Table 6).

Table 6 CNN + Softmax classification test results.

As shown in Table 7, MCNN + SVM was compared with MCNN + Softmax; essentially, two different classifiers were used with MCNN. The F1 values of each node was improved when using the MCNN + Softmax model, and the overall classification accuracy improved by more than 1%. This showed that using different classifiers had some influence on the recognition accuracy, and the Softmax classifier was more effective. Our experimental results reveal systematic performance differences attributable to the fundamental approaches of these classification methods. SVM’s limitations stem from its one-vs-rest decomposition strategy, where the multi-class problem is divided into five binary classifications, each potentially experiencing different class distributions during training. Softmax’s advantages derive from its integrated multi-class probabilistic framework that simultaneously optimizes all classes, providing more stable and consistent classification boundaries across all nodule types. Examination of per-class performance reveals distinct patterns: PGGO nodules show the largest improvement with Softmax (+ 5.28% F1-score), followed by MGGO (+ 5.18%) and other nodule types showing consistent gains (+ 3.23% to + 3.48%). This pattern suggests that complex or subtle nodule features benefit significantly from Softmax’s unified learning approach compared to SVM’s binary decomposition. The probabilistic outputs of Softmax also provide confidence measures crucial for medical applications, while its computational efficiency (single model vs. five binary models) offers practical advantages for clinical deployment.

Subsequently, the two models, MCNN + Softmax and CNN + Softmax, were compared. The MCNN used in this study was more accurate than the traditional CNN in terms of both F1 values of a certain class of nodules and the overall classification accuracy. In particular, the improvement in classification accuracy for SN types, PGGO types, and untreated normal types of image recognition was relatively large, with the F1 values improving by more than 2.27%. The main reason was the use of multiresolution and multiscale image sampling preprocessing enabled the features of the images to be better extracted.

Table 7 Comparison of F1 values (%).

Compared to similar methods, some researchers have used 3D Faster R-CNN and CMixNet47, the 3D Faster R-CNN-like architecture was used for lung nodule detection, CMixNet with U-Net-like encoder–decoder architecture was used for learning nodules’ features, model achieves accuracy of 96.33%, but only to classify the nodule as benign or malignant. Meanwhile, some scholars have done researcher conducted comparative studies on the effectiveness of lung nodule detection using different deep CNN model48,49,50, these methods such as TLbasedCNN, Ensem2DCNN, Multi-cropCNN, MMEL-3DCNN and so on., better performance was achieved in the results, in terms of accuracy, the highest reached 95.5%, in terms of F1-score, the proposed model obtains the highest reached a score of 95.24%. In comparison with their results, the accuracy and F1-score are close to our study. However, previous studies did not conduct a more fine-grained classification of different lung nodule types. In contrast, our MCNN approach in this study targets five common lung nodule types for classification. It demonstrates practical applicability and outperforms previous methods in terms of both accuracy and F1-score.

In summary, MCNN combined with the Gaussian difference pyramidal difference technique was feasible in applying lung nodal shadow recognition and classification scenarios.

Discussion

In this study, the proposed new Multiscale Convolutional Neural Network model offers several advantages over traditional methods for lung nodule classification. The advantages of CNNs are mainly in their powerful feature capture, automatic feature learning, parallel computing, scalability, robustness, and generalization, which give CNNs a significant advantage in processing complex image and vision tasks51. By integrating Gaussian Pyramid Decomposition, the MCNN can better exclude noise and invalid points, improving the accuracy of the matching. This multi-scale approach enhances the model’s ability to detect subtle differences between nodule types, which leads to improved classification accuracy. As demonstrated in our experiments, the MCNN achieved an overall accuracy of 96.48%, significantly outperforming the traditional CNN model, which had an accuracy of 92.34%. Specifically, the MCNN showed significant improvements in classifying solid nodules and pure ground-glass opacity nodules, with F1 scores increasing by over 2.0%.

Another advantage of the MCNN is reduced reliance on complex preprocessing and manual feature extraction. Traditional methods often require extensive image processing steps and handcrafted features, which can be time-consuming and prone to errors. In contrast, the MCNN leverages the automatic feature learning capabilities of deep learning, streamlining the classification process and minimizing potential sources of error.

Positioning MCNN within deep learning architectures

To contextualize our MCNN performance within contemporary deep learning architectures, our GPD-based approach offers a fundamentally different paradigm compared to established state-of-the-art models. While EfficientNetV252 achieves high performance through compound scaling and architectural optimization, ConvNeXt53 demonstrates competitive results by modernizing convolutional designs, and Swin Transformers54 utilize hierarchical feature representation with sophisticated attention mechanisms, our method prioritizes preprocessing-based multiscale feature enhancement through deterministic mathematical decomposition. This distinction provides significant computational advantages over transformer-based models that typically require substantial resources for attention computations, enabling our approach to be integrated with lightweight CNN architectures while maintaining effectiveness, a crucial consideration for resource-constrained clinical environments where deployment feasibility and computational transparency are critical factors.

Multiscale architecture comparison

Our MCNN approach occupies a unique position within the established taxonomy of multiscale CNN architectures for medical image analysis. The field has evolved through several paradigmatic approaches, each addressing multiscale feature extraction through distinct methodological frameworks. Architectural multiscale methods achieve scale diversity through network design modifications, where U-Net family approaches including the foundational U-Net55, UNet + +56, and UNet 3 +57 employ skip connections and nested architectures for multiscale feature integration. Similarly, Feature Pyramid Networks58 utilize bottom-up and top-down pathways with lateral connections, while the DeepLab series59,60 leverage dilated convolutions and Atrous Spatial Pyramid Pooling to build multiscale features within the network architecture itself.

Attention-based multiscale methods represent a more recent paradigm that combines scale processing with learned attention mechanisms. These approaches, including various attention U-Net variants and multi-scale attention networks, demonstrate high performance through sophisticated attention computations. However, these methods require complex attention parameter tuning and substantial computational resources for attention weight calculation during both training and inference phases.

In contrast, preprocessing-based multiscale methods, as exemplified by our approach, integrate classical pyramid techniques with CNN architecture at the data preparation stage. Our GPD-CNN combination offers several distinct advantages over existing paradigms: Firstly, it provides a mathematical foundation through Gaussian pyramids that deliver theoretically grounded scale-space representation. Secondly, it achieves computational efficiency by eliminating complex attention computations during training and inference. Thirdly, it ensures deterministic processing with consistent multiscale feature extraction across all inputs and finally, it maintains architectural simplicity by employing standard CNN structures with enhanced preprocessing rather than complex multi-branch designs. Performance evaluation demonstrates that our approach achieves 96.48% classification accuracy with F1 improvements exceeding 2.0% for solid and pure ground-glass nodules, delivering competitive results with significantly reduced computational complexity compared to attention-based methods.

Clinical workflow integration and practical implementation

The successful integration of our MCNN model into clinical practice requires careful consideration of existing radiological workflows and practical implementation challenges. In current clinical settings, lung nodule evaluation typically follows a structured approach where radiologists review CT scans, identify potential nodules, assess their characteristics, and make recommendations based on established guidelines such as the Fleischner Society recommendations or Lung-RADS criteria.

Our MCNN model can be seamlessly integrated into this workflow as a computer-aided detection and diagnosis (CAD) tool that operates in parallel with radiologist review. The proposed integration workflow would involve: (1) automatic preprocessing of incoming CT scans through our GPD-based multiscale decomposition, (2) real-time analysis and classification of detected nodules with confidence scores, (3) generation of structured reports highlighting nodules by risk category, and (4) presentation of results through the existing Picture Archiving and Communication System (PACS) interface familiar to radiologists.

Model limitations and future improvements

However, the MCNN model has limitations. One significant challenge is its dependence on large and diverse datasets for training. Although we utilized over 10,000 images with balanced representation across our five studied nodule types (SN, PGGO, MGGO, special-type, and normal), several categories of lung nodules were not specifically included in our classification framework. These include hamartomas, bronchial adenomas, and papillomas, which may limit the model’s applicability to the full spectrum of lung nodule pathology encountered in clinical practice. Additionally, 11% of images required additional preprocessing due to motion artifacts or suboptimal contrast enhancement, and our dataset included images from both hospital CT scanners and public databases with potentially different scanner types and imaging protocols, which may introduce inter-scanner variability and limit generalizability across different clinical systems.

The model is computationally intensive, requiring approximately 18 h for training on NVIDIA GeForce RTX 3080 with 10GB GDDR6X memory using 12 CPU cores and 32GB RAM. Inference time averages 0.3 s per image with memory requirements of 6GB GPU memory, which may exceed the capacity of standard clinical workstations and limit accessibility in resource-constrained clinical environments, particularly in smaller hospitals or developing regions.

Another limitation is the lack of longitudinal data, which restricts the model’s ability to track the evolution of lung nodules over time. Incorporating temporal information could enhance the model’s diagnostic capabilities and provide more insights into disease progression. Future work should focus on collecting more comprehensive datasets, employing techniques to address class imbalance such as data augmentation or weighted loss functions, and optimizing computational efficiency for broader clinical deployment.

Despite these challenges, the MCNN represents a significant advancement in automated lung nodule classification, offering higher accuracy and efficiency. Its ability to handle multi-scale features makes it a robust tool for clinical applications, potentially aiding radiologists in making more accurate and timely diagnoses.

Conclusion

This study aims to address the preliminary diagnosis of pulmonary nodules in clinical practice by proposing a novel lung nodule recognition and classification framework rooted in deep learning, specifically employing MCNN. The model is validated against clinical data obtained from public archive and Chongming Hospital, Shanghai University of Medicine & Health Sciences.

The novel MCNN model integrates Gaussian Pyramid Decomposition to enhance the classification of lung nodules in CT images. The innovative use of GPD allows the model to extract features at multiple scales, addressing the variability in nodule sizes and improving detection accuracy across different nodule types. This approach represents a significant advancement over traditional single-scale CNN models and other multi-scale methods that rely on attention mechanisms (Lung nodule detection). The experimental results reinforce this claim, demonstrating the MCNN’s superior performance and validating the methodological advancement.

The MCNN model offers positive opportunities for the field of medical imaging and lung cancer diagnosis. Its high classification accuracy can assist radiologists in making more precise diagnoses, potentially leading to earlier detection and better patient outcomes. By automating the classification process, the MCNN can reduce the workload on medical professionals, allowing them to focus on more complex cases and improving overall efficiency in clinical settings. Moreover, the methodology developed in this study can be extended to other medical imaging tasks where multi-scale feature extraction is beneficial, such as the detection of tumors in other organs or the analysis of different types of medical images, highlighting its broader impact and applicability.

To further enhance the performance and applicability of the MCNN model, several avenues for future research are proposed. First, expanding the dataset to include a larger and more diverse collection of lung nodule images will help improve the model’s generalizability and robustness. Incorporating longitudinal data will enable the model to track changes in nodules over time, providing valuable information on disease progression and response to treatment. Additionally, addressing class imbalance through advanced data augmentation techniques or modified loss functions will enable the model to perform better across all nodule types, including those less represented in the dataset. Furthermore, exploring the integration of other deep learning techniques, such as attention mechanisms or transfer learning, could further enhance the model’s feature extraction capabilities and classification accuracy. Finally, conducting clinical trials to validate the MCNN model’s performance will be crucial for its adoption in clinical practice, ensuring its reliability and effectiveness in diverse clinical environments.