Table 3 Comparison of datasets: strengths and limitations.

From: Optimizing non small cell lung cancer detection with convolutional neural networks and differential augmentation

Dataset

Strengths

Limitations

IQ-OTH/NCCD (Selected Dataset)

Covers bengin, malignant, and normal categories, allowing for a wider scope of diagnosis.

- High-quality 1 mm slice thickness for better resolution.

- Real-world data from a variety of demographic groups.

- Balanced dataset with a sufficient number of images (1,097 CT images).

- Clear labeling of tumors as bengin or malignant.

- Limited in size (1,097 images), which may not capture all tumor variations.

- Focuses mainly on a smaller cohort compared to large-scale datasets.

LC25000 (Histopathological Dataset)

- Contains 25,000 images across five cancer and tissue categories, providing a large and diverse dataset.

- Balanced classes ensure no bias toward any one category.

- Composed of histopathological images, not CT scans, so it may not be directly applicable to tasks involving CT image analysis.

- Does not include normal tissues as explicitly as the IQ-OTH dataset.

Lung-PET-CT-Dx (Large-Scale CT/PET)

− 251,135 de-identified CT/PET-CT images provide a large dataset with expert annotations.

- Focuses on major lung cancer histopathological subtypes.

- Very large dataset can lead to high computational costs for model training.

- Focuses more on CT/PET-CT images than on distinguishing between bengin and malignant lung tissues.

NLST (National Lung Screening Trial)

- Large-scale randomized trial data with high-risk participants for lung cancer screening.

- Provides longitudinal data with follow-up screenings.

- Screening-specific data may not cover the breadth of tumor types.

- Not specifically designed for training models, lacking labeled tumor images and annotations for model development.