Table 3 Comparison of datasets: strengths and limitations.
Dataset | Strengths | Limitations |
|---|---|---|
IQ-OTH/NCCD (Selected Dataset) | Covers bengin, malignant, and normal categories, allowing for a wider scope of diagnosis. - High-quality 1 mm slice thickness for better resolution. - Real-world data from a variety of demographic groups. - Balanced dataset with a sufficient number of images (1,097 CT images). - Clear labeling of tumors as bengin or malignant. | - Limited in size (1,097 images), which may not capture all tumor variations. - Focuses mainly on a smaller cohort compared to large-scale datasets. |
LC25000 (Histopathological Dataset) | - Contains 25,000 images across five cancer and tissue categories, providing a large and diverse dataset. - Balanced classes ensure no bias toward any one category. | - Composed of histopathological images, not CT scans, so it may not be directly applicable to tasks involving CT image analysis. - Does not include normal tissues as explicitly as the IQ-OTH dataset. |
Lung-PET-CT-Dx (Large-Scale CT/PET) | − 251,135 de-identified CT/PET-CT images provide a large dataset with expert annotations. - Focuses on major lung cancer histopathological subtypes. | - Very large dataset can lead to high computational costs for model training. - Focuses more on CT/PET-CT images than on distinguishing between bengin and malignant lung tissues. |
NLST (National Lung Screening Trial) | - Large-scale randomized trial data with high-risk participants for lung cancer screening. - Provides longitudinal data with follow-up screenings. | - Screening-specific data may not cover the breadth of tumor types. - Not specifically designed for training models, lacking labeled tumor images and annotations for model development. |