Introduction

Chondroid tumor is a group of diverse spectra characterized by the production of a cartilage-like matrix, which ranges from benign forms, such as enchondroma, to malignant forms like chondrosarcoma1. While benign chondroid tumors are often asymptomatic and can easily be diagnosed via medical images, their malignant counterparts pose significant challenges due to their invasive nature and potential for metastasis2,3. Differentiating between benign and malignant tumors is crucial, as it determines the treatment strategy: conservative monitoring for benign cases versus surgical intervention for malignant ones. However, this distinction is often challenging based solely on morphology-based findings from traditional medical imaging techniques, highlighting the need for advanced diagnostic tools to improve precision and consistency in chondroid tumor classification.

One promising approach is radiomics, a technique that extracts a large array of quantitative features from medical images4,5. By converting qualitative imaging data into measurable features based on shape, structure, and intensity, radiomics enables an objective analysis6. This methodology has shown significant advancements in the classification of musculoskeletal tumors, including chondroid tumors, where the differentiation between benign and malignant is challenging7. Radiomics allows clinicians to conduct a comprehensive analysis of tumor heterogeneity8. However, radiomics cannot handle complex, high-dimensional patterns, and not all radiomics features are clinically relevant9,10.

To address these limitations, deep learning (DL) has emerged as an approach to overcoming the limitations of radiomics11. DL is increasingly applied to medical imaging due to the ability to extract and learn complex features from large datasets automatically12,13. In a classification of chondroid tumors, DL models outperform radiologists or radiomics-based approaches by identifying complex patterns from MRI14,15. DL also reduces the time for labor-intensive processes such as manual segmentation and feature extraction by focusing more on automatically extracting features16,17. However, when extracting deep and high-dimension features, DL can sometimes overlook clinically relevant, handcrafted features that radiomics can provide.

Recognizing the complementary strengths of radiomics and deep learning, recent research has focused on combining these two methods to develop robust and high-performance tumor classification models18. This hybrid strategy integrates handcrafted radiomics with automatically learned high-dimensional features of DL, leading to a significant improvement in classification accuracy19. This combined approach was particularly valuable for predicting malignant nodules, in which both handcrafted radiomics and deep features are crucial for distinguishing benign and malignant nodules classification20. Studies have shown that machine learning models integrating both radiomics and deep features outperform models that use either method alone, offering a more comprehensive analysis21,22. This hybrid approach represents a promising direction for improving the precision of diagnosis and supporting clinical decision-making in musculoskeletal oncology.

The integration of radiomics and deep learning approach has proven not only in 2D but also in 3D imaging, providing robustness and versatility in tumor classification23,24. Recent studies have focused on using 3D radiomics features to extract morphological and textural details from volumetric data, providing a comprehensive view of tumor characteristics compared to the 2D approach in breast and lung cancers25,26. This approach demonstrates the increasing utility of 3D imaging on in capturing complex morphological and textural features, thereby extracting and enhancing classification performance in oncology. Therefore, the purpose of this study is to develop a streamlined approach using automatic segmentation and classification based on combined radiomics and DL-features in chondroid bone tumors.

Results

Clinical characteristics

A total of 147 patients who performed MRIs for chondroid tumors were included in our study. The MRI sequences consisted of coronal or sagittal fat-suppressed T2-weighted images, and axial T1-weighted, T2-weighted and fat-suppressed T2-weighted images, DWI with ADC map, and contrast-enhanced fat-suppressed T1-weighted image. The data were divided into training and testing set. The training set consisted of 117 patients, and the testing set comprised 30 patients. The mean age in the training set is 42 (range: 12–78), while the mean age in the testing set is 43 (range: 20–78). There were 41 male and 76 female patients in the training set, while 8 males and 22 females were in the testing set. Tumors were classified into three categories: enchondroma (class 0), grade 1 chondrosarcoma (class 1), and grade 2 & 3 chondrosarcoma (class 2). The training set contained 94 enchondromas, 14 grade 1 chondrosarcomas, and 9 grade 2 & 3 chondrosarcomas. The testing set comprised 24 enchondromas, 3 grade 1 chondrosarcomas, and 3 grade 2 & 3 chondrosarcomas (Table 1).

Table 1 Patient characteristics.

Segmentation performance

Table 2 shows the segmentation performance, which is consistent across the five-folds, with an average training loss of 0.1572 and a range of 0.1496–0.1695. The validation loss ranged from 0.1993 to 0.3215 in folds 2 and3, respectively, with the fold 3 performing the best. Regarding segmentation accuracy, the fold 3 also produces outstanding results compared to the others, with the highest dice score (0.9118) and mean validation dice score (0.8833). In general, the average dice (0.8491) and mean validation dice (0.8488) across all folds indicate the robust segmentation of the model. This segmentation performance influences the extracted shape features. However, we did not manually correct the segmentation since our goal was to build an end-to-end automated pipeline from segmentation to classification.

Table 2 Performance Results of 3D U-Net for Chondroid Tumor Segmentation.

Classification performance

Models are evaluated on multiple metrics, including accuracy, weighted kappa, and area under the curve (AUC), provide comprehensive insight into classification performance. The following analysis provides insights into the classification capability of each modeling approach and highlights the importance of integrative types of data. Table 3 presents the classification performance across five models using five different classifiers. Receiver operating characteristic (ROC) curves of the models are shown in Fig. 1. The confusion matrix of the CatBoost classifier, which demonstrated the highest performance among the models, is presented in Fig. 2. Supplementary Fig. 2 illustrates confusion matrices for all 5 models and classifiers.

Table 3 Classification performance of features combination models.
Fig. 1
figure 1

Comparison of ROC curve between models on chondroid tumor classification. The curve plots the true positive rate against the false positive rate across five models: Model 1 (Radiomics-Only, blue), Model 2 (Deep Learning-Only, green), Model 3 (Radiomics + Deep Learning, pink), Model 4 (Radiomics + Deep Learning + Clinical, red), and Model 5 (Radiomics + Deep ROI + Clinical, cyan). The dashed line illustrated the random classifier (AUC = 0.5). Model 5 shows the highest AUC, indicating the best classification performance, followed by Model 4, 3, 2, and 1.

Fig. 2
figure 2

Confusion Matrix of CatBoost classifier across different combined-features datasets. Each matrix represents performance across 5 models: (Top left) Model 1: Radiomics-Only, (Top right) Model 2: Deep Learning-Only, (Middle left) Model 3: Radiomics + Deep Learning, (Middle right) Model 4: Radiomics + Deep Learning + Clinical, (Bottom left) Model 5: Radiomics + Deep ROI + Clinical. These matrices plot a number of correct and incorrect samples in each class (Class 0: Enchondroma, Class 1: Grade 1 chondrosarcoma, and Class 2: Grade 2 & 3 chondrosarcoma). Darker cells along the diagonal indicate a larger number of correct classification counts, suggesting the enhanced performance of feature integration models, particularly Model 5.

Model 1: radiomics-only features model

This model achieved the highest performance with the CatBoost classifier, yielding an accuracy of 0.81 (95% CI 0.81–0.87), a weighted kappa of 0.69, and an AUC of 0.81. Despite demonstrating a moderate classification performance, this model highlights the limitation of relying solely on radiomics features. AUC shows some effectiveness in classification power. However, the overall performance indicates the need to integrate further features to enhance predictive ability.

Model 2 deep learning-only features model

This model showed slightly improved metrics, with the best performance again achieved by CatBoost, which produced an accuracy of 0.84 (95% CI 0.84–0.89), a weighted kappa of 0.71, and an AUC of 0.79. This performance illustrated the strengths of deep learning features, particularly in capturing complex patterns in images. The results reinforced that deep learning can outperform traditional methods when capturing intricate features.

Model 3: radiomics + deep learning model

This model demonstrated significant improvement, achieving the highest performance on the CatBoost classifier: an accuracy of 0.87 (95% CI 0.86–0.92), a weighted kappa of 0.84, and an AUC of 0.85. The significant increase in classification accuracy shows the advanced ability to leverage radiomics and deep learning features. A high score of weighted kappa indicated strong agreement between the model prediction and the ground truth, demonstrating the model’s stability in practical applications.

Model 4: radiomics + deep learning + clinical model

This model achieved the best scores with CatBoost again, attaining an accuracy of 0.89 (95% CI 0.88–0.92), weighted kappa of 0.78, and AUC of 0.87. The integration of clinical information into this model significantly enhances the classification abilities and underscores the importance of combining various data types to improve model performance. A high AUC score indicated the model’s effectiveness in distinguishing different chondroid tumor classes, demonstrating that clinical information can enhance model performance.

Model 5: radiomics + deep ROI + clinical model

This model achieved the highest overall classification performance among the other models and utilized the CatBoost model as the best classifier, with an accuracy of 0.90 (95% CI 0.90–0.93), a weighted kappa of 0.85, and an AUC of 0.91. This result highlights the effectiveness of integrating deep ROI features, allowing a focus on important tumor areas. The high AUC and accuracy suggest that this approach can enhance predictive performance and improve clinicians’ reliability in classifying chondroid tumors.

Discussion

In this study, we developed an automatic segmentation and classification model for chondroid bone tumors by integrating radiomics and DL-features. Chondroid tumors represent a heterogeneous group of neoplasms characterized by the production of a chondroid matrix, and classified as benign, intermediate, and malignant categories according to the 2020 WHO classification27. While histologic and cytogenetic criteria provide the foundation for this classification, conventional MRI findings—such as the extent of marrow replacement and cortical disruption—often overlap between adjacent tumor grades, making imaging differentiation challenging28,29.

Although advanced MRI techniques, such as DWI and dynamic contrast-enhanced MRI (DCE-MRI), have been introduced to provide quantitative parameters related to tumor cellularity and vascularity30. These modalities also exhibit considerable overlap in measured parameters among tumor grades, limiting their reliability for precise grading31.

These limitations underscore the need for alternative diagnostic strategies, such as radiomics and deep learning, which can capture subtle imaging patterns and integrate diverse information to enhance diagnostic accuracy. Moreover, adopting a streamlined approach that incorporates automatic tumor segmentation can enhance the consistency of feature extraction and increase clinical usability, further supporting the integration of these advanced methods into routine practice.

Our segmentation model demonstrated strong segmentation performance, achieving an average dice score of 0.849 and the highest dice score of 0.912 for chondroid tumor segmentation on MRIs. These results highlight the effectiveness of 3D U-Net architecture in accurately segmenting tumor regions in MRI scans. When compared to the previous studies, our results are on par with or exceed segmentation performance in related fields. While no prior research has specifically assessed chondroid bone tumor segmentation using MRI, related studies on other bone tumors provide a helpful context.

Ye et al.32 developed a multitask deep learning framework for detecting, segmenting, and classifying primary bone tumors and infections using multi-parametric MRI, reporting Dice scores of 0.75 ± 0.26 for T1-weighted imaging and 0.70 ± 0.33 for T2-weighted imaging on an external validation set. Although these scores are slightly lower than our results, their multi-center study demonstrates the broad applicability of MRI-based methods for bone tumors. Similarly, Wang et al.33 achieved Dice scores of 0.871 on a testing set and 0.815 on an independent validation set for both bone and soft tissue tumors using a deep learning model. Other imaging modalities generally show lower segmentation performance. PET-CT and CT-based methods reported lower Dice scores, reflecting the challenges of limited soft-tissue contrast34,35.Similarly, the work of Claudio E. von Schacky et al.36 focused on deep multitask learning for X-Ray images, and reported a dice score of 0.6 ± 0.370, emphasizing the limitations of 2D image-based methods. These results highlight the significance of our MRI-based segmentation approach, which uses high-contrast soft-tissue imaging and the advanced 3D U-Net architecture, thereby establishing a benchmark for automated chondroid tumor segmentation.

Our study demonstrates that a hybrid model integrating diverse data types—specifically radiomics features, deep learning features, and clinical parameters—can achieve superior classification performance compared to models relying on a single data source. The progressive improvements observed in accuracy, weighted kappa, and AUC across the models highlight the importance of combining complementary feature sets. Notably, Model 5, which incorporated radiomics features, deep learning features, and clinical parameters, demonstrated the highest performance with an accuracy of 0.9 (CI 95% 0.90–0.93), weighted kappa of 0.85, and AUC of 0.91. In comparison to previous studies, our approach offers several novel contributions. A previous study focused on bone chondrosarcoma classification using only MRI radiomics features reported an AUC of 0.85 for the training set and 0.78 for the testing set, which is lower than our results, possibly due to the absence of integrated deep learning and clinical information37 Another study using CT of chondroid bone tumors achieved an accuracy ranging from 0.75 to 0.81 and AUC between 0.78 and 0.89 for distinguishing atypical cartilaginous tumors and chondrosarcomas38 Similarly, radiographs-based models reported an accuracy from 0.79 to 0.83. However, direct comparison with CT- and radiograph approaches is difficult, as this model may primarily rely on calcification patterns and often lacks sufficient soft-tissue contrast39. Co-clinical radiomics studies, such as Roy et al.40, have explored the correlation between clinical and radiomics features, which strongly supports our hybrid approach. Our study advances the field by enabling automated classification based on segmentations, radiomics features, deep learning-derived features, and clinical variables, coupled with post-feature selection, to significantly improve classification performance compared to radiomics-only or deep-feature-only approaches reported in previous studies.

There are several limitations in our study. First, external validation was not feasible in this study due to the rarity of chondroid tumors. The consistent segmentation performance across cross-validation folds and the improved classification performance across multiple classifiers may suggest that our approach approximates reproducible results within the current dataset. However, its robustness and generalizability to broader populations cannot be guaranteed without external validation, preferably using multicenter datasets. Second, the relatively small sample size increases the risk of overfitting during the feature selection process. We applied LASSO regression nested within a fivefold cross-validation framework to select 15 radiomics features and 15 deep features, aiming to reduce overfitting. However, this approach does not fully eliminate the risk. Third, we used the 3D U-Net model for segmentation, without comparing alternative segmentation methods, which might offer complementary performance. However, our focus was to improve classification performance by combining sets of features. Fourth, we did not evaluate inter-rater agreements for segmentation. While intraclass correlation coefficient (ICC) is commonly employed to assess feature reproducibility across raters or repeated measures, in this study, our focus was on feature selection stability within a single ground-truth framework. Future studies incorporating multi-rater annotations or test–retest data could further explore ICC-based reproducibility assessments. Lastly, the class imbalance in our dataset reflects the natural prevalence of chondroid tumor subtypes, with benign tumors being significantly more common than intermediate- and high-grade tumors. To mitigate this, we applied SMOTE for data augmentation, which helped address the imbalance to some extent. However, this method may not fully resolve the challenges associated with underrepresented classes.

Further studies should focus on validating these models with larger datasets from multiple institutions to establish generalizability. In addition, the integration of other data modalities, like genomics or histopathological information, may provide a more comprehensive understanding of tumor characteristics, potentially leading to the development of more effective classification models and advancing personalized medicine approaches in bone oncology. In conclusion, our study demonstrates that integrating radiomics, deep learning, and clinical information has improved the classification performances of chondroid tumors. The improvement in performance metrics across the models highlights the potential of a multi-modal approach to enhance clinical diagnosis.

Method

This retrospective study was approved by the institutional review board (IRB of The Catholic University of Korea, Seoul St. Mary’s Hospital, approval number: KC21RASI0081). All procedures performed in this study were conducted in accordance with relevant guidelines and regulations, including the principles outlined in the Declaration of Helsinki. The IRB waived the requirement for informed consent due to the retrospective nature of the study.

Study population

In this study, we retrospectively collected data from patients who were consecutively diagnosed with primary chondroid tumors and underwent multi-sequence MRI preoperatively using Vida or Verio (Simens Healthcare, Erlangen, Germany) from November 2008 to December 2021. A total of 183 patients were initially collected.

After applying exclusion criteria, including small lesion size (< 1 cm, n = 20), motion artifacts (n = 9), incomplete sequences (n = 6), and previous biopsies or treatments (n = 3), 38 patients were excluded from the study. The final dataset comprised 147 patients with primary chondroid tumors, classified as follows: 118 enchondromas, 17 chondrosarcomas grade 1, and 12 chondrosarcomas grade 2 or 3 (Fig. 3). A sample of MRI sequences with corresponding segmentation can be found in Supplementary Fig. 1, the complete clinical characteristics are described in Supplementary Table 1.

Fig. 3
figure 3

Patient selection flowchart for chondroid tumor classification based on MRI sequences from 2008 to 2021.

3D U-Net segmentation

Lesion segmentation was performed using the open-source software ITK-SNAP (version 3.9.0, http://www.itksnap.org/). A musculoskeletal radiologist with 3 years of experience (H.P) performed the initial segmentation along the tumor borders on each sequence and subsequently validated by another musculoskeletal radiologist with 16 years of experience (J.Y.J). We utilized 3D U-Net to segment the suspicious lesions as the architecture has been proven to be a standard solution for medical image segmentation41 (Fig. 4). Roy et al.42 also demonstrated the ability of using computer-aided segmentation for T2-weighted MRI sequences, which supports the use of 3D U-Net approach. The U-Net architecture was deployed using the nnUNetv2 library43 with the configuration 3d_fullres to handle the high-resolution volumetric data. To cope with the shortage of medical data for segmentation, we applied structured pretrained and transfer learning for nnUNetv2. We fine-tuned a pretrained nnUNetv2 model using our chondroid tumor dataset, transferring the learned weights from a publicly available medical tumor segmentation dataset. Different sequences from the same patient were treated as individual data in order to train a unified model capable of segmenting tumors across various sequence types. To prevent data leakage, all MRI sequences from a single patient were assigned to the same fold during fivefold cross-validation, ensuring that no patient’s data appeared simultaneously in both the training and validation sets.

Fig. 4
figure 4

Illustration of 3D U-Net architecture for chondroid tumor segmentation. This model works on 3D image patches size of [64 × 64 × 64] with voxel spacing standardized to [2.0, 0.265, 0.265] mm. The five-stage decoder-encoder structure reduces the resolution to half in each encoder stage, starting with the input of (64 × 64 × 64), and then decreasing to (32 × 32 × 32), (16 × 16 × 16), (8 × 8 × 8), and finally (4 × 4 × 4). The decoder path restores the resolution by upsampling, reversing the process to the original dimension. The skip connections link each encoder layer to the corresponding decoder layer to retain spatial information.

Features extraction

Tumor delineation

We utilized multiple MRI sequences, which included T1-weighted, T2-weighted, gadolinium-enhanced fat-suppressed T1-weighted, and DWI with ADC. These image sequences were chosen to have comprehensive information on chondroid tumors across modalities, which enhances the classification outcomes. After a process of training and fine-tuning to automatically segment using 3D U-Net, the segmented regions of interest were then used as the primary input for the subsequent feature extraction phase.

Radiomics features (\(f_{radiomics}\))

Radiomics features extract diverse quantitative features representing tumor heterogeneity from the ROI.

$$f_{radiomics} = \left\{ {f_{first - order} , f_{shape} , f_{texture} ,f_{wavelet} } \right\}$$

First-order statistics \((f_{first - order} )\) collect intensity distribution metrics like mean \(\left( \mu \right)\), standard deviation \(\left( \sigma \right)\), skewness \((S)\), and kurtosis \((K)\). Shape features \(\left( {f_{shape} } \right)\) describe geometry characteristics like volume \((V)\) and surface area of the tumor \((A)\). Texture features \((f_{texture} )\), based on the Gray-Level Co-occurrence Matrix (GLCM), capture spatial patterns in voxel intensities. Wavelet features \(\left( {f_{wavelet} } \right)\) provide multi-resolution information and extract both the coarse and fine-grained tumor characteristics. Full list of radiomics features is provided in Supplementary Table 2.

Deep learning features (\(f_{DL}\))

DL features were extracted using Convolutional Neural Network (CNN) architecture44 which was trained on image patches of size 64 × 64 × 64 taken from original MRI scans. We separately extracted two types of deep learning features, one for the whole volume and one for the ROI region. The entire volume captures the global contextual patterns and relationships of the tumor with its surrounding areas, while the deep ROI features include fine-grained and detailed information about the tumor. The EfficientNet45 encoder consisted of multiple convolutional layers \(l \in \left\{ {1, \ldots ,L} \right\}\), followed by activation and pooling layers. The activation layers introduce non-linearity and help the network to learn complex relationships. The pooling layers help to down-sample the feature maps and capture hierarchical representations within the image. Each CNN layer, \(l\), computes the feature map \(F_{l}\) as:

$$F_{l } = \sigma \left( {W_{l} *F_{l - 1} + b_{l} } \right)$$

where \(W_{l}\) represents the weight filters for layer \(l\), \(b_{l}\) is the bias term, and \(\sigma\) is the activation function (e.g., ReLU). The final deep feature vector \(f_{DL}\) is obtained by flattening the last convolutional layer’s output:

$$f_{DL} = Flatern\left( {F_{l} } \right)$$

Deep ROI (\(f_{ROI}\))

In this study, we extracted deep features from MRI images and the segmented tumor regions of interest (ROI). DL features extracted from ROI focus more on the tumor region and help to capture details and relevant information. This approach can enhance the classification performance since the network prioritizes features from the most important area. Both features were used to evaluate the impact of the whole image and ROI-based features on the model classification results.

$$f_{ROI} = Flatern\left( {f_{l}^{ROI} } \right)$$

Clinical features (\(f_{clinicals}\))

Clinical features such as age, gender, tumor location, and other relevant parameters were incorporated to complement imaging features to evaluate their influence on the model’s performance in predicting outcomes. These included both categorical variables (e.g., gender, tumor location) and quantitative variables (e.g., age). However, we did not include MRI-derived quantitative biomarkers such as ADC, as these were already represented within the radiomics feature set.

$$f_{clinicals} = \left[ {age, gender, tumor\_location, \ldots } \right]$$

Data augmentation and feature selection

We used SMOTE46 for data augmentation and LASSO47 for the feature selection in our machine-learning pipeline. For data augmentation, we conducted experiments to compare class balancing techniques, including oversampling and weighted loss functions. The confusion matrix in Supplementary Figs. 4 and 6 illustrates the superior classification results of SMOTE over other methods. For feature selection, out of over 100 original features, LASSO reduced them to 15 radiomics (Supplementary Fig. 3) and 15 deep features, selecting 30 main features. The process of dimension reduction plays an important role in improving the accuracy of machine learning models by focusing on relevant features, thereby optimizing tumor classification for subsequent analysis.

Data augmentation

To address data class imbalance, the Synthetic Minority Oversampling Technique (SMOTE) was employed for data augmentation. SMOTE generates synthesis samples for minority class by interpolating between a minority sample \(x\) with its k-nearest neighbors called \(x_{nearest}\):

$$x_{new} = x + {\updelta } \cdot \left( {x_{nearest} - {\text{x}}} \right)$$

In which, \({\updelta } \in \left[ {0,{ }1} \right]\) is a random number. This method ensures the train samples are balanced and enhances the generalizability of the model. In this study, class 0 (Enchondroma) is the majority class with training 417 samples, class 1 (Grade 1 Chondrosarcoma) is the minority class with 58 training samples and class 2 (Grade 2 & 3 Chondrosarcoma) is a severely minority class with only 43 training samples. SMOTE selects a sample from a minority class, then finds its k-nearest neighbors and randomly selects a neighbor, interpolating this neighbor with the selected sample. This process iterates until the minority class meets the expected number of samples. Additionally, SMOTE parameters were fine-tuned to minimize the synthetic noise, ensuring alignment with the minority class. Supplementary Fig. 4 demonstrates the effect of SMOTE on classification outcomes with our dataset.

Feature selection

After augmenting data, feature selection was performed by using LASSO regression (Least Absolute Shrinkage and Selection Operator) to reduce dimensionality. LASSO was nested within each fold of the fivefold cross-validation to prevent bias. The algorithm optimized the selection of 15 radiomics and 15 deep features by minimizing the following objective function to remove unnecessary features:

$$L\left( \beta \right) = \mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \left( {\beta_{0} + \mathop \sum \limits_{j = 1}^{p} \beta_{j} x_{ij} } \right)} \right)^{2} + \lambda \mathop \sum \limits_{j = 1}^{p} \left| {\beta_{j} } \right|$$

where: \(y_{i}\) is the observed value of \(i{\text{th}}\) sample; \(x_{ij}\) is \(j{\text{th}}\) feature value of \(i{\text{th}}\) sample; \(\beta_{j}\) is coefficient of \(j{\text{th}}\) feature; \(n\) is total number of samples; \(p\) is total number of features; \(\lambda\) term controls the strength of penalties.

The first term \(\mathop \sum \limits_{i = 1}^{n} \left( {y_{i} - \left( {\beta_{0} + \mathop \sum \limits_{j = 1}^{p} \beta_{j} x_{ij} } \right)} \right)^{2}\) is residual sum of squares, measuring how well the model fit into the data. The second term \(\lambda \mathop \sum \limits_{j = 1}^{p} \left| {\beta_{j} } \right|\) is \(L_{1}\) penalty, which help to shrink coefficient to zero. Selected radiomics features are listed in Table 4 with feature names and feature categories.

Table 4 Selected features after apply LASSO method.

To understand the impact of different features on the classification performance of each model, we used built-in library of the CatBoost classifier. After training the model using the CatBoost classifier for the feature importance scores, these are normalized and adjusted to ensure comparability across feature sets. We summarize the important features across models and find that Deep ROI features have the most significant impact on classification outcomes, followed by Deep Learning features and Radiomics features. Clinical features have the lowest impact. Supplementary Fig. 5 demonstrates our findings.

Model classification and evaluation metrics

This section evaluates five models by combining features from radiomics, deep learning, and clinical information. Classification algorithms include Random Forest48, Gradient Boosting49, XGBoost50, LightGBM51, and CatBoost52. Each classifier has its strengths in handling complex data and reducing overfitting to find the best method for classification based on feature combination. The dataset is divided into 117 training cases (80%) and 30 testing cases (20%) using a stratified sampling technique to maintain a ratio of chondroid tumor classes (enchondroma, grade 1 chondrosarcoma, grade 2 & 3 chondrosarcoma).

Combining features

We build five different classification models by integrating radiomics features, deep features, and clinical features in various configurations:

  • Model 1: Radiomics-Only features model (\(F_{1}\))

    $$F_{1} = f_{radiomics}$$
  • Model 2: Deep Learning-Only features model (\(F_{2}\))

    $$F_{2} = f_{DL}$$
  • Model 3: Radiomics + Deep Learning Model (\(F_{3}\))

    $$F_{3} = \left[ {f_{radiomics} , f_{DL} } \right]$$
  • Model 4: Radiomics + Deep Learning + Clinical Model (\(F_{4}\))

    $$F_{4} = \left[ {f_{radiomics} , f_{DL} , f_{clinical} } \right]$$
  • Model 5: Radiomics + Deep ROI + Clinical Model (\(F_{5}\))

    $$F_{5} = \left[ {f_{radiomics} , f_{ROI} , f_{clinical} } \right]$$

Fitting and evaluation

Each classification model was trained by a fivefold cross-validation strategy to ensure generalizability and prevent overfitting. The dataset was randomly divided into 5 small subsets. In each iteration, 4 subsets were used for and the remaining subset as validation. We used grid search for hyperparameter optimization to maximize classification performance. To evaluate model performance, we employed accuracy, weighted kappa, and AUC. The combination of these metrics provides a comprehensive evaluation of the classification models, not only ensuring the overall correctness but also penalizing classification errors via weighted-kappa), and assessing the ability to discriminate tumor types through AUC. The results section presents the performance outcomes of our hybrid approach, including the segmentation performance achieved by the 3D U-Net model and the classification results obtained across the five machine learning classifiers. The overall classification pipeline is illustrated in Fig. 5.

Fig. 5
figure 5

Classification pipeline incorporating radiomics, deep learning, deep ROI, and clinical features to improve performance.