Introduction

Glioblastoma is the most common primary malignant brain tumor, accounting for approximately 57% of all gliomas and 48% of all primary malignant central nervous system (CNS) tumors1. Despite recent advances in the multimodal treatment of glioblastoma, the overall prognosis remains poor, with minimal long-term survival rates2. Approximately 5 out of every 100 glioblastoma patients survive five years after diagnosis, and more than 300,000 people die from glioma disease each year worldwide3,4. WHO Classification of Tumors of the Central Nervous System classified gliomas into grades I-IV, with grades I and II being low-grade gliomas and grades III and IV being high-grade gliomas (HGG)5. The prognosis for HGG is even worse, with the majority of HGG patients surviving less than two years, and the median survival for grade IV is only about 14.6 months6. ASCO guidelines recommend that all patients with advanced tumors receive palliative care within 8 weeks of diagnosis, but the distinction between patients with intermediate and advanced gliomas remains a challenge7,8.

The exact grading of glioma still needs to be confirmed by histopathology, but preoperative MRI features can significantly improve the grading accuracy9. Although biopsy or surgical excision is the primary method for obtaining pathological specimens, sampling error due to tumor heterogeneity and patient physical limitations remains a clinical challenge10. MRI, with its high resolution, non-invasive and quantifiable characteristics, has become a core imaging tool to assist grading and prognostic evaluation11,12,13.

Machine learning (ML) is used in various medical fields due to its ability to develop robust risk models and improve predictive power14,15. Recent studies have shown that ML algorithms have a strong ability to predict prognostic outcomes of patients using glioblastoma imaging and pathological features16. Currently, the most accurate prediction method is to use radiomics and deep learning to build models from manually or automatically extracted image features17,18,19. However, the classification performance of different machine learning algorithms for HGG grading is unclear, and the value of dealing with data imbalances has not been fully determined.This study aims to develop and validate an HGG classification model based on MRI radiomics, focusing on comparing the performance of various machine learning classifiers, including stack-fusion models. In addition, assess the improvements SMOTE has brought in addressing data imbalances. We also conducted survival analysis of HGG based on ML model to explore the survival differences of HGG and the factors affecting survival.

Materials and methods

Patients

To conduct this prediction study, we used a dataset from Taizhou Cancer Hospital and the Second Affiliated Hospital of Zhejiang University School of Medicine in Zhejiang, China. A total of 238 patients diagnosed with glioblastoma underwent pathological examination in two hospitals from March 2013 to June 2018 were selected. The exclusion criteria were as follows: (a) patients with missing original MRI (incomplete T1WI dataset) in the format of Digital Imaging and Communications in Medicine (DICOM) incomplete (n = 52), and (b) unsatisfactory imaging quality (n = 2). All enrolled cases were classified using the 5th edition of the WHO Classification Criteria for Tumors of the Central Nervous System (CNS5). Figure 1 shows the workflow of this study. Finally, a total of 184 patients (106 males and 78 females; age range, 11–80 years; mean age, 51.1 years) were included in the development of the ML prediction model. Pathological assessment of grade III lesions was performed in 59 cases and grade IV lesions in 125 cases. In the survival analysis of GBM patients, a total of 144 patients were finally included due to missing data in 40 follow-up cases. All procedures involving human participants were performed in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. Informed consent was obtained from all individual participants and/or their legal guardians included in the study. This study complied with the MI-CLAIM reporting guidelines.

Fig. 1
figure 1

Patient exclusion criteria flowchart.

Image data acquisition

Patients were examined by Intera Achieva 1.5T or GE Signa Excite HD 1.5T magnetic resonance equipment. The MRI sequence consisted of axial gradient-echo T1WI imaging, and DWI with free-breathing. Parallel imaging was used, and fat suppression was applied using spectral pre-saturation inversion recovery (SPIR).The T1WI acquisition parameters were: acquisition matrix 256 × 256, slice thickness 6 mm, slice spacing 1 mm, number of excitations (NEX) 1.0, repetition time (TR) 450 ms, echo time (TE) 21 ms, pixel width 0.86 × 0.86 mm, and field of view (FOV) 22.0 × 22.0 cm20.

Lesion segmentation

Before segmentation, the low-frequency intensity inhomogeneity in MRI image data was eliminated, and the N4 bias field of MRI image lines was corrected using SimpleITK (version 2.1.1). For T1-weighted MRI, the region of interest (ROI) was obtained by manually circling the tumor along the border, slice by slice. The delineation process included the cystic and hemorrhagic portion of the ROI while avoiding the vascular component and adjacent normal tissues. The ROIs were manually segmented on T1-weighted images by a radiologist with 10 years of experience in brain MRI. These ROIs were then reviewed by a radiologist with more than 20 years of experience. If one more experienced radiologist questioned the ROIs, they would be re-segmented after agreement between the two. The ROIs were manually drawn on axial slices using ITK-SNAP (version 3.6.0), covering the entire lesion. In this way, we ensured that segmentation for subsequent analyses was as accurate as possible. Figure 2 presents an example of a case with grade III and a case with grade IV glioma.

Fig. 2
figure 2

This study uses two subtypes of high-grade glioma as examples. Case 1 (first column): An 11-year-old male with grade III glioma. Case 2 (second column): A 19-year-old female with grade IV glioma.

Feature extraction

Radiomics features were defined according to the Thermal Radiomics Python package21. The T1WI sequence was used to calculate all the features on the optical radiomics list, and the T1WI sequence yielded a total of 107 image features. The features included in the thermal radiomics list include 18 First order features, 14 shape features (3D) and 75 texture features. The texture eigenomics list includes 24 Gy Level Co-occurrence Matrix (GLCM) Features, 16 Gy Level Size Zone Matrix (GLSZM) Features, 16 Gy Level Run Length Matrix (GLRLM) Features, 5 Neighbouring Gray Tone Difference Matrix (NGTDM) Features, and 14 Gy Level Dependence Matrix (GLDM) Features. We extracted the grouping for each of the 2 diagnostic imaging physicians using the within-group correlation coefficient (ICC). The features using intraclass correlation coefficient (ICC) to assess intra- and inter-observer agreement. Typically, an ICC of less than 0.75 was considered to be below reliability22. We used the features with good repeatability (ICC > 0.90) for subsequent steps such as feature dimensionality reduction and feature selection.

Feature selection

Too many features may lead to problems such as model overfitting. Feature selection methods can reduce the dimension of the feature space, i.e., to obtain a “low number but high quality” of features with a low probability of classification error. The Least absolute shrinkage and selection operator (LASSO) is a commonly used feature selection method23. We used lasso algorithms for feature selection and screened for features with good repeatability (ICC > 0.90) using consistency tests. These feature selection algorithms were implemented using the Python scikit learning environment (version 0.19.1). LASSO, which minimises the cost function was used and features with non-zero coefficients were identified with a specific penalty coefficient alpha24.

Data pre-processing

In the dataset used in this study, patients in class III accounted for 32.1% of all HGG patients, leading to a degradation of classifier performance. To address this category imbalance, we used SMOTE (sampling technique). Synthetic samples were generated by linear combination25. We divided the processed data into 70% training set and 30% test set. In the dataset used in this study, count data were presented as values and proportions and analysed using chi-square tests. The training set was used for model development and the test set was used for estimating the generalisability of the model.

Classifier and model validation

In order to achieve efficient and stable performance for classification, we implemented six ML classifiers using the Python scikit-learning environment (version 0.19.1), namely Logistic Regression, Extreme Gradient Boosting (XGBoost), Decision Tree, Random Forest (RF), Adaboost, and the Gradient Boosting Decision Tree (GBDT). The reason why these six classifiers were chosen and compared is that these classifiers are commonly used to classify related studies glioblastoma, bladder, skin lesions, breast, kidney, colon in previous studies26,27,28,29,30,31. Our study used five cross-validated model validations with baseline features for each cohort as shown in Table 1. The average performance of the model was used to evaluate the classification performance of the model. The shapley additive explanations (SHAP) explainer was constructed to rank the features by using Python package shap v0.39, which was used to calculate feature Contribution32.

Statistical analysis

The area under the ROC curve (AUC), sensitivity, accuracy, precision and specificity were calculated for the classification performance of HGG using the Python scikit-learn environment (version 0.19.1).The ROC curve was based on the mean value calculated from all cross-validation sets for generalisation ability. Confidence intervals (CIs) for the AUCs were obtained using the Python scikit-learning environment (version 0.19.1) for 1000 bootstrap replications. The trained models were interpreted by SHAP. Survival analyses were performed using cox proportional risk, Kaplan-Meier (KM) test. P < 0.05 was considered statistically significant difference.

Results

Clinical characteristics of the patients

In this study, a prediction model was developed by using MRI of 184 HGG patients with 59 grade III lesions and 125 grade IV lesions with mean ages of 44.8 and 54.2 years, respectively. The results showed that there were statistically significant differences between grade III lesions and grade IV lesions in minimum, fisrt order skewness, inverse difference normalisation, correlation information metric (IMC1), inverse variance, joint energy, large area low gray level emphasis and age (Table 1). Because of the imbalance in the number of cases of grade III lesions and grade IV lesions, we used an oversampling process with synthetic minorities.

Table 1 Characteristics of participants by glioblastoma grading.

Performance of feature selection methods and classifiers

In our ML model, we tested six classifiers with the LASSO feature selection method, and the LASSO coefficient path diagram is shown in Fig. 3A. As the strength of regularization gradually increases, many feature coefficients are compressed to zero, allowing most features to be eliminated and the model to be gradually simplified. The cross-validation error plot in the regularization path is shown in Fig. 3B.

Fig. 3
figure 3

Variable selection by the LASSO regression model. (A) Choice of the optimal parameter (λ) in the LASSO regression model with logλ as the horizontal coordinate and regression coefficients as the vertical coordinate; (B) Plot of λ vs. number of variables with logλ as the bottom horizontal coordinate, binomial deviance as the vertical coordinate, and number of variables as the top horizontal coordinate.

Model evaluation and comparison

When SMOTE was not used to handle data imbalance, XGBoost performed better than logistic regression, decision tree, random forest, Adaboost, and GBDT classifiers with ACC of 0.74, 0.73, 0.72, 0.72, 0.72, 0.73 respectively (Fig. 4). Table 2 shows the performance metrics for the six algorithms at each step, including AUC (95% CI), ACC, F1 Score, Sensitivity, and Specificity. When using smote to make the data balanced, XGBoost performs better than logistic regression, decision tree, random forest, Adaboost, and GBDT classifiers, with ACC of 0.78, 0.73, 0.74, 0.76, 0.76, 0.76, respectively (Fig. 5). After SMOTE processing, the balanced data resulted in the improved ACC values for all six models. The Stacking fusion model was constructed based on SMOTE to deal with the data imbalance problem and using the six ML models, GBDT, logistic regression, XGBoost, decision tree, random forest and Adaboost, as the primary learners (Fig. 6). The model uses the primary learners to make predictions on the data and then uses these predictions as inputs to the secondary learners (logistic regression) for final prediction or classification. Ultimately, the performance metrics of the seven models including the Stacking fusion model are shown in Table 3, with the Stacking fusion model having the best AUC value of 0.95. DCA decision curve is a tool to evaluate the value of the model forest farm (Figure S1A and Figure S1C), which is the decision curve without and after oversampling respectively, in which XGBoost algorithm has the highest net benefit. After Stacking is integrated, the net income of models is significantly increased (Figure S1E). As can be seen from the Calibration Curve (Figure S1B and Figure S1D) and the Precision-Recall curve (Figure S6 and Figure S7), SMOTE has improved the sample equilibrium of the model after SMOTE treatment, thus achieving better performance in predictive calibration. In addition, the calibration curve (Figure S1F) and the Precision-Recall curve (Figure S8) of the Stacking fusion model also performs well, further verifying its advantages in improving model robustness and prediction accuracy.

Fig. 4
figure 4

Six model ROC curves for data imbalance not handled with smote.

Table 2 Performance metrics for six algorithms that do not use Smote to handle data imbalance.
Fig. 5
figure 5

Six model ROC curves after using smote to deal with data imbalances.

Fig. 6
figure 6

ROC curve for the stacking fusion model with an AUC of 0.95.

Table 3 Performance metrics of six algorithms after using Smote to handle data imbalance.

Model interpretation

To better understand the relationship between the model and the data, we use SHAP to provide a more intuitive explanation of how these variables affect model predictions. However, SHAP is not interpretable for the fusion model, and is therefore interpreted for the best performing XGBOOST model. Figure S2A shows the important features in the model, where the features are ranked on the y-axis, indicating their importance to the prediction model. The results show a high correlation between “SizeZoneNonUniformity”, “Idn”, “Skewness”, “Minimum”, “InverseVariance”, “JointEnergy”, “Imc1” and “LargeAreaLowGrayLevelEmphasis” and the prediction of grade IV glioma. Figure S2B is used to explain the decision-making process of the machine learning model and help analyze the influence of various features on the model’s prediction results. On the analysis of the results, “SizeZoneNonUniformity”, “Skewness” ,“Minimum”, “LargeAreaLowGrayLevelEmphasis” and “InverseVariance” this a few indicators of red points are mainly concentrated on the left side, When these eigenvalues are small, the prediction results of the model are more in favor of grade 4 glioma. The red spots of the remaining indicators are mainly concentrated on the right side, indicating that when these characteristic values are large, it is more likely to predict grade IV brain glioma. This phenomenon suggests that these features play an important role in the model to distinguish between different grades of brain gliomas. Figure S2C is a Decision plot showing the contribution degree of each feature to the model output. The distribution of SHAP values for most indicators is concentrated in the range of 0.2–0.8, indicating that these features have a high importance in the model prediction process and play a key role in the final decision. Figure S2D is a Partial Dependence Plot (PDP). It is used to show the average influence of a certain feature on the model prediction results under different values. PDP helps to understand the relationship between the feature and the target variable by simulating the predicted trend when a single feature changes, thus revealing the decision logic of the model.

Survival analysis

We analyzed the survival of 144 patients with high-grade glioma HGG, of whom 43 (29.9%) had censored data due to lost follow-up or non-event endpoints, and 101 (70.1%) reached the end point of death. We did survival analysis using the KM test and Cox proportional risk model, where HGG staging was included in the Cox model in addition to the nine MRI features screened. KM test revealed that gender and MGMT protein had no significant effect on survival. However, HGG stage, IDH1 classification, KPS score, Ki-67, and surgical mode (total vs. subtotal) had significant effects on survival (Figure S3). Specifically, the median survival for Grade III patients was 667 days, compared to 531 days for Grade IV patients.

The Cox proportional risk model was used to investigate the significant contribution of MRI covariates in predicting OS. Both univariate and multivariate analyses were performed. Table 4 summarises all results, including effect sizes expressed as hazard ratios (HR). In univariate Cox regression, each covariate was assessed independently and found to be statistically significant for IMC1, MCC, IDH1, Ki-67, surgical approach (total resection and subtotal resection), tumor grading, KPS score.

In the multivariate Cox regression, we did find significant variables for IDH1, Ki-67, tumor grading, KPS score. In the multivariate Cox analysis of stepwise regression, we finally filtered out six statistically significant covariates. Increased IMC1 was associated with shortened OS (HR = 1.37). To verify the robustness of the model, we performed a subgroup analysis (Figure S4). In certain clinical subgroups of patients with Ki-67 > 20, KPS score > 70, grade IV gliomas, and wild-type IDH1, Imc1 is a imaging feature of great value in predicting the prognosis of patients.

Table 4 Risk values for different factors in univariate and multivariate analyses.

To verify the robustness of the model under different sample distributions and clinical contexts, we constructed a balanced dataset using undersampling under the results and conducted a survival analysis. (Figures S5-S8, Table 5). The results showed that the radiological feature Imc1 always had significant and stable prognostic value, providing strong support for personalized survival prediction and treatment.

Table 5 The risk values of different factors in the univariate and multivariate analyses after downsampling.

Discussion

In this study, we developed and validated an MRI-based radiomics approach for classifying HGG and interpreted the results of the model using SHAP values. We identified influential features, such as “SizeZoneNonUniformity”, “Skewness”, “Idn”, “Minimum”, “Imc1”, “JointEnergy”, and “InverseVariance”. In addition, we demonstrated how each feature impacts the model’s prediction of glioma grade. The best performance in the non-fusion model was achieved by using the XGBoost classifier, and using the SMOTE to deal with data imbalance improves the performance of all classifiers. This is consistent with previous studies that LASSO and XGBoost classifiers outperform other classifiers for classification of benign and malignant ovarian cysts and classification of skin lesions28,33. The Stacking fusion model performed best with AUC = 0.95 (sensitivity of 0.84; accuracy of 0.85; F1 score of 0.95) by interpreting the SHAP values, we found that size region non-uniformity, skewness, inverse difference normalisation, minimum eigenvalue, and correlation information metrics were highly correlated with the model predictions. In the survival analysis, we found that gender had no significant effect on survival, but there were significant differences in survival across GBM staging. In addition, by Cox proportional risk model, we determined the contribution of MRI covariates to predict OS and found that IMC1 were associated with survival.

Most ML prediction models use public MRI databases due to the scarcity of glioblastoma data and the very short survival period34. However, some studies have found that public MRI storage inventories have significant shortcomings, including a lack of expert tumour segmentation, which can lead to a reduction in predictive reliability35. Therefore, one of the strengths of this study is the use of a larger sample size of MRI data from glioblastoma patients in both centers and the application of ML algorithms to construct predictive models with higher model reliability. In the process of constructing the graded prediction model, we encountered several important issues. First, unbalanced data can significantly affect model performance in the biomedical field36,37. For example, if 30% of the patients with grade III glioblastoma are in the brain, even if the model predicts all outcomes as normal, the accuracy of predicting grade IV is still 0.7, which is clearly incorrect. This imbalance also led to a tendency for the model to predict grade IV more frequently, with lower accuracy in predicting grade III samples. However, the above study did not take this fully into account during the model construction process. In ML, it is recommended that methods such as oversampling or undersampling be used to address data imbalance38. Therefore, we use SMOTE sampling technique to balance the sample sizes of class III and class IV, which improves the prediction accuracy and stability of the model. Finally, model performance evaluation is also a challenge. AUC is the most widely used metric. In this study, the AUC of the six ML algorithms ranged from 0.89 to 0.99, and few studies have been conducted to predict GMB patient levels. Overfitting is another issue to be considered in model evaluation39. During the modelling process, even with methods such as cross-validation, some models still showed very high AUC values and high accuracy even on the test data. However, statistical tests showed differences in AUC between the two datasets, suggesting that the generalisation ability of models trained only with internal validation is questionable. Therefore, in the absence of fully independent external validation data, we recommend splitting part of the data and processing it before normalisation and variable selection. In conclusion, despite the challenges we face in constructing predictive models for disability risk, ML algorithms have the potential to address these issues. By addressing data imbalance, efficiently selecting relevant variables, improving model accuracy, and controlling overfitting, a predictive model for classifying HGG patients with high predictive and generalisation abilities was built.

Our imaging model contributes to the development of networks that can be used to aid decision-making in hospitals. From a model interpretation perspective, traditional ML algorithms are often criticised for their lack of transparency and interpretability40,41. In order to better understand the internal logic and decision rules behind the model predictions, another strength of this study is the use of SHAP values to interpret these ML models. XGBoost is the best performing non-fusion model in this study, and it is also the explanatory model we focus on. In these test data, we calculated the SHAP values for each feature variable to assess their contribution to the prediction results. The overall SHAP summary plot helps us to understand which features positively and negatively influence the predicted results, while the importance feature plot provides an average assessment of the importance of the features across the entire dataset. SHAP helps us to understand the contribution of each feature to the predicted results of this overall picture, and provides useful information for further analysis and interpretation of the model. In addition, the SHAP dependency graph helps to observe how these features affect the output of the prediction model at different levels. In this study, the influence of various imaging features on the grade of brain glioma can be clearly seen. Overall, the SHAP values used in this study provide a way to reduce the difficulty in interpreting ML models and increase their interpretability and transparency. This allows us to better understand the prediction results of the XGBoost model for grading HGG patients. By analysing the SHAP values, we can quantitatively assess the extent to which these features influence the prediction results, identify potential risk groups of HGG patients based on MRI features, and provide a basis for interventions. This helps physicians to make early clinical decisions, alleviate the suffering of HGG patients, and reduce the burden on the healthcare system.

A limitation of our study is the lack of external validation and protocol analysis of other centers. In addition, we used only one commonly used feature selection method and the performance of six classification methods for MRI-based identification of HGG patient classification. Finally, our results are based on basic sequences that are easy to be applied clinically and only a single T1 sequence was used; other MRI sequences were not taken into account. Another study also found that the prediction and classification of brain glioma lesions found that multiple sequences were superior to single sequences in terms of features42. This was not done and this was due to the absence of other sequences in some of the patients. Nonetheless, future studies could try a multi-feature selection approach and data from multiple sequences to build predictive models.

In this two-center study, we developed a model to predict survival up to 8 months after radiotherapy. The model was designed to be used in cohorts of patients receiving optimal treatment as well as in cohorts of patients receiving modified treatment. The neural network with T1 showed generalisable classification in both retrospective and external, prospective test cohorts. If validated in a large prospective study, these methods could be used to differentiate between patients with an initial response to radiotherapy and those who require closer image monitoring and second-line treatment (or termination of ineffective treatment).

In conclusion, this paper proposes a multi-parameter MRI-based grading prediction model for HGG patients. The experimental results show that the radiomics analysis based on multiparameter MRI is effective for grading HGG patients. XGBoost is the optimal non-fusion ML method, and Stacking fusion model has the best performance. After using SMOTE to make the data balanced can improve the performance of all models. If validated in a large prospective study, this method could be used to differentiate disease stages in HGG patients.