Introduction

Gliomas are the most common primary tumors of the central nervous system that arise from glial or precursor cells, characterized by increased relapse and mortality rates. Gliomas include astrocytomas, oligodendrogliomas, and ependymomas. According to the 2007 World Health Organization (WHO)1, astrocytomas are classified into four grades based on the growth potential and aggressiveness. Grades I (pilocytic astrocytomas) and II (diffuse astrocytomas) correspond to the most benign tumors with a favorable prognosis and are considered low-grade gliomas (LGG), whereas grades III (anaplastic astrocytomas) and IV (glioblastomas multiforme, GBM) are considered high-grade gliomas (HGG). Glioblastoma multiforme is the most common, malignant, aggressive, and challenging type of primary brain tumor; it grows rapidly and has the lowest survival rate, with a 5-year survival of around 5%2. Since LGG and HGG show different progression and response, and treatment resistance, accurate and early diagnosis and grading are essential to plan appropriate treatment. Furthermore, it should be noted that some subtypes of LGG can lead to GBM in a few months3, so it is crucial to differentiate LGG from GBM as early as possible.

Currently, the standard procedure for diagnosing, classifying, and grading gliomas is based on histopathological analysis of a sample of brain tissue acquired by surgical biopsy or at the time of resection4. However, the potential risks (e.g., the likelihood of damaging a vital brain area can cause neurological deficits) and limitations inherent to biopsy have led to the search for less invasive alternatives without adverse side effects. Thus, significant research efforts have been directed towards the development of neuroimaging techniques that allow the non-invasive extraction of a variety of so-called radiomic features (commonly divided into morphological features, textural features, functional features, and semantic features) for the diagnosis, classification of different types of tumors, predicting prognosis and determining the morphology and location of the tumor5,6. For instance, Cheng et al.7 used radiomic features for prediction of glioma grade, Lee et al.8 for pancreatic cancer, Miranda et al.9 for rectal cancer, Nguyen et al.10 for non-small cell lung cancer, Khanfari et al.11 for prostate cancer grading, and Kim et al.12 for prediction of disease-free survival in triple-negative breast cancer. In the particular case of glioblastoma, radiomics has emerged as powerful, non-invasive tools to obtain more information about the pathogenesis and therapeutic responses, providing significant biological insights into imaging features6.

Magnetic resonance imaging (MRI) and computed tomography (CT) are the most commonly used neuroimaging modalities13. However, other emerging techniques such as functional MRI (fMRI), magnetic resonance spectroscopy (MRS), positron emission tomography (PET), single-photon emission computed tomography (SPECT), combined PET/CT and hybrid PET/MRI are gaining increasing relevance for the diagnosis, prognosis, and monitoring of gliomas14,15,16. Despite the relevance and usefulness of neuroimaging, advances in genomics and proteomics have allowed the identification of prominent molecular biomarkers that contain both diagnostic and prognostic information for tumors of the central nervous system, becoming a pivotal tool for the evaluation of some gliomas and clinical decision making in neuro-oncology17. In 2021, WHO incorporated molecular data as a primary factor in classifying and determining the grade of gliomas, which, together with classic clinical and histological characteristics, can provide better performance18. Both methods based on neuroimaging techniques and those that focus on analyzing molecular biomarkers are supported by various machine learning and deep learning models due to their ease in processing large volumes of data and finding the most informative features, as well as their strong performance19,20,21.

Deepak and Ameer22 explored the performance of deep transfer learning with a pre-trained GoogLeNet to extract features from MRI images to discriminate between three types of brain tumors. Analysis of molecular mutations using MRI features proved to be a useful method for diffuse LGG prediction, with the advantage of being a non-invasive procedure23. Alksas et al.24 proposed an imaging-based glioma grading system that uses contrast-enhanced MRI, fluid-attenuated inversion-recovery MRI, and diffusion-weighted MRI to extract morphological, textural, and functional features. Then, the optimal features given by the Gini impurity index are fed to a multi-layer perceptron (MLP) to discriminate between different grades of glioma. Matsiu et al.25 developed a deep learning model to predict the LGG molecular subtype using a mixture of clinical and radiomic data. An overall accuracy of 68.7% was obtained when the imaging data included MRI, PET, and CT data. Gutta et al.26 conducted some experiments with a set of 237 patients to demonstrate that the performance of features learned by a convolutional neural network was superior to that of standard radiomic features for glioma grade prediction. Cheng et al.7 used a total of 2153 intratumoral and peritumoral features extracted from preoperative multi-parametric MRI scans of 285 patients to predict glioma grade, reaching an area under the ROC curve (AUC) of 0.975. Furthermore, this technique was shown to have strong generalization performance when applied to an independent validation data set with 65 patients.

Sun et al.27 compared several radiomic feature selection algorithms and classification models in glioma grading, concluding that the combination of feature selection based on a support vector machine (SVM) with an MLP performed the best in discriminating between LGG and GBM. Cho et al.28 used the minimum redundancy maximum relevance algorithm with mutual information as the information measure to select the top five features from a total of 468 radiomic features and three classifiers (logistic regression, SVM, and random forest) to distinguish between HGG and LGG images. Bae et al.29 evaluated the performance and generalizability of traditional machine learning and deep learning models for distinguishing glioblastoma from single brain metastasis using radiomic features. Zhao et al.30 applied Cox proportional hazards, SVM and random forest to a large glioma data set with 3462 patients for survival prediction, concluding that the best performance was achieved when incorporating radiation therapy and chemotherapy administration status. Tasci et al.31 introduced a new hierarchical voting-based strategy for feature selection for glioma grading based on clinical and molecular characteristics, improving the performance of using the least absolute shrinkage and selection operator (LASSO) method together with classifier ensembles. Joshi et al.32 proposed a two-stage ensemble for glioma detection and grading based on clinical and histological data. Munquad et al.33 employed a correlation-based feature selection scheme and an SVM to predict LGG and subtypes, achieving an average accuracy of 91%. Ren et al.34 predicted IDH1 (isocitrate dehydrogenase 1) and ATRX (alpha-thalassemia mental retardation X-linked chromatin remodeler) mutations for molecular stratification of LGG using an SVM with a recursive feature elimination algorithm to select an optimal subset of 28 radiomic features. Zheng et al.35 developed a functional deep neural network to identify high-risk IDH1-mutant glioma patients using clinical factors and molecular features, achieving 90% accuracy.

Zhan et al.36 proposed a computer-aided diagnosis for grading gliomas which consists of a feature extraction step using PCA to reduce the dimensionality of the data and a prediction step based on a k nearest neighbors classifier. Wu et al.37 evaluated 50 machine learning algorithms over a data set with 1114 eligible glioma patients and showed that their performance was better than that of the clinical prediction model. The authors concluded that this kind of prediction models can serve as a non-invasive prediction tool for preoperative diagnostic grading of glioma. Ye et al.38 employed four machine learning algorithms (SVM, random forest, extreme gradient boosting, and generalized linear model) to investigate the relationship between overall survival and the clinical history parameters, pathological characteristics, and molecular alterations of gliomas. The experiments concluded that extreme gradient boosting was the best performing model when applied to a data set with 198 patients. Zhou et al.39 analyzed the correlation between LGG stemness and clinicopathological characteristics. In addition, the authors used SVM, extreme gradient boosting and LASSO to identify genes critical for stemness subtype prediction. Kha et al.40 uses Shapley additive explanations (SHAP) analysis to select the best wavelet radiomics features, which were then used with extreme gradient boosting to predict the codeletion status of chromosome 1p/19q in LGG patients.

While most cutting-edge research has focused on the model-building stage of the machine learning process, the performance of a model is highly dependent on data quality. It is now widely accepted that performance improvements are primarily achieved through a data-centric approach41. Unlike model-centric systems that focus on how to modify the code, algorithms and representations to improve accuracy and generalization, data-centric approaches focus on curating the data to produce a better performing model. Data-centric machine learning comprises a series of tasks, including standardization and normalization, data cleaning, feature extraction, dimensionality reduction, feature transformation, instance selection, undersampling, data synthesis, and oversampling42. However, even recognizing the importance of data-centric methods, the challenge is to find an appropriate balance between these and model-centric methods to provide a robust machine learning solution43.

This paper aims to present a data-centric approach applied to The Cancer Genome Atlas (TCGA) data set and explore the potential benefits of oversampling and undersampling algorithms to address class imbalance, thus comparing their performance with that of six machine learning models (k nearest neighbors, support vector machine, multi-layer perceptron, logistic regression, random forest, and CatBoost). Furthermore, we conduct a comprehensive descriptive analysis of the data set to identify some statistical features and discover the most informative attributes using four feature ranking algorithms (information gain, Gini index, Chi-squared, and random forest). Next, a comparison is carried out with the best performing prediction models using all the features that make up the data set versus the case of using only the five most relevant attributes.

Methodology

This section presents the data set used and its main characteristics, the experimental protocol, and the performance evaluation methods.

Data

All experiments were carried out using a data set31 obtained from the widely used and publicly available repository of genome atlas data on TCGA (https://www.cancer.gov/tccg). In particular, the data set was built on the basis of the TCGA-LGG and TCGA-GBM projects and consists of three clinical factors (Gender, Age at diagnosis and Race) and 20 frequently mutated molecular biomarkers from 839 patients diagnosed with LGG or GBM. As seen in Table 1, all predictors are categorical type, except for the attribute Age at diagnosis, which is numerical. The molecular features are represented by the values 0 (not mutated) and 1 (mutated) according to the TCGA case number. It is worth noting that it was not necessary to apply any deletion or imputation technique because the data set used in the experiments did not contain missing data on any of the attributes (predictor variables).

Table 1 Information on the 23 predictors in the data set.

Descriptive statistics

The data set consists of two classes indicating the glioma grade: 487 (58.05%) patients with LGG (the positive class, 0) and 352 (41.95%) with GBM (the negative class, 1), resulting in an imbalance ratio of 1.38 (i.e., the ratio of majority to minority samples in the data set). Of the total samples in the data set, 488 (58.16%) correspond to men (0) and 351 (41.84%) to women (1). Regarding the attribute Race, there are 765 cases of white people (0), 59 of black or African American people (1), 14 of Asians (2), and only 1 American Indian (3). Table 2 reports the distribution of cases according to glioma grades for the clinical factors. The mean age values in Table 2 suggest that there are no significant differences between males and females affected by these brain tumors, even regardless of the glioma grade. On the other hand, the data regarding patient race could be biased because the vast majority of cases are white people, so any conclusions about the incidence of glioma based on the attribute Race could be erroneous.

Table 2 Distribution of cases (N (%)) according to glioma grades based on clinical factors. For Age at diagnosis, the mean and (standard deviation) are shown.

Table 3 summarizes a series of descriptive statistics for the attribute Age at diagnosis according to the gender of the patients, including measures of central tendency and measures of dispersion: minimum and maximum values, arithmetic mean, median, mid range, standard deviation (SD), standard error (SE), 95% confidence interval (95% CI), first (Q1) and third (Q3) quartiles, interquartile range (IQR), coefficient of skewness, coefficient of kurtosis, kurtosis excess, and coefficient of variation (CV). Additionally, we conducted Kolmogorov-Smirnov (K-S) test46 (with Lilliefors significance correction) at a significance level of 0.05 to check for normality of the distribution of the samples in each gender; if p-value \(> 0.05\), it may be assumed that the data follow a normal distribution. We chose the K-S test instead of the Shapiro–Wilk test because it is more appropriate for large sample size (N \(\ge 50\))47.

Table 3 Descriptive statistics of the attribute Age at diagnosis.

To visualize the shape of the distributions, Fig. 1 shows histograms and density plots for the attribute Age at diagnosis for both males and females. In addition, it also displays the Q-Q plots for the attribute Age at diagnosis. As can be seen, the residuals (green dots) tend to deviate quite a bit from the 45-degree line (blue) only at the tail ends, indicating that the data follow a normal distribution.

Figure 1
figure 1

Histograms (green boxes), density plots (red line) and Q-Q plots for the attribute Age at diagnosis.

On the other hand, Fig. 2 shows a box-plot with the distribution of the attribute Age at diagnosis for LGG (class 0) and GBM (class 1) cases. The dark blue vertical and the thin blue lines represent the mean age and standard deviation, respectively. The median is shown with a yellow vertical line, while the blue highlighted area represents the values between the first and the third quartiles.

Figure 2
figure 2

Distribution of values for the attribute Age at diagnosis.

Table 4 displays counts (frequencies) and proportions (relative frequencies) for the 20 molecular biomarkers. Note that the list is ordered from highest to lowest by the count (or percentage) of cases with a mutation in the corresponding biomarker, ranging from 404 for IDH1 to 22 for PDGFRA.

Table 4 Frequencies and proportions for the molecular biomarkers.

Experimental protocol

The multiple machine learning models used to predict glioma grade in the experiments included four standard classification models and two powerful classifier ensembles. The standard models were k-nearest neighbors (kNN), SVM, MLP, and logistic regression (LR). The ensembles were random forest (RF) and CatBoost. kNN is a non-parametric learning algorithm that produces the class label of an input sample based on the majority vote of its k closest training cases. SVM is a supervised machine learning model that classifies data by finding the hyperplane that optimally separates the samples of one class from the other, that is, the hyperplane that maximizes the distance (margin) between the closest samples of the opposite class. One of the most interesting features of SVM is that it works for both linear and nonlinear problems, as well as being less prone to overfitting. When data are not linearly separable, some kernel function must be used to transform the training data into a higher-dimensional feature space that allows linear separability. An MLP is an artificial neural network that consists of multiple layers of interconnected neurons: an input layer that receives the input sample as a combination of the feature values, an output layer that performs the classification by using some activation function, and one or more hidden layers (placed in between the input and output layers) whose neurons perform computations on the inputs. The logistic regression model makes a prediction based on the probability that an input sample belongs to a particular class: if the probability is greater than 0.5, the sample is assigned to that class; otherwise, the sample is classified to the other class.

RF44 is an extension of the bagging method made up of multiple decision trees, each generated from a sample drawn with replacement from the training set (i.e., with replacement means that one sample could be selected multiple times, while others could not be selected at all). During the construction of a tree, the best split is selected from a random subset of features, thus ensuring low correlation between decision trees. When classifying new input samples, all trees make a judgment and the final decision is made by majority vote. CatBoost45 is an improved implementation of gradient boosting on binary decision trees, which means that each new tree is trained to minimize the loss function of the previous model (i.e., to reduce the error made by previous trees) using gradient descent. CatBoost handles categorical features not by using a binary substitution of the categorical values but by performing a random permutation of the training data (this ensures different orderings during different stages of the gradient boosting process) and calculating the average label value for the sample with the same class value placed before the given one in the permutation.

Before applying the prediction models, the values of the attribute Age at diagnosis were normalized using the z-score standardization technique so that the mean of all values was 0 and the standard deviation was 1. A raw value x of the feature is converted into a normalized value z by

$$\begin{aligned} z = \frac{x-{\bar{x}}}{SD} \end{aligned}$$
(1)

where \({\bar{x}}\) and SD are the mean and standard deviation of a feature, respectively.

Note that normalization was applied solely to Age at diagnosis because all other attributes were categorical. On the other hand, to find the best values of the hyperparameters for the machine learning models, we fine-tuned them using an 80-20 stratified holdout setting method (Table 5).

Table 5 Model hyperparameters.

Performance evaluation

We adopted a stratified 10-fold cross-validation method, where the data set was randomly divided into ten stratified non-overlapping blocks of roughly equal size. The models were trained with nine of these blocks combined and then applied to the remaining block to estimate the performance. This process was repeated for each of the 10 blocks, giving a total of 755 training samples and 84 testing samples in each of the 10 iterations of the cross-validation. Performance was then calculated as the average of the 10 estimates thus obtained. We used six scalar indicators to evaluate the prediction performance: classification accuracy (Acc), Precision (Prec), Recall, Specificity (Spec), F1-score (F1), and Matthews correlation coefficient (MCC). All these measures were derived from a \(2 \times 2\) confusion matrix, where each entry (ij) represents the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).

In addition to these scalar metrics, we also included the receiver operating characteristics (ROC) and the precision-recall curves to visualize how the machine learning models used in the experiments perform in predicting the classes. The ROC curve plots a false positive rate (i.e., 1-specificity) on an X-axis against a true positive rate on a Y-axis; the closer the curve approaches the upper left corner of the ROC space, the better the model is at predicting the classes. The precision-recall curve shows the ratio between precision (ratio of true positives in positive predictions) and recall (ratio of true positives in positive class) at different thresholds; ideally, the curve should be as close to the top right corner as possible.

Results and discussion

This section consists of three blocks. First, we investigated the most informative molecular biomarkers based on the distribution of cases in each glioma grade and checked whether these results agree with the results of four feature ranking algorithms. The second block analyzes the performance of six standard prediction models and classifier ensembles for glioma grading. Finally, we apply some resampling techniques to handle class imbalance and verify if this leads to increased performance.

Most informative features

Values in Table 6 show the distribution of cases according to glioma grades for the mutated molecular biomarkers (i.e., feature value = 1). As can be seen, IDH1 mutations are the most common, being detected in 404 patients (48.15% of the total cases studied). However, these mutations occur in 94.31% of cases with LGG and only in 5.69% of cases with GBM, confirming previous findings that this is a very informative molecular biomarker for glioma grading17,34,48: IDH1/2 mutations have been largely associated with grade II and III gliomas and secondary glioblastomas49. Looking at the biomarkers with 50 or more cases, similar conclusions can be drawn for the molecular biomarkers ATRX with 84.33% of LGG, PTEN (phosphatase and tensin homolog) with 82.27% of patients affected by GBM and CIC (capicua transcriptional repressor) with 96.40% LGG. In the case of biomarkers with a low percentage of patients, we find NOTCH1 (notch receptor 1) (100% of LGG), FUBP1 (far upstream element binding protein 1) (95.56% of LGG), IDH2 (isocitrate dehydrogenase 2) (91.30% of LGG), SMARCA4 (SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily a, member 4) (85.19% of LGG) and RB1 (retinoblastoma transcriptional corepressor 1) (85.00% of GBM).

Table 6 Distribution of cases (N (%)) according to glioma grades based on mutated molecular biomarkers (bold values indicate the most discriminating molecular biomarkers, that is, those with the greatest difference between LGG cases and GBM cases).

To support the conclusions drawn from the values in Table 6, we ran four feature ranking algorithms on the normalized data set with the aim of checking which are the most informative predictors: information gain, Gini index, Chi-squared, and RF. Note that identifying the most informative clinical factors and glioma molecular biomarkers can be valuable in obtaining relevant biological information. On the other hand, in some practical cases, having small feature sets with high prediction accuracy can become paramount to minimize response time.

Information gain (infGain) estimates the relevance of a predictor based on the amount by which the entropy of the class decreases when considering that feature. The Gini index (Gini) estimates the distribution of a predictor in different classes and can be interpreted as a measure of impurity for a feature. Chi-squared (Chi2) measures the relationship strength between each variable and the class label. Note that Chi-squared applies to categorical predictors, and therefore, numerical attributes (as is the case for Age at diagnosis) must first be discretized into several intervals. In the case of RF as a feature ranking method, each tree in the forest calculates the importance of a predictor based on its ability to decrease the weighted impurity in the tree.

Table 7 Results of feature ranking methods. The ranking of each feature is shown in brackets (bold values indicate the five most informative predictors based on the multiple intersection method).

Since each feature ranking algorithm could yield different results (rankings), fusing them using a multiple intersection method was necessary to find out which features got the highest rankings in the output of the four algorithms. Thus, looking at the rankings of each algorithm, it was possible to determine which were the five most relevant attributes, while there were discrepancies in establishing the most informative attributes from the sixth position onwards. From the outputs of the multiple intersection method, Table 7 shows that the four feature ranking algorithms agreed to define IDH1 as the most informative attribute, followed by Age at diagnosis, PTEN, CIC and ATRX. These results are interesting because they are consistent with the findings of various studies conducted in neuroscience and neuro-oncology17,34,48,49 in which the mutated molecular biomarkers that best discriminate LGG from GBM were determined, as reported in Table 6. The relevance of this lies in the fact that feature selection or ranking algorithms could be used to discover molecular biomarkers with the greatest discriminating power instead of other more expensive, time-consuming and difficult to carry out methods.

We ran multidimensional scaling50 to visualize in Fig. 3 the samples from both classes as a function of the attribute Age at diagnosis against each of the four most informative biomarkers (IDH1, PTEN, CIC, and ATRX). Each blue dot represents an LGG sample, and each red dot is a GBM sample. The regions belonging to each class are shaded in blue or red depending on whether they correspond to the LGG or GBM class, respectively. These graphs allow us to see how the age of the patients and mutations are related to the grade of glioma. For example, Fig. 3a reveals that most LGG cases require IDH1 mutations and occur at younger ages than GBM cases. For PTEN (Fig. 3b), LGG occurs when there is no mutation, while GBM does not appear to depend on this molecular biomarker since approximately the same number of cases is seen both with and without PTEN mutations.

Figure 3
figure 3

Scatter plot of Age at diagnosis (X-axis) vs. the most informative molecular biomarkers (Y-axis).

Results of the prediction models

Table 8 reports the results of each of the six evaluation metrics achieved by the prediction models applied to the normalized data set (with all predictors) using the experimental protocol described above. The results revealed that RF was the best performing model, although closely followed by CatBoost and SVM. In contrast, kNN. MLP and LR obtained the lowest values regardless of the performance evaluation metric used.

Table 8 Prediction performance of the machine learning models (the best values are in bold).

For better analysis of these results, we performed a pairwise comparison of models using a correlated Bayesian t-test51 for each evaluation metric to check whether the difference in scores between each pair of models was significant or not. Unlike the frequentist correlated t-test, where the inference is a p-value, the inference of the Bayesian t-test is a posterior probability. Additionally, this test considers the correlation and the uncertainty (i.e., the standard error) of the results generated by cross-validation. The outputs of the statistical test are summarized in Table 9, where the number in a cell denotes the probability that the model corresponding to the row had a significantly higher score (posterior probability greater than 0.5) than the model corresponding to the column. Values in this table indicate that the results obtained by RF and CatBoost were significantly better than those of kNN, MLP, and LR, regardless of the metric used. When comparing RF and CatBoost with SVM, it can be seen that the differences were not statistically significant when using Prec (0.492 and 0.399) and Spec (0.460 and 0.416). Finally, posterior probabilities of RF being significantly better than CatBoost revealed that the performance differences between both ensembles were very small, so one should not conclude that RF performed better than CatBoost.

Table 9 Pairwise comparison of models.

Figure 4 plots the ROC curves for the RF and CatBoost ensembles separately for each of the two classes (LGG and GBM). The diagonal dotted line represents the behavior of a random classifier, while the full diagonal line represents iso-performance in the ROC space so that all the points on the line give the same profit/loss. The closer to the top and further to the left this full diagonal line is, the better the classifier result. The AUC was 0.923 for RF and 0.924 for CatBoost, that is, the difference between both classifiers was negligible.

Figure 4
figure 4

ROC curves for the classifier ensembles.

Figure 5 shows the confusion matrix corresponding to each of the six prediction models. Although it was seen that the imbalance ratio of the data set was moderately low (1.38), the confusion matrix allows us to discover the behavior of the models in each of the classes, that is, analyze the number of successes and errors individually by class in order to identify whether or not there were differences between predicting samples belonging to the majority class and samples of the minority class. Thus, it can be observed that the three models with the best performance (RF, CatBoost and SVM) made a lower number of errors than the other three classifiers (kNN, MLP and LG) on the minority class (GBM). In contrast, the number of misclassifications on the majority class (LGG) was similar in all classifiers.

Figure 5
figure 5

Confusion matrices of the classifiers.

Explainability of predictions

Due to the “black box” nature of most machine learning models, one of the main problems is their insufficient interpretability or the difficulty in understanding the predictions they make. To shed light on these limitations, some methodologies belonging to the eXplainable Artificial Intelligence (XAI)52 paradigm have been proposed in order to provide a reasonable understanding of the output of machine learning models. In particular, we analyzed the effect of the attributes on the prediction performance using two explainability approaches: global feature importance and SHAP.

Global feature importance estimates the contribution of each individual feature to the prediction by measuring the increase in the prediction error of the model after performing permutations on the feature values across the data set, which breaks the relationship between the feature and the target variable44,53. A feature is important if permuting its values increases the model error, while a feature is of little or no importance if permuting its values does not change the error of the model.

Bar charts in Fig. 6 show feature importances in descending order for each classifier, indicating that the IDH1 biomarker was the most important attribute contributing to the target variable (i.e., glioma grade), regardless of the model used. The second most important feature was Age at diagnosis in all cases except when applying the MLP neural network (note that even in this case the attribute Age at diagnosis was the third most important). It is worth highlighting that these results mostly agree with those reported in Table 7, where these two features were also identified as the most relevant when applying the multiple intersection method.

Figure 6
figure 6

Feature importance of the top 5 variables according to the AUC of the model.

It should be noted that the global feature importance approach reveals the absolute importance of each attribute, but it does not indicate the direction of the change given by the permutations, that is, it does not report whether the feature increases or decreases the prediction performance of the model. To overcome this limitation, we also employed the SHAP method introduced by Lundberg and Lee54, which is based on the principles of cooperative game theory and can provide broad explanations of model predictions at both local and global levels. This method computes Shapley values, which quantify the average marginal contribution of a feature to the prediction made by the model after considering all possible combinations with other features55, that is, it provides information about whether the influence of each characteristic on the prediction value of the model is positive (increase) or negative (decrease). The Shapley value of a feature, is calculated as the difference between the prediction when the feature is present and the prediction when the feature is absent.

Figure 7 shows the SHAP summary plot for each model, which represents the positive or negative impact of each feature on the prediction of one class. On the X-axis is the Shapley value, which denotes how much the features contribute to the prediction of a patient diagnosed with GBM across all possible combinations. A value less than 0 indicates a negative contribution (i.e., low importance for the prediction of the minority class GBM), equal to 0 indicates no contribution, and greater than 0 indicates a positive contribution (i.e., high importance for prediction). The left vertical axis (Y-axis) is for features ranked in descending order of their relevance to the prediction of class GBM, while the right vertical axis indicates the value of the features from lowest to highest. Each dot represents the Shapley value of a sample (patient) plotted horizontally and is colored red or blue depending on whether the value is high or low, respectively.

Figure 7
figure 7

SHAP summary plots.

From these plots, it can be seen that Age at diagnosis was the most important feature for the prediction in class GBM when the KNN and LR models were used, and the second most relevant with the rest of the classifiers. Samples with higher values of this feature (red color) had higher Shapley values, meaning that they contributed to the prediction of class GBM. Lower values of this attribute (blue) contributed against the prediction of this class. The IDH1 biomarker (categorical attribute) contributed the most to the prediction of GBM class when using the MLP, SVM, RF and CatBoost models. As IDH1 is a categorical attribute, its impact on the prediction depends on its value (0 = non-mutated, 1 = mutated). Thus, it can be seen that this biomarker with the non-mutated value for the patient (red color) contributed to the prediction of the GBM class, while the mutated value of this attribute contributed negatively.

Addressing class imbalance

Considering the differences in misclassifications between the majority class and the minority class, we decided to address the class imbalance in order to see if any performance improvement could be obtained. It is well known that training a machine learning algorithm with imbalanced data can favor the majority class, typically leading to higher misclassification rates over the minority class (GBM). Among the various strategies to address imbalanced data, resampling techniques are by far the most widely used approach because they have been proven to be efficient, classifier-independent, and can be easily implemented for any problem56. These are designed to change the composition of the training data set by adjusting the number of majority and/or minority samples until both classes are represented by an approximately equal number of samples. Many researchers have argued that over-sampling is generally superior to under-sampling because under-sampling algorithms can discard potentially useful data and increase classifier variance57. It should be noted that, to avoid overoptimistic results, resampling should be applied only to the training set, not to the entire data set58. In the case of over-sampling, for instance, this means that the testing samples are neither over-sampled nor seen by the machine learning model during training.

Experiments in this section were carried out with two resampling algorithms. The first is an over-sampling algorithm proposed by Chawla et al.59 called SMOTE, which generates artificial samples of the minority class (GBM) by interpolating existing samples that are close together. It first finds the k minority nearest neighbors for each minority sample, and then synthetic samples are generated in the direction of some or all of those nearest neighbors. Depending on the amount of over-sampling required, a certain number of samples are randomly chosen from the k nearest neighbors. The second is random under-sampling (RUS), which balances the data set by randomly removing samples that belong to the over-sized class (LGG).

Table 10 reports the performance results obtained after preprocessing the normalized data set with SMOTE and RUS. The first issue worth mentioning is that oversampling performed better than undersampling, except when Recall was used. Secondly, unlike the results obtained with the normalized data set without preprocessing (Table 8), now the best model after up-sampling the data set was SVM, although the differences concerning RF and CatBoost were really negligible.

Table 10 Prediction performance of the machine learning models using the resampled data sets (the best values are in bold).

To check whether or not the difference in the means of the results with the normalized training set without preprocessing and those preprocessed with over-sampling and under-sampling were significant, a two-tailed t-test60 was performed for a significance level of 5% (\(\alpha = 0.05\)), whose t-values and p-values are shown in Table 11. Thus, when comparing the means of Table 8 with those of over-sampling (upper part of Table 10), we obtained that the differences were statistically significant in all cases, except when using the specificity to evaluate the performance of the models. On the other hand, when comparing them with the means obtained with under-sampling (bottom of Table 10), we found that the differences in precision and specificity on the non-preprocessed set were significantly better than those of the downsized set. Therefore, despite the low imbalance index, the test indicated the convenience of over-sampling the normalized data set with the SMOTE algorithm to increase the performance of the prediction models.

Table 11 Statistical comparison between the non-preprocessed data set and the resampled data sets. The first line of each method is the t-value, and the second line corresponds to the p-value (italic values indicate no significant differences, while underline values indicate that the results without resampling were better than those with resampling).

As a further confirmation of the findings using SMOTE, in Fig. 8 we plotted precision-recall curves for the best predictiion models (SVM, RF and CatBoost) when applied to the original training sets and the over-sampled training sets. The area under the precision-recall curve was 0.838, 0.860 and 0.872 for SVM, 0.873, 0.91 and 0904 for RF, and 0.872, 0.908 and 0.898 for CatBoost using the original, over-sampled and under-sampled training sets, respectively. These values confirm some performance improvements as a result of addressing class imbalance with SMOTE.

Figure 8
figure 8

Precision-recall curves for SVM and the classifier ensembles applied with the original training sets (ac) , the oversampled training sets (df), and the undersampled training sets (gi).

The last experiment focused on analyzing the behavior of the prediction models on the upsized data set using the feature vector with the five most relevant attributes according to the multiple intersection method. Table 12 shows that the best performing models were LR and SVM, which is quite surprising because these results differed from those obtained on the data set containing all attributes. On the other hand, when comparing the results of the upper part of Table 10 with those of Table 12, one can see that the performance of all the prediction models worsened when applied to the reduced sets. To check whether or not the differences were statistically significant, we again ran a two-tailed t-test for a significance level of 0.05: t-value \(= -\)6.898545, p-value \(=\) 0.00098.

Table 12 Prediction performance of the machine learning models on the oversampled data set using the top five attributes (the best values are in bold).

Conclusions

Glioma grading and prediction constitute a highly relevant practical health problem that is usually addressed using neuroimaging techniques. However, the development of advanced genomics and proteomics methods allows the identification of mutations in certain molecular biomarkers that can support diagnosis, prognosis and prediction of response to therapy. In this study, several data-centric machine learning models have been used to discriminate between LGG and GBM samples using a series of clinical factors and molecular biomarkers. Furthermore, a comprehensive descriptive analysis of the data set used in the experiments has also been carried out. The descriptive analysis has included several statistics of the attributes and the application of four feature ranking algorithms to determine the most relevant characteristics, and it has been possible to observe that the molecular biomarkers selected by these algorithms as the most informative agree with the conclusions of previous molecular biology studies. However, these algorithms have important advantages because they are much less expensive and faster than genomics and proteomics methods.

Of the different machine learning methods analyzed, the two classifier ensembles (RF and CatBoost) have obtained the best scores regardless of the metric used. The global feature importance approach revealed the absolute relevance of each attribute, while the SHAP analysis of individual samples provided a reasonable interpretation of which attributes contributed most to the prediction of class GBM. On the other hand, when analyzing the confusion matrices, important differences have been observed between the misclassifications on the majority class and the minority class, which suggested the need to apply some techniques to address the class imbalance. In particular, the normalized data set has been preprocessed with an oversampling algorithm (SMOTE) and an undersampling algorithm (RUS) and it has been found that upsizing the minority class improves the prediction performance. As a final comment, it is worth noting that a model-centric approach applied to the TCGA data set achieved 0.876 accuracy31, while the data-centric method proposed in this study yielded accuracy rates of 0.882 (with oversampling) and 0.881 (with both oversampling and undersampling).

While this study provides valuable insights into prediction of glioma grades, an interesting avenue for future research refers to the analysis of the possible bias that may arise in predictions against certain sensitive social groups (e.g., gender, age, race, etc.). With this objective, the aim is to quantify the existence of bias through fairness metrics and, if necessary, apply bias mitigation methods61,62. When the bias is inherited from the way the training set was created, one approach that would reduce the bias is to internally rebalance the class distributions so that they are equal across class and sensitive attributes.