Abstract
Accurate prediction and grading of gliomas play a crucial role in evaluating brain tumor progression, assessing overall prognosis, and treatment planning. In addition to neuroimaging techniques, identifying molecular biomarkers that can guide the diagnosis, prognosis and prediction of the response to therapy has aroused the interest of researchers in their use together with machine learning and deep learning models. Most of the research in this field has been model-centric, meaning it has been based on finding better performing algorithms. However, in practice, improving data quality can result in a better model. This study investigates a data-centric machine learning approach to determine their potential benefits in predicting glioma grades. We report six performance metrics to provide a complete picture of model performance. Experimental results indicate that standardization and oversizing the minority class increase the prediction performance of four popular machine learning models and two classifier ensembles applied on a low-imbalanced data set consisting of clinical factors and molecular biomarkers. The experiments also show that the two classifier ensembles significantly outperform three of the four standard prediction models. Furthermore, we conduct a comprehensive descriptive analysis of the glioma data set to identify relevant statistical characteristics and discover the most informative attributes using four feature ranking algorithms.
Similar content being viewed by others
Introduction
Gliomas are the most common primary tumors of the central nervous system that arise from glial or precursor cells, characterized by increased relapse and mortality rates. Gliomas include astrocytomas, oligodendrogliomas, and ependymomas. According to the 2007 World Health Organization (WHO)1, astrocytomas are classified into four grades based on the growth potential and aggressiveness. Grades I (pilocytic astrocytomas) and II (diffuse astrocytomas) correspond to the most benign tumors with a favorable prognosis and are considered low-grade gliomas (LGG), whereas grades III (anaplastic astrocytomas) and IV (glioblastomas multiforme, GBM) are considered high-grade gliomas (HGG). Glioblastoma multiforme is the most common, malignant, aggressive, and challenging type of primary brain tumor; it grows rapidly and has the lowest survival rate, with a 5-year survival of around 5%2. Since LGG and HGG show different progression and response, and treatment resistance, accurate and early diagnosis and grading are essential to plan appropriate treatment. Furthermore, it should be noted that some subtypes of LGG can lead to GBM in a few months3, so it is crucial to differentiate LGG from GBM as early as possible.
Currently, the standard procedure for diagnosing, classifying, and grading gliomas is based on histopathological analysis of a sample of brain tissue acquired by surgical biopsy or at the time of resection4. However, the potential risks (e.g., the likelihood of damaging a vital brain area can cause neurological deficits) and limitations inherent to biopsy have led to the search for less invasive alternatives without adverse side effects. Thus, significant research efforts have been directed towards the development of neuroimaging techniques that allow the non-invasive extraction of a variety of so-called radiomic features (commonly divided into morphological features, textural features, functional features, and semantic features) for the diagnosis, classification of different types of tumors, predicting prognosis and determining the morphology and location of the tumor5,6. For instance, Cheng et al.7 used radiomic features for prediction of glioma grade, Lee et al.8 for pancreatic cancer, Miranda et al.9 for rectal cancer, Nguyen et al.10 for non-small cell lung cancer, Khanfari et al.11 for prostate cancer grading, and Kim et al.12 for prediction of disease-free survival in triple-negative breast cancer. In the particular case of glioblastoma, radiomics has emerged as powerful, non-invasive tools to obtain more information about the pathogenesis and therapeutic responses, providing significant biological insights into imaging features6.
Magnetic resonance imaging (MRI) and computed tomography (CT) are the most commonly used neuroimaging modalities13. However, other emerging techniques such as functional MRI (fMRI), magnetic resonance spectroscopy (MRS), positron emission tomography (PET), single-photon emission computed tomography (SPECT), combined PET/CT and hybrid PET/MRI are gaining increasing relevance for the diagnosis, prognosis, and monitoring of gliomas14,15,16. Despite the relevance and usefulness of neuroimaging, advances in genomics and proteomics have allowed the identification of prominent molecular biomarkers that contain both diagnostic and prognostic information for tumors of the central nervous system, becoming a pivotal tool for the evaluation of some gliomas and clinical decision making in neuro-oncology17. In 2021, WHO incorporated molecular data as a primary factor in classifying and determining the grade of gliomas, which, together with classic clinical and histological characteristics, can provide better performance18. Both methods based on neuroimaging techniques and those that focus on analyzing molecular biomarkers are supported by various machine learning and deep learning models due to their ease in processing large volumes of data and finding the most informative features, as well as their strong performance19,20,21.
Deepak and Ameer22 explored the performance of deep transfer learning with a pre-trained GoogLeNet to extract features from MRI images to discriminate between three types of brain tumors. Analysis of molecular mutations using MRI features proved to be a useful method for diffuse LGG prediction, with the advantage of being a non-invasive procedure23. Alksas et al.24 proposed an imaging-based glioma grading system that uses contrast-enhanced MRI, fluid-attenuated inversion-recovery MRI, and diffusion-weighted MRI to extract morphological, textural, and functional features. Then, the optimal features given by the Gini impurity index are fed to a multi-layer perceptron (MLP) to discriminate between different grades of glioma. Matsiu et al.25 developed a deep learning model to predict the LGG molecular subtype using a mixture of clinical and radiomic data. An overall accuracy of 68.7% was obtained when the imaging data included MRI, PET, and CT data. Gutta et al.26 conducted some experiments with a set of 237 patients to demonstrate that the performance of features learned by a convolutional neural network was superior to that of standard radiomic features for glioma grade prediction. Cheng et al.7 used a total of 2153 intratumoral and peritumoral features extracted from preoperative multi-parametric MRI scans of 285 patients to predict glioma grade, reaching an area under the ROC curve (AUC) of 0.975. Furthermore, this technique was shown to have strong generalization performance when applied to an independent validation data set with 65 patients.
Sun et al.27 compared several radiomic feature selection algorithms and classification models in glioma grading, concluding that the combination of feature selection based on a support vector machine (SVM) with an MLP performed the best in discriminating between LGG and GBM. Cho et al.28 used the minimum redundancy maximum relevance algorithm with mutual information as the information measure to select the top five features from a total of 468 radiomic features and three classifiers (logistic regression, SVM, and random forest) to distinguish between HGG and LGG images. Bae et al.29 evaluated the performance and generalizability of traditional machine learning and deep learning models for distinguishing glioblastoma from single brain metastasis using radiomic features. Zhao et al.30 applied Cox proportional hazards, SVM and random forest to a large glioma data set with 3462 patients for survival prediction, concluding that the best performance was achieved when incorporating radiation therapy and chemotherapy administration status. Tasci et al.31 introduced a new hierarchical voting-based strategy for feature selection for glioma grading based on clinical and molecular characteristics, improving the performance of using the least absolute shrinkage and selection operator (LASSO) method together with classifier ensembles. Joshi et al.32 proposed a two-stage ensemble for glioma detection and grading based on clinical and histological data. Munquad et al.33 employed a correlation-based feature selection scheme and an SVM to predict LGG and subtypes, achieving an average accuracy of 91%. Ren et al.34 predicted IDH1 (isocitrate dehydrogenase 1) and ATRX (alpha-thalassemia mental retardation X-linked chromatin remodeler) mutations for molecular stratification of LGG using an SVM with a recursive feature elimination algorithm to select an optimal subset of 28 radiomic features. Zheng et al.35 developed a functional deep neural network to identify high-risk IDH1-mutant glioma patients using clinical factors and molecular features, achieving 90% accuracy.
Zhan et al.36 proposed a computer-aided diagnosis for grading gliomas which consists of a feature extraction step using PCA to reduce the dimensionality of the data and a prediction step based on a k nearest neighbors classifier. Wu et al.37 evaluated 50 machine learning algorithms over a data set with 1114 eligible glioma patients and showed that their performance was better than that of the clinical prediction model. The authors concluded that this kind of prediction models can serve as a non-invasive prediction tool for preoperative diagnostic grading of glioma. Ye et al.38 employed four machine learning algorithms (SVM, random forest, extreme gradient boosting, and generalized linear model) to investigate the relationship between overall survival and the clinical history parameters, pathological characteristics, and molecular alterations of gliomas. The experiments concluded that extreme gradient boosting was the best performing model when applied to a data set with 198 patients. Zhou et al.39 analyzed the correlation between LGG stemness and clinicopathological characteristics. In addition, the authors used SVM, extreme gradient boosting and LASSO to identify genes critical for stemness subtype prediction. Kha et al.40 uses Shapley additive explanations (SHAP) analysis to select the best wavelet radiomics features, which were then used with extreme gradient boosting to predict the codeletion status of chromosome 1p/19q in LGG patients.
While most cutting-edge research has focused on the model-building stage of the machine learning process, the performance of a model is highly dependent on data quality. It is now widely accepted that performance improvements are primarily achieved through a data-centric approach41. Unlike model-centric systems that focus on how to modify the code, algorithms and representations to improve accuracy and generalization, data-centric approaches focus on curating the data to produce a better performing model. Data-centric machine learning comprises a series of tasks, including standardization and normalization, data cleaning, feature extraction, dimensionality reduction, feature transformation, instance selection, undersampling, data synthesis, and oversampling42. However, even recognizing the importance of data-centric methods, the challenge is to find an appropriate balance between these and model-centric methods to provide a robust machine learning solution43.
This paper aims to present a data-centric approach applied to The Cancer Genome Atlas (TCGA) data set and explore the potential benefits of oversampling and undersampling algorithms to address class imbalance, thus comparing their performance with that of six machine learning models (k nearest neighbors, support vector machine, multi-layer perceptron, logistic regression, random forest, and CatBoost). Furthermore, we conduct a comprehensive descriptive analysis of the data set to identify some statistical features and discover the most informative attributes using four feature ranking algorithms (information gain, Gini index, Chi-squared, and random forest). Next, a comparison is carried out with the best performing prediction models using all the features that make up the data set versus the case of using only the five most relevant attributes.
Methodology
This section presents the data set used and its main characteristics, the experimental protocol, and the performance evaluation methods.
Data
All experiments were carried out using a data set31 obtained from the widely used and publicly available repository of genome atlas data on TCGA (https://www.cancer.gov/tccg). In particular, the data set was built on the basis of the TCGA-LGG and TCGA-GBM projects and consists of three clinical factors (Gender, Age at diagnosis and Race) and 20 frequently mutated molecular biomarkers from 839 patients diagnosed with LGG or GBM. As seen in Table 1, all predictors are categorical type, except for the attribute Age at diagnosis, which is numerical. The molecular features are represented by the values 0 (not mutated) and 1 (mutated) according to the TCGA case number. It is worth noting that it was not necessary to apply any deletion or imputation technique because the data set used in the experiments did not contain missing data on any of the attributes (predictor variables).
Descriptive statistics
The data set consists of two classes indicating the glioma grade: 487 (58.05%) patients with LGG (the positive class, 0) and 352 (41.95%) with GBM (the negative class, 1), resulting in an imbalance ratio of 1.38 (i.e., the ratio of majority to minority samples in the data set). Of the total samples in the data set, 488 (58.16%) correspond to men (0) and 351 (41.84%) to women (1). Regarding the attribute Race, there are 765 cases of white people (0), 59 of black or African American people (1), 14 of Asians (2), and only 1 American Indian (3). Table 2 reports the distribution of cases according to glioma grades for the clinical factors. The mean age values in Table 2 suggest that there are no significant differences between males and females affected by these brain tumors, even regardless of the glioma grade. On the other hand, the data regarding patient race could be biased because the vast majority of cases are white people, so any conclusions about the incidence of glioma based on the attribute Race could be erroneous.
Table 3 summarizes a series of descriptive statistics for the attribute Age at diagnosis according to the gender of the patients, including measures of central tendency and measures of dispersion: minimum and maximum values, arithmetic mean, median, mid range, standard deviation (SD), standard error (SE), 95% confidence interval (95% CI), first (Q1) and third (Q3) quartiles, interquartile range (IQR), coefficient of skewness, coefficient of kurtosis, kurtosis excess, and coefficient of variation (CV). Additionally, we conducted Kolmogorov-Smirnov (K-S) test46 (with Lilliefors significance correction) at a significance level of 0.05 to check for normality of the distribution of the samples in each gender; if p-value \(> 0.05\), it may be assumed that the data follow a normal distribution. We chose the K-S test instead of the Shapiro–Wilk test because it is more appropriate for large sample size (N \(\ge 50\))47.
To visualize the shape of the distributions, Fig. 1 shows histograms and density plots for the attribute Age at diagnosis for both males and females. In addition, it also displays the Q-Q plots for the attribute Age at diagnosis. As can be seen, the residuals (green dots) tend to deviate quite a bit from the 45-degree line (blue) only at the tail ends, indicating that the data follow a normal distribution.
On the other hand, Fig. 2 shows a box-plot with the distribution of the attribute Age at diagnosis for LGG (class 0) and GBM (class 1) cases. The dark blue vertical and the thin blue lines represent the mean age and standard deviation, respectively. The median is shown with a yellow vertical line, while the blue highlighted area represents the values between the first and the third quartiles.
Table 4 displays counts (frequencies) and proportions (relative frequencies) for the 20 molecular biomarkers. Note that the list is ordered from highest to lowest by the count (or percentage) of cases with a mutation in the corresponding biomarker, ranging from 404 for IDH1 to 22 for PDGFRA.
Experimental protocol
The multiple machine learning models used to predict glioma grade in the experiments included four standard classification models and two powerful classifier ensembles. The standard models were k-nearest neighbors (kNN), SVM, MLP, and logistic regression (LR). The ensembles were random forest (RF) and CatBoost. kNN is a non-parametric learning algorithm that produces the class label of an input sample based on the majority vote of its k closest training cases. SVM is a supervised machine learning model that classifies data by finding the hyperplane that optimally separates the samples of one class from the other, that is, the hyperplane that maximizes the distance (margin) between the closest samples of the opposite class. One of the most interesting features of SVM is that it works for both linear and nonlinear problems, as well as being less prone to overfitting. When data are not linearly separable, some kernel function must be used to transform the training data into a higher-dimensional feature space that allows linear separability. An MLP is an artificial neural network that consists of multiple layers of interconnected neurons: an input layer that receives the input sample as a combination of the feature values, an output layer that performs the classification by using some activation function, and one or more hidden layers (placed in between the input and output layers) whose neurons perform computations on the inputs. The logistic regression model makes a prediction based on the probability that an input sample belongs to a particular class: if the probability is greater than 0.5, the sample is assigned to that class; otherwise, the sample is classified to the other class.
RF44 is an extension of the bagging method made up of multiple decision trees, each generated from a sample drawn with replacement from the training set (i.e., with replacement means that one sample could be selected multiple times, while others could not be selected at all). During the construction of a tree, the best split is selected from a random subset of features, thus ensuring low correlation between decision trees. When classifying new input samples, all trees make a judgment and the final decision is made by majority vote. CatBoost45 is an improved implementation of gradient boosting on binary decision trees, which means that each new tree is trained to minimize the loss function of the previous model (i.e., to reduce the error made by previous trees) using gradient descent. CatBoost handles categorical features not by using a binary substitution of the categorical values but by performing a random permutation of the training data (this ensures different orderings during different stages of the gradient boosting process) and calculating the average label value for the sample with the same class value placed before the given one in the permutation.
Before applying the prediction models, the values of the attribute Age at diagnosis were normalized using the z-score standardization technique so that the mean of all values was 0 and the standard deviation was 1. A raw value x of the feature is converted into a normalized value z by
where \({\bar{x}}\) and SD are the mean and standard deviation of a feature, respectively.
Note that normalization was applied solely to Age at diagnosis because all other attributes were categorical. On the other hand, to find the best values of the hyperparameters for the machine learning models, we fine-tuned them using an 80-20 stratified holdout setting method (Table 5).
Performance evaluation
We adopted a stratified 10-fold cross-validation method, where the data set was randomly divided into ten stratified non-overlapping blocks of roughly equal size. The models were trained with nine of these blocks combined and then applied to the remaining block to estimate the performance. This process was repeated for each of the 10 blocks, giving a total of 755 training samples and 84 testing samples in each of the 10 iterations of the cross-validation. Performance was then calculated as the average of the 10 estimates thus obtained. We used six scalar indicators to evaluate the prediction performance: classification accuracy (Acc), Precision (Prec), Recall, Specificity (Spec), F1-score (F1), and Matthews correlation coefficient (MCC). All these measures were derived from a \(2 \times 2\) confusion matrix, where each entry (i, j) represents the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
In addition to these scalar metrics, we also included the receiver operating characteristics (ROC) and the precision-recall curves to visualize how the machine learning models used in the experiments perform in predicting the classes. The ROC curve plots a false positive rate (i.e., 1-specificity) on an X-axis against a true positive rate on a Y-axis; the closer the curve approaches the upper left corner of the ROC space, the better the model is at predicting the classes. The precision-recall curve shows the ratio between precision (ratio of true positives in positive predictions) and recall (ratio of true positives in positive class) at different thresholds; ideally, the curve should be as close to the top right corner as possible.
Results and discussion
This section consists of three blocks. First, we investigated the most informative molecular biomarkers based on the distribution of cases in each glioma grade and checked whether these results agree with the results of four feature ranking algorithms. The second block analyzes the performance of six standard prediction models and classifier ensembles for glioma grading. Finally, we apply some resampling techniques to handle class imbalance and verify if this leads to increased performance.
Most informative features
Values in Table 6 show the distribution of cases according to glioma grades for the mutated molecular biomarkers (i.e., feature value = 1). As can be seen, IDH1 mutations are the most common, being detected in 404 patients (48.15% of the total cases studied). However, these mutations occur in 94.31% of cases with LGG and only in 5.69% of cases with GBM, confirming previous findings that this is a very informative molecular biomarker for glioma grading17,34,48: IDH1/2 mutations have been largely associated with grade II and III gliomas and secondary glioblastomas49. Looking at the biomarkers with 50 or more cases, similar conclusions can be drawn for the molecular biomarkers ATRX with 84.33% of LGG, PTEN (phosphatase and tensin homolog) with 82.27% of patients affected by GBM and CIC (capicua transcriptional repressor) with 96.40% LGG. In the case of biomarkers with a low percentage of patients, we find NOTCH1 (notch receptor 1) (100% of LGG), FUBP1 (far upstream element binding protein 1) (95.56% of LGG), IDH2 (isocitrate dehydrogenase 2) (91.30% of LGG), SMARCA4 (SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily a, member 4) (85.19% of LGG) and RB1 (retinoblastoma transcriptional corepressor 1) (85.00% of GBM).
To support the conclusions drawn from the values in Table 6, we ran four feature ranking algorithms on the normalized data set with the aim of checking which are the most informative predictors: information gain, Gini index, Chi-squared, and RF. Note that identifying the most informative clinical factors and glioma molecular biomarkers can be valuable in obtaining relevant biological information. On the other hand, in some practical cases, having small feature sets with high prediction accuracy can become paramount to minimize response time.
Information gain (infGain) estimates the relevance of a predictor based on the amount by which the entropy of the class decreases when considering that feature. The Gini index (Gini) estimates the distribution of a predictor in different classes and can be interpreted as a measure of impurity for a feature. Chi-squared (Chi2) measures the relationship strength between each variable and the class label. Note that Chi-squared applies to categorical predictors, and therefore, numerical attributes (as is the case for Age at diagnosis) must first be discretized into several intervals. In the case of RF as a feature ranking method, each tree in the forest calculates the importance of a predictor based on its ability to decrease the weighted impurity in the tree.
Since each feature ranking algorithm could yield different results (rankings), fusing them using a multiple intersection method was necessary to find out which features got the highest rankings in the output of the four algorithms. Thus, looking at the rankings of each algorithm, it was possible to determine which were the five most relevant attributes, while there were discrepancies in establishing the most informative attributes from the sixth position onwards. From the outputs of the multiple intersection method, Table 7 shows that the four feature ranking algorithms agreed to define IDH1 as the most informative attribute, followed by Age at diagnosis, PTEN, CIC and ATRX. These results are interesting because they are consistent with the findings of various studies conducted in neuroscience and neuro-oncology17,34,48,49 in which the mutated molecular biomarkers that best discriminate LGG from GBM were determined, as reported in Table 6. The relevance of this lies in the fact that feature selection or ranking algorithms could be used to discover molecular biomarkers with the greatest discriminating power instead of other more expensive, time-consuming and difficult to carry out methods.
We ran multidimensional scaling50 to visualize in Fig. 3 the samples from both classes as a function of the attribute Age at diagnosis against each of the four most informative biomarkers (IDH1, PTEN, CIC, and ATRX). Each blue dot represents an LGG sample, and each red dot is a GBM sample. The regions belonging to each class are shaded in blue or red depending on whether they correspond to the LGG or GBM class, respectively. These graphs allow us to see how the age of the patients and mutations are related to the grade of glioma. For example, Fig. 3a reveals that most LGG cases require IDH1 mutations and occur at younger ages than GBM cases. For PTEN (Fig. 3b), LGG occurs when there is no mutation, while GBM does not appear to depend on this molecular biomarker since approximately the same number of cases is seen both with and without PTEN mutations.
Results of the prediction models
Table 8 reports the results of each of the six evaluation metrics achieved by the prediction models applied to the normalized data set (with all predictors) using the experimental protocol described above. The results revealed that RF was the best performing model, although closely followed by CatBoost and SVM. In contrast, kNN. MLP and LR obtained the lowest values regardless of the performance evaluation metric used.
For better analysis of these results, we performed a pairwise comparison of models using a correlated Bayesian t-test51 for each evaluation metric to check whether the difference in scores between each pair of models was significant or not. Unlike the frequentist correlated t-test, where the inference is a p-value, the inference of the Bayesian t-test is a posterior probability. Additionally, this test considers the correlation and the uncertainty (i.e., the standard error) of the results generated by cross-validation. The outputs of the statistical test are summarized in Table 9, where the number in a cell denotes the probability that the model corresponding to the row had a significantly higher score (posterior probability greater than 0.5) than the model corresponding to the column. Values in this table indicate that the results obtained by RF and CatBoost were significantly better than those of kNN, MLP, and LR, regardless of the metric used. When comparing RF and CatBoost with SVM, it can be seen that the differences were not statistically significant when using Prec (0.492 and 0.399) and Spec (0.460 and 0.416). Finally, posterior probabilities of RF being significantly better than CatBoost revealed that the performance differences between both ensembles were very small, so one should not conclude that RF performed better than CatBoost.
Figure 4 plots the ROC curves for the RF and CatBoost ensembles separately for each of the two classes (LGG and GBM). The diagonal dotted line represents the behavior of a random classifier, while the full diagonal line represents iso-performance in the ROC space so that all the points on the line give the same profit/loss. The closer to the top and further to the left this full diagonal line is, the better the classifier result. The AUC was 0.923 for RF and 0.924 for CatBoost, that is, the difference between both classifiers was negligible.
Figure 5 shows the confusion matrix corresponding to each of the six prediction models. Although it was seen that the imbalance ratio of the data set was moderately low (1.38), the confusion matrix allows us to discover the behavior of the models in each of the classes, that is, analyze the number of successes and errors individually by class in order to identify whether or not there were differences between predicting samples belonging to the majority class and samples of the minority class. Thus, it can be observed that the three models with the best performance (RF, CatBoost and SVM) made a lower number of errors than the other three classifiers (kNN, MLP and LG) on the minority class (GBM). In contrast, the number of misclassifications on the majority class (LGG) was similar in all classifiers.
Explainability of predictions
Due to the “black box” nature of most machine learning models, one of the main problems is their insufficient interpretability or the difficulty in understanding the predictions they make. To shed light on these limitations, some methodologies belonging to the eXplainable Artificial Intelligence (XAI)52 paradigm have been proposed in order to provide a reasonable understanding of the output of machine learning models. In particular, we analyzed the effect of the attributes on the prediction performance using two explainability approaches: global feature importance and SHAP.
Global feature importance estimates the contribution of each individual feature to the prediction by measuring the increase in the prediction error of the model after performing permutations on the feature values across the data set, which breaks the relationship between the feature and the target variable44,53. A feature is important if permuting its values increases the model error, while a feature is of little or no importance if permuting its values does not change the error of the model.
Bar charts in Fig. 6 show feature importances in descending order for each classifier, indicating that the IDH1 biomarker was the most important attribute contributing to the target variable (i.e., glioma grade), regardless of the model used. The second most important feature was Age at diagnosis in all cases except when applying the MLP neural network (note that even in this case the attribute Age at diagnosis was the third most important). It is worth highlighting that these results mostly agree with those reported in Table 7, where these two features were also identified as the most relevant when applying the multiple intersection method.
It should be noted that the global feature importance approach reveals the absolute importance of each attribute, but it does not indicate the direction of the change given by the permutations, that is, it does not report whether the feature increases or decreases the prediction performance of the model. To overcome this limitation, we also employed the SHAP method introduced by Lundberg and Lee54, which is based on the principles of cooperative game theory and can provide broad explanations of model predictions at both local and global levels. This method computes Shapley values, which quantify the average marginal contribution of a feature to the prediction made by the model after considering all possible combinations with other features55, that is, it provides information about whether the influence of each characteristic on the prediction value of the model is positive (increase) or negative (decrease). The Shapley value of a feature, is calculated as the difference between the prediction when the feature is present and the prediction when the feature is absent.
Figure 7 shows the SHAP summary plot for each model, which represents the positive or negative impact of each feature on the prediction of one class. On the X-axis is the Shapley value, which denotes how much the features contribute to the prediction of a patient diagnosed with GBM across all possible combinations. A value less than 0 indicates a negative contribution (i.e., low importance for the prediction of the minority class GBM), equal to 0 indicates no contribution, and greater than 0 indicates a positive contribution (i.e., high importance for prediction). The left vertical axis (Y-axis) is for features ranked in descending order of their relevance to the prediction of class GBM, while the right vertical axis indicates the value of the features from lowest to highest. Each dot represents the Shapley value of a sample (patient) plotted horizontally and is colored red or blue depending on whether the value is high or low, respectively.
From these plots, it can be seen that Age at diagnosis was the most important feature for the prediction in class GBM when the KNN and LR models were used, and the second most relevant with the rest of the classifiers. Samples with higher values of this feature (red color) had higher Shapley values, meaning that they contributed to the prediction of class GBM. Lower values of this attribute (blue) contributed against the prediction of this class. The IDH1 biomarker (categorical attribute) contributed the most to the prediction of GBM class when using the MLP, SVM, RF and CatBoost models. As IDH1 is a categorical attribute, its impact on the prediction depends on its value (0 = non-mutated, 1 = mutated). Thus, it can be seen that this biomarker with the non-mutated value for the patient (red color) contributed to the prediction of the GBM class, while the mutated value of this attribute contributed negatively.
Addressing class imbalance
Considering the differences in misclassifications between the majority class and the minority class, we decided to address the class imbalance in order to see if any performance improvement could be obtained. It is well known that training a machine learning algorithm with imbalanced data can favor the majority class, typically leading to higher misclassification rates over the minority class (GBM). Among the various strategies to address imbalanced data, resampling techniques are by far the most widely used approach because they have been proven to be efficient, classifier-independent, and can be easily implemented for any problem56. These are designed to change the composition of the training data set by adjusting the number of majority and/or minority samples until both classes are represented by an approximately equal number of samples. Many researchers have argued that over-sampling is generally superior to under-sampling because under-sampling algorithms can discard potentially useful data and increase classifier variance57. It should be noted that, to avoid overoptimistic results, resampling should be applied only to the training set, not to the entire data set58. In the case of over-sampling, for instance, this means that the testing samples are neither over-sampled nor seen by the machine learning model during training.
Experiments in this section were carried out with two resampling algorithms. The first is an over-sampling algorithm proposed by Chawla et al.59 called SMOTE, which generates artificial samples of the minority class (GBM) by interpolating existing samples that are close together. It first finds the k minority nearest neighbors for each minority sample, and then synthetic samples are generated in the direction of some or all of those nearest neighbors. Depending on the amount of over-sampling required, a certain number of samples are randomly chosen from the k nearest neighbors. The second is random under-sampling (RUS), which balances the data set by randomly removing samples that belong to the over-sized class (LGG).
Table 10 reports the performance results obtained after preprocessing the normalized data set with SMOTE and RUS. The first issue worth mentioning is that oversampling performed better than undersampling, except when Recall was used. Secondly, unlike the results obtained with the normalized data set without preprocessing (Table 8), now the best model after up-sampling the data set was SVM, although the differences concerning RF and CatBoost were really negligible.
To check whether or not the difference in the means of the results with the normalized training set without preprocessing and those preprocessed with over-sampling and under-sampling were significant, a two-tailed t-test60 was performed for a significance level of 5% (\(\alpha = 0.05\)), whose t-values and p-values are shown in Table 11. Thus, when comparing the means of Table 8 with those of over-sampling (upper part of Table 10), we obtained that the differences were statistically significant in all cases, except when using the specificity to evaluate the performance of the models. On the other hand, when comparing them with the means obtained with under-sampling (bottom of Table 10), we found that the differences in precision and specificity on the non-preprocessed set were significantly better than those of the downsized set. Therefore, despite the low imbalance index, the test indicated the convenience of over-sampling the normalized data set with the SMOTE algorithm to increase the performance of the prediction models.
As a further confirmation of the findings using SMOTE, in Fig. 8 we plotted precision-recall curves for the best predictiion models (SVM, RF and CatBoost) when applied to the original training sets and the over-sampled training sets. The area under the precision-recall curve was 0.838, 0.860 and 0.872 for SVM, 0.873, 0.91 and 0904 for RF, and 0.872, 0.908 and 0.898 for CatBoost using the original, over-sampled and under-sampled training sets, respectively. These values confirm some performance improvements as a result of addressing class imbalance with SMOTE.
The last experiment focused on analyzing the behavior of the prediction models on the upsized data set using the feature vector with the five most relevant attributes according to the multiple intersection method. Table 12 shows that the best performing models were LR and SVM, which is quite surprising because these results differed from those obtained on the data set containing all attributes. On the other hand, when comparing the results of the upper part of Table 10 with those of Table 12, one can see that the performance of all the prediction models worsened when applied to the reduced sets. To check whether or not the differences were statistically significant, we again ran a two-tailed t-test for a significance level of 0.05: t-value \(= -\)6.898545, p-value \(=\) 0.00098.
Conclusions
Glioma grading and prediction constitute a highly relevant practical health problem that is usually addressed using neuroimaging techniques. However, the development of advanced genomics and proteomics methods allows the identification of mutations in certain molecular biomarkers that can support diagnosis, prognosis and prediction of response to therapy. In this study, several data-centric machine learning models have been used to discriminate between LGG and GBM samples using a series of clinical factors and molecular biomarkers. Furthermore, a comprehensive descriptive analysis of the data set used in the experiments has also been carried out. The descriptive analysis has included several statistics of the attributes and the application of four feature ranking algorithms to determine the most relevant characteristics, and it has been possible to observe that the molecular biomarkers selected by these algorithms as the most informative agree with the conclusions of previous molecular biology studies. However, these algorithms have important advantages because they are much less expensive and faster than genomics and proteomics methods.
Of the different machine learning methods analyzed, the two classifier ensembles (RF and CatBoost) have obtained the best scores regardless of the metric used. The global feature importance approach revealed the absolute relevance of each attribute, while the SHAP analysis of individual samples provided a reasonable interpretation of which attributes contributed most to the prediction of class GBM. On the other hand, when analyzing the confusion matrices, important differences have been observed between the misclassifications on the majority class and the minority class, which suggested the need to apply some techniques to address the class imbalance. In particular, the normalized data set has been preprocessed with an oversampling algorithm (SMOTE) and an undersampling algorithm (RUS) and it has been found that upsizing the minority class improves the prediction performance. As a final comment, it is worth noting that a model-centric approach applied to the TCGA data set achieved 0.876 accuracy31, while the data-centric method proposed in this study yielded accuracy rates of 0.882 (with oversampling) and 0.881 (with both oversampling and undersampling).
While this study provides valuable insights into prediction of glioma grades, an interesting avenue for future research refers to the analysis of the possible bias that may arise in predictions against certain sensitive social groups (e.g., gender, age, race, etc.). With this objective, the aim is to quantify the existence of bias through fairness metrics and, if necessary, apply bias mitigation methods61,62. When the bias is inherited from the way the training set was created, one approach that would reduce the bias is to internally rebalance the class distributions so that they are equal across class and sensitive attributes.
Data availibility
The data set used and analyzed during the current study is available in the UCI Machine Learning Repository: Glioma Grading Clinical and Mutation Features [Dataset]. https://doi.org/10.24432/C5R62J.
Code availability
The custom code used in this study is available upon request from the corresponding author.
References
Louis, D. N. et al. The 2007 WHO classification of tumours of the central nervous system. Acta Neuropathol. 114, 97–109 (2007).
Delgado-López, P. D. & Corrales-García, E. M. Survival in glioblastoma: A review on the impact of treatment modalities. Clin. Transl. Oncol. 18, 1062–1071 (2016).
Hanif, F. et al. Glioblastoma multiforme: A review of its epidemiology and pathogenesis through clinical presentation and treatment. Asian Pac. J. Cancer Prev. 18, 3–9 (2017).
Zhuge, Y. et al. Automated glioma grading on conventional MRI images using deep convolutional neural networks. Med. Phys. 47, 3044–3053 (2020).
Kummar, S. & Lu, R. Using radiomics in cancer management. JCO Precis. Oncol. 8, e2400155 (2024).
Taha, B., Boley, D., Sun, J. & Chen, C. C. State of radiomics in glioblastoma. Neurosurgery 89, 177–184 (2021).
Cheng, J. et al. Prediction of glioma grade using intratumoral and peritumoral radiomic features from multiparametric MRI images. IEEE/ACM Trans. Comput. Biol. Bioinform. 19, 1084–1095 (2022).
Lee, J. H. et al. Preoperative prediction of early recurrence in resectable pancreatic cancer integrating clinical, radiologic, and CT radiomics features. Cancer Imaging 24, 6 (2024).
Miranda, J. et al. The role of radiomics in rectal cancer. J. Gastrointest. Cancer 54, 1158–1180 (2023).
Nguyen, H. S. et al. Predicting EGFR mutation status in non-small cell lung cancer using artificial intelligence: A systematic review and meta-analysis. Acad. Radiol. 31, 660–683 (2024).
Khanfari, H. et al. Exploring the efficacy of multi-flavored feature extraction with radiomics and deep features for prostate cancer grading on mpMRI. BMC Med. Imaging 23, 195 (2023).
Kim, S., Kim, M. J., Kim, E. K., Yoon, J. H. & Park, V. Y. MRI radiomic features: Association with disease-free survival in patients with triple-negative breast cancer. Sci. Rep. 10, 3750 (2020).
Pinter, N. K. & Fritz, J. V. Neuroimaging for the neurologist: Clinical MRI and future trends. Neurol. Clin. 38, 1–35 (2020).
Verger, A. & Langen, K. J. PET Imaging in Glioblastoma: Use in Clinical Practice. In Glioblastoma (ed. De Vleeschouwer, S.) (Codon Publications, 2017).
Almansory, K. O. & Fraioli, F. Combined PET/MRI in brain glioma imaging. Br. J. Hosp. Med. (Lond.) 80, 380–386 (2019).
Tiefenbach, J. et al. The use of advanced neuroimaging modalities in the evaluation of low-grade glioma in adults: A literature review. Neurosurg. Focus 56, E3 (2024).
Siegal, T. Clinical impact of molecular biomarkers in gliomas. J. Clin. Neurosci. 22, 437–444 (2015).
Figarella-Branger, D. et al. The 2021 WHO classification of tumours of the central nervous system. Ann. Pathol. 42, 367–382 (2022).
Zlochower, A. et al. Deep learning AI applications in the imaging of glioma. Top. Magn. Reson. Imaging 29, 115–121 (2020).
Buchlak, Q. D. et al. Machine learning applications to neuroimaging for glioma detection and classification: An artificial intelligence augmented systematic review. J. Clin. Neurosci. 89, 177–198 (2021).
Luo, J., Pan, M., Mo, K., Mao, Y. & Zou, D. Emerging role of artificial intelligence in diagnosis, classification and clinical management of glioma. Semin. Cancer Biol. 91, 110–123 (2023).
Deepak, S. & Ameer, P. M. Brain tumor classification using deep CNN features via transfer learning. Comput. Biol. Med. 111, 103345 (2019).
Shboul, Z. A., Chen, J. & Iftekharuddin, K. M. Prediction of molecular mutations in diffuse low-grade gliomas using MR imaging features. Sci. Rep. 10, 3711 (2020).
Alksas, A. et al. A novel system for precise grading of glioma. Bioengineering 9, 532 (2022).
Matsui, Y. et al. Prediction of lower-grade glioma molecular subtypes using deep learning. J. Neurooncol. 146, 321–327 (2020).
Gutta, S., Acharya, J., Shiroishi, M. S., Hwang, D. & Nayak, K. S. Improved glioma grading using deep convolutional neural networks. AJNR Am. J. Neuroradiol. 42, 233–239 (2021).
Sun, P., Wang, D., Mok, V. C. & Shi, L. Comparison of feature selection methods and machine learning classifiers for radiomics analysis in glioma grading. IEEE Access 7, 102010–102020 (2019).
Cho, H. H., Lee, S. H., Kim, J. & Park, H. Classification of the glioma grading using radiomics analysis. PeerJ 6, e5982 (2018).
Bae, S. et al. Robust performance of deep learning for distinguishing glioblastoma from single brain metastasis using radiomic features: model development and validation. Sci. Rep. 10, 12110 (2020).
Zhao, R., Zhuge, Y., Camphausen, K. & Krauze, A. V. Machine learning based survival prediction in glioma using large-scale registry data. Health Inform. J. 28, 14604582221135428 (2022).
Tasci, E., Zhuge, Y., Kaur, H., Camphausen, K. & Krauze, A. V. Hierarchical voting-based feature selection and ensemble learning model scheme for glioma grading with clinical and molecular characteristics. Int. J. Mol. Sci. 23, 14155 (2022).
Joshi, R. C. et al. Ensemble based machine learning approach for prediction of glioma and multi-grade classification. Comput. Biol. Med. 137, 104829 (2021).
Munquad, S., Si, T., Mallik, S., Li, A. & Das, A. B. Subtyping and grading of lower-grade gliomas using integrated feature selection and support vector machine. Brief. Funct. Genom. 21, 408–421 (2022).
Ren, Y. et al. Noninvasive prediction of IDH1 mutation and ATRX expression loss in low-grade gliomas using multiparametric MR radiomic features. J. Magn. Reson. Imaging 49, 808–817 (2019).
Zheng, S. et al. GlioPredictor: A deep learning model for identification of high-risk adult IDH-mutant glioma towards adjuvant treatment planning. Sc.i Rep. 14, 2126 (2024).
Zhan, T. et al. An automatic glioma grading method based on multi-feature extraction and fusion. Technol. Health Care 25, 377–385 (2017).
Wu, M. et al. Development and validation of a clinical prediction model for glioma grade using machine learning. Technol. Health Care 32, 1977–1990 (2024).
Ye, L. et al. An online survival predictor in glioma patients using machine learning based on WHO CNS5 data. Front. Neurol. 14, 1179761 (2023).
Zhou, H., Chen, B., Zhang, L. & Li, C. Machine learning-based identification of lower grade glioma stemness subtypes discriminates patient prognosis and drug response. Comput. Struct. Biotechnol. J. 21, 3827–3840 (2023).
Kha, Q. H., Le, V. H., Hung, T. N. K. & Le, N. Q. K. Development and validation of an efficient MRI radiomics signature for improving the predictive performance of 1p/19q co-deletion in lower-grade gliomas. Cancers 13, 5398 (2021).
Kumar, S., Datta, S., Singh, V., Singh, S. K. & Sharma, R. Opportunities and challenges in data-centric AI. IEEE Access 12, 33173–33189 (2024).
Zha, D., Bhat, Z. P. , Lai, K.- H., Yang, F. & Hu, X. Data-centric AI: Perspectives and challenges. In Proc. SIAM Int. Conf. on Data Mining (eds Shekhar, S. et al.) 945–948 (SIAM, 2023).
Hamid, O. H. Data-centric and model-centric AI: Twin drivers of compact and robust industry 4.0 solutions. Appl. Sci. 13, 2753 (2023).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. CatBoost: unbiased boosting with categorical features. In Proc. 32nd Int. Conf. on Neural Information Processing Systems (eds Bengio, S. et al.) 6639–6649 (ACM, 2018).
Yap, B. W. & Sim, C. H. Comparisons of various types of normality tests. J. Stat. Comput. Simul. 81, 2141–2155 (2011).
Mishra, P. et al. Descriptive statistics and normality tests for statistical data. Ann. Card. Anaesth. 22, 67–72 (2019).
DeWitt, J. C. et al. Cost-effectiveness of IDH testing in diffuse gliomas according to the 2016 WHO classification of tumors of the central nervous system recommendations. Neuro-Oncology 19, 1640–1650 (2017).
Kan, L. K. et al. Potential biomarkers and challenges in glioma diagnosis, therapy and prognosis. BMJ Neurol. Open. 2, e000069 (2020).
Kruskal, J. B. & Wish, M. Multidimensional Scaling (SAGE, 1978).
Corani, G. & Benavoli, A. A Bayesian approach for comparing cross-validated algorithms on multiple data sets. Mach. Learn. 100, 285–304 (2015).
Gunning, D. et al. XAI-Explainable artificial intelligence. Sci. Robot. 4, 120 (2019).
Fisher, A., Rudin, C. & Dominici, F. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. J. Mach. Learn. Res. 20, 177 (2019).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) 4765–4774 (Curran Associates, 2017).
Alabi, R. O. et al. Machine learning explainability in nasopharyngeal cancer survival using LIME and SHAP. Sci. Rep. 13, 8984 (2023).
López, V., Fernández, A., Moreno-Torres, J. G. & Herrera, F. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst. Appl. 39, 6585–6608 (2012).
García, V., Sánchez, J. S., Marqués, A. I., Florencia, R. & Rivera, G. Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Expert Syst. Appl. 158, 113026 (2020).
Santos, M. S., Soares, J. P., Abreu, P. H., Araujo, H. & Santos, J. Cross-validation for imbalanced datasets: Avoiding overoptimistic and overfitting approaches [Research Frontier]. IEEE Comput. Intell. M. 13, 59–76 (2018).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).
Bland, J. M. & Bland, D. G. Statistics notes: One and two sided tests of significance. BMJ 309, 248 (1994).
Fletcher, R. R., Nakeshimana, A. & Olubeko, O. Addressing fairness, bias, and appropriate use of artificial intelligence and machine learning in global health. Front. Artif. Intell. 3, 561802 (2021).
Giovanola, B. & Tiribelli, S. Beyond bias and discrimination: redefining the AI ethics principle of fairness in healthcare machine-learning algorithms. AI Soc. 38, 549–563 (2023).
Funding
Open access funding provided by Institute of New Imaging Technologies and Department of Computer Languages and Systems, Universitat Jaume I.
Author information
Authors and Affiliations
Contributions
All authors contributed equally to the preparation of this manuscript, read it, and approved the submitted version.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Sánchez-Marqués, R., García, V. & Sánchez, J.S. A data-centric machine learning approach to improve prediction of glioma grades using low-imbalance TCGA data. Sci Rep 14, 17195 (2024). https://doi.org/10.1038/s41598-024-68291-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-024-68291-0
Keywords
This article is cited by
-
Hybrid classical and quantum computing for enhanced glioma tumor classification using TCGA data
Scientific Reports (2025)
-
Hybrid clustering strategies for effective oversampling and undersampling in multiclass classification
Scientific Reports (2025)










