Introduction

Pediatric low-grade gliomas (pLGG) are the most common brain tumor in children, accounting for approximately 40% of pediatric central nervous system (CNS) tumors1. As a diverse set of tumors, pLGGs can occur throughout the CNS and consist of various different histopathologies2. Despite the heterogeneity of pLGGs, it has been found that these tumors are often the result of a select few molecular alterations in the mitogen-activated protein kinase pathway, most commonly KIAA1549::BRAF (BRAF Fusion) or point mutation of BRAF p.V600E (BRAF Mutation)3. BRAF status has proven to be crucial for accurate prognostication and risk assessment of pLGG and for therapeutic decision-making4,5. The importance of genetic status is highlighted in recent editions of the WHO CNS tumor classification6. While historically CNS tumor classification has been based primarily on histological findings, the more recent approaches also rely on molecular status7.

The identification of the molecular alterations driving pLGGs has been transformative for their treatment. Therapeutics targeting specific genetic alterations that can supplement or replace classic cytotoxic treatments have been developed4,8. Usage of these targeted therapies, optimal prognostication, and risk assessment require surgically acquired tumor tissue, which is used to determine BRAF status9. Currently, in cases where tissue samples cannot be retrieved, such as with difficult-to-access tumors, the disease cannot be well characterized, and it is difficult to identify an appropriate targeted therapy.

Machine learning (ML) has shown potential as an alternative, non-invasive method of detecting BRAF status in patients with pLGG. Earlier studies used ML to successfully classify patients with pLGG by genetic status based on manually segmented tumor regions of their MRIs10,11,12,13,14,15,16. While encouraging, the pipelines described in these works depend on manual tumor segmentations that have known intra-reader17 and inter-reader18 variability issues. More recent studies have aimed to eliminate the reliance on manual segmentations by introducing more reproducible pipelines incorporating automated segmentation models19,20,21. However, such an approach is not fully reliable either, given that earlier studies focused on automating the segmentation of pLGGs either excluded difficult and outlier cases22, or failed for certain patients, missing the tumor entirely23.

In this study, we propose a non-invasive BRAF status identification pipeline that aims to remedy the weaknesses of earlier studies that relied on manual or automated segmentations. Previous works attempting to automate the segmentation of pLGGs19,23 used deep learning (DL) models based on the U-Net24. These DL architectures, which rely on convolutional neural networks (CNN), were state-of-the-art until recently. Now, however, more advanced models, such as those relying fully or partially on transformer layers25,26, or the ConvNeXt27 architecture, have proven more effective for adult brain tumor segmentation. We aimed to evaluate the performance of each of the aforementioned DL architectures for pLGG segmentation, to develop a more reliable automated segmentation model capable of consistently identifying the tumor region. We also aimed to develop and evaluate an ML pipeline to determine BRAF status from whole-brain fluid-attenuated inversion recovery (FLAIR) MRI sequences, relying on semiautomatic whole tumor volume segmentations and features derived from them, for pretraining only, rather than using segmentations as model inputs. Once deployed, this pipeline would have no reliance on tumor segmentations whatsoever.

We find that modern DL architectures such as MedNeXt outperform earlier CNN and transformer-based models for pLGG segmentation, although automated segmentations remain unreliable as direct inputs for downstream classification. Furthermore, we show that a segmentation-free whole-brain FLAIR MRI classification model achieves performance comparable to manual segmentation-based approaches and improves through pretraining. These results demonstrate that pLGG genetic status can be assessed without reliance on tumor segmentations, providing a practical and non-invasive alternative to tissue sampling.

Methods

This study conformed to the guidelines and regulations of the research ethics board of The Hospital for Sick Children, which approved this retrospective study and waived the need for informed consent due to the retrospective nature of the study. All data were deidentified after being extracted from the electronic health record database.

Data

A total of 513 patients treated for pLGGs located in the brain parenchyma from 1999 to 2023 at The Hospital for Sick Children (SickKids) (Toronto, Canada) were identified using the SickKids electronic health record database. We excluded 15 patients for whom genetic information was unavailable. We excluded 43 additional patients who did not have a 2D axial or coronal FLAIR MR sequence available. We used FLAIR as our primary imaging sequence because it is widely regarded as the most sensitive method of detecting brain tumors28, and thus was the MRI sequence that was available for the most patients in our retrospective dataset. Furthermore, FLAIR is sensitive to leptomeningeal spread and non-contrast-enhancing lesions29, and depicts the tumor and surrounding area better than contrast-enhanced T1-weighted images30. The demographics of the 455 patients included in this study are summarized in Table 1.

Table 1 Demographic information of the patient population

Molecular subtyping

We followed the same stepwise process for molecular characterization as previous works, which relied on subsets of the data we used in this study11,13,16. Formalin-fixed paraffin-embedded tissue obtained during biopsy or resection was used for most patients. If not available, frozen tissue was used. BRAF mutations were detected using immunohistochemistry, sequencing, or droplet digital PCR. BRAF fusions were identified using either fluorescence in situ hybridization or an nCounter Metabolic Pathways Panel (NanoString Technologies). Patients not found to be positive for BRAF mutation or fusion were determined to be “non-BRAF altered”.

Image acquisition, segmentation, and preprocessing

Patients underwent brain MRI at field strengths of 1.5T or 3T using MRI scanners of different vendors (Signa, GE Healthcare; Achieva, Philips Healthcare; Magnetom Skyra, Siemens Healthineers). Whole tumor volumes were segmented semiautomatically using the level tracing-effect tool of 3D Slicer31,32 (Version 4.10.2) by a pediatric neuroradiology fellow (MS) and validated by a fellowship-trained pediatric neuroradiologist (MWW). To standardize images and reduce variability due to differences in acquisition methods over time, as well as other sources of heterogeneity, all scans were bias-corrected, resampled to a resolution of 240 × 240 × 155, z-score normalized, and registered to the SRI24 atlas33 using 3D Slicer.

ML pipelines for pLGG genetic status classification

At a high level, we explored two approaches to designing a pipeline capable of classifying pLGGs into three classes: BRAF Fusion, BRAF Mutation, and non-BRAF altered. First, we assessed the potential for advanced DL models to improve the performance of automated segmentation models, making these segmentations more reliable. We then introduced a whole-brain FLAIR-based segmentation-free classification model. Figure 1 outlines the experiments that were used to assess both approaches, details of which can be found below.

Fig. 1: A high-level overview of our methodology.
Fig. 1: A high-level overview of our methodology.
Full size image

We started by testing 3 different architectures on the task of pLGG segmentation (1). The architecture that performed best on this task was used for all ensuing experiments. We then tested whether transfer learning from an adult brain tumor dataset could help improve segmentation performance (2). Next, the segmentation architecture was converted into a classification model by adding some layers and removing others. The classification model was trained and evaluated with semiautomatic whole tumor volume segmentations as input (3) to generate baseline results. Finally, the classification model was trained and tested with whole-brain FLAIR MRIs as input using three different initialization schemes: random (4), transfer learning (5), and pretraining (6).

Advanced DL models for more accurate tumor segmentation

We evaluated the performance of three different DL architectures for tumor segmentation on our pLGG dataset: (1) a CNN based on the MedicalNet architecture34, (2) a CNN, transformer hybrid similar to TransBTS26, and (3) a MedNeXt35 style ConvNeXt-based27 model. A review and comparison of these architectures is provided in the Supplementary Information. We relied on the GitHub repository of the referenced publications where these architectures were introduced for implementation. The number and size of layers were adjusted to obtain a similar number of trainable parameters (approximately 400k) within each model for a fair comparison between architectures. We ran all subsequent experiments with the architecture that was found to perform best on this segmentation task because it showed the highest potential to derive insights from MRIs of pLGGs (Fig. 1).

Next, we explored whether transfer learning (TL) from the Brain Tumor Segmentation (BraTS) 2020 challenge dataset, which consists of MRIs of adult patients with brain tumors, could improve the performance of our segmentation models (Fig. 2). TL aims to improve performance on one task by leveraging knowledge learned from a related task in advance36. It involves pretraining a model on a large dataset before fine-tuning it on a smaller target dataset, with the model learning to extract useful features on the larger dataset. With this knowledge transferred to the smaller target dataset, the model is less likely to overfit compared to when training from scratch. To assess the effects of TL, the model was first pretrained on the BraTS dataset, where the provided FLAIR images and segmentations were used as model inputs and outputs, respectively. Subsequently, the model was fine-tuned and evaluated on our pLGG dataset.

Fig. 2: Depiction of BraTS TL regimen for the segmentation (left) and classification (right) tasks.
Fig. 2: Depiction of BraTS TL regimen for the segmentation (left) and classification (right) tasks.
Full size image

The input to all models was the whole-brain FLAIR MRI from the respective dataset. The model was first trained on the BraTS dataset to identify tumor regions and HGG vs LGG, for the segmentation and classification tasks, respectively. The pretrained weights were then transferred over and trained on the corresponding task using the pLGG data.

Segmentation-free whole-brain FLAIR classification

The architecture that performed best for pLGG segmentation through the experiments detailed above was reconfigured for classification. The decoder was discarded, and a fully connected layer with three outputs, one for each of the three classes (BRAF Mutation, BRAF Fusion, and non-BRAF altered), was added on top of the encoder. The resultant model was randomly initialized and then trained to identify pLGG genetic subtype from whole-brain FLAIR MRI sequences, rather than using the tumor region alone. Next, similar to the experiments detailed above for segmentation, we attempted to improve classification model performance through transfer learning using the BraTS dataset (Fig. 2). The model was first trained to differentiate between high-grade and low-grade gliomas using whole-brain FLAIR MRI sequences from the BraTS dataset, then fine-tuned on our BRAF status classification task.

In another attempt to improve classification performance, we implemented a pretraining regimen (Fig. 3) that relied on a set of tasks we hypothesized would give the model a good understanding of the tumor region. The tasks were tumor segmentation, location identification, and radiomic feature value prediction, and the model was trained on them simultaneously using the sum of the three loss functions. The tumor segmentation task was the same as described in the first set of experiments. The location identification task required the model to predict whether the tumor was located in the supratentorial or infratentorial region of the brain. The label for this binary classification task was manually identified through analysis of the FLAIR MRI sequence by a pediatric neuroradiologist (MWW). Radiomics involves extracting quantitative features from medical images37; it has been shown that these features are predictive of BRAF status11,13. We extracted radiomic features from the semiautomatic whole tumor volume segmentations and used them to train a BRAF status random forest prediction model for patients in the training and validation sets according to the open-radiomics protocol38. The top 10 radiomics features were selected according to permutation importance; the values of these features were used for pretraining. If pretraining worked as expected, the model would recognize both where the tumor was located and characteristics within the tumor that are correlated with BRAF status. As illustrated in Fig. 3, after pretraining, the model was converted into a classification model, with certain pretrained layers kept intact, before the entire model was finetuned on the pLGG classification task.

Fig. 3: Overview of the pipeline for our best-performing BRAF status classification model, which uses semiautomatic whole tumor volume segmentations, radiomic feature values, and tumor locations for pretraining.
Fig. 3: Overview of the pipeline for our best-performing BRAF status classification model, which uses semiautomatic whole tumor volume segmentations, radiomic feature values, and tumor locations for pretraining.
Full size image

The top of the image describes the pretraining phase, where the whole MRI is used as the input. In red is the architecture that was found to be best for tumor segmentation, which processes the image and outputs the predicted tumor region (top right, in green). During pretraining, a fully connected layer was added to the segmentation architecture. The output of the bottleneck flowed into the fully connected layer, which output radiomic feature values and tumor location (middle right, in green). The fully connected layer consisted of 11 neurons (10 radiomics features, one location feature). The bottom of the figure describes the fine-tuning phase, which relied on the pretrained weights in the blue box (encoder and bottleneck layers). These layers were transferred over directly from the pretrained model and used as the starting point for training the final BRAF status classification model. A new fully connected layer consisting of just three neurons (one for each class) was added, to take in the bottleneck output, and generate a BRAF status prediction (bottom right, in green). The same pipeline described at the bottom of the figure for fine-tuning was used for the experiment where we aimed to identify BRAF status from whole MRIs without pretraining, except the model parameters were randomly initialized.

Classification based on tumor region alone

Most previous studies aiming to identify pLGG genetic status used manually segmented tumor regions as inputs to their ML models. Thus, a final set of experiments, which used semiautomatic whole tumor volume segmentations alone as input to the same classification architecture described above, was run and served as a baseline.

Statistics and reproducibility

To train and evaluate all models, patients were divided into train (80%), validation (10%), and test (10%) sets 25 times using a stratified split to ensure representative proportions of each class within each split. In each of these 25 trials, three separate models were trained on the training set using three different learning rates (0.00001, 0.0001, 0.001). Validation loss was calculated on the validation set at the end of each epoch. The optimal model selected for final evaluation on each trial was chosen according to the learning rate and epoch that resulted in the lowest validation loss. Reported performance metrics were based on the unseen test set, which was only used for model evaluation. All experiments for all models used identical data splits, allowing for direct comparisons of performance across architectures and initialization strategies. The “corrected resampled t-test”39,40 was used to test for significant differences in model performance. The use of nested-cross validation violates the independence assumption of the traditional t-test41, which could not be used here.

Experiment configuration

Kaiming Normal weight initialization was used to initialize all model parameters, where pretraining was not employed. Dropout was used to train all models (0.75 for fully connected layers, 0.25 for convolutional layers where entire channels were zeroed out) and turned off for interference. Cosine annealing was used to decay the learning rate over a maximum of 200 epochs for classification, and 50 for segmentation, which was slower. Training was stopped early to conserve resources if validation loss did not fall for 10 epochs in a row. The batch size was 8 images for classification experiments, but only 2 for segmentation experiments, which required more memory. Gradient accumulation was used for the segmentation experiments to effectively increase the batch size by only updating model parameters after every other batch.

Cross-entropy loss, weighted by the proportion of patients in each class, was used to train classification models. Classification models were primarily evaluated according to the One-Vs-Rest area under the receiver operating characteristic curve (AUC). Segmentation models were trained using the sum of soft dice loss and binary cross-entropy loss and evaluated by the Dice score. During pretraining, the radiomic feature value prediction and location identification tasks were trained using a smooth L1 loss and binary cross-entropy loss, respectively. Python 3.11.0 was used to run all experiments, and the PyTorch 1.13.042 library was used for DL.

Results

Segmentation

The segmentation results across 25 trials for all three DL architectures are shown in Fig. 4. The MedNeXt segmentation model (mean Dice score: 0.555, 95% confidence interval (CI) for the mean Dice score: [0.529, 0.574]) outperformed both the CNN-based MedicalNet (0.516, [0.492, 0.537]) and the CNN-transformer hybrid TransBTS (0.449, [0.420, 0.482]). The differences between the performance of each of the models were statistically significant. The p-values of the corrected resampled t-tests when comparing the performance of the MedNeXt model to that of MedicalNet and TransBTS were 0.045 and 0.00014, respectively, while comparing the latter two models resulted in a p-value of 0.017.

Fig. 4: Distribution of mean AUCs for three different DL architectures on the pLGG segmentation task across 25 trials.
Fig. 4: Distribution of mean AUCs for three different DL architectures on the pLGG segmentation task across 25 trials.
Full size image

The solid orange line is the median of each distribution, while the dashed red line is the mean.

Classification

The mean three-class AUCs for all classification models are listed in Table 2. With weights initialized from scratch, the MedNeXt style classification model achieved a mean AUC of 0.747 (95% CI for the mean: [0.719, 0.774]) on the whole brain FLAIR MRI BRAF status classification task, where no segmentation was used. Performance improved (p-value: 0.0466) through pretraining on the segmentation, location, and radiomic feature value prediction tasks, resulting in a mean AUC of 0.779 (95% CI for the mean: [0.759, 0.799]), which was found to be the best classifier according to AUC. Additional performance metrics for this best model are listed in Table 3. Notably, despite a relatively high AUC, accuracy was determined to be relatively low overall (57%). The confusion matrix and per-class metrics in Table 3 suggest that this is because the model lacks the ability to differentiate between BRAF Mutation and non-BRAF altered tumors. Overall, the model is best at identifying BRAF Fusion: the AUC (0.867), precision (0.746), and recall (0.747) for this class were the highest.

Table 2 Comparison of the performance of different classification pipelines as measured by AUC
Table 3 Detailed BRAF status classification performance metrics for the model pretrained using semiautomatic whole tumor volume segmentations, radiomics features, and tumor location

The baseline model, which used the tumor region alone as an input, performed similarly to the whole-brain FLAIR MRI models. The mean baseline AUC of 0.756 (95% CI for the mean: [0.735, 0.778]) was only slightly higher than the randomly initialized whole FLAIR MRI model, and the difference between the performance of the two models was not statistically significant (p-value: 0.342). The best-performing whole-brain FLAIR MRI model, pretrained on pLGG data, produced a higher AUC than the baseline, though again the difference was found to not be statistically significant (p-value: 0.141).

Transfer learning

TL from BraTS did not have a statistically significant impact on segmentation or classification performance. For the segmentation task, pretraining the MedNeXt style model on BraTS before fine-tuning it on our pLGG dataset resulted in a mean Dice score of 0.539 (95% CI for the mean: [0.525, 0.553]). This was lower than when training from scratch (0.555), and not significantly different (p-value: 0.180). Similarly, pretraining the MedNeXt classification model on BraTS prior to fine-tuning resulted in lower AUC (mean: 0.725, 95% CI for the mean: [0.704, 0.746]) than when training from scratch (0.747). Statistical testing showed that the difference was not significant (p-value: 0.215).

Discussion

In this study, we introduced a segmentation-free MRI-based pipeline to assess molecular status in patients with pLGG. Our whole-brain FLAIR MRI classification approach reduces the reliance of ML models used for pLGG genetic status classification on tumor segmentations. Our results illustrate that even without pretraining, and thus without any knowledge about tumor location whatsoever, the whole-brain FLAIR MRI classification model performs similarly to the baseline model that takes the tumor region as an input. Pretraining on pLGG tumor segmentations and features derived from them before fine-tuning on the classification task was found to improve model performance. This pretraining approach introduces a novel way of incorporating tumor information into a classification pipeline that does not rely on segmentations after training. The pretrained whole-brain FLAIR MRI classification model was able to accurately identify tumors with a BRAF Fusion but had difficulties differentiating between BRAF Mutations and non-BRAF altered tumors. These results suggest identification of BRAF Fusion, the most common genetic alteration in pLGGs, is the most promising use case for the whole-brain FLAIR MRI classification pipeline. We emphasize that once deployed, the pretrained model does not have any need for tumor segmentations.

We also explored the potential for advanced DL architectures to accurately and reliably segment MRIs of patients with pLGGs. Previous PLGG segmentation studies have relied on CNN-based models19,23; we found that a modern architecture, based on the MedNext architecture, performed better. The MedNeXt model was designed using ConvNeXt27 layers, which combine the inherent inductive biases of CNNs with the ability of transformers to scale efficiently and capture long-range dependencies35. It appears that MedNeXt combines the best of CNNs and transformers; it was previously shown to outperform both CNNs and transformers for adult brain tumor segmentation35. Our results suggest MedNeXt is better at pediatric brain tumor segmentation, too. However, for some patients, segmentations were poor (Fig. 5), suggesting that automated segmentations are too unreliable for use in downstream tasks although they were sufficiently accurate to be helpful in pretraining. Although expert manual segmentations are currently considered the gold standard, they pose practical challenges as classification model inputs due to known inter- and intra-rater variability, as well as scalability concerns. Variability in model inputs may reduce model reliability and harm clinician confidence when predictions are sensitive to subtle differences in input delineations.

Fig. 5: Preprocessed images for 8 patients, with both automated (yellow) and semiautomatic whole tumor volume (green) segmentations. Similar to what has been found in previous works, our model works well for some patients (left) but fails in certain cases (right).
Fig. 5: Preprocessed images for 8 patients, with both automated (yellow) and semiautomatic whole tumor volume (green) segmentations. Similar to what has been found in previous works, our model works well for some patients (left) but fails in certain cases (right).
Full size image

Notably, for the two topmost cases in the right column, the segmentation is completely off, with no tumor identified at all in the slice depicted. A downstream classification model will struggle to identify the genetic status of a tumor accurately if the automated segmentation model it depends on does not correctly identify the tumor.

Notably, TL from BraTS did not improve classification or segmentation model performance. TL has successfully boosted model performance for numerous medical tasks36. However, it has not always proven useful, likely due to domain shift, which is when there are differences between the dataset involved in pretraining and the dataset the model is later fine-tuned on43,44. For example, as described in ref. 45, previous studies focused on TL from BraTS for pLGG automatic tumor segmentation23,46 did not find a consistent, substantial improvement in model performance using TL. Considering the significant differences between adult and pediatric brain tumors47,48,49, it is perhaps unsurprising that knowledge does not transfer between the two patient groups. We hypothesize that our custom pretraining scheme involving semiautomatic whole tumor volume segmentation, tumor location, and radiomic features was successful because it did not rely on any additional data and thus was not susceptible to domain shift.

Currently, the genetic status of pLGG is ascertained through the analysis of tumor tissue acquired through either resection or biopsy9. However, in more than one-third of cases, neither option is possible or clinically advisable20. This is particularly true for tumors in anatomically challenging locations, such as midline pLGG45. As a result, the genetic status of these difficult-to-access tumors often remains unknown. Our model offers a potential non-invasive solution for these patients. Even in cases where biopsy is feasible, it presents limitations: samples can be inadequate for molecular analysis, and the procedure itself is expensive and carries inherent risks. Reported rates of permanent and transient complications are 0.7% and 5.8%, respectively50. Therefore, even when biopsy is an option, our non-invasive approach provides a potentially safer, more accessible alternative for determining the genetic status of pLGG.

Our work is not without limitations. Our observations about the relative performance of different DL models and training schemes are specific to the situations in which we evaluated them, namely, on our hospital’s pLGG dataset using a finite set of hyperparameters. Further tests on different tasks, datasets, and model configurations need to be run to determine the generalizability of these claims. However, the fact that similar results have been found for related tasks35 supports our claims. Additionally, our model was best suited for identifying BRAF Fusion and had difficulties differentiating BRAF Mutations from non-BRAF altered tumors. This could potentially be addressed by data augmentation techniques such as synthetic image generation, which has proven successful in improving pLGG genetic status classification51, but was not applied here. Furthermore, uncertainty quantification techniques such as evidential DL52, or dropout as Bayesian approximation53, could be applied to identify uncertain predictions, potentially facilitating the use of our model to distinguish between all 3 classes more accurately, or even to identify uncertain segmentations, enabling the use of automated segmentations for certain cases. Notably, despite these limitations, our findings suggest that the model in its current form already holds clinical utility. Specifically, it could be deployed as a binary classifier, distinguishing BRAF Fusion, the most common genetic alteration in pLGG, from all other genetic subtypes, with a high degree of accuracy.

Conclusion

We introduced the first model for pLGG genetic status assessment that does not require tumor segmentations as inputs. Our training pipeline uses segmentations and features derived from them for pretraining, but once deployed, the final model does not require any segmentations to determine the genetic status of pLGG.