Introduction

White matter hyperintensity (WMH) is an imaging change indicative of cerebral small vessel disease or cerebral microangiopathy1,2. Cerebral ischemia/hypoxia, hypoperfusion due to altered cerebrovascular autoregulation, disruption of the blood–brain barrier, inflammation, degeneration and amyloid angiopathy have all been suggested as underlying mechanisms of WMH, although their pathophysiology is not completely understood3,4. The clinical significance of WMH has been continuously investigated since Hachinski first described this feature on brain computed tomography (CT), with research revealing an association with future stroke, cognitive decline, depression, disability, and mortality in both general and diseased populations5,6,7. Stroke patients have been reported to have poor outcomes in the acute and subacute phases, as well as in the long-term period after stroke8.

The Fazekas scale has been widely used as a simplified visual rating method in the clinical field and in research9,10. Although it is fast, easy, and intuitive, its utility is limited in terms of subjectivity and inter-rater reliability. Recently, automatic and semi-automatic methods for WMH segmentation have been introduced to overcome these limitations. Previous studies have demonstrated a strong correlation between the Fazekas score and automatic segmentation of WMH11,12,13. However, these methods are not easy or appropriate for clinical application due to their complexity, time demands, necessity for quality control, and absence of normative data. In addition, these studies were performed predominantly on normally aging populations. The automatic measurement of WMH in stroke patients appears challenging, because brain MRIs show diverse lesions ranging from lacunes, large territorial or multiple lesions, to cerebral hemorrhages that could be mixed and confused with WMH14,15. As such, reliable detection and quantification of WMH require methods that can effectively integrate spatial characteristics and probabilistic modeling to account for lesion variability are crucial in these patients.

In recent studies, significant progress has been made in WMH segmentation using deep-learning techniques, showing high segmentation accuracy without complex procedures. Additionally, some studies have reported that the segmented volumes of WMH via deep learning correspond well with Fazekas ratings. However, while previous research has primarily focused on volumetric quantification, approaches that incorporate both spatial distribution patterns and probabilistic estimation of WMH burden remain limited. Hence, the present study aimed to develop an artificial intelligence (AI) pipeline utilizing deep learning to perform both segmentation and Fazekas ratings, encompassing both spatial and probabilistic characteristics of WMH alongside its structural properties.

Methods

Patient population

This investigation was conducted as a substudy of the Prevention of Cardiovascular Events in Ischemic Stroke Patients with a High Risk of Cerebral Hemorrhage (PICASSO) trial16. PICASSO was a multicenter, randomized, double-blind controlled trial that compared the efficacy and safety of two antiplatelet agents: aspirin and cilostazol. Patients were enrolled within 6 months of experiencing an ischemic stroke or transient ischemic attack and were classified as having either symptomatic or asymptomatic intracranial hemorrhage (ICH) or multiple cerebral microbleeds. Figure 1 shows a flow diagram of the patient enrollment in the study. Among 505 patients with MRI randomized in the PICASSO trial, 497 for whom baseline T2-weighted fluid-attenuated inversion recovery (FLAIR) MR images were collected and recorded in the center as DICOM files were screened for the present study. Owing to the characteristics of the PICASSO study participants, few patients had Fazekas 0 or 1. Therefore, among patients individuals corresponding to Fazekas 0 or 1 from the Korea University Stroke Registry (KUSR) who met the criteria for PICASSO study participation were also randomly included in this study. In total, 421 patients were included after the exclusion of 226 patients because of low-quality MR images. For the external validation, we included patients from the Samsung Medical Center (SMC) Registry who met the same criteria. This dataset contained only Fazekas scale labels and lacked WMH segmentation labels. All institutional boards of the participating centers, including the Korea University Ansan Hospital (2016AS0051), approved the study protocol, which followed the Standards for Diagnostic Accuracy (STARD) reporting guidelines17. Informed consent was waived due to the retrospective nature of the study, which utilized anonymized and de-identified data. All methods were performed in accordance with the relevant guidelines and regulations.

Fig. 1
figure 1

Flow diagram of patient enrollment.

Defining ground truth masks and image preprocessing

The WMH were segmented on the FLAIR images by a neurologist (J-MJ), and used as ground-truth masks. All images were preprocessed as follows in consecutive order: (1) intensity non-uniformity correction using the N4 bias field correction algorithm18(2) affine transform to the common space with the FLAIR atlas registered in the MNI (Montreal Neurological Institute) space19(3) skull stripping using an artificial neural network developed for brain extraction in pathologically-altered brain MRI20(4) inter-subject (nonlinear) coregistration for spatial normalization, (5) adaptive histogram equalization (Supplementary Figure S1). Horizontal flipping was applied to augment the data in the training set.

Outcome variable and data splitting

Two stroke neurologists (K and L) independently reviewed the baseline FLAIR images, and the Fazekas scale was scored as the outcome variable by consensus (Inter-rater agreement, quadratic weighted kappa: 0.853). WMH were evaluated in the hemisphere contralateral to the region affected by the prior stroke. In cases where the WMH score differed between the hemispheres, the rating was determined based on the unaffected or less affected side, following the principle of symmetry assumption. Further, 40% patients were randomly selected and stratified based on their Fazekas scores to create a separate “internal test set” used solely for performance evaluation. The remaining patients were then randomly divided into three folds stratified by Fazekas scores, and used as the model training and “internal validation sets” following a cross-validation strategy. For the internal validation, each validation fold was used to evaluate the corresponding parts of the model. The same folds were used consistently in the development of both the WMH segmentation model and the Fazekas scoring model to ensure that the subjects were not mixed.

Model development

To obtain the Fazekas scale from FLAIR images, we established a two-step pipeline (Fig. 2). The first step comprised WMH segmentation, for which we employed U-net-based residual network (uResNet)21a U-net-based deep learning model. U-net is a widely used segmentation algorithm with an encoder-decoder architecture based on a fully convolutional neural network (CNN) which uses skip connections to combine high- and low-level feature maps. uResNet implements residual units instead of plain convolutional blocks in U-Net, thus reducing vanishing gradient problems, and preventing information loss. We further constructed 2.5-dimentional models for the uResNet model, utilizing five consecutive slices to predict the WMH on the center slice. To further benchmark the segmentation performance, we additionally evaluated five deep learning models, including two transformer-based architectures (SwinUNETR22 and UNETR23 and three convolution-based architectures (AttentionUNet24 DynUNet25 and SegResNet26. Each model used for benchmarking was implemented using the MONAI v1.4.0 library27 with all hyperparameters set to their default options.

Fig. 2
figure 2

Study flow diagram. CNN, Convolutional neural network.

The second step was to rate the Fazekas scale using the predicted information. For this we employed two baseline methods, adopting a non-spatial approach based solely on the scalar value of the predicted WMH volume. One baseline method employed a rule-based approach derived from a previous study28in which thresholds for each Fazekas scale were set using one-versus-rest tasks to maximize the Youden index. The other baseline method involved univariate logistic regression with the scalar value serving as an independent variable.

In addition, we proposed a spatial-probabilistic approach to rate the Fazekas scale employing a 3-dimensional CNN using raw probability maps of the predictive WMH volumes. Probability maps of the WMH were inferred from the uResNet model and prepared as a 3-dimensional volume equivalent to the original size of the FLAIR images. Considering the ordinal nature of the Fazekas scale, regression and classification models were also developed. The raw output derived from the regression model was finally rounded off and clipped between zero and three. The details of this procedure, including the architectures of the models for segmentation and Fazekas rating, are presented in Supplementary Figure S2 and Figure S3, respectively.

All methods in this study were executed in Python version 3.9.7 using TensorFlow 2.10.1.

Statistical analysis

Normality was evaluated with the Shapiro–Wilk test, and homoscedasticity was checked using Levene’s test. Group comparisons were conducted using the chi-square test, Student’s t-test, or Mann–Whitney U test.

Segmentation performance was further evaluated using the area under the precision-recall curve (AUPRC) based on raw probability maps, voxel-based metrics, and surface distance-based metrics on volume maps after thresholding. The threshold was determined based on the internal validation sets to minimize the absolute error in the volume. The voxel-based metrics included DICE, Jaccard, Recall, Precision, and absolute error in volume, and the surface distance-based metrics included the Hausdorff distance and average symmetric surface distance. In patients with a Fazekas scale score of 0, only the volumetric error of the absolute difference was evaluated because of the absence of true positive WMH voxels. To investigate the influence of ischemic lesion volume on segmentation performance, we performed a subgroup analysis stratifying patients based on an ischemic volume threshold of 10 ml (< 10 ml versus ≥ 10 ml).

The goodness of fit between the ground truth volume and the predicted volume after thresholding was evaluated using the R-squared value of the linear regression and the Bland-Altman plot. The concordance correlation coefficient, reproducibility coefficient, and coefficient of variance were also evaluated. The performance of the Fazekas score rating was evaluated using a quadratic weighted kappa to measure the agreement between the true and predicted labels. For comparison with other studies, we additionally evaluated the binary categorization of the Fazekas scale using a one-vs.-rest approach with the area under the receiver operating characteristic curve (AUROC) and AUPRC.

To visually assess whether spatial information influences the prediction of the Fazekas scale, we employed the Grad-CAM method29 to visualize the saliency maps of two representative convolutional layers within our 3DCNN model for classification tasks. These layers were chosen from two skip connections in the network, comprising a shallow layer proximate to the input layer and a deep layer near the output layer.

Results

Baseline characteristics of study subjects

In total, 471 participants were enrolled in the analysis, including 313 from the PICASSO dataset, 108 from the KUSR registry for model development, and 50 from the SMC registry for external validation (Fig. 1). Stroke was identified in 419 participants (89.0%), with a lower prevalence in KUMC (60.2%) compared to PICASSO (97.1%) and SMC (100%) (P < 0.001). The median age was 65 years (IQR, 55–75), with participants from PICASSO and SMC being significantly older than those from KUMC (P < 0.001). Male participants comprised 60.7% of the cohort, with no significant differences among groups.

Hypertension (70.7%), diabetes mellitus (32.9%), and atrial fibrillation (7.0%) were prevalent comorbidities, with lower rates observed in KUMC (P < 0.001). Blood pressure was significantly higher in SMC (P < 0.001), while HDL levels were also elevated in this group (P = 0.015). Lesion distribution varied significantly (P < 0.001), with posterior circulation involvement (39.9%) being the most common. Subcortical (16.5%) and brainstem (20.3%) lesions were predominant, and multiple lesions were more frequently observed in SMC. Lesion-related data were not available for the PICASSO dataset. Further details are provided in Supplementary Table S1.

Segmentation performance

The segmentation performances of the internal validation and the internal test sets are summarized in Tables 1 and 2, respectively. The segmentation model based on raw probability maps achieved AUPRC values of 0.77 (95% CI, 0.36–0.92) and 0.81 (95% CI, 0.55–0.95) in patients with a Fazekas scale of 1 or higher in the internal validation and internal test sets.

Table 1 Segmentation performance in the internal validation sets.
Table 2 Segmentation performance in the internal test set.

The threshold for constructing the binary volume maps was calculated to be 0.34, minimizing the absolute volumetric error in the internal validation sets. After applying the threshold, for patients with a Fazekas scale of 1 or higher, the DICE indices were 0.70 (95% CI, 0.41–0.84) in the internal validation sets and 0.73 (95% CI, 0.49–0.87) in the internal test set, respectively. Average symmetric surface distances were 2.0 (95% CI, 1.0–6.2) in the internal validation sets and 1.7 (95% CI, 0.9–4.2) in the internal test set, respectively.

The absolute errors between the true and predicted WMH volumes were 3.7 ml (95% CI, 0.1 ml–13.6 ml, p = 0.191) and 3.1 ml (95% CI, 0.0 ml–15.9 ml; p = 0.982). Among patients with a Fazekas scale of 1 or higher, there were no significant volume differences observed across any of the subgroups categorized by Fazekas scale in the internal test set. Concurrently, in the internal validation sets, no significant volume differences were noted, specifically among patients with a Fazekas scale score of 2.

In the benchmark comparison, our 2.5D uResNet achieved the highest AUPRC of 0.77 (95% CI, 0.36–0.92) and the lowest Hausdorff distance of 32.1 (95% CI, 18.7–53.6) in patients with a Fazekas scale of 1 or higher in the internal validation set (Supplementary Table S2). Similarly, in the internal test set, 2.5D uResNet attained the highest AUPRC of 0.81 (95% CI, 0.55–0.95), while SegResNet yielded the highest DICE coefficient of 0.74 (95% CI, 0.39–0.84) and Jaccard index of 0.61 (95% CI, 0.24–0.72) (Supplementary Table S3). Supplementary Figure S4 presents illustrations of four representative cases from the internal test set, depicting stroke lesions, WMH labels, and the WMH predictions generated by the model.

The subgroup analysis revealed that segmentation performance was consistently lower in patients with ischemic lesion volumes of ≥ 10 ml (Supplementary Table S4). In the internal validation set, the absolute error in predicted WMH volume increased from 3.6 (95% CI, 0.1–13.6) to 5.4 (95% CI, 0.5–22.4) ml, and in the internal test set, from 2.8 (95% CI, 0.0–10.9) to 7.3 (95% CI, 0.2–23.7) ml. The best and worst model predictions for WMH segmentation are shown in Supplementary Figure S5 and Figure S6, respectively.

Figure 3 presents the linear regression and Bland-Altman plots depicting the goodness of fit between the true and predicted WMH volumes. The R-squared value obtained from the linear regression between the true and predicted WMH volumes was 0.91, with a concordance correlation coefficient of 0.96 in the internal validation sets. Conversely, in the internal test set, the R-squared value was 0.94, and the concordance correlation coefficient was 0.97.

Fig. 3
figure 3

The goodness of fit results in the internal validation sets (A) and the internal test set (B). Abbreviations: RMSE, root mean squared error; CCC, concordance correlation coefficient; RPC, reproducibility coefficient; CV, coefficient of variance.

The Bland-Altman plot demonstrates a minimal systematic difference of 0.33 ml between volumes in the internal validation sets and 0.4 ml in the internal test set. However, it further revealed a significant tendency for the predicted volume to decrease in comparison to the actual volume as the WMH volume increased in both evaluation sets (p < 0.001 and p = 0.008 in the internal validation and internal test sets, respectively). The reproducibility coefficients were 11.9 ml and 10.6 ml in the internal validation and internal test sets.

Fazekas score rating

The baseline rule-based method set thresholds at 2.1 ml for distinguishing Fazekas scores 0 versus 1 or higher, 16.4 ml for scores 0 and 1 versus 2 and 3, and 33.5 ml for scores 2 or lower versus 3. When these thresholds were applied, the quadratic-weighted kappa values between the predicted and true Fazekas scales were 0.926, 0.951, and 0.897 for the internal validation, internal test, and external validation sets, respectively. When the baseline logistic regression-based method was employed, the quadratic weighted kappa values between the predicted and true Fazekas scales were 0.909, 0.962, and 0.916 for the internal validation, internal test, and external validation sets, respectively.

Regarding the predicted Fazekas scale obtained by the 3D CNN for the regression task, the quadratic weighted kappa values between the predicted and true Fazekas scales were 0.904, 0.951, and 0.898 for the internal validation, internal test, and external validation sets, respectively (Table 3). Further, the predicted Fazekas scale derived from the 3D CNN for the classification task and the quadratic weighted kappa values between the predicted and true Fazekas scales were 0.917, 0.956, and 0.956 for the internal validation, internal test, and external validation sets, respectively. The 3D CNN achieved AUROC values exceeding 0.9 and AUPRC values exceeding 0.8 across all binary categorizations in all evaluation datasets, regardless of whether regression or classification methods were used.

Table 3 Fazekas rating performance.

Saliency maps were visualized for a sample patient from the external test set (Supplementary Figure S7). To further highlight the added value of spatial‑probabilistic information, Supplementary Figure S8 presents, for three pairs of subjects whose Fazekas grades differ by one point, the FLAIR MRI, the predicted WMH probability map, and the thresholded binary mask. Despite equivalent lesion burdens, cases showing more extensive probability spread into deep subcortical white matter territories distal to the frontal and occipital horns receive higher grades, whereas those with lesion probability confined to the periventricular rim are assigned lower grades.

Discussion

In this pioneering study, we aimed to evaluate the Fazekas scale using FLAIR images by developing a two-step pipeline. Notably, our approach involved a spatial-probabilistic methodology for the Fazekas rating, employing a 3-dimensional deep learning model with predicted probability maps. This differs from previous studies28which relied on the scalar value of the WMH volume. The probability maps contained positional information corresponding to the FLAIR images along with the continuous-range probability for WMH at each voxel. In the context of the Fazekas scale rating, WMH are evaluated based on their location and categorized into periventricular and deep subcortical areas, indicating that the position of the WMH may influence the Fazekas rating9,10. As such, our approach is anticipated to facilitate a more precise grading of the Fazekas scale by harnessing the spatial relationships of WMH and incorporating detailed probability information at the voxel level.

In patients with ischemic stroke and small vessel disease, WMH has been associated with an increased risk of recurrent vascular events, including ischemic and hemorrhagic stroke30. Although evaluating the burden of WMH in these patients is clinically crucial, distinguishing between WMH and ischemic stroke lesions can be challenging, particularly in cases where they appear similar on FLAIR images and adjacent to each other. Using a combination of FLAIR and diffusion-weighted imaging (DWI) may aid in distinguishing acute ischemic lesions from WMH; however, its utility is limited to subacute to chronic lesions15.

One study that segmented WMH in patients with acute ischemic lesions using U-net models found no significant differences in lesion identification or segmentation between models using FLAIR alone and those incorporating FLAIR and DWI31. As such, the authors speculated that the CNN-based architecture model may utilize various features from FLAIR images to distinguish between WMH and acute ischemic lesions31.

In the present study, we successfully segmented WMH using a simplified method that utilized only FLAIR images with the uResNet architecture. The DICE in our study was 0.73, showing good performance comparable to previous stroke patient studies with model results of 0.6132, 0.7633, and 0.7834, and similar to other population studies with results of 0.71 and 0.8028. Recall and precision values were also comparable to those found at previous research, at 0.73 and 0.74, respectively, compared to 0.74 and 0.5631. We expanded our analysis by including an additional five deep learning models for WMH segmentation in our benchmark comparisons. Our results indicate that, although performance varied across metrics, the trade-offs among the architectures highlight the importance of selecting a method that is tailored to specific clinical requirements.

Segmentation performance in our models was superior, with higher Fazekas scale scores. This trend may have arisen due to class imbalance, which leads to an increase in the number of true-positive voxels. Moreover, it may also have arisen because larger WMH are easier to detect, while deep-learning models may be less affected by mismatches along the boundary. In the past, several studies have utilized automatic segmentation of WMH for automatic scoring of the Fazekas scale28,34,35. However, no studies on stroke patients have yet used FLAIR images alone for automated scoring28,34,35.

In one study evaluating the predictive value of Fazekas scores from automatic segmentation in a general population-based cohort, ROC analyses were conducted to separate low WMH burdens (Fazekas scores 0,1) from high WMH burdens (Fazekas scores 2,3). The results revealed area under the curve (AUC) values of 0.93 for the Lesion Segmentation Tool and 0.94 for FreeSurfer34. In a study utilizing the U-Net based architecture VUNO Med-DeepBrain, targeting patients with memory complaints excluding those with stroke, the AUC values were 0.921 (normal vs. mild/moderate/severe), 0.956 (normal/mild vs/ moderate/severe) and 0.960 (normal/mild/moderate vs. severe)28. Another study involving patients with dementia, excluding those with stroke, demonstrated an average AUROC of 0.8035.

In contrast, in our study, the AUROC values were 0.957 (Fazekas scores 0 vs. 1,2,3), 0.982 (Fazekas scores 0,1 vs. 2,3), and 1.000 (Fazekas scores 0,1,2 vs. 3). For the classification task, the AUROC values were 0.972, 0.992, and 1.000. Furthermore, the baseline rule-and logistic regression-based methods yielded quadratic-weighted kappa values of 0.897 and 0.916, respectively. Additionally, for the 3D CNN for the regression task and the 3D CNN for the classification task, the quadratic weighted kappa values were 0.898 and 0.956, respectively. Our results also demonstrated a highly similar Fazekas rating performance in both the model development and external validation datasets.

Our research stands apart from other studies that evaluated model performance using metrics such as AUC and AUROC, as we used a quadratic weighted kappa value to assess the automated Fazekas score ratings, which presents unique advantages. This method allows a more nuanced evaluation of agreement, which is particularly important in the context of automated Fazekas score ratings, improving the reliability and interpretability of the findings36. In addition, the quadratic-weighted kappa value is useful for handling ordinal variables, enabling a more accurate representation of the agreement36. As such, our model provides a higher level of precision in grading the Fazekas scale.

Our Fazekas scaling model revealed that the shallow convolutional layers consistently activated in correspondence with the intensity of the predicted WMH probability map across folds. In contrast, the deep convolutional layers, closer to the output layer, exhibited spatially variant activations even in regions with similar intensity levels. Due to a ceiling effect, wherein accurate WMHs volume estimation alone permits highly precise Fazekas scale prediction, the performance differences may not be large. Nonetheless, our results suggest that spatial information contributes to the prediction of the Fazekas scale, which may account for the higher predictive performance of our model compared to scalar volumetric measures on the external test set. Subgroup analysis further indicates that the presence of larger ischemic lesions compromises WMH segmentation performance, as evidenced by increased volumetric prediction errors. This observation underscores the importance of developing or adapting segmentation approaches specifically for patients with significant stroke lesion burdens.

Despite the above strength, our study had several limitations. First, there is currently no definitive reference for WMH segmentation and Fazekas rating. This is a common limitation in studies using these methods in which the segmentation performed by experienced neurologists and the visual rating of the Fazekas scale serve as the gold standards; however, they inherently have limitations in terms of subjectivity and interrater reliability. Second, our study model utilized only FLAIR images. The incorporation of the combined information obtained from T1 sequences, DWI, and FLAIR images could potentially enhance the performance of automated segmentation and Fazekas scoring. However, despite this, our model showed high or comparable DICE, recall, and precision values compared with models that used DWI and T1 sequences, suggesting that our model plays a role in distinguishing WMH from stroke lesions, even without including other sequences. Third, we did not directly compare our automatic segmentation and Fazekas scoring models with previously-developed models. Finally, when rating the Fazekas scale, we did not differentiate between deep and periventricular white matter. However, by using a spatial-probabilistic model that incorporates both the location information of the WMH and voxel-wise probability information, we were able to overcome this limitation and achieve a high quadratic weighted kappa value.

In conclusion, our deep learning pipeline demonstrated accurate automatic WMH segmentation and Fazekas scale grading in patients with stroke. As such, this approach offers a convenient method for evaluating WMH burden using only FLAIR images in stroke patients, potentially aiding in the prediction of future vascular events.