Automated rating of Fazekas scale in fluid-attenuated inversion recovery MRI for ischemic stroke or transient ischemic attack using machine learning

Jeon, Eun-Tae; Kim, Seung Min; Jung, Jin-Man

doi:10.1038/s41598-025-17287-5

Download PDF

Article
Open access
Published: 01 September 2025

Automated rating of Fazekas scale in fluid-attenuated inversion recovery MRI for ischemic stroke or transient ischemic attack using machine learning

Eun-Tae Jeon¹^na1,
Seung Min Kim²^na1 &
Jin-Man Jung^1,3

Scientific Reports volume 15, Article number: 32219 (2025) Cite this article

1499 Accesses
Metrics details

Subjects

Abstract

White matter hyperintensities (WMH) are commonly assessed using the Fazekas scale, a subjective visual grading system. Despite the emergence of deep learning models for automatic WMH grading, their application in stroke patients remains limited. This study aimed to develop and validate an automatic segmentation and grading model for WMH in stroke patients, utilizing spatial-probabilistic methods. We developed a two-step deep learning pipeline to predict Fazekas scale scores from T2-weighted FLAIR images. First, WMH segmentation was performed using a residual neural network based on the U-Net architecture. Then, Fazekas scale grading was carried out using a 3D convolutional neural network trained on the segmented WMH probability volumes. A total of 471 stroke patients from three different sources were included in the analysis. The performance metrics included area under the precision-recall curve (AUPRC), Dice similarity coefficient, and absolute error for WMH volume prediction. In addition, agreement analysis and quadratic weighted kappa were calculated to assess the accuracy of the Fazekas scale predictions. The WMH segmentation model achieved an AUPRC of 0.81 (95% CI, 0.55–0.95) and a Dice similarity coefficient of 0.73 (95% CI, 0.49–0.87) in the internal test set. The mean absolute error between the true and predicted WMH volumes was 3.1 ml (95% CI, 0.0 ml–15.9 ml), with no significant variation across Fazekas scale categories. The agreement analysis demonstrated strong concordance, with an R-squared value of 0.91, a concordance correlation coefficient of 0.96, and a systematic difference of 0.33 ml in the internal test set, and 0.94, 0.97, and 0.40 ml, respectively, in the external validation set. In predicting Fazekas scores, the 3D convolutional neural network achieved quadratic weighted kappa values of 0.951 for regression tasks and 0.956 for classification tasks in the internal test set, and 0.898 and 0.956, respectively, in the external validation set. The proposed deep learning pipeline demonstrated robust performance in automatic WMH segmentation and Fazekas scale grading from FLAIR images in stroke patients. This approach offers a reliable and efficient tool for evaluating WMH burden, which may assist in predicting future vascular events.

Prediction of white matter hyperintensities evolution one-year post-stroke from a single-point brain MRI and stroke lesions information

Article Open access 07 January 2025

Artificial intelligence for volumetric measurement of cerebral white matter hyperintensities on thick-slice fluid-attenuated inversion recovery (FLAIR) magnetic resonance images from multiple centers

Article Open access 02 May 2024

Automated deep U-Net model for ischemic stroke lesion segmentation in the sub-acute phase

Article Open access 29 September 2025

Introduction

White matter hyperintensity (WMH) is an imaging change indicative of cerebral small vessel disease or cerebral microangiopathy^1,2. Cerebral ischemia/hypoxia, hypoperfusion due to altered cerebrovascular autoregulation, disruption of the blood–brain barrier, inflammation, degeneration and amyloid angiopathy have all been suggested as underlying mechanisms of WMH, although their pathophysiology is not completely understood^3,4. The clinical significance of WMH has been continuously investigated since Hachinski first described this feature on brain computed tomography (CT), with research revealing an association with future stroke, cognitive decline, depression, disability, and mortality in both general and diseased populations^5,6,7. Stroke patients have been reported to have poor outcomes in the acute and subacute phases, as well as in the long-term period after stroke⁸.

The Fazekas scale has been widely used as a simplified visual rating method in the clinical field and in research^9,10. Although it is fast, easy, and intuitive, its utility is limited in terms of subjectivity and inter-rater reliability. Recently, automatic and semi-automatic methods for WMH segmentation have been introduced to overcome these limitations. Previous studies have demonstrated a strong correlation between the Fazekas score and automatic segmentation of WMH^11,12,13. However, these methods are not easy or appropriate for clinical application due to their complexity, time demands, necessity for quality control, and absence of normative data. In addition, these studies were performed predominantly on normally aging populations. The automatic measurement of WMH in stroke patients appears challenging, because brain MRIs show diverse lesions ranging from lacunes, large territorial or multiple lesions, to cerebral hemorrhages that could be mixed and confused with WMH^14,15. As such, reliable detection and quantification of WMH require methods that can effectively integrate spatial characteristics and probabilistic modeling to account for lesion variability are crucial in these patients.

In recent studies, significant progress has been made in WMH segmentation using deep-learning techniques, showing high segmentation accuracy without complex procedures. Additionally, some studies have reported that the segmented volumes of WMH via deep learning correspond well with Fazekas ratings. However, while previous research has primarily focused on volumetric quantification, approaches that incorporate both spatial distribution patterns and probabilistic estimation of WMH burden remain limited. Hence, the present study aimed to develop an artificial intelligence (AI) pipeline utilizing deep learning to perform both segmentation and Fazekas ratings, encompassing both spatial and probabilistic characteristics of WMH alongside its structural properties.

Methods

Patient population

This investigation was conducted as a substudy of the Prevention of Cardiovascular Events in Ischemic Stroke Patients with a High Risk of Cerebral Hemorrhage (PICASSO) trial¹⁶. PICASSO was a multicenter, randomized, double-blind controlled trial that compared the efficacy and safety of two antiplatelet agents: aspirin and cilostazol. Patients were enrolled within 6 months of experiencing an ischemic stroke or transient ischemic attack and were classified as having either symptomatic or asymptomatic intracranial hemorrhage (ICH) or multiple cerebral microbleeds. Figure 1 shows a flow diagram of the patient enrollment in the study. Among 505 patients with MRI randomized in the PICASSO trial, 497 for whom baseline T2-weighted fluid-attenuated inversion recovery (FLAIR) MR images were collected and recorded in the center as DICOM files were screened for the present study. Owing to the characteristics of the PICASSO study participants, few patients had Fazekas 0 or 1. Therefore, among patients individuals corresponding to Fazekas 0 or 1 from the Korea University Stroke Registry (KUSR) who met the criteria for PICASSO study participation were also randomly included in this study. In total, 421 patients were included after the exclusion of 226 patients because of low-quality MR images. For the external validation, we included patients from the Samsung Medical Center (SMC) Registry who met the same criteria. This dataset contained only Fazekas scale labels and lacked WMH segmentation labels. All institutional boards of the participating centers, including the Korea University Ansan Hospital (2016AS0051), approved the study protocol, which followed the Standards for Diagnostic Accuracy (STARD) reporting guidelines¹⁷. Informed consent was waived due to the retrospective nature of the study, which utilized anonymized and de-identified data. All methods were performed in accordance with the relevant guidelines and regulations.

Defining ground truth masks and image preprocessing

The WMH were segmented on the FLAIR images by a neurologist (J-MJ), and used as ground-truth masks. All images were preprocessed as follows in consecutive order: (1) intensity non-uniformity correction using the N4 bias field correction algorithm¹⁸(2) affine transform to the common space with the FLAIR atlas registered in the MNI (Montreal Neurological Institute) space¹⁹(3) skull stripping using an artificial neural network developed for brain extraction in pathologically-altered brain MRI²⁰(4) inter-subject (nonlinear) coregistration for spatial normalization, (5) adaptive histogram equalization (Supplementary Figure S1). Horizontal flipping was applied to augment the data in the training set.

Outcome variable and data splitting

Two stroke neurologists (K and L) independently reviewed the baseline FLAIR images, and the Fazekas scale was scored as the outcome variable by consensus (Inter-rater agreement, quadratic weighted kappa: 0.853). WMH were evaluated in the hemisphere contralateral to the region affected by the prior stroke. In cases where the WMH score differed between the hemispheres, the rating was determined based on the unaffected or less affected side, following the principle of symmetry assumption. Further, 40% patients were randomly selected and stratified based on their Fazekas scores to create a separate “internal test set” used solely for performance evaluation. The remaining patients were then randomly divided into three folds stratified by Fazekas scores, and used as the model training and “internal validation sets” following a cross-validation strategy. For the internal validation, each validation fold was used to evaluate the corresponding parts of the model. The same folds were used consistently in the development of both the WMH segmentation model and the Fazekas scoring model to ensure that the subjects were not mixed.

Model development

To obtain the Fazekas scale from FLAIR images, we established a two-step pipeline (Fig. 2). The first step comprised WMH segmentation, for which we employed U-net-based residual network (uResNet)²¹a U-net-based deep learning model. U-net is a widely used segmentation algorithm with an encoder-decoder architecture based on a fully convolutional neural network (CNN) which uses skip connections to combine high- and low-level feature maps. uResNet implements residual units instead of plain convolutional blocks in U-Net, thus reducing vanishing gradient problems, and preventing information loss. We further constructed 2.5-dimentional models for the uResNet model, utilizing five consecutive slices to predict the WMH on the center slice. To further benchmark the segmentation performance, we additionally evaluated five deep learning models, including two transformer-based architectures (SwinUNETR²² and UNETR²³ and three convolution-based architectures (AttentionUNet²⁴ DynUNet²⁵ and SegResNet²⁶. Each model used for benchmarking was implemented using the MONAI v1.4.0 library²⁷ with all hyperparameters set to their default options.

The second step was to rate the Fazekas scale using the predicted information. For this we employed two baseline methods, adopting a non-spatial approach based solely on the scalar value of the predicted WMH volume. One baseline method employed a rule-based approach derived from a previous study²⁸in which thresholds for each Fazekas scale were set using one-versus-rest tasks to maximize the Youden index. The other baseline method involved univariate logistic regression with the scalar value serving as an independent variable.

In addition, we proposed a spatial-probabilistic approach to rate the Fazekas scale employing a 3-dimensional CNN using raw probability maps of the predictive WMH volumes. Probability maps of the WMH were inferred from the uResNet model and prepared as a 3-dimensional volume equivalent to the original size of the FLAIR images. Considering the ordinal nature of the Fazekas scale, regression and classification models were also developed. The raw output derived from the regression model was finally rounded off and clipped between zero and three. The details of this procedure, including the architectures of the models for segmentation and Fazekas rating, are presented in Supplementary Figure S2 and Figure S3, respectively.

All methods in this study were executed in Python version 3.9.7 using TensorFlow 2.10.1.

Statistical analysis

Normality was evaluated with the Shapiro–Wilk test, and homoscedasticity was checked using Levene’s test. Group comparisons were conducted using the chi-square test, Student’s t-test, or Mann–Whitney U test.

Segmentation performance was further evaluated using the area under the precision-recall curve (AUPRC) based on raw probability maps, voxel-based metrics, and surface distance-based metrics on volume maps after thresholding. The threshold was determined based on the internal validation sets to minimize the absolute error in the volume. The voxel-based metrics included DICE, Jaccard, Recall, Precision, and absolute error in volume, and the surface distance-based metrics included the Hausdorff distance and average symmetric surface distance. In patients with a Fazekas scale score of 0, only the volumetric error of the absolute difference was evaluated because of the absence of true positive WMH voxels. To investigate the influence of ischemic lesion volume on segmentation performance, we performed a subgroup analysis stratifying patients based on an ischemic volume threshold of 10 ml (< 10 ml versus ≥ 10 ml).

The goodness of fit between the ground truth volume and the predicted volume after thresholding was evaluated using the R-squared value of the linear regression and the Bland-Altman plot. The concordance correlation coefficient, reproducibility coefficient, and coefficient of variance were also evaluated. The performance of the Fazekas score rating was evaluated using a quadratic weighted kappa to measure the agreement between the true and predicted labels. For comparison with other studies, we additionally evaluated the binary categorization of the Fazekas scale using a one-vs.-rest approach with the area under the receiver operating characteristic curve (AUROC) and AUPRC.

To visually assess whether spatial information influences the prediction of the Fazekas scale, we employed the Grad-CAM method²⁹ to visualize the saliency maps of two representative convolutional layers within our 3DCNN model for classification tasks. These layers were chosen from two skip connections in the network, comprising a shallow layer proximate to the input layer and a deep layer near the output layer.

Results

Baseline characteristics of study subjects

In total, 471 participants were enrolled in the analysis, including 313 from the PICASSO dataset, 108 from the KUSR registry for model development, and 50 from the SMC registry for external validation (Fig. 1). Stroke was identified in 419 participants (89.0%), with a lower prevalence in KUMC (60.2%) compared to PICASSO (97.1%) and SMC (100%) (P < 0.001). The median age was 65 years (IQR, 55–75), with participants from PICASSO and SMC being significantly older than those from KUMC (P < 0.001). Male participants comprised 60.7% of the cohort, with no significant differences among groups.

Hypertension (70.7%), diabetes mellitus (32.9%), and atrial fibrillation (7.0%) were prevalent comorbidities, with lower rates observed in KUMC (P < 0.001). Blood pressure was significantly higher in SMC (P < 0.001), while HDL levels were also elevated in this group (P = 0.015). Lesion distribution varied significantly (P < 0.001), with posterior circulation involvement (39.9%) being the most common. Subcortical (16.5%) and brainstem (20.3%) lesions were predominant, and multiple lesions were more frequently observed in SMC. Lesion-related data were not available for the PICASSO dataset. Further details are provided in Supplementary Table S1.

Segmentation performance

The segmentation performances of the internal validation and the internal test sets are summarized in Tables 1 and 2, respectively. The segmentation model based on raw probability maps achieved AUPRC values of 0.77 (95% CI, 0.36–0.92) and 0.81 (95% CI, 0.55–0.95) in patients with a Fazekas scale of 1 or higher in the internal validation and internal test sets.

Table 1 Segmentation performance in the internal validation sets.

Full size table

Table 2 Segmentation performance in the internal test set.

Full size table

The threshold for constructing the binary volume maps was calculated to be 0.34, minimizing the absolute volumetric error in the internal validation sets. After applying the threshold, for patients with a Fazekas scale of 1 or higher, the DICE indices were 0.70 (95% CI, 0.41–0.84) in the internal validation sets and 0.73 (95% CI, 0.49–0.87) in the internal test set, respectively. Average symmetric surface distances were 2.0 (95% CI, 1.0–6.2) in the internal validation sets and 1.7 (95% CI, 0.9–4.2) in the internal test set, respectively.

The absolute errors between the true and predicted WMH volumes were 3.7 ml (95% CI, 0.1 ml–13.6 ml, p = 0.191) and 3.1 ml (95% CI, 0.0 ml–15.9 ml; p = 0.982). Among patients with a Fazekas scale of 1 or higher, there were no significant volume differences observed across any of the subgroups categorized by Fazekas scale in the internal test set. Concurrently, in the internal validation sets, no significant volume differences were noted, specifically among patients with a Fazekas scale score of 2.

In the benchmark comparison, our 2.5D uResNet achieved the highest AUPRC of 0.77 (95% CI, 0.36–0.92) and the lowest Hausdorff distance of 32.1 (95% CI, 18.7–53.6) in patients with a Fazekas scale of 1 or higher in the internal validation set (Supplementary Table S2). Similarly, in the internal test set, 2.5D uResNet attained the highest AUPRC of 0.81 (95% CI, 0.55–0.95), while SegResNet yielded the highest DICE coefficient of 0.74 (95% CI, 0.39–0.84) and Jaccard index of 0.61 (95% CI, 0.24–0.72) (Supplementary Table S3). Supplementary Figure S4 presents illustrations of four representative cases from the internal test set, depicting stroke lesions, WMH labels, and the WMH predictions generated by the model.

The subgroup analysis revealed that segmentation performance was consistently lower in patients with ischemic lesion volumes of ≥ 10 ml (Supplementary Table S4). In the internal validation set, the absolute error in predicted WMH volume increased from 3.6 (95% CI, 0.1–13.6) to 5.4 (95% CI, 0.5–22.4) ml, and in the internal test set, from 2.8 (95% CI, 0.0–10.9) to 7.3 (95% CI, 0.2–23.7) ml. The best and worst model predictions for WMH segmentation are shown in Supplementary Figure S5 and Figure S6, respectively.

Figure 3 presents the linear regression and Bland-Altman plots depicting the goodness of fit between the true and predicted WMH volumes. The R-squared value obtained from the linear regression between the true and predicted WMH volumes was 0.91, with a concordance correlation coefficient of 0.96 in the internal validation sets. Conversely, in the internal test set, the R-squared value was 0.94, and the concordance correlation coefficient was 0.97.

The Bland-Altman plot demonstrates a minimal systematic difference of 0.33 ml between volumes in the internal validation sets and 0.4 ml in the internal test set. However, it further revealed a significant tendency for the predicted volume to decrease in comparison to the actual volume as the WMH volume increased in both evaluation sets (p < 0.001 and p = 0.008 in the internal validation and internal test sets, respectively). The reproducibility coefficients were 11.9 ml and 10.6 ml in the internal validation and internal test sets.

Fazekas score rating

The baseline rule-based method set thresholds at 2.1 ml for distinguishing Fazekas scores 0 versus 1 or higher, 16.4 ml for scores 0 and 1 versus 2 and 3, and 33.5 ml for scores 2 or lower versus 3. When these thresholds were applied, the quadratic-weighted kappa values between the predicted and true Fazekas scales were 0.926, 0.951, and 0.897 for the internal validation, internal test, and external validation sets, respectively. When the baseline logistic regression-based method was employed, the quadratic weighted kappa values between the predicted and true Fazekas scales were 0.909, 0.962, and 0.916 for the internal validation, internal test, and external validation sets, respectively.

Regarding the predicted Fazekas scale obtained by the 3D CNN for the regression task, the quadratic weighted kappa values between the predicted and true Fazekas scales were 0.904, 0.951, and 0.898 for the internal validation, internal test, and external validation sets, respectively (Table 3). Further, the predicted Fazekas scale derived from the 3D CNN for the classification task and the quadratic weighted kappa values between the predicted and true Fazekas scales were 0.917, 0.956, and 0.956 for the internal validation, internal test, and external validation sets, respectively. The 3D CNN achieved AUROC values exceeding 0.9 and AUPRC values exceeding 0.8 across all binary categorizations in all evaluation datasets, regardless of whether regression or classification methods were used.

Table 3 Fazekas rating performance.

Full size table

Saliency maps were visualized for a sample patient from the external test set (Supplementary Figure S7). To further highlight the added value of spatial‑probabilistic information, Supplementary Figure S8 presents, for three pairs of subjects whose Fazekas grades differ by one point, the FLAIR MRI, the predicted WMH probability map, and the thresholded binary mask. Despite equivalent lesion burdens, cases showing more extensive probability spread into deep subcortical white matter territories distal to the frontal and occipital horns receive higher grades, whereas those with lesion probability confined to the periventricular rim are assigned lower grades.

Discussion

In this pioneering study, we aimed to evaluate the Fazekas scale using FLAIR images by developing a two-step pipeline. Notably, our approach involved a spatial-probabilistic methodology for the Fazekas rating, employing a 3-dimensional deep learning model with predicted probability maps. This differs from previous studies²⁸which relied on the scalar value of the WMH volume. The probability maps contained positional information corresponding to the FLAIR images along with the continuous-range probability for WMH at each voxel. In the context of the Fazekas scale rating, WMH are evaluated based on their location and categorized into periventricular and deep subcortical areas, indicating that the position of the WMH may influence the Fazekas rating^9,10. As such, our approach is anticipated to facilitate a more precise grading of the Fazekas scale by harnessing the spatial relationships of WMH and incorporating detailed probability information at the voxel level.

In patients with ischemic stroke and small vessel disease, WMH has been associated with an increased risk of recurrent vascular events, including ischemic and hemorrhagic stroke³⁰. Although evaluating the burden of WMH in these patients is clinically crucial, distinguishing between WMH and ischemic stroke lesions can be challenging, particularly in cases where they appear similar on FLAIR images and adjacent to each other. Using a combination of FLAIR and diffusion-weighted imaging (DWI) may aid in distinguishing acute ischemic lesions from WMH; however, its utility is limited to subacute to chronic lesions¹⁵.

One study that segmented WMH in patients with acute ischemic lesions using U-net models found no significant differences in lesion identification or segmentation between models using FLAIR alone and those incorporating FLAIR and DWI³¹. As such, the authors speculated that the CNN-based architecture model may utilize various features from FLAIR images to distinguish between WMH and acute ischemic lesions³¹.

In the present study, we successfully segmented WMH using a simplified method that utilized only FLAIR images with the uResNet architecture. The DICE in our study was 0.73, showing good performance comparable to previous stroke patient studies with model results of 0.61³², 0.76³³, and 0.78³⁴, and similar to other population studies with results of 0.71 and 0.80²⁸. Recall and precision values were also comparable to those found at previous research, at 0.73 and 0.74, respectively, compared to 0.74 and 0.56³¹. We expanded our analysis by including an additional five deep learning models for WMH segmentation in our benchmark comparisons. Our results indicate that, although performance varied across metrics, the trade-offs among the architectures highlight the importance of selecting a method that is tailored to specific clinical requirements.

Segmentation performance in our models was superior, with higher Fazekas scale scores. This trend may have arisen due to class imbalance, which leads to an increase in the number of true-positive voxels. Moreover, it may also have arisen because larger WMH are easier to detect, while deep-learning models may be less affected by mismatches along the boundary. In the past, several studies have utilized automatic segmentation of WMH for automatic scoring of the Fazekas scale^28,34,35. However, no studies on stroke patients have yet used FLAIR images alone for automated scoring^28,34,35.

In one study evaluating the predictive value of Fazekas scores from automatic segmentation in a general population-based cohort, ROC analyses were conducted to separate low WMH burdens (Fazekas scores 0,1) from high WMH burdens (Fazekas scores 2,3). The results revealed area under the curve (AUC) values of 0.93 for the Lesion Segmentation Tool and 0.94 for FreeSurfer³⁴. In a study utilizing the U-Net based architecture VUNO Med-DeepBrain, targeting patients with memory complaints excluding those with stroke, the AUC values were 0.921 (normal vs. mild/moderate/severe), 0.956 (normal/mild vs/ moderate/severe) and 0.960 (normal/mild/moderate vs. severe)²⁸. Another study involving patients with dementia, excluding those with stroke, demonstrated an average AUROC of 0.80³⁵.

In contrast, in our study, the AUROC values were 0.957 (Fazekas scores 0 vs. 1,2,3), 0.982 (Fazekas scores 0,1 vs. 2,3), and 1.000 (Fazekas scores 0,1,2 vs. 3). For the classification task, the AUROC values were 0.972, 0.992, and 1.000. Furthermore, the baseline rule-and logistic regression-based methods yielded quadratic-weighted kappa values of 0.897 and 0.916, respectively. Additionally, for the 3D CNN for the regression task and the 3D CNN for the classification task, the quadratic weighted kappa values were 0.898 and 0.956, respectively. Our results also demonstrated a highly similar Fazekas rating performance in both the model development and external validation datasets.

Our research stands apart from other studies that evaluated model performance using metrics such as AUC and AUROC, as we used a quadratic weighted kappa value to assess the automated Fazekas score ratings, which presents unique advantages. This method allows a more nuanced evaluation of agreement, which is particularly important in the context of automated Fazekas score ratings, improving the reliability and interpretability of the findings³⁶. In addition, the quadratic-weighted kappa value is useful for handling ordinal variables, enabling a more accurate representation of the agreement³⁶. As such, our model provides a higher level of precision in grading the Fazekas scale.

Our Fazekas scaling model revealed that the shallow convolutional layers consistently activated in correspondence with the intensity of the predicted WMH probability map across folds. In contrast, the deep convolutional layers, closer to the output layer, exhibited spatially variant activations even in regions with similar intensity levels. Due to a ceiling effect, wherein accurate WMHs volume estimation alone permits highly precise Fazekas scale prediction, the performance differences may not be large. Nonetheless, our results suggest that spatial information contributes to the prediction of the Fazekas scale, which may account for the higher predictive performance of our model compared to scalar volumetric measures on the external test set. Subgroup analysis further indicates that the presence of larger ischemic lesions compromises WMH segmentation performance, as evidenced by increased volumetric prediction errors. This observation underscores the importance of developing or adapting segmentation approaches specifically for patients with significant stroke lesion burdens.

Despite the above strength, our study had several limitations. First, there is currently no definitive reference for WMH segmentation and Fazekas rating. This is a common limitation in studies using these methods in which the segmentation performed by experienced neurologists and the visual rating of the Fazekas scale serve as the gold standards; however, they inherently have limitations in terms of subjectivity and interrater reliability. Second, our study model utilized only FLAIR images. The incorporation of the combined information obtained from T1 sequences, DWI, and FLAIR images could potentially enhance the performance of automated segmentation and Fazekas scoring. However, despite this, our model showed high or comparable DICE, recall, and precision values compared with models that used DWI and T1 sequences, suggesting that our model plays a role in distinguishing WMH from stroke lesions, even without including other sequences. Third, we did not directly compare our automatic segmentation and Fazekas scoring models with previously-developed models. Finally, when rating the Fazekas scale, we did not differentiate between deep and periventricular white matter. However, by using a spatial-probabilistic model that incorporates both the location information of the WMH and voxel-wise probability information, we were able to overcome this limitation and achieve a high quadratic weighted kappa value.

In conclusion, our deep learning pipeline demonstrated accurate automatic WMH segmentation and Fazekas scale grading in patients with stroke. As such, this approach offers a convenient method for evaluating WMH burden using only FLAIR images in stroke patients, potentially aiding in the prediction of future vascular events.

Data availability

The data in this study are available from the corresponding author upon reasonable request.

References

Phuah, C. L. et al. Association of Data-Driven White Matter Hyperintensity Spatial Signatures With Distinct Cerebral Small Vessel Disease Etiologies Neurol. 99, e2535–e2547 https://doi.org/10.1212/wnl.0000000000201186 (2022).
Wardlaw, J. M. et al. Neuroimaging standards for research into small vessel disease and its contribution to ageing and neurodegeneration. Lancet Neurol. 12, 822–838. https://doi.org/10.1016/s1474-4422(13)70124-8 (2013).
Article PubMed PubMed Central Google Scholar
Gouw, A. A. et al. Heterogeneity of small vessel disease: a systematic review of MRI and histopathology correlations. J. Neurol. Neurosurg. Psychiatry. 82, 126–135 (2011).
Article PubMed Google Scholar
Wardlaw, J. M., Hernández, M. C. V. & Muñoz-Maniega, S. What are White Matter Hyperintensities Made of? Journal of the American Heart Association 4, e001140 https://doi.org/10.1161/JAHA.114.001140 (2015).
Sharma, R., Sekhon, S. & Cascella, M. White matter lesions. (2020).
Debette, S. & Markus, H. S. The clinical importance of white matter hyperintensities on brain magnetic resonance imaging: systematic review and meta-analysis. BMJ (Clinical Res. ed). 341 (c3666). https://doi.org/10.1136/bmj.c3666 (2010).
Wong, T. Y. et al. Cerebral white matter lesions, retinopathy, and incident clinical stroke. JAMA 288, 67–74. https://doi.org/10.1001/jama.288.1.67 (2002).
Article PubMed Google Scholar
Fierini, F., Poggesi, A. & Pantoni, L. Leukoaraiosis as an outcome predictor in the acute and subacute phases of stroke. Expert Rev. Neurother. 17, 963–975 (2017).
Article CAS PubMed Google Scholar
Fazekas, F., Chawluk, J. B., Alavi, A., Hurtig, H. I. & Zimmerman, R. A. MR signal abnormalities at 1.5 T in alzheimer’s dementia and normal aging. Am. J. Neuroradiol. 8, 421–426 (1987).
PubMed Central Google Scholar
Wardlaw, J. M. et al. Neuroimaging standards for research into small vessel disease and its contribution to ageing and neurodegeneration. Lancet Neurol. 12, 822–838 (2013).
Article PubMed PubMed Central Google Scholar
Olsson, E. et al. White matter lesion assessment in patients with cognitive impairment and healthy controls: reliability comparisons between visual rating, a manual, and an automatic volumetrical MRI method—the gothenburg MCI study. J. Aging Res. (2013).
Valdés Hernández, M. C. et al. Close correlation between quantitative and qualitative assessments of white matter lesions. Neuroepidemiology 40, 13–22 (2012).
Article PubMed Google Scholar
Cedres, N. et al. Predicting Fazekas scores from automatic segmentations of white matter signal abnormalities. Aging (Albany NY). 12, 894 (2020).
Article PubMed Google Scholar
Shi, L. et al. Automated quantification of white matter lesion in magnetic resonance imaging of patients with acute infarction. J. Neurosci. Methods. 213, 138–146 (2013).
Article PubMed Google Scholar
Tsai, J. Z. et al. Automated segmentation and quantification of white matter hyperintensities in acute ischemic stroke patients with cerebral infarction. PloS One. 9, e104011. https://doi.org/10.1371/journal.pone.0104011 (2014).
Article ADS PubMed PubMed Central Google Scholar
Kim, B. J. et al. Prevention of cardiovascular events in Asian patients with ischaemic stroke at high risk of cerebral haemorrhage (PICASSO): a multicentre, randomised controlled trial. Lancet Neurol. 17, 509–518 (2018).
Article PubMed Google Scholar
Bossuyt, P. M. et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. Radiology 277, 826–832 (2015).
Article PubMed Google Scholar
Tustison, N. J. et al. N4ITK: improved N3 bias correction. IEEE Trans. Med. Imaging. 29, 1310–1320 (2010).
Article PubMed PubMed Central Google Scholar
Winkler, A., Kochunov, P. & Glahn D. (2012).
Isensee, F. et al. Automated brain extraction of multisequence MRI using artificial neural networks. Hum. Brain. Mapp. 40, 4952–4964 (2019).
Article PubMed PubMed Central Google Scholar
Guerrero, R. et al. White matter hyperintensity and stroke lesion segmentation and differentiation using convolutional neural networks. NeuroImage: Clin. 17, 918–934 (2018).
Article CAS PubMed Google Scholar
Hatamizadeh, A. et al. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. International MICCAI brainlesion workshop 272–284 (2021).
Hatamizadeh, A. et al. UNETR: Transformers for 3D Medical Image Segmentation. Proceedings of the IEEE/CVF winter conference on applications of computer vision 574–584 (2022).
Oktay, O. et al. Attention u-net: learning where to look for the pancreas. ArXiv Preprint arXiv 180403999 (2018).
Isensee, F., Jäger, P. F., Kohl, S. A., Petersen, J. & Maier-Hein, K. H. Automated design of deep learning methods for biomedical image segmentation. arXiv preprint arXiv:1904.08128 (2019).
Myronenko, A. in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part II 4. 311–320 (Springer).
Cardoso, M. J. et al. Monai: an open-source framework for deep learning in healthcare. ArXiv Preprint arXiv 221102701 (2022).
Joo, L. et al. Diagnostic performance of deep learning-based automatic white matter hyperintensity segmentation for classification of the Fazekas scale and differentiation of subcortical vascular dementia. PloS One. 17, e0274562. https://doi.org/10.1371/journal.pone.0274562 (2022).
Article CAS PubMed PubMed Central Google Scholar
Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE international conference on computer vision 618–626, (2017).
Park, J. H. et al. White matter hyperintensities and recurrent stroke risk in patients with stroke with small-vessel disease. Eur. J. Neurol. 26, 911–918. https://doi.org/10.1111/ene.13908 (2019).
Article PubMed Google Scholar
Liu, S. et al. Identification of white matter lesions in patients with acute ischemic lesions using U-net. Front. Neurol. 11, 1008. https://doi.org/10.3389/fneur.2020.01008 (2020).
Article PubMed PubMed Central Google Scholar
Lee, A. R. et al. Fully automated segmentation on brain ischemic and white matter hyperintensities lesions using semantic segmentation networks with squeeze-and-excitation blocks in MRI. Inf. Med. Unlocked. 21, 100440. https://doi.org/10.1016/j.imu.2020.100440 (2020).
Article Google Scholar
Han, L., Liu, L., Hao, Y. & Zhang, L. Diagnosis and treatment effect of convolutional neural Network-Based magnetic resonance image features on severe stroke and mental state. Contrast Media Mol. Imaging. 2021 (8947789). https://doi.org/10.1155/2021/8947789 (2021).
Cedres, N. et al. Predicting Fazekas scores from automatic segmentations of white matter signal abnormalities. Aging 12, 894–901. https://doi.org/10.18632/aging.102662 (2020).
Article PubMed PubMed Central Google Scholar
Rieu, Z. et al. A fully automated visual grading system for white matter hyperintensities of T2-Fluid attenuated inversion recovery magnetic resonance imaging. J. Integr. Neurosci. 22, 57. https://doi.org/10.31083/j.jin2203057 (2023).
Article PubMed Google Scholar
Li, M., Gao, Q. & Yu, T. Kappa statistic considerations in evaluating inter-rater reliability between two raters: which, when and context matters. BMC Cancer. 23, 799. https://doi.org/10.1186/s12885-023-11325-z (2023).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We would like to express our gratitude to Min-Hwan Lee, Hyuk-Sung Kwon, Sun U. Kwon, Jong-Ho Park, Yong-Jae Kim, Keun-Sik Hong, Sungwook Yu, Yang-Ha Hwang, Juneyoung Lee, Joung-Ho Rha, Sung Hyuk Heo, Sung Hwan Ahn, Woo-Keun Seo, Jong-Moo Park, Ju-Hun Lee, Jee-Hyun Kwon, Sung-Il Sohn, Hahn Young Kim, Eung-Gyu Kim, Sung Hun Kim, Jae-Kwan Cha, Man-Seok Park, Hyo Suk Nam, Dong-Wha Kang; on behalf of the PICASSO Investigators.

Funding

This work was supported by the K-Brain Project of the National Research Foundation (NRF), funded by the Korean government (MSIT) (No. RS-2023-00265393), the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. RS-2024-00347016), and a VHS Medical Center Research Grant (VHSMC25014), Republic of Korea. The funders had no role in the study design, the collection, analyses, or interpretation of the data, in the writing of the manuscript, or in the decision to publish the results.

Author information

Eun-Tae Jeon and Seung Min Kim contributed equally to this work.

Authors and Affiliations

Department of Neurology, Korea University Ansan Hospital, University College of Medicine, Ansan, South Korea
Eun-Tae Jeon & Jin-Man Jung
Department of Neurology, Veterans Health Service Medical Center, Seoul, South Korea
Seung Min Kim
Korea University Zebrafish Translational Medical Research Center, Ansan, South Korea
Jin-Man Jung

Authors

Eun-Tae Jeon
View author publications
Search author on:PubMed Google Scholar
Seung Min Kim
View author publications
Search author on:PubMed Google Scholar
Jin-Man Jung
View author publications
Search author on:PubMed Google Scholar

Contributions

ETJ: Conceptualization, Formal analysis, writing–original draft, writing–review, and editing. SMK: Conceptualization, Writing - Original Draft, Writing - Review & Editing. JMJ: Conceptualization, Investigation, Writing - Original Draft, Writing - Review & Editing, Supervision.

Corresponding author

Correspondence to Jin-Man Jung.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Jeon, ET., Kim, S.M. & Jung, JM. Automated rating of Fazekas scale in fluid-attenuated inversion recovery MRI for ischemic stroke or transient ischemic attack using machine learning. Sci Rep 15, 32219 (2025). https://doi.org/10.1038/s41598-025-17287-5

Download citation

Received: 19 October 2024
Accepted: 22 August 2025
Published: 01 September 2025
DOI: https://doi.org/10.1038/s41598-025-17287-5