Abstract
We explored effects of (1) training with various sample sizes of multi-site vs. single-site training data, (2) cross-site domain adaptation, and (3) data sources and features on the performance of algorithms segmenting cerebral infarcts on Magnetic Resonance Imaging (MRI). We used 10,820 annotated diffusion-weighted images (DWIs) from 10 university hospitals. Algorithms based on 3D U-net were trained using progressively larger subsamples (ranging from 217 to 8661), while internal testing employed a distinct set of 2159 DWIs. External validation was conducted using three unrelated datasets (n = 2777, 50, and 250). For domain adaptation, we utilized 50 to 1000 subsamples from the 2777-image external target dataset. As the size of the multi-site training data increased from 217 to 1732, the Dice similarity coefficient (DSC) and average Hausdorff distance (AHD) improved from 0.58 to 0.65 and from 16.1 to 3.75 mm, respectively. Further increases in sample size to 4330 and 8661 led to marginal gains in DSC (to 0.68 and 0.70, respectively) and in AHD (to 2.92 and 1.73). Similar outcomes were observed in external testing. Notably, performance was relatively poor for segmenting brainstem or hyperacute (< 3 h) infarcts. Domain adaptation, even with a small subsample (n = 50) of external data, conditioned the algorithm trained with 217 images to perform comparably to an algorithm trained with 8661 images. In conclusion, the use of multi-site data (approximately 2000 DWIs) and domain adaptation significantly enhances the performance and generalizability of deep learning algorithms for infarct segmentation.
Similar content being viewed by others
Introduction
Diffusion-weighted magnetic resonance imaging (DWI) has been a critical imaging technique for the diagnosis and treatment of acute ischemic stroke because it is highly sensitive in detecting acute cerebral infarcts 1. DWI lesion volume 2 and pattern 3 can predict post-stroke outcomes 4 and future cerebrovascular events 5. Moreover, DWI can guide acute recanalization therapy 6 by triaging patients based on their infarct volumes.
There is a clinical need for automated segmentation of DW images. Since human segmentation of the infarct core demands time-consuming clinical expertise, multiple deep learning techniques have been developed for automatic segmentation of DWI lesions 7. However, such techniques are critically dependent on the quantity and quality of datasets used to build the algorithms, and most studies to date have utilized single-site training data with modest sample sizes (Table 1). Only a few studies have externally tested their deep learning algorithms, reporting as expected that Dice similarity coefficients (DSCs) were much higher for internal data than for external data 8.
Large-scale, multi-site training data are needed to avoid two machine learning failures due to: a) the generalization problem, which prevents deep learning algorithms from learning patterns that generalize to unseen data 9, and b) the domain shift problem, where an algorithm that performs well in the setting where the data was collected (source domain) performs poorly in settings where data characteristics differ (target domain) 10. However, collecting and labeling extensive imaging data from multiple centers are challenging and labor-intensive processes that require thorough knowledge of neuroimaging. Specifically, regarding deep learning algorithms for DWI lesion segmentation, optimal sample size that minimizes the failures is not known yet.
To overcome the domain shift problem, domain adaptation has been successfully applied in various fields 11. Domain adaptation refers to a technique in machine learning where a model trained on a source domain (with a specific data distribution) is adapted to perform effectively on a target domain (with a different data distribution), despite the discrepancy between the two 12. However, studies exploring the effect of domain adaptation on the performance of deep learning algorithms for DWI infarct segmentation have not been reported yet. Clearly, the sample sizes of both source domain data and target domain data would be important variables to consider in such studies.
In this nationwide multi-center study (Fig. 1), 10,820 patients’ DW images (collected consecutively from 10 university hospitals) were used to develop deep learning-based infarct segmentation algorithms. These algorithms were tested using three external datasets. We examined effects of (1) various sample sizes of multi-site vs single-site training data, (2) data features, and (3) domain adaptation on algorithm performance.
Study flow chart. Panel in red and blue shows datasets used for developing and testing multi-site algorithms and single-site algorithms, respectively. Panel in green shows datasets used for experiments for domain adaptation with different sample size setups. TIA = transient ischemic attack, DWI = diffusion-weighted image, ISLES = Ischemic Stroke Lesion Segmentation.
Material and methods
Ethical statement
Institutional Review Boards of Dongguk University Ilsan Hospital approved this study (2010-01-083-015). All patients or their legally authorized representatives provided written informed consent for study participation. This study complies with the Declaration of Helsinki.
Training cohort
Multi-site data
This study included brain DW images from the Korean nationwide image-based stroke database project 13,14,15. From 2011 to 2014, we consecutively enrolled 12,013 patients with acute ischemic stroke or transient ischemic attack. Among 12,013 screened patients, we excluded the following: contraindications to MRI (n = 258), poor image quality (n = 78), unavailability of diffusion-weighted images (DWI) (n = 826), and MRI processing errors (n = 31), resulting in 10,820 patients being available (Fig. 1), which was split 8:2 into the Training-and-validation dataset (n = 8,661) and the Internal test dataset (n = 2,159).
Single-site data
To investigate performance of an algorithm trained using single-site data, we chose one of the 10 hospitals to prepare the Single-site dataset with 476 DW images (Fig. 1), which is comparable to the amounts of training data in previous studies 16,17.
Clinical data including stroke negative cases
To evaluate the clinical utility of the algorithm, we included a consecutive series of 1,194 subjects who underwent DW imaging in the emergency room to rule out ischemic stroke at a single university hospital that was not part of our internal or external testing sites, between March and August 2019. After excluding 356 subjects with any intracranial hemorrhage (n = 218), encephalopathy (n = 53), brain tumor (n = 50), metallic artifact (n = 24), motion artifact (n = 3), post-craniectomy status (n = 3), demyelinating disease (n = 2), and encephalitis (n = 1), 838 cases remained for analysis. For the clinical dataset, the presence of ischemic stroke was retrieved from the formal report by neuroradiologists and confirmed by an experienced vascular neurologist. We defined true positives as any observed overlap between the actual lesion and the segmentation mask.
External test cohorts
First, 2,777 DW images (the External dataset) were consecutively collected from a university hospital during the same period as the training cohort (Fig. 1). Second, the Ancillary test dataset I comprised DW images of 50 patients with ischemic stroke due to atrial fibrillation from three different university hospitals between 2011 and 2014 18. Third, the Ancillary test dataset II (n = 250) comprised Ischemic Stroke Lesion Segmentation Challenge (ISLES) 2022 data 19.
DW images and ischemic lesion segmentation
Images of the Training-and-validation dataset, the Internal test dataset, and the External dataset were obtained using 1.5-Tesla or 3.0-Tesla magnetic resonance imaging (MRI) systems. DWI protocols were: b-values of 0 and 1,000 s/mm2, TR of 2,400–9,000 ms, TE of 50–99 ms, voxel size of 1 × 1 × 3–5 mm3, interslice gap of 0–2 mm, and thickness of 3–7 mm. Ischemic lesions on DW images were segmented by experienced researchers using an in-house software Image_QNA under the close guidance by an experienced vascular neurologist, as previously described 13,14. In brief, ischemic stroke lesions on DW images were semi-automatically outlined by researchers at the Korean Brain MRI Data Center, a central imaging lab. The manual segmentation process involved the following steps: (a) the patient’s DW images were loaded; (b) segmenting areas of high signal intensity on DW images (b1000) by drawing a region of interest (ROI) around the high-intensity lesions and selecting any lesion-related pixel within the ROI, which served as the starting point for Image_QNA’s smart region selection algorithm to approximate the lesions; (c) adjusting the initial segmentation by excluding or including additional pixels with similar signal intensity by scrolling the computer mouse wheel; (d) making fine adjustments to the location, size, and shape of the lesions using image editing tools; and saving the final segmented data. All segmented images were carefully reviewed and edited by an experienced vascular neurologist (W-S. Ryu). For the Ancillary test dataset I, an experienced neurologist manually outlined ischemic lesions. In the Ancillary test dataset II, a hybrid human-algorithm annotation scheme was applied 19.
Image preprocessing and algorithm development
To train the infarct segmentation algorithm, noncontrast b1000 DW images were preprocessed by skull stripping using Otsu’s thresholding 20, N4 bias correction, and signal normalization (Supplementary Methods). To compare segmentation performances as multi-site training data increases, the Training-and-validation dataset was subsampled by a factor of 2.5, 5, 10, 20, 50, and 100% (217, 433, 866, 1,732, 4,330, and 8,661 patients’ images, respectively; Supplementary Fig. 1). Each of the subsampled multi-site datasets and the Single-site (382 of the 476 DWI) dataset had an 8:2 training-to-validation set ratio. For algorithm training, we employed 3D U-Net 21 with some modifications. Random augmentation was performed during deep learning. During the initial training the models were trained for 1000 epochs. To prevent overfitting, the model with the lowest loss on the validation set was selected as the final model. Further details are available in Supplementary material.
Algorithm evaluation
Segmentation performance was evaluated using average Hausdorff distance (AHD) or DSC.
Average Hausdorff Distance
where S and G are point sets of segmentation and ground truth.
Additionally, voxelwise precision was defined as the proportion of correctly predicted infarct voxels relative to the total number of predicted infarct voxels. We also assessed the algorithm performance depending on the differences in:
-
1.
infarct volume (< 1.7 mL, 1.7–14 mL, and > 14 mL)8
-
2.
last-known-well (LKW)-to-imaging time (< 3 h, 3–24 h, and > 24 h)
-
3.
infarct location (cortex, corona radiata, basal ganglia and internal capsule, thalamus, midbrain, pons, medulla, cerebellum, and multiple)
-
4.
MRI vendor
-
5.
the presence vs. absence of chronic infarct, which was defined as a) 3–15 mm ischemic lesions outside the basal ganglia, brainstem, thalamus, internal capsule, or cerebral white matter or b) ischemic lesions larger than 15 mm in any areas on fluid-attenuated inversion recovery images 22, and
-
6.
deciles of white matter hyperintensity (WMH) volume, which was quantified as previously described 23
Infarct location data were retrieved from a prospective web-based stroke registry (strokedb.or.kr). The infarct location for each patient was determined by the attending neurologist at each hospital.
Domain adaptation
To investigate whether domain adaptation affects segmentation performance, we randomly divided the External dataset into the Additional training-and-validation dataset for domain adaptation and the Test dataset for domain adaptation (n = 1,000 and 1,777, respectively; Fig. 1 and Supplementary Fig. 1). During the Domain Adaptation Phase, the model was fine-tuned for 100 epochs, and as in the full model training, the final model was selected using the validation set. All layers were fine-tuned to rapidly adapt the model (no frozen layers). To assess the effect of domain adaptation-related sample size on segmentation performance, the Additional training-and-validation dataset was subsampled by a factor of 5, 10, 20, 50, and 100%. These 6 datasets were used for the fine tuning of the algorithm trained with 100% of the (multi-site) 8,661 patients’ DW images. Moreover, to evaluate whether the sample size for initial training affects the algorithm performance after domain adaptation, initial deep learning was performed with 2.5, 5, 10, 20, 50, and 100% of the Training-and-validation dataset (n = 8,661) and then fine-tuned with the Additional training-and-validation dataset for domain adaptation (sample size of 50, 100, or 200).
Statistical analysis
To compare baseline characteristics of datasets, we used ANOVA, Kruskal–Wallis test, and chi-square test as appropriate. To compare infarct volumes (ground truth vs. algorithm prediction), we used Bland–Altman plots and correlation plots. To test whether DSC increases or AHD decreases as the training sample size increased, we used a linear regression analysis. The performance difference between algorithms was tested using paired t-test. All statistical analyses were performed using STATA 16.1 (STATA Corp., Texas, USA) and p less than 0.05 was considered statistically significant.
Results
Baseline characteristics of study population
Mean age was 67.9 ± 12.9 years (58.9% males) in the Training-and-validation dataset (n = 8,661; Table 1). Median National Institutes of Health Stroke Scale (NIHSS) score was 4 (interquartile range, 2–9) and median infarct volume was 1.95 mL (interquartile range 0.5–11.1). Compared with the Training-and-validation dataset and the Internal test dataset, the External dataset was characterized by more cardioembolic strokes, shorter time intervals from last-known-well (LKW)-to-imaging acquisition, and larger infarct volumes. Moreover, MR vendors, magnetic field strengths, and imaging parameters were different among the Training-and-validation dataset, Internal test dataset, and External dataset (Table 2 and Supplementary Table 1). Estimated background noise and estimated signal-to-noise ratios in the Internal dataset varied widely among the 10 participating hospitals (Supplementary Fig. 2).
Performance of a deep learning algorithm trained using single-site data
Mean age was 68.8 ± 13.2 years 60.8% males in the Single-site training-and-validation dataset (n = 382). Median infarct volume was 1.70 mL (interquartile range 0.53–11.25 mL; Supplementary Table 2). For the Single-site internal test dataset, the algorithm achieved AHD of 8.44 ± 8.58 mm and DSC of 0.70 ± 0.23 with a voxel-wise sensitivity of 0.69 and a precision of 0.78 (Supplementary Table 3). However, the single-site algorithm showed substantially lower performance for the tests using the External dataset, with DSC and AHD values of 0.50 ± 0.31 and 19.98 ± 14.06 mm (both p < 0.001), respectively.
Effects of training data sample size on the performance of deep learning algorithms to segment acute infarcts on DW images
As the sample size of the Training-and-validation dataset increased from 217 to 866, DSC of the 3D U-net algorithm increased from 0.58 to 0.65 and AHD decreased from 16.13 to 6.31 for the Internal test dataset (Fig. 2). When the sample size was further increased to 1,732, DSC seemed to increase less steeply, approaching a plateau (0.67). When the sample size was further increased to 4,330 and 8,661, DSC only slightly increased to 0.68 and 0.70, respectively. Similarly, AHD gradually decreased as the sample size increased, although the rate of improvement diminished. Similar results were seen in the tests using the External dataset (see Supplementary Fig. 3 for the Ancillary test datasets I and II). When the sample size was 433 or greater, DSC values in External dataset were significantly higher than those in Internal test dataset. In both Internal test dataset and External dataset, infarct volumes that were segmented and quantified by the algorithm (trained with 8,661 DW images) showed strong correlations with ground truth infarct volumes (both r2 = 0.96, p < 0.001; Supplementary Fig. 4), although the algorithm tended to underestimate infarct volumes. Contrary to the exponential increase in DSC, precision in both the Internal test dataset and the External dataset changed only slightly when sample size increased (Fig. 2).
Effects of training data sample size on the performance of deep learning algorithms to segment acute infarcts, with or without stratification by lesion volume and last-known-well to imaging time. (a) Dice similarity coefficient (DSC). (b) Average Hausdorff distance (AHD). (c) Voxel-wise precision. (d–f) DSC stratified by infarct volume (< 1.7, 1.7–14, and > 14 mL). (g–i) DSC stratified by last-known-well to imaging time. Dots and bars indicate mean and standard error values, respectively. Data of last-known-well to imaging time were missing for 565 and 1,849 patients in Internal test dataset and External dataset, respectively. See Fig. 1 for a better understanding of datasets. Gray dotted lines indicate data points for the sample sizes of 217, 433, 866, 1,732, 4,330, and 8,661. Precision was calculated voxel-wise. Compared with the DSC or AHD for the algorithm trained with 217 patients’ images, all the other DSCs or AHDs for the algorithms trained with 433, 866, 1,732, 4,330, and 8,661 patients’ images were significantly different.
Effects of sample size on the performance of deep learning algorithms to segment acute DWI lesions according to the volume and location of acute infarcts, presence of chronic ischemic lesions, LKW-to-imaging time, and MRI vendors
When the Internal test dataset and the External dataset were divided into small (< 1.7 mL, n = 994 and 1,046), medium (1.7–14.0 mL, n = 587 and 904), and large (> 14.0 mL, n = 446 and 825) infarct groups, DSCs for the internal and external testing were the highest (up to ~ 0.8) in the large infarct group, lower (up to ~ 0.7) in the medium infarct group, and the lowest (up to ~ 0.6) in the small infarct group (Fig. 2). Also, AHDs were lowest in the large infarct group (up to ~ 0.37 mm), followed by the medium infarct group (up to ~ 0.89 mm) and the small infarct group (up to ~ 2.61 mm; Supplementary Fig. 5). This finding is consistent with generally higher performances of our algorithms in the tests using the External dataset as opposed to the Internal test dataset, given that the mean infarct volume in the former was about two times bigger than in the latter.
With regard to lesion locations (Fig. 3), DSCs were generally higher for supratentorial lesions (~ 0.65 or higher) than for infratentorial lesions (~ 0.6 or lower), except for cerebellar lesions (in the tests using the Internal test dataset and the External dataset) and thalamus (in the test using the External dataset) with DSCs being about 0.7. However, AHDs were comparable across lesion locations, except for the medulla, where AHDs exceeded 6 mm in both the internal test dataset and the external dataset (Supplementary Fig. 6).
Performance of deep learning algorithms to segment infarcts in the Internal test dataset and the External dataset, with stratification by lesion location. (a) Cortex. (b) Corona radiata. (c) Basal ganglia and internal capsule. (d) Thalamus. (e) Midbrain. (f) Pons. (g) Medulla. (h) Cerebellum. (i) Multiple locations. Dots and bars indicate mean and standard error values, respectively. Gray dotted lines indicate data points for the sample sizes of 217, 433, 866, 1,732, 4,330, and 8,661. Sensitivity and precision were calculated voxel-wise. Note that Y-axis ranges vary in each figure. Compared with supratentorial lesions (a–d), infratentorial lesions (e–g) except for cerebellar lesions (h) show lower Dice similarity coefficients (DSCs). See Fig. 1 for a better understanding of datasets.
When data were stratified based on the presence of chronic infarcts and white matter hyperintensity (WMH) volumes, similar algorithm performances were observed across groups (Supplementary Fig. 7 and Supplementary Fig. 8).
When data were stratified based on the LKW-to-imaging time, DSCs were the highest (up to ~ 0.75) in the > 24-h group, slightly lower (up to ~ 0.7) in the 3–24-h group, and the lowest (up to ~ 0.55 and ~ 0.65) in the < 3-h group (Fig. 2). Similarly, AHDs were highest in the < 3-h group and comparable in the 3–24-h group and > 24-h groups (Supplementary Fig. 5). With respect to MRI vendors, the deep learning algorithm showed better performances for Philips or GE images than for Siemens images (Supplementary Table 4).
In tests of the algorithm trained with 8,661 patients’ DW images, DSCs and AHDs for the Internal test dataset varied, ranging from 0.45 to 0.78 and 19.48 to 0.27 mm, depending on the participating center and training data sample size, especially the latter (Supplementary Table 5).
Domain adaptation-related improvement of infarct segmentation in external testing
Domain adaptation using subsamples of the External dataset (target domain) enhanced the algorithm performance (Table 3 and Fig. 4). Before domain adaptation, the mean AHD was higher in the External dataset (3.02 mm) compared to the internal testing (1.73 mm). As the sample size for domain adaptation increased, the mean AHD values in the internal dataset fluctuated between 1.95 and 2.69 mm (all p < 0.001 compared with before domain adaptation). In contrast, in the external dataset, the mean AHD values progressively decreased from 3.02 to 1.89 mm (all p < 0.001 compared with before domain adaptation) with increasing sample size for domain adaptation.
Domain adaptation-related improvement of infarct segmentation in external testing. (a) Dice similarity coefficients (DSCs) for the Internal test dataset. (b) DSCs for the Test dataset for domain adaptation. (c) Average Hausdorff distance (AHD) for the Internal test dataset. (d) AHD for the Test dataset for domain adaptation. Data are presented as mean and standard error. Gray dotted lines indicate data points for the sample sizes of 217, 433, 866, 1,732, 4,330, and 8,661. See Fig. 1 for a better understanding of datasets.
When the sample size of the Training-and-validation (i.e., source domain) dataset was 217, additional training with 50 cases from the Additional training-and-validation (i.e., target domain) dataset for domain adaptation significantly increased DSC from 0.56 to 0.67 (p < 0.001; Fig. 4) in testing with the Test dataset for domain adaptation. When the domain adaptation was performed with 200 cases, DSC was higher (0.71) than that for the 50 cases (p < 0.001). A similar pattern was observed when the sample size of the source domain dataset was 433. However, when this sample size was 866 or higher, there was only a slight improvement of segmentation performance. Thus, in terms of the effectiveness of deep learning algorithms, the training data sample size of 866 without domain adaptation was practically similar to that of 217 with subsequent domain adaptation. It is notable that the domain adaptation with subsamples of target domain worsened the algorithm performance in internal testing (i.e., testing with the source domain data). This deterioration could be partly restored by increasing the sample size of the source domain data for initial deep learning to as high as 8,661. As the number of training-and-validation data increased AHD decreased gradually. However, the number of additional training-and-validation data for domain adaptation had little impact on AHD.
Performance of algorithms on clinical datasets, including stroke-negative cases
For the clinical dataset (n = 838), the mean age was 63.8 (SD 16.0), 57.2% were male, and 154 subjects (18.4%) had ischemic lesions. The algorithm achieved a sensitivity of 99.4% (95% CI: 96.4%–99.9%) and a specificity of 43.3% (39.5%–47.1%). The most frequent causes of false positives were asymmetric high signal intensity in the corticospinal tract at the internal capsule (n = 221), T2 shine-through artifacts (n = 122), and detection of subtle high signal intensity on the cortex (n = 14). By setting a threshold for the predicted infarct volume at 0.087 mL, the algorithm demonstrated a sensitivity of 95.0% and a specificity of 73.1%.
Discussion
Hyperacute strokes or tiny ischemic lesions are often difficult for even experts to detect 24. Our algorithm may assist physicians in identifying early infarcts and small ischemic lesions, potentially facilitating stroke workflow. Additionally, the algorithm enables physicians to measure infarct volumes, providing valuable data for further research into the clinical implications of infarct evolution and treatment strategies aimed at preventing infarct expansion 14. In the clinical dataset, the algorithm demonstrated high sensitivity but relatively low specificity. Since the algorithm was trained to detect areas of high signal intensity solely based on DWI, features such as corticospinal tract asymmetry and T2 shine-through artifacts may be mistakenly identified as lesions 25. However, the exceptionally high sensitivity of the algorithm could be advantageous for screening patients presenting with stroke symptoms in the emergency room.
The performance of the deep learning-based DWI lesion segmentation algorithm that was trained on the single-center dataset (n = 382) was much inferior in all three external tests than in the internal test. To develop a more robust algorithm that generalizes well and performs better on unseen data, there is a need for multi-site training data, which better reflects the heterogeneity of the ischemic stroke phenotype as well as the diversity of MR equipment and protocols in real-world clinical use. However, it is challenging to obtain, label, and annotate a high volume of multi-center data. Our findings suggest that multi-site data with a sample size of about 866–1732 might be cost-effective in developing a reasonable deep learning algorithm for DWI lesion segmentation. To enhance deep learning algorithms’ capacity to generalize to new cases, data augmentation can be used to artificially increase the amount and diversity of training data by generating modified copies of a dataset using existing data. However, this method carries the biases of the existing data, such as noise and resolution-related ones, without increasing the variety of infarct locations and patterns 26.
Domain generalization is a well-known challenge for machine learning in healthcare. Model performance in real-world conditions may be lower than expected due to discrepancies between the data used in model development and those encountered during deployment. A recent study has shown that generative models can enhance the fairness of medical classifiers under distribution shifts 27,28. However, the subtle nature of ischemic stroke features on DWI poses significant challenges, limiting the application of this method for our model training. It is clear that additional training data and more advanced image processing techniques are necessary to further improve the generalizability of the proposed model.
Utilizing a small amount of data from the target domain could be used to resolve the domain shift issue, where the algorithm performs poorly on the target data acquired from a different source or domain (and unseen during training) due to differences in the data distributions 29,30. Our study showed that on the External dataset, the algorithm that was trained with 217 DW images and was followed by domain adaptation with 50 additional DW images from the target domain performed comparably to the algorithm trained with 866 DW images without subsequent domain adaptation. As a trade-off due to diversion of the deep learning algorithm on the target domain, domain adaptation may result in worse performance in the source domain. However, resilience was observed with little impact on the algorithm’s performance in the source domain when employing a large multi-site dataset for training. The post-domain-adaptation (n = 200) DSC drop for the source domain internal test data was 0.10 and 0.03, respectively, in the algorithms that were pretrained with 866 and 8,661 patients’ DW images. Although strategies such as freezing certain layers 31 and using a fixed pretrained sibling network 32 have been proposed to prevent catastrophic forgetting—a phenomenon in which a model trained on new tasks or data loses previously learned information—we did not apply these methods in this study. There is a stability-plasticity trade-off 33, in which freezing layers can potentially reduce the effectiveness of transfer learning. We came down on the side of plasticity and given the heterogeneity of imaging inputs in the real world, believe this choice to be justified for our study.
DSCs for DWI lesion segmentation were low when infarcts were small or MRI was performed early (within 3 h of symptom onset). Given that the External dataset (for external testing) had approximately two-fold bigger infarct volumes than the Internal test dataset, this finding is in line with higher DSCs for the former (vs. the latter) dataset. In addition, training on multi-site data may have led to the robustness to external testing. Deep learning algorithms performed poorly on brainstem infarcts, probably due to the small number of cases even in the large training data (n = 8,661) and a relatively complex anatomical structures and variations of the posterior fossa near the brainstem 34. A strategy for enhancing the segmentation performance for brainstem infarcts should be developed in future research.
Among the metrics used to measure segmentation performance, the DSC is known to be dependent on lesion size 35. The same degree of displacement or error results in a lower DSC for smaller lesions. We observed a paradoxically higher DSC in the external dataset compared to the internal dataset, which was used to train the model, as the external dataset contained larger lesions. In contrast, the AHD, a metric independent of lesion size, exhibited slightly lower but comparable values in the internal test dataset (1.73 mm) and the external test dataset (1.82 mm) when trained on 8,661 samples.
We have taken a conservative approach and were reluctant to modify original patient data in our inputs, tending to accept the variability in quality and difficulties this engendered, as the price of doing business in the real world. However, recent studies have demonstrated that segmentation algorithms built on datasets with a small number of original images, augmented to thousands of diverse variations, can achieve strong performance 36,37. Expanding the range and diversity of augmentations could potentially reduce the need for large-scale datasets and improve generalizability across domains. This highlights a self-imposed limitation in our current approach, where augmentation variability was not used to fully address domain-specific biases. Future studies should investigate the impact of more extensive augmentation strategies versus conservatism to optimize performance. We can see that augmentation might be of considerable and ethical use when large training datasets are not available, but clinical use of such methods would need to be justified.
This study has strengths, such as the large sample size of multi-site training data and extensive external testing. There are also limitations. First, using apparent diffusion coefficient images for training may have enhanced the segmentation performance. Second, the algorithm’s performance may have been improved by incorporating clinical data, such as NIHSS scores, which reflect specific neurological symptoms related to stroke, into the training process, similar to how physicians use such information in clinical practice. Third, the use of the same raters for segmenting both the internal and external test datasets may limit the generalizability of the algorithm, as inter-rater reliability in segmenting ischemic lesions varies in real-world scenarios. In future studies, our segmentation algorithm should be tested in a multicenter, multi-reader study to better evaluate its robustness and applicability. Fourth, caution should be taken when extrapolating our findings from Korean stroke patients to other ethnic groups, although previous research found no ethnic differences in the pattern of ischemic infarct on DW images 38.
Conclusion
In conclusion, our study demonstrates that domain adaptation or big (n = ~ 1000) multi-site DWI data are required for a reliable infarct segmentation algorithm with generalizability. In addition, future research should focus on improving the relatively low segmentation performance for small or brainstem infarcts or hyperacute infarcts, which has not been previously described.
Data availability
The anonymized data used for this study will be made available upon reasonable request and the approval of our Data Steering Committee. The authors will respond within thirty days (Dr. WI-Sun Ryu). According to the regulations of the Institutional Review Board and national law, the raw image data used cannot be uploaded online and are only accessible to the researchers who participated in this study.
Abbreviations
- DWI:
-
Diffusion-weighted image
- DSC:
-
Dice similarity coefficient
- AHD:
-
Average Hausdorff distance
- ISLES:
-
Ischemic stroke lesion segmentation challenge
- NIHSS:
-
National Institute of Health Stroke Scale
- LKW:
-
Last-known-well
- WMH:
-
White matter hyperintensity
References
Albers, G. W. Diffusion-weighted MRI for evaluation of acute stroke. Neurology 51, S47-49. https://doi.org/10.1212/wnl.51.3_suppl_3.s47 (1998).
Thijs, V. N. et al. Is early ischemic lesion volume on diffusion-weighted imaging an independent predictor of stroke outcome? A multivariable analysis. Stroke 31, 2597–2602. https://doi.org/10.1161/01.str.31.11.2597 (2000).
Bang, O. Y. et al. Specific DWI lesion patterns predict prognosis after acute ischaemic stroke within the MCA territory. J. Neurol. Neurosurg. Psychiatry 76, 1222–1228. https://doi.org/10.1136/jnnp.2004.059998 (2005).
Barrett, K. M. et al. Change in diffusion-weighted imaging infarct volume predicts neurologic outcome at 90 days: Results of the Acute Stroke Accurate Prediction (ASAP) trial serial imaging substudy. Stroke 40, 2422–2427. https://doi.org/10.1161/STROKEAHA.109.548933 (2009).
Wen, H. et al. Multiple acute cerebral infarcts on diffusion-weighted imaging and risk of recurrent stroke. Neurology 63, 1317–1319 (2004).
Nogueira, R. G. et al. Thrombectomy 6 to 24 hours after stroke with a mismatch between deficit and infarct. N. Engl. J. Med. 378, 11–21. https://doi.org/10.1056/NEJMoa1706442 (2018).
Kim, Y. C. et al. A deep learning-based automatic collateral assessment in patients with acute ischemic stroke. Transl. Stroke Res. https://doi.org/10.1007/s12975-022-01036-1 (2022).
Liu, C. F. et al. Deep learning-based detection and segmentation of diffusion abnormalities in acute ischemic stroke. Commun. Med. (Lond) 1, 61. https://doi.org/10.1038/s43856-021-00062-8 (2021).
Kawaguchi, K., Kaelbling, L. P. & Bengio, Y. Generalization in deep learning. arXiv preprint arXiv:1710.054681 (2017).
Van Leemput, K., Maes, F., Vandermeulen, D. & Suetens, P. in Medical Image Computing and Computer-Assisted Intervention—MICCAI’98: First International Conference Cambridge, MA, USA, October 11–13, 1998 Proceedings 1. 1222–1229 (Springer).
Guan, H. & Liu, M. Domain adaptation for medical image analysis: A survey. IEEE Trans. Biomed. Eng. 69, 1173–1185 (2021).
Guan, H. & Liu, M. DomainATM: Domain adaptation toolbox for medical data analysis. Neuroimage 268, 119863. https://doi.org/10.1016/j.neuroimage.2023.119863 (2023).
Ryu, W. S. et al. Stroke outcomes are worse with larger leukoaraiosis volumes. Brain 140, 158–170. https://doi.org/10.1093/brain/aww259 (2017).
Ryu, W. S. et al. Relation of pre-stroke aspirin use with cerebral infarct volume and functional outcomes. Ann. Neurol. 90, 763–776. https://doi.org/10.1002/ana.26219 (2021).
Kim, D. E. et al. Mapping the supratentorial cerebral arterial territories using 1160 large artery infarcts. JAMA Neurol. 76, 72–80. https://doi.org/10.1001/jamaneurol.2018.2808 (2019).
Kim, Y. C. et al. Evaluation of diffusion lesion volume measurements in acute ischemic stroke using encoder-decoder convolutional network. Stroke 50, 1444–1451. https://doi.org/10.1161/STROKEAHA.118.024261 (2019).
Woo, I. et al. Fully automatic segmentation of acute ischemic lesions on diffusion-weighted imaging using convolutional neural networks: Comparison with conventional algorithms. Korean J. Radiol. 20, 1275–1284 (2019).
Kim, D. Y. et al. Covert brain infarction as a risk factor for stroke recurrence in patients with atrial fibrillation. Stroke 54, 87–95 (2023).
Hernandez Petzsche, M. R. et al. ISLES 2022: A multi-center magnetic resonance imaging stroke lesion segmentation dataset. Sci. Data 9, 762 (2022).
Otsu, N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9, 62–66. https://doi.org/10.1109/tsmc.1979.4310076 (1979).
Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T. & Ronneberger, O. in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016: 19th International Conference, Athens, Greece, October 17–21, 2016, Proceedings, Part II 19. 424–432 (Springer).
Wardlaw, J. M. et al. Neuroimaging standards for research into small vessel disease and its contribution to ageing and neurodegeneration. Lancet Neurol. 12, 822–838. https://doi.org/10.1016/S1474-4422(13)70124-8 (2013).
Ryu, W. S. et al. Grading and interpretation of white matter hyperintensities using statistical maps. Stroke 45, 3567–3575. https://doi.org/10.1161/STROKEAHA.114.006662 (2014).
Guadagno, J. V. et al. The diffusion-weighted lesion in acute stroke: heterogeneous patterns of flow/metabolism uncoupling as assessed by quantitative positron emission tomography. Cerebrovasc. Dis. 19, 239–246. https://doi.org/10.1159/000084087 (2005).
Moseley, M. E. et al. Diffusion-weighted MR imaging of acute stroke: Correlation with T2-weighted and magnetic susceptibility-enhanced MR imaging in cats. AJNR Am. J. Neuroradiol. 11, 423–429 (1990).
Chlap, P. et al. A review of medical image data augmentation techniques for deep learning applications. J. Med. Imaging Radiat. Oncol. 65, 545–563. https://doi.org/10.1111/1754-9485.13261 (2021).
Ktena, I. et al. Generative models improve fairness of medical classifiers under distribution shifts. Nat. Med. 30, 1166–1173. https://doi.org/10.1038/s41591-024-02838-6 (2024).
Yang, Y., Zhang, H., Gichoya, J. W., Katabi, D. & Ghassemi, M. The limits of fair medical imaging AI in real-world generalization. Nat. Med. 30, 2838–2848. https://doi.org/10.1038/s41591-024-03113-4 (2024).
Guan, H. & Liu, M. Domain adaptation for medical image analysis: A survey. IEEE Trans. Biomed. Eng. 69, 1173–1185. https://doi.org/10.1109/TBME.2021.3117407 (2022).
Singh, T. et al. Ftl-CoV19: A transfer learning approach to detect COVID-19. Comput. Intell. Neurosci. 2022, 1953992. https://doi.org/10.1155/2022/1953992 (2022).
Ke, Z., Liu, B., Ma, N., Xu, H. & Shu, L. Achieving forgetting prevention and knowledge transfer in continual learning. Adv. Neural Inf. Process. Syst. 34, 22443–22456 (2021).
Boschini, M. et al. in European Conference on Computer Vision. 692–709 (Springer).
Kim, S., Noci, L., Orvieto, A. & Hofmann, T. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11930–11939.
Luo, W., Li, Y., Urtasun, R. & Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 29 (2016).
Zou, K. H. et al. Statistical validation of image segmentation quality based on a spatial overlap index. Acad. Radiol. 11, 178–189. https://doi.org/10.1016/s1076-6332(03)00671-8 (2004).
Garcea, F., Serra, A., Lamberti, F. & Morra, L. Data augmentation for medical imaging: A systematic literature review. Comput. Biol. Med. 152, 106391. https://doi.org/10.1016/j.compbiomed.2022.106391 (2023).
Goceri, E. Medical image data augmentation: techniques, comparisons and interpretations. Artif. Intell. Rev, 1–45, https://doi.org/10.1007/s10462-023-10453-z (2023).
Bang, O. Y. et al. Clinical determinants of infarct pattern subtypes in large vessel atherosclerotic stroke. J. Neurol. 256, 591–599. https://doi.org/10.1007/s00415-009-0125-x (2009).
Wong, K. K. et al. Automatic segmentation in acute ischemic stroke: Prognostic significance of topological stroke volumes on stroke outcome. Stroke 53, 2896–2905 (2022).
Chen, L., Bentley, P. & Rueckert, D. Fully automatic acute ischemic lesion segmentation in DWI using convolutional neural networks. Neuroimage Clin. 15, 633–643. https://doi.org/10.1016/j.nicl.2017.06.016 (2017).
Woo, I. et al. Fully automatic segmentation of acute ischemic lesions on diffusion-weighted imaging using convolutional neural networks: Comparison with conventional algorithms. Korean J. Radiol. 20, 1275–1284. https://doi.org/10.3348/kjr.2018.0615 (2019).
Zhang, R. et al. Automatic segmentation of acute ischemic stroke from DWI using 3-D fully convolutional DenseNets. IEEE Trans. Med. Imaging 37, 2149–2160 (2018).
Winzeck, S. et al. Ensemble of convolutional neural networks improves automated segmentation of acute ischemic lesions using multiparametric diffusion-weighted MRI. AJNR Am. J. Neuroradiol. 40, 938–945. https://doi.org/10.3174/ajnr.A6077 (2019).
Liu, Z. et al. Towards clinical diagnosis: Automated stroke lesion segmentation on multi-spectral MR image using convolutional neural network. IEEE Access 6, 57006–57016 (2018).
Zhao, B. et al. Automatic acute ischemic stroke lesion segmentation using semi-supervised learning. arXiv preprint arXiv:1908.03735 (2019).
Alis, D. et al. Inter-vendor performance of deep learning in segmenting acute ischemic lesions on diffusion-weighted imaging: A multicenter study. Sci. Rep. 11, 12434. https://doi.org/10.1038/s41598-021-91467-x (2021).
Funding
This study was supported by the National Priority Research Center Program Grant (NRF-2021R1A6A1A03038865), the Basic Science Research Program Grant (NRF-2020R1A2C3008295), the Multiministry Grant for Medical Device Development (KMDF_PR_20200901_0098), and the Bioimaging Data Curation Center Program Grant (2022M3H9A2083956) of the National Research Foundation, funded by the Korean government.
Author information
Authors and Affiliations
Contributions
Wi-Sun Ryu: Conceptualization, Methodology, Validation, Formal analysis, Data curation, Writing—original draft, Writing—review & editing, Supervision, Project administration. Dawid Schellingerhout: Writing—original draft, Writing—review & editing. Jonghyeok Park: Methodology, Software, Validation. Jinyong Chung, Sang-Wuk Jeong, Dong-Seok Gwak, Beom Joon Kime, Joon-Tae Kim, Keun-Sik Hong, Kyung Bok Lee, Tai Hwan Park, Sang-Soon Park, Jong-Moo Park, Kyusik Kang, Yong-Jin Cho, Hong-Kyun Park, Byung-Chul Lee, Kyung-Ho Yu, Mi Sun Oh, Soo Joo Lee, Jae Guk Kim, Jae-Kwan Cha, Dae-Hyun Kim, Jun Lee, and Man Seok Park: Data curation. Writing – review & editing. Dongmin Kim: Conceptualization, Methodology, Software, Validation, Writing—review & editing, Project administration, Funding acquisition. Oh Young Bang: Validation, Writing—review & editing, Funding acquisition. Eung Yeop Kim: Writing—review & editing, Supervision. Chul-Ho Sohn: Methodology, Writing—review & editing, Project administration, Funding acquisition. Hosung Kim: Methodology, Formal analysis, Writing—original draft, Writing—review & editing. Hee-Joon Bae: Data curation, Writing—review & editing, Supervision. Dong-Eog Kim: Conceptualization, Methodology, Validation, Formal analysis, Data curation, Writing—original draft, Writing—review & editing, Supervision, Project administration, Funding acquisition.
Corresponding author
Ethics declarations
Competing interests
Wi-Sun Ryu, Jonghyeok Park, and Dongmin Kim are employees of JLK Inc. Hee-Joon Bae and Dong-Eog Kim are stockholders of JLK inc. The other authors have no conflicts of interest to declare.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ryu, WS., Schellingerhout, D., Park, J. et al. Deep learning-based automatic segmentation of cerebral infarcts on diffusion MRI. Sci Rep 15, 13214 (2025). https://doi.org/10.1038/s41598-025-91032-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-91032-w