Introduction

Intracranial hemorrhage is a potentially catastrophic neurological emergency requiring prompt attention as neurological deterioration frequently occurs within the initial hours following its onset1,2,3. This circumstance, with a substantial mortality rate of up to 45% and severe functional impairment among survivors, underscores the need for timely medical care for patients presenting to the emergency department (ED)4,5,6,7. Non-enhanced brain computed tomography (CT) serves as a primary diagnostic modality due to its noninvasive and rapid nature in the ED in cases of central nervous system emergencies, such as acute traumatic brain injury and intracranial hemorrhagic lesions8. Therefore, accurate identification of the type and location of acute intracranial hemorrhage through brain CT scans is crucial for guiding subsequent clinical management of these conditions. Furthermore, it holds significant implications in determining the need for emergent surgical intervention and the selection of appropriate surgical approaches9.

The field of clinical medicine has witnessed marked progress in the integration of deep learning technology. While various deep-learning solutions for diagnosing intracranial lesions are gradually being incorporated into radiology and have demonstrated considerable diagnostic capabilities10,11,12,13,14, their applicability and utility in clinical practice remain largely uncertain.

Deep learning-based assistive algorithms are compelled to focus on aiding individuals in decision-making rather than completely replacing physicians by autonomously making diagnoses. This is primarily due to their limited applicability across all patients and their inability to meet the diverse objectives of all physicians. Extensive implementation of such solutions as clinical decision support systems necessitates comprehensive evaluation within the clinical workflow of end-use physicians15,16,17,18.

Therefore, this study aimed to evaluate the impact of utilizing a deep learning-based assistive intracranial hemorrhage detection algorithm (DLHD) on the interpretation of non-enhanced brain CT scans and decision-making through a simulation-based interventional design.

Methods

Study design and participants

This simulation-based prospective interventional study was conducted using a web-based questionnaire. Participants for the study were recruited through an official notice as part of a research project operated by The National IT Industry Promotion Agency under the Government of South Korea. Participants included in the study were individuals aged 18 years or older, including board-certified emergency physicians, residents undergoing emergency medicine training, and emergency medical technicians (EMTs) working at the study site’s ED. Participants who did not comprehend the study content or withdrew after agreeing to participate were excluded. The study was conducted at an emergency center within a tertiary hospital that had a total of 18 board-certified emergency physicians, 29 residents undergoing emergency medicine training, and 5 EMTs with nationally certified licenses. Among them, three board-certified emergency physicians, two senior residents, three junior residents, and two EMTs voluntarily participated in the study, resulting in a total of ten participants. The participants’ work experience in the ED varied, with two emergency physicians having 89 months of experience; one, 53 months; two, 41 months; one EMT, 9 months; one EMT, 6 months; and three emergency physicians, 5 months of experience. Emergency medical professionals with less than 24 months of experience were classified as “inexperienced.” All participants received detailed information regarding the study purpose and the mechanics of the simulation system. The study was approved by the Institutional Review Board of Severance Hospital, South Korea (approval number 4-2023-0821) and adhered to the ethical standards outlined in the Declaration of Helsinki. All participants were provided with an explanation of the study before enrollment, and informed consent was obtained. The Institutional Review Boards of Severance Hospital approved the study and granted a waiver of documentation, waiving the requirement for written informed consent.

Selection of clinical data

A total of 2596 patients underwent non-enhanced brain CT scans in the ED between July and December 2022. The study included adult patients aged 18 years and above who were initially assessed in the ED. Among the collected brain CT data, 450 cases of follow-up CT scans ordered by physicians other than those in the ED were excluded. The DLHD’s interpretations were assessed for intracranial hemorrhage (ICH). True positives occurred when the DLHD correctly identified ICH, false positives when incorrectly identifying ICH in its absence, false negatives when failing to identify present ICH, and true negatives when correctly identifying absent ICH. For simulation, relevant information such as the patient’s present illness, vital signs, and past medical history were extracted from electronic medical records. Then, an unbiased ED physician independently assessed the radiological and clinical evidence presented by each brain CT scan for study. Among the 2146 cases, since cases without intracranial hemorrhage could dominate due to the emergency department’s clinical environment, the DLHD’s interpretations of all cases were compared with the radiologist’s official readings and classified into true positives, true negatives, false positives, and false negatives. A total of 111 cases were randomly selected to ensure statistical significance while maintaining an even distribution within these four categories. Each case included their associated clinical data and three questionnaires, enabling assessment of interpretation performance and decision-making effect. This approach allows for a balanced evaluation of clinical decisions across various scenarios. These anonymous datasets were automatically obtained using the clinical research analysis portal developed by the hospital’s digital healthcare department.

Deep learning-based assistive intracranial hemorrhage detection algorithm

Brain non-enhanced CT images were analyzed using deep learning software (JBS-04 K; JLK Inc., Korea), which was approved by the Korea Food and Drug Administration for clinical use. The algorithm was developed using 6963 brain CT scans with intracranial hemorrhage and 6963 without intracranial hemorrhage from the Artificial Intelligence Hub directed by the Korean National Information Society Agency. All hemorrhage lesions on CT images were manually segmented by neuroradiologists. Taking intracranial hemorrhage subtypes into account, five deep learning models were developed using 2D U-net with the Inception module 1, 2: lesion segmentation model, lesion subtype pre-trained segmentation model, subdural hemorrhage model, subarachnoid hemorrhage model, and small (< 5 mL) lesion segmentation model19. Dice loss function, Adam optimizer, and learning rate of 1e-4 were used for the model training. Following this, five base models were ensembled using weighting values derived from a deep learning-based weighted model in which input data consisted of 5-channel segmentation results from five base models with a range of 0–1. Using random initiative weight values, the weighted model was trained to minimize Dice loss between predicted segmentation and ground-truth segmentation. Five different hemorrhage detection models and the weighted ensemble model yielded five segmentation outputs and weight values per slice, respectively. The segmentation outputs and weight values were then multiplied, and the pixel probability with the highest value was selected as the maximal probability at the slice level20.

The protocol of prospective simulation for performance assessment

The prospective simulation sessions conducted in this study were meticulously designed to align with the patient management process in the ED of the study site. Performance assessment was carried out in individual rooms under the supervision of a researcher. Participants were presented with brain CT findings, accompanied by the demographic and clinical characteristics of patients, including age, sex, chief complaint, and vital signs. The participants received access to a monitor screen displaying the patient’s relevant clinical information and the CT scan. The simulation session consisted of two sequential steps, recorded using a web-based form (Google Forms; Google, Mountain View, CA). In the first step, participants were tasked with scrutinizing the provided CT scan for abnormalities and making clinical decisions concerning the need for further diagnostic studies and the appropriate disposition of the patient. They relied solely on the provided clinical information without incorporating a deep learning algorithm. Following the first step, participants were given a washout period of approximately one day before proceeding to the second step21. In each round of the study, the order of the cases was randomized to help minimize potential biases. This randomization aimed to reduce the likelihood that the last N cases in the unaided arm would be presented in close proximity to the first N cases in the aided arm, mitigating the risk of carryover effects due to the short washout period. Here, they examined the same cases; however, the CT scans were provided with the deep learning algorithm and clinical information. Participants were not allowed to alter their responses from the first step, and all subsequent responses were recorded in real time. The brain CT image was provided to the participants in the form of video footage that could be scrolled. Only the axial view was available, with the brightness and size of the image not adjustable. There were no imposed time constraints for participants to complete the simulation.

Definition of the reference standard

Retrospective annotation of brain imaging served as the reference standard to determine the presence of intracranial hemorrhage. A panel of seven board-certified neuroradiologists, each with a minimum of 8 years of experience, meticulously reviewed the CT scans while considering relevant previous imaging findings and other pertinent clinical information extracted from medical records. Notably, they were blinded to any information regarding the use of DLHD.

Statistical analyses

Categorical variables are presented using counts and percentages, while the chi-square test was utilized to examine differences between groups. Continuous variables are reported as median (q1, q3), and between-group differences were assessed using the Mann–Whitney U test. To evaluate the performance of interpretation, various measures, including sensitivity, specificity, accuracy, and area under the receiver operating characteristic curve (AUROC), were computed. These measures were calculated for each individual participant and then aggregated for all participants. The level of agreement in clinical decision-making was assessed using the kappa statistic, with k values categorized as minor agreement (< 0.20), fair agreement (0.21–0.40), moderate agreement (0.41–0.60), high agreement (0.61–0.80), and excellent agreement (> 0.80)22. Within-participant comparisons of AUROC estimates were conducted using the DeLong test, while between-participant comparisons were performed using the multi-reader multi-case ROC method. The sensitivity, specificity, and accuracy parameters were compared using the generalized estimating equation method. Kappa statistics were compared using the bootstrap method. Statistical significance was set at p < 0.05. The statistical software SAS (version 9.4, SAS Inc., Cary, NC, USA) and the “MRMCaov” and “multiagree” R packages (version 4.2.3, http://www.R-project.org) were used for data analyses.

Results

Table 1 presents an overview of the diagnostic performance of DLHD assessed using 2146 CT scans conducted at the study site. DLHD showed a sensitivity of 70.81 (95% confidence interval [CI]: 66.65–74.97), specificity of 86.72 (95% CI: 85.10–88.34), accuracy of 83.32 (81.74–84.90) and AUROC of 0.788 (0.765–0.810). The negative and positive predictive values of DLHD were 59.20 and 91.61, respectively.

Table 1 Diagnostic performance of DLHD.

Of the 111 cases selected for the prospective simulation study, there were 32 true-positive cases, 27 true-negative cases, 24 false-positive cases, and 28 false-negative cases for the diagnosis of intracranial hemorrhage (Figs. 1 and 2). The average age of included patients was 62.7 years, with 73.9% being male. Fifteen patients (13.5%) had a history of tumors, and three (2.7%) had previous brain operations.

Fig. 1
figure 1

Flow chart of dataset selection. CT, computed tomography; ED, emergency department; DLHD, deep learning-based assistive intracranial hemorrhage detection algorithm; ICH, intracranial hemorrhage.

Fig. 2
figure 2

The protocol of prospective simulation for performance assessment. CT, computed tomography; ED, emergency department; ICU, intensive care unit; ICH, intracranial hemorrhage; DLHD, deep learning-based assistive intracranial hemorrhage detection algorithm.

Table 2 displays the impact of using DLHD on the diagnostic performance of emergency professionals. When comparing cases with and without algorithm assistance, the sensitivity of the inexperienced group increased significantly from 59.33 to 72.67% (p < 0.001), while the specificity decreased from 65.49 to 53.73% (p < 0.001). However, no statistically significant differences were observed in terms of accuracy and AUROC depending on whether the algorithm was utilized or not. Conversely, in the experienced group, no statistically significant differences were observed in sensitivity, specificity, accuracy, and AUROC, depending on the use of DLHD.

Table 2 Changes in the CT interpretation performance using DLHD.

Table 3 summarizes the results regarding the consistency of clinical decision-making with the use of deep learning-assisted interpretation. The overall kappa values were 0.594 (95% CI: 0.533–0.655) for interpretation of CT images and 0.586 (95% CI: 0.539–0.633) for deciding the disposition of ED patients, respectively. For the experienced group, the kappa values were 0.769 (95% CI: 0.702–0.836) for interpretation of CT images and 0.738 (95% CI: 0.687–0.789) for deciding the disposition of ED patients, indicating considerable consistency. However, for the inexperienced group, the values were significantly lower, with a kappa value of 0.421 (95% CI: 0.296–0.546) for interpretation of CT images and 0.425 (95% CI: 0.362–0.488) for deciding the disposition. Inconsistent opinions were observed in brain CT interpretation, decision of disposition, and reasoning for decisions when employing the deep learning assistive technique. Notably, these changes in clinical decisions were more frequent in inexperienced emergency medical professionals compared to experienced ones (p < 0.001).

Table 3 Consistency in clinical decision-making.

Among experienced emergency medical professionals, the utilization of DLHD (Deep Learning-based Hemorrhage Detection) resulted in changes in the disposition of an average of 19 out of 111 cases. These changes were due to the identification of previously unnoticed hemorrhagic lesions that required observation or admission, which were aided by DLHD, resulting in an average rate of change of 17.9%. In contrast, among inexperienced emergency medical professionals, changes in disposition were observed in an average of 38 out of 111 cases. The average rate of change specifically due to newly identified hemorrhagic lesions using DHLD was 45.3%, which was statistically significant (p < 0.001) (Fig. 3).

Fig. 3
figure 3

Clinical decision change before and after utilizing DLHD. DLHD, deep learning-based assistive intracranial hemorrhage detection algorithm.

Discussion

This study evaluated the impact of DLHD on the decision-making process of emergency medical professionals in a clinical environment. Our findings revealed that sensitivity to detect intracranial hemorrhage increased when utilizing DLHD among inexperienced emergency medical professionals. Moreover, this study indicates minimal influence of the algorithm on experienced participants’ ability to detect hemorrhages and make clinically informed decisions. In contrast, inexperienced participants were significantly influenced by the algorithm’s output.

DLHD can enhance sensitivity in detecting abnormalities on brain CT scans. When interpreting these scans, experienced clinicians conduct targeted interpretation of imaging studies and may identify incidentally discovered abnormal findings along with cerebral hemorrhages. However, our observations revealed that inexperienced emergency professionals were significantly more influenced by the algorithm in both hemorrhage detection and the decision-making processes, relying more on annotation aids compared to their experienced counterparts. Notably, the algorithm occasionally misclassifies not only hemorrhages but also other lesion types, such as tumors and benign abnormalities, as hemorrhages23,24. Differentiating intracerebral hemorrhage (ICH) from conditions like parenchymal calcifications, dural patches, and tumors poses challenges due to the similarities in the hyperdensities of these anomalies. Moreover, deep learning-induced misclassifications can occur with typical hyperdensities caused by calcifications of various structures in the brain. Hence, clinicians must meticulously examine these structures to accurately differentiate between actual presence in the blood and misclassifications25,26. Inexperienced emergency professionals face difficulties in accurately distinguishing between these lesion types, making them more susceptible to false positives generated by the algorithm. In this study, after encountering 24 false-positive cases among those detected by the algorithm, average interpretation changes of 12.5% and 38.3%, respectively, were observed in the experienced and inexperienced groups. Overall, inexperienced professionals showed greater dependence, leading to changes in clinical decisions. This highlights the need for cautious implementation and comprehensive evaluation of deep learning solutions in the clinical workflow of EDs. Therefore, deep learning-based assistive technology should be considered as a screening tool rather than a definitive diagnostic tool.

Previous studies have primarily focused on the diagnostic performance of automatic detection algorithms25,27,28,29. However, conducting practical validation before implementing this technology in clinical practice is crucial to ensure its effectiveness and practicality for end-users30. To the best of our knowledge, this is the first study to investigate the effects of deep learning-based CT annotation solutions on clinical decision-making in emergency medical professionals. To simulate real-world clinical scenarios, we provided participants with comprehensive patient information, including medical histories, chief complaints, vital signs, and brain CT findings, all of which play important roles in clinical practice alongside imaging results. This was vital as emergency medical professionals base their decisions on multiple factors, considering both brain CT results and overall clinical evaluation rather than relying solely on imaging.

The ED prioritizes rapid screening for critical illnesses, necessitating timely management. Non-enhanced brain CT scans are particularly effective and efficient in evaluating ED patients with central nervous system symptoms, including headaches, mental status deterioration, and other neurological deficits31,32. The ability to screen critical findings from non-enhanced brain CT results is crucial, especially in the ED33. Therefore, it is imperative to emphasize the need for heightened sensitivity in diagnostic assistance; implementing a solution that facilitates the triage of abnormalities rather than providing definitive diagnoses would be practical. In under-resourced hospitals with resource constraints resulting in a single physician covering the entire department or during frequently understaffed night shifts, or when residents who may have limited experience are responsible for the interpretation34,35,36,37, the implementation of deep learning-based CT annotation solutions can provide valuable support for patient safety to enhance the sensitivity of detecting intracranial hemorrhages in brain CT scans.

When interpreting these results, it is important to acknowledge the limitations of this study. First, as this study was based on simulations, the findings may not precisely reflect real-world cases. The simulation did not include profound physical and neurological examination, laboratory results, and focused history-taking, which are important components of the decision-making process. Second, the deep learning-based CT annotation solution used in this study had a limited target range, which may have restricted the identification of other abnormalities in brain CT scans. Future research should explore algorithms that encompass a broader target range. Third, the changes in clinical decisions reported in this study did not necessarily correlate with improved clinical outcomes. Finally, selection bias is possible as all participants were emergency medical professionals working in the same ED. Therefore, the recommendations for clinical decisions presented in the simulation may not be generalizable to all emergency medical professionals. Furthermore, although the 1-day washout period aimed to prevent participants from acquiring answers or additional information between the two experiments, thereby potentially enhancing the significance of AI’s impact, the short interval between readings could result in participants remembering their initial responses. This could artificially enhance the performance of the AI-assisted second reads. Further studies are required with varying washout periods to validate our findings and address any residual concerns about recall bias.

In conclusion, the utilization of DLHD had variable effects on the diagnostic performance and decision-making of emergency medical professionals. The study underscores the importance of comprehensive evaluation and careful integration of deep learning solutions in the clinical workflow, particularly for inexperienced professionals. Further research is warranted to assess the algorithm’s impact on patient outcomes, cost-effectiveness, and its generalizability across diverse clinical settings.