Introduction

The survival rate of very low birth weight infants (VLBWIs), those weighing less than 1500 g at birth, has improved with advancements in neonatal care1. However, intraventricular hemorrhage (IVH) is strongly associated with adverse neurodevelopmental outcomes2. Despite advanced interventions, such as antenatal corticosteroids, magnesium sulfate for neuroprotection, and improved neonatal resuscitation in the delivery room, IVH remains a significant challenge in the neonatal intensive care unit (NICU). Brain ultrasonography is the most readily available diagnostic tool for cerebral lesions in the NICU. To date, no definitive treatment exists to mitigate the severity of IVH or substantially improve long-term outcomes. IVH is most common among preterm infants3, and the incidence of IVH in VLBWIs varies from 20 to 40%4,5,6,7,8,9. Most IVH occurs within the first 72 h after birth10, and within the first 24 h after birth with over 50%11. Furthermore, approximately 35% of VLBWIs with IVH develop post-hemorrhagic hydrocephalus, and 15% require surgical intervention9,12. Therefore, identifying the risk factors is crucial to prevent and predict the occurrence of IVH. Most studies have used logistic regression models in population-based cohorts6,13. Heuchan et al. developed a risk-adjustment model for predicting variations in IVH occurrence in different institutions in Australia and New Zealand6. Singh et al. developed a predictive model for assessing the risk for severe IVH in VLBWIs using seven clinical variables at birth13. Recently, machine learning models have shown promising results in predicting neonatal mortality14,15,16. Artificial intelligence-based (AI) models have been effective in identifying conditions such as patent ductus arteriosus and spontaneous intestinal perforation at admission, which are major complications in VLBWIs17,18. Early prediction of IVH is crucial for timely intervention and improved outcomes. However, resource limitations often hinder early and accurate IVH prediction immediately after birth. Traditional prediction methods often require advanced equipment and specialized expertise, which are not universally available. Therefore, there is a pressing need to develop a robust and resource-limited AI-based model capable of leveraging readily available clinical data. Such a model would broaden access to critical early warning systems by enabling hospitals with limited resources to implement timely and effective interventions. By focusing on scalability and ease of integration, this study aims to develop an AI model that is practical and effective for deployment in a wide range of healthcare settings, including those with significant constraints on equipment, training, and funding.

This study aims to develop and validate a novel deep neural network-based model with an attention mechanism (DNN-A) for the early prediction of IVH in VLBWIs. The model architecture includes an input layer with eight neurons representing infant characteristics, two hidden layers (eight and six neurons), and an output layer for binary classification. An attention mechanism, situated between the hidden layers, highlights critical input features. The network employs ReLU activation functions in the hidden layers, a sigmoid activation function in the output layer, and the Adam optimizer for parameter optimization. This architecture ensures efficient processing of clinical data, providing timely and accurate IVH risk assessments, even in resource-limited hospital settings. Our model achieved 90% accuracy during training and 87% during testing, with an area under the curve (AUC) of 87, indicating its strong predictive performance for IVH. Moreover, during validation, the proposed model outperformed several other machine-learning-based models, including the support vector machine (SVM), decision tree, logistic regression, XGBoost, and naïve Bayes models, using multiple well-known evaluation metrics.

Materials and methods

Study population

The clinical records of 877 VLBWIs treated at Keimyung University Dongsan Hospital, Daegu, South Korea, were retrospectively reviewed. All data were anonymized. IVH cases were identified using data on brain ultrasound performed within the first 7 days after birth. Given that unstable hemodynamic changes at birth are major risk factors for IVH, infants with suspected neonatal asphyxia (APGAR score at 5 min < 5 ), early onset sepsis, hypotension with medical treatment, and severe respiratory distress with persistent pulmonary hypertension of newborns were excluded. The dataset was divided into two groups: the affected group (infants with IVH) and the control group (non-IVH). To ensure comparability between the two groups, they were matched based on sex, gestational age at delivery, maternal age, and delivery mode. Because the number of controls was much higher than the number of infants with IVH, all infants with IVH were included in the study. Since IVH predilection was higher in male infants than in female infants19, sex-match was a priority between IVH and non-IVH infants. In addition to the delivery mode and maternal age, a close gestational age was also considered to select as many samples as possible. After applying these criteria, 189 infants with IVH and 198 controls were enrolled for this study.

Ethics statement

Data collection was approved by the Institutional Review Board (IRB) of Keimyung University Dongsan Hospital, and the requirement for informed consent was waived in this study (IRB Keimyung University Dongsan Hospital No. 2021-060-073). All experiments were performed in accordance with relevant guidelines and regulations of the Keimyung University Dongsan Hospital.

Features pre-processing

The model leverages eight readily available clinical input factors: maternal age (ranging from 18 to 42 years), delivery mode, endotracheal intubation in the delivery room, birth weight (ranging from 490 g to 2,260 g), gestational age at delivery (ranging from 20 to 31 weeks), APGAR scores at 1 and 5 min (ranging from 1 to 9), and sex. The sex variable originally recorded as “F” for females and as “M” for males was converted into dummy numeric variables. To prevent very large or small values from negatively affecting model performance, all variables in the dataset were normalized to a range between 0 and 120. Eight variables were used to characterize the fixed/global information of preterm infants. Normalization was calculated using the following equation:

$${X_{new}}=\frac{{X - {X_{\hbox{min} }}}}{{{X_{\hbox{max} }} - {X_{\hbox{min} }}}}$$
(1)

where X is the input and Xnew is the normalized value.

Fig. 1
figure 1

Average of each feature for all study populations in control (green bars) versus affected infants (red bars). All statistical comparisons were performed using an unpaired student t-test to detect significant differences between averaged groups. (*P < 0.05); (**P < 0.01); (***P < 0.001); (****P < 0.0001).

Statistical analysis

All statistical analyses were performed using GraphPad Prism 8 software (GraphPad Software, Inc., La Jolla, CA, USA : https://www.graphpad.com) to detect significant differences in critical parameters between the two groups. For each parameter, the mean values of the affected and control groups were compared using an unpaired Student t-test. This is appropriate when the two groups being compared are independent of each other (affected vs. control groups to evaluate whether the affected group differs significantly from the control group, making a two-sample comparison appropriate. Data are presented as mean ± standard deviation (SD) to reflect the central tendency and variability within each group. The comparison was performed directly using the raw numerical values of the parameters, ensuring accuracy and transparency in the statistical evaluation. All available data samples for affected infants were analyzed. Notably, measurement samples were higher for controls than for affected infants. The clinical records from both groups were used to develop and assess machine learning models without distinguishing between specific individuals. The model did not consider the age at which the clinical records were completed. Unpaired Student’s t-test was used to determine the p-value, and categorical factors (such as sex and delivery mode) were compared based on their relative values in percentages. Statistical significance was at p < 0.05. No significant differences in the relative number of male infants were observed between the groups. Table 1 summarizes the statistical comparisons of various parameters between two groups, including the sample size for each group and the p-value from the unpaired Student’s t-test, which assessed whether there were significant differences between the two groups for each parameter. The control and affected groups exhibited significant differences in terms of endotracheal intubation, birth weight, gestational age at delivery, and APGAR scores at both 1 and 5 min, suggesting that these factors are associated with the occurrence of IVH. However, maternal age, delivery mode, and sex did not differ significantly between the two groups (Fig. 1).

Table 1 Study variables in the control and affected groups with the corresponding p-value were determined using the unpaired student t-test.

Proposed deep neural network with attention

The proposed approach used a deep neural network architecture with an attention mechanism to classify VLBWIs into two groups according to the presence or absence of IVH. The proposed model consists of four layers: an input layer with eight neurons representing infant characteristics or input features, and two hidden layers with eight and six neurons for identifying and extracting pertinent information from the input data. The attention mechanism is located between two hidden layers to highlight important features of data. Finally, an output layer with one neuron to predict IVH (affected) or non-IVH (control) cases (see Fig. 2). Each neuron in the network uses the rectified linear unit (ReLU) for the activation function, and the output layer consists of one neuron for binary classification and uses the sigmoid activation function. The Adam optimizer, with a learning rate of 0.0001, was used to minimize the training-related losses21. Although a lower learning rate may cause delayed learning or convergence to inferior solutions, it will result in steadier convergence. The model used a cross-entropy loss function, a statistical tool frequently used in classification, which quantifies the discrepancy between the actual and predicted label22. The output layer of the neural network, which comprises one neuron reflecting the probability of each class (affected or control), is intended for binary classification. To minimize cross-entropy loss during training, the model modifies its parameters (weights and biases) using the Adam optimizer. Cross-entropy is defined by measuring the difference between the actual and predicted outputs of the model and is expressed as follows:

$$loss= - \sum\limits_{{i=1}}^{N} {({y_i}} \log {\bar {y}_i}),$$
(2)

where y is the probability predicted by the network; \(\bar {y}\)is the ground truth; i is the number of data points; and N is the total number of samples.

Fig. 2
figure 2

Architecture of the proposed neural network model with an attention mechanism.

Proposed attention mechanism

In machine and deep learning models, the attention mechanism enables models to focus on specific segments of the input data throughout the prediction process23. The basic idea behind attention is to assign distinct weights to various input data segments, enabling the model to “pay attention” to pertinent data while generating predictions. The additive attention (general attention) method uses a learned compatibility function to generate attention scores by introducing trainable parameters. Self-attention, also known as multi-head attention, enables the model to use multiple sets of attention weights simultaneously to different sections of the input sequence24.

The proposed deep neural network integrates a general attention mechanism (Fig. 2) that allows the model to focus on specific features or patterns in the input data that are most relevant for the classification task. The MLP processes intermediate representations by incorporating an attention layer, which generates a weighted representation of the input features by applying the attention mechanism to the output of the preceding layer. The attention mechanism introduces trainable parameters to compute attention scores, which indicate the importance or relevance of the different features in the input data when making predictions. Attention scores were applied to the input features to produce a context vector that emphasizes the informative components of the input for the current classification task24. The general attention mechanism comprises three main components: queries (Q), keys (K), and values (V). The general attention mechanism used in neural networks is expressed as follows:

$$Attention\,(Q,k,V)=soft\hbox{max} \,\left( {\frac{{Q{K^T}}}{{\sqrt {{d_k}} }}} \right)V$$
(3)

where Q, k, and V denote the query, key, and value matrices, respectively, and dk denotes the dimension of the key vectors. The dot products of the query and key vectors were calculated using matrix multiplication QKT. Scaling the result \(\sqrt {{d_k}}\)helps mitigate issues related to disappearing gradients. The T in KT represents the transpose of the matrix K. To determine the attention weights, the scores were normalized using the softmax algorithm, and the resulting attention information is represented as a weighted sum of the values, V, calculated using these attention weights.

Model validation

The classification performance of the proposed model was assessed using various well-established metrics, including recall, accuracy, precision, F1-measure, specificity, Matthew correlation coefficient, and markedness, which were calculated using Eqs. (4), (5), (6), (7), (8), (9), and (10) respectively25. The classification accuracy was calculated as follows:

$$Accuracy(Acc)=\frac{{TP+TN}}{{TP+FP+FN+TN}},$$
(4)

Precision represents a fraction of relevant predictions among all predicted values, which can be calculated as follows:

$$precision=\frac{{TP}}{{(TP+FP)}},$$
(5)

Recall is the ratio of correctly predicted occurrences among all instances in the dataset and can be calculated as follows:

$$recall=\frac{{TP}}{{(TP+FN)}},$$
(6)

An imbalanced dataset may harm the actual network accuracy owing to accuracy detection in the majority of classes. Accordingly, the F1 measure was used to assess the detection performance of the proposed model and was calculated as follows:

$$F1=\frac{{2 \times precision \times recall}}{{precision+recall}},$$
(7)

The ability of a model to accurately detect real negatives among all actual negatives was measured using its specificity., which was calculated using the following equation:

$$Specificity=\frac{{TN}}{{TN+FP}},$$
(8)

Another metric for evaluating the validity of binary classifications is the Matthews correlation coefficient (MCC), which is a balanced measure that can be applied even in cases where the classes have different sizes because it accounts for both true and false positives as well as negatives. MCC measures the correlation between the observed and predicted binary classifications using the following equation:

$$MCC=\frac{{TP \times TN - FP \times FN}}{{\sqrt {(TP+FP)(TP+FN)(TN+FP} )(TN+FN)}},$$
(9)

In binary classification problems, markedness is a statistical metric that indicates how well a model can accurately predict positive and negative occurrences while accounting for false-positive and false-negative rates.

$$Markedness=precision+Negative\,\Pr edictive\,Value - 1$$
(10)

where negative predictive value measures the proportion of true negative predictions among all negative predictions.

Probability density function (PDF) evaluation

By examining the probability density function (PDF) for each variable, variations or similarities in the distributions between the affected and control groups can be determined. Variations in PDF distribution, form, and position could indicate variations in developmental characteristics, risk factors, or health outcomes between the two groups.

Results

Proposed model training

Training the suggested network models across preterm infant data was the primary task in this step. The Graphics Processing Unit (GPU) accelerates the training process. A GPU-capable Python learning package was used to implement the proposed model. For every training epoch, the training dataset was divided into smaller groups known as mini-batches. The process was repeated three times, and the mean values were calculated. All the ideal parameters were determined experimentally. The system configuration for training the proposed network model was as follows: CPU: AMD Ryzen 9 7950 × 16-Core Processor 4.50 GHz, GPU: RTX 4090, Programming language: Python, Memory: 64 GB, and Operating system: Windows 10 (64-bit).

Manual grid search for optimal hyperparameters

Hyperparameters are external adjustments that substantially influence the behavior and performance of a neural network but are not learned during training. Searching for combinations of hyperparameters is crucial to achieving the best results for a specific task. In this study, we used the grid search method to adjust key hyperparameters, specifically batch size and the number of epochs26. This process involves systematically evaluating a predefined set of hyperparameter combinations to identify the configuration that yields the best performance. Specifically, we experimented with different batch sizes and training epoch counts to observe their impact on the model’s accuracy and generalization capabilities. Consequently, a batch size of 10 with 100 epochs yielded the highest accuracy that was used to train and test the proposed model. Table 2.

Table 2 Manual grid search for optimal hyperparameters and corresponding accuracy values.
Fig. 3
figure 3

The impact of different test data percentages on the network training performance. The training accuracy curve (a) of the proposed network using different percentages of test data and (b) the corresponding loss curve.

To determine the optimal percentage of training and test data, we trained the network using different percentages of test data. Performing experiments using various percentages of training and test data is essential for assessing the model’s ability to generalize to new data by changing the ratio of test to training data. We utilized the random splitting of data into training and testing subsets. By specifying different test sizes into 10%, 20%, 30%, and 40% we effectively varied the proportion of data allocated for testing, with the remaining data used for training. This provides insights into the robustness and reliability of the model by enabling us to evaluate its performance under various data availability scenarios. Owing to various limitations, large-scale data collection may not be possible, especially in premature infants with and without IVH in real-world situations. Testing various data percentages can help evaluate the model’s performance with limited data and replicate scenarios of data scarcity. Models trained on smaller datasets often develop more resilient and generalizable patterns, potentially enhancing performance when applied to new datasets. This approach provides valuable insights into how well the model generalizes across various dataset sizes.

The training accuracy of the proposed DNN-A and the corresponding loss curves for different percentages of test data are illustrated in Fig. 3a,b, respectively. Although training accuracy remained consistent for various percentages of the test data, the AUC values varied with each training percentage, and training loss increased for configurations with 40% test data. To ensure a fair comparison between the different machine-learning-based models and the proposed model, we performed a similar experiment with different training and test percentages for all other models and compared the results using different evaluation metrics (Table 3).

Table 3 Comparisons of different state-of-the-art machine learning models’ performance vs. the proposed model’s performance based on different test data percentages.

Model evaluation

To further validate the performance of the proposed model, we utilized different machine-learning approaches, including support vector machine (SVM)27, logistic regression28, naïve Bayes models29, decision trees30, and XGBoost. The performance of the classification model was evaluated using a confusion matrix25, which is a widely used method for measuring the classification accuracy of machine learning methods. The confusion matrix is explained in Table 4. The term “true positive” (TP) was used when the actual value and network forecast were identical. When both the actual class and network’s anticipated value are equal to zero, the term “true negative” (TN) was used. When the expected class is 1 and the actual class is 0, it is referred to as a false positive (FP). When the expected class is 0 and the actual class is 1, it is referred to as a false negative (FN).

Table 4 Confusion matrix evaluation measures for the classification results.
Fig. 4
figure 4

Confusion matrix of the classification result (a) confusion matrix of the classification result using the proposed method, (b) decision tree algorithm, (c) XGBoost, (d) support vector machine, (e) logistic regression, (f) naïve Bayes.

The confusion matrix of the classification accuracy using the proposed DNN-A is shown in Fig. 4a. The classification accuracy using the confusion matrix was compared with other machine learning-based models, including the decision tree algorithm, XGBoost, support vector machine, logistic regression, and naïve Bayes models, each tested with 30% of the dataset as test data, illustrated in Fig. 4b–f, respectively. The proposed model exhibited the highest accuracy rate than the other models.

Fig. 5
figure 5

Comparison of area under the curve of receiver operating characteristics (ROC) for various machine learning models against the proposed model across different test data percentages. Subfigures depict ROC curves for (a) 10% of test data, (b) 20% of test data, (c) 30% of test data, and (d) 40% percent of test data.

Table 3 provides an in-depth analysis of the various test data percentages for various machine-learning models, highlighting the performance metrics of each compared to the proposed DNN-A model. A comprehensive analysis of performance indicators such as accuracy, precision, recall, F1 score, specificity, false-positive rate, false-negative rate, MCC, and markedness were evaluated to assess model effectiveness. The results underscore the superior performance of the proposed DNN-A model. Notably, the proposed model outperformed the SVM model with an accuracy of 87.1% using 30% of the test data. This demonstrates that even with a smaller amount of test data, the suggested model is resilient in handling the classification problem. Furthermore, the proposed model demonstrated stability and dependability throughout a range of data sizes as the percentage of test data increased to 40%, maintaining its superiority with an accuracy of 81.9%. In contrast, the SVM model exhibited slightly lower accuracy with smaller test sets, achieving 80.0% for both 10% and 20% test data. Notably, the logistic regression and naïve Bayes models exhibited similar accuracy within this range, indicating sensitivity to the amount of available data. However, for 30% of the test data, the XGBoost model exhibited the highest accuracy, comparable with that of the logistic regression model. This indicates that XGBoost effectively identified intricate patterns within the dataset, thereby improving classification accuracy. The decision tree exhibited the highest accuracy with 40% of the test data, suggesting its strength in handling larger datasets, possibly because of their capacity to identify complex relationships between decisions in the data. The proposed model and other machine learning models were compared using the receiver operating characteristic (ROC) curves (Fig. 5). The ROC curve of the proposed model was compared to those of other machine learning models, which demonstrated consistently better performance across various test data percentages in Fig. 5a–d. The higher AUC values of the proposed model reflect its superior balance between sensitivity and specificity, enhancing its ability to effectively distinguish between true and false positives.

Fig. 6
figure 6

Probability density functions for (a) APGAR score value at 1 min, (b) APGAR score value at 5 min, (c) along with maternal age, and (d) gestational age of the controls (blue curves) and patients affected (red curves).

Variations in PDF distribution form and position could indicate variations in developmental characteristics, risk factors or health outcomes between the two groups. These findings are crucial for understanding the causes of preterm births and formulating clinical care plans. The distributions of the APGAR scores at 1 and 5 min, respectively, for both the affected and control groups are shown in Fig. 6a,b, with higher APGAR scores indicating improved neonatal health and vitality. The PDF of the maternal age and gestational age is illustrated in Fig. 6c,d, indicating that older maternal age and lower gestational age at delivery are associated with an increased risk of IVH.

Fig. 7
figure 7

Confusion matrix of the classification result. (a) Feature set 1, (b) Feature set 2, (c) Feature set 3, (d) Feature set 4.

Model evaluation with extra limited resources

To enhance efficiency and streamline the analytical process, we conducted additional studies with more limited resources, focusing on four primary measurements recorded immediately after birth. These assessments, such as gestational age at delivery and APGAR scores, are recognized as critical indicators of newborn health and are strongly associated with adverse outcomes like intraventricular hemorrhage (IVH). By reducing the number of features in the analysis, we aimed to decrease the computational and data collection demands while achieving faster prediction capabilities. This approach optimizes resource utilization, enabling more targeted and efficient testing. For practical clinical applications, developing a model based on a minimal set of essential birth measurements offers significant benefits. Such a model facilitates rapid risk assessments and early intervention planning, which are crucial for timely and focused treatment in very low birth weight infants (VLBWIs). By concentrating on a streamlined feature set, this study aims to support the implementation of effective, resource-conscious, and clinically actionable strategies in neonatal care.

Accordingly, we conducted four experiments using various combinations of different primary measurement sets (feature sets 1–4), as listed in Table 5. The trial results indicated that Feature set 1, which included sex, APGAR score at 1 min, gestational age at delivery, and birth weight, achieved the highest accuracy, highlighting the importance of these variables for predicting IVH. The predictive ability of different feature sets in precisely predicting IVH are shown by the confusion matrix in Fig. 7 for (a) Feature Set 1, (b) Feature Set 2, (c) Feature Set 3, and (d) Feature Set 4, and the ROC curve in Fig. 8. The confusion matrix and ROC curve both further confirmed that the models that used Feature Set 1 consistently exhibited higher accuracy rates in predicting IVH than those of other feature sets.

Table 5 Model evaluation using different feature sets for extra limited resources.
Fig. 8
figure 8

ROC curve and AUC value per different feature sets of the proposed network model.

A comparison of the proposed model with several cutting-edge machine learning models for IVH prediction in VLBWIs is shown in Fig. 9, demonstrating that the proposed model consistently outperformed other machine learning models in terms of accuracy, precision, recall, F1 score, AUC value, specificity, MCC, and markedness. From a practical standpoint, this suggests that the proposed model performs better than the current models for identifying IVH in VLBWIs, reducing false positives and false negatives while accurately classifying a higher percentage of true IVH cases. These findings highlight the effectiveness of the proposed model and its potential for clinical use for predicting IVH in VLBWIs.

Fig. 9
figure 9

Comparison of the proposed model against state-of-the-art machine learning models for IVH prediction in VLBWIs.

Discussion

We developed a novel deep neural network model with an attention mechanism to predict IVH in VLBWIs in low-resource settings that can be easily applied at the time of delivery. Multilayer perceptron (MLP) is the most fundamental type of network31. The model prediction was based on eight primary variables: gestational age at delivery, birth weight, maternal age, sex, mode of delivery, APGAR score at 1 and 5 min, and endotracheal intubation during delivery room resuscitation. The selection of variables was based on previous studies examining the risk factors of IVH in cohort studies of VLBWIs. Furthermore, these variables can be easily collected through a brief history and neonatal evaluation in the delivery room, facilitating the model’s applicability in low-resource settings.

IVH in VLBWIs was diagnosed using brain ultrasonography and classified according to Papile’s classification32. A recent cohort study in Australia33 reported a higher incidence of cerebral palsy in school-aged children born at less than 28 weeks gestation with low-grade IVH, which is typically considered benign. Moreover, low-grade IVH was associated with higher rates of cerebral palsy in school-age children33,34. Therefore, IVH grade I, including germinal matrix hemorrhage on brain ultrasonography, was defined as the IVH group in our study. IVH begins with injury to the germinal matrix, which is susceptible to injury due to the abundance of immature friable capillaries and decreased capacity of the premature brain to autoregulate cerebral blood flow11. Lee et al.35 indicated that gestational age and birth weight at delivery are the strongest predictors of IVH in VLBW infants. Notably, approximately 50% of IVH cases in VLBW infants occur within the first 6–8 h of life36,37.

Intrauterine growth restriction (IUGR) is associated with fetal hypoxemia and an increased risk of adverse postnatal outcomes. However, the impact of IUGR on the incidence of IVH in preterm infants remains unclear38. For this reason, we considered gestational age at delivery and birth weight, which are strongly correlated, as separate factors in the present study. Lee et al. reported that postnatal acidosis within the first hour after birth is associated with a higher risk of severe IVH in VLBWIs35. Hypoxemia or hypoperfusion in the delivery room can lead to metabolic acidosis39. Although serum pH or base deficit is a more accurate indicator of metabolic acidosis in the delivery room, we analyzed the APGAR score at 5 min, which has been commonly used in practice. Antenatal steroid administration promotes fetal drug maturation and reduces the occurrence of IVH in VLBWIs40, highlighting the role of glucocorticoids in antenatal steroid administration, which promotes fetal vasoconstriction and alleviates fetal vasodilation. Because more than 95% of enrolled pregnant women in the present study were administered antenatal steroids, additional analysis including antenatal steroids in the present study was not necessary. The ability of the proposed model to make reliable predictions without the need to monitor hemodynamic changes after birth is a major advantage. Compared with other machine learning models, DNN-A exhibited superior accuracy of 90% for training and 87% for tests, with the highest AUC of 0.87 because it used limited resources. IVH prediction at birth is crucial to provide objective counseling to the infant’s parents and to determine appropriate treatment decisions during the critical postnatal period. This study included models representing diverse learning paradigms, such as SVM for margin-based classification, Decision Trees for rule-based learning, Logistic Regression for linear probabilistic modeling, XGBoost for gradient-boosted decision trees, and Naïve Bayes for probabilistic inference. The inclusion of feature subset analysis further underscores the importance of feature engineering in predictive modeling. Feature Set 1, which consisted of key demographic and neonatal health indicators (sex, APGAR score at 1 min, gestational age at delivery, and birth weight), emerged as the most predictive combination. These variables likely hold strong associations with the underlying physiological and developmental factors that contribute to the risk of IVH, highlighting their critical role in improving model accuracy and interpretability. This result not only underscores the significance of domain-specific feature selection but also provides valuable insights for clinical decision-making and early risk assessment of IVH.

The present study had several limitations. First, brain ultrasonography was not performed simultaneously after birth. To overcome this limitation, we classified the occurrence of IVH based on the results of brain ultrasound performed within the first 3 days after birth. However, the present study did not include postnatal hemodynamic changes, which are major risk factors for IVH. Unlike in other countries, ibuprofen was the most frequently used medication for the treatment of patent ductus arteriosus (PDA) in South Korea, as indomethacin is not available. Moreover, the prophylactic use of indomethacin is exceedingly rare in most neonatal intensive care units (NICUs) across the country. At our institution’s NICU, prophylactic treatment for PDA was not practiced. For all very low birth weight infants (VLBWIs) included in this study, brain ultrasounds were performed by a single pediatric radiologist within the first 7 days after birth. The severity of intraventricular hemorrhage (IVH) was assessed according to Papile’s classification32. Our study aimed to predict the occurrence of IVH at birth. Another novel innovation of the current model is that it makes a reliable prediction without the need for continuous vital sign monitoring. Additionally, the small sample size might have limited the accuracy and reliability of the machine-learning models, warranting further studies with a large training set. Different percentages of the training and test datasets were compared to evaluate the proposed model under different conditions. Furthermore, this model was developed using data from a single institution. Although the study population encompassed patients from a range of geographic areas (rural, suburban, and metropolitan), the cohort was restricted to the hospital’s service area, where all patients were treated by the same group of healthcare professionals. Future research should involve a more geographically diverse sample, ideally incorporating an international cohort.

Conclusion

A novel deep neural network model with an attention mechanism was developed to predict IVH in VLBWIs, particularly in limited resource settings. Using eight readily available measurements and limited to only four measurements, this model has the potential to be applied in real-time at the time of delivery, making it a vital tool for early intervention and care decisions. Our methodology is directly relevant in resource-constrained settings which are based on prior cohort studies, and are easily accessible through maternal information and the newborn’s condition in the delivery room.

Although our approach demonstrates good accuracy and repeatability, IVH classification may be affected by the non-standard interpretation of brain ultrasonography in terms of post-birth temporality. The experimental results demonstrated that the infant’s sex, 1-minute APGAR score, gestational age at delivery, and birth weight are the most important variables for predicting IVH. Nevertheless, because our goal was to predict IVH at the time of delivery, this topic should be further investigated in future studies. Despite these drawbacks, our approach is a noteworthy development in IVH prediction and provides potential insights into neonatal care in environments with limited resources.