Introduction

Bronchopulmonary dysplasia (BPD) is the most common long-term morbidity for premature infants, and over a quarter of infants with moderate to severe BPD will develop BPD associated pulmonary hypertension (PH) [1,2,3]. Compared with infants with BPD who do not develop PH, those who do develop PH have increased morbidity, worse developmental outcomes, and significantly increased mortality, with over a quarter of infants dying within 2 years of diagnosis [4,5,6].

Prior predictive models have been developed for BPD, with respiratory support, fraction of inspired oxygen (FiO2), and birthweight emerging as the strongest predictors, but models have not yet been developed to specifically predict PH in infants with BPD [7, 8]. While risk factors overlap between BPD and BPD associated PH, including lower birth weight and parameters of respiratory support, there remains a need for BPD associated PH risk estimation to improve precision of preventative and therapeutic interventions [9, 10]. Additionally, machine learning techniques that incorporate complex clinical data and improve predictive ability have not yet been applied towards the outcome of BPD associated PH [11].

To help improve care for these high-risk infants, we used a large multicenter cohort to predict PH in infants at two timepoints: 33 weeks post-menstrual age (PMA) for those receiving mechanical ventilation, and 36 weeks PMA for all infants receiving any respiratory support (diagnosed with BPD per Neonatal Research Network [NRN] 2019 criteria) [12]. We compared the performance of a more interpretable logistic regression model with that of both shallow (multilayer perceptron [MLP]) and deep, time series-based (long short-term memory [LSTM]) neural networks. Finally, we validated model performance on a temporally distinct multicenter cohort.

Methods

Cohort identification

We performed a multicenter cohort study of infants born between 22 and 28 weeks gestational age treated in neonatal intensive care units (NICUs) managed by the Pediatrix Medical group with discharge year from 2008 to 2022. This database of 462 NICUs in North America collects prospectively entered clinical information from the electronic health record documentation tool including demographics, lab values, medications, and procedure information [13]. We identified infants at risk for PH at two timepoints: infants receiving mechanical ventilation at 33 weeks PMA; and infants with any respiratory support at 36 weeks PMA (meeting NRN 2019 criteria for BPD, defined as any infant with respiratory support at 36 weeks PMA regardless of supplemental oxygen use before or at 36 weeks PMA) [12]. We included the 33 weeks PMA timepoint as infants mechanically ventilated at this timepoint are at highest risk for developing severe, grade 3 BPD and therefore most likely to receive possible benefit from PH preventative therapies, given the correlation between BPD severity and PH risk [7, 14]. Individual infants could be included at one or both timepoints based on their respiratory support at 33 weeks and 36 weeks PMA (see Supplementary Fig. 1 for cohort inclusion schematic).

The Duke Health IRB approved this study with waiver of informed consent (Pro00106931).

Outcomes and variables

The primary outcome was the development of PH after either 33 or 36 weeks PMA, respective to each timepoint and prior to discharge. We defined the development of PH as the first day of either a clinical diagnosis of PH or of exposure to pulmonary vasodilator medication. We defined clinical diagnosis of PH as a documentation of pulmonary hypertension or cor pulmonale by the treating physician. Pulmonary vasodilator medications included inhaled nitric oxide (iNO), sildenafil, epoprostenol, or bosentan. Infants with early PH, defined as meeting the criteria for PH within the first 28 days of life, were eligible for inclusion only if the clinical notes documented an end of the initial diagnosis of PH or a discontinuation of pulmonary vasodilatory therapy also within the first 28 postnatal days. We excluded from analysis any infants with early PH that continued beyond 28 postnatal days as well as any infant who otherwise developed PH before 33 weeks or PMA or 36 weeks PMA, from analysis at those timepoints, respectively. We also excluded infants with major congenital anomalies, those missing gestational age or discharge documentation, those admitted after postnatal day 0, and those discharged prior to 33 weeks PMA or 36 weeks PMA, from analysis at those timepoints, respectively.

Other clinical information included demographics, medication administration (grouped into vasopressors, diuretics, antibiotics, opioids, caffeine, surfactant, bronchodilators, pulmonary vasodilators, xanthines, and systemic steroids; see Supplementary Methods 1 for included medications), lab values (including blood gas, chemistry, blood count, and coagulation profiles), respiratory support, and comorbidity data (see Supplementary Methods 2 for full list of variables). For lab values, we recorded daily minimum and maximum values. For medication use, we recorded a daily binary variable if any medication within each category was given. For respiratory support, we recorded the daily respiratory support.

Model and risk score development

We used infants discharged from 2008 to 2020 for initial model training and testing and infants discharged from 2021 to 2022 for temporal validation. To develop the predictive models, we included clinical data from postnatal day 0 to 33 weeks PMA or to 36 weeks PMA for the respective timepoint models. We separated variables into static (e.g., demographics and comorbidities) and dynamic (e.g., daily clinical information) inputs. We implemented both cross-sectional and longitudinal models. The cross-sectional models included different variants of regularized logistic regression (LR) models and an MLP model. The longitudinal models used a many-to-one LSTM neural network [15].

For the development of cross-sectional models, we initially summarized the dynamic components by calculating the minimum, maximum, and last values for continuous laboratory measurements, the sum of total days of exposure for each medication group, the sum of total days of each respiratory support type, and the last respiratory support. This summarized dynamic information was then combined with the static data and collectively input into the cross-sectional models. For any missing data, we employed mean imputation [16]. For the development of the longitudinal model, we replicated the static data for each day, combined the static and daily dynamic information, and used this combined longitudinal dataset as the inputs of the LSTM model. We implemented missingness embedding as a part of the LSTM model to handle days with missing data [17]. We standardized variables to a mean of 0 and standard deviation of 1. We used Hyperband for hyperparameter optimization in the neural network models [18]. For each timepoint, all models used the same 85-15 train-test split to ensure equal comparison of performance across model types.

To develop a more interpretable cross-sectional model with as few covariates as possible, we first used least absolute shrinkage and selection operator (LASSO) logistic regression to select 20 candidate predictors through the coefficient [19]. Subsequently, we employed forward selection with Akaike Information Criterion as the criterion to rank the importance of variables selected by LASSO. We performed the feature selection and ranking operations within a tenfold cross-validation to assess the consistency of feature importance across train-test split folds. We finally selected the top 6 most salient predictors to train a parsimonious logistic regression model. We reported both standardized beta coefficients to allow for interpretation of relative feature importance, as well as coefficients for the original scale predictor variables.

Statistical analysis and model assessment

We used summary statistics including count (percentage) and median (interquartile values) for infants at each timepoint. We performed univariate analysis between infants who did and did not develop PH using Wilcoxon rank-sum and chi-square tests. We assessed model performance using area under the receiver operator characteristic curve (AUROC), as an evaluation of overall model discrimination, and area under the precision recall curve (AUPRC), as an evaluation of the balance between model sensitivity and positive predictive value. We used bootstrapping to determine 95% confidence intervals for AUROC and AUPRC. We assessed performance on the held-out development test set and subsequently on the temporal validation cohort. Given potential variation in clinical indication for iNO use, we performed a sensitivity analysis that evaluated model performance excluding infants who met the PH outcome only from iNO exposure. Infants with a clinical diagnosis of PH as well as iNO exposure remained included. We considered P values < 0.05 statistically significant. We performed predictive modeling and statistical analysis in Python (Python Software Foundation, Beaverton, OR) and R (R Foundation for Statistical Computing, Vienna, Austria).

Results

For the 33 weeks PMA timepoint, we identified 2849 infants on mechanical ventilation across 170 sites (Table 1). Of these infants, 1631 (57%) were male, 1156 (41%) were White, the median gestational age was 25 weeks (24, 27), and 360 (13%) developed PH at a median age 39 weeks PMA (36, 43). PH was determined by clinical diagnosis alone in 90 infants (25%), by medication use alone in 70 infants (19%), and by both criteria in 200 infants (56%) (Supplementary Table 1). Infants who developed PH were more premature (25 weeks gestational age [GA] [24, 26] vs 26 weeks GA [24, 27]; p < 0.001), with lower birthweight (641 g [554, 769] vs 730 g [610, 870]; P < 0.001), and were disproportionately Black (41% vs 28%; P < 0.001 across all races) compared with those who did not develop PH. Among infants at the 33 weeks PMA timepoint, those who developed PH had significantly higher mortality than those who did not (23% vs 7%, P < 0.001).

Table 1 Infant demographics for the 36 weeks PMA cohort and the 33 weeks PMA cohort.

For the 36 weeks PMA timepoint, we identified 20,173 infants on any respiratory support across 228 sites, including 2385 infants from 156 sites also included in the 33 weeks PMA timepoint (Table 1, and Fig. 1). Of these, 10,982 (54%) were male, 8702 (43%) were White, and the median GA was 26 weeks (interquartile values: 25, 27). Of these infants, 770 (4%) developed PH at a median age 40 weeks PMA (38, 42). PH was determined by clinical diagnosis alone in 309 infants (40%), by medication use alone in 84 infants (11%), and by both criteria in 377 infants (49%) (Supplementary Table 2). Compared with infants who did not develop PH, infants who developed PH were more premature (median 25 weeks GA [24, 26] vs 26 weeks GA [25, 27]; p < 0.001), had lower birth weight (660 g [566, 782] vs 810 g [670, 970]; P < 0.001), and were disproportionately Black (39% vs 26%, P < 0.001 across all races). Mortality was significantly higher for infants with PH (12% vs 1%, P < 0.001).

Fig. 1: Consort flow diagram of infants in developmental cohort.
figure 1

Details infants removed from the analysis cohort due to exclusion criteria as well as final size and number of infants who met the primary outcome in each cohort.

At the 33 weeks PMA timepoint, the top six features were birth weight, total days of caffeine, total days of systemic steroids, presence of early PH, and last fraction of inspired oxygen and capillary blood gas (CBG) pCO2 prior to 33 weeks PMA (Table 2, and Supplementary Table 3). Using these features, the LR predictive model demonstrated good discrimination with AUROC 0.726 [95% CI, 0.653–0.796] (Fig. 2). The LSTM model performed similarly (AUROC 0.719 [0.639–0.795]), while the MLP model was relatively weaker (AUROC 0.635 [0.544–0.719]). The LR model performed best as assessed by AUPRC (LR AUPRC 0.329 [0.221–0.462]; LSTM 0.267 [0.188–0.398]; MLP 0.197 [0.133–0.306]). Model performance was consistent when applied to the temporal validation cohort of 252 infants across 78 sites (Fig. 3, LR model performance across different prediction thresholds in Supplementary Table 4).

Fig. 2: Model performance at 36 weeks PMA and 33 weeks PMA timepoints in developmental test cohort.
figure 2

Performance evaluated with receiver operating characteristic (ROC) curve and precision-recall curve for LSTM, MLP, and LR models. LR logistic regression, LSTM long short-term memory, MLP multilayer perceptron.

Fig. 3: Model performance at 36 weeks PMA and 33 weeks PMA timepoints in temporal validation cohort.
figure 3

Performance evaluated with receiver operating characteristic (ROC) curve and precision-recall curve for LSTM, MLP, and LR models. LR logistic regression, LSTM long short-term memory, MLP multilayer perceptron.

Table 2 Predictors of pulmonary hypertension in final logistic regression models.

For the model at 36 weeks PMA, the top six features were birth weight, presence of an atrial septal defect (ASD), total days of systemic steroids, and last serum chloride, CBG pCO2, and respiratory support prior to 36 weeks PMA (Table 2, and Supplementary Table 5). Model discrimination was robust at the 36 week PMA timepoint (LR model AUROC 0.826 [0.788–0.861], LSTM 0.820 [0.783–0.855], MLP 0.803 [0.758–0.847]) (Fig. 2). Performance as assessed by AUPRC was slightly lower than for the 33 week PMA timepoint, likely reflecting the increased incidence of pulmonary hypertension in the 33 week PMA cohort (Fig. 2). Model performance remained robust in the temporal validation cohort of 2749 infants across 163 sites (LR AUROC 0.778 [0.739–0.814]; LSTM AUROC 0.775 [0.736–0.811]; MLP 0.767 [0.727–0.807]) (Fig. 3, LR model performance across different prediction thresholds in Supplementary Table 6). Model performance was similar across sex and racial demographics (Supplementary Tables 7 and 8). Model performance was also similar in a sensitivity analysis excluding infants who met the PH outcome due to iNO exposure alone (Supplementary Table 9).

Discussion

In this study, we used a large multicenter cohort of premature infants to develop models predicting PH at two separate timepoints: at 33 weeks PMA for those receiving mechanical ventilation, and at 36 weeks PMA for all infants meeting NRN 2019 criteria for BPD. Model discrimination was strong at both timepoints and was consistent when applied to the multicenter temporal validation cohort. At both timepoints, mortality was significantly higher in infants who developed PH. Overall, accurately predicting the outcome of PH will be the first step in identifying the highest risk infants to improve monitoring, selection for possible preventative therapies, and outcomes.

The outcomes in our cohort illustrate the significant mortality of BPD associated PH, reaching 23% for infants mechanically ventilated at 33 weeks who develop PH. Additionally, there are currently no evidence-based therapies for either preventing or treating BPD associated PH. A much-needed multicenter dose-escalating safety study for the pulmonary vasodilator sildenafil is currently underway [14]. We envision a twofold benefit for implementing our models. First, predicting PH could help identify the highest risk infants allowing for enhanced screening or identification of potential candidates for future preventive therapies. Here, we developed models at both the 36-week timepoint, at which the diagnosis of BPD is commonly determined, as well as an earlier 33-week timepoint, as these younger high-risk infants may have added benefit from potential preventative therapies. Second, predicting PH may improve the success of future clinical trials for BDP associated PH preventive therapeutics. Prognostic enrichment—selecting infants more likely to have the outcome—could allow for the enrollment of fewer infants, increase the power of future studies, and improve the likelihood of trial completion [20].

To determine the most pragmatic model, we compared the predictive ability of two neural network-based models with that of a logistic regression model from only six features. For both the 33 weeks and 36 weeks PMA timepoints, the LR model achieved similar performance to the time series-based neural network LSTM model. The lack of superior performance of the more complex model may be due to the relatively small sample size for training, particularly in the 33 weeks PMA cohort. Additionally, neural networks excel at incorporating multimodal and unstructured data such as chest x-ray images or bedside telemetry monitoring [21]. It is possible that incorporating these inputs into a future neural network model could provide additional discriminatory power. However, for our current models, the logistic regression model provides the best balance of model performance with parsimony and interpretability, and represents a more functional model [22].

Models at both the 33 weeks and 36 weeks PMA timepoints incorporated pCO2 blood gas data and respiratory support information. Prior risk calculators for developing BPD alone have similarly found that respiratory support at 28 days PMA was the most discriminating predictor [7, 8]. Persistent hypercarbia and an ongoing oxygen requirement are both suggestive of worse gas exchange and an increased level of lung parenchymal disease. These findings are consistent with the putative mechanisms for PH development in infants with BPD, in which alveolar hypoxia drives ongoing inflammation and endothelial dysfunction [1, 3]. Better understanding the molecular pathways responsible for this dysfunction may also provide potential targets for future therapeutics. Notably, gestational age was not included for either model. Gestational age is largely colinear with birthweight, so was not selected by the regularized regression methods. A prior study of infants with birthweight <750 g found that, compared with infants born appropriate for gestational age, those born small for gestational age (SGA) were born less premature but had a higher probability of developing more severe BPD [23]. A recent meta-analysis also demonstrated a significant association between fetal growth restriction or SGA at birth with increased risk for BPD and PH [24]. It is possible that weight may be a stronger overall indicator of lung maturity at birth, but more research is needed to better understand this relationship.

For the 36 week PMA model, lower chloride values predicted increased risk of pulmonary hypertension. This finding may reflect higher intensity of diuretic use or increased electrolyte derangements from diuretic treatment in infants who subsequently develop PH. Daily diuretic use was also incorporated in the model as a binary variable and was not selected as a significant contributor. International consensus guidelines on pediatric PH recommend judicious fluid management in infants with severe PH, including diuretics [25]. A prior multicenter cohort study found increased duration of furosemide exposure in infants 23–29 weeks gestational age was associated with decreased risk of BPD, although this has not yet been confirmed in prospective trials [26]. It is possible diuretic exposure may reduce risk of BPD, but for individual infants, a higher diuretic requirement may reflect underlying worse disease.

Interestingly, variables for presence of a patent ductus arteriosus (PDA) and need for PDA intervention were included in the initial model but were not selected as final predictors. Presence of an ASD was an independent predictor for pulmonary hypertension in our overall cohort. This finding is consistent with prior studies of infants with BPD, in which—compared with infants without an ASD— those with an ASD had increased odds of developing pulmonary hypertension and developed pulmonary hypertension sooner [27, 28]. Post-tricuspid valve left-to-right shunts, such as a ventricular septal defect or a PDA, may more significantly contribute to pulmonary vascular disease, but the potential for hemodynamically significant pre-tricuspid shunting from an ASD also requires attention and close monitoring [25].

The models at both the 33 and 36 weeks PMA timepoints incorporated medication administration data. Increased systemic steroid exposure predicted increased risk for developing PH for both cohorts, and increased caffeine exposure predicted PH for the 33 weeks PMA model. Recent cochrane analyzes of randomized controlled trial data suggested that both early (within 7 days of birth) and late (7 days or more after birth) steroid exposure may decrease the risk of BPD in premature infants [29, 30]. Similarly, caffeine administration to low-birth-weight infants was associated with decreased BPD at discharge in a randomized trial [31]. The impact of steroid or caffeine exposure on subsequent BPD associated PH development is less well known. Longer duration of medication exposure may reflect a baseline sicker infant population that has increased PH risk, and more research is needed to better understand these associations.

Model discriminatory performance was relatively higher for the 36 weeks PMA model than for that at 33 weeks PMA. Additionally, there was increased variation in feature importance across the tenfold cross validation for the 33 weeks PMA model (Supplementary Table 3). Together, these findings suggest that we may have only a limited understanding of the contributors to the outcome of PH in infants, particular in the cohort of infants receiving mechanical ventilation at 33 weeks PMA. Our models incorporated many clinical variables including lab values, respiratory support, and comorbidities but were still unable to fully predict the outcome of PH. Development of new biomarkers may be needed to better risk stratify and also identify new targets for therapeutics. For example, prior studies have found proBNP may be a potential biomarker for predicting risk in infants with BPD associated PH, but more work is needed to evaluate its impact in prospective studies as well as its utility for screening for PH [32].

Our study has many strengths. We incorporated a large multicenter cohort of infants as well as granular daily clinical data from the electronic health record. Our cohort reflects practice across both academic and community NICUs; including this diverse population will improve the generalizability of our models. However, our study has notable limitations. First, our primary outcome of PH was based on a composite of either a clinical diagnosis or medication use. The gold standard diagnosis for PH is through heart catheterization which may not be feasible or safe in many small infants [25]. Echocardiogram can be a reliable non-invasive surrogate, but our database did not include echocardiogram images or reports for analysis. Different centers in our cohort may have had different thresholds for obtaining screening echocardiograms, for diagnosing PH from echocardiograms, and for initiating pulmonary vasodilator therapy [33, 34]. Further, some clinician may use pulmonary vasodilator therapy, particularly iNO, for indications other than pulmonary hypertension. Reassuringly, our models performed similarly on sensitivity analysis that excluded iNO use from the outcome label. Prospective evaluation of our models should include a consistent definition of the pulmonary hypertension outcome. Second, our study was limited to data during the NICU admission. Infants may remain at risk for PH after discharge, and ongoing outpatient screening is recommended for infants with a persistent oxygen requirement [35]. Third, our models inherently excluded infants diagnosed with pulmonary hypertension prior to either the 33-week or 36-week PMA timepoint. New models would be needed to identify these infants prior to their diagnosis. Finally, while we validated our model performance in a temporally distinct multicenter cohort, additional prospective and external validation will be critical to ensure our model remains reliable and provides a clinically meaningful prediction.

In conclusion, we have developed and validated predictive models for PH in infants at two timepoints: 33 weeks PMA in infants receiving mechanical ventilation and 36 weeks PMA in all infants with BPD. These models use readily available clinical variables to assist with bedside clinical decision making. Prospective implantation of these models could allow for identification for the most at-risk infants.