Introduction

Therapeutic hypothermia (TH; or cooling therapy) improves survival and neurodevelopmental outcomes in children with neonatal encephalopathy secondary to perinatal asphyxia (NE) and is the standard care in high income countries.1 Due to TH, more children survive, and fewer survivors develop severe motor abnormalities like cerebral palsy (CP).2,3 However, children cooled for NE are still at high risk for cognitive deficits at early school age, even in the absence of CP.4 Therefore, it is important to identify those children at risk of cognitive difficulties at school age early on to offer timely and targeted support.

Routine follow-up of children cooled for moderate to severe NE at 12–24 months focuses on the assessment of cognitive and language development.5 In clinical practice, early assessment of development is often done using the Bayley Scales of Infant and Toddler Development to assess cognitive, language, and motor development, and allowing for separate composite scores for these areas to be derived at one-month age intervals.6 At early school age, cognitive abilities are typically assessed using the Wechsler Intelligence Scale for Children.7 It assesses multiple cognitive abilities that estimate a total intelligence quotient (full-scale IQ, FSIQ).7

Some previous studies conducted in children with, and without risk, for cognitive deficits found weak to good associations between Bayley Scales of Infant and Toddler Development, Third Edition (Bayley-III) cognitive scores and Wechsler Intelligence Scale for Children, Fourth Edition (WISC-IV) FSIQ (r between 0.33–0.62).8,9,10,11 The associations tend to be stronger in children with higher risk of disability, as exemplified by a Swedish study where Bayley-III cognitive composite scores at 2.5 years explained up to 38% of the variance in WISC-IV FSIQ at 6.5 years in extremely preterm born children,11 but only 24% in full term born controls (multivariable model).10

While there is some evidence for an association between Bayley and WISC in the NE population from a randomized controlled trial of TH, this work used an earlier version of Bayley (version II), and the results appeared less predictive for scores between 70 and 84.12 The study found that Bayley-II Mental Developmental Index (MDI) < 70 had a high predictive value for FSIQ < 70 at school age (sensitivity = 0.87, specificity = 0.96).12 However, this association was not robust when Bayley-II MDI was between 70 and 84; only half of the children with moderate to severe NE and without CP (13/27) remained in the aforementioned developmental category at 6–7 years of age.12 In contrast, most of the children with CP remained in the same developmental category, but that might be due to the majority (96%) having severe cognitive impairments (MDI and FSIQ < 70).12

Data on the association between cognition assessed using Bayley-III, which is commonly used in clinical practice in the therapeutic hypothermia era, and cognition at school age in children without severe motor disabilities after NE is unknown. Further, the evolution of cognitive scores in this population from 18 months to early school age is not well described. We hypothesized that the Bayley-III CLC scores at 18–21 months will be positively associated with cognitive scores at 6–8 years. The aim of this study was to evaluate the association and individual changes in cognitive scores assessed using Bayley-III average cognitive and language composite (CLC) scores at 18–21 months and WISC-IV FSIQ at 6–8 years in children cooled for moderate to severe neonatal NE.

Methods

Study design

Data was drawn from the CoolMRI study,13 which recruited from a population-based cohort of children aged 6–8 years without cerebral palsy who had received TH for moderate to severe NE after birth. The eligible children were identified through the patient database at St Michael’s Hospital, Bristol, UK. The children had been born at or above 35 weeks gestation between April 2008 and December 2012 in South West England and underwent cooling therapy at St Michael’s Hospital, Bristol. All children had evidence of perinatal asphyxia and moderate-severe encephalopathy confirmed by clinical neurological examination and amplitude-integrated EEG.14 Absence of CP at 18–21 months and at 6–8 years of age, was confirmed following detailed neurological examination by an experienced pediatrician.15 Children were followed up at 18–21 months and 6–8 years of age. Developmental assessments were conducted by an experienced physiotherapist (SJ) at 18–21 months, and an experienced neuropsychologist at 6–8 years of age.4 Assessors were blinded to the clinical details of the children, although at 18–21 months the assessor was aware that children underwent TH.

Cognitive assessments

Recruited children were followed longitudinally, with assessment at two time points. At 18–21 months, children were assessed with the Bayley-III.6 Bayley-III Cognitive and Language Composite (CLC; average of the Bayley-III Cognitive Composite and Bayley-III Language Composite scores) score was used in the analysis as language represents an essential part of cognitive assessment.16 Cognitive abilities at 6–8 years of age were assessed with the WISC-IV.7 Full-Scale IQ was used in the analysis. Both assessments have a standardized mean of 100 and a standard deviation (SD) of 15.

Data collection

Baseline data collected by researchers in the CoolMRI study included gestation at birth, birth weight, measures of asphyxia, and encephalopathy. The severity of asphyxia was defined by acidosis on the blood gas obtained within 1 h after birth, Apgar score, and need for respiratory support at 10 min after birth.14 We used the severity of abnormalities on neurological examination to denote the severity of NE (mild, moderate, or severe). The severity of encephalopathy was also evaluated with amplitude-integrated EEG, acquired prior to cooling therapy, and classified as moderately or severely abnormal by an experienced neonatologist. The total neonatal brain MRI injury score was calculated based on a qualitative assessment of structural neonatal brain T1-weighted MRI scans at 4–15 days post-birth by an experienced perinatal neurologist,15 using the Rutherford classification system.17 A total neonatal brain MRI injury score was used in the analysis (range: 0–11; 11 represents maximum injury) and computed as a sum of injury across basal ganglia and thalami, cortex, white matter, and posterior limb of internal capsule.17 The socio-economic status was estimated by the index of multiple deprivation (IMD).13 IMD is a measure of relative deprivation in small areas in England and is computed as a weighted sum of (1) income deprivation, (2) employment deprivation, (3) education, skills and training deprivation, (4) health deprivation and disability, (5) crime, (6) barriers to housing and services, and (7) living environment deprivation.18 IMD was calculated from the 2019 release for England and reported as deciles, ranging from 1 to 10. A score of 1 represents the most deprived 10% of small areas in England, and a score of 10 represents the least deprived 10% of small areas in England.13 The IMD deciles were calculated from the mother’s postcode at 6–8 years.

Sample size and outcomes

The CoolMRI study identified a total of 69 eligible children through the St Michael’s Hospital patient database, of whom 50 were recruited in the study.15 Primary outcome measure was the FSIQ at 6–8 years of age.

Statistical analysis

Statistical analyses were conducted using R.19 Complete case analyses were used for all statistical analyses. The evaluation of whether the cognitive scores are normally distributed was carried out by quantile–quantile (Q–Q) plots and the Shapiro-Wilk normality test. Density plots were further used to visualize the distribution of cognitive scores. The main analysis tested for association between Bayley-III CLC and WISC-IV FSIQ scores. The means of the Bayley-III CLC and WISC-IV FSIQ were compared with a paired t-test. Cognitive outcomes at 18–21 months and 6–8 years were categorized using the following cut-off points: < 70, 70–84, 85–100, and > 100. The percentage of children with scores remaining in the same category, moving to a lower or higher developmental category between the two assessments was calculated. Univariable and multivariable linear regression models were used to predict WISC-IV FSIQ scores. In the first model, only Bayley-III CLC was used as a predictor. In the second model, the contributions of age at Bayley-III assessment, sex, birth weight, gestation at birth, and IMD were evaluated. In the third and fourth models, the contributions of neonatal brain MRI injury score and NE clinical grade were evaluated. The Akaike Information Criterion (AIC) was used to compare the four models. No external validation data were available. In a further sub-group data analysis, the characteristics of children who deteriorated and children who remained in the same range or progressed to a higher FSIQ score at 6–8 years of age, were compared using the independent sample t-test for continuous and the Pearson’s χ2 with Yates’ continuity correction for categorical variables. The median test (available in R package agricolae) was used to compare the medians of IMD and neonatal brain MRI injury scores between the groups. In a further sensitivity analysis, the regression analyses were repeated using IMD as a binary predictor (IMD lower band: < 6 vs IMD upper band: 6–10).

Results

Participant characteristics

Of the 69 invited children, 19 were excluded at 6–8 years (11 did not respond, 5 moved away, and 3 declined participation).15 Of the remaining 50 recruited children, a total of 49 (98%) children had available data for both Bayley-III CLC and WISC-IV FSIQ. These were used in the main analysis. The sociodemographic information (including IMD), clinical characteristics, neonatal brain MRI injury score, and Bayley-III cognitive and language composite scores of children who were eligible but not recruited in the study were not significantly different to those included in this study (Table S1).

Baseline characteristics of children included in the main analysis are presented in Table 1. The mean gestational age of the group was 39.8 (SD 1.6) weeks, and over half (n = 29, 59%) were male. Most children (n = 39, 80%) were classified at birth with a moderate NE grade, while 10 (20%) with severe NE grade. At 6–8 years, the median IMD was 7 (IQR 4–9). There was no significant association between IMD decile and NE grade (χ²(9) = 8.35, p = 0.499).

Table 1 Baseline characteristics of the study cohort.

Bayley-III CLC & WISC-IV FSIQ scores

The distribution of scores at both time points is presented in Fig. 1. There was little evidence that the scores were not normally distributed (p = 0.193 and p = 0.389), although the FSIQ density plot visually suggested two distinct peaks/FSIQ clusters (Fig. 1). Therefore, a post-hoc K-means cluster analysis of FSIQ (cluster number = 2) was conducted and two clusters were compared by sex, NE clinical grade, neonatal brain MRI injury score, and IMD, using independent sample t-test for continuous and χ2 tests for categorical variables. The FSIQ clusters significantly differed by IMD, t(43.76) = 2.72, p = 0.009. The mean IMD for Cluster 1 (mean FSIQ 104.8, SD 6.6) was 7.3 (less deprived areas), while for Cluster 2 (mean FSIQ 85.6, SD 6.5) was 5.5 (more deprived areas). No statistically significant differences were found in sex, NE clinical grade, and neonatal brain MRI injury score.

Fig. 1
figure 1

Density plots of Bayley-III Cognitive & Language Composite scores at 18–21 months and WISC-IV Full-Scale IQ at 6–8 years.

Cognitive scores at 18–21 months and 6–8 years are reported in Table 2. The Bayley-III CLC scores (mean 103.4, SD 11.3) were significantly higher than WISC-IV FSIQ scores (mean 95.4, SD 11.7), t(48) = 4.61, p < 0.001. Most children had a Bayley-III CLC of 85–100 (n = 16, 32.7%) or > 100 (n = 31, 63.3%) at 18–21 months of age, with two (4.0%) children having a score of < 85. At 6–8 years of age, 23 (46.9%) and 18 (36.7%) children had a WISC-IV FSIQ of 85–100 or > 100, respectively, and 8 (16.3%) children had a WISC-IV FSIQ of < 85.

Table 2 Cognitive scores at 18–21 months and 6–8 years.

Associations between scores

Individual changes in Bayley-III CLC and WISC-IV FSIQ scores over time are presented in Fig. 2. Most children’s scores either remained in the same developmental category (n = 22, 44.9%) or changed to a lower developmental category (n = 22, 44.9%). Only 5 (10.2%) of children’s scores moved to a higher developmental category at early school age. Children with a Bayley-III CLC between 85 and 100 (n = 16) had a range of WISC-IV FSIQ with 6 (37.4%) below 85, 7 (43.8%) between 85 and 100 and 3 (18.8%) above 100. Children with a Bayley-III CLC above 100 (n = 31) had a range of WISC-IV FSIQ with one (3.2%) below 85, 15 (48.4%) between 85–100 and 15 (48.4%) above 100. Of the remaining two children, one moved from a Bayley-III CLC of < 70 to a WISC-IV FSIQ of 70–84, and the other from a Bayley-III CLC of 70–84 to a WISC-IV FSIQ of 85–100.

Fig. 2: Individual changes in Bayley-III CLC scores at 18–21 months to WISC-IV FSIQ scores at 6–8 years of age.
figure 2

Note. CLC average of Cognitive & Language Composite scores, FSIQ Full-Scale IQ.

Regression analyses

The linear association between Bayley-III CLC and WISC-IV FSIQ is presented in Fig. 3 (unadjusted model). Reference lines indicate cut-off thresholds at 70, 85, and 95 for both assessments. Five out of 8 (62.5%) children with an WISC-IV FSIQ below 85 had a Bayley-III CLC above 95.

Fig. 3: Linear relationship between Bayley-III CLC and WISC-IV FSIQ.
figure 3

Scatterplots show the association between Bayley-III CLC scores and WISC-IV FSIQ. a includes all data points, with outliers (absolute residuals > 20 points from zero) shown in red. b excludes these outliers. CLC average of Cognitive & Language Composite scores, FSIQ Full-Scale IQ.

The results of all adjusted linear regression models are presented in Table 3. Bayley-III CLC scores were the only parameters that were significantly associated with WISC-IV FSIQ when adjusting for relevant sociodemographic factors, neonatal brain MRI injury score and NE clinical severity grade. When adding these parameters, the predictive ability of the models increased from 19% to 36% for FSIQ.

Table 3 Regression models of WISC-IV FSIQ at 6–8 years by Bayley-III CLC at 18–21 months.

Sub-group analysis

In a further sub-group analysis, we explored differences in characteristics of children whose scores changed to a lower developmental category (n = 22, 44.9%) compared to children who remained in the same or moved to a higher developmental category at 6–8 years of age (n = 27, 55.1%). Those children whose developmental category reduced between the two time points had significantly lower WISC-IV FSIQ scores than those with scores that remained in the same or moved to a higher developmental category. Addition, male infants were more likely than female infants to increase their scores over the time period (p = 0.040). There were no other significant differences between the two groups regarding clinical characteristics, neonatal brain MRI injury score, IMD, and Bayley-III CLC scores (Table 4).

Table 4 Comparison of baseline characteristics of children with unchanged or increased scores compared to children with scores that decreased between 18–24 months and 6–8 years.

Sensitivity analysis

In a post-hoc analysis, when using IMD as a binary variable (defined as deciles 1–5 or 6–10) we saw similar results to the main analysis with Bayley-III CLC remaining the sole predictor of later FSIQ (Table S2).

Discussion

We explored the associations between cognitive and language development assessed using Bayley-III at 18–21 months and WISC-IV at 6–8 years of age in children cooled for moderate to severe NE. While we found positive associations between Bayley-III CLC and WISC-IV FSIQ, the Bayley-III explained only a small proportion of variance in FSIQ at school age; with limited predictive value at 6–8 years for cognitive outcomes in our cohort of children with NE without CP.

Our findings are remarkably similar to other published research predicting childhood FSIQ from infant Bayley scores. In our sample, over a third of children with a Bayley-III CLC score of 85–100 had FSIQ below 85 at early school age, similar to a study in children cooled for moderate to severe encephalopathy without CP, where 11 of 33 with Bayley-II MDI above 84 at 18–22 had IQ below 84 at 6–7 years.12 In our study, Bayley-III CLC alone explained only 19% of the variance in FSIQ, increasing to 36% when sociodemographic factors (including measure of relative deprivation, IMD), neonatal brain MRI and NE severity, were included; however, the Bayley-III CLC alone better predicted FSIQ as shown by lower AIC. This is comparable to findings in preterm populations.20 Yet over 60% of FSIQ variance remains unexplained when using Bayley scores as a sole predictor.

Our study’s unique contribution to the work of other researchers on the limited association between Bayley and WISC, is that we also investigated individual cognitive trajectories and associated factors. Around 45% of children moved to a lower developmental category, whereas others stayed in the same developmental category or progressed to a higher one at school age. While sex was not a significant predictor of FSIQ, more male infants appeared to increase their score between the two time periods, and further work may be needed to identify the mechanisms behind this. No significant differences were found between children whose developmental category changed between the two time points in basic clinical characteristics, neonatal brain MRI injury score and IMD. Other factors might be driving these changes, such as brain plasticity or other protective mechanisms. Previous work has demonstrated significant positive associations between early environment/life experiences and later brain and cognitive development.21,22

The limited association between Bayley-III and WISC-IV may be explained by multiple factors.10,11,23 Bayley-III is used to assess children’s abilities at different points in development,23 while individual IQ scores on intelligence tests like WISC-IV tend to be relatively stable from childhood onwards and predictive of later cognitive performance.24 The limited association between the two measures may also be due to the lengthy interval between assessments, coupled with potential measurement errors. Yu and colleagues25 found that accounting for measurement errors and shorter intervals showed high stability in intelligence from early infancy through adulthood, challenging previous research. Furthermore, the complexity of skills measured by IQ tests may not be evident at earlier ages since cognitive abilities are still developing. Individual score changes over time may also be patterned by various environmental factors influencing individual cognitive trajectories. For instance, we found that children from less deprived areas had higher FSIQ scores than those from more deprived neighborhoods.

Bayley-III tends to underestimate developmental delay as compared to its previous version (version II) when using a standard cut-off of < 70 for moderate-severe delay.26 Other researchers found that by using higher thresholds (e.g., 95) on Bayley-III cognitive composite scores, the sensitivity to detect developmental delay increased.9,11 In our study, increasing the Bayley-III threshold to 95 did not improve identification of children with a WISC-IV FSIQ < 85, as 62.5% of children still had FSIQ below 85. In 2019, Bayley-IV was published and is being integrated into clinical practice.27 Bayley-IV did not include high-risk populations in the standardization sample. Bayley-III included 10% of children at risk for developmental delay which is thought to be the reason for inflated scores.26,27 The sensitivity of Bayley-IV to detect later developmental delay should be evaluated in future studies.

The results of this study need to be assessed in line with the following limitations. Due to the small sample size, we were limited to performing only simple analyses and the results should be interpreted with caution. Furthermore, the Bayley-III was administered at an early age (at 18–21 months), whereas stronger associations between Bayley-III and WISC-V were shown when Bayley-III was administered at 31–42 (r = 0.474) months than at 6–18 months (r = 0.157).28 Additionally, we did not have a Bayley-III social-emotional scale to be compared to FSIQ at 6–8 years.

Nevertheless, our study’s strengths include its longitudinal design and focus on individual cognitive trajectories, an aspect often overlooked in the studies. A further strength is the inclusion of children without CP who constitute the majority (85–88%) of survivors following TH for NE.29 This is unique as studies often include children with severe developmental problems where predictions of outcomes are easier. Furthermore, most children in our sample had moderate encephalopathy, and only a few had severe developmental delay in FSIQ at school age. Our study suggests that there might be unique challenges in finding associations/predictions in children with subtle problems and moderate delay.

In conclusion, our findings suggest limitations in using Bayley-III scores before 2 years as a prognostic tool for the identification of children at risk for cognitive impairments at school age. Clinicians should be therefore aware of the limited predictive ability of Bayley at an early age and clinical practice should be adapted to include longitudinal following of children at risk into early school age. Longitudinal monitoring, which is not currently part of routine clinical practice, might be beneficial for accurate identification and support of children’s cognitive development.