Introduction

The most frequent cause of neonatal seizures is acute brain injury in the early postnatal period. Seizures typically emerge over the first 72 postnatal hours in term neonates, primarily caused by hypoxic-ischaemic encephalopathy (HIE) or cerebrovascular injury1,2,3. More than half of neonates with moderate or severe HIE develop seizures1,3,4. For those neonates who do develop seizures, approximately 7% to 10% are at risk of death and 23% to 50% are at risk of poor outcome1,5,6.

Seizures can be subtle, often without clinical correlate, and often remain undetected7. Continuous electroencephalogram (EEG) monitoring is the gold standard for neonatal seizure surveillance. Yet real-time interpretation of the EEG requires specialised expertise that is not always available, limiting the capacity for continuous review of EEGs. A recent multi-centre study found that, even with continuous EEG or amplitude-integrated EEG readily available, only 11% of seizures were treated within 1 hour of onset4. Prompt treatment can reduce seizure burden and therefore may reduce seizure-mediated neuronal damage and improve outcomes8.

Automated review of the EEG, with expert oversight, would allow for increased monitoring of at-risk neonates. A recent clinical trial of an automated algorithm to detect EEG seizures demonstrated the potential clinical utility9. Yet this seizure detection algorithm, which was developed in 201110, has been comprehensively surpassed in performance by a range of newer methods11,12,13,14,15,16,17,18,19. Most of these contemporary seizure detection methods use deep neural networks. These powerful tools offer increased performance over feature-based machine learning methods by enabling end-to-end learning, from raw EEG to label class, and by scaling performance with increasing model size and training data. We have identified 4 key challenges in the current literature that may be constraining performance.

First, many deep-learning models are trained with small datasets, a significant limitation in this field11,13,14,15,16,17,19. A widely used open-access dataset contains 112 h of EEG recordings from 79 neonates20, although in many cases, only a subset of 39 neonates with seizures is used15,16,17,19. Second, most methods use global, and not per-channel, seizure annotations11,12,13,14,15,16,17,18,19. This enables faster annotation of the EEG but provides less detailed information for model training. Additionally, this can make the models susceptible to variations in the EEG montage. Third, most methods use a relatively small network architecture, with fewer than 50k parameters11,12,13,14,15,16,17,18,19. This may limit the extent to which a model can capture the complexity of the data. Fourth, validation of models on held-out datasets is frequently omitted, making it difficult to determine how the models would perform on new, unseen data11,15,16,17,19.

In this study, we aim to address these limitations. Our primary goal is to develop a deep-learning model capable of detecting seizures in neonatal EEG with accuracy suitable for clinical application. To this end, we test the hypothesis that increasing both model size and training data will improve performance. Our models are based on a modern convolutional neural network architecture and are trained to detect seizures on a per-channel basis. Additionally, we validate our models on independent, held-out datasets from Cork and Helsinki to determine efficacy on unseen data.

Results

Model and data scaling

We evaluated a wide range of model scales, as described in Table 1, from the 39k parameter Nano variant up to the >500 times larger 21m parameter Extra-Large (XL) model and find significant performance improvements. Figure 1b illustrates these improvement gains in Matthews correlation coefficient (MCC), correlation, and error rate suggesting that model scaling is indeed a viable path to better models for neonatal seizure detection. A representative sample of the model output in Fig. 2b gives a qualitative sense of this performance improvement.

Table 1 Model variants explored in this work
Fig. 1: Scaling training data and model size yields approximate power-law performance gains.
figure 1

Metrics calculated at the segment level for 20% (41/202 neonates) of the development dataset in (a) and (b) and neonate level for held-out validation datasets in (c). a Performance improves with increasing number of neonates and hours of annotated EEG (counted per channel) in training set; error bars denote min/max over 3 trials. Prominent datasets from the literature are included for comparison10,13,20. b Scaling model size over 3 orders of magnitude reveals typical deep-double descent pattern with a performance dip for the Small model before recovering for larger models. Marker size indicates computational cost in giga floating- point operations (GFLOPs). c Scaling model size on the held-out datasets from Cork and Helsinki. We include a linear fit to illustrate the predictability of performance increase. See Table 8 for description of metrics used.

Fig. 2: Segment of EEG with seizure and comparison of different model outputs.
figure 2

a Sixty second sample of EEG, from the development dataset, with per-channel seizure annotations shaded. In this example, only 3/8 channels contain seizure. b Annotation and model outputs for 10 h from C4-O2 of the same EEG recording. The EEG sample in (a) corresponds to the first 60 s of the first seizure event in (b). Models of different scales---namely the Nano, Small, Medium, Large, and Extra Large (XL) models---become more confident, suppressing the output for non-seizure periods while maintaining high agreement in seizure periods. This ease of interpretation would be beneficial to clinical implementation that use a real-time model output trace.

One notable feature is a large drop in performance for the first 10-times scaling from the Nano to Small model. This phenomenon has been observed repeatedly in other applications and is known as deep double descent21. We also see some evidence of this in data scaling in Fig. 1a from 1k to 10k hours of EEG.

Figure 1c presents results for held-out validation sets across different model scales. As these datasets only have global annotations across all channels, we take the maximum over the per-channel outputs to produce a global prediction. This simplification may obscure some of the per-channel performance differences in the models. Nevertheless, the power-law trend of improvement with model scale is clear across many metrics, displaying a strong validation of the scaling hypothesis. We also find the appearance of the double-descent dip here again, although less pronounced, across several metrics on both datasets.

We do, however, see indications of diminishing returns for some metrics for the Large and XL models on the Helsinki dataset. We investigate this further later in this section.

We also quantify the effect of increasing dataset size with random sub-sampling by (1) EEG segment and (2) neonate (keep all segments for the sampled neonates, and drop all others). We do so by training a Medium model with (sub-samples of) 80% of the development dataset and test on the same left-out 20%. We find that in both cases there is significant performance gains, up to 50%, from scaling the data—illustrated in Fig. 1a. Adding more EEG segments improves performance, even for datasets >20 times larger than the nearest published work, indicating that scaling data remains a powerful lever for improving models.

Model performance

Table 2 presents a comprehensive evaluation of the XL model across the 3 datasets. The results of the test set are evaluated per channel. Combining across all channels to form a global annotation increases detection performance: for example, AUC increases from 0.978 to 0.988 and MCC increases from 0.648 to 0.703. In Table 3 we also include the limited set of metrics available for direct comparison to the literature. Despite our relatively simplistic approach to translating from per-channel to global predictions, we still find that our models compare quite favourably to those published in the literature. This is true even for models that have been trained on the Helsinki dataset and report a cross-validation result.

Table 2 Performance of the XL model on 3 datasets
Table 3 Comparison of proposed model and other published models tested on the Helsinki dataset

Additionally, to assess performance on a per-neonate level we analyse the XL model’s ability to estimate seizure burden. Table 4 shows that on the Cork validation set, the model’s estimate of seizure burden has no statistically significant difference to that determined by the consensus of experts. On the Helsinki dataset the seizure burden was underestimated, but on both datasets the median difference is small enough that it is unlikely to have clinical relevance. Finally, we also assess performance on neonates without seizure and find median values of ≤0.01 mins/h on both datasets, confirming the low false detection rates of the model.

Table 4 Performance of extra-large (XL) model estimating seizure burden

The XL model attains expert-level equivalence on both Cork and Helsinki validation datasets. In both cases, the change in agreement by replacing a human expert with the AI model predictions was consistent with 0: Δκ = −0.094 (95% CI: −0.189, 0.005) for Cork and Δκ = −0.082 (−0.156, 0.002) for Helsinki. For the Helsinki dataset, the Medium model also reaches this benchmark and the Large model is just narrowly rejected but neither model achieves this benchmark on the Cork validation dataset. For the smaller models, such as the Nano and Small, this benchmark is well out of reach (p < 0.001). Results for all models are presented in Table 5.

Table 5 Estimates of Δκ, the change in level of agreement by replacing a human expert with the AI model predictions

Event duration analysis

Figure 3 presents the distribution of model performance for increasing seizure event durations. We find that for long seizures (>300 s) the model performs well, with a detection rate of 100%. Most of the missed events are for short seizures (<30 s). The difficulty with short seizures has a more pronounced effect on the Helsinki dataset where they were more commonly annotated.

Fig. 3: Influence of seizure event duration on detection performance.
figure 3

Extra-Large (XL) model performance by event duration of consensus seizures for (a) Cork and (b) Helsinki validation datasets. A notable finding is that most of the model errors are for short seizures (<30 s), where perhaps the 16 s input segment size limits detection resolution.

Distribution shift

Although we find strong scaling performance with model size for the Cork validation set, Fig. 1c indicates diminishing returns for the Large and XL models on the Helsinki validation set. For those models, we find that an optimal classification threshold shifts down from 0.5 to 0.4 and 0.3 respectively. This may indicate that the larger, more capable models are learning some features that are useful on training sets but may not generalise to all settings.

One hypothesis for why we observe this effect in the Helsinki dataset but not in our training data or the Cork validation set could be due to differences in clinical protocols applied in different centres. The most obvious difference is that almost 50% (38/78) of the neonates have had EEG recorded ≥1 week after birth, in contrast to the Cork validation set which were all within a week of birth. If we divide the Helsinki dataset into two groups, those with EEGs recorded within a week (early-EEG group) and those with EEG recorded after the first week of life (late-EEG group), we find significant differences in primary diagnosis. A primary diagnosis of either asphyxia (including HIE) or stroke accounts for 92% (32/37) in the early-EEG group compared with just 32% (10/31) in the late-EEG group, p < 0.001 (n = 68; Fisher exact test). This may not be unexpected as the suspected diagnosis would likely be the main driver for EEG monitoring.

With this division of the dataset, we find a remarkable concordance between this explanation of the distribution shift and the scaling behaviour for these cohorts. In Fig. 4 we show that for the early-EEG group the same scaling behaviour observed in both Cork datasets is recovered. In contrast, for the late-EEG group, we see that the performance peaks at the Medium model and starts to degrade progressively for the Large and XL models. This is suggestive that these more capable models are indeed learning something specific about EEG, which may be related to the primary diagnosis or to postnatal age.

Fig. 4: Divergence of scaling behaviour between 2 groups in the Helsinki validation dataset.
figure 4

a Distribution of postnatal age in weeks. b Matthews correlation coefficient (MCC) for both groups. The scaling for <1 week postnatal age tracks closely with that observed in both Cork datasets, even matching the double-descent dip for the Small model. At ≥1 week however we see progressive degradation for the Large and Extra-Large (XL) models comparative to the Medium model.

Montage robustness

A feature of our seizure detection model is its independence to channel montage, both the number of channels and the type of montage. To investigate this robustness, we take our predictions on the Helsinki dataset and simulate data loss or montage changes by randomly inserting contiguous sections of zeros in the per-channel model output. The final prediction is still calculated as the maximum over all channels so this dropped data will not contribute to the global estimate. We drop 10%, 25%, 50%, and 100% of the channel data at random in contiguous segments; here 100% is equivalent to dropping the channel. This was applied across increasing numbers of channels until all but 1 were affected. This procedure was repeated for 20 trials.

The result of this experiment is shown in Fig. 5, where we summarise the impact as % degradation relative to zero data loss for both the AUC and MCC metrics. We find that the model is remarkably robust: by dropping one-half of the channels the AUC (MCC) degrades by only 1.4% (7.0%). If the data loss is partial, we see even stronger results; for example, dropping 25% from 17/18 channels we see only a 0.5% (3.0%) drop in AUC (MCC). The upper bound on performance here is of course determined by whether there is sufficient information remaining in the data to recover the global annotation even in principle, a dependence of the spatial distribution of the seizure event.

Fig. 5: Summary of the effect of data loss on model performance on the Helsinki dataset.
figure 5

Degradation is measured relative to zero data loss for Matthews correlation coefficient (MCC) in (a) and area under the receiver-operator-characteristic curve (AUC) in (b) using XL model. Inset figures illustrate data loss for up to 8 channels.

Discussion

We have developed a state-of-the-art convolutional neural network for neonatal seizure detection, improving substantially upon previously published results. We also have verified our hypothesis that scaling is a hitherto under-utilised lever for performance improvement in neonatal EEG analysis. Scaling both the dataset size, by neonate and by duration of EEG, yielded up to 50% increases in MCC. Scaling model sizes similarly delivered significant performance improvements of up to 15% in MCC. The result of these improvements is that our best model, the 21m parameter XL variant of the ConvNeXt architecture, attains expert-level equivalence with the EEG experts on two independent, fully held-out validation sets (Δκ ≠ 0 rejected with p > 0.05).

Much of the literature focuses on methodological improvements, with specialised architectures trained on very small datasets yielding incremental gains15,16,17,19. Our work challenges this approach and suggests a more promising path to expert-level models is through data and model scale. A key part of the model scaling strategy is designing an architecture with computational efficient scaling. Failure to do so can lead to prohibitively expensive training iterations. Scaling the fully-convolutional neural network model13, for example, to an equivalent size of the XL model would require >6 times the computational load.

Our scaling results also challenge the conventional wisdom that increasing model size will eventually lead to overfitting and decreased generalisation performance. Indeed to date, most research in neonatal EEG has focused on relatively small models, with <50k parameters11,12,13,14,15,16,17,18,19. Despite this, model scaling well past the point of over-parameterisation has been a key feature of recent AI progress21,22,23. This observation that performance will initially decline before improving with scaling is known as deep-double descent and was found to occur across a range of tasks, model architectures, and optimisation methods21. Figure 1b illustrates this finding in all metrics with a decrease in performance for the Small model comparative to the smaller Nano model. We also see indications of this in data scaling (Fig. 1a), where increasing the size of the training dataset actually decreases performance before improving again with more data. This surprising finding is a corollary of the deep-double descent effect on model scale and was also observed elsewhere21. If operating in a narrow scale range, on the left-hand side of the double-descent dip, it is understandable that smaller models and datasets would seem optimal (as found in other studies13). However, an exploration of a much larger scale range, as we show here, yields substantial benefits by moving past the double-descent trap.

A limitation in the neonatal seizure detection literature is that AUC is almost always presented as the lead—and often only—performance metric10,11,12,13,14,15,16,17,18,19. This metric can be misleading for many reasons24,25,26. For example, with large class imbalance, as is the case for electrographic seizures, false positives are obscured. To illustrate this, our worst performing model (Nano) has an AUC of 0.980 on the Helsinki set, exceeding the best reported value of 0.96414. Our XL model improves on this only slightly to 0.982 but has approximately 10 times fewer FD/h and achieves expert-level agreement on both held out datasets. The Nano model, in contrast, is far from achieving expert-level agreement: Δκ is approximately 5-times (2-times) larger on Cork (Helsinki) validation datasets.

Addressing this limitation, we present a comprehensive set of metrics for continuous and binary variables, including more balanced measures of performance, such as MCC, Pearson’s r, and Cohen’s κ24,25,26,27, in addition to metrics with more clinical relevance, such as FD/h, correlation with seizure burden, and expert-level equivalence testing. We have developed an open-source framework for metric calculation to assist with transparency in reporting of performance for this field.

We have also highlighted the utility of developing models with per-channel annotations, making the algorithm adaptable to different clinical montage requirements or protocols. Figure 6 illustrates the heterogeneous time-varying nature of seizure focus among EEG channels. As a result, global labels will obfuscate important channel differences, similar to injecting noise into the training data. Although global labels, or weak labels13, are easier to annotate, they present only summary information without detail and therefore fail to maximise the full potential of the valuable EEG data. By providing a strong training label, Fig. 5 shows that per-channel models are flexible to different montages and even robust to large amounts of data loss, as is likely to occur in a clinical environment.

Fig. 6: Summary of per-channel EEG seizure annotations for 77 neonates.
figure 6

a: number of channels involved in each seizure event. b: agreement among seizure annotations across channels for each seizure event, as quantified by Fleiss κ. c: total seizure duration for each neonates' EEG estimated from each channel separately. For a small number of EEGs, F3 is replaced by Fp1 or Fp3; and likewise, F4 is replaced by Fp2 or Fp4.

We found evidence of a distribution shift on the Helsinki validation set. Returns on model scaling appears to diminish after the Medium model with the best model becoming metric dependent, indicating that the gains for the Large and XL models don’t transfer as well to this dataset (see Fig. 1c). Analysis in Fig. 4 indicates that the Large and XL models are learning something specific to the early-EEG group (postnatal age <1 week) compared to the late-EEG group (>1 week). We speculate that this could be related to subtle differences in the EEG waveforms associated with either postnatal age or, more likely, with primary diagnosis such as HIE or stroke versus other primary diagnoses such as sepsis, meningitis, or recovery post cardiac surgery20. This suggests that future development of seizure detectors could benefit from more diverse training data, recorded from neonates at different postnatal ages and with more varied pathologies and seizure aetiologies other than HIE and stroke.

The key result of this work is—for the first time—a thorough demonstration of an expert-level neonatal EEG seizure detector. Although this claim has been made before28 it was accompanied by some important caveats. First, it was a cross-validation result and not a held-out dataset. Second, this model failed to reach expert-level equivalence when validated on a held-out set29. Third, statistical equivalence was found for only one Δκa, when replacing one expert, and not for the overall Δκ, an average over the 3 annotators, as our test finds. In our work, in contrast, we report statistical equivalence to experts on two different fully held-out datasets with a combined number of 130 neonates with over 2.7k hours of EEG. For these reasons, we believe that our claim of expert-level equivalence is the first of its kind for neonatal seizure detection.

This study is not without limitations. The observed distribution shift on the Helsinki validation set suggests the XL model works best within the first week of life. Although seizures are most common during this period1,2,3, we should not assume that this covers all possible use cases. Another possible limitation is that our development dataset is from one centre. A promising direction for improvement on both counts is to train on a more diverse multi-centre dataset of EEG with recordings from a larger postnatal time range. And lastly, although we show that the proposed model attains expert-level agreement on our retrospective validation sets, a clinical investigation of the algorithm cotside is the best way to evaluate utility.

In conclusion, we find strong evidence that scaling training data and model size improves performance for neonatal EEG seizure detection. Held-out validation, on datasets with a combined total 2.7-k hours of multi-channel EEG from 130 neonates, found accurate and reliable generalisation performance. Achieving expert-level performance demonstrates readiness for clinical validation. Automated analysis of long-duration EEG facilitates increased seizure surveillance for at-risk neonates. This, in turn, can assist in timely neuroprotective strategies to help improve long-term outcomes for vulnerable neonates in critical care.

Methods

Development dataset

EEG records from 202 term neonates were obtained via a fully-anonymised database of EEG recordings from the Cork University Maternity Hospital (CUMH), Ireland. EEG was recorded as part of ongoing clinical research studies. Informed consent was obtained from the parents or guardians and ethical approval was obtained from the Clinical Research Ethics Committee of the Cork Teaching Hospitals. EEG recording commenced as soon as possible after birth and continued for hours or days. EEGs were recorded from term neonates with mixed aetiologies at risk of seizures in the neonatal intensive care unit (NICU) in most cases. We also include a control subset of healthy term newborns recorded in the postnatal wards (≤2 h of EEG per neonate) to use as part of the training data30,31,32,33.

The Neurofax EEG-1200 (Nihon Kohden), NicoletOne ICU Monitor (Natus, USA), or the Lifelines EEG (iEEG Lifelines, Stockbridge, United Kingdom) machines were used to record the EEG. Sampling frequencies were set at 200, 256, or 500 Hz depending on the machine. EEG signals were recorded from the frontal (F3/F4, Fp1/Fp2, or Fp3/Fp4), temporal (T3/T4), central (C3/C4 and CZ), and occipital (O1/O2) or parietal (P3/P4) regions.

A total of 6487 h of multi-channel EEG was reviewed for seizure by two neonatal neurophysiologists (authors SRM and SV). A bipolar montage of 8 channels was used to review seizures, as shown in Fig. 2a. For the control cohort of healthy newborns, the montage was set to F4–T4, T4–P4, P4–CZ, CZ–P3, F3–T3, T3–P3 as these records did not include C3/C4 electrodes31.

Each channel was reviewed and annotated separately, resulting in 50,299 h of annotated EEG. Seizures were identified in 77 neonates. A total of 12,402 individual per-channel seizure events were annotated (see Fig. 2a for example of per-channel annotations), with a median (interquartile range, IQR) of 48 (19 to 144) distinct seizures events per neonate. Demographic and clinical data are presented in Table 6.

Table 6 Cohort demographics according to the EEG datasets

To estimate inter-rater agreement, EEG from 13 neonates was reviewed by both neurophysiologists. Cohen’s κ indicated high inter-rater agreement, with a median κ of 0.808 (IQR: 0.702–0.874; range: 0.548–0.990). Although this is calculated on a per-channel rather than global annotation, agreement is in keeping with the previously reported estimates of inter-rater agreement: κ = 0.767 for the Helsinki dataset20 and κ = 0.827 for a Cork/London dataset34; both assessments used Fleiss κ to account for the 3 reviewers.

Analysis of the per-channel annotations indicate a high degree of variability in the number of EEG channels involved in each seizure event and in the variability of the time-synchronization of seizures across channels, as illustrated in Fig. 6. This figure also indicates that seizure burden is approximately independent of EEG channel, although the frontal channels (F3–C3 and F4–C4) appear to have a slightly lower burden compared to the other channels.

The per-channel annotations were used to develop a channel-independent algorithm. Different centres will use different protocols when recording EEG, ranging from a 1-channel amplitude-integrated EEG (aEEG) to a full 10:20 electrode array of 19 channels20. Developing an AI model on a specific number of channels and a specific montage leads to models that are sensitive to that montage only. Electrodes may detach or become unusable due to artefact during recording. Sustaining a long-duration EEG recording, as is needed for seizure surveillance, without degradation of signal quality on some channels may be unrealistic, given the challenging recording environment of the NICU.

Held-out EEG validation sets

To validate the performance of our algorithms we tested on two held-out, unseen datasets. The first dataset is a cohort consisting of EEG from 51 term neonates with mixed aetiologies at risk of seizures34. EEGs were reviewed independently by three international EEG experts, with a high level of agreement34. Although the EEG data is collected in the same location as the development dataset (CUMH), there is no reviewer overlap between this and the development dataset.

The second validation dataset is an open-access neonatal EEG dataset with seizure annotations20. Again, this was reviewed by three EEG experts. The dataset consists of EEG from 79 term neonates with mixed aetiologies.

For both validation datasets, seizure annotations were global, a single label used to indicate seizure in one or more channels. We refer to the datasets according to geographic origin: the Cork and Helsinki validation sets. Table 6 includes demographic information on both datasets.

Seizure detection model

We develop a modern convolutional neural network, based on the ConvNeXt architecture35, for our seizure detection model. In order to test our hypothesis of increasing model scale leading to improved performance we implement several variants of the model related by a simple width and depth scaling paramaterisation. All models are trained to maximise classification performance on 16 s segments of EEG. The hyperparameters and pre- and post-processing are the same for each model (these were fixed via experiments using the smallest model). The development of these models is described in more detail in the following.

We adapt the ConvNeXt architecture35, originally designed for 2D computer vision applications, to our 1D time-series EEG data. This architecture was systematically designed for efficiency and performance. Taking inspiration from the recent success of vision transformer architectures it was designed with purely convolutional components and achieved state-of-the-art performance across several computer-vision tasks35. The basic building block of the model is shown in Fig. 7. Notably, the use of depth-wise convolution and stacked 1 × 1 convolutional layers contribute to increased computational efficiency without sacrificing accuracy.

Fig. 7: ConvNeXt block.
figure 7

Here the W is an integer parameter we use to control the width of our models. The block includes three convolutional layers, one with depth-wise convolutions (indicated by the d) and one-dimensional kernel of length 7 samples, followed by two 1 × 1 convolutional layers, an equivalent implementation of a multi-layer perceptron. Notable features are the use of layer normalisation (LN) rather than batch normalisation and a Gaussian error linear unit (GELU) instead of the rectified linear unit (ReLU).

The detailed architecture is described in Table 7. Our parameterisation defines the network by 2 parameters: D for depth and W for width. Due to the residual structure, simply varying these two integer values allows for easy creation of model variants at different scales without any further adjustments. In this work, we explore models ranging in scales from 38.7k – 20.6m parameters; see Table 1 for the depth–width parameter settings for each model.

Table 7 Model architecture for the proposed ConvNeXt model

Training methods

For long-duration continuous recordings, seizure events typically occupy a small fraction of recording time, with the majority of the EEG being seizure free. A study by Rennie and colleagues described a median (IQR) total seizure burden of 69 (28 to 118) minutes over a median (IQR) of 70 (31 to 97) hours of EEG recording4. Additionally, not all neonates with EEG monitoring will have seizures: the same study found that 139 from 214 neonates did not have recorded electrographic seizures, despite the long duration of monitoring. Our development dataset reflects this imbalance, with an approximate class imbalance of 50:1. This imbalance can present a challenge for training machine-learning models, as the models can become biased towards the majority class.

The most common ways to deal with this are (a) oversampling the minority class, (b) undersampling the majority class, and (c) re-weighting the loss function. Oversampling is computational demanding, and for large datasets such as long-duration EEG recordings, unappealing and wasteful of expensive computational resources. Undersampling is also wasteful, as a large proportion of the diverse EEG records are discarded. Loss re-weighting is usually a good option but in our case such a large imbalance can result in large loss values which, even with gradient clipping, can de-stabilise the learning.

Instead, we use stratified mini-batch sampling: we keep all data and dynamically undersample the non-seizure examples at random during training. From one training epoch to the next the model will see a different sample of the non-seizure data but the same seizure data. By selecting a different random sample of non-seizure data per training epoch, all of the non-seizure data will be exposed during training with a sufficient number of training epochs. In practice, we found that dynamically undersampling to a ratio of 5:1 and combining loss re-weighting to account for this imbalance was the most stable and efficient implementation.

All models are trained with the same hyperparameters using AdamW on a learning-rate schedule. The learning-rate schedule follows a variant of the 1-cycle policy36 with 4 phases: warmup, freeze-at-max, cooldown, then freeze-at-min. The learning rate changes logarithmically during the warmup and cooldown phases. In our experiments we found this schedule reliably led to training convergence: 10 random initialisations of the Medium model resulted in a mean (standard deviation) relative change in MCC of just −0.083% (0.614%). This eliminated the need for early-stopping based on monitoring of validation loss. Although common practice in many machine-learning applications, we have found this to be unreliable. The large variability among neonates resulted in the early stopping condition being highly sensitive to the choice of babies in the validation set. One approach to mitigate this is to use more than one k-fold13, but this results in several models that need to be ensembled somehow. Problematically, this sensitivity of the model to the validation set raises questions about generalisation to unseen data when using this method. Additionally, a consequence of the deep-double descent phenomenon, which we observe in the Results section (Fig. 1), is that early stopping will only select the best model in the special case of when the model size and dataset size are critically balanced21.

To improve model robustness we developed and experimented with several data augmentation techniques. This consisted of several signal processing transformations: magnitude scaling, magnitude warping, jitter, time warping, and spectral-phase randomisation. In addition, generic transformations such as flip, cutmix37, cutout38, and mixup39 were applied. The parameters of each augmentation were manually adjusted to ensure all transformations were label preserving. Different probabilities were assigned to each transformation for a given batch. From our experimentation, only flip and cutout gave consistent improvements in performance and were therefore included in the model development presented here. Improvement varies somewhat with scale but inclusion of augmentation gives ~5% relative improvement on MCC.

Pre- and post-processing

Pre-processing of the EEG consisted of bandpass filtering within the 0.3–30 Hz passband, downsampling to 64 Hz, and removal of some artefacts. These artefacts were either periods of contiguous zeros, caused by checking the impedance of electrode scalp contact, or periods of excessive high-amplitude activity, defined by a standard deviation greater than 1 mV for each segment. EEG was divided into 16 s segments with a step size of 4 s. These segments were labelled as seizure if ≥8 s of the segment was annotated as seizure and non-seizure otherwise. Each channel of the segment was then used as a separate training example and was assigned a positive label if that specific channel contained a seizure annotation. This results in ~42m segments (10.5m without overlap) with a negative:positive ratio of ~ 50: 1.

When testing the model with a full EEG recording, we processed 16 s segments with a step size of 0.25 s. The continuous-valued output of the model is then smoothed with a 32 s rectangular window. From this probability-like output, we apply the standard threshold of 0.5 to generate the binary decision mask. Very short segments (<10 s) of seizure (non-seizure) are deleted (filled) in the final mask. We deliberately restrict our post-processing to be simple and limited in contrast to some more involved schemes in previous work10,13,40. While approaches like adding a collar to detected events or optimising the threshold can help with some metrics on some datasets10,13,40, we believe the best way to generalise well to other datasets is to rely on the model to learn the start and end of seizure events directly from the data.

All models were designed and built using the development dataset. The set was divided with a random 80:20 split of neonates: 80% of neonates’ EEG used for training and 20% for testing. When development was finalised the models were then trained on all the development dataset and tested on the held-out validation sets. There was no back-and-forth between model development and testing on the held-out datasets.

Evaluating performance

We conduct a comprehensive evaluation of the model using two complementary approaches: (1) performance metrics using human annotations as the gold standard and (2) human-expert equivalence testing. To enable reproducible research, we developed an open-source Python framework to run the evaluations (including both metrics and statistical tests) used in the study (available at https://github.com/CergenX/SPEED, commit c09f60a).

We include a range of performance metrics to avoid reliance on a single metric. Because of the many limitations associated with the area under the receiver-operating-characteristic curve (AUC)24,25,26, we opt to include more transparent measures such Pearson’s correlation and Matthews correlation coefficient (MCC)26,27. We also include clinically-relevant measures such as false detections per hour (FD/h) and seizure burden per hour. A complete list of metrics is presented in Table 8. With multiple annotators, as we have for both our validation sets, we follow the convention of using a consensus annotation13,19,40.

Table 8 Description of metrics used in this work

The metrics presented in Table 8 use annotations from a single expert or a consensus of experts as a gold standard. This approach is useful for comparing models but is often hard to interpret for clinical adoption. It also fails to capture the level of agreement among experts or quantify performance relative to that.

We evaluate performance relative to inter-rater agreement using a test developed for neonatal EEG seizure detection28,29,40. The method measures the impact of replacing each expert with the AI model predictions and quantifying the difference in inter-rater agreement using Fleiss κ to account for agreements by random chance. We define this difference in agreement for our 3 annotator held-out datasets as

$$\Delta {\kappa }_{a}={\kappa }_{{\rm{experts}}}-{\kappa }_{{\rm{AI}},\,a}\quad {\rm{for}}\,a=1,2,3$$
(1)

where κexperts is the inter-rater agreement among the 3 experts and κAI,a is the agreement with 2 experts and the AI for the 3 possible combinations. An overall difference in agreement, Δκ, is estimated as the mean value of Δκa over the 3 experts. The condition of Δκ = 0 indicates that the AI predictions do not change inter-rater agreement and therefore can be considered equivalent29,40. To test whether Δκ = 0, we follow the process of generating a distribution of Δκ by bootstrapping with 1000 iterations randomly resampling by neonate, computing Δκ for each resample. This allows us to estimate the variability in Δκ introduced by variability in inter-rater agreement as well as model performance. From this distribution, if the 95% confidence interval (CI) includes 0 then we accept the null hypothesis that the model predictions do not significantly alter inter-rater agreement. Adherence to this condition establishes expert-level performance for the AI model.