Background & Summary

Processing pipeline variability is a critical factor contributing to reproducibility challenges in neuroimaging research. When the same functional imaging dataset is analyzed by a variety of processing pipelines, different conclusions are drawn depending on which approaches were used1. A variety of different processing stream decisions affect final conclusions, including pipeline components on both the structural and functional side2,3. To support reproducible neuroimaging research, benchmarks must be identified for best standards and practices. One of these necessary benchmarks is gold standard manually defined brain tissue segmentations4.

Nowhere are manually defined segmentations more needed than in studying the first 1000 days of life, a dynamically changing period of brain growth and development5,6. 80% of brain growth occurs during the first 1000 days of life, including dramatic synaptogenesis, myelination, and other cellular processes7,8,9. Aggregating over 100,000 participants from over 100 MRI studies, Bethlehem et al. found that brain development growth acceleration peaks at 7 months of age, with velocity highest around the first three years of life5. Work by Alex et al. confirmed this velocity peak and showed that these trajectories of growth are linked to cognitive and motor outcomes at 2 years of age and that these trajectories differ by sociodemographic factors and adverse birth outcomes6. This dynamic period of growth complicates accurate cortical and subcortical segmentation10. The considerable myelination through the first year of life causes T1-weighted (T1w) scans (which enhance the signal of fatty tissue) and T2-weighted (T2w) scans (which enhance the signal of water) to show a contrast spin-inversion effect during this period11. Existing studies remain limited due to protocols that varied considerably in processing mechanisms, including varied early life segmentation atlases4,12,13. In this context, an atlas refers to a common set of labels for each brain structure within a whole brain MRI scan; a set of atlases can be used to segment new MRI scans and inform where common structures are across brains. Thus, a researcher can confidently know that they are examining the same brain region in two different children’s brains.

Standardized infant segmentation atlases have become a critical need within research programs. The NIH has already invested $50 + million, and plans to invest hundreds of millions more, in the HEALthy Brain and Child Development (HBCD) study. This study promises to elucidate neurodevelopmental trajectories with unprecedented precision and rigor14,15 and overcome sample size limitations highlighted by Marek et al.16. This fills a critical need for measuring true effect sizes for brain-wide associations relevant to early-life outcomes. Correct structural brain segmentations are essential to this promise, especially during the first 9 months due to the dynamic processes of growth and myelination occurring17,18,19. Thus, an atlas is needed that supports the dynamic changes within this time period. Yet the availability of manually-corrected segmentations from anatomical MRI data across infancy is limited4. Such corrections require considerable neuroanatomic expertise, expertise linking MRI landmarks to neuroanatomic borders, and are time-intensive, thus requiring considerable effort.

As field-wide momentum grows for reproducible research standards, a philosophy of open science is a necessary component of research best practices20. Without transparent research, factors that contribute to low reproducibility rates cannot be examined. In this context, as underlying manual segmentations are an impactful part of processing pipelines, it stands to reason that these segmentations should themselves be open and transparent. The primary objective of this resource was to construct a set of manually curated and expert reviewed human infant brain segmentations that adhere to FAIR21 data principles (Findable, Accessible, Interoperable, and Reusable). This dataset can be used to assess existing pipelines and/or develop new ones, such as the recently presented BIBSNet algorithm that was trained on this dataset22. Early life segmentation algorithms already exist within the literature12,23,24,25,26,27,28,29,30,31,32,33. However, many lack coverage across the whole-brain (eg. ID-Seg24, MANTiS27, iSEG challenges25, SDM U-net for subcortical23, ANUBEX30, SegSrgan29), use only T1w or only T2w images as inputs (eg. Infant Freesurfer12, MCRIB-S26, ID-Seg24), or are specific to neonatal periods (VINNA31) and aren’t reliable across the full first years of life4. As well, the underlying training data for those algorithms is often unavailable to the scientific community (iBEAT32). Finally, widespread disagreements among researchers can exist even for well-established areas like Wernicke’s area or the hippocampus34,35.

Therefore, a lack of high quality, publicly available training data is a major limitation to improved infant segmentation pipelines, which is often pointed out by the developers of these algorithms themselves24,31. Making such manual corrections available via open repositories would subject such segmentations to broader exposure and review, improving the rigor and fidelity of the manual segmentations. Indeed, such work has already been performed extensively in adults and even in fetal tissue (ex.36,37,38,39,40), and numerous segmentations have been made publicly available via repositories like OpenNeuro41.

The Baby Open Brains (BOBs) dataset addresses the need for openly available manually corrected segmentations of MRI data during the earliest periods of life4. Such a resource is critical for developers wishing to create processes for accurate automated segmentations. The curation of such a dataset requires considerable neuroanatomic expertise, including knowledge of anatomical MRI landmarks for accurate segmentations. Until now, the labor and considerable effort required to conduct such work has left much of the methods development without a ‘gold standard’ or benchmark dataset. This lack of proper benchmarks has limited the ability for pipeline developers to generalize infant processing pipelines and ensure the effectiveness of different pipelines across infant age groups, which has subsequently led to constrained pipelines tuned for particular ages.

BOBs manual segmentations will provide a benchmark for evaluating and improving automated segmentations. As infant neuroimaging expands, the research community will observe an exponential increase in MRI segmentation approaches. Already at least a half dozen early life segmentation algorithms exist within the literature12,23,24,25,26,27,28,29,30; however, few have tested segmentations across the early life age span and that cover the whole-brain and incorporate a wide-breadth of labels beyond gray matter, white matter, and CSF. Combining expert and community review, the BOBs dataset provides a unique foundational benchmark for evaluating and improving image segmentation methods, as well as expanding their scope towards more comprehensive segmentations. Such benchmarks standardize methods development as methods researchers can evaluate segmentation performance and validate tool capability. Such a benchmark standard for performance evaluation facilitates best practices and standards in infant neuroimaging.

These algorithms will form a necessary foundation for early-life large-scale studies such as HBCD. Automated MR processing pipelines specifically designed for early development are necessary to allow large-scale studies such as HBCD to create MR outputs unconfounded by age. With the BOBs resource providing a foundational benchmark to evaluate and improve these processing pipelines, HBCD and other future early-life neuroimaging studies will be well-equipped to provide the promised knowledge of nuanced neurodevelopmental trajectories and their complex environmental interactions.

Methods

The dataset is comprised of baby connectome project (BCP) anatomic and segmentation MRI data

The data for the BOBs dataset is pulled from the Baby Connectome Project (BCP), a longitudinal neuroimaging study in infants 0–5 years old. Detailed methodology has been described previously19. Briefly, infants were recruited from departmental research participant registries based on both state-wide birth records and the broader communities around the University of North Carolina at Chapel Hill and the University of Minnesota. Infants were eligible for the BCP if they 1) were born at a gestational age of 37–42 weeks, 2) had a birth weight appropriate for gestational age, and 3) had an absence of major pregnancy and delivery complications. Parents provided informed consent and permission for their child’s study participation and data sharing prior to participation. All procedures were approved by the University of North Carolina at Chapel Hill (Study #16-1943) and University of Minnesota Institutional Review Board (SITE00000093). For this dataset, 71 MRI visits with good quality data from infants 1–9 months old scanned at the University of Minnesota were used. Images selected for the dataset represented best quality images based on visual review by the authors, which remains the gold standard for quality assurance in comparison to automated methods42,43. Specifically, images were inspected for signs of poor quality such as motion, ghosting, blurriness, ringing, signal drop-off or image cut-offs. MRI data was collected using a 32-channel head coil on a Siemens 3 T Prisma scanner and included high resolution T1w (MPRAGE: TR 2400 ms, TE 2.24 ms, TI 1600 ms, Flip angle 8°, resolution = 0.8 × 0.8 × 0.8 mm3) and T2w (turbo spin-echo sequences: turbo factor 314, Echo train length 1166 ms, TR 3200 ms, TE 564 ms, resolution = 0.8 × 0.8 × 0.8 mm3, with a variable flip angle) structural scans collected during natural sleep.

Segmentations were initialized using two different segmentation pipelines

As a starting point for manual reviewers, segmentations were run through one of two segmentation pipelines. The first segmentations were initialized from a joint label fusion (JLF) pipeline44, and then manually curated. However, such a procedure required many hours of manual curation as these initializations required much coarser edits. Therefore, these initial manual segmentations were used to train “BIBSNet”22, a deep neural network built using nnU-Net45 and SynthSeg33. Using BIBSNet, other segmentations were initialized and then manually curated. Iteratively using BIBSNet prototypes as a starting point saved many hours of work, as the prototypes were much more accurate starting points than the JLF pipeline. In both pipelines, Advanced Normalization Tools (ANTs) was used to perform denoising and N4 bias field correction and T1w and T2w images underwent a rigid-body realignment to remove distortions and improve image quality for the reviewers. Detailed information about preprocessing is referenced in22 and on the BIBSNet Github (https://github.com/DCAN-Labs/BIBSnet).

Markers curated segmentations according to a standard operating protocol

A schematic depicting the process of segmentation initialization, correction, and upload is shown in Fig. 1. Markers attended trainings provided by the experts and had regular consultations with expert reviewers throughout the segmentation process. Marker segmentations were reviewed by expert reviewers (EF/SS/JW/DA) and modified as needed. Markers performed image segmentation edits using ITK-SNAP46 software. Initialized segmentations were overlaid on top of structural scans and manually edited. Markers utilized both the T1w and T2w scans to determine correct segmentation boundaries, such that there is one segmentation per session. As infant brains in this age range have increasing amounts of myelination in the white matter, referring to both T1w and T2w scans was critical to determining the extent of white matter. For each brain, the cortical surface and the gray-white matter boundary were edited first and reviewed. Subcortical regions were then edited, including the lateral ventricles, inferior lateral ventricles, cerebellum white matter, cerebellum cortex, thalamus, caudate, putamen, pallidum, amygdala, hippocampus, nucleus accumbens, third ventricle, fourth ventricle, and brainstem. Segmentations were done in phases, with the lateral ventricles, third ventricle, and fourth ventricle segmented first, the nucleus accumbens, caudate, putamen, and pallidum second, the brainstem, thalamus, and cerebellum third, and then the amygdala and inferior lateral ventricles last. The hippocampus was segmented separately, either before or after the rest of the subcortical segmentations. Definitions for the boundaries of these regions were pulled from previously published definitions47,48,49. A full SOP of subcortical boundaries was created (See Supplemental Information) and can be found on the OSF site50 as well as the ReadTheDocs page (https://bobsrepository.readthedocs.io).

Fig. 1
figure 1

A schematic depicting the process of creating the dataset. Segmentations were initialized with an automated processing pipeline and then manually corrected, utilizing both the T1 and T2 MRI images. Segmentations were then reviewed by expert reviewers who made revisions as necessary. These images were defaced and deidentified, and uploaded to OpenNeuro. OSF acts as a hub to integrate the links to dataset images, protocols, and any other future documentation created as the dataset expands.

Approved anatomic MRI data were deidentified and defaced

Final data was stripped of identifying information and formatted into BIDS format. To deface images, T1w and T2w images were run through PyDeface using MNI infant templates as well as a custom infant mask (https://cdnis-brain.readthedocs.io/deidentification/), which masked out facial features from the scans. Final deidentified and defaced images and segmentations were version controlled with DataLad to enable data provenance.

Data Records

The BOBs dataset is available on OpenNeuro, with 71 BCP visits spanning 1–9 months of age

The BOBs dataset is available on OSF50 and OpenNeuro51. In total, segmentations were manually curated from 71 imaging visits across 51 participants. Of the 51 participants, 34 participants contributed one scan visit, 14 contributed 2 scans, and 3 contributed 3 scans to this set of segmentations. The age at scan ranged from 1–9 months old, with at least 6 scans at each month 1–8 (Fig. 2). The demographics of the dataset participants skewed White, non-Hispanic, and well-resourced (Fig. 2), with 82% of the sample identifying as White, non-Hispanic and 96% of mothers having at least a college degree. The demographics of the 51 participants pulled for the dataset did not differ statistically from the full BCP neuroimaging sample (N = 901 visits across 383 participants). Select neurodevelopmental measures, including the Mullen Scales of Early Learning, the Vineland Adaptive Behavior Scales, and subscales from the Infant Behavior Questionnaire - Revised, showed no differences between dataset participants and the full BCP sample as well (Table 1), suggesting that participants in this dataset can be considered representative of the larger BCP sample.

Fig. 2
figure 2

71 scans across 51 participants make up the segmentations in the dataset. All come from the UMN site of the Baby Connectome Project (BCP), and span 1–9 mo. The sample demographics skew white, non-hispanic, and well-resourced.

Table 1 No differences were seen on selected neurodevelopmental scores between the participants selected for the BOBs dataset and the full BCP sample from which they were selected.

The current BOBs dataset is comprised of FreeSurfer-style segmentations for infants

These segmentations comprise cerebral gray/white matter and 23 subcortical structures. Uploaded segmentations went through several review stages before final approval, including at least one expert reviewer manually checking the segmentation. Leveraging both a T1w and T2w, care was taken to label white matter both affected and unaffected by the contrast spin-inversion effect. Diverging from FreeSurfer labels, the ventral thalamic boundary that separates thalamus from ventral diencephalon was defined by the hypothalamic sulcus52. The hippocampal label was used to define the hippocampus proper, excluding the formation at the tail along the lingual gyrus, in order to be consistent with other infant literature53. While we think evaluating whether the SOP is “right” or “wrong” may be beyond the scope of this paper, we chose such definitions in order to be more consistent with prior infant MRI literature53. We welcome the community to inspect and refine existing segmentations to ensure that the “gold standard” benchmarks reflect a community gold standard.

The BOBs dataset follows BIDS formatting standards

Data within the dataset follows the BIDS formatting standards54,55. Each subject folder contains one or more session folders. The “anat” subdirectory within each session folder contains the T1w and T2w image files, the associated segmentation file, and corresponding json files containing metadata for each file. In addition to the subject folders, the directory contains a “dataset_description.json” file, containing a description of the dataset, a “dseg.tsv” file containing a lookup table of segmentation label numbers and names, and a phenotype folder with a “sessions.json” and “sessions.tsv” that contain a list of ID numbers, session, chronological age, gestational age at birth, and sex of the participants in the dataset. The dataset also includes two non-BIDs standard files, “index.html”, a list of links to download individual files, and “V1.0.zip”, a zipped version of the entire repository, that are included for ease of access. File organization can also be found on the BOBs ReadTheDocs page.

Technical Validation

Manual segmentations show massive qualitative improvement over initial Joint Label Fusion segmentations

Compared to initial Joint Label Fusion segmentations, created from the DCAN infant-ABCD-BIDS pipeline56, manual segmentations show dramatic qualitative improvements. Initial segmentations had three major types of errors that were corrected by markers (see Fig. 3). First, initial segmentations often created major errors in cortical folding patterns (Fig. 3 top). The initial model may not account for differences between infant and adult image intensity, and this model failure may drive folding pattern segmentation errors. These errors required intensive edits to correct the basic gyri and sulci patterns. Additionally, due to the contrast spin inversion occurring at this age from myelination processes, labeling the full extent of unmyelinated white matter required extensive manual segmentation (Fig. 3 middle). Automated segmentations often miss unmyelinated white matter, especially along the lateral surface of the brain where myelination processes occur later in development. Finally, as exemplified in Fig. 4, subcortical regional intensities change dramatically over this time period, and thus subcortical regional boundaries often needed refining (Fig. 3 bottom).

Fig. 3
figure 3

Manual segmentations show massive improvements over initial JLF segmentations. Three cases are demonstrated, showing that reviewers were able to correct errors such as cortical folding patterns, missing unmyelinated white matter, and incorrect subcortical boundaries.

Fig. 4
figure 4

T1w and T2w images show dramatic developmental differences across the age range considered. (a) The selected images are from the same participant at three different ages, clearly depicting the transition from unmyelinated to myelinated white matter, and the differing image contrast intensities in the T1w vs. T2w at each age. Red arrows point out cortical gray/white matter changes, blue triangles point out internal capsule white matter changes, and green circles point out nucleus accumbens region changes (b) Cohen’s d values of white-gray matter differentiation are plotted for T1w and T2w MRI images. Considering both the T1w and the T2w images at this age group is critical to fully capture the white matter and subcortical boundaries.

Dynamic brain development in infancy requires dense sampling and segmentations utilizing both T1w and T2w images

As infant brains in this age range have increasing amounts of myelination in the white matter17, referring to both T1w and T2w scans was critical to determining the extent of white matter. This brain growth is exemplified in a single infant in our dataset across three ages in Fig. 4a. In this infant, there is visually dramatic development of image contrast within and across brain structures. This early time period shows rapid myelination, such that the older ages show much more myelinated white matter, especially along the major white matter tracts. Most dramatically at 5 months in this infant, there is an abundance of unmyelinated white matter that can be easily seen on the T2w image, but would be easily missed on the T1w image. Regardless of the cause, these developmental changes require considering both the T1w and the T2w images at this age group to fully capture the white matter and subcortical boundaries. This was especially critical in subcortical regions such as the basal ganglia, where boundaries might only be visible in either the T1w or the T2w, but not both. The symbols on each of the images exemplify regions that are better served by examining the T1w or the T2w but not both, such as the basal ganglia.

As the largest manually curated human infant brain segmentation dataset for the critical 1–9 month age range, the BOBs dataset proved vital in developing BIBSnet22. BIBSnet is an automated segmentation pipeline necessary for HBCD MRI data preprocessing, and critical for infant pipeline development. The BOBs dataset’s critical role in developing BIBSnet establishes further external technical validation for the dataset. Prior efforts towards developing automated segmentation pipelines lacked densely sampled, manually labeled training data that would be critical for early-life longitudinal studies like HBCD4,14,17,18,19. For example, the Developing Human Connectome Project (dHCP) provides extensive anatomical segmentations that are largely restricted to neonatal and preterm infants28,57. Such segmentations are derived from the T2w but do not use the T1w; while they can be used to develop automated segmentation pipelines23,29,30, such pipelines may fail to generalize beyond the neonatal period. The Infant Freesurfer dataset comprises data from a dozen infant sessions through the first two years of life58, and helped develop Infant Freesurfer12, but lacks the participant density of the BOBs dataset.

Usage Notes

In addition to the repository dataset on OpenNeuro, BOBs is available at https://bobsrepository.s3.amazonaws.com/index.html. More information and additional download links are available on our ReadTheDocs page. The dataset was also linked to BrainBox (https://brainbox.pasteur.fr/), which allows users to review the dataset online.