Main

Voxel-based morphometry (VBM) is a widely used analytical approach in neuroimaging research that aims to measure differences in the local concentration of brain tissue across multiple brain magnetic resonance imaging (MRI) scans and to investigate their association with biological and psychometric variables1,2. Comparing neuroimaging data is challenging, because the intensity of MRI is not standardized, and brain structures differ across individuals. Standard VBM preprocessing addresses this by segmenting MRI scans into tissue classes and spatially normalizing the resulting tissue map to a template3,4,5,6,7,8. Finally, generalized linear models (GLMs) are fit for each voxel, modeling associations between spatially normalized tissue probabilities and the considered biological (for instance, age and sex) or psychometric variables (for instance, symptom severity or cognitive performance scores). If the GLM reveals a significant association, the corresponding voxel may be considered a region of interest (ROI), indicating a potential neural correlate of the variables under study.

Given that the effect sizes in VBM-based statistical analyses are typically small9, MRI datasets with thousands of participants are necessary to obtain accurate measurements10. To meet this demand, large-scale datasets have expanded to include more than 40,000 participants11, resulting in datasets that exceed 100,000 MRIs. However, preprocessing these large datasets with existing toolboxes, such as CAT127, can take weeks or even months on standard hardware, delaying scientific progress. Developing a more computationally efficient VBM preprocessing pipeline could alleviate these processing bottlenecks, allowing researchers to focus on the conceptual aspects of their studies and accelerating scientific discovery. Therefore, creating a faster VBM preprocessing pipeline is a critical step forward for structural neuroimaging research.

In recent years, deep learning has emerged as a highly effective approach for various tasks in medical image analysis12, providing state-of-the-art performance in a wide range of image-related applications, such as semantic segmentation13. The neuroimaging community has followed this trend, resulting in neural network-based tools for brain extraction14,15,16, tissue segmentation17,18, registration19,20,21,22 and other neuroimaging-specific tasks23.

However, the adoption of neural network-based methods for preprocessing still lags behind classical toolboxes, such as CAT127, SPM3 or FreeSurfer6, owing to two main reasons. First, deep learning tools often perform poorly in a realistic setting where they are applied to MRIs from scanner sites unseen during model training14. In line with ref. 24, this problem can be addressed by increasing the number of scanner sites25 and the extensive use of data augmentation15,20,24,26. Second, neural network-based tools are often specialized for only one processing step, while CAT12, SPM and FreeSurfer provide full processing pipelines for VBM and other methods such as surface-based morphometry (SBM). Tools, such as SynthMorph24, SynthStrip15 and EasyReg21, attempt to resolve this by being integrated into the FreeSurfer toolbox, serving as alternatives for parts of its preprocessing pipeline. However, to the best of our knowledge, there is no toolbox that has been developed from the ground up to fully harness the potential of neural networks across all preprocessing steps needed for VBM analysis.

We present deepmriprep, a preprocessing pipeline for VBM analysis of structural MRI data that is built to fully leverage deep learning. deepmriprep uses neural networks for the major VBM preprocessing steps: brain extraction, tissue segmentation and spatial registration with a template. Brain extraction is performed by deepbet14, the most accurate existing method to remove nonbrain voxels in T1-weighted (T1w) MRIs of healthy adults. To encompass the full VBM preprocessing, in this work, we additionally develop neural networks for tissue segmentation and registration. For tissue segmentation, we use a patch-based three-dimensional (3D) UNet approach, inspired by Isensee et al.23, which also exploits neuroanatomical properties, such as hemispheric symmetry. Nonlinear image registration is performed using a custom variant of SYMNet22, which uses a 3D UNet in conjunction with DARTEL shooting27 to predict smooth and invertible deformation fields.

The neural networks are trained on 685 MRIs compiled from 137 OpenNeuro datasets in a grouped cross-validation to ensure realistic validation performance. The worst predictions are visually inspected to identify potential weaknesses. In addition, deepmriprep is tested on 18 OpenNeuro datasets with pediatric healthy controls (HCs) and a dataset with synthetic atrophy and synthetic image artifacts. To investigate the effect of preprocessing on VBM-based statistical analyses, the VBM pipelines of deepmriprep and CAT12 are applied to 4,017 participants from three cohorts. In subsequent VBM analyses, associations with biological and psychometric variables are investigated, and the correlation of the resulting t-maps based on deepmriprep and CAT12 is analyzed. Finally, the correlations between deepmriprep- and CAT12-based tissue volume measurements are investigated in a ROI-based setting using the LPBA40 atlas28. In conclusion, our results indicate that deepmriprep is 37 times faster than CAT12 while achieving comparable accuracy in the individual preprocessing steps and strongly correlated results in the final VBM analysis.

Results

Tissue segmentation

OpenNeuro-HD

We first evaluated deepmriprep and CAT12 on 685 high-resolution adult MRIs with strict quality control (OpenNeuro-HD) using cross-dataset validation. deepmriprep demonstrated robust tissue segmentation, achieving a median Dice score DSCmedian of 95.0 across validation MRIs (Fig. 1a, Supplementary Fig. 1, Supplementary Table 1 (left) and Source Data Fig. 1). This high level of agreement with the ground truth—that is, CAT12 tissue segmentation maps derived from high-resolution MRIs (see ‘Preprocessing’ section in the Supplementary Information)—is an improvement compared with CAT12 (in original MRI resolution), which achieves a median Dice score DSCmedian of 93.1.

Fig. 1: Dice scores of tissue segmentations of deepmriprep and CAT12 in OpenNeuro-HD with worst-case segmentation examples.
figure 1

a, Dice scores of deepmriprep and CAT12 with respect to the CSF, GM, WM and foreground (mean of CSF, GM and WM) across 685 MRI scans from OpenNeuro-HD. Violin plot density traces terminate exactly at the observed minima and maxima, and the superimposed box plots represent 25th percentile (lower), median (center) and 75th percentile (upper) with whiskers extending to data points within 1.5× interquartile range (IQR) of the quartiles. b, Sagittal slice of the T1w MRI, the reference tissue map (CAT12 at 0.5 mm3 resolution) and the predicted tissue segmentation map in the sample that resulted in the lowest foreground Dice score for deepmriprep (first row) and CAT12 (second row).

Source data

The high agreement of deepmriprep’s tissue segmentation with the ground truth in terms of the Dice score is confirmed by the foreground probabilistic Dice score (pDSC 84.7) and Jaccard score (JSC 90.6) shown in Supplementary Figs. 2 and 3 and Supplementary Tables 2 and 3. The segmentation of cerebrospinal fluid (CSF) resulted in the lowest median Dice scores \({\mathrm{DSC}}_{\mathrm{median}}^{\mathrm{CSF}}\) of 91.1 for deepmriprep and 85.6 for CAT12 (Fig. 2). Furthermore, the CSF Dice scores showed the strongest outliers across all 685 validation MRIs, with minimal Dice scores \({\mathrm{DSC}}_{\min }^{\mathrm{CSF}}\) of 73.6 for deepmriprep and 62.9 for CAT12. In the tissue maps that resulted in the minimal foreground metrics \({\mathrm{DSC}}_{\min }\), \({\mathrm{pDSC}}_{\min }\) and \({\mathrm{JSC}}_{\min }\) for each method (Fig. 1b and Supplementary Figs. 2 and 3), deepmriprep and CAT12 produced a thicker outer layer of CSF than the ground truth. With respect to gray matter (GM) and white matter (WM), the tissue maps of both methods did not show notable visual differences compared with the reference maps.

Fig. 2: Dice scores of tissue segmentations of deepmriprep and CAT12 in OpenNeuro-Total with worst-case segmentation examples.
figure 2

a, Dice scores between deepmriprep and CAT12 across all 8,279 MRI scans from OpenNeuro with respect to the CSF, GM, WM and foreground (mean of CSF, GM and WM). Violin plot density traces terminate exactly at the observed minima and maxima, and the superimposed box plots represent 25th percentile (lower), median (center) and 75th percentile (upper) with whiskers extending to data points within 1.5× IQR of the quartiles. b, MRI input (first row) and tissue segmentation map (deepmriprep: second row, CAT12: third row) which resulted in the 0.0, 0.1, 0.2, 0.3 and 0.4 percentile foreground Dice scores across all 8,279 MRI scans.

Source data

OpenNeuro-Total

To test robustness in a realistic, heterogeneous setting, we compared deepmriprep and CAT12 on 8,279 scans from 208 datasets with only minimal quality control (OpenNeuro-Total). Despite this challenging setting, the median Dice score DSCmedian of 93.1 between deepmriprep and CAT12 tissue maps—not to be confused with Dice scores of each tool’s output compared with ground truths—showed high agreement for most of the respective tissue maps (Supplementary Fig. 4a). Again, GM and WM segmentation was most consistent with Dice scores of 96.0 for \({\mathrm{DSC}}_{\mathrm{median}}^{\mathrm{GM}}\) and 97.4 for \({\mathrm{DSC}}_{\mathrm{median}}^{\mathrm{WM}}\), while CSF segmentation resulted in a lower median Dice score of 85.9 for \({\mathrm{DSC}}_{\mathrm{median}}^{\mathrm{CSF}}\).

Despite the absence of ground truth, visually comparing tissue maps with low Dice scores—that is, low agreement between deepmriprep and CAT12—enables a qualitative assessment of each tool’s robustness.

Throughout the tissue maps with the 0.0th, 0.1th, 0.2th, 0.3th and 0.4th percentile foreground Dice scores DSC (Supplementary Fig. 4b), deepmriprep showed reasonable results with minor artifacts, while CAT12 was prone to errors. In the 0.0th and 0.1th percentile tissue maps, CAT12 produced unusable results with respective Dice scores of 0.0 for DSCGWM and 1.9 for DSCGWM, compared with the reasonable tissue maps created by deepmriprep. In the 0.2th, 0.3th and 0.4th percentile tissue maps, CAT12 produced less detailed tissue maps than deepmriprep and misclassified tissue at the edge of the brain as background. deepmriprep properly classified the outer edge tissue, but misclassified areas of CSF as background in the 0.4th percentile tissue map. The same characteristic sources of errors could be found across the 16 tissue maps with the lowest agreement between deepmriprep and CAT12 (Supplementary Fig. 5), again measured by DSC. Finally, due to an error, CAT12 did not produce any tissue map for one MRI scan, while deepmriprep processed all 8,279 MRIs without any errors.

Image registration

The registration of tissue probability maps with deepmriprep resulted in a median mean squared error MSEmedian of 9.9 × 10−3 and a median linear elasticity LEmedian of 250 during cross-dataset validation (Supplementary Figs. 6, left, and 7). These metrics indicate that deepmriprep performs on par with CAT12 (MSEmedian 9.2 × 10−3, LEmedian 240). While CAT12 showed slightly better median metrics, the supervised SYMNet used within deepmriprep resulted in a smaller maximal linear elasticity across MRIs, with an \({\mathrm{LE}}_{\max }\) of 366 (CAT12: \({\mathrm{LE}}_{\max }\,386\)), and a smaller 95th percentile LE95p of 280 (CAT12: LE95p 283). This favorable linear elasticity indicates improved regularity of the deformation field for challenging probability maps—that is, maps that exhibit large voxel-wise differences from the template.

For both registration methods, the same probability map resulted in the largest voxel-wise mean squared error (MSE) after registration (Supplementary Fig. 6, right). Visual inspection of this warped probability map uncovers a small misalignment at the upper edge of the ventricles for deepmriprep, indicating less rigor in aligning the map with the template. Based on the absolute voxel-wise difference to the template, no apparent differences between deepmriprep and CAT12 could be found.

VBM analyses

VBM analysis results for GM based on deepmriprep and CAT12 across all three datasets (Marburg-Münster Affective Disorders Cohort Study (MACS), Münster Neuroimaging Cohort (MNC) and BiDirect) demonstrated high similarity (Supplementary Fig. 8 and Supplementary Data 1), with strong correlation between the respective t-maps (Supplementary Table 7). The correlation of t-maps remained strong even for the psychometric variables—years of education, HC versus major depressive disorder (HC versus MDD) and intelligence quotient (IQ)—despite their smaller effects compared with the biological variables, namely age, sex and body mass index (BMI). The analyses that used all three datasets all resulted in correlation coefficients of r > 0.8, with BMI (r = 0.75) being the only exception. The equivalence between deepmriprep- and CAT12-based analysis outcomes is also supported by their similar maximal, absolute t-scores \(| t{| }_{\max }\) (Supplementary Tables 8 and 9), especially for age and HC versus MDD. deepmriprep resulted in a larger maximal, absolute t-score for sex and age and smaller maximal, absolute t-scores for IQ, years of education and BMI. The difference in maximal values and the reduced t-map correlation for BMI was primarily driven by a large cluster in the outer cerebellum, which appeared only in the CAT12-preprocessed data (Supplementary Fig. 8).

The correlation coefficients of r > 0.8 also hold for the analyses using the MACS dataset and BiDirect dataset individually, again with BMI being the exception due to CAT12-based large clusters in the outer cerebellum (Supplementary Figs. 9 and 11). For analyses using only the MNC dataset, BMI-based t-maps strongly correlated with r = 0.83, while the sex-based t-maps resulted in a correlation coefficient of r = 0.72 (Supplementary Fig. 10).

The VBM results in WM (Supplementary Figs. 1215) also exhibit strong correlations of r > 0.8 between most t-maps. Again, the BMI-based t-maps showed the lowest correlation caused by CAT12-based large clusters in the cerebellum. In addition, sex-based t-maps for the MNC dataset, BiDirect dataset and the pooled analysis showed lower correlation coefficients of 0.71, 0.71 and 0.73, respectively, and the HC versus MDD analysis for the BiDirect dataset resulted in a correlation coefficient of r = 0.74.

Finally, ROI-based GM volume measurements of deepmriprep and CAT12 exhibit strong correlations of r > 0.9 across all of the 29 ROIs of the LPBA40 atlas, with the only exception being the GM volumes of the brainstem region with a correlation of r = 0.75 (Supplementary Figs. 1621). For WM, the caudate (r = 0.80), hippocampus (r = 0.81) and cerebellar lobe (r = 0.89) showed the lowest correlation coefficients, with all remaining regions resulting in correlations of r > 0.9.

Processing time

deepmriprep achieved the highest processing speed on both low-end and high-end hardware (Supplementary Fig. 22) across the 8,279 scans from 208 datasets with only minimal quality control (OpenNeuro-Total). On high-end hardware, DeepMRIPrep took an average of 4.6 s per MRI using the graphics processing unit, whereas CAT12, parallelized across all 16 cores of the high-end processor, required an average of 173 s per MRI. On low-end hardware, deepmriprep and CAT12 take 209 s and 1,096 s per MRI, respectively.

Discussion

We present deepmriprep, a neural network-based pipeline specifically built for VBM preprocessing. deepmriprep is 37× faster than CAT12, a leading toolbox known to be the more efficient preprocessing alternative to FreeSurfer for SBM. For preprocessing a large dataset containing 100,000 MRI scans such as the UK Biobank11, this translates into a reduction of computation time from 6 months to 5 days. While being faster, it delivers equivalent or better accuracy across tissue segmentation and registration. Most importantly, statistical maps based on deepmriprep preprocessing show strong correlations with respective CAT12-based VBM results.

It should be highlighted that assigning reliability to any statistical VBM map in the absence of gold standard or ground truth is inherently difficult29, aggravated by the large amounts of data required to achieve sufficient statistical power10. Therefore, the differences between the results of the deepmriprep- and CAT12-based VBM analyses should be interpreted with caution. Consequently, more research is needed to advance VBM from a scientific tool for detecting group-level differences to a reliable clinical application for individual patient diagnosis.

Furthermore, it should be noted that CAT12 is not solely a VBM toolbox but also offers SBM, while the current version of deepmriprep is limited to VBM, a shortcoming that should be addressed in a future version of deepmriprep to gain adoption in the broad neuroimaging community. Furthermore, the nonlinear registration could be improved by optimizing the affine matrix in conjunction with the warping field, thereby avoiding any potential biases of the initial affine registration. Finally, the training data quality could be improved to further increase deepmriprep’s accuracy. One promising, straightforward approach would be to use more training data with higher image quality, for instance, by using MRIs acquired with increased scan time, increased matrix size and reduced slice thickness.In addition, human expert annotations can be used to generate high-quality tissue segmentation maps and deformation fields, which can then be combined with lower-quality MRIs from the same session as input data. This would train the neural networks to predict high-quality tissue segmentation maps and deformation fields, even if the input MRI is of lower quality.

Although deepmriprep’s high processing speed and user-friendly interface are its main advantages, its underlying software design may hold even greater implications for future development (https://github.com/wwu-mmll/deepmriprep)30. The software is structured into small, modular components, each comprising fewer than 1,000 lines of code. This streamlined design improves long-term maintainability and reduces the likelihood of potentially far-reaching bugs5,31. Most importantly, the straightforward software architecture of deepmriprep reduces the barrier for researchers in VBM and other neuroimaging domains, making it easier to understand, adapt and reuse the code for various neuroimaging pipelines. We anticipate that the broader adoption of deepmriprep into other neuroimaging pipelines will advance the underlying methods, thus fostering progress in the broader neuroscience community.

Methods

Datasets

This study uses existing data from 225 datasets published on the OpenNeuro platform32 downloaded via the openneuro-py version 2023.1.0 Python package (https://github.com/hoechenberger/openneuro-py). OpenNeuro data from adult HCs were used for training and validation with cross-validation, while OpenNeuro data from HCs aged 2–12 years were used for testing, along with a Synthetic Atrophy dataset and three patient datasets: the Münster Neuroimaging Cohort (MNC), the Marburg-Münster Affective Disorders Cohort Study (FOR2107/MACS) and the BiDirect study. Data availability is governed by the respective consortia. No new data were acquired for this study.

Training and validation datasets

OpenNeuro-Total

Out of the over 700 datasets available at OpenNeuro at the time of compilation (10 November 2021), each dataset that contained at least five T1w MRIs from at least five adult HCs was included, resulting in 208 datasets. Based on a successive visual quality check, 30 MRIs were excluded, mainly due to improper masking (Supplementary Fig. 23a) and erroneous orientation (Supplementary Fig. 23b). The remaining compilation of 8,279 T1w MRIs is used as the OpenNeuro-Total dataset (Supplementary Fig. 24 and Supplementary Table 10).

OpenNeuro-HD

The 8,279 MRIs from OpenNeuro-Total were preprocessed using the commonly used CAT12 toolbox (https://neuro-jena.github.io/cat/) with default parameters7. To ensure high quality of the training data, strict quality thresholds were set based on the preprocessing quality ratings provided by the toolbox. To be included, all ratings had to be at least a B− grade, resulting in the following thresholds: a surface Euler number below 25, a surface defect area under 5.0, a surface intensity root mean square error below 0.1, and a surface position root mean square error below 1.0. All OpenNeuro datasets that contained fewer than ten adult HCs after this quality control were excluded. In the remaining datasets, MRIs were ranked according to the surface defect number, and finally the top five MRIs per dataset that passed a visual quality check were included in the dataset. This results in a total of 685 MRIs from 137 datasets called OpenNeuro-HD (Supplementary Fig. 24, middle, Supplementary Table 10 and Source Data Fig. 2).

Test datasets

OpenNeuro-Kids

Among the over 700 datasets available at OpenNeuro at the time of compilation (10 November 2021), each dataset that contained at least 5 T1w MRIs from at least 5 HCs in the age range from 2 to 12 years was included, resulting in 18 datasets. Based on a successive visual quality check, 300 MRIs were excluded, mainly due to strong motion artifacts (Supplementary Fig. 23c) and improper masking (Supplementary Fig. 23d). The remaining compilation of 867 T1w MRIs is used as the OpenNeuro-Kids dataset (Supplementary Fig. 24, right, Supplementary Table 10 and Source Data Fig. 1). The CAT12 preprocessing for OpenNeuro-Kids used the TMP_Age11.5 template.

Synthetic atrophy and synthetic artifacts

Published by Rusak et al.33, this dataset uses T1w MRIs of 20 HCs from the Alzheimer’s Disease Neuroimaging Initiative34 to synthetically introduce global neocortical atrophy. Simulating ten progressions of atrophy, ranging from 0.1 mm to 1 mm of global thickness reduction, the resulting dataset consists of 220 T1w MRIs (including the 20 originals) and their respective ground-truth tissue maps.

To additionally investigate the influence of scanner effects, we introduce artificial artifacts in the 20 original T1w MRIs using Rician noise, bias field, blurring, ghosting, motion, ringing and spike artifacts (Supplementary Fig. 25). Each of the seven artifacts is applied with medium and strong intensity, resulting in 480 synthetic MRIs.

VBM analysis datasets

For the VBM analyses, we use a total of 4,017 MRIs from three independent German cohorts (Supplementary Fig. 26): the Marburg-Münster Affective Disorders Cohort Study (MACS; N = 1,799), the Münster Neuroimaging Cohort (MNC; N = 1,194) and the BiDirect cohort (N = 1,024). All three cohorts include subsamples with both patients with MDD and HCs who are free from any lifetime mental disorder diagnoses according to DSM-IV criteria.

Marburg-Münster Affective Disorders Cohort Study (FOR2107/MACS)

Patients were recruited through psychiatric hospitals, while the control group was recruited via newspaper advertisements. Patients diagnosed with MDD showed varying levels of symptom severity and underwent various forms of treatment (inpatient, outpatient or none). The FOR2107/MACS was conducted at two scanning sites: University of Münster and University of Marburg. Inclusion criteria for the present study were availability of completed baseline MRI data with sufficient MRI quality. Further details about the structure of the FOR2107/MACS35 and MRI quality assurance protocol36 are provided elsewhere.

Münster Neuroimaging Cohort (MNC)

In MNC, patients were recruited from local psychiatric hospitals and underwent inpatient treatment due to a moderate or severe depressive disorder. Further information regarding this study can be found in refs. 37,38.

BiDirect

The BiDirect Study is a prospective project that comprises three distinct cohorts: patients hospitalized for an acute episode of major depression, patients up to 6 months after an acute cardiac event, and HCs randomly drawn from the population register of the city of Münster, Germany. Further details on the rationale, design and recruitment procedures of the BiDirect study have been described in refs. 39,40.

Preprocessing

All datasets are preprocessed using the VBM pipeline of version 12.8.2 of the CAT12 toolbox, which was the latest version available at the time of analysis, with default parameters7. The affine transformation calculated during this initial CAT12 preprocessing is used such that tissue segmentation (see ‘Tissue segmentation’ section in the Methods) and image registration (see ‘Image registration’ section in the Methods) are consistently applied in the template coordinate space. Image registration is based on GM and WM probability maps in the standard resolution of 1.5 mm (113 × 137 × 113 voxels).

For tissue segmentation, unprocessed MRIs are affinely registered to the template in a high resolution of 0.5 mm (339 × 411 × 339 voxels) using B-spline interpolation, and the CAT12 preprocessing is repeated on the basis of these high-resolution MRIs. This circumvents any potential image degradation caused by additional resizing of the CAT12 tissue map. Because there exist no ground-truth tissue maps, these high-resolution tissue maps are used as reference maps for model training and validation. Because the MRIs are skull-stripped before tissue segmentation in deepmriprep’s prediction pipeline (see ‘Prediction pipeline’ section in the Methods), all voxels in the MRI that do not contain tissue in the respective tissue map are set to zero. Furthermore, the standard N4 bias correction41 is applied using the ANTS package4 to avoid interference with potential artificial bias fields introduced during data augmentation (see ‘Data augmentation’ section in the Methods). Finally, min–max scaling between the 0.5th and 99.5th percentile is used as proposed in ref. 23 with one modification: values above the maximum are not clipped to one but scaled via the function \(f(x)=1+{\log }_{10}x\) to prevent any loss of crucial information in areas with extreme intensity values (for example, blood vessels). The code for the input preprocessing is publicly accessible at https://github.com/wwu-mmll/deepmriprep-train (ref. 42).

Data augmentation

Data augmentation is used during training to artificially introduce image artifacts that may occur in real-world datasets. This increases model generalizability, because effects that are infrequent in the training data can be systematically oversampled with any desired intensity. Data augmentations for the image registration step would have to be consistent with equation (2), requiring specialized implementations. Hence, for the current version of deepmriprep, data augmentation is omitted during image registration model training.

The 12 different data augmentations used during model training (Supplementary Fig. 27) are implemented in the niftiai version 0.3.2 Python package (https://github.com/codingfisch/niftiai)43 and introduce artificial bias fields, motion artifacts, noise, blurring, ghosting, spike artifacts, downsampling, translation, flipping, brightness, contrast and Gibbs ringing. Bias fields are generated by applying an inverse Fourier transform to low-frequency Gaussian noise, whereas motion artifacts, ghosting44, spike artifacts45 and Gibbs ringing46 are achieved by introducing artifacts in the k-space of the T1w MRI. To be MRI-specific, noise is sampled out of a chi distribution47, a generalization of the Rician noise distribution48. Instead of using the full set of affine and nonlinear spatial transformations, only translation and flipping are applied via nearest-neighbor resampling to circumvent any potential image degradation.

Tissue segmentation

To achieve high-quality tissue segmentation, a cascaded 3D UNet approach, inspired by Isensee et al.23, is applied to a cropped region of 336 × 384 × 336 voxels in the high-resolution MRI (see ‘Preprocessing’ section in the Methods). This specific cropping is chosen to make the image dimensions divisible by 16 (required by UNet), without excluding voxels which potentially contain tissue. The first stage of the cascaded UNet processes the whole image with a reduced resolution of 0.75 mm (224 × 256 × 224 voxels). In the second stage, the original resolution of 0.5 mm is processed using a patchwise approach, which incorporates the prediction from the first stage in its model input. For each patch position, an individual UNet is trained (see ‘Training procedure’ section in the Methods). Both stages of the UNet architecture are identical with respect to the use of the rectified linear unit (ReLU) activation function, instance normalization49, a depth of 4, and the doubling of the number of channels with increasing depth, starting with 8 channels. The implementation of the model and the training procedure is publicly accessible via GitHub at https://github.com/wwu-mmll/deepmriprep-train (ref. 42).

Patchwise UNet

The second stage of the cascaded UNet subdivides the 336 × 384 × 336 voxels of the high-resolution MRI into 27 patches, each containing 128 × 128 × 128 voxels (Supplementary Fig. 28). For each of these 27 patches, a specific UNet model is trained (see ‘Training procedure’ section in the Methods). To minimize the number of voxels in a patch that do not contain any tissue, the positions of the patches are optimized based on the tissue segmentation masks of the 685 MRIs in OpenNeuro-HD. This iterative optimization starts from a regular grid of 3 × 3 × 3 patches that covers the total volume. Then, each patch is moved stepwise by one voxel toward the image center until this would cause a tissue voxel in one of the 685 MRIs not to be covered by the patch. To exploit the brain’s bilateral symmetry, each of the patch on the left hemisphere is moved in lockstep with its corresponding patch on the right hemisphere.

Before applying patches from the right hemisphere to the UNet, we apply flipping along the sagittal axis such that they resemble left-hemisphere patches. The resulting prediction is then flipped back. This approach reduces the number of effective patch positions for which individual UNets need to be trained from 27 to 18.

Close to the border of a patch, the accuracy of the prediction typically decreases. Therefore, predictions close to the border of a patch are weighted less via Gaussian importance weighting23 during accumulation of the final prediction containing 336 × 384 × 336 voxels.

Multilevel activation function

Based on SPM3, the tissue maps produced by CAT12 contain continuous values ranging from 0 to 3. The values 1, 2 and 3 correspond to CSF, GM and WM, while 0 indicates that the respective voxel does not contain any tissue. The histogram of the template tissue map (Supplementary Fig. 29, right) shows that values close to 0, 1, 2 and 3 are more frequent than intermediate values. Furthermore, smaller peaks can be observed around the values 1.5 and 2.5, which correspond to the classes CSF–GM and GM–WM, respectively, which CAT12 introduces. To introduce an inductive bias toward this desired value distribution, the final layer of the tissue segmentation UNet utilizes a custom multilevel activation function inspired by Hu et al.50. This custom multilevel activation is achieved through the summation of six sigmoid functions,

$$f(x)=S(\alpha x)+\mathop{\sum }\limits_{i\in [1.5,2.,2.5,3.]}\frac{S(\alpha (x-i))}{2}\,\,\,\,{\mathrm{with}}\,\,\,\,S(x)=\frac{1}{1+{{\rm{e}}}^{-x}},$$

with α being a parameter of the neural network that is optimized during model training. This function (Supplementary Fig. 29, left) successfully maps a normal distribution—that is, the typical output distribution of a neural network—to the desired value distribution with peaks at 0, 1, 1.5, 2, 2.5 and 3 (Supplementary Fig. 29, middle). In combination with a MAE loss, this multilevel activation function facilitates the training of the tissue segmentation model.

Training procedure

The two stages of 3D UNets are trained in a cascaded fashion. The first stage model is trained for 60 epochs on full-view MRIs with a resolution of 0.75 mm (224 × 256 × 224 voxels) using a batch size of 1.

The training of the 3 × 3 × 3 patchwise approach in the second stage is more complex and results in 18 models, each dedicated to one of the 18 effective patch positions (3 × 3 × 3 = 27 minus the 9 flipped right hemisphere patches; see ‘Patchwise UNet’ section in the Methods). First, a UNet is trained on all effective patch positions for two epochs as a foundation model. For each of the 18 patch positions, this foundation model is finally fine-tuned, using solely the respective patch position for 20 epochs. For all patch-based training the batch size is set to 2, and the flip augmentation is disabled for patches located on the left and right hemispheres. The input of the second model consists of patches of the original MRI, concatenated with the respective patch of the first stage predictions upsampled to the image resolution of 0.5 mm. All models are trained with the one-cycle learning rate schedule51 using a maximal learning rate of 0.001, which follows the default settings of the fastai library52.

Image registration

To introduce the neural network-based image registration used for deepmriprep, we first introduce the standard image registration approaches. In standard image registration such as DARTEL27, given an input image I and a template J, the sum of image dissimilarity D and a regularization metric R weighted with a regularization parameter Λ,

$$L({\bf{I}},{\bf{J}},{\boldsymbol{\Phi }})=D({\bf{I}}\cdot {\boldsymbol{\Phi }},{\bf{J}})+\Lambda R({\boldsymbol{\Phi }})$$

is minimized via the deformation field Φ. In the loss function L used in this standard approach, the regularization parameter Λ controls the trade-off between image similarity and the regularity of the deformation field. In CAT12, the default metric for image similarity is simply the MSE between the moving and the target image

$$D({\bf{I}}\cdot {\boldsymbol{\Phi }},{\bf{J}})=\mathrm{MSE}({\bf{I}}\cdot {\boldsymbol{\Phi }},{\bf{J}})=\frac{1}{| \varOmega | }\mathop{\sum }\limits_{{\bf{p}}\in \varOmega }| | {\bf{I}}\cdot {\boldsymbol{\Phi }}({\bf{p}})-{\bf{J}}({\bf{p}})| {| }^{2},$$

and the regularization term is the linear elasticity of the deformation field Φ

$$R({\boldsymbol{\Phi }})=\int \left(\mu \parallel \epsilon ({\boldsymbol{\Phi }}){\parallel }^{2}+\frac{\lambda }{2}{(\mathrm{tr}(\epsilon ({\boldsymbol{\Phi }})))}^{2}\right){\rm{d}}{\bf{x}},$$
(1)

where μ is the weight of the zoom elasticity and λ is the weight of the shearing elasticity.

To guarantee that the deformations are invertible, registration frameworks27,53,54 consider the deformation as the solution of an initial value problem of the form

$$\frac{{\rm{d}}{\boldsymbol{\Phi }}(s;{\bf{x}})}{{\rm{d}}s}={\bf{v}}({\boldsymbol{\Phi }}(s;{\bf{x}}),s)\,\,\,\,\mathrm{with}\,\,\,\,{\boldsymbol{\Phi }}(0;{\bf{x}})={\bf{x}}.$$
(2)

The mapping xΦ(s; x) defines a family of diffeomorphisms for all time s [0, τ]. Hence, it is guaranteed that an inverse of the mapping exists, which can be computed through backward integration. As proposed in DARTEL (diffeomorphic anatomical registration through exponentiated lie algebra), a stationary velocity field framework instead of the large deformation diffeomorphic metric mapping (LDDMM) model55,56 allows the velocity field v to be constant over time. Using this simplification, the regularity of the deformation field—that is, smoothness and invertibility—is automatically reinforced via forward integration (also called shooting) of this constant velocity field. In this way, a smooth and invertible deformation field can be found by iteratively optimizing the velocity field v with respect to L using a gradient descent approach.

SyN registration57 additionally enforces symmetry between the forward (image to template) Φ and backward (template to image) deformation field Φ−1. SyN considers the full forward and backward deformations to be compositions of half deformations \({{\boldsymbol{\Phi }}}^{\frac{1}{2}}\) and \({{\boldsymbol{\Phi }}}^{-\frac{1}{2}}\) via

$${\boldsymbol{\Phi }}={{\boldsymbol{\Phi }}}^{\frac{1}{2}}\cdot -{{\boldsymbol{\Phi }}}^{-\frac{1}{2}}\,\,\,\,{\mathrm{and}}\,\,\,\,{{\boldsymbol{\Phi }}}^{-1}={{\boldsymbol{\Phi }}}^{-\frac{1}{2}}\cdot -{{\boldsymbol{\Phi }}}^{\frac{1}{2}}.$$

Based on this consideration, SyN adds the dissimilarity between the image and the backward deformed template D(I, J Φ−1) and the dissimilarity between the half forward deformed image and the half backward deformed template \(D({\bf{I}}\cdot {{\boldsymbol{\Phi }}}^{\frac{1}{2}},{\bf{J}}\cdot {{\boldsymbol{\Phi }}}^{-\frac{1}{2}})\) to arrive at the loss function

$$L({\bf{I}},{\bf{J}},\varPhi )=D({\bf{I}}\cdot \varPhi ,{\bf{J}})+D({\bf{I}},{\bf{J}}\cdot {\varPhi }^{-1})+D\left({\bf{I}}\cdot {\varPhi }^{\frac{1}{2}},{\bf{J}}\cdot {\varPhi }^{\frac{-1}{2}}\right)+\Lambda R(\varPhi ).$$
(3)

Using the diffeomorphic mapping in equation (2), velocity fields \({{\bf{v}}}^{\frac{1}{2}}\) and \({{\bf{v}}}^{-\frac{1}{2}}\) are used to generate the half deformations \({{\boldsymbol{\Phi }}}^{\frac{1}{2}}\) and \({{\boldsymbol{\Phi }}}^{-\frac{1}{2}}\).

Model architecture and training

The neural network-based image registration framework used for deepmriprep is based on SYMNet22 and uses a UNet to predict the forward and backward velocity field \({{\bf{v}}}^{\frac{1}{2}}\) and \({{\bf{v}}}^{-\frac{1}{2}}\). Analogous to the SyN registration, these velocity fields are integrated according to equation (2) to arrive at the half deformation fields \({{\boldsymbol{\Phi }}}^{\frac{1}{2}}\) and \({{\boldsymbol{\Phi }}}^{-\frac{1}{2}}\) via the scaling and squaring method27,53 with τ = 7 time steps (Supplementary Fig. 30).

Similar to the neural network architecture used for tissue segmentation (see ‘Tissue segmentation’ section in the Methods), the UNet uses instance normalization49, a depth of 4, and is doubling the number of channels with increasing depth, starting with 8 channels. However, we apply two modifications: (1) usage of LeakyReLU58 instead of ReLU activation layers, and (2) hyperbolic tangent (tanh) activation function in the final layer, ensuring that the UNet’s output conforms to the value range of −1 to 1 used for image coordinates by PyTorch59. The model is trained for 50 epochs using the one-cycle learning rate schedule with a maximal learning rate of 0.001.

During initial tests, training with the standard SyN loss function (equation (3)) led to major artifacts in the predicted velocity and deformation field (Supplementary Fig. 31b). To avoid these artifacts, we tested supervised approaches (Supplementary Fig. 31c–e) that utilize deformation fields created by CAT12 (Supplementary Fig. 31a). Using an iterative approach, we determined the velocity fields \({{\bf{v}}}_{{\rm{CAT}}}^{\frac{1}{2}}\) and \({{\bf{v}}}_{{\rm{CAT}}}^{-\frac{1}{2}}\) that produce these given deformation fields ΦCAT and \({{\boldsymbol{\Phi }}}_{{\rm{CAT}}}^{-1}\) in our PyTorch-based implementation and used these velocity fields as targets. Using the MSE, the resulting loss function Lv measures disagreements between the predicted velocity fields \({{\bf{v}}}^{\frac{1}{2}}\) and \({{\bf{v}}}^{-\frac{1}{2}}\) and the targets via

$${L}_{{\bf{v}}}\left({{\bf{v}}}^{\frac{1}{2}},{{\bf{v}}}^{-\frac{1}{2}}\right)=\frac{1}{| \varOmega | }\mathop{\sum }\limits_{{\bf{p}}\in \varOmega }| | {{\bf{v}}}_{\mathrm{CAT}}^{\frac{1}{2}}({\bf{p}})-{{\bf{v}}}^{\frac{1}{2}}({\bf{p}})| {| }^{2}+| | {{\bf{v}}}_{\mathrm{CAT}}^{-\frac{1}{2}}({\bf{p}})-{{\bf{v}}}^{-\frac{1}{2}}({\bf{p}})| {| }^{2}.$$

Using this loss function, the predicted velocity fields show fewer artifacts, but based on the resulting Jacobi determinant field JΦ, some inaccuracies remain (Supplementary Fig. 31c). The Jacobi determinant indicates the volume change caused by the deformation for each voxel. By explicitly adding the MSE between the predicted and ground-truth Jacobi determinant JΦ, the loss function

$$\begin{array}{l}{L}_{{\bf{{v}}},{{\bf{{J}}}}_{\Phi }}\left({{\bf{v}}}^{\frac{1}{2}},{{\bf{v}}}^{-\frac{1}{2}}\right)=\frac{1}{|\Omega |}\mathop{\sum }\limits_{{\bf{p}}\in \varOmega }||{{\bf{v}}}_{\mathrm{CAT}}^{\frac{1}{2}}({\bf{p}})-{{\bf{v}}}^{\frac{1}{2}}({\bf{p}})|{|}^{2}\\ +||{{\bf{v}}}_{\mathrm{CAT}}^{-\frac{1}{2}}({\bf{p}})-{{\bf{v}}}^{-\frac{1}{2}}({\bf{p}})|{|}^{2}+||{{{\bf{J}}}_{{\boldsymbol{\Phi }}}}_{\mathrm{CAT}}({\bf{p}})-{{\bf{J}}}_{\Phi }({\bf{p}})|{|}^{2}\end{array}$$

improves regularity of the predicted velocity fields and the resulting Jacobi determinant field (Supplementary Fig. 31d). Finally, we reintroduce the original loss function LSyN as

$${L}_{\mathrm{supervised}}\left({{\bf{v}}}^{\frac{1}{2}},{{\bf{v}}}^{-\frac{1}{2}}\right)={L}_{{\bf{v}},{{\bf{J}}}_{\Phi }}\left({{\bf{v}}}^{\frac{1}{2}},{{\bf{v}}}^{-\frac{1}{2}}\right)+\beta {L}_{\mathrm{SyN}}$$

with β set to 2 × 10−5. The fields predicted with this approach, called supervised SYMNet or sSYMNet, do not show any apparent artifacts (Supplementary Fig. 31e). The implementation of the model and the training procedure is publicly accessible via GitHub at https://github.com/wwu-mmll/deepmriprep-train (ref. 42).

Cross-dataset validation and evaluation metrics

We follow best practices by applying a five-fold cross-dataset validation—that is, cross-validation with datasets grouped together—using the 137 datasets from OpenNeuro-HD (see ‘Training and validation datasets’ section in the Methods). Thereby, we enforce realistic performance measures because all reported results are achieved in datasets unseen during training of the respective model. We apply the same folds across tissue segmentation, image registration and GM masking (see ‘Gray matter masking’ in the Supplementary Information) to avoid data leakage between these processing steps. The test datasets (see ‘Test datasets’ section in the Methods) and the datasets in OpenNeuro-Total and OpenNeuro-Kids, which are not part of OpenNeuro-HD (see ‘OpenNeuro-Total’ section in the ‘Results’), are evaluated with an additional model trained with the full OpenNeuro-HD dataset.

Given that the distribution of performance metrics across images is often skewed, the median is used as a measure of central tendency, complemented by a visual inspection of negative outliers.

To evaluate tissue segmentation and GM masking performance, we use the Dice score DSC, the probabilistic Dice score pDSC and the Jaccard score JSC. The image registrations are evaluated based on the regularity of the deformation field and the dissimilarity between the warped input and the template image. This dissimilarity is measured using the voxel-wise MSE between the images. The deformation field’s regularity—that is, its physical legitimacy—is quantified via the linear elasticity LE (equation (1)).

Prediction pipeline

The complete deepmriprep pipeline used before the GLM analysis in the ‘VBM analyses’ section in the ‘Results’ consists of six steps: brain extraction, affine registration, tissue segmentation, tissue separation, nonlinear registration and smoothing. After brain extraction using deepbet14 with default settings, affine registration is applied using the sum of the MSE (between image and template) and Dice loss (between image brain mask and template brain mask). The affine registration is implemented in torchreg60 with zero padding—sensible after brain extraction—and the default two-stage setting with 500 iterations in 12-mm3 and successive 100 iterations in 6-mm3 image resolution. After tissue segmentation (see ‘Tissue segmentation’ section in the Methods) and before image registration (see ‘Image registration’ section in the Methods), we apply GM masking in the ventricles and around the brain stem to conform the probability masks to an undocumented step in the existing CAT12 preprocessing (see ‘Gray matter masking’ section in the Supplementary Information). After image registration of the GM and WM probability masks, Gaussian smoothing with a 6 mm full width at half maximum kernel is applied. In line with all previous steps, smoothing (a simple convolution operation) is implemented in PyTorch, enabling graphics processing unit acceleration throughout the entire prediction pipeline.

VBM analyses

To investigate the effect of different preprocessings on the VBM analyses, deepmriprep- and CAT12-preprocessed data are used to examine statistical associations with both biological variables (age, sex and BMI) and psychometric variables (years of education, MDD versus HC, and IQ). To ensure the reliability of results, each VBM analysis is repeated 100 times with a randomly selected 80% subset of the data. Finally, the median t-map across these 100 VBM analyses is used to compare the VBM results of deepmriprep and CAT12.

Statistics and reproducibility

We follow best practices by applying a five-fold cross-dataset validation—that is, cross-validation with datasets grouped together—during training and extensive testing of the models across multiple test datasets. In addition, the number of datasets was maximized by collecting MRI data from OpenNeuro and performing quality checks, resulting in 225 datasets. Each VBM analysis is repeated 100 times with a randomly selected 80% subset of the data. No statistical method was used to predetermine sample size.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.