Scan-rescan reliability assessment of brain volumetric analysis across scanners and software solutions

Bürkle, Eva; Nazzal, Ahmad; Debolski, Alexander; Ernemann, Ulrike; Lindig, Tobias; Bender, Benjamin

doi:10.1038/s41598-025-15283-3

Download PDF

Article
Open access
Published: 14 August 2025

Scan-rescan reliability assessment of brain volumetric analysis across scanners and software solutions

Eva Bürkle¹,
Ahmad Nazzal²,
Alexander Debolski⁴,
Ulrike Ernemann¹,
Tobias Lindig^1,2,3^na1 &
…
Benjamin Bender^1,2^na1

Scientific Reports volume 15, Article number: 29843 (2025) Cite this article

1309 Accesses
7 Altmetric
Metrics details

Subjects

Abstract

Automated brain volumetry shows promise in improving the screening and monitoring of neurodegenerative diseases. However, the reliability of measurements across different scanners and software remains uncertain. This study assessed the reliability of gray matter, white matter, and total brain volume measurements from seven volumetry tools, using six scanners across two scanning sessions, performed within 2 h the same day, in twelve subjects. Generalised estimating equations models showed significant effects of both software and scanner on all measurements with stronger effect of software (p < 0.001). Percentage of coefficient of variation (CV) was calculated to measure scan-rescan reliability. Median CV across scanners of AssemblyNet and AIRAscore was less than 0.2% for grey and white matter, and 0.09% for total brain volume; while FreeSurfer, FastSurfer, syngo.via, SPM12, and Vol2Brain had a CV greater than 0.2%. In Bland-Altman analysis there was no systematic difference, but limits of agreement differed greatly between methods. Based on these findings, we recommend using the same scanner and software combination across sessions to ensure that observed changes in brain volume are reliable and clinically valuable.

Proof of principle for the clinical use of a CE-certified automatic imaging analysis tool in rare diseases studying hereditary spastic paraplegia type 4 (SPG4)

Article Open access 21 December 2022

Brain volumes in genetic syndromes associated with mTOR dysregulation: a systematic review and meta-analysis

Article 05 December 2024

Brain morphometry in 22q11.2 deletion syndrome: an exploration of differences in cortical thickness, surface area, and their contribution to cortical volume

Article Open access 02 November 2020

Introduction

With recent advancements in artificial intelligence (AI), an increasing number of brain volumetric tools, including certified as medical devices, are available on the market^1,2,3. Brain volumetric analyses are promising tools for quantifying brain volume loss in neurodegenerative diseases. For instance, they can be used to assess Alzheimer and other dementias and subtypes of Parkinson’s syndromes^4,5,6,7 and monitor brain and spinal cord atrophy in multiple sclerosis to predict clinical outcomes and monitor therapy response⁸. For the results of automated volumetry to be of clinical value, it is important to understand the scan-rescan reliability of measurements⁹. Several factors affect the reliability of volumetric measurements: first, subject movement during the scan; second, the scanner’s intrinsic signal-to-noise ratio and inhomogeneities of the B0 and B1 fields, including differences in field strength¹⁰; third, the sequence parameters affecting the contrast-to-noise ratio between grey and white matter¹¹; and fourth, the segmentation algorithm used⁷. This impacts both scientific research and clinical results when comparing measurements from temporally spaced volumetric examinations^4,5,6. Therefore, there is a need to investigate the reliability of volumetric software in regards of reproducibility of measurements. Several studies have compared research brain volumetric tools in regards of their test-retest performance^7,8. Nevertheless, knowledge about the performance of certified medical devices and new scientific AI based segmentation tools is scarce. Therefore, this study aims to systematically investigate the effect of different scanners of the same vendor using seven different segmentation algorithms including certified medical device software, new well performing AI based tools and established scientific tools on brain volumetric measurements (e.g. FreeSurfer, which is still one of the most widely used volumetric research-tools¹².

Results

Demographics

Twelve healthy subjects (6 women, 6 men) with a mean age of 35.3 years (± 8.5 years) were examined between March 2021 and November 2021.

General estimation equations results

Effect of software and scanner on measured Gray matter volume

In the analysis of the effect of session, scanner, and software on gray matter (GM) volume measurement, significant main effects were observed for software (Wald χ² = 22377.50, df = 6, p < 0.001) and scanner (Wald χ² = 91.76, df = 5, p < 0.001) but not for session (Wald χ² = 1.47, df = 1, p = 0.23) – see Table 1. The interaction between session and software was not statistically significant (Wald χ² = 2.10, df = 5, p = 0.834) but a significant interaction was found for session and scanner (Wald χ² = 30.46, df = 6, p < 0.001). However, post-hoc analysis showed that only the interaction between session and Vida scanner was significant (Wald χ² = 4.224, df = 1, p = 0.040), which is most likely an alpha-error. Moreover, the interaction between scanner and software was significant (Wald χ² = 1.279 × 10¹², df = 13, p < 0.001). Specifically, interaction between Aera scanner and AIRAscore software (Wald χ² = 265.229, df = 1, p < 0.001), FastSurfer software (Wald χ² = 5.465, df = 1, p = 0.019), FreeSurfer software (Wald χ² = 35.167, df = 1, p < 0.001); Aera3 scanner and AIRAscore software (Wald χ² = 261.100, df = 1, p < 0.001), FastSurfer software (Wald χ² = 4.596, df = 1, p = 0.032), FreeSurfer software (Wald χ² = 10.623, df = 1, p = 0.001); Avanto scanner and AIRAscore software (Wald χ² = 293.339, df = 1, p < 0.001), FastSurfer software (Wald χ² = 23.571, df = 1, p < 0.001), FreeSurfer software (Wald χ² = 32.964, df = 1, p < 0.001); Vida scanner and AIRAscore software (Wald χ² = 38.278, df = 1, p < 0.001), FastSurfer software (Wald χ² = 17.227, df = 1, p < 0.001); Vidafit scanner and AIRAscore software (Wald χ² = 233.987, df = 1, p < 0.001), FastSurfer software (Wald χ² = 7.889, df = 1, p = 0.005), FreeSurfer software (Wald χ² = 13.692, df = 1, p < 0.001), SPM12 software (Wald χ² = 21.262, df = 1, p < 0.001), syngo.via software (Wald χ² = 16.381, df = 1, p < 0.001), and Vol2Brain (Wald χ² = 11.859, df = 1, p < 0.001). Furthermore, the three-way interaction among session, scanner, and software (Wald χ² = 1.445 × 10¹², df = 15, p < 0.001) was also significant.

Table 1 Post-hoc analysis for constant terms of software and scanner Gray matter volume measurements with AssemblyNet and Prisma as reference. The table shows the parameters (Param), regression coefficient B (RCB), standard error (SE), 95% Wald confidence interval lower bound (Lower Int), 95% Wald confidence interval upper bound (Upper Int), Wald Chi-Square (Wald χ²), degrees of freedom (df), and significance level (sig).

Full size table

Effect of software and scanners on measured white matter volume

In the analysis of the effect of session, scanner, and software on white matter (WM) volume measurement, significant main effects were observed for both software (Wald χ² = 2218.32, df = 6, p < 0.001) and scanner (Wald χ² = 255.22, df = 5, p < 0.001) but not for session (Wald χ² = 0.78, df = 1, p = 0.38) – see Table 2. The interaction between session and scanner was not statistically significant (Wald χ² = 9.00, df = 5, p = 0.109). Weak significant interaction was found between session and software (Wald χ² = 16.91, df = 6, p = 0.01). However, post-hoc analysis showed no significant interaction between any software and session. Moreover, the two-way interaction between scanner and software was significant (Wald χ² = 1.376 × 10¹², df = 12, p < 0.001). Specifically, the interactions between Aera scanner and AIRAscore software (Wald χ² = 111.928, df = 1, p < 0.001), FastSurfer (Wald χ² = 15.697, df = 1, p < 0.001), and FreeSurfer (Wald χ² = 14.240, df = 1, p < 0.001); Aera3 scanner and FreeSurfer (Wald χ² = 20.801, df = 1, p < 0.001); Avanto scanner and AIRAscore software (Wald χ² = 130.165, df = 1, p < 0.001), FastSurfer software (Wald χ² = 36.870, df = 1, p < 0.001), FreeSurfer software (Wald χ² = 34.647, df = 1, p < 0.001), syngo.via software (Wald χ² = 5.167, df = 1, p = 0.023), Vol2Brain software (Wald χ² = 4.712, df = 1, p = 0.030); Vida scanner and AIRAscore software (Wald χ² = 69.986, df = 1, p < 0.001), FastSurfer software (Wald χ² = 20.466, df = 1, p < 0.001), FreeSurfer software (Wald χ² = 58.623, df = 1, p < 0.001), syngo.via software (Wald χ² = 7.710, df = 1, p = 0.005); Vidafit scanner and AIRAscore software (Wald χ² = 65.878, df = 1, p < 0.001), SPM12 software (Wald χ² = 9.629, df = 1, p = 0.002), syngo.via software (Wald χ² = 26.540, df = 1, p < 0.001). The three-way interaction among session, scanner, and software (Wald χ² = 55944.44, df = 11, p < 0.001) was also significant.

Table 2 Post-hoc analysis for constant terms of software and scanner white matter volume measurements and AssemblyNet and Prisma as reference. The table shows the parameters (Param), regression coefficient B (RCB), standard error (SE), 95% Wald confidence interval lower bound (Lower Int), 95% Wald confidence interval upper bound (Upper Int), Wald Chi-Square (Wald χ²), degrees of freedom (df), and significance level (sig).

Full size table

Reproducibility of measurement for Gray matter volume, white matter volume and total brain volume

For GM volume measurements Vol2Brain, FastSurfer, FreeSurfer, SPM12, and syngo.via had a median CV of less than 1%. Only AssemblyNet and AIRAscore reached a median CV and an IQR of less than 0.2% (see Table 3; Fig. 1).

Table 3 Descriptive statistics for CV GM measurement.

Full size table

For WM volume measurements, AssemblyNet, AIRAscore, and FastSurfer showed a median CV of less than 0.2%, while the first two outperformed FastSurfer on the IQR. Vol2Brain, SPM12, and syngo.via achieved a median CV between 0.23% and 0.37%. Vol2Brain showed a considerably lower performance with a median CV of 1.7% (see Table 4; Fig. 2).

Table 4 Descriptive statistics for CV WM measurement.

Full size table

Total brain volume (TBV), as the largest volume and with an intrinsic error tolerance to GM versus WM misclassifications, showed as expected the smallest CV of the three evaluations. For TBV, AssemblyNet and AIRAscore outperformed the other software solutions with a median CV less than 0.1% and an IQR less than 0.2%. FastSurfer, FreeSurfer and SPM12 had a CV less than 0.2% and an IQR less than or around 0.2%. Syngo.via resulted in CV of 0.2% but with an IQR of around 0.3%, while Vol2Brain was most affected by the difference in WM volume estimates between measurements and showed the lowest performance (see Table 5; Fig. 3).

Table 5 Descriptive statistics for CV TBV measurement.

Full size table

Bland-Altman-Plots with individual subjects and scanners revealed no systematic deviations of individual scanners and subjects or a systematic influence of the size of the measured volume on the difference, while the limits of agreement showed a similar effect as the CV for different software solutions (see supplemental material).

Discussion

This study assessed the scan-rescan reliability of seven brain volumetric software solutions by examining within-day scans across six different scanners. Firstly, scanner, software and session effects were examined with generalized estimation equations (GEE). As observed in previous studies, the software had the largest impact on measured volumes showing significant variations in measured volumes^13,14,15. Additionally, while the scanner was a relevant factor, its effect was lesser than the software, underpinning the relevance of using the same setup –scanner, software, and sequences– for performing follow-up examinations¹⁴. In terms of reliability, AssemblyNet and AIRAscore showed the lowest measurement error between scanning sessions using the same scanner, achieving a median CV of less than 0.09% TBV. Thus, the CV for both solutions falls within the range of TBV changes observed in healthy middle-aged subjects and is below the annual decline of 0.5% and 1% seen in multiple sclerosis (MS) patients of the same age group¹⁵. Furthermore, since total brain atrophy correlates with cognitive impairment in MS patients^8,16, quantitative TBV measurements during therapy could help identify patients at risk, considering measurement error. Moreover, a previous study concluded that a CV below 2% is desirable in Alzheimer’s disease patients⁹. All tools but Vol2Brain had a 75th percentile CV of below 2% for gray matter and white matter – suggesting clinical utility in neurodegenerative diseases. Nevertheless, measurement error should be as small as possible to allow for detection of pathologic volume changes even in shorter follow-up periods.

In this study, we used six Siemens scanners. We selected the Siemens Prisma scanner as the “reference” due to its widespread use in research and high signal to noise ratio. This approach allowed us to efficiently manage the initial exploration of scanner/software interactions and provide results for any scanner/software combination, assessing their impact on the final volumetric result. However, deep learning models like AssemblyNet trained solely on Siemens data may perform better with our dataset, which may explain the narrower boxplots in the scan-rescan setting for AssemblyNet compared to AIRAscore or FastSurfer and the relatively small impact of scanner on absolute volumes. Including scanners from different manufacturers could yield different results for all tools; therefore, our results cannot be generalized to non-Siemens scanners¹⁷.

Our results indicate that deep learning tools such as AssemblyNet, AIRAscore, and FastSurfer demonstrate high reliability. In a previous study investigating the scan-rescan reliability of FreeSurfer, it was shown that the software had high reproducibility for total brain segmented volume regardless of scanners, head coils, or sequences¹⁸. Our findings align with these findings, showing no significant effect on volume measurements due to scanner-coil combinations. Another study that examined FreeSurfer (v7.3), FSL-FAST, CAT12, ANTs research tools for gray matter, white matter, and total intracranial volume found that all methods provided yielded consistent and reproducible measurements across subjects of well below 1%, though there was notable variability between methods¹⁹. Similarly, we observed greater variability between the methods we tested in our study. Moreover, our data expands on previous results by comparing the scan-rescan reliability of seven software solutions, including CE-labeled commercially available products and deep learning tools. It is important to emphasize that this study does not aim to support any tool but to highlight the significant issue of variability in volumetric analysis. We stress the importance of recognizing the variability introduced by different software and scanners over time, which must be considered when quantifying clinically relevant changes in brain volume.

This study has several limitations. One limitation is the lack of datasets from a broader range of subjects. A larger dataset would help in calculating correction factors for volumetric variability. To address this limitation, we used a homogenous set of participants and scan protocols; however, there were only a few datasets available for certain specific scanner software combinations, limiting statistical analysis. Future studies should aim to acquire more datasets from different subjects using the same software to build a broader database for calculating correction factors for volumetric variability. A higher number of subjects would increase the sample size for specific subgroup-combinations (scanner-software) and therefore result in statistically higher power. Also, we do not have any data for older patients yet. For this first study, we aimed for a very homogenous group of subjects. To reach more generalizable results, further studies with an increased number and more diverse subjects, and different scanner vendors are needed. Additionally, only T1-weighted scans were used in this study for volumetric analysis. It is possible to investigate the effects of other sequences, such as synthetic 3D T1 datasets derived from other imaging contrasts in future studies²⁰. Due to availability, we only used scanners of one manufacturer. On the one hand this allowed a very homogenous design of the measurement protocols, on the other hand this is a limitation concerning generalizability of the results. Some software solutions (AssemblyNet) have been trained solely on Siemens data and might therefore perform better in this setting than on data of different scanners. It is also important to assess the impact of hydration status on volumetric measures, considering age and the influence of medications such as cortisone and antipsychotic drugs, which are known to affect brain volume²¹. The decision to focus on gray matter and white matter may have obscured differences in smaller brain volumes across software solutions and scanners. A previous study found that CV range between 1.6% for caudate and 6.1% for thalamus volume²². Different scanner/software combinations might produce varying volumetric results for different brain regions due to different anatomic definitions. However, by only evaluating total brain volume, gray matter volume and white matter volume this should be neglectable.

In conclusion, accurate volumetric measurements are essential for diagnosing and monitoring of neurodegenerative diseases, and planning therapy. New treatments for Alzheimer disease with anti-Aβ-antibodies can affect brain volume with unknown long-term effects²³. Therefore, an accurate monitoring of the new therapy with brain volumetry can be an important part of follow-up controls to detect side effects of therapy as well as disease progression²⁴. Our findings show that reproducibility of volumetric measurements varies significantly across software, with deep learning tools demonstrating higher reliability. As volumetric MRI analysis becomes more common, result interpretation must account for measurement protocol, scanner, software, and patient-specific factors (e.g., hydration status and medications). Establishing guidelines for correction factors would further improve the comparability of volumetric analysis, resulting in earlier, more accurate diagnoses and possibly improved treatment outcomes. Based on our results, we can recommend using the same combination of scanner and software across sessions to ensure that observed changes in brain volume are most reliable and clinically valuable.

Methods

Ethics approval and participants

The study was approved by the ethics committee at the medical faculty of the Eberhard Karls University and at the University Hospital of Tübingen (approval number: 512/2020BO). All experiments were conducted in accordance with ethics committee guidelines and regulations. Participants provided written, informed consent prior to the examination. Exclusion criteria included age below 18 or above 65, known structural anomalies of the brain, pregnancy, and contraindications for magnetic resonance imaging (MRI).

Scanners and scanning protocol

Six different MRI scanners (all Siemens Healthineers, Erlangen, Germany) for imaging studies were used to acquire a T1-MPRAGE: three 1.5-Tesla scanners, including two distinct Aera scanners, located in separate rooms and referred to as Aera and Aera3, and one Avanto scanner, all equipped with 20-channel head-neck coils. Additionally, three 3-Tesla scanners were used including: Vida, Vida Fit, and Prisma, all fitted with 20-channel head-neck coils. For two 3 Tesla scanners (Prisma and Vida fit) scans with a 64-channel head-neck coil were acquired. For the 1.5 Tesla scanners, the scanning parameters were TR = 2400 ms, TI = 1000 ms, flip angle = 8°, bandwidth = 180 Hz/Px, 176 slices. For the 3 Tesla scanners, the scanning parameters were TR = 2300 ms, TI = 900 ms, flip angle = 9°, bandwidth = 240 Hz/Px, 176 slices. Scanning protocols were as described by Siemens Healthineers for the syngo.via evaluation.

MRI preprocessing and volumetric software

All images were evaluated for visible motion artifacts. Registration to anatomical reference spaces, such as MNI or Talairach, was performed by each software where required. For the volumetric analyses, seven different software programs were used: FreeSurfer Version 7.1.1^25,26, SPM12 version 7771²⁷ running on Matlab 2018²⁸ AIRAscore (Version 2.1.0, AIRAmed GmbH, Tübingen, Germany)²⁹, AssemblyNet³⁰, FastSurfer³¹, Vol2Brain³² and Brain Morphometry as part of the Neurology Workflow in syngo.via (VA40, Siemens Healthineers)³³. FastSurfer and FreeSurfer do not provide a total WM label. Therefore, to create a total WM volume comparable to the other solutions, the output of the following labels was combined [Left|Right] Cerebellum-White-matter, Brainstem, [Left|Right]-VentralDC, [Left|Right]-Cerebral-White-Matter and corpus callosum labels (CC_Posterior, CC_Mid_Posterior, CC_Central, CC_Mid_Anterior, CC_Anterior). For FreeSurfer, SPM12, FastSurfer, and AssemblyNet DICOM raw data was converted into NIFTI-1 files with dcm2niix (version 1.0.20211006). For one dataset (Proband2_Avanto_Messung1), Vol2Brain and AssemblyNet failed segmentation, due to tilted head position. After manual correction to AC-PC alignment, the dataset could be segmented, and these values were used for further evaluations. AIRAscore and syngo.vio accepted DICOM images as input.

Procedure

A prospective balanced design was used. Each participant was scanned twice using eight different combinations of scanner and coil on the same day, resulting in a total of 16 scans per participant, except for one participant who had only 12 scans due to fitting issues with the 64-channel coil. For the different scanners, there was a location switch. The total duration was approximately 2 h per participant (5 min per scanning session). Between each scan, the participant was moved out of the scanner, repositioned, and then moved back into the scanner. A localizer was acquired for each scan (see Fig. 4). The scanner image, computer image, and brain shape depicted in Fig. 4 were incorporated using elements from Canva.com.

Statistical analysis

IBM Corp. Released 2023. IBM SPSS Statistics for Windows, Version 29.0.2.0 Armonk, NY: IBM Corp was used for statistical analysis. Python Version 3.10.12 was used for calculating the coefficient of variance. Statistical analysis was split into two parts. In the first the effect of software, scanner and session (first and second scan) on estimations of gray matter and white matter was evaluated using generalized estimation equations, while in the second step test-retest performance of each software was evaluated based on the general recommendation not to switch software or scanner for follow-up examinations.

Statistical analysis using generalized Estimation equations

A generalized estimation equations (GEE) was computed to evaluate the effect of scanner, session (first or second scan with the same scanner), and software on the measured volume for gray matter and white matter. The dependency of the measurements due to the measurement of the same subject under different circumstances (scanner, software, session) was included in the linear model. The model was computed on the full dataset of the 12 subjects using the combination scanner and 20-/64- channel coil. For statistical comparison, AssemblyNet software and Prisma scanner were used as references. The Prisma scanner was used as the reference scanner because it is a widely recognized and commonly used standard in neuroimaging studies, providing a robust baseline for comparison. AssemblyNet was used as the reference software because it employs a novel deep learning approach combining two assemblies of U-Nets, each based on a large number of convolutional neural networks (125 CNNs) to achieve fine-grain segmentation of various anatomical regions, utilizing a training dataset of 45 manually segmented cases from the OASIS-dataset, which includes a diverse range of subjects and different Siemens scanners, ensuring high accuracy and generalizability in neuroimaging studies³⁴.

Measuring reproducibility of volume measurements

To measure the reproducibility of volume measurements for gray matter, white matter, and total brain volume (TBV) the percentage coefficient of variance \(\:\text{\%}\) (\(\:CV=\frac{\text{S}\text{D}}{\text{M}\text{e}\text{a}\text{n}}\) was calculated for all repeated measurements on the same scanner in the same subject. As CV follows a Chi-distribution median and interquartile range (IQR) were used to describe the results and boxplots and Bland-Altman-Plots were used for visualization to count the systematic volume differences between the different software or scanners the percentage change was used instead of absolute volume difference of repeated scans.

Data availability

Due to GDPR limitations raw imaging data may not be shared, the volumetric results of the different tools for each case is provided within the supplementary information files.

References

Pemberton, H. G. et al. Automated quantitative MRI volumetry reports support diagnostic interpretation in dementia: a multi-rater, clinical accuracy study. Eur. Radiol. 31 (7), 5312–5323 (2021).
Article PubMed PubMed Central Google Scholar
Mendelsohn, Z. et al. Commercial volumetric MRI reporting tools in multiple sclerosis: a systematic review of the evidence. Neuroradiology 65 (1), 5–24 (2023).
Article PubMed Google Scholar
Lindig, T. et al. Proof of principle for the clinical use of a CE-certified automatic imaging analysis tool in rare diseases studying hereditary spastic paraplegia type 4 (SPG4). Sci. Rep. 12 (1), 22075 (2022).
Article ADS PubMed PubMed Central CAS Google Scholar
Struyfs, H. et al. Automated MRI volumetry as a diagnostic tool for alzheimer’s disease: validation of icobrain Dm. Neuroimage Clin. 26, 102243 (2020).
Article PubMed PubMed Central Google Scholar
Sander, L. et al. Improving accuracy of brainstem MRI volumetry: effects of age and sex, and normalization strategies. Front. Neurosci. 14, 609422 (2020).
Article PubMed PubMed Central Google Scholar
Chougar, L. et al. Automated categorization of parkinsonian syndromes using magnetic resonance imaging in a clinical setting. Mov. Disord. 36 (2), 460–470 (2021).
Article PubMed CAS Google Scholar
Palumbo, L. et al. Evaluation of the intra- and inter-method agreement of brain MRI segmentation software packages: A comparison between SPM12 and freesurfer v6.0. Phys. Med. 64, 261–272 (2019).
Article PubMed CAS Google Scholar
Sastre-Garriga, J. et al. MAGNIMS consensus recommendations on the use of brain and spinal cord atrophy measures in clinical practice. Nat. Rev. Neurol. 16 (3), 171–182 (2020).
Article PubMed PubMed Central Google Scholar
Wittens, M. M. J. et al. Inter- and Intra-Scanner variability of automated brain volumetry on three magnetic resonance imaging systems in alzheimer’s disease and controls. Front. Aging Neurosci. 13, 746982 (2021).
Article PubMed PubMed Central Google Scholar
Chu, R. et al. Automated segmentation of cerebral deep Gray matter from MRI scans: effect of field strength on sensitivity and reliability. BMC Neurol. 17 (1), 172 (2017).
Article PubMed PubMed Central Google Scholar
Wang, J. et al. Optimizing the magnetization-prepared rapid gradient-echo (MP-RAGE) sequence. PLoS One. 9 (5), e96899 (2014).
Article ADS PubMed PubMed Central Google Scholar
Khadhraoui, E. et al. Automated brain segmentation and volumetry in dementia diagnostics: a narrative review with emphasis on freesurfer. Front. Aging Neurosci. 16, 1459652 (2024).
Zaki, L. A. M. et al. Comparing two artificial intelligence software packages for normative brain volumetry in memory clinic imaging. Neuroradiology 64 (7), 1359–1366 (2022).
Article PubMed PubMed Central Google Scholar
Calloni, S. F. et al. Combining semi-quantitative rating and automated brain volumetry in MRI evaluation of patients with probable behavioural variant of fronto-temporal dementia: an added value for clinical practise? Neuroradiology 65 (6), 1025–1035 (2023).
Article PubMed Google Scholar
Mangesius, S. et al. Qualitative and quantitative comparison of hippocampal volumetric software applications: do all roads lead to rome? Biomedicines, 10(2), 432. (2022).
Marciniewicz, E. et al. Quantitative magnetic resonance assessment of brain atrophy related to selected aspects of disability in patients with multiple sclerosis: preliminary results. Pol. J. Radiol. 84, e171–e178 (2019).
Article PubMed PubMed Central Google Scholar
Takao, H., Hayashi, N. & Ohtomo, K. Effect of scanner in longitudinal studies of brain volume changes. J. Magn. Reson. Imaging. 34 (2), 438–444 (2011).
Article PubMed Google Scholar
Knussmann, G. N. et al. Test-retest reliability of FreeSurfer-derived volume, area and cortical thickness from MPRAGE and MP2RAGE brain MRI images. Neuroimage Rep., 2(2), 100086. (2022).
Singh, M. K. Reproducibility and reliability of computing models in segmentation and volumetric measurement of brain. Ann. Neurosci. 30 (4), 224–229 (2023).
Article PubMed PubMed Central Google Scholar
Iglesias, J. E. et al. SynthSR: A public AI tool to turn heterogeneous clinical brain scans into high-resolution T1-weighted images for 3D morphometry. Sci. Adv. 9 (5), eadd3607 (2023).
Article PubMed PubMed Central Google Scholar
Dieleman, N., Koek, H. L. & Hendrikse, J. Short-term mechanisms influencing volumetric brain dynamics. Neuroimage Clin. 16, 507–513 (2017).
Article PubMed PubMed Central Google Scholar
Maclaren, J. et al. Reliability of brain volume measurements: a test-retest dataset. Sci. Data. 1, 140037 (2014).
Article PubMed PubMed Central Google Scholar
Alves, F., Kalinowski, P. & Ayton, S. Accelerated brain volume loss caused by Anti-β-Amyloid drugs: A systematic review and Meta-analysis. Neurology 100 (20), e2114–e2124 (2023).
Article PubMed PubMed Central CAS Google Scholar
Filippi, M., Cecchetti, G. & Agosta, F. MRI in the new era of antiamyloid mAbs for the treatment of alzheimer’s disease. Curr. Opin. Neurol. 36 (4), 239–244 (2023).
Article PubMed CAS Google Scholar
Freesurfer https://surfer.nmr.mgh.harvard.edu/
Fischl, B. & FreeSurfer Neuroimage, 62(2): 774–781. (2012).
Article PubMed Google Scholar
SPM12. Available from: https://www.fil.ion.ucl.ac.uk/spm/software/spm12/
MathWorks - Entwickler von MATLAB und Simulink. Available from: https:\\mathworks.com\.
AIRAmed Quantitative Neuroradiologie - Unsere Lösungen - AIRAmed GmbH.; Available from: https://www.airamed.de/de/startseite
Coupé, P. et al. AssemblyNet: A large ensemble of CNNs for 3D whole brain MRI segmentation. Neuroimage 219, 117026 (2020).
Article PubMed Google Scholar
Henschel, L. et al. FastSurfer - A fast and accurate deep learning based neuroimaging pipeline. Neuroimage 219, 117012 (2020).
Article PubMed Google Scholar
Manjón, J. V. et al. vol2Brain: A new online pipeline for whole brain MRI analysis. Front. Neuroinformatics, 16, 862805 (2022).
syngo.via. Available from: https://www.siemens-healthineers.com/de/digital-health-solutions/syngovia-view-go
Marcus, D. S. et al. Open access series of imaging studies: longitudinal MRI data in nondemented and demented older adults. J. Cogn. Neurosci. 22 (12), 2677–2684 (2010).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank Dr. Johann Jacoby (Institute for Clinical Epidemiology and Applied Biometry, University Tübingen) for his input on statistical methods and code for calculation of the GEE.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Tobias Lindig and Benjamin Bender contributed equally to this work.

Authors and Affiliations

Department of Diagnostic and Interventional Neuroradiology, University Hospital Tübingen, Hoppe-Seyler-Straße 3, 72076, Tübingen, Germany
Eva Bürkle, Ulrike Ernemann, Tobias Lindig & Benjamin Bender
AIRAmed GmbH, Konrad-Adenauer-Str. 13, 72072, Tübingen, Germany
Ahmad Nazzal, Tobias Lindig & Benjamin Bender
High-Field MR Center, Max Planck Institute for Biological Cybernetics, Tübingen, Germany
Tobias Lindig
Department of Neurosurgery, University Hospital Tübingen, Tübingen, Germany
Alexander Debolski

Authors

Eva Bürkle
View author publications
Search author on:PubMed Google Scholar
Ahmad Nazzal
View author publications
Search author on:PubMed Google Scholar
Alexander Debolski
View author publications
Search author on:PubMed Google Scholar
Ulrike Ernemann
View author publications
Search author on:PubMed Google Scholar
Tobias Lindig
View author publications
Search author on:PubMed Google Scholar
Benjamin Bender
View author publications
Search author on:PubMed Google Scholar

Contributions

B.B., T.L. and E.B. methodology, conceptualization and design, E.B. investigation, writing - original draft (lead), A.D. formal analysis (FastSurfer, Vol2Brain, assembly.net), B.B. and E.B. formal analysis (SPM12, FreeSurfer, syngo.via, AIRAscore, statistics), software. A.N. Visualization, Writing - original draft. U.E. resources, supervision. All authors: Writing - review and editing (equal), approval of the final manuscript.

Corresponding author

Correspondence to Benjamin Bender.

Ethics declarations

Competing interests

Ahmad Nazzal, Tobias Lindig and Benjamin Bender are employed by AIRAmed. AIRAmed provided segmentations as part of a research agreeement free of charge. Tobias Lindig and Benjamin Bender received speaker honoraria by Eisai GmbH, outside the submitted work. Benjamin Bender received honoraria by Medtronic, outside the submitted work.All other authors declare that they do not have any competing interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary Material 2

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bürkle, E., Nazzal, A., Debolski, A. et al. Scan-rescan reliability assessment of brain volumetric analysis across scanners and software solutions. Sci Rep 15, 29843 (2025). https://doi.org/10.1038/s41598-025-15283-3

Download citation

Received: 20 May 2025
Accepted: 06 August 2025
Published: 14 August 2025
DOI: https://doi.org/10.1038/s41598-025-15283-3

Subjects

Abstract

Similar content being viewed by others

Proof of principle for the clinical use of a CE-certified automatic imaging analysis tool in rare diseases studying hereditary spastic paraplegia type 4 (SPG4)

Brain volumes in genetic syndromes associated with mTOR dysregulation: a systematic review and meta-analysis

Brain morphometry in 22q11.2 deletion syndrome: an exploration of differences in cortical thickness, surface area, and their contribution to cortical volume

Introduction

Results

Demographics

General estimation equations results

Effect of software and scanner on measured Gray matter volume

Effect of software and scanners on measured white matter volume

Reproducibility of measurement for Gray matter volume, white matter volume and total brain volume

Discussion

Methods

Ethics approval and participants

Scanners and scanning protocol

MRI preprocessing and volumetric software

Procedure

Statistical analysis

Statistical analysis using generalized Estimation equations

Measuring reproducibility of volume measurements

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1

Supplementary Material 2

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links