Abstract
Over the years, ComBAT has become the standard method for harmonizing MRI-derived measurements, with its ability to compensate for site-related additive and multiplicative biases while preserving biological variability. However, ComBAT relies on a set of assumptions that, when violated, can result in flawed harmonization. In this paper, we thoroughly review ComBAT’s mathematical foundation, outlining these assumptions, and exploring their implications for the demographic composition necessary for optimal results. Through a series of experiments involving a slightly modified version of ComBAT called Pairwise-ComBAT tailored for normative modeling applications, we assess the impact of various population characteristics, including population size, age distribution, the absence of certain covariates, and the magnitude of additive and multiplicative factors. Based on these experiments, we present five essential recommendations that should be carefully considered to enhance consistency and supporting reproducibility, two essential factors for open science, collaborative research, and real-life clinical deployment.
Similar content being viewed by others
Introduction
In neuroimaging, leveraging multi-site data to analyze general trends and variability within diverse populations is increasingly common1,2,3,4,5. However, variations in imaging protocols across sites introduce systematic biases, posing significant challenges to multi-site and longitudinal studies6,7,8,9,10,11,12,13. To address these biases, various harmonization techniques have been developed7,13,14,15,16,17,18,19. These methods either pre-harmonize diffusion-weighted images via neural networks or other techniques like LinearRISH9,20,21 or post-harmonize MRI-derived measurements, such as through ComBAT18,19,22.
ComBAT is one of the most effective methods (8000+ citations as of June 2025 according to PubMed) for harmonizing MRI-derived measurements due to its ease of application and ability to adjust for site-related additive and multiplicative biases while preserving biological variability22,23,24,25. Its success has led to the development of various ComBAT-like methods, such as Longitudinal ComBAT26, CovBat14, Auto-ComBAT27, GMM-ComBAT16, M-ComBAT28, and ComBAT-GAM29, Cluster-ComBAT30, each offering specific improvements. Despite these advances, the original ComBAT remains a widely used solution in both research and clinical research applications22,23,31.
The off-the-shelf implementation of ComBAT relies implicitly on several key assumptions, which researchers have highlighted in previous studies. Failing to meet these assumptions can significantly impact its effectiveness. The primary assumptions include:
-
1.
The covariate effects (i.e. sex, age, handedness, etc.) on the data must be consistent across all harmonization sites (more details in Theory section)32.
-
2.
The population distributions should not display substantial imbalances in key covariates (i.e., sample size, age, sex, medical condition)11,32,33,34.
-
3.
Age distributions must overlap substantially across sites and span a wide age range26,29,32.
-
4.
Data is harmonized in a single step once the data acquisition is completed (c.f. §For the need of a reference site section).
Unfortunately, these assumptions are often overlooked in practice. Notably, no study has systematically evaluated their impact on ComBAT’s performance under controlled conditions or provided clear, quantifiable guidelines in diffusion MRI.
In this paper, we thoroughly review ComBAT’s mathematical foundation, clarifying its assumptions and the demographic conditions necessary for optimal performance. We also argue that aligning each site with a common reference dataset, with a method called Pairwise-ComBAT, mitigates some of ComBAT’s limitations. Doing so enhances consistency and supporting reproducibility, two essential factors for open science, collaborative research, and real-life clinical deployment.
Throughout our experiments, we use the Cambridge Centre for Ageing Neuroscience dataset (CamCAN)35,36, as a reference dataset and evaluate the impact of the magnitude additive and multiplicative bias, and various demographic characteristics including sample size, age distribution, missing covariates and pathological population presence. In some cases, we replicate our results with other public databases, such as the Alzheimer’s Disease Neuroimaging Initiative (ADNI)37 dataset or the Mind Research Network (MRN, site A) dataset38. To assess the harmonization quality, we also proposed a goodness-of-fit measure mathematically derived from Pairwise-ComBAT.
Following these experiments, we present five essential recommendations to use ComBAT. This paper aims to highlight both the limitations and opportunities that ComBAT presents for multi-site data harmonization.
Theory
In this section, we present the theory behind ComBAT. We outline the parameters of ComBAT, how they are estimated, and how they are used to harmonizing data. To ensure a good understanding of the method, the reader shall refer to Fig. 1 for the key harmonization steps. Note that although this figure is tailored to Pairwise-ComBAT (c.f. §Pairwise-ComBAT section), it can be used to demystify ComBAT’s formalism.
Illustration of the seven steps of a typical Pairwise-ComBAT harmonization of two sites underlying the effect of each variable of Eq. (15). From the raw data in a) to the harmonized data in j). The gray curves in b) illustrate the overall trend of the population from site 1 and site 2, whereas the scatter plots show the values of the \(J_i\) subjects of each of each site i. Details on how the metric/age population shaded plots are computed are provided at the end of Experiments section.
Let \(\mathscr {Y}=\{ Y_1,..., Y_I\}\) be a set of I physical MRI sites and \(Y_i=\{ Y_{i1},..., Y_{iJ_i}\}\) the \(J_i\) dMRI derived metric images from the site i. For example, \(Y_i\) may contain a set of FA maps obtained from the same MRI system at site i. Note that \(J_i\) is site dependent, as the number of data is usually different from one site to another. Each map \(Y_{ij}\) can be represented as \(Y_{ij}=[y_{ij1},..., y_{ijV}]\), where V is the number of voxels (or regions) in the common space and \(y_{ijv} \in \mathbb {R}\) and j is the patient index.
To compensate for site-specific biases, ComBAT uses a linear model for the data formation of each voxel (or region) v as follows:
where \(\alpha _v\) is the model intercept of the overall population across all sites, \({\varvec {x}}_{ij}\) is a vector of covariates (i.e. age and sex), \(\varvec{\beta }_v\) is a regression vector, \(\gamma _{iv}\) is the additive site effect, \(\delta _{iv}\) is the multiplicative site effect, and \(\varepsilon _{ijv}\) is some random independent Gaussian noise with \(\varepsilon _{ijv} \sim \mathscr {N}(0,\sigma ^{2}_{v})\). This model is illustrated for two sites in Fig. 1b where the black line corresponds to the overall population trend represented by \(\varvec{\beta }_v\) in the previous equation. It is crucial to recognize that ComBAT assumes the regression vector \(\varvec{\beta }_v\) (i.e. the slope of the population) is constant for all sites which, as will be demonstrated, is often incorrect and results in flawed harmonization functions. Thus, the harmonization process of two parallel black and red populations illustrated in Fig. 1b is an idealized version of the algorithm.
The primary goal of ComBAT is to eliminate the site-specific additive and multiplicative biases (respectively \(\gamma _{iv}\) and \(\delta _{iv}\)) ensuring that the harmonized population profile conforms to the black dotted model depicted in Fig. 1j:
Since each parameter of the model \(\alpha _v, \varvec{\beta }_v, \gamma _{iv}, \sigma _v\) and \(\delta _{iv}\) are a priori unknown, they must be empirically estimated from the data. One simplistic approach to estimating these parameters is through the L/S-ComBAT (Location and Scale) method18. Although this approach is straightforward, it has its share of limitations which are typically addressed with the Bayesian formulation of ComBAT which is simply called ComBAT in Fortin et al.19.
ComBAT starts by computing the regression vector \(\varvec{\beta }_v\) and the intercept \(\alpha _v\) with an ordinary least-square approach
where X is the matrix of covariate vectors from every participant across all sites, and \(Y_v\) is a vector containing the metric in location v for every participant in all sites.
ComBAT then standardizes all data in relation to \(\hat{\alpha }_v\) and \(\hat{\sigma }_v\) as depicted in Fig. 1d and represented mathematically as
where \(\hat{\sigma }_v\) is given by
where \(J_I\) is the total number of data across every site. Then, ComBAT estimates \(\gamma _{iv}^*\) and \(\delta _{iv}^{2*}\) by maximizing their posterior distribution instead of their likelihood distribution as in L/S ComBAT18. Please note that the ComBAT notation can be confusing. While \(\gamma _{iv}\) and \(\delta _{iv}^{2}\) are the bias and variance of the rectified population (Fig. 1c), \(\gamma _{iv}^*\) and \(\delta _{iv}^{2*}\) are bias and variance of the standardized populations of Fig. 1d. Following Bayes’ theorem, the posteriors \(\gamma ^*_{iv}\) and \(\delta _{iv}^{2*}\) can be written as
where
The hyperparameters of the prior distributions \(\mu _i\), \(\tau ^2_i\), \(\lambda _i,\) and \(\theta _i\) are estimated following the moment of these distributions. As such, \(\mu _i\) and \(\tau ^2_i\) are the empirical mean and variance across every location v of site i,
where \(\hat{\gamma }^*_{iv} = \frac{1}{J_i}\sum _j z_{ijv}\) is the location-wise and site-wise average of the standardized data illustrated in Fig. 1d and e.
The estimation of \(\lambda _i,\) and \(\theta _i\), one needs to compute the voxel-wise and site-wise standardized variance \(\hat{\delta }^{2*}_{iv} = \frac{1}{J_i-1}\sum _j (z_{ijv} - \hat{\gamma }^*_{iv})^2\). The empirical mean and variance of \(\hat{\delta }^{2*}_{iv}\) across all location v is computed as \(\bar{G}_{i} = \frac{1}{V}\sum _v \hat{\delta }^{2*}_{iv}\) and \(\bar{S}^{2}_{i} = \frac{1}{V-1}\sum _v (\hat{\delta }^{2*}_{iv} - \bar{G}_{i})^2\). Then, by having \(\bar{G}_i\) and \(\bar{S}^2_i\) equal to the first and second theoretical moment of the inverse gamma distribution, we get
that can be rearranged as follows to get the estimate of the two hyperparameters
Now that the hyperparameters of the likelihood and prior distributions have been estimated, by developing Eqs. (6) and (7) and combining to it Eqs. (8) and (10) to it, we get that the mathematical expectation of the posterior distributions lead to the following estimate of \(\gamma ^*_{iv}\) and \(\delta ^{2*}_{iv}\)
Since Eqs. (11) and (12) are inter-dependent, \(\bar{\gamma }_{iv}^*\) and \(\bar{\delta }_{iv}^{2*}\) are estimated through an iterative process by starting with a reasonable value for \(\bar{\delta }_{iv}^{2*}\) (i.e., \(\hat{\delta }^{2*}_{iv}\)), than by calculating \(\bar{\gamma }_{iv}^*\), and then by re-estimating \(\bar{\delta }_{iv}^{2*}\) with the new value of \(\bar{\gamma }_{iv}^*\) and so on. This process is repeated until convergence.
Once all parameters of Eq. (1) have been empirically estimated, the data can be harmonized as follows :
as illustrated in Fig. 1f to i.
Mathematical limitations of ComBAT
The slope \(\varvec{\beta }_v\) As outlined through the previous equations, ComBAT assumes that the slope \(\beta _v\) is the same for all sites’ region v. This assumption is problematic when the canonical data formation model of Eq.(2) is affected by a multiplicative factor \(S_i\) and an additive factor \(A_i\) as follows:
where the first equation is the canonical form of a linear distortion model, \(\gamma _{iv}=\alpha _v(S_i-1)+A_i\) and \(\delta _{iv}=S_i\). While Eq. (14) is similar to Eq. (1), it nonetheless contains a multiplicative term \(S_i\) on the slope which is not taken care of by ComBAT. As will be shown later, this results into misalignment populations when the uniform slope assumption is violated.
The variance \(\delta _{iv}^2\) When calculating the variance \(\hat{\delta }_{iv}^{2*}\), two key considerations ought to be taken into account. First, the posterior estimator of Eq. (12) can be viewed as a nonlinear weighted average of the empirical and prior variances with respect to the \(J_i\) (i.e. the number of data at site i). To illustrate that, when \(J_i \rightarrow 0\), Eq. (12) becomes \(\hat{\delta }_{iv}^{2*} \rightarrow \frac{\bar{\theta }_i}{\bar{\lambda }_i - 1}\), which, in combination with Eq. (10), yields to \(\hat{\delta }_{iv}^{2*} \approx \hat{G}_i\), representing the prior variance. Conversely, as \(J_i \rightarrow \infty\), Eq. (12) simplifies to \(\hat{\delta }_{iv}^{2*} \rightarrow \frac{1}{J_i} \sum _j (z_{ijk} - \gamma _{iv})^2\), the formula for empirical variance. However, as we will demonstrate empirically, the weight assigned to the prior term remains disproportionately high as \(J_i\) increases. As a consequence, a poor a priori variance \(\hat{G}_i\) will affect \(\hat{\delta }_{iv}^{2*}\) even on large populations of more than \(\sim \!\! 100\) subjects.
The second consideration that one must keep in mind is the fact that the variance \(\hat{\delta }_{iv}^{2*}\) depends on \(z_{ijv}\). Since \(z_{ijv}\) is itself influenced by \(x_{ij}^T \varvec{\beta }_v\) (c.f. Eq. 4), whenever the slope assumption is violated, the term \((z_{ijv} - \bar{\gamma }_{iv}^*)^2\) becomes biased, leading to an inaccurate estimation.
Materials and methods
For the need of a reference site
ComBAT harmonizes multisite data by projecting each data point onto an ”average” site, defined by an intercept \(\hat{\alpha }_v\), a slope \(\hat{\varvec{\beta }}_v\), and a variance \(\hat{\sigma }_v^2\) (Eq. 13), illustrated by the black line in Fig. 1b. This approach is ideal for structured studies or clinical trials, in which participants undergo specific interventions according to a predefined protocol, with data collected at multiple sites under controlled conditions. Additionally, such studies are usually time-bound, culminating with a fixed set of sites and participants that are harmonized collectively at the end of the study.
However, many applications do not meet these constraints. We argue that aligning each site to a common reference dataset, ideally one that is well populated across all covariates, offers a more versatile and robust solution. There are three main reasons for adopting this approach.
-
(i)
Reproducibility and open science Harmonizing data to a reference dataset offers multiple advantages for the scientific community. First, it enhances the comparability of data collected from diverse sources, thereby facilitating the construction of aggregated datasets that can be shared under consistent statistical assumptions. This eliminates the need for future re-harmonization and simplifies data sharing processes. By aligning with a common reference site, researchers can also establish reproducible standards for critical imaging metrics, such as those in diffusion MRI, which play a key role in characterizing disease. This reproducibility is essential to open science, fostering transparent and collaborative research across institutions. Additionally, harmonization to a reference site facilitates longitudinal studies spanning extended periods. When new data are aligned to the same reference, there is no need to re-harmonize all sites collectively, allowing researchers to track disease progression or treatment outcomes with consistent parameters. This approach promotes stability in longitudinal analyses, reduces unnecessary recalibration across sites, and minimizes the computational burden associated with repeated harmonization.
-
(ii)
Normative modeling Normative modeling quantifies individual variability by assessing how much a subject deviates from a representative population. Typically, a normative model is built by pooling data across sites. Since ComBAT’s parameters (\(\alpha _v, \varvec{\beta }_v\), and \(\sigma _v\)) are computed from all sites’ data, adding new data to the pool shifts the average reference site, creating a ”moving target” issue. Although Kia et al. propose a hierarchical Bayesian regression approach to mitigate this effect through uncertainty-aware inference39, their method does not yield harmonized measurements that can be directly shared or reused across analyses, as is possible with ComBat. Also, we suggest that harmonizing each site pairwise to a stable reference dataset is a much simpler approach for building a normative reference model than the iterative and cyclic approach that HBR is built upon.
-
(iii)
Routine clinical practice Clinical practice entails a continuous flow of data from numerous, and potentially expanding, sources over an indefinite period. For software providers, this results in the need to accommodate unpredictable growth in data sources. This variability challenges the traditional ComBAT harmonization method. By independently harmonizing each site to a common reference, new sites can be seamlessly integrated as they come online, avoiding the common ComBAT issue where the harmonization of a new site disrupts the parameters already established for previously harmonized sites. Furthermore, once harmonization parameters are established for a site, they can be applied to new incoming data, minimizing repeated harmonization and preserving previously harmonized results. This approach maintains clinical consistency for previously treated or diagnosed subjects while reducing the logistical and computational burden of transferring and processing large-scale brain data across multiple sites.
Let us note that the study by Kim et al.32 also performed a thorough evaluation of two-site DTI harmonization. However, Kim et al. treat the two sites symmetrically and focus primarily on the effects of total sample size and differences in age distribution on ComBat’s performance. In contrast, our experiments assume the presence of a large, well-balanced reference site and concentrate on harmonizing smaller ”moving” sites to this reference using the Pairwise-ComBAT strategy (cf. Pairwise-ComBAT section). Furthermore, we extend the analysis by investigating additional factors impacting harmonization quality, including mismatches in covariate slopes, unbalanced sex distributions, and mixed patient-control populations. Finally, we conceptualize harmonization as a machine learning problem with distinct training and testing sets, allowing us to assess ComBat’s ability to generalize to new, unseen data and thus provide a more comprehensive evaluation of its effectiveness.
Pairwise-ComBAT
The goal of Pairwise-ComBAT is to harmonize each moving site, denoted as M, independently onto a reference site, R. The reference site acts as a standardized baseline, containing diverse data representative of healthy individuals.
To achieve this, data from site M is harmonized using the parameters of site R, specifically the multiplicative (\(\bar{\delta }_{Rv}^*\)) and additive (\(\bar{\gamma }_{Rv}^*\)) batch effects. This results in a modified version of Eq. (13):
For a more intuitive understanding of Pairwise-ComBAT, the reader may refer to Fig. 1, particularly panels (f) to (i), which illustrate the effect of each parameter. While Pairwise-ComBAT is conceptually similar to Da-Ano et al’s M-ComBAT28, our harmonization equation (Eq. (15) differs from their’s (Eq. (4) in paper28). Furthermore, in all our experiments, we assume the reference site is well-populated, an assumption not made by Da-ano et al. A mathematical comparison between Pairwise-ComBAT and M-ComBAT28 is available in the supplementary materials.
The reader shall note that Pairwise-ComBAT is only a reconfiguration of the original ComBAT method, thus producing an equivalent harmonization outcome with the same strengths and weaknesses as the original ComBAT19. A side-by-side comparison between Pairwise-ComBAT and the original ComBAT method is shown in Fig. S1 in the supplementary materials.
Goodness of fit
To gauge the performances of ComBAT, we propose a goodness-of-fit metric which quantifies the overlap between the post-harmonized distributions of the moving (M) and reference (R) populations. This metric is an empirical measure, independent of the model’s assumptions. For its computation, if harmonization is successful, both populations can be successfully rectified using the reference parameter intercept \(\bar{\gamma }_{Rv}^*\), i.e. :
By removing the effect of the covariates and the biases, this rectification allows assuming that \(z_{ijv}\) is independent from \(\textbf{x}_{ij}\) and that \(P(z_{Rjv} |\textbf{x}_{Rj})=P(z_{Rjv})\) and \(P(z_{Mjv} |\textbf{x}_{Mj})=P(z_{Mjv})\) which is Gaussian as mentioned after Eq. (7). The distance between univariate distribution is computed using the Bhattacharyya distance (BD) which, in the case of two Gaussian distributions, results into the following closed-form solution
where \(\mu _{Rv}=mean_j(z_{Rjv})\), \(\mu _{Mv}=mean_j(z_{Mjv})\), \(\sigma _{Rv}^2=var_j(z_{Rjv})\), and \(\sigma _{Mv}^2=var_j(z_{Mjv})\). Of course, the BD used in this study assumes Gaussian distributions for analytical tractability. Since ComBat applies only linear transformations, non-Gaussian features in the raw data (e.g., MD) are generally preserved, and the BD should be interpreted as an approximate measure of distributional similarity.
An illustration of the BD computed before and after harmonization is provided in Fig. 2.
The reader should note that any statistical distance, such as the BD, is sensitive to factors like sample size and site composition, and therefore cannot define an absolute threshold for successful harmonization. Instead, these metrics are most informative when used comparatively, allowing assessment of relative harmonization quality across experiments and conditions.
Dataset studied
Reference site
The reference site serves as a standardized baseline, consisting of diverse data representative of healthy individuals and balanced across the covariates relevant to the study. For this work, we selected the publicly available CamCAN Stage 2 dataset35,36. While we do not claim CamCAN as a universal reference, it nonetheless offers a large-scale, multi-modal resource on cognitive aging across the adult lifespan (ages 18–87), making it particularly well suited for our study. With roughly 700 participants, the sample is well-balanced, thanks to a uniform age and sex distribution and recruitment from a larger population-based cohort of 2681 individuals36. Approval for the study was granted by the Research Ethics Committee of Cambridgeshire 2 (reference: 10/H0308/5035), and participants gave written, informed consent prior to involvement. A total of 441 healthy participants (224 males, 217 females) were selected for this study.
DWI data were acquired on a 3T Siemens TIM Trio Scanner (32-channel head coil) using a twice refocused spin echo (TRSE) echo-planar imaging sequence to reduce eddy-current artifacts. The acquisition included two b-values of 1000 and \(2000\,\textrm{s}/\textrm{mm}^2\) along 30 directions and three non-diffusion-weighted images (b=0). Additional parameters were: 66 axial slices, voxel size = 2 mm x 2 mm x 2 mm, repetition time (TR) = 9100 ms, echo time (TE) = 104 ms, and an acceleration factor of 2 using GRAPPA36. Please note that since the noncentral chi bias40 was not removed from that population, diffusivity measurements may be affected in CamCAN due to the GRAPPA/sum-of-squares reconstruction and high b-value. For better results, the reader may consider in the future performing a noncentral chi correction.
Moving sites
Modified-CamCAN This dataset contains the same data as the CamCAN dataset but with different additive and multiplicative biases following this equation:
where A is used to translate the population, S to change the slope of the population wrt their associated covariates and M to change the variance of the population.
An example of CamCAN and modified CamCAN (both with 441 subjects) with \(A=0.8\), \(S=0.8\) and \(M=1.1\) is illustrated in Fig. 2a. Figure 2b shows the two population distributions before (left) and after (right) the Pairwise-ComBAT harmonization. In Fig. 2c, we have the BD metrics and mean absolute difference (MAD) between the moving and the reference site data points before and after harmonization. This figure echoes with most experiments reported in Results section.
Pairwise-ComBAT harmonization of the mean diffusivity (MD) in the modified CamCAN dataset against the unbiased CamCAN (N=441). (a) The parameters A (additive), M (multiplicative), S (slope) used to generate the modified CamCAN version (c.f. Eq. (16). (b) (left) raw data with the slopes for CamCAN (solid black line), modified CamCAN (dashed black line), and the slope estimated by Pairwise-ComBAT (solid blue line). (right) The Pairwise-ComBAT harmonization. (c) Distances between the raw and harmonized populations.
ADNI The ADNI data comes from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database(adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner,MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD).
We selected the 152 subjects (100 Healthy Control (HC), 28 Mild Cognitive Impairment (MCI), 24 Alzheimer’s Disease (AD)) from site 127-GE site of the ADNI-3 cohort37. ADNI-127-GE includes 83 males and 69 females, with age mean/std of \(73.4 \pm 7.4\) (Min: 61, Max: 90), 142 subjects were left-handed and 11 right-handed.
The diffusion weighted MRI (DWI) data were acquired on a 3T General Electric (GE) Healthcare using a spin echo imaging sequence. Acquisition includes 6 non-diffusion-weighted images (b=0) and 48 directions with b-value of \(1000\,\textrm{s}/\textrm{mm}^2\). Additional parameters are: TE =56 ms, TR =7800 ms, voxel size = 2 mm x 2 mm x 2 mm and approximate scan time =7 min 30 s.
NIMH We used the 157 subjects from the National Institute of Mental Health (NIMH) Intramural Healthy Volunteer Dataset41. The MRI protocol used was initially based on the ADNI-3 basic protocol and was modified for DWI acquisition to include the slice-select gradient reversal method and reversed blip scans; and turned off reconstruction interpolation. The 3D-T1 weighted acquisition from ADNI-3 was replaced by the 3D-T1 acquisition from the ABCD protocol. All participants provided electronic informed consent for online pre-screening, and written informed consent for all other procedures.
Processing
Diffusion MRI data were processed using the TractoFlow pipeline42 with recommended settings. DTI-derived scalar maps were computed using b-values below \(b = 1200,\textrm{s}/\textrm{mm}^2\), while fiber orientation distribution function (fODF) metrics were derived from b-values above \(b = 700,\textrm{s}/\textrm{mm}^2\). As noted by Jensen and Helpern43, using b-values below 1200 s/mm2 helps minimize non-Gaussian effects in the signal, whereas b-values above 700 s/mm2 improve the estimation of the fiber response function42. Higher b-values also increase anisotropy, resulting in sharper fODFs. Also, following the recommendations by Theaud et al.42, use applied an MP-PCA denoising and a N4 bias correction to the dMRI signal.
The fODF was generated using a spherical harmonics order of 8 and the same fiber response function44 for all the subjects (15, 4, 4) x \(10 ^4\textrm{ms}/\mathrm {\mu m}^2\). Next, all metrics were registered to the Montreal Neurological Institute (MNI) space. The IIT Human Brain Atlas (IIT-FA-skeleton, v.5.045,) was finally used to extract the averaged value for each major white matter fiber bundle of interest (N=25).
For simplicity and ease of reading, the results section presents only one white matter bundle, the arcuate fascicles left (AF left), and a single diffusion metric derived from the Diffusion Tensor Imaging (DTI) model, the mean diffusivity (MD).
Harmonization of the experimental data
As mentioned before, we consider ComBAT as a machine learning algorithm. As such, every dataset was divided into 2 subsets: a training subset used to estimate the harmonization parameters \(\hat{\sigma }_v, \bar{\delta }_{iv}^*, \bar{\gamma }_{iv}^*, \varvec{\hat{\alpha }}_v\) and \(\varvec{\hat{\beta }}_v\), and a test subset to gauge the generalization capabilities of ComBAT. In all cases, CamCAN was used as the reference site towards which each dataset is harmonized to. In the case of modified-CamCAN, 100 random samples of CamCAN are used for the reference site and the remaining 341 samples are transformed following Eq. (16) into modified-CamCAN. Although large reference datasets are desirable, experimental evidence in preparation of this paper shows that a well-balanced dataset is more important than sheer size, and that increasing the reference dataset beyond 100 subjects provides little additional improvement in harmonization accuracy.
For the other datasets, the reference site of CamCAN is made of all 441 data samples. All five experiments described below are repeated randomly 30 times for each criterion and averaged.
Experiments
Five experiments were conducted to gauge the performances of ComBAT when pushed to its limits. In the case of modified-CamCAN, since it contains the same data samples as CamCAN (up to a transformation - Eq.( 16)), we report the mean absolute difference (MAD) for both the training and the testing data. As for the other datasets’ harmonization, we report the Bhattacharyya goodness-of-fit distance (BD).
Experiment 1: Additive and multiplicative biases The aim of the first experiment is to illustrate that, as mentioned in Mathematical limitations of ComBAT section, ComBAT cannot correctly align two populations and properly estimate the variance of the moving site
\(\hat{\delta }_{iv}^{2*}\) when the two populations have different covariate slopes. As such, these experiments gauge the effect that the bias, slope and variance multiplicative factors have on ComBAT (respectively, A, S and M in Eq. (16)).
Experiment 2: Training sample size The effect of the number of training subjects N from the moving site is evaluated by varying its value from 2 to a maximum number of subjects. In each case, the N samples are randomly selected while maintaining an equal number of males and females. In all cases, we report the training and testing MAD error obtained on the modified-CamCAN dataset.
Experiment 3: Training age range The effect of the training age range is evaluated by varying the range in which the moving site data are sampled. We tested age spans from 10 years to 60 years, and, for each age span, we tested various age groups (AG) i.e.:
-
10 years age span : \(20-30\), \(30-40\), ..., \(70-80\) AG
-
20 years age span: \(20-40\), \(30-50\), ..., \(70-90\) AG
-
30 years age span: \(20-50\), \(30-60\), ..., \(60-90\) AG
-
...
-
60 years span: \(20-80\), \(30-90\) AG.
-
70 years span: \(20-90\) AG.
For modified-CamCAN, samples of 32 subjects were used for each age range, which is the maximum number of subjects possible to achieve a male/female balance with CamCAN data. Note that the \(80-90\) age range was not included due to the small number of subjects.
(a) The quadratic error in the estimation of the moving site variance \(\hat{\delta }_{iv}^{2*}\) for different slope and variance multiplicative factors S and M. (b) Harmonization of the mean diffusivity (MD) of the NIMH population (red) onto that of CamCAN (gray). The different slopes between NIMH and CamCAN (left) cannot be compensated for by ComBAT (right). Red cross indicates a population misalignment. Note that while the lack of increase in MD with age in NIMH may reflect a true biological factor (e.g. ethnicity, environmental factor, etc.) the goal of this experiment is to underline that ComBAT alone is mathematically incapable in correcting a covariate slope.
Experiment 4: Sex covariate When harmonizing two populations, two key covariates are typically considered: age and sex. In this experiment, we specifically evaluated the impact of the sex covariate. The objective is twofold: first, to assess how critical the inclusion of this covariate is in the ComBAT harmonization equations, and second, to understand the importance of maintaining a well-balanced representation of males and females in the populations being harmonized.
To achieve this, we harmonized the modified-CamCAN dataset onto the CamCAN dataset under different conditions. These included scenarios with and without the sex covariate incorporated into the ComBAT equations (typically in variables \({\varvec{x}}_{ij}\) and \(\varvec{\beta }_v\)). Furthermore, we investigated the effect of excluding one of the sexes (i.e., harmonizing populations that are either male-only or balanced). This analysis allows us to measure the influence of both the inclusion of the sex covariate and the population’s gender composition on the harmonization performance.
Experiment 5: Pathological populations The effect of pathology was assessed by generating a ”control” group and a ”pathological” group from the modified CamCAN training sample. The control group includes 100 subjects, which remain constant throughout the experiment. The pathological group also includes 100 subjects and is modified similarly to the first experiment. The additive and multiplicative A and M factors of Eq. (16) vary from 0.8 to 1.20 of the initial value to simulate a more or less different distribution from the control subjects. For this experiment, the age range of the subjects are from 20 and 90 years, with a 50/50 male/female distribution. The effect of pathology is assessed by calculating the harmonizer training and testing error from the healthy control group only (HC-only) as well as for the two healthy control and pathology groups (HC-patho).
Complementary experiments As previously mentioned, for readability, the following results focus on a single metric–mean diffusivity (MD)–and one white matter region–the left arcuate fasciculus (AF left). However, the same experiment was conducted on various other diffusion MRI metrics and white matter regions, all exhibiting trends consistent with those reported in the paper. The supplementary material provides results for these additional experiments, including other metrics–fractional anisotropy (FA), isotropic volume fraction (isoVF) from NODDI, and total apparent fiber density (AFDt) from CSD–as well as different brain regions, namely the middle section of the corpus callosum (CC_Mid), the left corticospinal tract (CST_L), and the right inferior fronto-occipital fasciculus (IFOF_R).
The reader should note that the metric/age population distributions in Fig. 1 and through the results section are not derived from the analytical form of the ComBAT model but are instead obtained empirically via local quantile regression smoothing. A sliding window of \(\pm 10\) years is applied to select a subset of subjects, from which percentiles are estimated for all distributions (raw, moving and reference data distributions). When fewer than ten subjects fall within the window, its size is dynamically increased to obtain a larger sample and improve the accuracy of the centile estimates. This approach allows for non-linear centile curves that better reflect observed data distributions, particularly in the presence of residual non-Gaussian noise or non-linear age effects which is the case for several dMRI metrics like mean diffusivity.
Experiment 1. Harmonization of the Modified-CamCAN population (in gray) on the CamCAN population (in red). Results for different multiplicative factors on (a) the bias (A) (b) the Slope (S) and (c) the noise variance (M) as described in Eq. (16). The left columns are the original data and the right columns are the harmonized data. Green checks indicate correct harmonization, red X’s indicate poor harmonization. Poor harmonization occurs when the moving site’s age covariate slope differs substantially from the reference.
[Left](a) Mean Absolute Difference (MAD) training and testing harmonization errors for various number of training samples N from the Modified-CamCAN moving site. (b) and (c) illustrate the effect of using too few or enough training samples on the training and testing fit. Green checks indicate correct harmonization, red X’s indicate poor harmonization. [Right] Effect of covariate on Pairwise-ComBAT harmonization of the mean diffusivity (MD) metric between modified CamCAN and CamCAN datasets. a) Mean Absolution Difference (MAD) harmonization when considering male-only, female-only and a balance (male+female) populations. Below are harmonization results displaying male-only and female-only MD-versus-age plots after (b) balanced harmonization and (c) harmonization of male subjects from the moving site to female subjects in the reference site..
Results
Experiment 1: additive and multiplicative biases
Figure 4 illustrates the effect of the bias, slope, and variance multiplicative factors (A, M, and S) from Eq. (16) on ComBAT harmonization. As shown in column (a), ComBAT effectively compensates for bias effects when \(M=S=1\). Across all cases, the harmonized population exhibits negligible Bhattacharyya distances (BD) on the order of \(2.5 \times 10^{-5}\). However, as seen on column (b), ComBAT struggles to harmonize data from two sites with significantly different covariate slopes. While harmonization is adequate for a multiplicative slope factor close to 1 like \(S=0.8\), the alignment between the red and black populations deteriorates dramatically for \(S=0\) and \(S=1.8\).
[Top] Effect of age range harmonization of the mean diffusivity (MD) metric between the modified CamCAN and CamCAN. Green checks indicate correct harmonization, red X’s indicate poor harmonization.[Bottom] Training and testing mean absolute difference (MAD) harmonization error for age ranges from 10 to 70 years of the Modified-CamCAN dataset.
The third column (c) demonstrates that, in the absence of bias and slope factors (\(A=S=1\)), ComBAT performs well in compensating for low and high multiplicative variance factors (i.e., \(M=0.2\) and \(M=1.8\)). However, as depicted in the bottom-right figures, when there is a difference in slope between the populations (i.e., \(S=1.8\)), ComBAT fails to accurately estimate the population variance \(\hat{\delta }_{iv}^{2*}\).
To illustrate the impact of the slope multiplicative factor S on ComBAT, Fig. 3 shows the quadratic error in estimating the moving site variance \(\hat{\delta }_{iv}^{2*}\) for varying slope values \(S \in [0, 2]\). Each curve represents a unique variance multiplicative factor M. Notably, ComBAT achieves zero error in variance estimation when \(S=1\), i.e. when the covariate slopes are the same for both populations. However, as S deviates from 1 (either decreasing towards 0 or increasing towards 2), the quadratic error in \(\hat{\delta }_{iv}^{2*}\) estimation rises.
A similar scenario is illustrated on the bottom of Fig. 3, which shows the harmonization of the mean diffusivity (MD) of the NIMH population (red) onto that of CamCAN (gray). The difference in covariate slopes between NIMH and CamCAN (left) leads to erroneous harmonization outcomes (right). Please note that the purpose of this experiment is to demonstrate that ComBAT, by design, cannot correct for differences in covariate slopes. Whether the absence of an age-related increase in MD in the NIMH dataset reflects a true biological phenomenon or a selection effect is beyond the scope of this study.
Overall, these results highlight ComBat’s effectiveness in correcting both positive and negative biases as well as changes in variance but cannot account for differences in slopes between populations. More results obtained on other dMRI metrics and WM regions are reported in Figs. S2 to S7 in the supplementary materials.
Experiment 2: training sample size
The results presented in Fig. 5 [Left] explore the impact of the number of training subjects N from the moving site on harmonization performance. Figure 5 [Left] (a) highlights the relationship between training and testing errors when harmonizing varying numbers of training subjects, ranging from 2 to 341, from the Modified-CamCAN dataset to the reference site. The dashed line in the figure represents the minimum achievable error, which is obtained when the entire dataset is utilized for harmonization.
When examining these results, several patterns become apparent. When a very small number of subjects (\(N<8\)) is used, the testing error is significantly higher than the training error. This discrepancy suggests overfitting, as the model struggles to generalize with limited data. As the sample size increases beyond \(N=8\) subjects, the testing and training errors begin to decrease, indicating that ComBAT generalizes well even with moderate sample sizes. Importantly, although the training and testing errors approach the minimum achievable error when the sample size reaches 64 subjects or more, it is evident that reasonable harmonization performance can already be achieved with a sample size between 16 and 32 subjects.
This trend is further illustrated in Fig. 5 [Left] b and c, which provide additional insights into the relationship between sample size and harmonization quality. With 16 or fewer subjects, the testing results remain inconsistent and relatively poor, underscoring the limitations of working with such small datasets. However, as the number of subjects increases to 32 or more, both the training and testing results improve substantially, with harmonization achieving a strong fit to the data.
We also provide in the supplementary material a side-by-side comparison between Pairwise-ComBAT and ComBat for different values of N (c.f Fig. S1). Also, more results obtained on other dMRI metrics and WM regions are reported in Figs. S8 and S9.
Experiment 3: training age range
The harmonization of the Modified-CamCAN moving dataset across different age ranges is depicted in Fig. 6 [Top] and [Bottom].
Figure 6 [Bottom] reports the training and testing MAD for age ranges from 10 to 70 years. These results emphasize that as the age range widens, the harmonization performance improves, with both training and testing errors stabilizing at a low plateau for age ranges of 40 years or more. This suggests that broader age ranges provide more robust estimates of harmonization parameters, improving generalization for unseen data.
More specific harmonization examples for different age ranges are depicted in Fig. 6 [Top]. From left to right, the panels present results for age ranges spanning 10, 20, and 40 years. The first row illustrates unharmonized data for three specific age groups (10–20, 50–60, and 70–80 years in Fig. 6a), while the second and third rows show the harmonized training and testing data, respectively. In all cases, harmonization outcomes are compared between two approaches: using the full CamCAN age range (20–90 years) or restricting to CamCAN subjects whose ages match those of the moving site (i.e., 10–20, 50–60, and 70–80 years for Fig. 6a).
Across all settings, ComBAT achieves relatively low training errors, particularly when harmonization is performed using reference site subjects whose ages align closely with those in the moving site. This effect is especially pronounced for the 20–30 and 70–80 year age ranges, where the harmonized training distribution deviates less from the original black CamCAN distribution. However, examining the testing results reveals a critical limitation: smaller age ranges, such as 10 or 20 years, often result in poor harmonization fits for the testing data. Despite this, the results in Fig. 6c demonstrate that good training and testing harmonization fits can be achieved when the age range exceeds 40 years.
More results obtained on other dMRI metrics and WM regions are reported in Figs. S10 to S15 in the supplementary materials.
(a) and (b) illustrate a pathological population (purple triangles) exhibiting negative (top left) and positive (middle top) bias relative to a normative population (red). Below these, the harmonization results are shown both with and without the inclusion of pathological cases in the training dataset. Green checks indicate correct harmonization, red X’s indicate poor harmonization. (c) presents the harmonization mean absolution difference (MAD) error, comparing scenarios with [bottom] and without [top] the inclusion of pathological cases during training.
Experiment 4: sex covariate
The top panel of Fig. 5 [Right] illustrates the MAD harmonization error across various scenarios: harmonizing two balanced populations (i.e., equal numbers of males and females), harmonizing a male-only population with a balanced population, harmonizing two male-only populations, and harmonizing a male-only population with a female-only population. Interestingly, the first three scenarios result in low training and testing harmonization errors, while the fourth–harmonizing a male-only population with a female-only population–exhibits significantly larger errors. These findings emphasize the challenges and limitations of harmonizing highly heterogeneous populations, where covariate adjustments alone may not suffice.
The bottom panel of Fig. 5 [Right] displays the raw data (top) and, below, the male and female populations after (b) a recommended balanced harmonization and (c) harmonization of male-only data onto female-only data. Notably, the female population is poorly aligned with the reference CamCAN dataset when harmonization parameters are derived from an unbalanced dataset
Harmonization of ADNI-127-GE site with CamCAN. (a) Raw data, (b) Pairwise-ComBAT harmonization from HC group only, (c) Pairwise-ComBAT harmonization from HC and pathological group. The gray (target site) and red (moving site) lines show the population mean trend of both sites, and the corresponding light color indicates the standard deviation. Green checks indicate correct harmonization, red X’s indicate poor harmonization.
Experiment 5: pathological populations
Figures 7 and 8 present harmonization results when the moving site includes both healthy control (HC) subjects and subjects affected by a pathology. In Fig. 7a and b, the pathological cases (blue dots) are a subset of subjects from the Modified-CamCAN dataset, where a negative bias is introduced in (a) and a positive bias in (b). In both scenarios, the harmonized training and testing data highlight the critical importance of using only HC subjects for harmonization onto a reference, such as CamCAN. Including pathological cases in the harmonization process results in a compressed population, where the harmonized HC and pathological cases are erroneously aligned with the target population (gray).
Additionally, Fig. 7c demonstrates that including pathological cases in the training set significantly increases the harmonization error. This issue becomes particularly problematic when the pathology’s bias falls below \(A=M=0.9\) or exceeds \(A=M=1.1\). More results obtained on other dMRI metrics and WM regions are reported in Figs. S16 to S21 in the supplementary materials.
Figure 8 illustrates the HC subjects (red-shaded region) and pathological cases (AD and MCI markers) from the 127-GE ADNI site in comparison with the CamCAN dataset. Here again, when the harmonization parameters are computed using only HC subjects, the red and gray-shaded populations are perfectly aligned, and the pathological cases remain appropriately distinct as outliers. Conversely, if both HC and pathological cases are used to estimate the ComBAT harmonization parameters, the pathological population is compressed into the normative range, creating the false impression that these subjects are normal.
Discussion
In this study, we evaluated the often-overlooked influence of various factors on the effectiveness of the ComBAT harmonization method, using a pairwise implementation of the technique. Our experiments investigated the effects of site-wise biases, sample size, age range, the sex covariate, and the presence of pathologies. Results demonstrate that ComBAT’s performance can be impacted by these factors, which underlines how ComBAT cannot be used as a black box.
As a general guideline, ComBAT performs well when harmonizing relatively large, well-balanced populations (in terms of age and sex) with substantial inter-population overlap in age distributions and similar age covariate slopes. ComBAT performs also well for compensating additive and multiplicative biases (parameters A and M through our experiments). Our findings also reiterate ComBAT’s ability to preserve the biological variability of critical covariates, such as age, while correcting for inter-site differences.
However, ComBAT cannot be used indiscriminately, as it may produce suboptimal harmonization under certain conditions. One prominent issue arises when the age covariate slope differs between the populations being harmonized. In such cases, while the additive bias is compensated, the multiplicative bias is incorrectly estimated, and the slope discrepancy remains uncorrected. As discussed in Mathematical limitations of ComBAT section, this limitation stems from the assumption that the slope \(\varvec{\beta }_v\) is consistent across all sites, which is not always valid11,32. That said, slope differences across sites should always be interpreted with caution, as they may reflect true biological variability between cohorts (e.g., recruitment or selection bias, ethnicity, or environmental factors). In practice, readers encountering such cases are encouraged to consider the plausibility of biological explanations (e.g., differences in recruitment criteria or socio-demographic composition), assess potential site–covariate interactions by carefully examining acquisition protocols and cohort design and avoid treating ComBAT (or any post-hoc harmonization method) as a substitute for proper cohort balancing at the acquisition or recruitment stage.
Another challenge occurs when the sample size is too small for the moving site being harmonized to the reference site. Fortin et al.19, the first to apply ComBAT to dMRI data harmonization, recommended a minimum of \(N=20\) subjects per site for reliable harmonization. Similarly, Orlhac et al.33 reported that ComBAT becomes less reliable for harmonization of PET imaging biomarkers across two sites when sample sizes fall below \(N=30\) subjects per site. Our results are consistent with these studies, showing that a reasonable harmonization can already be obtained with 16 to 32 subjects in the moving site and that fewer subjects many result into overfitting (low training error and large testing error). The study by Kim et al.32 recommends using at least \(N=162\) subjects. While this threshold appears substantially higher than our suggested values, Kim et al. treat both sites symmetrically and do not designate a reference site. In contrast, our recommendation of 16–32 subjects for the moving site and \(100+\) for the reference site is consistent with their findings, as the larger reference cohort ensures stable harmonization while the moving site requires fewer subjects..
Another issue emerges when the age range of a population is too narrow (\(<20\) years). When reference subjects closely match the age range of the moving site (i.e., 10–20 or 70–80), training errors remain relatively low, indicating a good fit to the training data. However, this age-matched approach of adjusting reference distributions based on moving site characteristics does not necessarily result in optimal generalization, as shown by high test errors. This is particularly the case when the age range is restricted to 10 or 20 years, for which test performance deteriorates, leading to inconsistencies in the harmonization, especially for age ranges not included in the training data. In contrast, the improvement in both training and test errors is reached with age ranges of 40 years or greater. This finding is consistent with previous research suggesting that wider age distributions and age range overlap are required for an optimal ComBAT harmonization26,29,32. Note that for applications targeting a narrower age group (e.g., pediatric or geriatric populations), the 40-year range can be proportionally reduced (e.g. a 10-year span for an adolescent cohort) provided that the age distributions of the reference and moving sites are well matched.
Regarding the covariates, Pairwise-ComBAT assumes that all subjects share the same distribution when covariates are not included, i.e. it assumes identical distributions between males and females, which risks suppressing biologically relevant variability by wrongly attributing it to site effects32,33. The inclusion of covariates helps to account for differences in distributions, thereby improving harmonization - especially in cases of demographic imbalance. This is consistent with the findings of Kim et al.32, who reported that ComBat struggles to harmonize sites with very small and demographically diverse samples unless appropriate covariates are included.
Results also emphasize the importance of handling pathological populations carefully. In clinical studies of brain diseases like Alzheimer’s (AD), Multiple Sclerosis (MS) or brain tumors, the site effect is often influenced by differences in clinical status32,33,46. To prevent site effects from confounding biological variability, previous studies have typically included clinical status as a covariate32,46. In contrast, this study excludes clinical status from the estimation of training parameters, rather than using it as a covariate. When pathological and healthy subjects are mixed, Pairwise-ComBAT tends to compress both pathological and healthy subjects around the target site, thereby reducing population-related variability. The proposed harmonization strategy preserves biological variability while correcting for site effects, thus avoiding bias in the pathological population. Thereby, a recommended strategy is to estimate the harmonization parameters using only control subjects, temporarily excluding pathological cases. Once estimated, there parameters can then be applied to the pathological subjects. This approach minimizes the risk of confounding disease-related variability with site effects33.
In conclusion, our findings underscore that the key to successful ComBAT harmonization lies in ensuring that populations are as homogeneous as possible in terms of age range, demographic profiles, sex distribution, covariate slope and health status. Neglecting these considerations may not only lead to poor harmonization during training but also result in catastrophic testing errors.
The reader should note that while CamCAN has been used through this paper, it may not be an optimal reference dataset for all studies. For instance, it would be less relevant for sites with only partial age overlap, such as the ABCD cohort (ages 9–20 years). In such cases, a practical strategy would be to first harmonize the ABCD healthy controls to the overlapping age range of CamCAN, then use the combined CamCAN+ABCD healthy cohort as an extended reference.
Recommendations for a good ComBAT harmonization
Given the influence of the factors discussed above, we strongly recommend carefully examining the characteristics of the populations from each site before proceeding with their harmonization. This step ensures that ComBAT’s assumptions are met, particularly concerning observed distributions (i.e., additive and multiplicative effects) and sample size, and also helps identify one or more relevant covariates. These covariates can be demographic (i.e., age, sex, or handedness) or clinical (i.e., neurodegenerative disease, treatment, or the presence of a brain abnormality, such as a tumor).
In the light of our experiments, our recommendations are as follows:
-
Data inspection ComBAT should never be used as a ”magic black box.” Always ensure that the populations to be harmonized exhibit distributions that align with ComBAT’s assumptions, particularly uniformity across key variables.
-
Covariate slope effect The data slope \(\varvec{\beta }_v\) of key co-variates such as age should be approximately the same across all sites. If this condition is not met, the resulting harmonization is likely to be erroneous.
-
Sample size At least 16 to 32 subjects should be included in the moving site when computing the parameters of ComBAT with a reference site to ensure that the resulting harmonization function generalizes well to all data.
-
Age distributions The age range of the moving site’s population should span at least 40 years. If this requirement cannot be met, consider using an age-matched ComBAT harmonization, particularly for applications where proper harmonization of the training population is crucial. Note that for applications targeting a narrower age group (e.g., pediatric or geriatric populations), the 40-year range can be proportionally reduced (e.g. a 10-year span for an adolescent cohort) provided that the age distributions of the reference and moving sites are well matched.
-
Sex covariate The populations from all sites should have a uniform distribution of male and female subjects. Under no circumstances should a male predominant population be harmonized with a female predominant population, and vice versa.
-
Disease differences For normative modeling applications, while commonly understood, site harmonization should ideally be estimated on healthy subjects to avoid entangling biological differences related to pathology with scanner-induced effects. These parameters can later on be applied to pathological subjects, reducing the risk of confounding disease-related variability with site effects.
-
Reference site The choice of a reference site should prioritize population compatibility and balance across covariates relevant to the study, and in some cases, using multiple reference sites or stratified strategies may be preferable to minimize bias in downstream analyses.
As noted in the introduction, numerous variants of ComBAT have been proposed in recent years, each addressing specific limitations of the original method. Table 1 summarizes six of the most widely cited alternatives, along with a brief description and recommendations for when each method is most appropriate.
Conclusions and future works
This study rigorously examined the ComBAT harmonization methodology and its variant, Pairwise-ComBAT, in addressing biases across multi-site diffusion MRI datasets. The findings highlight the critical importance of adhering to foundational assumptions, such as demographic consistency, sufficient sample size, and proper covariate representation, to ensure robust harmonization outcomes. While ComBAT demonstrated remarkable efficacy in compensating for additive and multiplicative biases, challenges arise when site-wise demographic distributions or covariate slopes differ significantly. We also provided six recommendations for an optimal application of ComBAT in both research and clinical settings, emphasizing its potential for preserving biological variability and facilitating reproducible science.
Data availability
The data and code used to generate the results will be made publicly available upon acceptance of the paper. The dataset is currently hosted on Zenodo at the https://zenodo.org/records/15693521, and the code is hosted on the GitHub repository: github.com/scil-vital/Pairwise-ComBAT.
References
Di Biase, M. et al. Mapping human brain charts cross-sectionally and longitudinally. Procs of the National Academy of Sciences 120 (2023).
Marquand, A. F., Rezek, I., Buitelaar, J. & Beckmann, C. F. Understanding heterogeneity in clinical cohorts using normative models: Beyond case-control studies. Biolo Psychiatry 80, 552–561 (2016).
Rutherford, S. et al. Evidence for embracing normative modeling. eLife 12 (2023).
Verdi, S., Marquand, A. F., Schott, J. M. & Cole, J. H. Beyond the average patient: How neuroimaging models can address heterogeneity in dementia. Brain 144, 2946–2953 (2021).
Villalón-Reina, J. E. et al. Multi-site normative modeling of diffusion tensor imaging metrics using hierarchical bayesian regression. In Proc MICCAI, 207–217 (2022).
Cetin-Karayumak, S. et al. Harmonized diffusion mri data and white matter measures from the adolescent brain cognitive development study. Scientific Data 11, 249 (2024).
Hu, F. et al. Image harmonization: A review of statistical and deep learning methods for removing batch effects and evaluation metrics for effective harmonization. NeuroImage 274, 120125 (2023).
Huynh, K. M., Chen, G., Wu, Y., Shen, D. & Yap, P.-T. Multi-site harmonization of diffusion mri data via method of moments. IEEE Trans. Med. Ima. 38, 1599–1609 (2019).
Moyer, D., Steeg, G. V., Tax, C. M. W. & Thompson, P. M. Scanner invariant representations for diffusion mri harmonization. Magnetic Resonance Med. 84, 2174–2189 (2020).
Schilling, K. G. et al. Fiber tractography bundle segmentation depends on scanner effects, vendor effects, acquisition resolution, diffusion sampling scheme, diffusion sensitization, and bundle segmentation workflow. NeuroImage 242, 118451 (2021).
Bayer, J. M. et al. Site effects how-to and when: An overview of retrospective techniques to accommodate site effects in multi-site neuroimaging analyses. Front. Neurol. 13, 923988 (2022).
Helmer, K. G. et al. Multi-site study of diffusion metric variability: effects of site, vendor, field strength, and echo time on regions-of-interest and histogram-bin analyses. In Medical Imaging 2016: Biomedical Applications in Molecular, Structural, and Functional Imaging, 9788, 729–738, SPIE, (2016).
Pinto, M. S. et al. Harmonization of brain diffusion mri: Concepts and methods. Fronti. Neurosci. 14 (2020).
Chen, A. A. et al. Mitigating site effects in covariance for machine learning in neuroimaging data. Human Brain Mapping 43, 1179–1195 (2022).
Luca, A. D. et al. Cross-site harmonization of multi-shell diffusion mri measures based on rotational invariant spherical harmonics (rish). NeuroImage 259, 119439 (2022).
Horng, H. et al. Generalized combat harmonization methods for radiomic features with multi-modal distributions and multiple batch effects. Scientific Rep. 12, 4493 (2022).
Hu, F. et al. Deepcombat: A statistically motivated, hyperparameter-robust, deep learning approach to harmonization of neuroimaging data. bioRxiv 2023.04.24.537396 (2023).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics 8, 118–127 (2007).
Fortin, J.-P. et al. Harmonization of multi-site diffusion tensor imaging data. NeuroImage 161, 149–170 (2017).
Tax, C. M. et al. Cross-scanner and cross-protocol diffusion mri data harmonisation: A benchmark database and evaluation of algorithms. NeuroImage 195, 285–299 (2019).
Karayumak, S. C. et al. Retrospective harmonization of multi-site diffusion mri data acquired with different acquisition parameters. NeuroImage 184, 180–200 (2019).
Fortin, J.-P. et al. Harmonization of cortical thickness measurements across scanners and sites. NeuroImage 167, 104–120 (2018).
Radua, J. et al. Increased power by harmonizing structural mri site differences with the combat batch adjustment method in enigma. NeuroImage 218, 116956 (2020).
Cai, L. Y. et al. Masivar: Multisite, multiscanner, and multisubject acquisitions for studying variability in diffusion weighted mri. Magnetic Resonance Med. 86, 3304–3320 (2021).
Leithner, D. et al. Impact of combat harmonization on pet radiomics-based tissue classification: A dual-center pet/mri and pet/ct study. J. Nuclear Med. 63, 1611–1616 (2022).
Beer, J. C. et al. Longitudinal combat: A method for harmonizing longitudinal multi-scanner imaging data. NeuroImage 220, 117129 (2020).
Carré, A. et al. Autocombat: a generic method for harmonizing mri-based radiomic features. Scientific Rep. 12, 12762 (2022).
Da-ano, R. et al. Performance comparison of modified combat for harmonization of radiomic features for multicenter studies. Scientific Rep. 10, 10248 (2020).
Pomponio, R. et al. Harmonization of large mri datasets for the analysis of brain imaging patterns throughout the lifespan. NeuroImage 208, 116450 (2020).
Hoang, B. et al. Distributed harmonization: Federated clustered batch effect adjustment and generalization. In Proc of SIGKDD, 5105 – 5115 (2024).
Marzi, C. et al. Efficacy of mri data harmonization in the age of machine learning: a multicenter study across 36 datasets. Sci Data 115 (2024).
Kim, M. E. et al. Empirical assessment of the assumptions of combat with diffusion tensor imaging. J. Med Ima 11, 024011–024011 (2024).
Orlhac, F. et al. A guide to combat harmonization of imaging biomarkers in multicenter studies. J. Nuclear Med. 63, 172–179 (2022).
Parekh, P. et al. Sample size requirement for achieving multisite harmonization using structural brain mri features. NeuroImage 264, 119768 (2022).
Shafto, M. A. et al. The cambridge centre for ageing and neuroscience (cam-can) study protocol: a cross-sectional, lifespan, multidisciplinary examination of healthy cognitive ageing. BMC Neurol. 14, 204 (2014).
Taylor, J. R. et al. The cambridge centre for ageing and neuroscience (cam-can) data repository: Structural and functional mri, meg, and cognitive data from a cross-sectional adult lifespan sample. NeuroImage 144, 262–269 (2017).
Aisen, P. S. et al. Clinical core of the alzheimer’s disease neuroimaging initiative: Progress and plans. Alzheimer’s Dementia 6, 239–246 (2010).
Gollub, R. L. et al. The mcic collection: A shared repository of multi-modal, multi-site brain image data from a clinical investigation of schizophrenia. Neuroinformatics 11, 367–388 (2013).
Kia, S. M. et al. Closing the life-cycle of normative modeling using federated hierarchical bayesian regression. PLoS One 17 (2022).
Koay, C. G., Ozarslan, E. & Basser, P. J. A signal transformational framework for breaking the noise floor and its applications in mri. J. Magnetic Resonance 197, 108–119 (2009).
Nugent, A. C. et al. “the nimh healthy research volunteer dataset” (2024).
Theaud, G. et al. Tractoflow: A robust, efficient and reproducible diffusion mri pipeline leveraging nextflow & singularity. NeuroImage 218, 116889 (2020).
Jensen, J. & Helpern, J. Mri quantification of non-gaussian water diffusion by kurtosis analysis. NMR Biomed 23, 698–710 (2010).
Descoteaux, M., Deriche, R., Knosche, T. R. & Anwander, A. Deterministic and probabilistic tractography based on complex fibre orientation distributions. IEEE Trans. Med. Ima. 28, 269–286 (2008).
Qi, X. & Arfanakis, K. Regionconnect: Rapidly extracting standardized brain connectivity information in voxel-wise neuroimaging studies. Neuroimage 225, 117462 (2021).
Shinohara, R. T. et al. Volumetric analysis from a harmonized multisite brain mri study of a single subject with multiple sclerosis. Am. J. Neuroradiol. 38, 1501–1509 (2017).
Acknowledgements
This work was supported by the ACUITY/CQDM consortium, the Natural Sciences and Engineering Research Council of Canada through its discovery grant program, and the MITACS Acceleration funding program. We are also thankful for the Institutional Université de Sherbrooke Research Chair in Neuroinformatics and NSERC Discovery Grant. Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
Author information
Authors and Affiliations
Consortia
Contributions
P-M.J. co-funded the work, mastermind this paper, conceived the experiments, and co-wrote paper, M.E. wrote the code for most of the experiments, co-wrote the paper and produced all 8 figures, G.G. wrote most of the ComBAT code and co-wrote the paper, F.D, G.T, M.D, J-C.H., Y.D. structured and processed the data and wrote part of the ComBAT code, ADNI provided some of the data, and M.D. co-funded the work and help writing the paper. Data used in preparation of this article were partly obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Jodoin, PM., Edde, M., Girard, G. et al. Challenges and best practices when using ComBAT to harmonize diffusion MRI data. Sci Rep 15, 41508 (2025). https://doi.org/10.1038/s41598-025-25400-x
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-25400-x










