Introduction

Autism spectrum disorder (ASD; autism), attention-deficit/hyperactivity disorder (ADHD), and obsessive-compulsive disorder (OCD) are behaviourally-defined neurodevelopmental conditions [1,2,3] with significant variability and overlap in their neurobiology and phenotypic presentation [4, 5]. To characterize the variability within and across these conditions, a growing body of research has focused on data-driven approaches, including clustering [6,7,8,9], to discover transdiagnostic groups of individuals who share similar neurobiological [7, 9,10,11] or phenotypic features [12]. These studies have consistently found a misalignment between data-driven subgroups and existing diagnostic labels [6, 12]; however, significant variability exists across these studies in the neurobiological features and analytical approaches used in clustering. For example, different measures of brain morphology [12, 13] and function [10] have been used along with a range of clustering approaches including hierarchical, spectral, multi-view, or regression clustering [12, 13].

In addition to differences in data modalities and analytical approaches, significant variability is also found in data acquisition methods (e.g., scanners, scanning parameters, motion), imaging pipelines [14,15,16] (quality control method, denoising/correction algorithms), and sample characteristics [17], including diagnoses and sociodemographic composition. Given this, it is not surprising that there is also significant variability in study findings. This includes differences in the suggested number of clusters (e.g., 2-8 cluster solutions [10, 12, 18]), neurobiological characteristics defining the subgroups (e.g. differences in cortical volume or subcortical [18], cortical thickness [12]), and the phenotypic presentation of the clusters (for example, cluster differences in social communication abilities [10, 18], language and attention [18], cognitive ability and hyperactivity [9]). The heterogeneity of findings in the existing literature has raised questions about the replicability of clustering results. In this context, replicability is defined as obtaining consistent findings across studies with the same research question [19]. Among the existing literature, only one study has investigated the replicability of data-driven subgroups across two independent datasets including children with diagnoses of neurodevelopmental conditions [9]. Using resting-state functional connectivity datasets from the Province of Ontario Neurodevelopmental Disorder (POND) Network and the Healthy Brain Network (HBN), this study found two clusters that differed in IQ, hyperactivity, and impulsivity, as well as patterns of segregation and integration within the brain’s networks. These results provide encouraging preliminary evidence that the results of clustering based on these measures of brain function may be replicable in spite of differences in datasets. A critical gap still remains, however, in understanding replicability in clustering results based on measures of brain morphology. The present study addresses the gap by examining the issue of replicability in measures of brain morphology namely cortical thickness, surface area, and cortical and subcortical volume. To this end, we will examine 1) replicability of clustering across these measures, and 2) replicability across two independent datasets. Given the known sex-differences in neurodevelopmental conditions, our analysis was disaggregated by sex, allowing us to also examine replicability across the male and female subsets of each dataset.

Methods

Participants

For this study, we used data from two independent datasets, namely, POND (export date May 22, 2023), and HBN (Release 10). Data from participants who were between 5–19 years of age, had a diagnosis of autism, ADHD, OCD, or who were neurotypical, and whose neuroimaging data passed quality control were selected for the current study. This resulted in data from 747 participants from POND (autism: n = 312, female=22.4%, median age=12.4 (5.53); ADHD: n = 220, 25.5% female, median age=11.3 (4.08); OCD: n = 70, 40.0% female, median age=11.8 (5.57); neurotypical: n = 145, 41.4% female, median age=11.9 (5.16)), and 582 participants from HBN (autism: n = 60, 8.33% female, age=12.31(5.86); ADHD: n = 445, 31.2% female, 10.02 (4.83); OCD: n = 19, 52.6% female, age=8.76 (5.09); neurotypical: n = 58, 41.4% female, age=10.1(5.24)). For POND, clinical diagnoses were supported by gold-standard assessments: the Autism Diagnostic Observation Schedule-2 (ADOS [20]) and the Autism Diagnostic Interview-Revised (ADI-R [20]) for autism, the Parent Interview for Children Symptoms for ADHD (PICS) for ADHD, and the Children’s Yale-Brown Obsessive Compulsive Scale for OCD [21] (CY-BOCS). Children in the neurotypical group did not have a history of neurodevelopmental, psychiatric, or neurological diagnoses, were born after 35 weeks gestation, and had no first-degree relative with a neurodevelopmental condition. For HBN, a computerized web-based version of the Schedule for Affective Disorders and Schizophrenia—Children’s version (KSADS [22]) was administered, which was reviewed alongside all study material by a clinical team to synthesize a consensus clinical diagnosis aligning with the DSM-5 [23]. Individuals with no diagnosis given were considered as neurotypical.

Both POND and HBN studies were approved by the respective institutions’ research ethics board. Written informed consent and/or verbal assent (if written is not available) were obtained from the primary caregiver and/or participants as appropriate. The present study on secondary analysis of POND and HBN data was approved by the Holland Bloorview Research Ethics Board.

Behavioural measures

Phenotypic measures available for both datasets included the Social Communication Questionnaire (SCQ) to quantify autism-like features [24], the Strength and Weakness for ADHD symptoms and Normal Behaviour Rating Scale (SWAN [25]) to measure inattention and hyperactivity symptoms,Toronto Obsessive Compulsive Scale (TOCS) to measure obsessive-compulsive traits, and full-scale intelligence quotient (FSIQ) measured by an age-appropriate IQ scale [26, 27]. The internalizing and externalizing measures of Child Behaviour Checklist (CBCL [28, 29]) were used to quantify internalizing and externalizing symptoms.

Sociodemographics measures

In addition to age and sex, racial and ethnic identifications in both datasets were collected through self-reported or parent-reported questionnaires. For POND, racial categories were aligned with the standards set by the Canadian Institute for Health Information. These categories encompassed Black, East Asian, Indigenous, Latino, Middle Eastern, South Asian, Southeast Asian, White, and other. Participants with mixed racial backgrounds were coded in multiple categories. For HBN, racial categories followed the US Census guidelines, including American Indian or Alaskan Native, Asian, Black, Hispanic, Native Hawaiian or other Pacific Islander, White, 2 or more races, and other. Given the sample size, we consolidated race into two categories of minoritized and white for both datasets. For both datasets, household income was categorized as low (<$74,999 CAD), medium ($75,000 CAD to $199,999 CAD), and high (≥$200,000 CAD). Education was defined based on the highest educational attainment of the primary caregiver, categorized as: Level 1 (non-completion of high school or high school diploma), Level 2 (associate degree or undergraduate degree), and Level 3 (graduate or professional degree).

Imaging data

For both datasets, measures of cortical surface area, cortical thickness, cortical volume and subcortical volume were obtained from structural MRI (sMRI). For POND, the sMRI images were collected on Siemens MAGNETOM 3 T Trio and Prisma MRI scanners across three sites, namely, the Hospital for Sick Children (Toronto, Ontario; Trio: n = 233; Prisma: n = 348), Queen’s University (Kingston, Ontario; Trio: n = 100; Prisma: n = 43), and Holland Bloorview Kids Rehabilitation Hospital (Toronto, Ontario; Prisma: n = 23). For HBN, the data were collected using Siemens 3 T Trio and Prisma scanners from three institutions in the New York City area, namely the CitiGroup Cornell Brain Imaging Center (Prisma: n = 345), Rutgers University (Trio: n = 202), the City University of New York Advanced Science Research Center (Prisma: n = 28), and a mobile site in Staten Island with a 1.5 Tesla Siemens Avanto (n = 7).

To extract surface area, cortical thickness, and cortical volume, the CIVET pipeline (version 2.1.0) [30] was used. These measures were extracted for 76 regions based on the automated anatomical labeling atlas (AAL) [17, 31]. Non-uniformity image correction and stereotaxic registration to the Montreal Neurologic Institute (MNI ICBM) [32] template (non-linear 6th generation target) was then used. Masking, extraction and classification were used to separate and obtain gray matter, white matter, and cerebrospinal fluid volume. A surface diffusion kernel was applied [33], and regions were registered to the AAL atlas [34]. Cortical thickness was calculated based on the distance between two smooth surfaces [14] and gray matter and white matter surfaces was generated by tissue classification, and then surfaces were registered to the automated anatomical labelling (AAL) atlas [35]. Lastly, segmentation by use of multiple automatically generated templates (MAGeT) [33] was used to calculate volume of 95 subcortical structures from multiple starting atlases, including 5-atlas subcortical, cerebellum, amygdala, hippocampus-subfields, and striatum and thalamus subdivisions. The CIVET and MAGeT quality control (QC) pipelines were used, and participants were only included if they passed both QC pipelines. Details of the data filtering is provided in the eTable 1 in Supplement 1. For each dataset, separately for males and females, the brain measures were corrected for scanner effects using ComBat Harmonization [14]. For age correction, the best model fit among linear, quadratic, and cubic effects was used for each brain region [9].

Analysis pipeline

Data and statistical analyses were performed using Python 3.8.0 and R 3.3.3. An overview of the analysis pipeline is depicted in eFigure 1 in Supplement 1. Given the sex-differences in neurobiology of neurodevelopmental conditions [36,37,38,39], analyses were conducted independently on male and female subsets of each dataset.

To examine between-dataset similarities in the structure of the data, we used Principal Component Analysis (PCA) [40]. PCA is a multivariate approach that transforms the set of measurement variables into a new set of uncorrelated variables (principal components; PCs) that capture the largest variation in the data. The coefficients of the original variables, referred to as loadings, represent the strength of their contribution to the PCs. For this study, PCA was applied independently on surface area, cortical thickness, and cortical, and subcortical volume data. To examine similarities in the principal components across POND and HBNs, Pearson’s correlation between POND and HBN loadings was computed. This was computed as the maximum correlation between a POND PC, and the corresponding PC on HBN, allowing for a window of 2 in cases where PC numbers were not aligned between the datasets.

To characterize the clustering structure of the datasets, we used the PCA-transformed data to compute participant similarity networks. These are matrices with entries corresponding to the similarity between pairs of participants (i.e., entry i, j, corresponds to the similarity between participants i and j). Pairwise similarities were computed using the Gaussian transform of the cosine distance between vectors encoding brain measure values across all regions of the atlas. The cosine distance was selected as it provides a robust method for capturing structural associations in high-dimensional datasets [41]. With this pipeline, we generated four distinct similarity networks (cortical area, cortical thickness, cortical, and subcortical volume) separately for male and females in each dataset. These matrices were then clustered using spectral clustering [42].

Statistical analyses

To examine the existence of clusters within each network, we employed three measures of clusterability: the gap statistic [43], silhouette coefficient [33], and Calinski-Harabasz [44]. Statistical significance of clustering patterns were determined using a permutation test comparing the three measures of clusterability between the datasets and 200 random networks. The random networks were generated using the same weight distribution as the original networks [45], preserving the degree and strength of the original networks [46]. Alignment among the constructed clusters and diagnostic labels, as well as clusters obtained from different brain measures was assessed using the adjusted rand score [47]. An adjusted rand score of one indicates full alignment between two sets, whereas a value of zero suggests no alignment.

Univariate and multivariate methods were employed to examine the associations among clusters and behavioural and brain measures. For univariate analysis, measures were compared among clusters using t-test for normally-distributed, continuous data, Mann-Whitney tests for non-normally-distributed, continuous data, and Chi-squared tests for categorical data. Family-wise correction was used for multiple comparisons and Cohen’s effect size [42] was reported for statistically significant results. For multivariate analysis, we predicted cluster labels from phenotypic measures using a random forest classifier [48]. These phenotypic predictors included scores on the SCQ, SWAN (inattention and hyperactivity), and CBCL (internalizing and externalizing), as well as full-scale IQ, age, race/ethnicity, and household education level.

Results

Participants

A total of 121 participants in POND and 923 participants in HBN failed either the CIVET and MAGet quality control (detailed in Supplementary Table 1). As the result, 747 participants from POND and 582 participants from HBN remained for the analyses. The demographic characteristics for the POND and HBN participants are shown in Table 1.

Table 1 Demographic characteristics for the POND datasets.

PCA decomposition

The number of principal components needed to account for 75% of variance in the data across the measures and dataset ranged between 14 and 24 (eTable 2 in Supplement 1). The correlations between the loadings on the principal components of two datasets are shown in the eFigure 2 in Supplement 1. The loadings were significantly correlated between 82.1% of PCs. Of the statistically significant correlations, 40.9% exceeded a correlation coefficient of 0.3. HBN females had the lowest percentage of significant correlations (eTable 3 in Supplement 1).

Clustering composition of the data

Participant similarity matrices are visualized as network graphs in Fig. 1. As seen, two distinct groupings are evident across datasets, brain measures, and male and female subsets. To determine if clusters existed in the data, we used the gap statistic [43], comparing the within cluster dispersion of the data to that expected under the null distribution (no random permutation networks). The gap statistic was significantly larger for our data compared to the null distributions (random permutation networks) for surface area, cortical thickness, and cortical and subcortical volume for both males and females (p < 0.01, eTable 4 in Supplement 1). Silhouette and Calinksi-Harabsz scores (eFigure 3 in Supplement 1) suggest that the optimal number of clusters is two for all measures and datasets.

Fig. 1: Replicability of clustering structure.
figure 1

A participant similarity networks, (B) adjusted rand score.

Clustering results

Across brain measures and datasets, there was very low alignment between diagnostic labels and data-driven groupings (adjusted rand scores <0.02; eTable 5 in Supplemental 1). Alignment among clusters constructed using different brain measures is shown in the eFigure 4 in Supplement 1. Across all datasets, clustering solutions were highly aligned for cortical volume and surface area (adjusted rand score 0.63–0.81), and moderately aligned for cortical thickness and subcortical volume (adjusted rand score 0.22–0.44). This finding was replicated across datasets and female/male subsets.

Cluster differences in brain measures

For both datasets, we computed the effect size for the differences in brain measures across clusters using Cohen’s d (eFigure 5, eFigure 6 and eTable 6 in the Supplement 1). Figure 2 shows the association among these effect sizes between POND and HBN, as well as male and female subsets. Linear regression analysis revealed a significant association between cluster effect sizes for POND and HBN after controlling for measure and sex (intercept=0.09+/−0.02, p < 0.0001; beta=0.92+/−0.01, p < 0.0001; adjusted R-squared=0.93). Similarity, a significant association was found for cluster effect sizes between males and females (intercept = −0.04+/−0.02, p = 0.04; beta=0.97+/−0.01, p < 0.0001; adjusted R-squared=0.91). This suggests that brain signatures associated with the clusters are highly consistent between datasets and male/female subsets.

Fig. 2: Association among between-cluster effect sizes computed.
figure 2

A POND and HBN, (B) male and female subsets.

Cluster associations with phenotypic measures

Univariate testing did not reveal any significant between-cluster-differences in age, race/ethnicity, family income and education, ethnicity, FSIQ, SCQ, SWAN, or CBCL scores (detailed statistics in the eTable 3 in Supplemental 1) across datasets or measures (Fig. 3).

Fig. 3
figure 3

95% confidence interval of the mean for the cluster difference reported for phenotypic measures for POND and HBN, disaggregated by sex.

The accuracy for multivariable prediction of cluster labels is presented in the eTable 7 in Supplemental 1. One-sample t-tests revealed that cluster labels were predicted with greater than chance accuracy for subcortical volume for males in both POND (accuracy = 0.65 ± 0.09; p = 0.02) and HBN (accuracy = 0.61 ± 0.05; p = 0.01). For those prediction tasks, feature importance values (calculated based on mean decrease in impurity [48]) are reported in supplemental eFigure 7. The differentiating features were highly consistent between POND and HBN and female and male subsets, with the highest importance attributed to age and the phenotypic measures (IQ, CBCL internalizing and external, SCQ, SWAN scores). The contribution of sociodemographic factors to prediction was significantly smaller.

Discussion

Our study characterized the replicability of the participant similarity networks constructed using surface area, cortical thickness, and cortical and subcortical volume, across the POND and HBN datasets, as well as male and female subsamples.

Replicability across datasets

Despite significant differences in the POND and HBN datasets in demographic and phenotypic composition, our results revealed a high degree of consistency between the data structures for the two datasets. In particular, we found high between-datasets correlations among the principal components obtained using POND and HBN datasets, suggesting that data structures are similar in both datasets across the brain measures examined. The clustering structure was highly replicable across datasets, with our results revealing a 2-cluster composition across the four brain measures and the female/male subsets. This is at the lower end of previous literature findings where the number of reported clusters is highly variable, ranging from 2–10 [6, 9,10,11, 18, 49]. Larger number of clusters are likely to be found when multiple brain measures are combined, especially if these measures quantify potentially different biological mechanisms (for example, if two independent groups are found in each measure A and B, the combination of measures will result in four possible group combinations).

The brain signatures of the clusters were highly consistent across datasets with high correlations among regional effect sizes for between-cluster differences. Another finding that was replicated was that data-driven clusters were not aligned with diagnostic labels as indicated by the low Adjusted Rand Index scores (eTable 5 in Supplement), across datasets, brain measures, and female/male datasets. This finding is consistent with previous literature [6, 9, 10, 18, 49], further adding to the body of work highlighting the need for enhanced biologically-relevant precision in characterization of neurodevelopmental conditions, compared to our broad diagnostic categories.

In this study, we did not find statistically significant phenotypic differences between clusters through uni-variate analysis. However, multivariate analysis showed that cluster labels derived based on subcortical volume were predicted with greater than chance accuracy based on a combination of differences in age, IQ, internalizing, externalizing symptoms, autism features, and inattention and hyperactivity/impulsivity, across both POND and HBN. This finding suggests that neurobiological homogeneity may not align with single diagnostic domains of neurodevelopmental conditions, but instead, reflects differences in constellation of phenotypic features that are not specific to a single diagnosis category. The null finding of univariate phenotypic differences may also be due to statistical power as replicability in brain-behaviour associations may require very large sample sizes [50].

Replicability across brain measures

In addition to between-dataset differences, we examined the replicability of clustering structures across different brain measures within a dataset. The two-cluster solution was replicated across cortical area, cortical thickness, cortical volume, and subcortical volume; However, the participant membership to the clusters was only partially replicated between subcortical volume and cortical thickness, and surface area and cortical volume, but not at all among other pairs of measures. The misalignment between cortical thickness and surface area is not surprising given that these features are suggested to be genetically distinct determinants of cortical structure [51]. Further, the finding of replicability between cortical volume and surface area is consistent with the suggestion that interindividual variation in gray matter volume is largely driven by differences in surface area rather than the cortical thickness [52]. The dissociation between cortical thickness and surface area is particularly important to studies of subgroup structure in neurodevelopmental conditions that integrate multiple measures of cortical morphology. Given that these measures reflect different genetic mechanisms, clustering based on each individual measure may be advantageous to reveal subgroups that share differences in these mechanisms.

Replicability and sex differences

Given the known sex-differences in the neurobiology and phenotypic expression of neurodevelopmental conditions [53], we disaggregated our results by sex. There was high replicability between male and female datasets in the principal component decomposition of the brain measures, overall clustering structure, lack of alignment with the diagnostic labels, and brain signatures of clusters. In terms of brain-phenotype association, replicability was found in HBN, but not POND. It is important to note that overall, we observed higher variability in the female dataset. This may suggest larger variability in neurobiological characteristics or may be the result of our smaller sample size for the female subsets.

Strengths and limitations

This study has several strengths, including our large sample sizes across both datasets. At the same time, there was lower representation of females, matching the expected prevalence in autism and ADHD. This may have limited our ability to detect female-specific patterns. Additionally, our phenotypic measures were limited by what was available in both POND and HBN sets. It may be possible that brain-behaviour associations can be found in other measures of function or cognition (e.g., response inhibition, memory, affect recognition).

Conclusions

To our knowledge, this is the first study of clustering replicability in structural brain measures across neurodevelopmental conditions. We found evidence of replicability of the clustering structure across two independent datasets; however, when examining replicability across brain measures, only replicability across cortical thickness and subcortical volume, and surface area and cortical volume were strongly supported by our results.