Main

Genetic architecture contributes directly and indirectly to the wiring of brain circuits and provides the foundation of behaviour repertoire manifestation1,2. By understanding genetic underpinnings, it is possible to unravel the origins of individual differences in cognitive processes and behaviours, offering insights into both adaptive capacities and developmental vulnerabilities3. Identifying biological determinants behind brain organization and behavioural differentiation necessitates an integrative approach that cuts across an array of disciplines. Nevertheless, neuroimaging genetics, psychiatric genetics and environmental factor studies have been conducted in isolated silos.

Genetic underpinnings of phenotypes or disease have been traditionally studied through genome-wide association studies (GWAS). However, GWAS have been restricted to common variants which mainly reside in non-coding regions and exert only small effects on many phenotypes, including those studied in neuroscience4. Compared with incumbent single nucleotide polymorphism (SNP) analyses in GWAS, protein-coding copy number variations (CNVs) represent rare and consequential genome-wide perturbations leading to a large decrease or increase in gene expression. This class of genetic variation is defined as either a deletion or duplication of sequences of nucleotides more than 1,000 base pairs long5,6. Notably, CNVs have been associated with neurodevelopmental disorders (for example, autism spectrum disorder7 or attention-deficit-hyperactivity-disorder8) and psychological/psychiatric disorders (for example, schizophrenia9,10, bipolar disorder11 or major depressive disorder12).

Many protein-coding CNVs are now being understood to exert body-wide implications13,14 and cortical alterations15. Research indicates that CNVs contribute to cortical changes in the brain, affecting both its structure and function16,17. The observed patterns of robust cortical alterations were largely specific to individual CNVs14,18. The different brain alterations can lead to ramifications beyond the impairment of cognitive ability or language capacity, dominantly reported in the CNV literature19,20. Systemic associations outside the central nervous system, including the cardiovascular system, might contribute to decreased longevity of CNV carriers in the general population13,14. Since protein-coding CNVs are cumulatively frequent in the population and have the potential for substantial effects on a given phenotype, they represent an emerging potent imaging-genetics tool.

During the period of adolescence, brain circuits and behavioural tendencies undergo dynamic changes shaped by genetic factors, environmental influences and their interactions21,22. Adolescence is also a life stage during which symptoms of numerous psychiatric disorders become apparent23. Recent findings underscore the necessity of adopting a multidimensional and interdisciplinary approach that cuts across sociology, psychology and biology, conventionally studied in isolation. Such a holistic perspective is essential for a more nuanced understanding of the intricate interplay of genetic, socioeconomic and environmental factors influencing healthy children’s development24. By integrating information from cognitive assessments, genetic information and socio-environmental measures, it is possible to identify potential risk factors as well as unveil protective elements contributing to resilience in individuals navigating the complexities of adolescence25,26. There is an ongoing debate on whether CNVs exhibit specific associations with particular disorders, or rather influence neurodevelopment as a whole. As a result, we should carry out analyses and studies that are open to the possibility that CNVs will impact behaviour in various ways throughout adolescence27. The analysis of CNVs in adolescents is positioned to carve out important interactions between our genetic heritage, the environmental milieu, and the intricacies of cognitive and social development.

In the present study, we leveraged understudied rare genetic alterations (genome-wide CNVs) with strong downstream effects. We interrogated the Adolescent Brain Cognitive Development℠ Study (ABCD Study)28, which represents one of the largest collections of brain images and genetic profiles from over 10,000 children aged 9–11 years at baseline. These adolescents are prospectively deeply phenotyped by means of an extensive battery of cognitive, behavioural, clinical, psychosocial and socioeconomic measures. Benefiting from this comprehensive multimodal data, we investigated the effects of a genomic deletion and duplication on patterns of individual participants’ brain architecture linked with cognitive, behavioural, psychosocial and socioeconomic measures in a single unified multivariate analysis. Specifically, we first probed curated data from 7,338 children for the presence of CNVs. We then deployed multivariate pattern-learning algorithms in children without any CNV to estimate modes of population covariation between brain architecture, represented by 148 regional atlas volumes and ~1,000 behavioural variables spanning 20 rich categories. Finally, we quantified the effects of deletions and duplications on the revealed canonical modes. The robustness of derived modes and CNV-induced differences were substantiated by cross-validation and permutation testing24,29. This multidimensional and doubly multivariate framework revealed the multifaceted relationships between genes, brain architecture and behaviour, which paves the way for innovation in neuroscience, genetics and personalized medicine.

Results

Genome-wide mutations alter patterns of brain and behaviour

We used a pattern-learning approach to analyse the impact of genome-wide CNVs in the ABCD cohort by means of its uniquely deep phenotype profiling. To this end, in the group of 7,338 children that passed genetic and MRI quality control, we first identified 486 children carrying at least one genomic deletion fully encompassing at least one gene and 1,406 children carrying at least one duplication that fully encompassed at least one gene. In addition, we identified 132 children who carried both a deletion and a duplication, and these individuals were included in both the deletion and duplication groups for subsequent analysis. The remaining 72% (5,314) of the children did not carry any protein-coding CNV larger than 50 kb across the genome (Fig. 1a). These participant groups (deletions, duplications, controls) showed similar proportions of sex (percentage of females: 44–48%) and distributions of age (Fig. 1a).

Fig. 1: Linking genome-wide CNVs and brain–behaviour population covariation.
figure 1

a, Genome-wide CNV identification in the ABCD population cohort. We investigated 7,338 children from the ABCD database. In total, 5,314 children do not carry any protein-coding CNV, 486 carry a deletion and 1,406 carry a duplication fully encompassing one or more genes. A total of 132 participants carried both deletion and duplication (left plot, outer circle). The ratio of males and females is similar in every group (left plot, inner circle). The histogram (right plot) depicts the age of participants across CNV groups. b, Overlaid histograms showing the distribution of the number of genes encompassed by each CNV. c, A partial least squares model links the brain with behaviour in one holistic model. We estimate a multivariate relationship structure among 148 brain atlas volume measures and ~1,000 behaviour measures spanning 20 categories based on measurements from children without any CNV. The canonical scores represent the latent variable expressions calculated from linear combinations of the original brain and linear combinations of behaviour measurements that maximize the covariance between the two sets of variables. The number in brackets represents the number of phenotypes per category. d, CNV status associated with individual expression strengths of brain and behaviour patterns. The bar plot shows the average brain and behaviour scores of CNV carriers across all modes. Error bars represent 95% confidence intervals based on 1,000 bootstrap resampling replications. Stars indicate significant differences identified through cross-validation testing from control participants who were not used to derive model parameters (cf. Methods). These results reveal that carrying a CNV significantly impacts canonical scores across different modes of brain–behaviour covariation, emphasizing the utility of a multivariate holistic framework that cuts across single disciplines.

Source data

Next, we zoomed in on the CNVs that we localized in the children’s genetic profiles (Fig. 1b). Almost 60% of deletions encompassed a single complete gene. Duplications generally encompassed more affected genes than deletions, although a single-gene duplication was the most common (~30% of cases). Besides the genetic profiling, the ABCD resource provides brain and behaviour measurements for each participant: brain measurements were represented by 148 regional brain volumes defined according to the Desikan–Killiany standard atlas. Behavioural measures drew across 938 different phenotypes spanning 20 categories for in-depth follow-up analyses.

To investigate how genetic mutations impact brain and behaviour, we first established the link between measurements of brain architecture and behaviour using a single holistic multivariate model. Specifically, we brought to bear a partial least squares (PLS) model that maximizes the covariation between the weighted set (linear combination) of sociodemographics, family wellbeing, physical characteristics, or behavioural measures and a weighted set (linear combination) of brain structure measures (Fig. 1c). The PLS model parameters were initially estimated in participants without any CNV, as a reference group, to reveal the modes of covariation that reflect the general population. The participant-wise expressions of each brain–behaviour covariation mode are hereafter called ‘scores’. In other words, these ‘scores’ are calculated as a linear combination (weighted sum) of the original variables with PLS weights. Each identified PLS mode can thus be characterized by a set of brain and behaviour scores for all participants. Using a robust protocol for cross-validation and empirical permutation testing24, we identified three significant PLS modes (Supplementary Fig. 1). These revealed major sources of population covariation in adolescents captured the ways in which brain features are intertwined with early life events, mental wellbeing or environment.

In the next step, we wished to evaluate whether carrying a coding CNV led to statistically significant shifts in the observed brain and behaviour patterns. To this end, we devised a cross-validation scheme that compares PLS scores between controls and CNV carriers, all derived from a single PLS model (Supplementary Fig. 2). Specifically, we first estimated a single PLS model using the control group data, which captured population-level brain–behaviour covariation. We then fed the brain and behaviour data of CNV carriers through this same model, yielding analogous estimates in the CNV group. This approach ensures that the same PLS modes represent the same brain–behaviour associations across both groups. In the subsequent CNV–control comparisons, deletion and duplication carriers were pitted against control participants not used to derive PLS parameters to prevent overfitting (cf. Methods for details). The comparison was based on separately testing the difference in the average behaviour scores and the average brain scores. We were thus able to assess mode expression differences separately for deletions and duplications (Fig. 1c). We found significant differences between deletions and controls in behaviour scores for all 3 identified modes (Pmode1 = 0.003, Pmode2 = 0.003, Pmode3 = 0.010 after false discovery rate (FDR) correction). By contrast, duplications showed significant difference for the first 2 behavioural modes (Pmode1 = 0.003, Pmode2 = 0.016, Pmode3 = 0.051). Furthermore, there was a significant shift in brain scores for the all 3 modes in duplications (Pmode1 = 0.003, Pmode2 = 0.001, Pmode3 = 0.037) and the second mode in deletions (Pmode1 = 0.349, Pmode2 = 0.018, Pmode3 = 0.161). A sensitivity analysis demonstrated that the obtained differences were not driven by the presence of recurrent CNVs, such as 16p11.2 or 22q11.2 (Supplementary Fig. 3). To further test the robustness of our findings, we conducted an additional analysis where all but one sibling per family were excluded from the main dataset. This analysis also produced results consistent with those of our primary study, reinforcing the reliability of our findings (Supplementary Fig. 4). Collectively, our results revealed that carrying a CNV significantly impacts the expression of patterns linking brain architecture and diverse aspects of cognitive, psychosocial and socioeconomic measures in our ABCD sample. In other words, genetic factors contribute to individual differences in brain–behaviour correspondences in adolescents.

Tri-modal population modes link brain, behaviour and environment

After identifying robust deviations of brain–behaviour patterns in CNV carriers at population scale, we examined each revealed mode in more detail. The dominant (that is, the first) mode portrayed the ties between large-scale brain networks with sociodemographics and cognition. Specifically, we first re-expressed the difference in PLS scores between controls and CNV carriers using Cohen’s d measure to provide a standardized measure of CNV-carriership effect size. The dominant PLS mode was characteristic of significantly altered behaviour scores, with the shift being more prominent in CNV deletions (CNV–controls Cohen’s dDEL = 0.17, dDUP = 0.12) (Fig. 2a). To find which phenotypes play a prominent role in the first mode, we calculated brain and behaviour loadings. Our version of these loadings here was obtained by Pearson’s correlation between a respective PLS score and the original measurement (Supplementary Table 1). As an example, each brain loading indexes the linear association strength between brain region measurements and brain scores in our reference group. Among the strongest brain effects, we observed the medial orbital sulcus (average of left and right hemisphere Pearson’s r = 0.30), a part of the frontal lobe which may be involved in various cognitive functions, including decision making, emotional processing and social cognition30. Since duplication carriers displayed higher brain scores compared with controls and since the medial orbital sulcus was associated with positive loading (higher volume = higher score), this result pointed to increased volume in this region for duplication carriers. Other strong loadings included the middle occipital sulcus, subcallosal area, superior occipital gyrus, or right lingual gyrus (Fig. 2b). We subsequently mapped obtained loadings onto a brain surface. Notably, the temporal lobe, parietal cortex and parts of the frontal cortex played a crucial role in the dominant mode. We then computed the average absolute loading effects in each of the seven large-scale networks according to Schaefer–Yeo definitions (Fig. 1c). Finally, we submitted brain loadings to a formal bootstrap test to determine whether they were significantly different from zero (cf. Methods). This test was based on 1,000 PLS model instances built on a randomly perturbed version of our ABCD participants created by sampling a participant cohort of the same sample size (with replacement). We observed that at least 64% of the loadings were significant, highlighting the robustness of this first mode.

Fig. 2: The leading population mode tracks decreased cognitive functioning in CNV carriers.
figure 2

a, CNVs significantly impact the revealed dominant behaviour pattern. Cohen’s d values of canonical scores calculated between controls and, separately, carriers of deletions and duplications are plotted for the first canonical mode. Filled dots indicate significantly different average scores from controls, based on cross-validation testing. b, Brain region correlates reveal a whole-brain pattern. Brain loadings were calculated as the correlation between brain scores and 148 regional brain volumes. These loadings are mapped onto the cortical surface to illustrate their spatial distribution, with colours indicating the strength and direction of the associations (red, positive; blue, negative). The accompanying bar plot highlights the 20 regions with the strongest loadings, with 95% confidence intervals estimated by rerunning the PLS model on 1,000 bootstrap resamplings of participants (cf. Methods). G, gyrus; S, sulcus. The radial bar chart shows the average brain loadings in each of the seven large-scale networks defined by Schaefer–Yeo parcellation. c, Behaviour correlates highlight real-life functioning. Behaviour loadings were calculated as the correlation between behaviour scores and ~1,000 behaviour measures. Left: behaviour loadings from a PLS model, grouped by category and colour coded accordingly. Each dot represents an individual phenotype, with the y axis indicating its effect (loading) strength and direction. Right: summary of the results presented by the average absolute loading for each category in a circular bar plot, capturing the overall contribution of each domain to the derived behaviour score. In summary, the first canonical mode highlights the connection between frontoparietal and temporal regions and assessments of cognition and demographics.

Source data

Furthermore, we inspected a broad portfolio of behaviour characteristics interlocked with the above-described brain-level effects. To this end, we calculated behaviour loadings similarly to brain loadings. The strongest loadings included family income (Pearson’s r = −0.68), poverty index (Pearson’s r = 0.66), parental education (Pearson’s r = −0.58), measures of cognitive performance (Pearson’s r = −0.56) and also screen time or sleep duration (Pearson’s r = 0.46) (Fig. 2c). To obtain a synopsis of the dominant behavioural profile, we averaged absolute behaviour loadings in each of the 20 categories. Demographics, cognitive and socioeconomic categories had the strongest average loadings (average absolute Pearson’s r > 0.22). Since CNV carriers displayed higher expression compared with controls for this mode characterized by negative loading for measures of cognitive performance (lower performance = higher score), these results thus point to decreased cognitive abilities and real-life functioning, especially in deletion carriers. Collectively, the dominant canonical mode highlighted the crosslinks between (i) frontoparietal and temporal regions and assessments of (ii) cognition and (iii) demographics.

The second PLS mode spotlighted opposing gene dosage effects on the brain structure that we identified to tie into family history of mental health. Specifically, we observed significant opposing brain average expressions for both deletions and duplication (CNV–controls Cohen’s dDEL = 0.04, dDUP = −0.06) (Fig. 3a), which might reflect the mirroring effect on brain architecture previously reported for CNVs at specific genomic loci31. Similarly to the dominant behavioural mode, we also observed significantly different behavioural scores with stronger effects for deletions (CNV–controls Cohen’s dDEL = −0.07, dDUP = −0.05). According to the calculated brain loadings (Supplementary Table 1), the mirroring brain scores were mainly driven by the precentral gyrus (across-hemisphere average Pearson’s r = −0.37), followed by supramarginal, postcentral, or lingual gyri (Fig. 3b). Despite being part of distinct brain networks, these regions were previously associated with neural mechanisms supporting complex cognitive tasks, especially those involving semantic processing or executive functions32,33. Following the conducted bootstrap significance test, 43% of the brain loadings were significantly different from zero. Collectively, regions with the strongest loadings belonged to somatomotor, dorsal attention and frontoparietal networks. The interactions and coordinated activity of these networks are known to be essential for the efficient integration and execution of complex cognitive and motor tasks34.

Fig. 3: The second population mode spotlights a shift in brain scores associated with mental wellbeing.
figure 3

a, Canonical scores reveal distinct and shared effects of gene dosage. The lollipop chart displays brain and behaviour scores for deletions and duplications derived for the second PLS mode. Filled dots mark average scores that differ significantly from controls, as determined by cross-validation testing. b, Three large-scale brain networks dominate the brain loadings. Brain loadings are mapped onto the cortical surface, with colours indicating each region’s contribution to the brain–behaviour pattern (red, positive; blue, negative). The accompanying bar plot highlights the 20 regions with the strongest loadings, with 95% confidence intervals estimated from 1,000 bootstrap resamplings, each involving a rerun of the PLS model. The precentral, postcentral and lingual gyri show the most prominent contributions. The radial bar chart summarizes brain loadings by large-scale networks, highlighting the strongest contributions in the dorsal attention, somatomotor and frontoparietal networks. c, Behaviour loadings highlight the central role of mental wellbeing. Loadings are grouped by category (left) and summarized by average absolute loading in a circular bar chart (right). The strongest contributions come from parent- and child-reported measures of problems, stress, anxiety and depression. Accordingly, this mode is primarily driven by mental wellbeing-related phenotypes, particularly from parental behaviour, child questionnaires and sleep-related categories. Collectively, the second canonical mode proposes decreased mental wellbeing as a prominent marker of deletion carriers. AD/H, attention deficit/hyperactivity; Anx./dep., anxious/depressed; OCD, obsessive–compulsive disorder; ODD, oppositional defiant disorder.

Source data

The prominent deviations in behaviour scores in CNV carriers can be explained by elevated assessments of mental wellbeing as revealed by behaviour loadings. Specifically, phenotypes from the Child Behaviour Checklist (CBCL) and the Adult Self Report (ASR) dominated the set of relevant behaviour loadings (Fig. 3c). Particularly, the total scores of CBCL (Pearson’s r = 0.70) and ASR (Pearson’s r = 0.74) emerged as the two strongest loadings. They were followed by measures of both parental and child anxiety, stress and depression, as well as child sleep disorders. Indeed, when averaged across categories, the sleep category joined child behaviour and parental questionnaires as the most prominent (average Pearson’s r = 0.22). The combination of flagged phenotypes from both children and adult assessments suggests that the second mode captures a comprehensive view of the wellbeing intricately tied to the family system. In addition, it points towards potential dynastic effects, that is, the impacts of (inherited) genetic variants on family environments. Collectively, the second canonical mode proposed decreased familial mental wellbeing as a prominent marker of deletion carriers.

In the third and last canonical mode, we observed the relationship of the default mode and frontoparietal networks with environmental measures. Despite the mirrored effects on brain structure, the only significant shift was found for brain scores in duplication carriers and behaviour scores in deletion carriers (Fig. 4a). The third mode was characterized by a strong contribution of the insula (average Pearson’s r = −0.35) as well as middle temporal and lateral superior temporal gyri (Fig. 4b). The bootstrap test points to a lower stability in this mode, where 10% of brain loadings show significance. The strongest brain loadings were part of the default mode network (average absolute Pearson’s r = 0.13). This network belongs to the multimodal end of the unimodal-to-multimodal characterization of large-scale brain networks. Previous research suggests that the relevance of this network and underlying regions could imply their crucial roles in several key cognitive processes, including self-reference, social cognition, episodic and autobiographical memory, language or semantic memory35.

Fig. 4: The third population mode links higher-order networks to environment measures.
figure 4

a, Different deletion and duplication shifts. Average brain and behaviour scores for CNV carriers (deletions in pink, duplications in blue) are plotted for the third canonical mode. Significant differences from control participants are represented by filled points. b, Higher-order networks play a prominent role in the third canonical mode. Loadings are visualized on the cortical surface (bottom), with red indicating positive and blue indicating negative contributions. The bar plot (right) displays the 20 regions with the strongest loadings, with 95% confidence intervals estimated from 1,000 bootstrap resamplings. The radial plot (top left) summarizes average absolute brain loadings across seven canonical functional networks, highlighting strongest contributions from the default mode network. c, Environmental variables characterize behaviour loadings. Behaviour loadings are grouped by category (left) and summarized by average absolute loading per domain in a circular bar chart (right). Measures of neighbourhood (neigh.) violence, safety and crime are among the strongest loadings. These environment-associated phenotypes come primarily from the socioeconomic category. The third significant mode illustrates how deletions shift the expression of the mode linking the environment and higher-order networks. Sum., summary.

Source data

Examining the behaviour profile in the third canonical mode highlighted variables associated with the environment (Fig. 4c). Concretely, phenotypes related to the neighbourhood, such as crime reports (Pearson’s r = 0.71), drug possession, violent crimes, adult offence and feelings of safety emerged as strongly associated. These phenotypes reflect social and community dynamics, which might affect the overall quality of life for individuals within that context. In summary, the third significant mode revealed an alteration in how environmental differences link to higher-order networks in adolescent deletion carriers.

Exploring brain–behaviour relationship across genes, population, time

After describing the interconnections between genetic mutations and the expressions of behaviour patterns, we explored how characteristics of genes encompassed in CNVs shaped behaviour scores. In other words, for each individual with a deletion or duplication, we scored the genes inside a CNV using a total of seven complementary descriptions, including the average temporal expression, number of genes preferentially expressed in the brain, number of genes associated with autism spectrum disorder, schizophrenia, or a broader portfolio of disorders, and a functional intolerance score: the inverse of loss-of-function observed/expected upper bound fraction (1/LOEUF). The 1/LOEUF score reflects the degree of negative selection pressure against loss-of-function mutations in a gene, with higher values indicating stronger evolutionary constraint and greater intolerance to functional disruption in the general population. We then performed an exploratory analysis using Pearson’s correlation between behaviour scores and the quantitative descriptions of CNVs occurring in the genome (Fig. 5a). For deletions, the strongest observed association was with the sum of 1/LOEUF (Pearson’s rmode3 = −0.09, PFDR = 0.10). For duplications, the strongest observed association was with temporal gene expression (Pearson’s rmode1 = 0.10, PFDR = 0.001). This result suggested deteriorating impact of altering dosage in genes expressed later during human development. Another strong association was between the dominant mode and the sum of 1/LOEUF scores, where the positive correlation with PLS scores suggested decreased cognitive performance among CNV carriers. Due to the limited number of CNV carriers, only the association with genetic temporal profile reached significance after applying FDR correction to the control for multiple comparisons. Nevertheless, the reported associations can serve as valuable pointers for further research. Additional analysis using probability of loss intolerance (pLI) as another measure of functional intolerance of CNVs is in Supplementary Fig. 5. In summary, our findings underscore the intricate relationship between genetic characteristics and behavioural outcomes, highlighting the importance of considering both genetic and temporal dimensions in understanding the aetiology of behavioural patterns and susceptibility to disorders.

Fig. 5: Population modes are driven by temporal gene characteristics rather than sociodemographic factors.
figure 5

a, Behaviour scores are linked to the spatial and temporal expression of CNV genes. Genes encompassed by each CNV were annotated using seven distinct metrics. For each CNV carrier, gene-level annotations were aggregated across all deleted or duplicated genes to generate participant-specific summaries. The heatmaps display associations between these CNV-derived annotations and PLS scores, shown across all PLS modes and separately for deletions and duplications. The three modes are labelled on the basis of their dominant phenotype. Single asterisk (*) denotes significant association (assoc.) after FDR correction. ASD, autism spectrum disorder; SCZ, schizophrenia. b, Behaviour scores are not explained by ethnicity. As an example, behaviour scores of the first canonical mode are plotted for all participants separated by participant ethnicity. The raincloud plot combines a scatterplot, a boxplot (whiskers equal to 1.5× the interquartile range) and a violin plot. c, Canonical modes are not driven by sociodemographic factors. Associations between population variables and PLS scores are shown for the first three canonical modes. FDR-corrected P values are displayed: t-tests for sex, correlations for age and the first ten genetic principal components (PCs), and one-way ANOVAs for ethnicity and site. Darker colours indicate stronger associations. d, Canonical modes capture brain maturation. A single PLS model was applied to brain measurements from 3,715 participants at both baseline and 2-year follow-up. The raincloud plot displays brain scores of all controls from the first PLS mode at each timepoint, along with their correlation. e, Similar cortical aging across all three groups. We examine the difference between baseline and 2-year follow-up measurements for all CNV groups and all 3 canonical modes. The plot presents the average brain scores for each visit, participant group and PLS mode. The arrow direction symbolizes the direction of change. In boxplots (b,d), the centre line represents the median; the box spans from the 25th to the 75th percentile; whiskers extend to the minimum and maximum values within 1.5× the interquartile range from the box bounds. Outliers beyond this range are not shown. Both CNV groups display similar patterns of brain aging compared to controls.

Source data

In the next step, we explored whether modes of population stratification, that is, specific sociodemographically defined groups, also influence the derived patterns. In other words, we quantified whether ethnicity, sex, age, or genetic background are linked with the shifts in brain and behaviour scores. As a concrete example of this sensitivity analysis, we stratified participant-wise scores for the first mode by reported ethnicity as defined by the ABCD team36 (Fig. 5b). Using one-way analysis of variance (ANOVA), we assessed whether there were significant differences in scores as a function of these diverse ethnic categories. Notably, the findings revealed that the scores did not exhibit a statistically significant difference among ethnicities (F-statistic = 0.89, P = 0.56). We then extended this post hoc analysis to other modes of population covariation and other metrics of population stratification. Namely, we quantified the difference in scores between males and females using a two-sample t-test and as a function of the 21 recruitment sites using one-way ANOVA. Moreover, we probed the linear association of scores with age and the ancestry structure of the cohort measured using the first ten principal components of genotyping data. We collected all P values and applied FDR correction to control for multiple comparisons across the totality of 52 performed tests. None of the performed tests revealed significant association (Fig. 5c). This comprehensive examination provided valuable insights into the potential universality of the observed scores among modes of population stratification, underscoring the importance of considering the generalizability aspect in the broader context of the study’s implications37.

As the final step, we extended our analyses by examining longitudinal changes in brain structure between controls and CNV carriers at the 2-years-after-imaging timepoint (Supplementary Fig. 6). We focused on brain structure for this analysis step because the majority of behavioural phenotypes were not available for the second timepoint. Investigating the trajectory of brain development over time can provide insights into whether individuals with CNVs exhibit distinct patterns of structural change. We benefited from the availability of 3,715 brain scans measured 2 years after the first visit (51% of participants passing quality control). We observed a high correlation between regional volumes acquired at these two timepoints (Supplementary Fig. 7). We used the PLS model derived using baseline measurements to re-express brain measurement from the 2-year follow-up (cf. Methods). Put differently, we applied the baseline PLS model to the follow-up brain measurements to assess how the brain measurements at follow-up align with the brain patterns established at baseline. In doing so, the original PLS model provided a holistic summary of brain maturation by calculating a brain score for each participant at each visit, for each of the three modes. We then calculated Pearson’s correlation between the brain scores from the baseline and follow-up measurements. Similar to brain structure measurements, we observed a strong link between PLS scores in the dominant mode between the baseline and follow-up measurements in controls (Pearson’s r = 0.89) (Fig. 5d). Furthermore, we used a linear mixed-effects model (cf. Methods) to examine whether the rate of change in brain scores differed between controls and CNV carriers (Fig. 5e). We found a significant main effect of measurement time point, indicating that brain scores changed significantly over time across all three groups (Supplementary Table 2). The interactions between time and group (that is, control or CNV carrier) were not significant after FDR corrections, suggesting that the rate of change in brain scores over time did not differ significantly between controls, duplication carriers and deletion carriers (Supplementary Table 2). Therefore, both CNV groups displayed similar brain maturation patterns compared to controls. Nevertheless, the second mode remains a promising target for further exploration in studies with greater statistical power (coefficient for the interaction between CNV and time, uncorrected PDEL = 0.045, PDUP = 0.076). Given the observed similarity in brain structure developmental patterns of CNV carriers and controls, further exploration of earlier stages of life may provide further valuable insights into distinctions in neurodevelopmental processes.

Discussion

In this quantitative population neuroscience study, we carefully examined the ramifications of carrying an exonic CNV on brain organization and behaviour. To this end, we designed an analytic protocol based on a holistic tri-modal pattern-learning framework that can cleanly dissect the impact of genetic mutations on multimodal measurements that cut across disciplines to untangle the complex genes–brain–behaviour interplay. This multivariate model uncovered three significant modes of covariation between brain volume and behaviour. The first mode connected robust volumetric differences in frontoparietal and temporal regions with measures of cognition and demographics. The second mode linked dorsal attention, somatomotor and frontoparietal networks with mental health measures. Finally, the third mode highlighted associations between the higher-order networks and environmental factors. We then drew a detailed picture of how carrying a genomic deletion or duplication influences the expression of these comprehensive brain and behaviour patterns. Specifically, deletions and duplications were linked with negative effects on family wellbeing, as seen in the adverse effects on cognitive functioning, mental health and socioeconomic outcomes. Our collective results also highlight the similar ramifications for cognition and behaviour associated with deletions and duplications despite their distinct effects on brain anatomy, corroborating some of our earlier CNV-imaging studies on the UK Biobank cohort14,38.

The analyses of genetic influences have long been dominated by univariate frameworks39,40. These standard regression approaches model one input variable at a time and thus focus on individual variables independently while neglecting the complex relationships and synergies that exist among genes, brain and behaviour. In other words, univariate approaches struggle to provide a natural approach to analysing high-dimensional data and harnessing the ‘curse of dimensionality’, making it challenging to capture the joint influence of multiple variables41. The growing availability of variable-rich and multimodal datasets with deep phenotypic profiling prompts a change in our traditional analytic toolkit42. Doubly multivariate techniques, for example, PLS, address several limitations of mass univariate approaches, providing a more nuanced and integrated perspective on the relationships between thousands of measures of brain architecture and behaviour in the general population24,29. Previous research showed that particular genes are an important contributor to the interindividual variability of thus uncovered latent patterns43. Building on the heritability of the latent patterns, we showcased that their expression is further shaped by the presence of genome-wide protein-coding mutations. Our findings address the need for a deeper examination of the relationship between CNVs, brain structure and behaviour, as recently proposed as an important research direction going forward15, revealing their effects on social, familial and environmental factors.

The consequences for various aspects of human health and wellbeing often go unnoticed because analyses of genomic deletions and duplications most commonly focus on intellectual disability and developmental delay19,44. Developmental delay phenotypes, especially language and motor disorders, are the earliest symptoms for which children are clinically referred for assessments and genetic testing45. Recent results showcased potential lifelong implications represented by diminished academic qualifications, occupation or household income for a small set of schizophrenia-associated CNVs46. As an important contribution of our present investigation, our results demonstrated that the genome-wide presence of any coding CNV might be linked to impaired real-life functioning, represented here by cognitive performance, income, education, screen time, or sleep duration. These characteristics played a driving role in our dominant mode of population covariation, which is tightly linked to frontoparietal and temporal networks—regions frequently reported to be altered by previous single-CNV studies, such as 22q11.2 (ref. 15). A similar dominant mode characterized by cognitive measures as well as screen time was identified in the Human Connectome Project population resources29. The stronger influence of deletions on the dominant mode compared with duplications is concordant with the more pronounced effect of deletions on cognitive ability observed in clinical studies47.

We also reveal additional consequences beyond just the dominant population mode, which are at risk of staying hidden in classical analyses48. Concretely, our second mode highlighted impoverished familial mental wellbeing as a prominent marker of CNV carriers. Notably, the presence of phenotypes from both child and parental questionnaires demonstrates how wellbeing is closely tied to the family system. It has been estimated that over 99% of CNVs are inherited49. Therefore, in addition to influencing offspring phenotype through genetic inheritance, the parental genotype can indirectly influence offspring phenotype through its expression in the parental phenotype50. Where this occurs, offspring may be subject to both phenotype-associated CNV and phenotype-associated environments from parents. In conclusion, the CNVs we studied in adolescents have likely been passed down from either parent, which points toward influences on the overall family system. The multigenerational impact where genetic and also environmental legacies contribute to the behavioural outcomes highlights the complex interplay between inherited genetic variations and the environments shaped by parental phenotypes.

Finally, our across-CNV analyses also revealed new associations between genomic deletions, environmental factors and higher-order brain networks, represented here mainly by regions in the temporal lobes. Interestingly, the higher association cortex, especially the default mode network, was suggested to be more ‘life wired’, resulting from differences in the circumstances and contexts in which people grow up and everyday life experiences51. The deeper layers of the neural processing hierarchy, such as the default mode network, allow for greater environmental influence and plasticity, as demonstrated by prolonged maturation and slower myelination compared with sensory/motor circuits in human primates52,53. Our finding adds more evidence for the adaptive and dynamic nature of the recently evolved parts of the human brain, emphasizing the prominent role of genetic and environmental interplay in shaping neural development and function. Importantly, the environmental milieu, here represented by measures of crimes, drug possession, or violence, is related to health through psychological, physiological and behavioural pathways54. Previous research documented chronic health conditions to be more prevalent in low-income neighbourhoods, including those affecting infants (low birth weight), children (asthma) and adults (cardiovascular health)55. Specifically, living in low socioeconomic status neighbourhoods56 and neighbourhoods perceived as unsafe57 displayed elevated physiological risk, which includes indicators of inflammation and neuroendocrine and cardiovascular functioning.

Here we provided a detailed depiction of how environmental and behavioural factors are reflected in brain structure. Moreover, we documented the alterations in these brain-wide patterns in the presence of genetic mutations. In concordance, previous single-CNV studies identified brain-wide patterns of regional alterations that robustly differentiate controls from carriers of clinical CNVs (sensitivity 94.2% and specificity 93.3% in classifying 22q11.2 cases from healthy controls)14,58. Here we broaden the incumbent analysis scope of a few selected CNVs towards any coding CNV present in the genome. The brain pattern corresponding to our second across-CNV population mode highlighted opposing effects of deletions and duplications, recapitulating the mirroring effects observed in clinical studies31. As a primary example, the lingual gyrus here played a dominant role in two altered brain patterns. The effects of CNVs on this region have been documented for carriers of 16p11.2 CNVs59. Similarly, as prominent examples from the frontal lobe, we observed significant contributions of the middle and superior frontal gyrus, which have been shown to be impacted by 1q21.1 (ref. 18) and 15q11.2 (ref. 60) alteration, respectively. Impairments of lingual and frontal gyri have been associated with anxiety–depression severity61 or attention control deficit62—phenotypes often present in CNV carriers63. While these regions recapitulate certain findings from commonly conducted single-CNV studies, our analysis puts forward the concept of shifted multivariate patterns, capturing more complex interactions and revealing how CNVs influence broader brain–behaviour relationships. While the identified whole-brain patterns represent a general trend in each type of genetic mutation, the specific alterations pertinent to a specific CNV (for example, 22q11.2 or 16p11.2) are further moulded by the attributes of genes that are affected by a given CNV. According to analyses, the final brain and behaviour profile can be shaped by various attributes of deleted/duplicated genes, including their tolerance to being mutated, or the temporal expression profile of affected genes. This may be part of the reason why previous research found brain patterns associated with deletions at 22q11.2 loci to strongly resemble deletions at 15q11.2 loci while being different from 16p11.2 deletions14. Our findings reveal overarching brain patterns shared across various CNVs, yet the precise alterations associated with individual CNVs vary and are determined by the characteristics of the affected genes, such as their temporal and spatial expression profiles as well as environmental impacts19,64,65.

In conclusion, we developed a multilevel pattern-learning framework to investigate the effects of genome-wide protein-coding mutations on brain organization and behaviour. This approach offers a comprehensive view of the multifaceted impact of rare genetic variations, surpassing limitations of many traditional univariate frameworks. We revealed that both genomic deletions and duplications may contribute to challenges in family wellbeing through associations with increased parental and child stress, anxiety and depression, as well as neighbourhood violence. These behavioural and emotional challenges are mirrored by shifts in brain organization, with alterations predominantly seen in higher-order networks, underscoring the profound cross-associations between genetic mutations, behavioural outcomes and changes in brain structure. Future research building on such approaches, transcending levels of description usually studied in isolation, can better appreciate the complexity of the relationship between genetic determinants and human health.

Methods

ABCD population data source

Brain imaging, behavioural, clinical and genetic data in this study were obtained from the Adolescent Brain Cognitive Development Study (ABCD), representing the most extensive biomedical child development study of its kind. The ABCD Study acquired data from 11,877 children aged 9–10 years (mean age = 9.49 years) from 21 sites across the United States (48% girls; 57% Caucasian, 15% African American, 20% Hispanic, 8% other)66. We leveraged baseline measurements from ABCD Annual curated release 4.0, which contains baseline data on the entire participant cohort as well as early longitudinal data, including 2-year follow-up neuroimaging data (second brain-imaging timepoint). All protocols for ABCD were approved by either a central or site-specific institutional review board committee67. Caregivers provided written, informed consent and children provided verbal assent to all research protocols68,69.

Genetic annotation and CNV calling

Our study is built on the identification of exonic CNVs in the ABCD study sample. The genotyping protocol for the ABCD sample (n = 11,088) has been described previously70. In addition to the quality control (QC) provided by ABCD, we performed several additional steps to ensure high quality of the genetic data. Using PLINK (v.1.9)71, we removed SNP variants with a missing rate >5% as well as SNPs with a Hardy–Weinberg equilibrium exact test P < 0.0001. We only considered arrays with call rate ≥99%, log R ratio s.d. < 0.35, B allele frequency s.d. < 0.08, absolute value of wave factor <0.05 and count of all unfiltered CNV per sample ≤10 (n = 7,896). These thresholds align with the quality control guidelines established by the Psychiatric Genomics Consortium CNV calling pipeline10,19,72.

Concurrently, on the basis of genotyping, we identified 73 participants with >5% missing data and 419 participants with a high degree of identity-by-descent (PI_HAT >0.8), indicative of duplicated data or monozygotic twins. For these cases, we retained the array with the highest call rate. On the basis of these criteria, we excluded a total of 238 individuals from the CNV dataset, resulting in a final sample of n = 7,658. No individuals had discordant phenotypic and genetic sex information. Finally, we excluded 51 individuals associated with plate 461 (based on ABCD instructions, n = 7,607).

The identification of CNVs using SNP array (GRCh37/hg19) data followed previously published methods19,20. CNVs were called using the pipeline described at https://github.com/labjacquemont/MIND-GENESPARALLELCNV. In short, we computed PFB-files (Human Genome Build NCBI37/hg19) on the basis of 500 best arrays in ABCD, and we used GC (content)-model files (https://kentinformatics.com and https://github.com/ucscGenomeBrowser/kent.git). Autosomal CNV detections from either PennCNV73 or QuantiSNP74, or both, were combined using CNVision75. All identified CNVs met stringent quality control criteria: confidence score ≥30 (for at least one of the two detection algorithms), size ≥50 kb, unambiguous type (deletion or duplication), overlap with segmental duplicates, and HLA regions or centromeric regions <50%. In addition, we employed our in-house machine-learning algorithm, DigCNV (https://github.com/labjacquemont/DigCNV), which leverages nine CNV characteristics—including array metrics, localization metrics and CNV-specific metrics—to detect additional artefact CNVs. Finally, all carriers (1 participant) of a structural variant ≥10 Mb, a mosaic CNV or a chromosome anomaly (aneuploidy or sexual chromosome anomaly) were removed (n = 7,606). For the final set of participants, we calculated the first 10 genetic principal components (PCs) using the –pca function in PLINK (v.2.3)76. After these quality control steps, we identified 668 sibling pairs using the KING protocol (‘king -b file.bed–related’, https://www.kingrelatedness.com/manual.shtml).

All identified CNVs were annotated using Gencode V19 (hg19) with ENSEMBL (https://grch37.ensembl.org/index.html). In this study, we only used exonic CNVs that fully encompassed at least one gene. In addition to the number of encompassed genes, each CNV was further annotated with seven other previously used scores. Specifically, we used an annotation quantifying the tolerance to protein-loss-of-function of each gene: 1/LOEUF77. Each CNV was then characterized by the sum of 1/LOEUF of encompassed genes. Higher scores thus indicated greater intolerance to loss-of-function mutations. Furthermore, CNVs were described using average temporal expression78 and average peak epoch. Gene-wise temporal expression was calculated as the developmental trajectory that the gene follows based on trajectory analysis (gene-specific trajectory coding: ‘Rising’ = 1, ‘Non-transitional’ = 0, ‘Falling’ = −1). The peak epoch corresponds to an epoch of highest expression, where epochs correspond to the developmental period defined previously79. Each CNV was also characterized by the number of genes, for which expressions in the brain were labelled as ‘High’ or ‘Elevated’ according to the GTEx resource (https://www.gtexportal.org). Finally, we quantified how many genes in each CNV were previously associated with autism spectrum disorder80, schizophrenia81 and any disorder by either rare or common variation78. Additional analysis using probability of loss intolerance (pLI) as another measure of functional intolerance of CNVs is provided in Supplementary Fig. 5. The similarity of the seven annotations is summarized in Supplementary Fig. 8.

As part of our sensitivity analyses, we compiled a list of 85 CNVs previously proposed to be pathogenic10,44,82,83,84 (sum of 1/LOEUF for each gene encompassed in CNV ≥6 or inclusion in ClinGen resource; https://clinicalgenome.org). Regional coordinates are available elsewhere19. CNV was defined as recurrent if it overlapped by ≥50% with one of the 85 CNVs and/or included the key genes of corresponding region (see details for each recurrent CNV in Supplementary Table 3). This 50% threshold has been shown to provide excellent sensitivity and specificity to detect recurrent CNVs19.

Detailed profiling of behavioural and cognitive data

We analysed a rich battery of 1,319 cognitive, sociodemographic and environmental data from 11,879 participants partially reported in previous research24. In line with previous research24, we used robust z-scores for the preprocessing of each phenotype. The robust z-scores were derived by calculating each phenotype’s absolute deviation from the median absolute deviation (MAD)85. In other words, the resulting score indicates how many standard deviations each value deviates from the median, with robustness to outliers. Subsequently, we removed values with a z-score >4. We then excluded phenotypes with <90% retained values before excluding participants with <90% retained values across the retained phenotypes. The remaining participants (n = 11,618) were considered for further analysis. The complete list of 962 phenotypes from 20 predefined categories included in the analysis is available in Supplementary Table 4. These categories were defined by the NIH and can be found online (https://nda.nih.gov/general-query.html?). To avoid potential confusion, we adjusted names of three categories as follows: Questionnaires to Parental Questionnaire, Summary to Mental Health Summary, Diagnosis to Diagnosis (K-SADS), where K-SADS stands for the Kiddie Schedule for Affective Disorders and Schizophrenia interview. As the last step, for the purpose of data analysis, missing values were imputed using the KNNImputer function (n_neighbors = 5, weights = ‘uniform’) in the scikit-learn package. All derived phenotypic measures were then adjusted for variation that can be explained by age and sex.

MRI imaging-derived phenotypes

Our data sample included expert-curated brain-imaging phenotypes of grey matter morphology. The images were acquired across 21 sites in the United States with harmonized imaging protocols for GE, Philips and Siemens scanners86. We used baseline structural T1-weighted tabulated MRI data from ABCD curated release 4.0. We only included participants who passed quality assurance using the recommended QC parameters (n = 11,723) described in the ABCD 4.0 Imaging Instruments Release Notes. ABCD preprocessing and QC steps are described in detail in the methodological reference for the ABCD study86.

The downloaded tabulated brain-imaging phenotypes were guided by the topographical brain region definitions based on the Destrieux parcellation atlas87. This feature-generation step provides neurobiologically interpretable measures of grey matter volume in 148 regions. For each included regional volume, we calculated the MAD for each brain region and removed values with MAD >4 (ref. 24). Participants with <90% of regional volume retained in any region were excluded from the analysis. The remaining participants (n = 11,681) were included for further analysis. Finally, interindividual variations in the volumes that could be explained by age, sex, total brain volume and scanning site were regressed out.

To analyse temporal changes in brain structure, we also acquired structural T1-weighted tabulated MRI data during the follow-up 2 years after the first MRI recording. Brain-imaging data from this second timepoint underwent the same cleaning steps as the baseline brain-imaging data. In total, follow-up brain measurements were available for 3,715 participants (2,608 controls, 317 deletion carriers, 790 duplication carriers).

Multivariate pattern analysis protocol

After rigorous quality control of brain, behaviour and genetic data sources, we analysed a total of 7,338 participants that met all established criteria, ensuring robust data integrity across each measurement domain. As a first data preparation step, each brain and behaviour measurement was normalized (z-scored) to ensure comparability across different scales. The normalized measurements were then submitted to principal component analysis (PCA). PCA is known to be robust to noisy, sparse and mixed-valued data, which makes it particularly effective in enhancing the stability of subsequent analyses by addressing rank deficiency and minimizing noise. Furthermore, PCA is ideally suited for mixed data types, as it seamlessly converts continuous, binary and otherwise categorical variables into continuous components, allowing for efficient dimensionality reduction while preserving the essential variability in the dataset88. On the basis of a thorough examination, we extracted the first 50 PCA components for both the brain and behaviour measurements (Supplementary Fig. 9). Notably, even though our primary analysis utilized regional volumes as indicators of brain structure, we found that the behavioural loadings were similar when using regional thickness and area. Furthermore, the brain loadings from our first and third modes were similar to those obtained using brain area, while the brain loadings from our second mode closely matched those derived from thickness (Supplementary Fig. 10).

After initial data cleaning, we focused on participants without CNVs to identify modes of covariation representative of the general population. The first cleaned input dataset included regional brain volumes (5,314 × 50 matrix), and the second comprised behavioural measures (5,314 × 50 matrix). To uncover multivariate relationships between these high-dimensional datasets, we applied canonical PLS analysis—a method well suited for identifying latent structures that maximize covariance between two variable sets89. In other words, PLS identifies canonical modes by finding linear combinations of brain and behavioural variables that co-vary most strongly, solving the generalized eigenvalue problem of their cross-covariance matrix. The resulting participant-level projections are referred to as brain scores and behaviour scores throughout the paper. These scores represent the expression of each latent mode in each individual. The model produces k orthogonal modes, ordered by the amount of brain–behaviour covariance they explain. The first (that is, the dominant) and strongest mode explained the largest fraction of covariance between brain and behaviour measurements, while subsequent modes account for residual covariation unexplained by the preceding ones. PLS draws similarities with canonical correlation analysis (CCA). However, CCA can be prone to instability90. Nevertheless, our obtained PLS solutions strongly resemble those obtained with CCA (Supplementary Fig. 11).

Contribution of original phenotypes to latent variables

To quantify the contribution of each regional volume and behavioural measure to the construction of the latent population mode, we computed PLS loadings as Pearson’s correlation between a respective PLS score and the original measurement across participants. The thus obtained loadings indicate the strength and direction of the relationship between the original phenotype and the identified PLS score. Stronger loading values signify greater importance in contributing to the latent structures, offering insights into which variables drive the covariation patterns. All brain and behaviour loadings are available in Supplementary Table 1.

To assess the significance of phenotype contributions beyond chance, we employed a bootstrapping strategy. In each iteration, a perturbed version of the dataset was generated by resampling control participants with replacement, maintaining the original sample size. This procedure was repeated 1,000 times, ensuring consistent ordering and orientation of the PLS modes across iterations. Each bootstrap iteration yielded a new realization of the full analysis pipeline, resulting in 1,000 trained PLS models and corresponding sets of PLS coefficients. A coefficient was considered statistically robust if its two-sided 95% confidence interval—based on the bootstrap distribution (2.5th to 97.5th percentile)—did not include zero, indicating a consistent and significant contribution across resamples.

We used the BrainStat toolbox91 to contextualize obtained patterns with respect to large-scale brain networks based on the Schaefer–Yeo definition92. Specifically, we mapped the brain loadings from the 148 regions to ‘fsaverage5’ vertices. We then computed the average absolute loading in each of the seven resting-state brain networks.

Optimal number of PLS dimensions in the general population

Each identified PLS mode was submitted to statistical significance tests of robustness consistent with an established combination of cross-validation and permutation testing24 (Supplementary Fig. 1). Initially, controls were split into 10 folds, where 9 folds of participants were used as a train set, and 1 fold was used as a test set. The splitting into train and test set was stratified on the basis of genetic relatedness, ensuring that siblings were kept together to avoid scenarios where one sibling is used in training while the other sibling is in the test set. To identify siblings, we used the KING protocol (Kinship-based INference for Gwas) to generate kinship coefficients, which allowed us to define two groups: non-siblings (<0.177 kinship coefficients) and siblings (>0.177 kinship coefficients). The controls in the training set were used to estimate the parameters of all subsequent tools. In the first step, each brain or behaviour measurement was z-scored column-wise across all controls in the training set. PCA then separately reduced the dimension of brain and behaviour measurements to 50 features. In the next step, the preprocessed behaviour and brain measures were used as input variables to estimate a single canonical PLS model, where the output of the model is a set of scores (latent variables). This PLS model can also be characterized by weights (projection matrices used to transform input variables). These training-PLS weights were back projected using the PCA model to obtain brain and behaviour weights in the original non-reduced ambient space.

In the next step, brain and behaviour scores were computed for controls from the test set. Specifically, z-scoring followed by PCA dimensionality reduction was applied with parameters learned using the training set. The resulting preprocessed measurements were multiplied by the original training-PLS weights to obtain PLS scores for test-sample controls. Finally, the covariance between brain and behaviour scores was calculated for each canonical mode. We took the average of these canonical covariances across the 10 folds. This procedure was repeated 100 times with a random fold split of controls to obtain a distribution of out-of-sample covariances for each PLS mode.

To assess the statistical significance of the resulting PLS modes, we ran 1,000 iterations of the same 10-fold cross-validation procedure described above, where the order of participants of the brain measurements was randomly permuted in each iteration. In contrast to the unpermuted dataset, we collected covariances for the training rather than the testing participants to account for overfitting by the PLS. In other words, using covariance from the permuted train set, and not the test set, represents a more stringent criterion. Finally, P values for each of the PLS modes were calculated as a percentage of cases when permuted covariance was greater than the mean cross-validated covariance.

Group differences in brain–behaviour pattern expression

We developed a pipeline quantifying the differences in brain and behaviour scores between controls and CNV carriers for the identified PLS modes (Supplementary Fig. 2). Initially, participants without any CNV were split into a training set (90%) and a test set (10%) with stratification based on genetic relatedness to ensure proper grouping. As outlined above, brain and behaviour scores for controls in the test set were calculated using parameters learned from the training set. Importantly, the same PLS model was applied to CNV carriers (both deletion and duplication), allowing us to compute their brain and behaviour scores without re-estimating the PLS model. In other words, the PLS model trained on the training control group was used to analogously analyse the brain and behaviour data of CNV carriers, ensuring consistency in the interpretation of brain–behaviour associations across groups. More specifically, z-scoring followed by PCA dimensionality reduction was applied with parameters learned using the training set of controls. The resulting preprocessed measurements were multiplied by the precalculated PLS weights to obtain PLS scores for CNV carriers. Finally, we calculated the average across the 10 folds in differences between out-of-sample controls and CNV carriers for both brain and behaviour scores in each canonical mode. This procedure was repeated 1,000 times with a random 90:10 split of controls to obtain a distribution of PLS score differences. Finally, P values for each of the PLS modes were calculated based on the percentage of cases when the difference between mean scores of CNV carriers and mean scores of out-of-sample controls was greater than zero (respectively lower for modes with negative mean expression). Resulting P values were adjusted across all modes using FDR correction to control for multiple comparisons.

Temporal shift in brain pattern expressions

A total of 3,715 participants passed the quality control of genetic data and had brain recordings measured at the baseline and 2-year follow-up. The majority of behavioural phenotypes were not available for the second timepoint Therefore, rather than training a new PLS model, which would not be directly comparable to the original, we focused on assessing longitudinal changes in brain structure using the original model93. Specifically, we applied the PLS model trained on baseline brain and behavioural data to the follow-up brain measurements. This allowed us to compute brain scores at follow-up for each participant and for each PLS mode by projecting the follow-up brain data onto the baseline-derived brain loadings. In essence, we evaluated how each participant’s follow-up brain measurements aligned with the brain–behaviour patterns identified at baseline.

The aim of this analysis was to explore whether the rate of change in brain scores differs between controls, duplication carriers and deletion carriers. We employed a linear mixed-effects model for each PLS mode and for each CNV group separately. This approach accounted for repeated measurements at two timepoints. For each PLS mode, we constructed a mixed-effects model with the brain scores as the dependent variable. The fixed effects included time T (coded as 0 for baseline and 1 for follow-up), CNV group (controls as 0, duplication/deletion carriers as 1, with the control group serving as the reference category), and the interaction between time and CNV group. The interaction term was included to assess whether the rate of change in brain scores over time differed between the control group and the CNV carrier groups. Random intercept and slope terms were included to capture variability in baseline levels and time effects across individuals. For each PLS mode and each CNV group, the model was formulated as follows:

$${{\rm{brain}}\; {\rm{score}}}_{{ij}}={\beta }_{0}+{\beta }_{1}{T}_{{ij}}+{\beta }_{2}{\rm{CNV}}+{\beta }_{3}\left({T}_{{ij}}\times{\rm{CNV}}\right)+{u}_{0i}+{u}_{1i}{T}_{{ij}}+{\varepsilon }_{{ij}},$$
(1)

where β0 is the intercept, β1 captures the fixed effect of time, β2 represents the fixed group effects, β3 is the interaction effect between time and CNV carrier status, \({u}_{0i}\) represents the random intercept for time for participant i, u1i is the random slope for time for participant i, and \({\varepsilon }_{{ij}}\) is the residual error term for participant i at timepoint j.

A significant interaction term \({\beta }_{3}\) would suggest that the rate of change in brain scores differed between the control group and one or both CNV carrier groups in a given PLS mode.

Effect size of CNV carriership

We used Cohen’s d to quantify the effect size of the CNVs on revealed PLS modes. For a given mode and separately for brain and behaviour, Cohen’s d was calculated as:

$$d=\frac{\underline{{x}_{1}}-\underline{{x}_{2}}}{\sqrt{\frac{{s}_{1}^{2}+{s}_{2}^{2}}{2}}},$$
(2)

where \(\underline{{x}_{1}}\) corresponds to the mean PLS score across CNV carriers, and \(\underline{{x}_{2}}\) corresponds to the mean PLS score across controls. Similarly, s1 and s2 correspond to standard deviations of PLS scores of CNV carriers and controls.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.