Introduction

Obsessive-compulsive disorder (OCD) is a heritable complex neuropsychiatric condition characterized by persistent, intrusive thoughts and rituals. Current scientific literature supports a genetic contribution to OCD risk. Population-scale epidemiological studies indicate substantial familial clustering of the condition [1,2,3]. Based on twin study estimates, additive genetic factors account for 47% of variance in obsessive-compulsive symptoms [1]. Analyses of common genetic variation in OCD cases versus controls suggest that common risk variants explain around 28% of the observed phenotypic variance in OCD [4].

Rare variant studies of OCD published in recent years have primarily involved whole exome sequencing (WES) of trio cohorts (consisting of parents and an affected proband). Cappi et al. first described an analysis of WES data for 184 OCD trios and detected an excess of de novo (variant found in the proband, but absent in both parents) damaging coding single nucleotide variants (SNVs) and insertion-deletions (indels) in probands relative to unaffected trios, along with two genes recurrently hit with damaging mutations, CHD8 and SCUBE1 [5]. More recently, Halvorsen et al. described an analysis of 1313 OCD cases, of which 587 were probands in complete trios [6]. In accordance with Cappi et al., authors noted an excess of de novo damaging coding SNVs and indels in these OCD trios relative to unaffected trios. They also observed a general excess of protein-truncating SNVs and indels in singleton cases relative to ancestry-matched controls. They identified CHD8 as a probable risk gene (Q < 0.3) in their analyses, which utilized summary statistics from Cappi et al., and identified an additional damaging coding de novo variant in this gene within their newly published trios [6]. It is critical to note that neither of these studies featured any assessment of copy number variants (CNVs). Since these variants are characterized by the deletion or duplication of thousands of bases, it is reasonable to hypothesize that in a sufficiently powered case/control comparison, given the excess of protein-truncating SNVs and indels already seen in OCD, there might be a similar excess of CNVs in OCD cases specifically impacting protein-coding genes.

Several genome-wide CNV studies of OCD have been published over the years (Supplementary Table 1), but they either have smaller sample sizes, or do not specifically focus on OCD as a phenotype. The largest CNV study of OCD is from McGrath et al. in 2014, as part of a joint study of OCD and Tourette Syndrome (TS) cases [7]. The study did not describe a general excess of rare CNVs, but did note that cases had an elevated rate of neurodevelopmental deletions relative to controls that was not statistically significant [7]. There have been other more recent genome-wide CNV studies of OCD, but none featuring an exhaustive case/control comparison of CNV burden at a similar scale of McGrath et al. or the exome study described in Halvorsen et al. [7,8,9,10,11].

Given the excess of protein-truncating SNVs and indels already seen in OCD WES studies, we hypothesized that there might be a similar excess of CNVs in OCD cases, and that these CNVs specifically impact protein-coding genes. To test this hypothesis, we designed a case/control CNV study which benefits from usage of a more recently-developed genotype array platform (the Illumina Global Screening Array series; GSA), an ancestrally homogenous Scandinavian population ideal for genetic study, and rich clinical data available for enrolled cases.

Methods

Samples

All OCD cases included in this study were collected in Sweden and Norway as part of the Nordic OCD and Related Disorders Consortium (NORDiC). The rationale, design and methods of the NORDiC study have been described previously [12]. In Sweden, the case-control arm of the study is referred to as NORDiC-SWE and all samples were collected across Sweden between 2015 and 2019. This study was approved by a local ethics board (Stockholm Regional EPN) and all participants provided informed consent. NORDiC-SWE OCD cases (63% female) have a primary International Classification of Disease, 10th revision [13] and/or Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition [14] diagnosis of OCD from a multidisciplinary specialist OCD team established with a semi-structured instrument such as the Mini-International Neuropsychiatric Interview [15] or the Structured Clinical Interview for DSM Disorders [16]. All patients were included in the study regardless of psychiatric comorbidity, as long as they fulfilled strict diagnostic criteria for OCD. Patients were excluded in cases of diagnostic uncertainty, such as OCD secondary to a neurological disorder or CNS insult, or where the differential diagnosis between OCD and an alternative condition was unclear. Swedish controls were sampled from LifeGene [17], a prospective population-based cohort of around 50,000 individuals in Sweden. Controls were unrelated to any OCD case to the third degree and unaffected with OCD based on self-report. Since the controls were inherited from those used for a GWAS of anorexia nervosa [18], potential controls were excluded if they had a lifetime history of anorexia nervosa and were largely female (~91%). Participants provided either blood or saliva for DNA extraction.

In Norway, the case-control arm of the study is referred to as NORDiC-NOR and all samples were collected across Norway between 2016 and 2019. This study was approved by the regional ethics board (REK West) and all participants provided informed consent. NORDiC-NOR OCD cases (65% female) had the same diagnostic process, and the inclusion and exclusion criteria as those in Sweden. Norwegian controls (50% female) were selected from NORMENT and were ages 18–65 at time of collection. They were screened for psychiatric illness via questionnaires, and included individuals have indicated that neither they nor any first-degree relatives have undergone any formal treatment. Participants provided either blood or saliva for DNA extraction. See [19] for more details.

Array genotyping

The majority of samples in this study were genotyped on the Illumina GSA version 1, 2 or 3 (GSAv1, v2, v3), which include a common core of ~600,000 SNPs [20]. The one exception was the Norwegian controls, which were genotyped at deCODE Genetics using any array derived from GSAv1 that contained the same common core of SNPs [19]. The GSA samples were genotyped at LIFE&BRAIN in Bonn, Germany and the Norwegian controls were genotyped at deCODE Genetics in Reykjavík, Iceland. Some cohorts were genotyped in multiple waves and Supplementary Table 2 provides the number of samples for each wave.

Processing genotype array data

We obtained raw genotype array data (IDATs) for cases and controls, and processed them using the gtc2vcf pipeline (https://github.com/freeseek/gtc2vcf). We transformed IDATs into dataset-level variant call format (VCF) files with reference alleles listed relative to human reference build 37 (see Supplementary Methods for details), which simplified the dataset merge process. We merged all 10 input datasets on a subset of variants that have genotype missingness <0.02 in each individual dataset. We identified a total of 542,466 variants that fit the full set of criteria. As a precaution, we took the subset of 537,278 variants that were non-ambiguous (not A/T or G/C), and were not indels for the merger. The raw merged dataset consisted of genotypes across these 537,278 variants for 2885 cases and 4227 controls.

Quality Control (QC)

We carried out several rounds of sample-level QC using PLINK v1.90b4.9 to ensure that any case/control association results were not influenced by poor sample quality, sample swapping, cryptic relatedness or ancestry differences. See Supplementary Methods for a description of the full procedure, and the number of samples removed at each step. After sample-level QC we were left with a total of 2325 cases and 3790 controls suitable for inclusion in a comparison of CNV burden between OCD cases and controls.

CNV calling

We called CNVs on all NORDiC OCD cases and Swedish and Norwegian controls which we had sample-level Log R ratio (LRR) and B allele frequency (BAF) data for. All CNV calls were made on the same set of 537,278 variants common to all data described previously. The calling procedure utilized PennCNV v1.0.5 and QuantiSNP v2.2 to generate separate CNV callsets for each single sample, and then defined CNV loci based on the intersection of these callsets. See the Supplemental Methods section for a description of the full procedure.

Sample-level QC specific to intensity metrics

Within each individual dataset, we performed outlier pruning on sample-level intensity metrics to remove poor-quality samples likely to have aberrant CNV call metrics. All metrics were computed by the PennCNV command ‘detect_cnv.pl’. Our dataset-level outlier pruning was focused on the standard deviation of the Log R Ratio (LRRSD), absolute value of waviness factor (absWF) and BAF drift. For each metric in a given dataset, a sample was marked as an outlier if it fell beyond 3 standard deviations (SDs) of the mean. A sample was removed if it was an outlier for any of these metrics. After dataset-level pruning, all remaining samples had LRRSD ≤ 0.2, absWF ≤ 0.02, and BAF drift ≤ 0.001.

We next produced density plots of total CNV count per sample and total number of bases occupied by CNV per sample, and noted a small number of samples with unusually high counts that dataset-level QC failed to exclude. Based on visual inspection of the kernel density plots for the sample-level raw CNV counts and the number of bases occupied by raw CNVs, we removed samples that had over 20MB of basepairs occupied by raw CNV calls, or a total number of separate raw CNV calls greater than 20.

CNV filtering

The qualifying (≥15 probes, ≥30 kb) CNV callset was put through a series of filter steps using code adapted from [21] in order to produce a set of analysis-ready calls that are rare, and do not overlap loci naturally prone to copy number alterations. First, we removed CNVs that overlapped (here defined as >30%) with loci 500 kb from telomeres, or 500 kb away from designated centromere loci. Next, we removed CNVs with 30% of bases overlapping “non-defined” (i.e., polyN) portions of the GRCh37 reference. We also removed CNVs overlapping previously reported and described segmental duplication loci [22]. Next, we removed CNVs where >30% of bases overlapped loci from Repeatmasker (http://www.repeatmasker.org), corresponding to simple repeat, low complexity or satellite loci. CNVs that overlapped gene regions for T cell receptors or immunoglobulins were removed next. We removed loci where CNV calls were associated with samples from Epstein Barr Virus-transformed Lymphoblastoid Cell Lines [23], as utilized in Huang et al. [24]. Next, calls were subjected to a series of filters on CNV frequency. Calls were required to be found at <1% frequency in gnomAD v2.1 non-neuro global and subpopulations [25]. In addition, calls were required to be found at <1% frequency in the full combined case/control dataset, along with <1% frequency in each additional input dataset. The final step of our CNV filtering protocol utilized marker-level BAF and LRR values in a callsite for each given carrier sample, and used BAF metrics to validate or reject a call by determining if the distribution of BAF values is consistent with reported copy state, as described in https://biopsych.dk/iPsychCNV/ and utilized in [26] (see Supplemental Methods).

Gene-based and breakpoint-based association tests

We used gene-based and breakpoint-based tests to determine if there were single genes or loci where overlapping CNVs were associated with OCD case status to a degree surviving multiple test correction, while also examining evidence of test statistic inflation. We defined four separate case/control groups for this analysis : 1) Swedish male, 2) Swedish female, 3) Norwegian male, and 4) Norwegian female. For each locus, we compared the proportion of cases with at least one rare, overlapping CNV to the proportion in controls, using a two-sided Cochran–Mantel–Haenszel exact test. To estimate genomic inflation, we used a case/control permutation-based approach described in ref. [6]. Before formal tests we tested for genomic inflation of non-overlapping locus-based test statistics, and excluded CNVs overlapping 3 loci as a result (see Supplemental Methods).

We conducted gene-based tests of CNV burden in a manner described previously by others [27]. We tested deletions and duplications separately, and merged neighboring genes into single units if over 50% of overlapping CNVs impact both genes [27], leading to 988 separate tests in total. We used the Benjamini–Hochberg procedure to control for false discovery, defining results where false discovery rate (FDR) < 0.05 as significant.

We also constructed association tests based on CNV breakpoints. As with gene-based tests, we merged probes into separate units based on whether they are shared by over 50% of CNVs that impact them. We also used the same p-value adjustment procedure here as before. Once again, test statistics are well controlled for deletions (lambda = 1.02) and for duplications (lambda = 1.01).

Power analysis for association tests

We conducted power analyses to produce estimates for the types of potential risk CNVs for OCD we might have reasonable power to implicate, given a prevalence estimate of 0.01, a sample size of 2248 cases and 3608 controls and multiple testing correction for 20,000 genes. We saw that in power calculations we had 80% power to detect risk CNVs with relative risk and corresponding control frequencies ranging from 69.8 at a frequency of 0.0001 to a relative risk at 2.5 at a frequency of 0.01. When comparing our power curves to risk CNVs identified from a well-powered case/control CNV study of Schizophrenia [27], we see that we are largely underpowered to implicate similar CNVs in our study. We produced power curves for a case/control comparison at the same sample size as described in the Schizophrenia CNV study [27] and we see that in this theoretical study design we should have sufficient power to detect these OCD risk CNVs should they exist. See Supplemental Methods section for an in-depth discussion of the power analysis and plotting of power curves.

Global CNV burden tests

We tested for an association between case status and global CNV burden using linear and logistic regression models, with covariates. We first constructed a null model for the total count of raw CNVs per sample. Without covariates, we saw that case status was associated with the raw number of CNV calls made in a given sample (estimate = −0.17, P = 0.04). We added to this model major principal components representative of ancestry (PCs 1–5) and sex. Notably, PC5 is a clear predictor of Swedish versus Norwegian ancestry, in a manner that is not dataset-specific (see Supplementary Figs. 2 and 3). We considered additional covariates (PCs 6–20, LRRSD) and added them to the model if they were associated with both raw CNV count and with case status. We found PC 7 and LRRSD to be associated with both raw CNV count and case status, and added them to the model as well. The logistic regression model in global comparisons of burden was:

OCD_case ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC7 + SEX + LRR_SD+burden_metric

A linear regression model used these same covariates, but here instead the so-called ‘burden metric’ was the outcome and OCD case status was the critical predictor. All p-values reported from these tests are two-sided.

Results

Case/control cohort

The raw case/control cohort consisted of 2885 cases and 4227 controls, spread across 10 datasets (Fig. 1, Supplementary Table 2). All samples were genotyped on versions or derivatives of the Illumina GSA (Supplementary Table 2). We found a total of 537,278 probes shared across all datasets within this study, and utilized these probes for both quality control and CNV calling.

Fig. 1: An overview of the study design and principal results from the analyses carried out.
Fig. 1: An overview of the study design and principal results from the analyses carried out.
Full size image

We have constructed a CNV case/control study using samples from Illumina GSA and its derivatives. We use extensive CNV call quality control to bring our CNV call rate down to around 0.6 per sample. Our principal results all point to a contribution to OCD genetic risk from CNVs that are at least 30 kb in size and a frequency <0.01, at a rate per sample of around 0.07 (95% CI 0.02–0.11, P = 2.58 × 10−3).

We isolated a subset of the cohort that were of high technical quality and were suitable for case/control comparisons (see Methods and Supplemental Methods for details). A total of 2248 cases and 3608 controls were included in our formal CNV analyses (Fig. 1, Supplementary Table 2). We did not see evidence of stratification of OCD case status across any of the major principal components (Supplementary Fig. 1), though it was clear that Swedish and Norwegian ancestry did separate across these components and should be accounted for (Supplementary Figs. 2 and 3). Most of the variance within the data was explained by the first 5 PCs (Supplementary Fig. 4).

CNV calling and filtering

We called and analyzed CNVs at least 30 kb in size and spanning at least 15 probes. A large number of these calls were present in sample-level data (1.42 calls per sample, Supplementary Tables 35). We retained CNVs outside of genomic loci prone to noisy intensity values, and were found at a frequency <0.01 in the cohort as well as the gnomAD v2.1 structural variant callset (see Methods). This procedure led to a higher degree of comparability across separate datasets (0.59 calls per sample, Supplementary Tables 35).

We compared cases and controls for evidence of systemic differences in raw CNV call count, LRRSD and filtered CNV call count. LRRSD metrics across datasets indicated that data were of good quality. Looking at ANGI controls, which underwent clustering before we received the data, the mean LRRSD metrics were higher, but still in range of other included datasets (Supplementary Fig. 5). As indicated before, raw CNV call counts had appreciable differences between datasets, while QC-pass CNV call counts were well-harmonized across the data (Supplementary Figs. 6 and 7).

Out of an abundance of caution, we compared the global burden of the smallest size bin of CNVs assessed in cases versus controls (30–100 kb) to determine if there are clusters of calls that pile up in a manner suggestive of batch effect. We specifically noted an elevation of CNV deletion signal (lambda = 1.21), which was driven exclusively by 19 calls clustering around 3 loci (see Methods). We excluded these CNVs from further analyses, eliminating genomic inflation for deletions within this size bin (lambda = 0.98, Supplementary Fig. 8). No similar clustering that was suggestive of batch effect was present in CNVs calls between 100 kb and 500 kb in size (Supplementary Fig. 9).

Global CNV burden

We found that OCD cases had an excess burden of rare CNVs at least 30 kb in size relative to unaffected controls (OR = 1.12, P = 1.77 × 10−3). More of this excess burden appears to come from deletions (OR = 1.16, P = 8.41 × 10−3) than from duplications (OR = 1.09, P = 0.06, Supplementary Fig. 10). Leave-one-out analyses showed that these results were not driven by any lone input dataset (Supplementary Fig. 10), or by one covariate (Supplementary Fig. 11). Every additional basepair of deletion made a sample more likely to be a case (OR = 1.047 per 100 kb, P = 2.31 × 10−3), along with every additional basepair of duplication (OR = 1.033 per 100 kb, P = 1.81 × 10−3). Consistent with this, OCD cases carried an excess burden of large (>1MB) CNVs (OR = 2.01, P = 3.35 × 10−4, Supplementary Fig. 12). Ultrarare CNVs observed only once in the case/control cohort conferred greater relative risk for OCD (OR = 1.21, P = 1.30 × 10−3, Supplementary Fig. 13), consistent with particularly penetrant, risk-conferring CNVs being subject to negative selection.

CNV burden is concentrated in protein-coding regions

OCD cases were more likely to carry CNVs that impact protein-coding regions of the genome (OR = 1.19, P = 3.07 × 10−4). There was no evidence for a case burden relative to controls for CNVs not overlapping any protein-coding bases (OR = 1.04, P = 0.50). Consistent with the burden of CNVs in cases, the accumulation of CNV-impacted protein-coding genes increased OCD case risk (OR = 1.07 and P = 1.99 × 10−3 for each deletion-impacted gene, OR = 1.04 and P = 3.48 × 10−3 for each duplication-impacted gene).

Case CNV signal was concentrated within genes that are dosage sensitive. Cases carried an excess of CNVs that overlap at least one protein-coding gene that is more likely to be intolerant to loss-of-function (pLI >0.5, OR = 1.60, P = 6.37 × 10−8, Fig. 2) [28]. There was no difference in burden of CNVs that do not carry at least one of these genes (OR = 1.04, P = 0.48). We also utilized more recently described [21] sets of data-derived haplosensitive and triplosensitive genes and found that CNV burden was elevated primarily within haplosensitive genes (Supplementary Fig. 14). Finally, we carried out a series of tests to determine if case sample country-of-origin or sex influence deleterious CNV burden (see Supplementary Methods). We failed to detect a difference in deleterious CNV burden between Swedish and Norwegian cases (Supplementary Fig. 15) and between male and female cases (Supplementary Fig. 16).

Fig. 2: CNV burden (deletions, duplications) partitioned by overlap with protein-coding genes.
Fig. 2: CNV burden (deletions, duplications) partitioned by overlap with protein-coding genes.
Full size image

The odds ratio estimate for case status for each additional CNV (deletions in red, duplications in blue) is depicted with a dot while the 95% confidence interval for the estimate is depicted with bars. Unadjusted p-values are provided for each test result. There is no evidence for case risk being conferred by CNVs that don’t overlap a protein-coding base (P > 0.05 for both deletions and duplications). CNVs that confer OCD risk instead appear to overlap protein-coding regions, specifically those that code for genes that are loss-of-function intolerant (pLI > 0.5).

CNV burden impacting evolutionarily constrained bases

OCD cases had a higher number of evolutionarily constrained bases impacted by CNV deletions than controls (OR = 1.03 per kbp, P = 6.34 × 10−3). There was no significant case/control difference in the number of constrained bases impacted by duplications (OR = 0.998 per kbp, P = 0.79). We found that CNV deletion burden preferentially loads onto bases with particularly high phyloP scores (Fig. 3), consistent with deletions impacting genomic loci that are intolerant to variation. Repeating this test on CNVs that did not impact a coding base, we did not note any significant case/control difference in constrained bases burdened by CNVs (Supplementary Fig. 17).

Fig. 3: Number of bases impacted by CNVs (deletions, duplications) partitioned by mammalian constraint score.
Fig. 3: Number of bases impacted by CNVs (deletions, duplications) partitioned by mammalian constraint score.
Full size image

The odds ratio estimate for case status for each additional kilobase impacted by CNVs (deletions in red, duplications in blue) is depicted with a dot while the 95% confidence interval for the estimate is depicted with bars. Unadjusted p-values are again provided for each test result. In general, each kilobase of DNA that is deleted increases OCD risk, in a manner where the risk conferred increases when the bases deleted are more constrained. This effect is not observed for duplications.

Gene-based tests of CNV burden

We failed to identify any test statistics where the level of significance passed the threshold for significance (988 tests, FDR-adjusted P < 0.05). There was no evidence of genomic inflation within deletion test statistics or duplication test statistics (lambda = 1.02 and lambda = 1.01, respectively, Fig. 4A, B). In spite of no individual loci being implicated, the overall CNV burden described in OCD cases suggests that a larger cohort size is likely to provide the sufficient power required. In particular, cases were more likely to have CNVs where only one sample overlaps the affected area, and specifically, cases have an elevation of loci impacted by at least two deletions beyond what case/control permutation predicts (Fig. 4C). Summary statistics from these tests have been included (Supplementary Table 7), along with statistics from breakpoint-based tests (Supplementary Table 8).

Fig. 4: Lack of genome-significant CNV impacted loci in OCD cases versus controls likely due to low power.
Fig. 4: Lack of genome-significant CNV impacted loci in OCD cases versus controls likely due to low power.
Full size image

A QQ plot for clumped gene-based test results specific for deletions. B QQ plot for clumped gene-based test results specific for duplications. C Results from permutation tests of CNV burden in recurrence in OCD cases relative to controls. D Tests of association between neurodevelopmental disorder (NDD) CNV burden and OCD case status. Odds ratio for each additional CNV (deletions in red, duplications in blue) is depicted with a dot while the 95% confidence interval for the estimate is depicted with bars. Unadjusted p-values are also provided. All detectable duplication excess in cases appears to impact neurodevelopmental genes (n = 664, from Fu et al. [30]) while the deletion excess in cases appears to impact both neurodevelopmental genes and unknown genes outside of this geneset.

Burden of neurodevelopmental CNVs

In general, we found that OCD cases carried a higher burden of neurodevelopmental CNVs than controls. Burden of neurodevelopmental CNVs as defined in Kendall et al. [29] increased OCD case risk (OR = 2.49, P = 6.04 × 10−3), as did burden within specific genes implicated with neurodevelopmental disorders from Fu et al. [30] (n = 664, OR = 2.54, P = 1.91 × 10−5). Although the deletion contribution to this result was higher, there was a discernible contribution from duplications as well (Fig. 4D).

Pathway enrichment analyses

We ran a series of pathway enrichment analyses (see Supplemental Methods for details) on genesets derived from 2187 Gene Ontology (GO) terms and 37 tissues from GTEx (13 from the brain) using files generated via Bryois et al. [31]. We included genes with pLI >0.5 (4543 genes) as a positive control given the enrichment described earlier, and a set of tolerant genes (pNull >0.5, 4540 genes) used as a negative control. Out of 2226 total tests, only one test passed the FDR-corrected significance threshold of 0.1, the positive control pLI >0.5 (Supplementary Table 9). We attempted to use this signal to add additional power to our pathway enrichment analyses by specifically focusing on geneset tests subsetted specifically on genes with pLI >0.5, controlling for the total of intolerant genes impacted per person, and found that across these tests (795 total), a total of 7 tests survived FDR correction (Supplementary Table 10). Of these tests, 5 were expression profiles from GTEx brain tissue, an event unlikely by chance given only 13 brain tissues tested total (two-sided fisher’s exact test P = 9 × 10−9). Significant brain tissues include the Substantia nigra, Hippocampus and the Hypothalamus.

Overlap with exome sequencing studies of OCD

We found non-random overlap between genes impacted by case-only single-gene CNVs in our study and prior OCD exome study statistics from Supplementary Table S15 of ref. [6]. We derived a set of genes from our analyses that were impacted by at least one single-gene case CNV and no single-gene control CNV (n = 149 genes). These genes had an elevated count of loss-of-function and damaging missense de novo mutations across 771 trios (observed = 9, expected = 3.94, one-sided poisson P = 0.02) and an elevated count of loss-of-function variants in 476 cases versus 1761 controls (observed = 26, expected = 17.30, one-sided poisson P = 0.03). We set up a Transmission and De Novo Association (TADA) analysis using the same methods described previously [6] and the summary statistics from Supplementary Table S15, with NORDiC case/control count statistics added. No genes beyond the already-described CHD8 passed the threshold of Q < 0.3 for being classified as a probable risk gene (Supplementary Table 11), though the gene that comes closest, ZMYM2 (Q = 0.32), has been implicated in neuropsychiatric phenotypes across multiple publications [30, 32, 33].

OCD polygenic risk in deleterious CNV carriers

We hypothesized that individuals carrying deleterious (pLI >0.995, neurodevelopmental as in Kendall et al., or neurodevelopmental as in Fu et al.) CNVs were more likely to have lower neuropsychiatric polygenic risk. This would be consistent with higher-powered studies of other neuropsychiatric conditions [34]. To test this, we utilized polygenic risk scores (PRS) computed from three different GWAS summary statistics: standing height (Pan-UKB, https://pan.ukbb.broadinstitute.org) (N = 360,388, as a negative control), OCD [4] (2688 cases, 7037 controls), and a cross-disorder study of psychiatric conditions [35] (162,151 cases, 276,846 controls). We tested for an association between deleterious CNV burden and normalized PRS, using the same covariates as those in global CNV burden analyses, and performing separate tests for deletions and duplications.

Of the six tests we performed (Supplementary Table 12), we identified one significant (p < 0.05) association, between deleterious CNV deletions and cross-psychiatric condition study PRS. In this comparison, deleterious CNV deletion carriers in our case cohort had lower normalized psychiatric PRS than non-carriers (estimate = −0.45, P = 3.35 × 10−3). While this PRS is not OCD-specific, the summary statistics underlying it do include OCD cases, and given how much larger the sample size is, it likely captures pleiotropic common risk variants that increase risk for multiple psychiatric conditions at once.

Clinical features of carriers of deleterious CNVs

We performed an analysis of clinical features of case carriers of these deleterious CNVs versus non-carrier cases (see Supplementary Table 13 for carrier status per sample). We focused on the Swedish subset of the case cohort (n = 1612) where we had access to detailed clinical information on each participant. Details of the cohort characteristics, treatments and outcome measures are described in detail in the study protocol [12]. Briefly, participants were recruited through a network of specialist OCD clinics that have highly standardized assessment and treatment protocols or via self-referral to a dedicated study website. All diagnoses were confirmed through a structured diagnostic interview. For those receiving treatment, this consisted of specialized cognitive-behavior therapy and/or serotonin reuptake inhibitors delivered by highly experienced teams. The primary outcome measure was the clinician-rated Yale-Brown Obsessive Compulsive Scale (YBOCS; score range 0–40, with higher scores denoting more severe symptoms).

We first explored the association between deleterious deletions and duplications and the presence of key psychiatric comorbidities (ASD, ADHD, TS/chronic tic disorder, schizophrenia, bipolar disorder and eating disorders) through contingency tables and Chi-Square statistics (or Fisher exact tests, when relevant). We found that 6 (4.1%) of the 147 individuals with comorbid ASD had neurodevelopmental duplications, compared to 6 (0.4%) of 1465 individuals without comorbid ASD (Chi-square = 24.3, df = 1, P < 0.001). The remaining psychiatric disorders were not significantly associated with neurodevelopmental duplications. No significant associations emerged for neurodevelopmental deletions. We also examined if neurodevelopmental duplications and deletions were associated with reported age of symptom onset or pretreatment YBOCS scores, but they were not (P values < 0.05).

We further explored if the presence of duplications or deletions was associated with treatment outcomes in a sub-cohort of Swedish individuals with complete treatment data (n = 846). We found that individuals with deletions (but not duplications) in specific neurodevelopmental disorder genes improved on average 16% on the YBOCS, whereas individuals without such deletions improved 47% on the YBOCS, a statistically significant difference (independent samples t-test; t = −3.03, df = 854, 2-sided P = 0.02).

Discussion

We have compiled what, to our knowledge, is currently the largest OCD case/control study of rare CNVs, and our results support a contribution of these variants to OCD genetic risk. This contribution came specifically from rare CNVs that overlap protein-coding regions of the genome, as there was no detectable difference in noncoding CNV burden between cases and controls. Large, ultra-rare CNVs appeared to confer the highest amount of OCD relative risk. Even when controlling for the total number of CNV-impacted bases, OCD cases had a higher number of deleted bases that are under high mammalian evolutionary constraint. In a manner consistent with OCD WES studies, coding region CNVs impacting loss-of-function intolerant protein-coding genes appear to confer more substantial OCD risk than those that do not. There was no single locus in the genome where CNV burden predicted OCD case status at a level that survived multiple test correction. The distribution of case CNV calls in the genome was non-random, and consistent with a pattern in which distinct CNV risk loci exist, but we have insufficient sample size to be able to detect them.

Our study benefited from the uniquely rich clinical information that is available from the participants in the NORDiC study. In particular, we established a specific association between neurodevelopmental duplications and ASD (not other comorbidities), although no significant associations emerged for neurodevelopmental deletions. The results suggest that whereas neurodevelopmental duplications in OCD can be, at least in part, explained by the presence of comorbid ASD, our findings regarding deletions appear to be independent of key psychiatric comorbidities. We also found a tentative association between deletions and multimodal treatment response in a sub-cohort of individuals who had treatment outcome data, whereby individuals with deletions in neurodevelopmental disorder genes were less likely to respond to treatment. However, these results should be interpreted with caution because this analysis only included 8 cases with deletions in neurodevelopmental disorder genes. Larger samples are needed to confirm this finding.

An attribute of our study that could be interpreted as both a strength and a weakness is the ancestral homogeneity of the cohort. While this led to a cleaner analysis that is unlikely to be influenced by substantial differences in ancestry, it does not address European bias and subsequent inequity present in most genetic studies [36]. We note that the effects of negative selection mean that results here are likely generalizable across ancestries. OCD cases here are enriched for CNVs overlapping protein-coding genes which are depleted of damaging variation across multiple ancestries [28]. Because of this, these variants are likely subject to negative selection, irrespective of ancestry. Consistent with this, a recent cross-ancestry analysis of Schizophrenia cases and unaffected controls [37] saw significant overlap with rare variant burden in constrained genes already highlighted in the SCHEMA study [33].

The case/control CNV callset here and the previously published gene-based WES summary statistics from Halvorsen et al. 2021 overlap in a nonrandom manner consistent with the presence of multiple OCD risk genes impacted in both datasets [6]. This indicates that the process of calling CNVs from WES data and forming gene-based summary statistics that incorporate SNV, indel and CNV calls is a worthwhile endeavor. Consistent with this, recent large WES analyses have benefited greatly from incorporating CNV call information into gene based tests, and methods for making CNV calls from WES data and incorporating them into analyses have been optimized [30, 38]. CNV burden and damaging SNV/indel burden, in a scenario where sample size is sufficiently large, should point to a consistent core set of risk genes where damaging coding variation substantially increases risk of OCD.