Abstract
The past decade has seen remarkable progress in identifying genes that, when impacted by deleterious coding variation, confer high likelihood for autism spectrum disorder (ASD), intellectual disability and other associated developmental disorders. However, most underlying gene discovery efforts have focused on individuals of European ancestry, limiting insights into genetic liability across diverse populations. To help address this, the Genomics of Autism in Latin American Ancestries (GALA) Consortium was formed, presenting here the largest sequencing study of autism in Latin American individuals (n > 15,000, including 4,717 participants with an ASD diagnosis). We identified 35 genome-wide significant (false discovery rate < 0.05) autism-associated genes, with substantial overlap with findings from European cohorts, and highly constrained genes showing consistent signal across populations. The results provide support for emerging (for example, MARK2, YWHAG, PACS1, RERE, SPEN, GSE1, GLS, TNPO3 and ANKRD17) and established autism genes and for the utility of genetic testing approaches for deleterious variants in individuals from diverse backgrounds; the results also demonstrate the ongoing need for more inclusive genetic research and testing. We conclude that the biology of autism is consistent across populations, with no detectable influence of ancestry.
Similar content being viewed by others
Main
ASD is characterized by deficits in social communication and the presence of restricted interests and/or repetitive behaviors1. Although the majority of the genetic liability for autism is attributed to common genetic variation, rare variants, often arising de novo, play a substantial role in individual liability2,3. Multiple large-scale studies of rare and common variation associated with autism likelihood are ongoing, and dozens of genes strongly associated with autism have emerged4,5, primarily coding for proteins involved in gene expression regulation, neuronal communication or the cytoskeleton6. These findings have contributed to improved interpretation of genetic tests and represent initial steps in the development of personalized interventions and targeted therapies. Although translation to broad clinical care remains limited, gene-targeted therapeutic strategies for rare genetic disorders associated with autism and other neurodevelopmental disorders (NDDs) have emerged as a very dynamic area of study in both academia and industry7. The overwhelming majority of participants in gene discovery studies are of European (EUR) ancestry, even though they comprise only 16% of the global population8. This limited window into genetic architecture across ancestries could exacerbate preexisting disparities in diagnostics and service use for autism9. Indeed, recent studies have reported high rates of inconclusive results after genetic testing in non-EUR individuals, likely because of uncertainty in interpreting genomic variants10,11,12.
We established the GALA Consortium to investigate the impact of genetic and environmental factors on autism across Latin Americans, including participants from all of the Americas, corresponding to the Admixed American (AMR) superpopulation in the 1000 Genomes Project13. These AMR individuals comprise the largest recently admixed population in the world and the largest minority in the United States. It is as yet unknown whether the genetic architecture of autism differs across ancestral populations, and the genetic diversity of the AMR group14,15 makes this question especially relevant.
We present, to our knowledge, the largest sequencing study to date of autism in AMR individuals and compare our results to findings from non-AMR cohorts. We show that a common measure of evolutionary impact on gene-level variation—that is, genomic constraint scores—differs by ancestry. However, this is not the case for the most constrained genes, which exhibit less population-level variation than expected based on their sequence composition. This is important because most identified autism-associated genes are evolutionarily constrained4,16, and this applies over diverse populations. Using Bayesian models, we identify 35 genome-wide significant genes associated with autism in Latin American individuals and observe a great degree of overlap with findings in largely EUR cohorts. These results indicate that autism and other NDD genes are shared across ancestries and that existing genetic testing pipelines are effective for the most deleterious variation, especially if information on allele frequency across ancestries is incorporated. We conclude that the biology of autism is consistent across populations and not impacted to any detectable degree by ancestry.
Results
Rare variant landscape in Latin Americans diagnosed with ASD
GALA currently encompasses 10 cohorts across the Americas, with data from eight included in this study (Fig. 1 and Methods, ‘Description of GALA sites’ section). Some GALA samples were contributed to other large-scale whole-exome sequencing (WES) and whole-genome sequencing (WGS) efforts4,17; analyses of 1,613 samples (including 707 ASD probands) are reported here for the first time. The GALA analyses reported here include all sequenced samples from GALA cohorts as well as additional genetically inferred AMR samples from the Autism Sequencing Consortium (ASC)18 and Simons Powering Autism Research (SPARK)19.
a, Map of GALA collection sites across the Americas. b, Pedigree structure of the GALA cohort, comprising 4,717 cases and 10,710 controls. Diamonds represent offspring, and individuals with ASD are shown in pink. Map was generated in R (version 4.3.3) using the ggplot2 and rnaturalearth packages with base map data from Natural Earth (public domain; https://www.naturalearthdata.com/). CHARGE, Childhood Autism Risks from Genetics and Environment; TASC, The Autism Simplex Collection; KP, Kaiser Permanente.
A substantial source of individual autism liability resides in rare deleterious variation in conserved genes4,6, often de novo or very recent. Hence, to maximize power for discovery, we focus on data collected from trios—that is, an affected proband and both unaffected parents and their typically developing sibling(s), when available. When parental DNA samples could not be collected, we incorporated probands using a case−control framework. After extensive quality control (Extended Data Fig. 1), our analysis included 6,977 individuals: 4,717 ASD cases and the remainder consisting of controls and typically developing siblings (Fig. 1b and Supplementary Table 1). In total, 15,427 individuals were sequenced, including parents from trio-based collections who contributed to de novo variant detection but were not themselves analyzed for variant burden. Specifically, 14,359 individuals were sequenced, with WES (n = 14,152) or WGS (n = 207), as part of family-based analysis: 4,450 AMR ASD individuals, 1,459 siblings and 8,450 parents (Supplementary Table 2). For case−control analysis, 267 ASD AMR samples were matched to 801 non-psychiatric AMR controls from the Mount Sinai BioMe biobank20,21.
We identified 6,555 rare (that is, allele frequency < 0.1% in our dataset and in the population-specific non-neuro subsets of gnomAD versions 2.1.1 and 3.1.2 (refs. 22,23)) and unique de novo coding sequence variants (5,062 in ASD probands and 1,493 in siblings) (Supplementary Table 3). We identified 36 de novo variants that occurred twice in individuals with ASD: 18 were found in affected siblings, consistent with germline mosaicism, and 18 occurred in unrelated individuals. Additionally, we observed 211 and 15 rare autosomal de novo small genic copy number variants (CNVs)24 in 2,191 probands and 707 siblings, respectively (Supplementary Tables 4 and 5).
In previous studies, highly constrained genes showed an aggregated signal of variants contributing to autism liability16, and integrating genomic constraint scores has proven powerful for gene discovery4. However, constraint scores are derived from cohorts largely of EUR ancestry. Therefore, we first sought to evaluate the utility of these scores on samples of diverse ancestries.
First, we examined the distribution of de novo variants as a function of a well-established metric of tolerance to loss-of-function variants, the loss-of-function observed/expected upper bound fraction (LOEUF)22, derived from gnomAD version 2.1.1. Genes with low LOEUF scores are depleted for loss-of-function variation compared to expectation as a result of negative natural selection22. Our results demonstrate that rates of de novo variants for both protein truncating (PTV) and deleterious missense (MisB, with a ‘missense badness, PolyPhen-2 and constraint’ (MPC) score25 ≥2) variants are elevated in probands compared to typically developing siblings in genes with low LOEUF scores (Fig. 2). Comparing our findings with previously published results4, we observed that the overall rates of de novo variation in AMR individuals are consistent with those observed in other ancestry groups (Extended Data Fig. 2). Notably, we found a statistically significant enrichment of PTVs in constrained genes among AMR probands. We also observed a trend toward enrichment of missense variants with MPC score ≥ 2 (P = 0.077).
The average number of rare variants per sample, normalized by the synonymous de novo variant rate, is compared between ASD probands (n = 4,450) and unaffected siblings (n = 1,459) of AMR ancestry. a–c, The analysis includes PTVs in highly constrained genes (LOEUF deciles 1–3, 5,363 genes) and less constrained genes (LOEUF deciles 4–10, 12,765 genes) (a); missense variants categorized by predicted functional severity (MPC ≥ 2 for high severity, 1 ≤ MPC < 2 for moderate severity) (b); and MPC < 1 (for low severity) and synonymous missense variants (c). Data are presented as mean values ± 95% CIs. Statistical significance was assessed using two-sided z-tests comparing normalized de novo mutation rates between probands and siblings. P values were adjusted for multiple comparisons using the Benjamini–Hochberg FDR method, and exact adjusted P values are shown above the bars.
Second, we examined whether LOEUF is well calibrated across ancestral populations. Effective population size differs across Native American, EUR and African populations26, but current estimates of gene constraint are derived from cohorts that are largely of EUR ancestry. Existing LOEUF scores are modestly over-conservative when applied to AMR samples (Fig. 3, Extended Data Fig. 3 and Karczewski et al.22), but, when focusing on the most constrained (lower) deciles, they correlate well with the observed number of PTVs normalized by sample size and gene length (Fig. 3). Because association signal concentrates to these lower deciles (Fig. 2), these observations justify the use of existing LOEUF scores for our study and generally for studies focusing on highly constrained genes in other ancestries, including admixed African ancestries (Extended Data Fig. 3).
The sum of observed PTVs is plotted for non-Finnish European (NFE; n = 56,885) and AMR (n = 17,296) populations in gnomAD version 2.1.1, scaled to population size and total coding sequence length for each gnomAD LOEUF decile. LOEUF deciles reflect gene constraint, with lower deciles indicating more constrained genes.
Autism gene discovery in Latin Americans
For gene discovery, we used TADA (transmission and de novo association), an algorithm that integrates de novo, inherited and case−control variants as well as LOEUF scores and small genic CNVs4,6,27,28. Sixteen genes were associated with autism at a false discovery rate (FDR) < 0.01; 35 genes met genome-wide significant association (FDR < 0.05); and 61 genes were associated at FDR < 0.1 (Fig. 4, Table 1 and Supplementary Table 6). To examine the overlap of these findings with those in largely EUR ASD cohorts, we first identified and removed all AMR samples in Fu et al.4, yielding a non-AMR complementary set (FuCOMP) with no overlap with our analyses. Nineteen of the 35 GALA genes with FDR < 0.05 showed significant signal in FuCOMP. We next compared the observed numbers of variants in the GALA cohort with the expected number of variants derived from TADA analysis in FuCOMP. To do this, we compared results for concordant genes, defined as genes that show FDR < 0.05 in GALA and in FuCOMP, and we observed that, overall, the findings are consistent with expectation (Extended Data Table 1). We also compared our gene findings from the GALA cohort with those in a large cohort ascertained for severe developmental disorders29: six of the 16 genes that had an FDR < 0.05 in GALA and an FDR < 0.1 in FuCOMP showed an FDR < 0.05 in the developmental disorder cohort (Table 1).
As in previous studies, de novo variation provided a major source of signal for top genes (Extended Data Fig. 4). Similarly, PTVs are a major source of signal, and it was interesting to note that missense variants were also an important source of rare variation association signal (Extended Data Fig. 5). For several of the top genes, the association signal is fully or almost fully derived from missense variants in the GALA cohort, which, for MTOR, YWHAG, GRIN1, PACS1 and CACNA1D, is consistent with previous findings and may suggest a dominant negative or gain-of-function mechanism (Table 1 and Extended Data Table 2). Gene Ontology and Mammalian Phenotype enrichment analyses (Supplementary Tables 7 and 8) highlighted biological processes and phenotypes related to synaptic function, neuronal development and social and repetitive behaviors.
Implications for clinical genetics
With compelling evidence for overlapping autism gene findings in AMR samples, we next asked about the fraction of findings that are identified as pathogenic or likely pathogenic (P/LP) as per American College of Medical Genetics (ACMG) guidelines30. We used VarSome31—minimizing the use of proprietary databases and approaches used by commercial testing laboratories—to evaluate (1) genome-wide de novo variation and (2) inherited variation in X-linked genes associated with autism (Supplementary Table 9). This analysis included all de novo variants observed across the genome, not just those meeting the TADA inclusion criteria. Specifically, we included all protein-truncating, missense and synonymous variants, including those in genes lacking mutation rate or LOEUF estimates. For inherited variants, we focused on rare variants in known X-linked genes associated with autism and/or NDDs. We analyzed all GALA and FuCOMP samples, focusing on genes for which there was a reported association with an autism and/or a broader NDD phenotype.
Among the 20,571 de novo variants in our analysis, 926 (4.5%) were classified by VarSome as P/LP when we focused on genes that included autism among the associated phenotypes (Supplementary Table 10). In the AMR cohort, 195 variants (3.8%, 95% confidence interval (CI): 3.27−4.32%) were identified as P/LP (Supplementary Table 11) compared to 731 out of 15,386 (4.75%, 95% CI: 4.42−5.10%) in non-AMR samples. In terms of participants with findings, 4.31% (95% CI: 3.75−4.96%) of AMR and 5.53% (95% CI: 5.15−5.94%) of non-AMR probands had at least one P/LP variant identified. Comparisons between EUR and non-EUR participants revealed that EUR individuals had a higher rate of de novo P/LP variants. Specifically, EUR participants had 634 (4.83%, 95% CI: 4.47−5.21%) P/LP variants identified compared to 292 (3.92%, 95% CI: 3.50−4.40%) in non-EUR participants. Overall, EUR participants had a higher rate of P/LP variants identified than non-EUR participants (5.61%, 95% CI: 5.20−6.06% versus 4.54%, 95% CI: 4.05−5.09%).
When broadening our criteria to include other NDD phenotypes, 1,339 de novo variants were deemed to be P/LP (Supplementary Table 12). In AMR, 276 variants were classified as P/LP (5.32%, 95% CI: 4.74−5.98%) versus 1,063 in non-AMR individuals (6.91%, 95% CI: 6.52−7.32%). In terms of participants with de novo findings, 6.07% (95% CI: 5.39−6.82%) of AMR participants and 7.99% (95% CI: 7.53−8.47%) of non-AMR participants had at least one P/LP finding. EUR participants had a notably higher rate of findings (8.22%, 95% CI: 7.72−8.75%) compared to non-EUR participants (6.24%, 95% CI: 5.66−6.87%).
Extending our analysis to include X-linked inherited findings, we observed a further increase in P/LP detection rates. Specifically, 201 de novo or X-linked variants (2.80%, 95% CI: 2.43−3.21%) in AMR samples and 758 variants (3.58%, 95% CI: 3.33−3.84%) in non-AMR samples were classified as P/LP for ASD. When we broadened the terms to include other NDD-related genes, the proportion of P/LP variants rose to 4.10% (95% CI: 3.65−4.58%) in AMR participants and to 5.26% (95% CI: 4.96−5.57%) in non-AMR participants. The rate of participants with at least one P/LP variant increased to 6.47% (95% CI: 5.78−7.24%) in AMR samples and to 8.38% (95% CI: 7.91−8.87%) in non-AMR samples (Supplementary Table 11). EUR participants showed a higher yield of P/LP findings (8.62%, 95% CI: 8.11−9.16%) compared to non-EUR participants (6.63%, 95% CI: 6.04−7.28%).
Qualitatively similar results were obtained when using Neptune32, which uses databases of previously identified variants to call P/LP variants in a set of 73 ACMG-recommended genes with actionable findings33 (Extended Data Fig. 6 and Supplementary Table 11). Although greater numbers of rare variants were identified in individuals from diverse ancestries, the proportion of these that could be classified as P/LP was lower. This combination of higher variant detection but reduced classification rate of P/LP variants contributes to a somewhat lower overall yield of P/LP findings per individual in AMR or non-EUR individuals when compared to non-AMR or EUR ancestries, respectively. Considering the VarSome and Neptune results together, the findings provide support for the translatability of rare genetic findings in autism across ancestries in a clinical setting, albeit with opportunities for improvement.
Discussion
The past decade has seen major advances in deciphering the overall and the genetic architecture of autism but largely from EUR cohorts. It is not yet known whether the genetic architecture of autism differs across ancestral populations, including in admixed populations. Latin American individuals comprise the largest recently admixed population in the world and the largest minority in the United States. Diverse sites with large AMR representation have joined to form GALA, and here we report a first, large-scale multinational analysis of rare variant liability in Latin Americans with ASD, identifying autism-associated genes in this cohort and comparing genetic architecture with that observed in non-AMR ASD.
As in previous studies, we found that signal for genes strongly associated with autism was concentrated in highly conserved genes and largely driven by very rare de novo variation. For the discovery of autism-associated genes impacted by very rare de novo or case−control variation, it is critical to have reliable estimates of expected genic mutation rates, which can be derived from both cross-species comparisons and empirical data from massive, aggregated sequencing resources, such as gnomAD. Although representation of diverse populations is improving, much of the existing sequence data are skewed toward EUR samples. Thus, there is much more to be done regarding genetic variability within underrepresented populations. Our analyses confirm that metrics of gene-level constraint are overly conservative, due to the overreliance on EUR samples that have a lower effective population size. However, we also demonstrate that the key metric LOEUF, when applied to the most conserved genes, is well calibrated across diverse ancestral populations.
Because deleterious variation in highly conserved genes is subject to strong purifying selection, such variation is both very rare and frequently de novo. Allele frequency filtering based on gnomAD or similar datasets is, hence, an important means to infer very rare variation. However, we observe that relying on overall allele frequency allows for the introduction of more common variation into the analyses, hence reducing power and increasing the false-positive rate. We began our analyses using established best practices for filtering by global allele frequency in the analysis of potentially de novo variants4,6. However, we noticed that some variants initially classified as rare in gnomAD (allele frequency < 0.1%) turned out to be more common in particular populations. To address this heterogeneity, we recommend annotating variants with allele frequencies across all subpopulations in the non-neuro releases of gnomAD, as we have done here. Building upon this strategy, we extended the same annotation to our analysis of inherited variation, adopting a more stringent allele frequency threshold of <0.01%, to ensure even more precision in our findings34,35.
We next used TADA to identify 35 genes associated with autism at an FDR threshold <0.05 in the GALA dataset, 16 with FDR < 0.01 and eight with FDR < 0.001 (Fig. 4 and Table 1). Consistent with previous studies in largely EUR cohorts, gene expression regulation, neuronal communication and cytoplasmic genes are well represented among the autism-associated genes identified in GALA (Table 1 and Supplementary Tables 7 and 8). FDR is well calibrated in TADA6, and genes identified with TADA in smaller cohorts are consistently replicated at expected levels in larger samples. However, it is still important to evaluate the level of confidence in the genes identified. First, as noted above, we compared results for top genes across GALA and a recent large-scale study (FDR < 0.05 in both AMR samples and non-AMR/FuCOMP studies), and we observed that findings are consistent with expectation (Extended Data Table 1). (Note that, although individual gene-level counts may differ, this variation is expected given the rarity of events; by contrast, when we aggregated data from the top genes, the number of observed variants across genes and variant classes in GALA closely matches the expected total derived from FuCOMP.) However, there are multiple genes with evidence in GALA but not in FuCOMP. This can be for one of several reasons, including (1) sparseness of de novo events and, hence, overrepresentation/underrepresentation of de novo events in subsamples; (2) differences in ascertainment; and (3) the possibility that some findings are false-positive findings. Although all three could make some contribution, (1) was extensively evaluated previously4,6, and the analyses suggested that it is likely to be the major contributor to discordance. To further evaluate whether discordant genes may still represent true positives, we first compared GALA findings to results from FuCOMP, a non-AMR cohort. Although many top (FDR < 0.05) GALA genes were also supported in FuCOMP, a subset of 17 genes showed an FDR > 0.05 in FuCOMP, suggesting weaker support. We, therefore, examined their support in a large cohort of individuals with severe developmental disorders29, and seven of these 17 genes show a clear support. Finally, among the 35 autism-associated genes with an FDR < 0.05, most have a dominant neurodevelopmental morbid association in OMIM, ClinGen and/or Gene2Phenotype (Table 1). The concordance of findings between genome-wide studies (GALA, FuCOMP and developmental disorders) and curated clinical databases indicate that our approach is valid for autism gene discovery in AMR samples and that the FDRs are likely well calibrated.
We next examined emerging and known genes found in the GALA analyses, including contrasting results with those seen in non-AMR samples (FuCOMP) and curated databases (Table 1, Fig. 5, Extended Data Table 1 and Extended Data Fig. 7). These genes provide further support for MTOR signaling (for example, MARK2, MTOR, TSC2, YWHAG and GLS), synaptic and cytoskeletal function (for example, DYNC1H1, PAK2, DLG4, GRIN1 and SYNGAP1) and transcriptional regulation (for example, SPEN, RERE and GSE1) in autism. Notably, these pathways are also strongly implicated across intellectual disability and NDDs, underscoring the tremendous overlap in genetic discovery that transcends traditional diagnostic boundaries. Many of these genes are constrained for PTVs and/or missense variants and show support from independent datasets, including de novo events in severe developmental disorder cohorts. A description of top and interesting genes is found in Extended Data Table 2.
Variants observed in GALA analyses of AMR individuals are marked with pink circles, those found in FuCOMP individuals are marked with green and variants found in DECIPHER are marked with purple. Note that there were two instances of V123Wfs*2 variants in GSE1 in GALA, two instances of the pathogenic variant R203W in PACS1 in GALA and 18 in DECIPHER. Figures were generated using the Lollipop software package52.
Altogether, the results are consistent with the assumption that the same set of highly constrained genes identified in ongoing genome-wide studies is associated with autism, regardless of ancestry. This perspective also receives support from common variant studies in complex traits, where causal effects appear to be highly similar across ancestries36,37: Hou et al.36 analyzed 53,001 African-European admixed individuals and observed that causal effects of common variants (allele frequency > 0.5%) for 38 complex traits are largely similar across local ancestries, in agreement with other studies, including a recent analysis showing that cis-genetic effects on gene expression are highly similar between EUR and African individuals37.
We considered whether the observed similarity in deleterious variant burden between AMR-assigned and EUR-assigned individuals could reflect the influence of EUR admixture within AMR genomes. In principle, local ancestry inference (LAI) would allow mapping of individual variants to ancestral tracts, enabling a more granular test of whether such variants preferentially arise on EUR versus non-EUR backgrounds. However, current LAI methods require dense haplotypic data across the genome, typically from WGS. The sparse and uneven coverage of exome data poses considerable challenges for LAI, and performance has been shown to decline substantially in this context38,39. Moreover, because much of our gene discovery relies on de novo rather than inherited variants, the signal is unlikely to be biased by local ancestry tracts, and we also confirmed that variant and gene discovery is clearly driven by the large proportion of individuals with modest overall EUR ancestry (Extended Data Fig. 8). Still, we acknowledge that this is a potential limitation of the study and a valuable direction for future work in cohorts with whole-genome data where LAI can be reliably determined.
Using clinical genetics software platforms, we confirm the overall translatability of clinical genetic approaches when focusing on rare deleterious variation; however, we also reveal differences in the rate of P/LP variants between AMR and non-AMR individuals and between EUR and non-EUR individuals. The causes driving differences in rates of P/LP need to be better understood, as this is a limitation that complicates the interpretation of our analyses. A recent study focusing on pediatric patients with serious neurologic, cardiac or immunologic conditions reported similar diagnostic yield for genome sequencing in European Americans and Latin Americans (19.8% versus 17.2%); however, yields were lower (11.5%) and inconclusive results were higher in African Americans11. In that study, genome sequencing was carried out by commercial diagnostic laboratories, making use of a proprietary pipeline that incorporates variant databases; the degree to which proprietary algorithms and the degree to which reliance on previously observed variation influenced the higher rate of inconclusive results cannot be determined.
Analysis of pathogenic variation in the All of Us Research Program, which integrates data from a diverse cohort to identify genetic differences across ancestries, further highlights the disparities in variant classification across populations. The study examined P/LP variants in a modest number of genes with actionable findings, showing differences as a function of ancestry, with 42% fewer pathogenic variants identified in Latin American versus EUR individuals (1.32% versus 2.26%)10. All of Us analyses used Neptune, a system developed for clinical genetic reporting32. Neptune relies heavily on variants identified in prior curated data, which will bias the findings in diverse populations. Consistent with this, analyses of the GALA cohort using Neptune show lower rates of findings compared to non-AMR samples. Our results suggest that with a focus on deleterious de novo variation, use of prior results is less necessary, and others have shown that even highly curated variant databases include false-positive findings that can lead to incorrect information to subsequent families40,41,42,43. Where possible, we recommend minimizing reliance on previously reported pathogenic variants. In addition, to further improve genetic testing results across diverse populations, our results show that it is of key importance to use allele frequency from all relevant populations, as we have done here.
We should, however, recognize the limitations inherent in our study and in any study that focuses on ancestries beyond EUR and a few other commonly characterized populations. For instance, we focused on de novo variants and their interpretation in AMR populations. Variants called de novo in our sample, and within subjects, are likely a mixture of true and false positives. For populations not deeply characterized for genetic variation, it is reasonable to expect elevation in the false-positive rate, simply because we do not know the frequencies of variants therein and which variants are relatively more common. For this reason, more of the variation called de novo is likely to be inherited variation.
At the same time, it is possible that unknown genomic complexity, such as common structural variants44,45,46, elevate false negatives within these populations, including genomic variation important for phenotypes like autism, which is another limitation of our study. The combination of these three quantities—true positives, false positives and false negatives—determines the total variation that we observe. Based on our results, which show similar patterns to those observed in EUR studies, we can conclude that the vast majority of our results arise from true positives. Nonetheless, we should not conclude that populations are all the same when it comes to calling de novo variation. Indeed, we can be confident that they are not, given what we know about increased genetic diversity in African populations47,48,49,50 and the impact that cryptic structural variation and singleton events have on the reliability of calling ultra-rare variation. Only through deeper genetic studies can we expect completely comparable results to those of EUR population samples, ameliorating the above issues.
In conclusion, our observations are consistent with the neurobiology of autism being shared across ancestries and provide support for the translatability of autism clinical genetic approaches across ancestries.
Methods
Cohort description
GALA comprises multiple sites from North, Central and South America recruiting AMR participants for studies on the genetic architecture of autism. Study procedures were approved by the institutional review board (IRB) of the Program for the Protection of Human Subjects at Mount Sinai (no. 16-01262). Informed consent was obtained from the parents or legal guardians of all study participants.
Study procedures for participant enrollment were approved by the Program for the Protection of Human Subjects at Mount Sinai (no. 16-01262 for the Seaver Center at Mount Sinai, São Paulo, Brazil, and Bogotá, Colombia; and no. 21-00039 for Peru), the University of California, Davis IRB (no. 226028-22) and the University of Miami IRB (no. 20070193). Two cohorts were collected previously: study procedures for participant enrollment in Costa Rica were approved under the guidelines of the Ministry of Health of Costa Rica, the Ethical Committee of the National Children’s Hospital in San Jose and the IRB at Mount Sinai, as described previously53,54; and The Autism Simplex Collection (TASC), which included an estimated 12% of individuals of Latin American ancestry, was recruited across 13 sites in North America and Europe, as described previously55, with local IRB oversight and all consents reviewed before depositing biospecimens and data to the National Institutes of Health repository.
For clarity, we use ‘ASD’ to refer to individuals who received a clinical diagnosis according to the procedure outlined below and ‘autism’ elsewhere. ASD diagnoses are based on expert clinical evaluations using Diagnostic and Statistical Manual of Mental Disorders, 5th Edition (DSM-5) criteria, incorporating all available data, including standardized assessments. Participants can be any age. Individuals with a known genetic condition (for example, fragile X syndrome) are excluded from analyses. Once a diagnosis of ASD is confirmed, the individual and their parents contribute a sample (blood or saliva) for genetic analyses. If both parents are not available, collection of other biological family members is encouraged (siblings, grandparents, etc.). Participating sites generally also collect additional clinical and family history information.
Description of GALA sites
New York, USA
The Seaver Autism Center for Research and Treatment at the Icahn School of Medicine at Mount Sinai, located in New York City, is the main coordinating site within the GALA Consortium. AMR individuals make up almost 30% of the population of New York City. Affected individuals undergo a full diagnostic ASD workup and receive additional assessments, including a cognitive test, adaptive behavior measure, medical checklist and behavioral checklists. Participating families receive $100 USD in compensation.
São Paulo, Brazil
The Human Genome and Stem Cell Research Center (HUG-CELL) at the Universidade de São Paulo in Brazil has over 20 years of experience in clinical and molecular research in autism, with more than 2,000 families seen. Brazil has a multiethnic admixed population, including African and Amerindian ancestry56. The HUG-CELL conducts research in human and medical genetics of rare diseases, providing genetic counseling services and genetic tests for the population. A team of psychiatrists, psychologists and neurologists completes a formal ASD diagnostic workup prior to obtaining samples for genetic testing from the individual and their family members. Financial compensation for participation is not permitted at this site; however, individuals who meet clinical criteria are offered free fragile X testing.
Bogotá, Colombia
The Centro de Investigaciones Genéticas en Enfermedades Humanas (CIGEn) at the Universidad de los Andes in Bogotá, Colombia, in close collaboration with the Instituto Colombiano del Sistema Nervioso, Clínica Montserrat, focuses on unraveling the prevalence and characteristics of autism within the Colombian population. Through ASD referrals, the impact of CIGEn extends beyond Bogotá, reaching out to other cities throughout Colombia (Medellín, Cali, Armenia, Pereira, Bucaramanga, Cartagena, Barranquilla and Santa Marta), with the aim of including families from diverse backgrounds. Financial compensation is not offered for participation.
Mexico City, Mexico
The Children’s Psychiatric Hospital ‘Juan N. Navarro’ (HPIJNN), which is part of the Psychiatric Care Services of the Mexican Government’s Ministry of Health, provides professional care for minors with mental health, psychiatric and behavioral problems. As the largest teaching center in child and adolescent psychiatry in Mexico, it performs diverse biomedical and clinical research activities. One of the main lines of research focuses on autism, in collaboration with the Genetics Department at the National Institute of Psychiatry Ramón de la Fuente Muñíz (INPRFM). The samples from Mexico are being sequenced and were not included in the current analyses. Financial compensation is not provided at this site, in accordance with ethics committee requirements.
Lima, Peru
The Centro Ann Sullivan del Perú is a non-profit center in Lima, Peru, that serves individuals with varying abilities and their families. The center specializes in helping individuals with ASD. GALA investigators from the Seaver Autism Center (M.P.T. and A.K.) traveled to Lima to perform 40 psychiatric evaluations, aid in ASD diagnostics and collect blood samples from individuals with ASD and their families. Behavioral surveys were carried out for all participants, and ASD and attention-deficit/hyperactivity disorder diagnoses were made using DSM-5 criteria. Financial compensation was not offered; instead, participating individuals received their clinical evaluation results.
California, USA (CHARGE)
The Childhood Autism Risks from Genetics and the Environment (CHARGE) cohort is a population-based case−control study collected in California at the University of California, Davis, Center for Children’s Environmental Health laboratories with the intent of addressing the impact of environmental exposures on risk57.
Florida, USA
The John P. Hussman Institute for Human Genomics at the University of Miami, located in Miami, Florida, recruits families through clinical referrals and lay organizations, providing services to families with ASD. Upwards of 70% of the Miami population identifies as AMR. The diagnostic workup included the Autism Diagnostic Interview-Revised (ADI-R) and assessment of adaptive behavior. Discrepancies between ADI-R and clinical findings were resolved using additional clinical measures, including the Autism Diagnostic Observation Schedule (ADOS).
Central Valley, Costa Rica
The founder population of the Central Valley of Costa Rica (CVCR) originated at the end of the 16th century from the intermarriage of 86 Spanish families and Indigenous Americans. The population was geographically isolated until the late 19th century; therefore, the current inhabitants are estimated to descend from fewer than 1,000 founders58. A genetic study on autism in the CVCR was initiated in 2003, and affected individuals were ascertained using the translated Spanish versions of the ADI-R and the ADOS as well as assessment of intellectual abilities and adaptive behavior53.
USA and Europe (TASC)
TASC was a collaboration among 13 sites in North America and Western Europe funded by the National Alliance for Autism Research, now Autism Speaks, and the National Institute of Mental Health. As detailed previously55, more than 1,700 individuals with ASD confirmed with extensive prospective assessment, as well as additional family members including parents, completed this study. Individuals within this study were sequenced, and those who were of AMR ancestry were included in these analyses.
California, USA (Kaiser Permanente)
The Autism Research Program (ARP) at the Kaiser Permanente Northern California (KPNC) Division of Research was established in 2002 by Senior Research Scientist Lisa Croen. The program focuses on research identifying genetic and environmental factors associated with autism and understanding patterns of detection, diagnosis and utilization of health services for individuals with ASD across the lifespan. The ARP created the Autism Family Biobank, a repository including genetic, medical and environmental information from more than 1,000 individuals with ASD and their two biological parents, who donated blood or saliva between 2015 and 2017. This collection is representative of the diverse population served by KPNC, an integrated healthcare system. The samples from Kaiser Permanente are being sequenced and were not included in the current analyses. Participants receive $15 USD per biospecimen, and families receive an additional $15 USD upon completion of the parent surveys.
Ancestry determination and sample-level quality control
Latin American samples analyzed in the current freeze include (1) GALA participants (some published in Fu et al.4); (2) non-overlapping AMR samples in the ASC and SPARK19 reported in Fu et al.4; and (3) additional AMR samples from the new release of SPARK (iWESv2). The current freeze includes trio data from 14,359 AMR samples, including 4,450 affected individuals (609 from GALA and 3,841 from ASC and the SPARK releases) and 1,459 typically developing siblings and case−control data from 267 cases and 801 controls.
To assign ancestry to each case, we followed an approach modeled after the pipeline used by gnomAD (https://gnomad.broadinstitute.org/news/2021-09-using-the-gnomad-ancestry-principal-components-analysis-loadings-and-random-forest-classifier-on-your-dataset/). Specifically, each of three jointly called datasets, derived from unpublished GALA sequencing, Fu et al. and SPARK (iWESv2), was merged with the Human Genome Diversity Project (HGDP) + 1000 Genomes Project (1KG) subset of gnomAD22, and principal component analysis (PCA) was performed in the joint dataset after they had been restricted to 5,000 ancestry-informative single-nucleotide polymorphisms59. A random forest classifier was trained on the HGDP + 1KG reference samples using the first 10 principal components and used to assign superpopulation/continental ancestry to individuals in our dataset. AMR ancestry classification was based on the predicted ancestry label assigned by the random forest model. Non-AMR cases included any individuals with ASD in ASC or SPARK releases who did not meet our criteria for genetically inferred AMR ancestry (28,818 parents, 13,030 probands and 4,749 typically developing siblings).
Hail 0.2 was used to process the SPARK (iWESv2) and unpublished GALA joint-genotyped variant call files (VCFs). Multiallelic sites were split; variants were annotated using the Variant Effect Predictor (VEP)60; and low-complexity regions (https://github.com/lh3/varcmp/blob/master/scripts/LCR-hs38.bed.gz) were removed. Hail’s pc_relate() function was used to confirm reported pedigrees and identify duplicate samples within and between datasets, which were removed. Sex was imputed using the impute_sex() function, and genotype filters were applied as described in previous methodology6 to generate working datasets (Extended Data Fig. 1).
De novo variants
Previously published de novo calls were extracted from Supplementary Table 20 from Fu et al.4. For the unpublished GALA and SPARK (iWESv2) datasets, de novo variants were called using the my_de_novo_v16() function (https://discuss.hail.is/t/de-novo-calls-on-hemizygous-x-variants/2357/19) with variant frequencies from the non-neuro subset of gnomAD exomes version 2.1.1 as priors. Potential de novo variants were dropped if they were present at a frequency greater than 0.1% within the non-neuro subset of gnomAD version 2.1.1, gnomAD version 3.1.2, in any subpopulation of these gnomAD datasets or the dataset in which they were called. Variants were further excluded if they had ‘ExcessHet’ in the Filters field, exhibited a proband allele balance < 0.3 or demonstrated a depth ratio < 0.3. Only ‘HIGH’-confidence or ‘MEDIUM’-confidence variants were kept, with the MEDIUM-confidence calls limited to a maximum allele count in the dataset of 1. A single variant per person per gene was chosen, giving preference to variants with more damaging consequences. Samples were finally excluded if the count of coding de novo variants was significantly greater than expected.
Inherited variants
Starting with the same working datasets as for de novo calling, counts of transmitted and non-transmitted alleles were generated using Hail’s transmission_disequilibrium_test() function. Variants were filtered out if they were marked ‘ExcessHet’ by GATK4 or had allele frequencies greater than 0.01% within their own dataset, within the non-neuro subset of gnomAD version 2.1.1, gnomAD version 3.1.2 or within any subpopulation of these gnomAD datasets. Variants with an allele count > 6 in the total parents of the dataset were excluded as well. Hard filtering was applied according to GATK recommendations (https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants). Final counts of transmitted and non-transmitted alleles were produced for PTV, MisB, MisA (1 ≤ MPC < 2) and synonymous variants.
Case−control variants
Probands within incomplete trios were identified from the ASC and GALA cohorts and matched using the top 10 principal components (‘Ancestry determination and sample-level quality control’) with non-psychiatric, unrelated controls from BioMe at a ratio of three controls to one case (3:1). Incomplete trios from SPARK (iWESv2) were removed. To ensure genome build standardization between these two cohorts, CRAM files from ASD cases were unmapped using GATK4 (ref. 61) and then remapped to a different version of the hg38 reference genome (https://biobank.ndph.ox.ac.uk/ukb/refer.cgi?id=838) using GATK3.5. Single-nucleotide variants (SNVs) and insertions/deletions (indels) were joint-genotyped across cases using the Haplotypecaller of GATK4. Like for the trio dataset processing, Hail 0.2 (https://hail.is) was used to process the joint-genotyped VCF file. The identity_by_descent() function of Hail was used to test for relatedness, which resulted in the removal of 13 cases. Sex was imputed for every sample using the impute_sex() function of Hail and cross-checked with metadata provided by all sites to ensure sample concordance.
As was done for the previous datasets4, multiallelic sites were split, variants were annotated using the VEP and low-complexity regions were removed. Variants were removed if they had an allele count ≥ 2 in the entire case−control dataset as well as an allele count ≥ 5 in the non-psychiatric subset of gnomAD version 2.1.1. Genotype calls were filtered to genotype quality > 25 and allele balance > 0.3. For case−control coverage harmonization, variants in high coverage, defined as a call rate ≥ 90%, were kept. To perform case−control matching, we excluded one case that was an outlier in the distribution of the number of synonymous variants. Finally, 267 cases were matched to 801 controls by sex and the first 10 principal components using the match_on function of the R package optmatch62.
CNV analysis
De novo CNVs called in Fu et al.4 coming from AMR samples were extracted (1,861 probands and 680 unaffected siblings). Trio and case−control datasets were analyzed separately, and GATK-gCNV24 was used to detect CNVs. First, raw CRAM files were compressed into read counts that covered the annotated exons to serve as input data. Then, a PCA-based approach that combines density and distance-based clustering was employed on the observed read counts to organize batches of samples for parallel processing. GATK-gCNV was run on cohort mode analysis for 200 samples within the cluster identified through PCA, and the remaining samples were subjected to GATK-gCNV analysis using the case mode, with models specific to the cohort (368 probands and 29 typically developing siblings). For quality control, CNV calls were processed according to Fu et al.4 methodology; CNVs were retained if they had an allele frequency < 1% that spanned more than two captured exons. For homozygous deletions, the quality score threshold was set to the lesser of 400 or 10 times the number of intervals. For heterozygous deletions, the quality score threshold was set to the lesser of 100 or 10 times the number of intervals. For duplications, the quality score threshold was set to the lesser of 50 or four times the number of intervals. For sample-level quality control, samples were retained if the number of raw, autosomal CNV calls detected by GATK-gCNV did not exceed 200 and if the number of calls with quality score ≥ 20 did not exceed 35. After quality control, 291 probands, 25 typically developing siblings, 209 cases and 735 controls remained.
A gene was considered impacted by a deletion if at least 10% of its non-redundant exons were overlapped by the deletion. For a duplication, a gene was considered impacted if at least 75% of its non-redundant exons were overlapped. Additionally, CNVs were annotated against a list of 79 curated genomic disorder loci (see Supplementary Table 10 in Fu et al.4), and a CNV call was classified as a genomic disorder CNV if it shared at least 50% reciprocal overlap with an annotated genomic disorder.
Genetic association analyses
TADA4,27 was performed for three types of inheritance classes: de novo (PTV, MisB, MisA, deletion (DEL) and duplication (DUP)), inherited (PTV, MisB and MisA) and case−control (PTV, MisB, MisA, DEL and DUP) variation. CNVs resulting from non-allelic homologous recombination (NAHR) were excluded, and only CNVs impacting fewer than nine constrained genes were retained (LOEUF < 0.6) (Supplementary Tables 13–19).
Bayes factors were constructed separately for each variant class (PTV, MisA, MisB, DEL and DUP) as described, accounting for sample size and directly using relative risk priors from Fu et al. directly (see Supplementary Table 8 in Fu et al.4). Previously published mutation rates were adjusted to align with the observed variant counts in unaffected siblings for each variant type in the dataset4.
Expected versus observed mutations in GALA
As noted in the main text, for top genes in GALA that were also significant in FuCOMP, we compared the observed numbers of variants in the GALA cohort with the expected number of variants derived from TADA analysis in FuCOMP. Although observed and expected counts may vary at the individual gene level, as expected for ultra-rare events, the overall observed and expected totals across all genes are well matched, supporting the consistency of signal with expectation (Extended Data Table 1).
Clinical genetics analyses
In addition to VarSome described in the main text, we also ran Neptune32, which uses databases of previously identified variants to call P/LP variants in a set of target genes; we took a similar approach to a recent study10 carried out in the All Of Us Research Program by focusing on 73 actionable ACMG genes63. Of the 12,162 variants in these genes among the 4,450 family-based AMR cases, Neptune provided a classification for 8,501 (69.9%); this compares to 28,262 variants among the 13,030 non-AMR family-based cases, of which 20,750 (73.4%) were classified by Neptune. In AMR participants, 136 variants were classified as P/LP, representing 1.12% (136/12,162) of all variants in these genes and 1.60% (136/8,501) of all classified variants. In non-AMR participants, 344 variants were classified as P/LP, representing 1.22% (344/28,262) of all variants and 1.66% (344/20,750) of all classified variants. Examining the results from the perspective of the participants, in AMR we observed 2.73 variants in these genes per individual, of which 1.91 per individual could be classified by Neptune, and 0.031 per individual were classified as P/LP. The corresponding numbers were 2.17 variants, 1.59 Neptune classified variants and 0.026 P/LP variants per non-AMR individual. The results show that, on the variant level, the differences in AMR versus non-AMR participants trace in part to a reduced ability of Neptune to classify non-AMR variants (Extended Data Fig. 6 and Supplementary Table 11). However, as also noted above, there are more variants per AMR participant (both total and Neptune classified), leading to an apparent lessening of impact in terms of P/LP variants per individual.
ACMG interpretation of variants
As noted above, for genetic association analyses, the TADA framework was limited to autosomal genes with available mutation rates and LOEUF scores (n = 18,128 genes) and considered only missense variants with an MPC score ≥ 1. By contrast, the clinical interpretation of variants included all autosomal or X-linked protein-truncating, synonymous and missense variants, regardless of gene annotation. In addition to applying the allele frequency cutoff of 0.1% (‘De novo variants’), X-linked variants were subjected to an allele frequency cutoff of 0.1% in the male non-psychiatric subsets of gnomAD versions 2.1.1 and 3.1.2 and their subpopulations. This resulted in 20,571 de novo variants being included for clinical genetics annotation. Inherited variant analysis was restricted to a list of well-established X-linked genes implicated in autism and/or intellectual disability (Supplementary Table 9) and subjected to the same allele frequency cutoff.
The commercially available VarSome package31 was used to evaluate the clinical impact of both de novo variants and X-linked inherited variation in the selected genes. Given the large number of variants, a batch environment was used, which limited the parameters that could be optimized for each gene. Additionally, as ACMG guidelines30 consider patient phenotype, the focus was placed on genes for which there was a reported relationship with an autism phenotype (Autism Spectrum Disorder, Autism and Autistic Behavior) and/or with a broader NDD phenotype (including the three autism terms as well as Intellectual Disability, Global Developmental Delay, Seizure, Epileptic Encephalopathy and Complex Neurodevelopmental Disorder), without knowing the full spectrum of non-autism phenotypes in the participants. Hence, the results presented here (Supplementary Tables 10–12), although based on a more transparent algorithm, should not be considered fully compliant with ACMG classification guidelines.
The api.batch_lookup function in VarSome was used to obtain germline variant-level information related to ACMG classification, nucleotide substitution and amino acid substitution, along with pathogenicity predictions. When possible, transcripts with the most severe coding impact were selected. Otherwise, the MANE Select transcript, longest canonical transcript, MANE Plus transcript, longest transcript or RefSeq transcript was chosen in that order by default.
For de novo variation, variant lists containing unique sets of variants found in each sex and zygosity were annotated. Inheritance in VarSome was set to ‘Confirmed De Novo’. Output from each list was returned in separate JSON files, which were then read into R for downstream processing into tab-separated tables. Inherited variation was examined in a similar manner; however, inheritance was set to the parent of origin of the variant.
To extend these analyses further, we used Neptune32, examining 73 ACMG actionable genes analyzed in All Of Us10. The VIP database used for annotation in Neptune was downloaded from https://gitlab.com/bcm-hgsc/neptune in VCF format, and all variants were lifted over63 from GRCh37 to GRCh38. Clinical significance annotations were parsed from the INFO field, and variants classified as Pathogenic/Likely Pathogenic, Uncertain significance and Benign/Likely Benign were noted. All rare variants in probands, regardless of mode of inheritance, were used in these analyses. Of the 73 genes, Venner et al.10 annotated only biallelic variants as P/LP in three recessive genes (MUTYH, ATP3B and KCNQ1) and only a specific variant as P/LP in HFE; we did not observe P/LP variants in these four genes, so no additional corrections were made.
Inclusion and ethics statement
This study was conducted in accordance with Nature Portfolio’s guidelines on inclusion and ethics in global research. The research was designed to include participants of diverse ancestries, with the goal of improving representation in autism genetics research. Study protocols were approved by the IRBs at all participating sites, including the Program for the Protection of Human Subjects at the Icahn School of Medicine at Mount Sinai (GCO no. 14-1082(0001)) as well as the local IRBs in Brazil, Colombia, Peru, Mexico, Kaiser Permanente and the CHARGE study (see ‘Cohort description’). Written informed consent was obtained from all participants or from parents or legal guardians where necessary. Data collection adhered to relevant ethical and cultural standards, and compensation for participation varied by site as described above. Collaborations between institutions in the United States and Latin America were established to ensure equitable contributions across sites. Local investigators in Brazil, Colombia, Mexico and Perú were involved in data collection and authorship.
Sex was recorded based on self-report at enrollment and confirmed with genetic information. Both male and female participants were included; however, sex-stratified analyses were not conducted, as the primary focus of this study was on de novo and rare variant burden across ancestry groups rather than sex differences. Participant ages varied by cohort, with probands typically enrolled during childhood or adolescence and parents as adults.
Statistics and reproducibility
All statistical analyses were performed using R (version 4.3.3), Hail (version 2.0) and Python (version 3.8). Statistical methods are described in detail in the relevant sections of Methods. Two-sided tests were used throughout unless otherwise specified. Multiple hypothesis testing was corrected using the Benjamini–Hochberg FDR procedure or Bonferroni correction as appropriate. Sample sizes were determined by the number of available participants meeting inclusion criteria in the ASC, GALA and SPARK cohorts; no statistical method was used to predetermine sample size. All available samples passing relatedness and quality control thresholds were included in the analyses. No data were otherwise excluded from the analyses.
Because this study involved secondary analysis of existing human genomic data, randomization and blinding were not applicable. The investigators were not blinded to sample status during analyses. Scripts for computational analyses performed were deposited in a GitHub repository (https://github.com/buxbaum-lab/GALA) to ensure reproducibility. Key results were independently replicated using validation datasets as described.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
Sequencing data for the ASC and GALA samples are available through controlled access via the Database of Genotypes and Phenotypes (accession number phs002502) and on the National Human Genome Research Institute Genomic Data Science Analysis, Visualization and Informatics Lab-space (AnVIL) under accession number phs002502.v1.p1 (https://anvilproject.org/data). SPARK phenotype and sequencing data are available to authorized users through the SFARI Base (https://www.sfari.org/resource/sfari-base/).
Individual-level data from the ASC and GALA cohorts are not publicly available due to participant privacy restrictions. Researchers may request access by contacting J.D.B. (joseph.buxbaum@mssm.edu). All requests will be reviewed by the Mount Sinai Institutional Data Access Committee to ensure compliance with participant consent and IRB protocols. Reasonable requests will receive a response within 2−4 weeks. Summary variant counts, gene-level burden statistics and figure source data are available in the accompanying Supplementary Tables and at https://github.com/buxbaum-lab/GALA.
Code availability
All software used in this study is publicly available at the cited references. The R code used to generate the TADA analysis and figures is available under the MIT license at https://github.com/buxbaum-lab/GALA.
References
Lord, C. et al. Autism spectrum disorder. Nat. Rev. Dis. Primers 6, 5 (2020).
Klei, L. et al. Common genetic variants, acting additively, are a major source of risk for autism. Mol. Autism 3, 9 (2012).
Gaugler, T. et al. Most genetic risk for autism resides with common variation. Nat. Genet. 46, 881–885 (2014).
Fu, J. M. et al. Rare coding variation provides insight into the genetic architecture and phenotypic context of autism. Nat. Genet. 54, 1320–1331 (2022).
Zhou, X. et al. Integrating de novo and inherited variants in 42,607 autism cases identifies mutations in new moderate-risk genes. Nat. Genet. 54, 1305–1319 (2022).
Satterstrom, F. K. et al. Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism. Cell 180, 568–584 (2020).
Davidson, B. L. et al. Gene-based therapeutics for rare genetic neurodevelopmental psychiatric disorders. Mol. Ther. 30, 2416–2428 (2022).
Fatumo, S. et al. A roadmap to increase diversity in genomic studies. Nat. Med. 28, 243–250 (2022).
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Venner, E. et al. The frequency of pathogenic variation in the All of Us cohort reveals ancestry-driven disparities. Commun. Biol. 7, 174 (2024).
Abul-Husn, N. S. et al. Molecular diagnostic yield of genome sequencing versus targeted gene panel testing in racially and ethnically diverse pediatric patients. Genet. Med. 25, 100880 (2023).
Wright, C. F. et al. Genomic diagnosis of rare pediatric disease in the United Kingdom and Ireland. N. Engl. J. Med. 388, 1559–1571 (2023).
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Moreno-Estrada, A. et al. Human genetics. The genetics of Mexico recapitulates Native American substructure and affects biomedical traits. Science 344, 1280–1285 (2014).
Ongaro, L. et al. The genomic impact of European colonization of the Americas. Curr. Biol. 29, 3974–3986 (2019).
Kosmicki, J. A. et al. Refining the role of de novo protein-truncating variants in neurodevelopmental disorders by using population reference samples. Nat. Genet. 49, 504–510 (2017).
DeFelice, M. et al. Blended genome exome (BGE) as a cost efficient alternative to deep whole genomes or arrays. Preprint at bioRxiv https://doi.org/10.1101/2024.04.03.587209 (2024).
Buxbaum, J. D. et al. The autism sequencing consortium: large-scale, high-throughput sequencing in autism spectrum disorders. Neuron 76, 1052–1056 (2012).
SPARK Consortium. SPARK: a US cohort of 50,000 families to accelerate autism research. Neuron 97, 488–493 (2018).
Abul-Husn, N. S. et al. Implementing genomic screening in diverse populations. Genome Med. 13, 17 (2021).
Belbin, G. M. et al. Toward a fine-scale population health monitoring system. Cell 184, 2068–2083 (2021).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2023).
Babadi, M. et al. GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data. Nat. Genet. 55, 1589–1597 (2023).
Samocha, K. E. et al. Regional missense constraint improves variant deleteriousness prediction. Preprint at bioRxiv https://doi.org/10.1101/148353 (2017).
Browning, S. R. et al. Ancestry-specific recent effective population size in the Americas. PLoS Genet. 14, e1007385 (2018).
He, X. et al. Integrated model of de novo and inherited genetic variants yields greater power to identify risk genes. PLoS Genet. 9, e1003671 (2013).
De Rubeis, S. et al. Synaptic, transcriptional and chromatin genes disrupted in autism. Nature 515, 209–215 (2014).
Kaplanis, J. et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature 586, 757–762 (2020).
Richards, S. et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet. Med. 17, 405–424 (2015).
Kopanos, C. et al. VarSome: the human genomic variant search engine. Bioinformatics 35, 1978–1980 (2018).
Eric, V. et al. Neptune: an environment for the delivery of genomic medicine. Genet. Med. 23, 1838–1846 (2021).
Miller, D. T. et al. ACMG SF v3.0 list for reporting of secondary findings in clinical exome and genome sequencing: a policy statement of the American College of Medical Genetics and Genomics (ACMG). Genet. Med. 23, 1381–1390 (2021).
Arriaga-MacKenzie, I. S. et al. Summix: a method for detecting and adjusting for population structure in genetic summary data. Am. J. Hum. Genet. 108, 1270–1282 (2021).
Gudmundsson, S. et al. Variant interpretation using population databases: lessons from gnomAD. Hum. Mutat. 43, 1012–1030 (2022).
Hou, K. et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat. Genet. 55, 549–558 (2023).
Saitou, M., Dahl, A., Wang, Q. & Liu, X. Allele frequency impacts the cross-ancestry portability of gene expression prediction in lymphoblastoid cell lines. Am. J. Hum. Genet. 111, 2814–2825 (2024).
Maples, B. K., Gravel, S., Kenny, E. E. & Bustamante, C. D. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93, 278–288 (2013).
Honorato-Mauer, J. et al. Characterizing features affecting local ancestry inference performance in admixed populations. Am. J. Hum. Genet. 112, 224–234 (2025).
Manrai, A. K. et al. Genetic misdiagnoses and the potential for health disparities. N. Engl. J. Med. 375, 655–665 (2016).
Ciesielski, T. H., Sirugo, G., Iyengar, S. K. & Williams, S. M. Characterizing the pathogenicity of genetic variants: the consequences of context. npj Genom. Med. 9, 3 (2024).
Sharo, A. G., Zou, Y., Adhikari, A. N. & Brenner, S. E. ClinVar and HGMD genomic variant classification accuracy has improved over time, as measured by implied disease burden. Genome Med. 15, 51 (2023).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Jun, G. et al. Structural variation across 138,134 samples in the TOPMed consortium. Preprint at Research Square https://doi.org/10.21203/rs.3.rs-2515453/v1 (2023).
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).
Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Yilmaz, F. et al. Genome-wide copy number variations in a large cohort of bantu African children. BMC Med. Genom. 14, 129 (2021).
Pereira, L., Mutesa, L., Tindana, P. & Ramsay, M. African genetic diversity and adaptation inform a precision medicine agenda. Nat. Rev. Genet. 22, 284–306 (2021).
Gomez, F., Hirbo, J. & Tishkoff, S. A. Genetic variation and adaptation in Africa: implications for human evolution and disease. Cold Spring Harb. Perspect. Biol. 6, a008524 (2014).
Yu, N. et al. Larger genetic differences within Africans than between Africans and Eurasians. Genetics 161, 269–274 (2002).
Schaaf, C. P. et al. A framework for an evidence-based gene list relevant to autism spectrum disorder. Nat. Rev. Genet. 21, 367–376 (2020).
Jay, J. J. & Brouwer, C. Lollipops in the clinic: information dense mutation plots for precision medicine. PLoS ONE 11, e0160519 (2016).
McInnes, L. A. et al. A genetic study of autism in Costa Rica: multiple variables affecting IQ scores observed in a preliminary sample of autistic cases. BMC Psychiatry 5, 15 (2005).
McInnes, L. A. et al. The NRG1 exon 11 missense variant is not associated with autism in the Central Valley of Costa Rica. BMC Psychiatry 7, 21 (2007).
Buxbaum, J. et al. The Autism Simplex Collection: an international, expertly phenotyped autism sample for genetic and phenotypic analyses. Mol. Autism 5, 34 (2014).
Naslavsky, M. S. et al. Exomic variants of an elderly cohort of Brazilians in the ABraOM database. Hum. Mutat. 38, 751–763 (2017).
Hertz-Picciotto, I. et al. The CHARGE study: an epidemiologic investigation of genetic and environmental factors contributing to autism. Environ. Health Perspect. 114, 1119–1125 (2006).
Mathews, C. A. et al. Genetic studies of neuropsychiatric disorders in Costa Rica: a model for the use of isolated populations. Psychiatr. Genet. 14, 13–23 (2004).
Purcell, S. M. et al. A polygenic burden of rare disruptive mutations in schizophrenia. Nature 506, 185–190 (2014).
McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).
Van der Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr. Protoc. Bioinformatics 43, 11.10.1–11.10.33 (2013).
Hansen, B. B. & Olsen Klopfer, S. Optimal full matching and related designs via network flows. J. Comput. Graph. Stat. 15, 609–627 (2006).
Miller, D. T. et al. Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2021 update: a policy statement of the American College of Medical Genetics and Genomics (ACMG). Genet. Med. 23, 1391–1398 (2021).
Acknowledgements
GALA is currently supported by the National Institutes of Health (grant MH128813, J.D.B.), the Seaver Autism Center for Research and Treatment and the SWT and Seaver Foundations. GALA originated with sites from, and with support of, the ASC (MH129724, J.D.B.; MH129722, M.D.; MH129725, K.R; MH129751, S.S.; and prior ASC funding—for example, MH100233 and MH111661). ASC sites continue to support analyses of GALA studies, with additional analyses supported by MH128813. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by Clinical and Translational Science Awards grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this paper was also supported by the Office of Research Infrastructure of the National Institutes of Health under award numbers S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This study makes use of data generated by the DECIPHER community. A full list of centers that contributed to the generation of the data is available from https://deciphergenomics.org/about/stats and via email from contact@deciphergenomics.org. DECIPHER is hosted by the EMBL-EBI, and funding for the DECIPHER project was provided by the Wellcome Trust (grant no. WT223718/Z/21/Z).
Author information
Authors and Affiliations
Consortia
Contributions
K.R., B.D., C.B. and J.D.B. conceived and designed the study. T.L., T.P., C.R.S., C.M.C., J.L.A., G.S.C., H.C., R.C., C.I.S.C., M.L.C., A.D.P.L., M.F., E.F., L.G., A.C.D.E.S.G., A.J.G., L.C.H., N.L., Y.L., D.N.-R., R.O., K.P.P., I.P., R.S., H.M.S., L.T., J.Y.T.W., L.A.-G., L.A.C., C.S.C.-F., I.H.-P., A.K., M.C.L., L.M., M.R.P.-B., M.A.P.-V., P.S., F.T., M.P.T., M.E.T., M.J.D. and J.D.B. contributed samples and generated data. J.D.B., C.B., B.D., B.M., S.D.R., L.K., L.S., J.M.F., F.K.S., S.J. and M.N.A. developed methodology and performed data analyses. J.D.B., B.D., C.B., E.H.C., T.L., L.S. and M.N.A. drafted and revised the paper. All authors reviewed and approved the final version of the paper. J.D.B. supervised the study.
Corresponding author
Ethics declarations
Competing interests
L.A.-G. is the main author of the CRIDI-ASD interview; she teaches the training course for the aforementioned instrument and receives payment for the training. The other authors declare no conflicts of interest.
Peer review
Peer review information
Nature Medicine thanks Andres Moreno-Estrada and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Anna Ranzoni, in collaboration with the Nature Medicine team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Data processing for samples from three different data sources.
The figure describes the variant, genotype, and sample quality control steps that were implemented to process the raw, joint-genotyped VCFs and generate the de novo and inherited calls used for downstream analyses. Sample counts are tabulated before downstream ancestry filtering.
Extended Data Fig. 2 Comparison of rare de novo variant counts per sample between ASD probands and unaffected siblings across different ancestries, normalized to synonymous variant rates.
The average number of rare variants per sample –normalized by the synonymous de novo variant rate– is compared between ASD probands and their unaffected siblings for all ancestries (ALL: 17,480 probands and 6,208 siblings), Admixed American (AMR: 4,450 probands and 1,459 siblings), and non-Admixed American (FuCOMP: 13,030 probands and 4,749 siblings). The analysis considers: (a) protein truncating variants (PTVs) in highly constrained genes (LOEUF deciles 1–3, 5,363 genes) and less constrained genes (LOEUF deciles 4–10, 12,765 genes); (b) missense variants categorized by predicted functional severity (MPC ≥ 2 for high severity, 1 ≤ MPC < 2 for moderate severity); and (c) MPC < 1 (for low severity) and synonymous missense variants. Data are presented as mean values ± 95% confidence intervals. Statistical significance was assessed using two-sided z-tests comparing normalized de novo mutation rates between probands and siblings. P values were adjusted for multiple comparisons using the Benjamini–Hochberg false discovery rate (FDR) method, and exact adjusted P values are shown above the bars.
Extended Data Fig. 3 Genic burden of PTVs across different ancestries in gnomAD v2.1.1 as a function of gene constraint.
The sum of observed PTVs per ancestry is plotted, scaled to each population’s size and total gene coding sequence length within gnomAD LOEUF deciles. The plot includes African African American (AFR, nN = 8,128), Admixed American (AMR, nN = 17,296), East Asian (EAS, n = 9,197), Non-Finnish European (NFE, n = 56,885), and South Asian (SAS, n = 15,308) ancestries. LOEUF deciles represent levels of gene constraint, with lower deciles indicating more constrained genes.
Extended Data Fig. 4 Relative contribution to TADA signal by mode of inheritance.
The proportional impact of each inheritance mode on the ASD-associated genes is shown at three false discovery rate (FDR) thresholds: ≤0.1 (a, d), ≤0.05 (b, e), and ≤0.01 (c, f). Panels (a–c) display results for the GALA cohort, while panels (d–f) show results for the FuCOMP subset from Fu et al.4. BF, Bayes Factor.
Extended Data Fig. 5 Relative contribution to TADA signal by variant type.
The proportional impact of each variant type on the ASD-associated genes is shown at three false discovery rate (FDR) thresholds: ≤0.1 (a, d), ≤0.05 (b, e), and ≤0.01 (c, f). Panels (a–c) display results for the GALA cohort, while panels (d–f) show results for the FuCOMP subset from Fu et al.4. BF, Bayes Factor.
Extended Data Fig. 6 Classification rates and proportions of P/LP variants across AMR and non-AMR populations using Neptune.
The figure compares the classification rates and proportions of P/LP variants in the indicated subsamples. Left: The ratio of (upper) classified variants (by Neptune) to total variants, (middle) P/LP variants to total variants, and (lower) P/LP variants to Neptune classified variants is shown for AMR, non-AMR, non-European (non-EUR) and EUR ancestries. Right: Comparisons include (upper) the total number of variants, (middle) the number of classified variants, and (lower) the number of P/LP variants, all expressed per proband. AMR participants have more variants per individual (both total and Neptune-classified) compared to non-AMR participants, but a reduced ability of Neptune to classify variants in AMR contributes to a slightly lower proportion of P/LP variants per individual. Similar results are seen for non-EUR versus EUR. Data are presented as mean values ± 95% confidence intervals (error bars show the plotted CI bounds). Statistical analysis: pairwise two-sided z-tests were used to compare groups within each panel; P values were adjusted for multiple comparisons using the Benjamini–Hochberg FDR procedure. Asterisks indicate adjusted P < 0.05.
Extended Data Fig. 7 Lollipop diagrams illustrating variants identified in emerging autism-associated genes.
Variants observed in GALA analyses of AMR individuals are marked with pink circles, those found in FuCOMP individuals are marked with green, and variants found in DECIPHER are in purple. Figures were generated using the lollipop software package52.
Extended Data Fig. 8 Evaluation of ancestry composition and variant burden among GALA probands.
(a) Ancestry proportions for all GALA individuals (left) and among carriers of damaging rare variants (right) inferred using a Random Forest classifier trained on 1000 Genomes + Human Genome Diversity Project (HGDP) reference populations. Most individuals display majority Admixed American (AMR) ancestry. (b) Ternary plots showing the distribution of ancestry proportions among all GALA individuals (left) and among carriers of damaging rare variants (right).
Supplementary information
Supplementary Tables (download XLSX )
Supplementary Tables workbook.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Natividad Avila, M., Jung, S., Satterstrom, F.K. et al. Deleterious coding variation associated with autism is shared across ancestries. Nat Med (2026). https://doi.org/10.1038/s41591-026-04228-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41591-026-04228-6







