Introduction

Various congenital anomalies, ranging from congenital heart defect (CHD) to orofacial cleft (OFC), affect approximately 3% of births each year in the United States1 and account for about 20% of infant mortality2. CHD patients have abnormalities in the structure of the heart at birth3, while OFC patients have incomplete fusions of embryonic tissues in their lips or palates4. Improved understanding of their genetic etiology will improve the accuracy of genetic diagnoses and guide potential disease-specific treatment strategies.

Transcription factors (TFs) play key roles in orchestrating differentiation and establishing cell identity during development5,6. Genetic variants that damage TF function can cause various developmental disorders7. Sequence-specific TFs control gene expression programs by binding to recognition sites in the genome and regulating the expression of their target genes. Missense variants in the DNA binding domains of TFs can alter DNA binding activity and cause a wide range of diseases, including Mendelian diseases8. For example, many of the pathogenic variants in NKX2-5 and TBX5 for CHD, and IRF6 for OFC, are found in their DNA binding domains9,10. We thus hypothesized that DNA binding domain variants in other TF genes might also cause these congenital anomalies. Furthermore, we hypothesized that DNA binding domain variants not yet found to be pathogenic but that occur in TFs with DNA binding domain variants previously found to cause CHD or OFC might also cause these conditions.

Searching for genetic causes underlying congenital anomalies requires genetic data from patients. In recent years, the Gabriella Miller Kids First pediatric research program (“Kids First” from here on) funded efforts to sequence the genomes of patients as well as the family trios. Such family trio studies have been a primary strategy to discover disease genes for congenital anomalies11,12,13. The trio design is crucial in detecting de novo variants in probands and ascertaining rare pathogenic variants, as demonstrated by the Deciphering Developmental Disorders (DDD) study14. Most probands for CHD and OFC are sporadic cases with unaffected parents (100% for CHD cohorts and 95.3% for OFC cohorts in this study). Therefore, in this study, we searched for de novo variants and rare inherited variants in the probands.

As many TFs are essential, and their haploinsufficiency cause Mendelian diseases7, there is selective pressure acting against damaging variants in essential TF genes. Therefore, damaging variants in TFs in humans are expected to be present as de novo variants, which have yet to undergo negative selection. These individuals can carry genetic conditions, like CHD and OFC, which are often caused by de novo variants. Recently, DNA binding domain variants in three distinct TFs found in ocular congenital cranial dysinnervation disorders were shown to affect DNA binding affinity15. Such findings support the likelihood that analyzing data from cohorts of congenital anomalies, like CHD and OFC, can uncover damaging variants in TFs that are causative.

The aim of our study was two-fold. First, we sought to discover novel disease genes in CHD and OFC because more causal genes likely remain to be found8,12,13,16. While CHD and OFC are distinct congenital anomalies, here we analyzed data for these two congenital anomalies because: (1) they are largely genetic conditions, (2) de novo variants explain a significant proportion of the patients’ molecular cause, (3) TF genes have been implicated as disease genes, and (4) there were large cohort data available from multiple studies to increase power of disease gene discovery. We boosted power to discover novel disease genes by combining data from multiple cohorts across the spectrum of syndromic and non-syndromic cases for CHD and OFC, respectively12,13,17,18,19. We utilized the PrimateAI variant effect prediction tool20 to identify missense variants likely to be pathogenic more precisely than earlier studies12,13. Furthermore, we applied the Transmission And De novo Association (TADA)21 test to identify genes that show enrichment of putative damaging de novo inherited variants across different types of variant classes, such as missense and predicted loss-of-function (pLoF) variants (i.e., nonsense, canonical splicing, and frameshift variants). This method has been successfully applied to discover potential autism genes22.

Second, focusing on TFs because of their key roles in development and Mendelian diseases, we surveyed TFs and TF DNA binding domain variants for their potential association with CHD and OFC. The resulting list of TFs and DNA binding domain variants is provided as a resource for future studies to evaluate whether they alter DNA binding activity8,16.

Results

Genetic variants identified from multiple family trio cohorts of CHD and OFC

To maximize power to discover novel disease genes, we combined genetic data from multiple CHD and, separately, OFC cohorts. For CHD, we collected a non-redundant list of de novo variants and heterozygous predicted loss-of-function (pLoF) variants (i.e., nonsense, canonical splicing, and frameshift variants) in probands from three prior studies12,17,18, one of which is part of the Kids First program18. In total, our list included variants from 3835 family trios with a proband with CHD (Supplementary Data 1). For OFC, we assembled genetic data from four Kids First cohorts13,23 and the Deciphering Developmental Disorders (DDD) study19, totaling 1844 family trios (Supplementary Data 1). We combined those data with a list of de novo variants found in 757 family trios from Bishop et al.13, and 603 family trios from Wilson et al.19. For the Kids First cohort samples not analyzed by these two studies, we identified de novo variants from the whole-genome sequencing data using the slivar tool24 (Methods).

Missense variant effect prediction methods prioritized putatively damaging variants

Missense variant effect prediction methods aim to score missense variants according to their likelihood of being benign or pathogenic25,26,27,28,29,30,31,32,33. Disease genes are expected to be enriched for damaging, and not neutral, variants. Therefore, we compared ten variant effect prediction tools in order to select one that best differentiates potentially damaging variants from neutral ones in the context of congenital anomalies. For this, we scored de novo variants in known CHD genes (Supplementary Data 2) from CHD patients12 (3835 families with 113 variants) and unaffected siblings from an autism study34 (2179 families with 26 variants). The autism study was unique in that four members of an autism proband family were sequenced: 2 unaffected parents, 1 unaffected sibling, and 1 proband. This enabled deriving a set of de novo variants that are likely benign in the unaffected siblings. In contrast, the CHD cohorts did not have any genetic data from unaffected siblings, and we can expect that unaffected siblings from an autism study likely did not have CHD diagnoses. Although these variants’ pathogenicity has not all been resolved, we nonetheless expect many of the de novo variants from CHD patients to be pathogenic and most of those from the unaffected siblings in the autism study to be benign for CHD.

We compared the performance of the ten tools in discriminating the two sets of variants at various score thresholds (Fig. 1A). We aimed to select a method that highly enriches potentially pathogenic variants at the top quantile. Overall, PrimateAI20 showed the highest area under the curve metric for both receiver operator characteristic (ROC) and precision-recall (Supplementary Fig. 1). Although Missense Variant Pathogenicity (MVP)26 performed similarly well, the number of variants from unaffected children that were falsely classified as pathogenic was higher than that using PrimateAI. For instance, there were 13 and 4 predicted pathogenic variants out of 26 de novo variants from unaffected children over the score percentile threshold of 0.75, using MVP and PrimateAI, respectively. Moreover, since PrimateAI does not use any disease association information in model training, we anticipate it is less likely to show overfitting. Therefore, we used PrimateAI to infer the likelihood of missense variant pathogenicity in all subsequent analyses in this study.

Fig. 1: Comparison of missense variant prediction methods.
figure 1

A Number of variants in each score percentile bin, which corresponds to 5% increments, for ten missense variant effect predictions. Only de novo variants in 225 human CHD genes, which are listed in (Supplementary Data 2), are considered. The orange line depicts the precision at each percentile threshold. B Enrichment of missense variants in 5% PrimateAI score bins for all de novo variants in CHD patients and unaffected children. The error bars are 95% bootstrap confidence intervals. MisA, missense class A (PrimateAI ≥ 0.9); MisB missense class B (0.75 < PrimateAI ≤ 0.9).

Next, we determined score thresholds to classify all de novo missense variants. If we add up the mutation rate per generation for all possible missense variants in the human genome, the total missense mutation rate is approximately 0.68 per generation35,36. Then, we inferred the expected number of de novo missense mutations in each 5% PrimateAI score bin (i.e., 0.68 × 0.05). Based on this expected rate, we derived the enrichment of de novo missense variants in CHD versus control samples for each score bin (Fig. 1B). The enrichment was more pronounced at the higher score bins. Therefore, we set two score thresholds: a stringent threshold of 0.9, and a more permissive, albeit still highly enriching, threshold of 0.75, to derive two groups of putatively damaging missense variants (PrimateAI ≥0.9 as MissenseA [MisA] and 0.75 ≤ PrimateAI < 0.9 as MissenseB [MisB]). These two subsets were enriched among CHD samples but depleted among control samples (Supplementary Fig. 2). Variants with lower PrimateAI scores showed neither enrichment nor depletion in these samples. This is consistent with enrichment of de novo missense variants predicted to be damaging in patients of CHD and autism11,34. From here on, we considered de novo and inherited pLoF, de novo MisA, and de novo MisB variants as putatively damaging. We used the same score thresholds for the analysis of the OFC patient cohorts.

Detection of genes with enrichment of putatively damaging de novo and rare variants

Next, to identify candidate CHD and OFC genes, we analyzed the de novo pLoF, MisA, and MisB variants and rare inherited pLoF variants using the transmission and de novo association (TADA) model21. This model integrates enrichment of de novo variants based on a mutational model35 and the enrichment of inherited variants from cases compared to those from controls. The test calculates a Bayes factor that captures the enrichment of putatively damaging variants of different types (i.e., higher Bayes factor indicates more statistically significant enrichment). We considered 3578 unaffected parents in an autism cohort as controls because we can expect that they likely do not have CHD or OFC12,37. This approach was used in an earlier study for CHD that aimed to discover genes with enrichment of putatively damaging variants12.

We detected 46 and 22 significant genes for CHD and OFC, respectively (q value < 0.1, Supplementary Data 3 and 4). Since genes with no depletion of pLoF variants in a healthy population are not likely to be congenital anomaly genes, we excluded genes with gnomAD’s loss-of-function observed/expected upper bound fraction (LOEUF) > 136. Most candidate genes had both pLoF and missense variants contributing to the enrichment (Fig. 2). Thus, integrating the variant types was useful in detecting candidate disease genes.

Fig. 2: Bayes factor for each variant type’s enrichment in candidate disease genes.
figure 2

(Top) Bayes factor contribution by MisA, MisB, and pLoF variants in TADA for A CHD and B OFC in the “de novo + case/control” setting. Only positive Bayes factor contributions in candidate genes (q value < 0.1) with LOEUF < 1 are displayed (CHD: 46 genes, OFC: 22 genes). (Bottom) Number of variants in each category. BF Bayes factor, TF transcription factor.

17 of the 46 genes identified in the CHD analysis cohorts were not known CHD genes (i.e., not significant in studies of individual cohorts and not annotated as CHD genes; Table 1). 8 of the 22 genes identified in the OFC analysis cohorts were not known OFC genes; known OFC genes were taken from Genomics England PanelApp38 “Clefting” version 4.0 list (Supplementary Data 5). CHD and OFC patients are at higher risk for other congenital anomalies39,40. Indeed, several of these genes are developmental disorder genes, such as TAOK1, WAC, PACS1, FOXP1, BRAF, SETD5, and ZMIZ1 (phenotype MIM numbers: 619575, 616708, 615009, 613670, 613706, 615761, and 618659, respectively). In a recent study on CHD41, a de novo variant in SETD5 was considered as a positive diagnosis. Similarly, 7 of the 8 novel candidate OFC genes—MED13L, SOX5, KAT6B, ARID1B, MACF1, ADNP, and BRF1—are linked to various developmental disorders (phenotype MIM numbers: 6616789, 616803, 616170, 135900, 618325, 615873, and 616202, respectively). These results are consistent with the known associations of CHD and OFC with neurodevelopmental disorders42,43.

Table 1 List of novel candidate disease genes for CHD and OFC

More than half of the significant genes in CHD and OFC showed probands with an inherited pLoF variant in the candidate disease gene (27 out of 46 and 13 out of 22 for CHD and OFC, respectively). Two of the OFC family trios (one with a CTNND1 pLoF variant and another with an ARHGAP29 pLoF variant) had an affected parent who passed on the pLoF variant. However, most inherited pLoF variants in candidate and known disease genes were inherited from unaffected parents, suggesting the possibility of incomplete penetrance. For both CHD and OFC, the contribution of inherited variants from unaffected parents has been documented, consistent with our observation17,44

De novo missense variants in CHD and OFC genes

Predicting the pathogenic effects of missense variants is challenging, and many are classified as variants of uncertain significance (VUSs) in ClinVar45. Although we selected PrimateAI for this study, predictions by other methods can also be informative. As a resource for clinical researchers, we provide a table of predictions for the de novo missense variants identified in CHD and OFC genes (Supplementary Data 6 and 7). These tables include de novo missense variants in known CHD or OFC genes (Supplementary Data 2 and 3) and candidate CHD or OFC genes in the respective cohorts. In addition to scores from the tools we compared in Fig. 1, we also include scores from the more recent AlphaMissense tool46.

The coding sequence length affects which TADA model detects enrichment in the gene

To evaluate the utility of incorporating inherited pLoF variants in the case/control setting (i.e., “de novo and case/control”), we compared against the enrichment obtained using just de novo variants with TADA (i.e., “de novo only”). Surprisingly, using just the de novo variants yielded more candidate CHD genes (Supplementary Data 3) than using the “de novo and case/control” setting; 24 and 10 genes were exclusively significant in “de novo only” and “de novo and case/control” settings, respectively. The 24 genes that were significant (i.e., TADA q value < 0.1 and LOEUF < 1) only in the “de novo only” setting had no rare inherited pLoF variants in the cohorts, which lowered the Bayes factor estimates when case/control data were incorporated. Since approximately 90% of these genes are highly constrained with LOEUF < 0.3 (i.e., in approximately the top 10% of all protein-coding genes), pLoF variants in these genes are expected to be extremely rare in unaffected individuals. Since longer genes are expected to have more pLoF variants on average, we compared the lengths of genes unique to each setting. The coding sequence lengths of the 10 genes that were uniquely significant in the “de novo and case/control” model were significantly longer than those of the 24 genes uniquely significant in the “de novo only” model (p = 0.019, one-sided Wilcoxon rank-sum test; Fig. 3). The LOEUF estimates of genes in the two sets were not significantly different (P > 0.05, Wilcoxon rank-sum test). We observed similar trends for candidate OFC genes (Supplementary Data 4 and Supplementary Fig. 3). Altogether, these results demonstrate that the coding sequence length of genes affects their identification as significant disease genes by the “de novo only” versus the “de novo and case/control” TADA model. It is likely because longer genes have a greater chance that pLoF variants are present in a population and inherited, thereby contributing to increased enrichment in the “de novo and case/control” setting. On the other hand, shorter genes have lower expected mutation rate for pLoF variants, so each de novo variant contributes to greater amount of enrichment.

Fig. 3: Coding sequence length of significant CHD genes by discovery model.
figure 3

Distribution of coding sequence length for the significant genes unique to the “de novo and case/control” model and “de novo only” model. The number of genes is labeled below each category. CDS, coding sequence; aa, amino acid. * p < 0.05, one-sided Wilcoxon rank-sum test.

TF DNA binding domain variants identified in candidate CHD and OFC disease genes

Because of the known role of TFs in CHD47 and OFC48, we examined how many significant genes from our analysis were TFs49. For CHD, there were 14 TFs that showed significant enrichment in either “de novo and case/control” or “de novo only” analysis (Table 2 and Fig. 2). For OFC, 7 TFs showed significant enrichment (Table 2 and Fig. 2). For both CHD and OFC, TFs were significantly enriched among the significant genes (p = 0.006 and p = 0.016, respectively, one-sided Fisher’s exact test).

Table 2 Transcription factors significantly enriched for predicted deleterious de novo variants

There were 5 and 3 candidate CHD and OFC TF genes, respectively, that are not yet established CHD or OFC disease genes. For CHD, we identified KDM5B, FOXP1, KLF2, MEIS2, and CTCF. For OFC, we identified SOX5, ADNP, and GRHL2. Two candidate CHD TF genes—KDM5B and FOXP1—were also statistically implicated in a similar CHD study50 that aggregated de novo variants from two12,17 of the 3 studies that we analyzed. Nevertheless, KDM5B, FOXP1, MEIS2, and CTCF are known developmental disorder genes (phenotype MIM numbers: 618109, 613670, 600987, and 615502, respectively). Some children with mutations in these genes have been reported to show heart defects51,52,53,54. KLF2 has not been directly associated with CHD, but its zebrafish homologue klf2 is required for heart valve formation55. A non-coding variant that causes over-expression of Grhl2 in mice led to orofacial cleft phenotypes56.

Since DNA binding activity plays a crucial role in TF function, we searched for TF DNA binding domain missense variants in known developmental disorder genes. We developed a pipeline to filter for missense variants in the TF DNA binding domains based on a set of 62 DNA binding domain classes in the Pfam database57 (Supplementary Data 8) and the protein domain prediction model HMMer58. Without filtering for disease genes, there were 46 and 11 de novo TF DNA binding domain missense variants in the CHD and OFC cohorts, respectively (Supplementary Data 9); with filtering, there were 17 and 13 DNA binding domain missense variants, respectively (Table 3). Some of these variants are in CHD, OFC, and other developmental disorder genes that are mostly haploinsufficient, characterized by low LOEUF estimates (Table 3). Based on PrimateAI, they were all predicted to be pathogenic (PrimateAI rank score > 0.8). We hypothesize that these variants damage the TFs’ DNA binding activity.

Table 3 De novo TF DNA binding domain missense variants in genes associated with CHD, OFC, or developmental disorder genes

Discussion

We aggregated multiple parent-offspring trio cohorts of CHD and OFC to detect 46 and 22 genes, respectively, with enrichment of damaging de novo variants and inherited pLoF variants. 17 were novel candidate CHD genes and 8 were novel candidate OFC genes (Supplementary Data 3 and 4). It is challenging to unambiguously define a list of known CHD and OFC genes. We defined them based on the list from the Seidman lab and Genomics England PanelApp, but they may still miss some genes with supporting evidence in the literature. In fact, some ‘novel’ genes have support from existing literature, while others do not (Table 1). This means that further studies are needed to validate which of these are true disease genes for CHD and OFC. Moreover, increasing the sample sizes of family trio cohorts will be key to discovering more candidate disease genes; however, thousands of family trios are still insufficient to discover most of the disease genes. As there are likely hundreds of genes causing these congenital anomalies, the likelihood of observing multiple cases with damaging de novo variants in the same gene is still low. Kaplanis and colleagues estimated that sequencing hundreds of thousands of parent-offspring trios will be necessary to reach sufficient power to detect about 80% of developmental disorder genes based on analysis of de novo variants14.

We evaluated the performance of multiple missense variant effect prediction methods to prioritize candidate pathogenic variants. While most methods were able to discriminate de novo missense variants in CHD genes found in CHD patients from those found in unaffected children, PrimateAI was the most effective and led to the identification of more de novo missense variants. De novo variant data from unaffected siblings in autism studies was critical for this analysis, as these siblings are most likely not CHD patients (Fig. 1). We also provide a list of de novo missense variants in known and candidate CHD and OFC genes as a resource (Supplementary Data 6 and 7).

Incorporating the number of inherited pLoF variants in cases and controls into enrichment analyses led to some significant genes not reaching significance with de novo variants alone. We point out that the control samples for the variant enrichment analysis were unaffected parents from the autism cohort. They may carry autism-related variants, but they are not likely to carry pathogenic variants in CHD or OFC genes. Despite aggregating data from multiple studies, there were many genes with no inherited pLoF variants, and many of them were only significant in the “de novo only” analysis. These genes were generally shorter than the genes identified uniquely by the “de novo and case/control” analysis, suggesting that gene length affects which model may be better powered. Moreover, applying both the “de novo only” and the “de novo and case/control” model is useful for detecting as many candidate disease genes as possible.

In this study, we analyzed only pLoF and missense variants. Copy number variations (CNVs) that increase or decrease gene dosage also play a role in congenital anomalies59. Therefore, calling de novo and inherited CNVs in the affected children and testing their enrichment in individual genes will increase the chance of disease gene discovery in future studies22. In terms of inherited variants, we considered only pLoF variants because the effects of missense variants are more difficult to predict. Including inherited missense variants in the model may potentially increase power, but ensuring high precision in pathogenicity prediction will be essential.

The contribution of inherited variants to risk of CHD and OFC is consistent with earlier reports. For instance, Sifrim et al. described that non-syndromic CHD cases had contributions of inherited damaging variants from unaffected parents, suggesting incomplete penetrance17. Our work does not directly address the reasons for incomplete penetrance, but understanding any genetic or environmental factors affecting the penetrance would be important for patient diagnosis and prognosis. One possible explanation is mosaicism, as a multiplex family study on OFC hypothesized60.

In this study, TFs were enriched among the identified genes. We identified many de novo TF DNA binding domain missense variants in genes that were significantly enriched in CHD or OFC or that are known CHD, OFC, or developmental disorder genes. The identified variants were predicted to be pathogenic by PrimateAI. Some of the TFs with TF DNA binding domain variants in the CHD cohort are known to cause other developmental disorders, such as congenital diaphragmatic hernia and congenital anomalies of kidneys and urinary tract61,62. These results suggest that these TFs are pleiotropic and that other mutations in them may cause heart defects in some patients.

Variant effect prediction tools are only moderately accurate, at best, in distinguishing TF DNA binding domain missense variants with altered DNA binding activity16. Future studies using DNA binding assays, such as protein binding microarrays (PBMs)8,63, will be needed to determine which of the identified CHD and OFC variants alter DNA binding activity and in what manner they do so.

There are several limitations to this study. First, because we directly aggregated de novo variant data from multiple studies, multiple pipelines were used to call these variants. Calling de novo variants altogether would be a more involved but more consistent approach to identify de novo variants for meta-analysis. Second, we relied on computational predictions to stratify missense variants by their importance. Even though we selected PrimateAI as a tool that best discriminated de novo variants found in CHD patients from those found in unaffected siblings in an autism cohort, it is by no means a perfect tool. Moreover, not all de novo coding variants found in CHD patients are pathogenic, nor are all of those found in unaffected children benign. However, this kind of comparison is frequently made to evaluate variant effect prediction tools20,46. Lastly, as noted above, definitive lists of ‘known’ CHD and OFC genes are elusive. However, we followed the classification of CHD genes from a lab leading the efforts to find CHD genes (i.e., Seidman lab) and that of OFC genes from a panel curated by Genomics England that is running a large-scale genomic study on rare disease (i.e., 100,000 Genomes Project). All in all, future studies can benefit from a harmonized de novo variant database, more accurate variant effect prediction tools, and well-curated disease gene lists.

Methods

Genetic data from family trio cohorts of CHD and OFC

We aggregated multiple datasets to maximize statistical power to detect disease genes. For CHD, we downloaded de novo variant data from two exome-sequencing studies12,17 and one genome-sequencing study18. We also downloaded the list of rare inherited pLoF variants from Jin et al.12. We identified overlapping samples by comparing the set of de novo variants from each proband. After removing duplicate samples, there were a total of 3835 unique family trios.

For OFC, we analyzed data from 4 cohorts from the Gabriella Miller Kids First program64 and an additional cohort from the United Kingdom19 (Supplementary Data 1). For the 4 cohorts from Kids First, their database of Genotypes and Phenotypes (dbGaP) IDs were phs001168 (n = 376 trios), phs001997 (n = 404 trios), phs001420 (n = 262 trios), and phs002595 (n = 351 trios). Of these, data from 374 European (phs001168), 267 Colombian (phs001420), and 116 Taiwanese (phs001997) family trios were analyzed in Bishop et al.13. For these 757 family trios, we downloaded a list of de novo variants in probands from Table S3 of Bishop et al.13. The 113 of the trios in phs001997 data that were not analyzed in Bishop et al. were from the African Craniofacial Anomalies Network, and 351 trios in phs002595 were from a cohort in the Philippines. We analyzed data from these 484 trios using the genotype calls provided by the Kids First data portal. Lastly, we downloaded a list of de novo variants in probands of 603 family trios in the United Kingdom from Table S4 of Wilson et al.19.

We considered unaffected siblings or parents of probands in an autism cohort as controls without CHD or OFC. We downloaded de novo variant data from unaffected siblings of probands in an autism cohort34 to compare variant effect predictions (Fig. 1). Lastly, we downloaded heterozygous pLoF variants from 3578 unaffected parents in an autism cohort as controls12,37, which we used to test enrichment of putatively damaging variants (Fig. 2). We analyzed all genetic variants based on the GRCh38 human reference genome. The downloaded variants in hg19 were lifted over to the GRCh38 human reference. We performed variant calling and curation just for the 484 OFC samples not included in Bishop et al.13.

Identifying de novo variants and rare inherited variants in the OFC cohorts

For the samples not included in Bishop et al.13. (n = 484), We applied different strategies for identifying de novo predicted-loss-of-function (pLoF) and missense variants. pLoF variants consist of nonsense, splice site, and frameshift variants. Since trio-based variant calls (i.e., VCF files) provided in the Gabriella Miller Kids First data portal64 showed false negatives in de novo single nucleotide variants (SNVs), we derived de novo SNVs based on the gvcf files of the three family members in each trio.

For SNVs, which span pLoF and missense variants, we identified de novo variants by (1) merging gvcf files of the three family members in each trio using GLNexus65 with the ‘gatk’ setting and (2) using slivar24 to filter for variants that are heterozygous in the proband but homozygous reference in the two parents. We further filtered for those with the maximum population allele frequency in gnomAD36 of less than 5 × 10−5, no homozygous individuals in gnomAD, and TOPMed66 allele frequency of less than 5 × 10−5.

In contrast, we used de novo insertions and deletions (indels) identified in the trio-based variant calls. For indel pLoF variants, we (1) downloaded the family-based VCF files from the Gabriella Miller Kids First data portal and (2) filtered for variants that are heterozygous in the proband but homozygous reference in the two parents using slivar24. The variants were filtered for having genotype quality (GQ) greater than 20 and read depth (DP) greater than 6. We also filtered for those with a maximum population allele frequency in gnomAD36 of less than 5 × 10−5, no homozygous individuals in gnomAD, and TOPMed66 allele frequency of less than 5 × 10−5.

For all OFC samples, we identified rare inherited pLoF variants by filtering for variants with a heterozygous genotype in the proband and only one parent with a heterozygous genotype using the family-based vcf files from the Gabriella Miller Kids First data portal. We also filtered for those with the maximum population allele frequency in gnomAD36 of less than 5 × 10−5, no homozygous individuals in gnomAD, and TOPMed66 allele frequency of less than 5 × 10−5.

Comparison of missense variant effect prediction methods

We compared the performance of ten missense variant effect prediction methods: PrimateAI20, PolyPhen225, MVP26, PROVEAN27, CADD28, MetaSVM29, REVEL30, VEST431, MPC32, and MutationAssessor33. These tools’ scores for missense variants were accessed from the database for nonsynonymous SNPs’ functional predictions (dbNSFP) version 4.567. To compare between scores easily, we utilized the rank scores, which range from 0 to 1 and correspond to the percentile among missense variants. We compared their performance in discriminating de novo missense variants in CHD genes (Supplementary Data 2) from CHD patients from those from unaffected children. There were a total of 3836 CHD family trios12,17,18 and 2179 control family trios34 that carried 113 and 26 de novo variants in CHD genes, respectively. We computed their area under the curve for receiver operator characteristic (ROC) and precision-recall to compare their performance.

Next, we determined the appropriate PrimateAI score thresholds for potentially damaging variants. Across all genes, we estimated the enrichment of de novo missense variants for CHD families and control families in each of the 5% score bins. The expected number of de novo missense variants per family was the sum of all missense mutation rates (~0.68 per generation). Then, we bootstrapped sampled CHD and control families to establish the respective 95% confidence intervals of the enrichment estimates. Ultimately, based on Fig. 1B, we selected PrimateAI ≥ 0.9 and 0.75 ≤ PrimateAI < 0.9 as the two missense variant groups— MisA and MisB.

Testing enrichment of damaging de novo and rare inherited variants

We used the TADA model21 to detect genes with an enrichment of potentially damaging variants (i.e. predicted-loss-of-function (pLoF), missense with PrimateAI20 rank score ≥ 0.9 (MisA), or missense with PrimateAI rank score 0.75–0.9 (MisB)) from the number of de novo variants and mutation rate estimates. We derived the per-gene mutation rates for MisA, MisB, and pLoF based on estimates in Samocha et al.35 and gnomAD36. We multiplied the per-gene missense mutation rate μMis, gene by 0.1 and 0.15, to derive μMisA, gene and μMisB, gene, respectively, as all possible MisA and MisB variants are expected be 0.1 and 0.15 of all missense variants. We added the per-gene nonsense, splice site, and frameshift mutation rates to derive the per-gene pLoF mutation rates.

We applied TADA to 17,488 autosomal genes with LOEUF estimates in gnomAD36. We performed the test once, including inherited pLoF variants, and once without to compare the effect of inherited variants. Multiple hypothesis correction across all genes was applied using the q value estimates. We considered genes with q value < 0.1 and gnomAD’s LOEUF < 1 to be significant. We excluded genes with LOEUF ≥ 1 because it suggests that there is negligible selective constraint against predicted-loss-of-function variants in those genes.

Identifying TF DNA binding domain variants in candidate disease genes

We identified disease-associated TF genes based on a list of 1639 TFs49. Then, we determined the location of the DNA binding domains using a set of 62 DNA binding domain classes in the Pfam database version 35.057 (Supplementary Data 5) and the protein domain prediction model HMMer58. We considered only canonical transcripts and amino acid sequences based on GENCODE68 in annotating whether the missense variants fall within a DNA binding domain.

Compliance with ethical regulations

This study complied with all relevant ethical regulations including the Declaration of Helsinki. The research described in this study did not require review by an institutional review board (IRB). We did not directly interact with patients, nor did we collect patient data for this study. All the dbGaP studies, from which we obtained data that we analyzed, state that IRB approval is not required. The data from Jin et al., Richter et al., Sifrim et al., Wilson et al., Bishop et al., etc., are all from the supplementary tables published and made freely, publicly available as part of those papers.