Genome-wide profiling of highly similar paralogous genes using HiFi sequencing

Chen, Xiao; Baker, Daniel; Dolzhenko, Egor; Devaney, Joseph M.; Noya, Jessica; Berlyoung, April S.; Brandon, Rhonda; Hruska, Kathleen S.; Lochovsky, Lucas; Kruszka, Paul; Newman, Scott; Farrow, Emily; Thiffault, Isabelle; Pastinen, Tomi; Kasperaviciute, Dalia; Gilissen, Christian; Vissers, Lisenka; Hoischen, Alexander; Berger, Seth; Vilain, Eric; Délot, Emmanuèle; Eberle, Michael A.

doi:10.1038/s41467-025-57505-2

Download PDF

Article
Open access
Published: 08 March 2025

Genome-wide profiling of highly similar paralogous genes using HiFi sequencing

Nature Communications volume 16, Article number: 2340 (2025) Cite this article

8025 Accesses
8 Citations
92 Altmetric
Metrics details

Subjects

Abstract

Variant calling is hindered in segmental duplications by sequence homology. We developed Paraphase, a HiFi-based informatics method that resolves highly similar genes by phasing all haplotypes of paralogous genes together. We applied Paraphase to 160 long (>10 kb) segmental duplication regions across the human genome with high (>99%) sequence similarity, encoding 316 genes. Analysis across five ancestral populations revealed highly variable copy numbers of these regions. We identified 23 paralog groups with exceptionally low within-group diversity, where extensive gene conversion and unequal crossing over contribute to highly similar gene copies. Furthermore, our analysis of 36 trios identified 7 de novo SNVs and 4 de novo gene conversion events, 2 of which are non-allelic. Finally, we summarized extensive genetic diversity in 9 medically relevant genes previously considered challenging to genotype. Paraphase provides a framework for resolving gene paralogs, enabling accurate testing in medically relevant genes and population-wide studies of previously inaccessible genes.

Robust and accurate estimation of paralog-specific copy number for duplicated genes using whole-genome sequencing

Article Open access 09 June 2022

Systematic analysis of paralogous regions in 41,755 exomes uncovers clinically relevant variation

Article Open access 27 October 2023

Simultaneous de novo calling and phasing of genetic variants at chromosome-scale using NanoStrand-seq

Article Open access 09 July 2024

Introduction

Population-wide whole-genome sequencing (WGS) studies based on short reads have enabled comprehensive characterization of variants, particularly small variants, in ~90% of the human genome^1,2,3. However, there exist difficult regions and variant classes that remain largely inaccessible to short reads^4,5. A large portion of these difficult regions occur within segmental duplications (SDs)^6,7, where high sequence similarity between copies of SDs results in ambiguous mapping of short reads. In addition to difficulty mapping reads within SDs, high sequence similarity promotes unequal crossing over, resulting in hotspots for copy number variants (CNVs), as well as high rates of gene conversion⁸. These high rates of gene conversion promote sequence exchange between SDs^9,10, further increasing the errors in read alignment. While short-read-based computational methods have been developed to improve the genotyping capability and diagnostic yield in segmental duplications^{5,11,12,13,14,15}, comprehensive variant calling in these regions remains a challenge, and SDs have not been studied at the population level by the current high throughput technologies.

Many medically relevant genes fall into SDs where traditional alignment-based analysis has not been demonstrated to reliably detect the full diversity of these regions. For example, spinal muscular atrophy is caused by variants in the SMN1 gene, which has a highly similar paralog SMN2¹⁶. Another disease, 21-Hydroxylase-Deficient Congenital Adrenal Hyperplasia (21-OHD CAH), is caused by variants in the CYP21A2 gene¹⁷, which resides in a 30 kb tandem repeat called the RCCX module and has a pseudogene CYP21A1P. Variants in the OPN1LW/OPN1MW gene cluster, which contains 1-5 copies of OPN1LW or its paralog OPN1MW, cause color vision deficiencies^18,19. To date, these medically important SD-encoded genes are studied with multi-step analyses including a combination of low or medium-throughput assays such as multiplex ligation-dependent probe amplification (MLPA), amplicon sequencing, or long-range PCR followed by Sanger sequencing to detect copy number changes or individual variants^20,21. These tests are sometimes limited to a few known variants and may be prone to false negatives if the patient has a pathogenic variant that is not part of the test. There remains a need to fully characterize these genes both for research and clinical testing.

Recently, researchers have begun to study SDs using long-read sequencing. High quality phased assemblies have been generated for a number of samples^22,23,24 using PacBio HiFi and Oxford Nanopore Technologies (ONT) long reads, revealing the sequences of SDs and providing biological and evolutionary insights^7,10. However, SDs with multiple copies of highly similar regions are prone to assembly errors, especially in regions of extended sequence homology^10,25. Alternatively, we developed a phasing approach, Paraphase, that identifies haplotypes of genes and their paralogs, and demonstrated its ability to accurately resolve the highly similar SMN1/SMN2 region²⁶. That study was limited to one difficult region, leaving a need for a genome-wide demonstration.

Here we extended Paraphase to analyze 316 paralogous genes that fall into 160 groups of SD regions across the genome, including many medically relevant genes that were traditionally considered challenging to genotype. Applying Paraphase to 259 individuals from five ancestral populations, we showed the genetic diversity of these regions across populations in copy number (CN) and sequence variation. We note that some of these regions show exceptionally low diversity between genes and paralogs, signaling selective pressures and/or high rates of gene conversion. Finally, we studied the Paraphase derived haplotypes for these paralogous genes in 36 parent-offspring trios and identified 11 de novo events, among which 7 are de novo single nucleotide variants (SNVs) and 4 are consistent with de novo gene conversion events.

Results

Profiling 160 gene-coding paralogous regions with Paraphase

Paraphase resolves highly similar genes by realigning HiFi reads to one, most relevant, gene chosen to represent all copies of the gene and its paralogs. We call this gene the archetype gene. For example, to study SMN1 and SMN2, we realign all of the reads that are aligned to either SMN1 or SMN2 to just SMN1 because that is the fully functional copy. The aligned reads are then phased into haplotypes for variant calling (Fig. 1a). For this study, we identified 160 paralogous regions >10 kb in length with >99% sequence similarity that were found between two and four times in GRCh38 (Supplementary Data 1, also see Methods). These paralogous regions encode 316 genes in total (excluding pseudogenes). In this paper, the term “paralog group” is used to describe a set of genes that are highly similar in sequence and are analyzed by Paraphase as a group.

**Fig. 1: Paraphase design and the regions it analyzes.**

Among the 160 paralog groups (Supplementary Data 1), 149 have genes located on the same chromosome, with 16 in tandem (less than 10 kb apart). To quantify the impact of sequence homology on read alignments, we examined the mapping qualities (MAPQs) in these regions in both short-read and long-read sequence data (Fig. 1b, see Methods) with two metrics: base MAPQ (median MAPQ of all reads overlapping a position) and summary MAPQ (median of base MAPQs across all positions of a paralog group). For short-read data, MAPQs are extremely low (76.4% of the paralog groups have a summary MAPQ < = 20, and 98.8% of the paralog groups have some bases with a base MAPQ < = 20), indicating the difficulty of mapping short reads to these regions. Even for long-read data, 44.1% of the regions have a summary MAPQ < = 20 and 75.2% have some bases with a base MAPQ < = 20. For long-read data, there are 25 (15.6%) paralog groups where the summary MAPQ is 60 and there are no bases with a base MAPQ < = 20. These are either regions where the sequence similarity is high but the homology extends less than the HiFi read length of ~15-20 kb, or regions which have lower sequence similarity but are included in Paraphase for fusion calling (see Methods). Paraphase analysis can still improve the performance in these high MAPQ regions because: 1) even reads with high MAPQ can be misaligned due to reference genome artifacts, common CNVs, and high rates of gene conversion, 2) gene fusions are hard to detect because split alignments are unlikely to happen in regions of homology and 3) lower MAPQs will be expected in data with shorter read length, such as in HiFi hybrid capture data.

Validation of Paraphase calls

We first validated Paraphase variant calls in 8 medically relevant genes in 21 disease or carrier samples identified using orthogonal methods such as MLPA and Sanger sequencing (Table 1 and Supplementary Data 2, also see Methods). For this validation, Paraphase correctly identified all 30 of the clinical variants in these samples.

Table 1 Validated Paraphase calls in medically relevant genes

Full size table

We also examined haplotypes called by Paraphase in 36 trios. Among 14,734 full-length haplotypes called in the probands (also requiring full-length haplotypes called in the two parents of each trio), 14,679 (99.6%) agreed exactly with one of the haplotypes observed in the parents. Upon examining the 55 inconsistent cases, 43 (0.29%) are not fully supported by reads and thus determined as Paraphase errors (switch errors or missed haplotypes in the parent). The remaining 12 (0.081%) inconsistent haplotypes are fully supported by reads, and thus are true recombination or de novo events (See “Identification of de novo mutations and gene conversion” section).

In addition, we compared Paraphase variant calls against high quality diploid assemblies in 47 HPRC samples (See Supplementary Notes). Paraphase calls were consistent with the assembly in the majority of paralog groups (defining the assembly as the ground truth, 82.4% of paralog groups have >95% recall and >95% precision) (Supplementary Fig. 1). To better understand these differences, we manually reviewed the discrepant calls and determined that the reduced precision and recall in some paralog groups is mostly due to errors in the assembly (Supplementary Figs. 2 and 3, also see Supplementary Notes) such as individual base errors and misassembly of the paralog group leading to incorrect copy numbers.

The performance of Paraphase will depend on the sequence divergence between haplotypes, read length, and sequencing depth. We performed a variety of simulated experiments to estimate how these factors impact the accuracy of Paraphase (See Supplementary Notes). Paraphase maintained high haplotyping accuracy with minimums of 10 kb read length, 10X per-haplotype sequencing depth and 0.05% sequence divergence (Supplementary Fig. 4, also see Supplementary Notes).

Copy number variability of paralog groups

We calculated the distribution of the total CN (defined by the number of unique haplotypes, adjusted by depth) of each paralog group in 259 unrelated individuals across five ancestral populations. We assessed the variability of the total CN by the percentage of individuals having the mode CN. For this study, we say that a paralog group has low CN variability if more than 90% of the individuals have the mode CN value, and medium CN variability if between 80% and 90% of the individuals have the mode CN value. Conversely, a paralog group is defined as having high CN variability if less than 80% of individuals have the mode CN value. Based on these definitions, 79 of the paralog groups have low CN variability, 17 have medium CN variability and 64 have high CN variability (Fig. 2a, Supplementary Data 1). Additionally, 25.6% (41/160) of the paralog groups had significant (Chi-squared test, p < 0.05, with Bonferroni correction) deviations between ancestral populations (Supplementary Fig. 5).

**Fig. 2: Distribution of the total CN of each paralog group across populations.**

The CN variability can give us a general understanding of the population-level “accuracy” of the reference genome (in this case GRCh38). For example, an SD with two paralogous regions would always have a CN of four in our analysis if the reference is correct and generalizes across the population. Likewise, a paralog group where every individual has a CN of two in the population is likely a false SD in the reference. We identified 22 paralog groups where more than 95% of all individuals have a total CN of two (Supplementary Table 1, Fig. 2b). This suggests that duplications are rare in the population for these genes and these SDs could represent errors in the reference genome. Nineteen of these paralog groups overlap regions that were classified as false duplications in GRCh38 based on the CHM13 T2T assembly²⁷. Three of these paralog groups (DEFB109B and its SD, CNTNAP3/CNTNAP3C and POTED and its SD; see Supplementary Table 1) were not identified as false duplications in GRCh38 by the CHM13 T2T assembly, although they are only present once in the CHM13 assembly (Paraphase analysis of CHM13 data shown in Supplementary Fig. 6).

Conversely, we found three paralog groups (CTAGE8/CTAGE9, OR2A1/OR2A42, and RIMBP3/RIMBP3B/RIMBP3C) that are truly CN variable regions in the population (Fig. 2a) but were attributed to false duplications in GRCh38 due to missing genes in the CHM13 T2T assembly²⁷ (Paraphase analysis of CHM13 data shown in Supplementary Fig. 7). Population CN analysis provides a more accurate assessment of false duplications in a reference genome than the analysis of a single individual.

Paralog groups with exceptionally low within-group diversity

Paraphase identified 159,795 haplotypes from the 160 paralog groups in the 259 samples. Extensive gene conversion and unequal crossing over can result in highly similar gene copies that can no longer be separated into different genes based on sequence alone. For example, SMN1 and SMN2 are different in sequence in Exons 7–8 but are indistinguishable in Exons 1-6 indicating that gene conversion may be much more common in Exons 1-6 than in Exons 7-8²⁶. Thus, a principal component analysis (PCA) of haplotype sequences in Exons 7-8 can separate SMN1 haplotypes and SMN2 haplotypes into distinct clusters, but a PCA of haplotypes in Exons 1-6 does not differentiate the SMN1 haplotypes from SMN2 haplotypes (Supplementary Fig. 8).

To identify paralog groups with low within-group diversity, we developed a metric based on the divergence between individual haplotypes (see Methods). For example, in a paralog group with a gene and a paralog, the gene will evolve independently from the paralog in the absence of gene conversion. This means that the divergence will be lower between two copies of the gene (i.e. gene-gene divergence) or two copies of the paralog in the absence of any selective pressures. Conversely, the gene-paralog divergence will be significantly higher (Supplementary Fig. 9). Increasing rates of gene conversion and unequal crossing over will tend to make the gene more similar to the paralog and thus drive the gene-paralog divergence down (Supplementary Fig. 9).

We identified 23 paralog groups (termed low-diversity paralog groups) where the within-group sequence divergence is comparable to the general allelic sequence divergence (See Methods). Among these, 4 are on chrY, 11 on chrX, and 8 on autosomes (Table 2). It is often not easy to assign haplotypes of a paralog group to individual genes without prior knowledge of how genes and paralogs differ from each other. However, among the 23 low-diversity paralog groups, there are five where the phased haplotypes extend into non-homologous regions so that we can assign haplotypes to genes based on their flanking sequence: AMY1A/AMY1B/AMY1C (Fig. 3a), CTAG1A/CTAG1B, BOLA2/BOLA2B, SULT1A3/SULT1A4 and SLX1A/SLX1B (BOLA2/BOLA2B, SULT1A3/SULT1A4, and SLX1A/SLX1B are three paralog groups in tandem and genotyped as one region by Paraphase). PCA of the haplotype sequences shows that haplotypes of the different genes of the same group do not form distinct clusters and thus are indistinguishable from each other by sequence alone (Fig. 3b–d, also see Supplementary Fig. 10).

Table 2 Paralog groups with low within-group diversity

Full size table

**Fig. 3: Paralog groups with low within-group diversity.**

The 23 low-diversity paralog groups show two different patterns in their genomic structure, CN variability and evolutionary history (Table 2). Those on autosomes have high CN variability and many are human-specific duplications (See Discussion). Conversely, low-diversity paralog groups on sex chromosomes mostly have low CN variability, are arranged in palindrome structures and evolutionarily conserved, i.e. all genes are present in other primates where they are also in palindromes^28,29. Additionally, there are 3 palindromic paralog groups on chrX where the genes and paralogs are in tandem so Paraphase can identify copies on the same chromosome. In these 3 paralog groups, the gene copies in cis are more similar to each other than those in trans (Fig. 3e), suggesting that gene conversion between arms of palindromes happens more frequently in cis (possibly through forming a hairpin structure) than in trans.

Identification of de novo mutations and gene conversion

In 36 parent-offspring trios we identified 12 events (6 paternal and 6 maternal) where a haplotype in the proband is different from the corresponding haplotype in the parent (Supplementary Figs. 11–12). Eleven of these are de novo events where the proband haplotype differs from the parent haplotype by one SNV. Among these, 7 are de novo SNVs (not observed in either parent) and 4 are products of gene conversion (observed in a parent but on a different haplotype). Among the gene conversion cases, 2 are non-allelic (an example is shown in Fig. 4), 1 is allelic and 1 could be either allelic or non-allelic. Among the 11 de novo events, 4 are intergenic, 6 are in introns, and 1 is in an exon (synonymous). The remaining case of the 12 events is a hybrid haplotype between two haplotypes from the same parent, which could arise through equal or unequal crossing over (inconclusive without longer range phasing information in the parent due to the high copy number of the paralog group) (Supplementary Fig. 12).

**Fig. 4: De novo non-allelic gene conversion in a trio.**

Resolving medically relevant paralogous genes

As a demonstration of how Paraphase can be used to study paralog groups in the population, we examined variant and haplotype frequencies across populations in three known medically relevant paralog groups, CYP21A2/CYP21A1P, PMS2/PMS2CL, and OPN1LW/OPN1MW.

Variants in CYP21A2 cause 21-Hydroxylase-Deficient Congenital Adrenal Hyperplasia (21-OHD CAH)¹⁷. CYP21A2 resides in a 30 kb tandem repeat called the RCCX module that includes its pseudogene, CYP21A1P, together with two other pairs of paralogs, C4A/C4B and TNXB/TNXA^17,20 (Fig. 5a). This region is susceptible to gene conversion¹⁷, as well as deletions and duplications of the RCCX module resulting in CN changes and disease-causing hybrid genes between CYP21A2 and CYP21A1P. Here, complete haplotype sequences resolved by Paraphase allowed us to fully reveal the genetic diversity in this region. The total CN of RCCX is highly variable across populations (Fig. 5b) with 38.2% of individuals having a CNV. Figure 5a shows examples of samples with various CNs. In addition, we identified a duplication allele (Fig. 5a, bottom panel) that carries a copy of CYP21A1P, a copy of CYP21A2 with a stop-gain variant Q319X, and a second functional copy of CYP21A2. We found that this allele is at 1-2% frequency in the populations (Supplementary Table 2) and, without phasing the full region, could be misidentified as a pathogenic allele due to the presence of Q319X. Researchers have previously found that individuals with Q319X frequently have a duplication of CYP21A2, which complicates CYP21A2 testing³⁰. Paraphase can distinguish a CYP21A2 + CYP21A2(Q319X) allele vs. a CYP21A2(Q319X) allele.

Fig. 5: Population results in *CYP21A2*, *PMS2* and *OPN1LW/OPN1MW.*

Pathogenic variants in PMS2 cause Lynch syndrome³¹. In its last few exons (Exons 12-15), PMS2 has high sequence similarity to its pseudogene PMS2CL, and gene conversion and unequal crossing overs are known to promote sequence exchange between the two genes^32,33,34. We examined the haplotypes of PMS2 and PMS2CL in the population. In Exon 15, the sequences of PMS2 and PMS2CL are indistinguishable from each other, lacking any differentiating variants (Fig. 5c, also see PCA in Supplementary Fig. 13). For example, a commonly considered PMS2CL-specific variant³⁵, NM_000535.7:c.*92dup, is in 72.8% of PMS2CL haplotypes and 31.1% of PMS2 haplotypes. We found that gene conversion happens between PMS2 and PMS2CL occasionally in Exon 12 and frequently in Exons 13-14 (See Methods and Fig. 5c, d). Interestingly, our analysis showed more evidence of gene conversion in individuals of African ancestry, and more than 75% of African PMS2/PMS2CL haplotypes are partially or fully converted (Fig. 5d).

OPN1LW and its paralog OPN1MW are responsible for red-green color vision deficiencies and other vision conditions such as blue cone monochromacy (BCM)¹⁸. The region is arranged in a gene array and only the first two genes in the array are expressed¹⁸. Paraphase identifies all copies of the repeat, assigns genes to OPN1LW or OPN1MW, and identifies the first two copies in the array on each chromosome. Figure 5e shows an allele with one copy each of OPN1LW and OPN1MW (top panel), and an allele that only has OPN1LW in the first two copies of the array (bottom panel), leading to color vision deficiencies. Our analysis showed that the total CN of this paralog group is highly variable among populations (Fig. 5f). Allele frequencies are summarized in Supplementary Table 3, including alleles that cause color vision deficiencies.

In addition to the three paralog groups described above, we also summarized population results for other medically relevant genes, including SMN1/SMN2 (spinal muscular atrophy¹⁶, Supplementary Fig. 14), STRC (hereditary hearing loss and deafness³⁶, Supplementary Fig. 15), HBA1/HBA2 (Alpha thalassemia³⁷, Supplementary Table 4), IKBKG (Incontinentia Pigmenti³⁸, Supplementary Table 5), the CFH gene cluster (CFH/CFHR1/CFHR2/CFHR3/CFHR4) (atypical hemolytic uremic syndrome³⁹ and age-related macular degeneration^40,41, Supplementary Table 6) and GBA (Gaucher and Parkinson’s disease^42,43, Supplementary Table 7). Together, we identified medically relevant variants in at least one of these 9 paralog groups in 75%, 49.6%, 45.8%, 52.2%, and 17.4% of individuals of African, European, Admixed American, South Asian and East Asian ancestries, respectively.

Discussion

In this paper, we applied Paraphase to 160 segmental duplication regions where large (>10 kb) regions of high (>99%) sequence similarity exist between genes and their paralogs. By phasing reads from the same paralog group together, Paraphase can recover misaligned reads and correctly resolve genes together with their highly similar paralogs/pseudogenes. This method enables high-throughput CN detection and genotyping of SD-encoded genes with only HiFi data at standard WGS depth (30X).

An important benefit of the paralog group-centered approach is that it is not influenced by the CN difference between an individual and the reference. This approach can work even when the CN of a paralog group does not agree with the reference genome in most individuals of the population, such as in the case of false segmental duplications in GRCh38. In addition, by calling variants against the same reference gene within a paralog group, Paraphase outputs gene copies that can be easily compared against each other, allowing us to perform within-group divergence analysis, as well as to detect de novo mutations including gene conversion events between paralogs. Our analysis of 36 trios identified 7 de novo SNVs and 4 de novo gene conversion events, demonstrating the power of long-read sequencing in detecting de novo variations^44,45, particularly in previously inaccessible regions of the genome.

Among regions analyzed by Paraphase, we observed that paralog groups on sex chromosomes are more CN invariable (93.1% of paralog groups on sex chromosomes have low CN variability vs. 39.7% on autosomes) and have drastically lower within-group diversity (median pairwise haplotype divergence 0.00033 on sex chromosomes vs. 0.00187 on autosomes, p-value 4.179e-11). This could be related to the fact that most paralog groups on sex chromosomes are arranged in palindrome structures (86.2% vs. 16.8% on autosomes). Unequal crossing-overs between arms of a palindrome results in inversions and do not change the copy number. Arm-to-arm gene conversion is known to occur frequently to prevent sex chromosomes from accumulating deleterious mutations in the absence of homologous chromosomes^29,46, and could contribute to the low within-group diversity.

We identified 23 paralog groups with extremely low within-group diversity (Table 2), where genes and their paralogs are as similar as alleles from the same gene. Consistent with most other paralog groups on sex chromosomes, the low-diversity paralog groups on sex chromosomes are all arranged in palindromes and mostly have low CN variability. For these paralog groups, both the genes and the palindrome structure are evolutionarily conserved in other primates. The low-diversity paralog groups on autosomes, however, are not arranged in palindrome structure and mostly have high CN variability. Interestingly, many of these paralog groups are duplicated exclusively in the human lineage, with positive selection detected, e.g. AMY1A/AMY1B/AMY1C⁴⁷, BOLA2/BOLA2B^48,49 and SULT1A3/SULT1A4⁵⁰. It is possible that recent duplication and positive selection and/or gene conversion could play a role in the evolution of these genes, preventing sequence divergence and maintaining an elevated gene dosage in humans. Beyond paralog groups with low within-group diversity throughout the entire gene body, one future direction is to extend this analysis to identify local low-diversity regions resulting from gene conversions, such as the gene conversion found in Exons 1-6 of SMN1/SMN2 (Supplementary Fig. 8), and Exon 15 of PMS2 (Fig. 5c, Supplementary Fig. 13).

The SD-encoded genes presented in this paper were previously inaccessible to population-wide genomic analyses and hence are largely missing from variant annotation databases such as gnomAD³, creating hurdles in variant interpretation. Here we provide a database (https://zenodo.org/doi/10.5281/zenodo.10909886) of variant allele frequencies collected from the population samples used in this paper. This annotation resource can be further expanded as more HiFi data are generated and analyzed with Paraphase.

One limitation of Paraphase is that currently it focuses on paralog groups with 2-4 paralogous genes in GRCh38 and does not include other highly similar genes with even higher CNs. This excludes 79 genes that fall into SDs in our analysis. Nevertheless, Paraphase can be customized to analyze user-specific regions, allowing new targets to be added in the future.

Paraphase, combined with HiFi long reads, provides a single framework for resolving paralogous genes. In medically important genes challenged by pseudogenes or paralogs, Paraphase helps enable more accurate testing to detect pathogenic variants, thus bringing us one step closer to consolidating the numerous currently offered genetic tests into a single test. Furthermore, in previously inaccessible and less studied genes, population-wide sequencing-based analysis with Paraphase will facilitate the discovery of novel gene-disease associations.

Methods

Paraphase: HiFi-based caller for highly similar paralogous genes

Paraphase is designed to work with both PacBio HiFi WGS and target enrichment data. Paraphase resolves a group of highly similar genes by extracting HiFi reads aligned to any member of the paralog group, realigning them to the archetype gene, and phasing them into haplotypes, followed by variant calling on each haplotype²⁶ (Fig. 1a). Briefly, haplotype phasing is achieved by identifying variant sites in the target region and reducing reads to just the bases at variant sites, followed by graph-based assembly of simplified reads into haplotypes. Reads are then assigned to the haplotypes that they correspond to. For each haplotype, variant calling (fully phased) is done by taking the consensus sequence at each position across reads that have been assigned to the same haplotype and reporting the base differences between the consensus and the reference. Realigning all reads from all genes of the same paralog group to one gene bypasses the error-prone process of aligning reads to multiple similar regions. This framework enables all copies of the paralog group, including genes and their paralogs or pseudogenes, to be examined for variants and annotated for functional status.

When two paralogous regions are in tandem, Paraphase uses read-based phasing to further phase gene haplotypes into alleles, i.e. gene copies on the same chromosome, by grouping haplotypes that have an overlapping set of supporting reads. For example, for the RCCX module demonstrated in Fig. 5a, reads are grouped by the haplotypes they originate from and haplotypes of the same color (green or purple) represent those from the same allele.

Gene fusions between paralogs are called by detecting haplotypes whose flanking regions (upstream and downstream of the paralogous region) are consistent with two different genes. Fusion breakpoints are called by detecting a switch in bases at the paralogous sequence variant (PSV) sites that have been carefully curated previously.

Within Paraphase, there are a few gene-specific callers for medically relevant genes. These callers use gene-specific information during analysis, for example, known sequence differences between genes and paralogs/pseudogenes. In addition, these callers produce gene-specific output information such as hybrid gene structures and known pathogenic variants, including large difficult-to-call structural variants.

For a single WGS sample, across 160 paralog groups, Paraphase requires 4 Gb memory and typically completes analysis in 90 minutes (1 thread) or 15 minutes (16 threads). All results presented in this paper were generated using Paraphase V3.1.2.

Genome-wide identification of highly similar genes for analysis by Paraphase

We extracted 19,394 Ensembl protein-coding genes (>20 kb sequences centered on each gene, adding flanking sequences for shorter genes) and aligned them against GRCh38 (ALT contigs excluded, pseudoautosomal regions (PARs) masked) with blastn⁵¹ (Version 2.16.0 + , command “-outfmt = ‘6 std qlen slen’ -word_size 50 -max_hsps 30”). We selected genes that had alignment matches >10 kb in length and >99% in sequence similarity as candidate paralog groups. Genes that do not meet the criteria are considered sufficiently different from the rest of the genome, and they are likely to have no alignment problems and should be genotyped correctly by the standard HiFi workflow. The majority of genes have zero paralogs, and the remaining ones vary in the number of paralogs (Supplementary Fig. 16). Among genes with three or fewer paralogs, which represent the majority of genes with paralogs, we incorporated 155 groups of regions into Paraphase. In addition, we included genes impacted by shorter homology or lower sequence similarity, where gene fusions are highly medically relevant yet difficult to call by conventional SV callers due to homology, including HBA1/HBA2 (Alpha thalassemia), GBA1/GBAP1 (Gaucher and Parkinson’s disease), CYP2D6/CYP2D7 (pharmacogenomics), CYP11B1/CYP11B2 (Glucocorticoid-remediable aldosteronism) and CFH/CFHR1/CFHR2/CFHR3/CFHR4 (atypical hemolytic uremic syndrome and age-related macular degeneration). In total, Paraphase analyzes 160 groups of regions (Supplementary Data 1), which encode 316 genes in total (pseudogenes are not included).

Calculation of base MAPQ and summary MAPQ

We selected 20 samples from five ancestral populations with both Illumina (data downloaded from the 1000 Genomes Project¹) and HiFi WGS data available to assess alignment MAPQs. For each base position, we calculated a “base MAPQ”, defined as the median values across the MAPQs of all reads from all 20 samples aligned to the position. We then calculated a “summary MAPQ” for each paralog group, defined as the median value of the base MAPQ values across all base positions of that paralog group.

Validation against clinical samples with known variants

Validation samples were collected from 21 clinical samples (disease or carrier samples) with 30 pathogenic variants in 8 disease-causing genes that were previously validated by orthogonal methods⁵², such as MLPA and Sanger sequencing (Table 1 and Supplementary Data 2). Data (standard 30X HiFi WGS) for these clinical samples were collected from Radboud University Medical Center, GeneDx, and Genomics Research to Elucidate the Genetics of Rare diseases (GREGoR) Consortium. In addition, we used 36 trios to examine the consistency of haplotypes called in probands vs. parents. Among the 36 trio, 8 were collected from Radboud University Medical Center⁴⁵, 10 were from the 100,000 Genomes Project and 18 were from Genomics Research to Elucidate the Genetics of Rare diseases (GREGoR) Consortium.

Comparison against assemblies

We compared Paraphase variant calls against high-quality diploid assemblies in 47 HPRC samples²³. Contigs from assemblies were aligned with Minimap2⁵³ (Version 2.26-r1175, command “-x asm5”) against Paraphase target regions to identify haplotypes of each paralog group in the assembly. Variants in the assembly were called by identifying base differences from the reference with a custom Python script. Segments from contigs were matched against Paraphase-called haplotypes by matching variants. Assembly-based variant calls were compared against Paraphase variant calls among matching haplotypes between Paraphase and the assembly. For this analysis, the assembly was defined as the ground truth. Thus, for Paraphase-called haplotypes that do not have a matching segment in the assembly, all variants called on those haplotypes were considered false positives by Paraphase. For segments in the assembly that do not have a matching Paraphase haplotype, all variants on those segments were considered false negatives by Paraphase.

Simulation analysis

We conducted simulation experiments to assess haplotyping accuracy when varying read length, haplotype depth and paralog divergence. Reference sequences were extracted from GRCh38 corresponding to Paraphase target regions (plus 50 kb flanking sequences). Five paralogous copies of each gene were simulated at different divergence levels (0.01%, 0.05%, 0.1%, and 0.15%) with Mutation-Simulator⁵⁴ (Version 3.0.2, command “-sn” for specified divergence levels). HiFi reads of different lengths (5 kb, 10 kb, and 15 kb) and depths (5X, 10X and 15X haplotype depth) were then simulated from those simulated gene copies with a workflow that consists of PBSIM3⁵⁵ (Version 3.0.4, command “--strategy wgs --method qshmm --qshmm QSHMM-RSII.model --pass-num 8”), which simulated CLR reads, and the ccs software (Version 6.4.0, default parameters, https://github.com/PacificBiosciences/ccs), which generated consensus HiFi reads. For each target region, Paraphase was run on simulated data, with reads from all paralogous copies merged. Paraphase variant calls were compared against variants simulated by Mutation-Simulator. Precision and recall were calculated in the same way as described above in the “Comparison against assemblies” section.

Population samples

For frequency calculations, we used HiFi WGS data from 259 unrelated individuals from five ancestral populations (113 Europeans, 52 Africans, 48 Admixed Americans, 23 South Asians, and 23 East Asians), collected from the Human Pangenome Reference Consortium (HPRC)^22,23, the 100,000 Genomes Project, Radboud University Medical Center⁴⁵, Genomics Research to Elucidate the Genetics of Rare diseases (GREGoR) Consortium, and Genomic Answers for Kids (GA4K) at Children’s Mercy Kansas City.

Paralog groups with low within-group diversity

We searched for paralog groups within which the haplotype diversity is comparable to the general sequence diversity between alleles of the same gene. To profile the average allelic sequence divergence, we used Paraphase to phase haplotypes sequences for 600 randomly selected genes (400 on autosomes and 200 on chrX) that fall outside of SDs, i.e. each individual is expected to have two copies of a gene when there is no CNV, in the same set of 259 individuals. For each of the randomly selected genes, we calculated pairwise divergence values between any two haplotypes of the same gene among individuals of the same ancestral population. Between a pair of any two haplotypes, sequence divergence was calculated by dividing the number of base differences (SNVs only) by the length of the region. Focusing only on SNVs in non-homopolymer regions, 90% of the pairwise sequence divergence values among haplotypes were lower than 0.00156 for autosomal genes (Supplementary Fig. 17), and for chrX genes, 90% of the sequence divergence among haplotypes were lower than 0.00101 (Supplementary Fig. 17), consistent with a lower mutation rate on chrX⁵⁶. We then calculated pairwise sequence divergence values between any two haplotypes of the same paralog group targeted by Paraphase. To obtain candidate paralog groups where the within-group sequence divergence is as low as the general allelic diversity, we required that 90% of the pairwise sequence divergence values among haplotypes of the same paralog group to be lower than 0.00156 and 0.00101 for autosomal and sex chromosome paralog groups, respectively. For both randomly selected genes and Paraphase paralog groups, only haplotypes from the same ancestral populations were compared for pairwise divergence calculation. For paralog groups reported in Table 2, we further filtered out paralog groups where the homology does not span the entire gene, i.e. partial paralogs. Principal component analysis (PCA) within a paralog group was conducted on SNV sites identified across all haplotypes of the paralog group using the prcomp function in R.

PMS2 gene conversion calling

PMS2 and PMS2CL haplotype sequences in Exon 12 region (chr7:5,981,000-5,985,000, GRCh38) and Exons 13-14 region (chr7:5,977,000-5,980,000) are separated into two main groups (PMS2-like and PMS2CL-like) based on the PCA (Supplementary Fig. 13). Variants (called against the PMS2 reference sequence) that are present in >95% of the PMS2CL-like group and <5% of the PMS2-like group are selected as signature sites for calling gene conversion. Gene conversion is called when a PMS2CL haplotype has <20% of the signature variants or when a PMS2 haplotype has >80% of the signature variants. A partial conversion in Exons 13-14 is a special haplotype common in the population, called based on a predefined subset of the signature variants (Fig. 5c, middle panel, first haplotype).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

HiFi WGS data for HPRC samples are documented in https://github.com/orgs/human-pangenomics/repositories and can be downloaded from https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=working/ for analysis.

Code availability

Paraphase is implemented in Python and is freely available for download from GitHub (https://github.com/PacificBiosciences/paraphase). It can also be installed via Conda or PyPI.

References

The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Mandelker, D. et al. Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. Genet. Med. J. Am. Coll. Med. Genet. 18, 1282–1289 (2016).
MATH CAS Google Scholar
Ebbert, M. T. W. et al. Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight. Genome Biol. 20, 97 (2019).
Article PubMed PubMed Central Google Scholar
Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res 11, 1005–1017 (2001).
Article PubMed PubMed Central CAS Google Scholar
Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 (2022).
Article PubMed PubMed Central CAS Google Scholar
Chen, J.-M., Cooper, D. N., Chuzhanova, N., Férec, C. & Patrinos, G. P. Gene conversion: mechanisms, evolution and human disease. Nat. Rev. Genet. 8, 762–775 (2007).
Article PubMed CAS Google Scholar
Dumont, B. L. Interlocus gene conversion explains at least 2.7 % of single nucleotide variants in human segmental duplications. BMC Genomics 16, 1–11 (2015).
Article MATH CAS Google Scholar
Vollger, M. R. et al. Increased mutation and gene conversion within human segmental duplications. Nature 617, 325–334 (2023).
Article ADS PubMed PubMed Central CAS Google Scholar
Alkan, C. et al. Personalized copy-number and segmental duplication maps using next-generation sequencing. Nat. Genet. 41, 1061 (2009).
Article PubMed PubMed Central MATH CAS Google Scholar
Sudmant, P. H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Handsaker, R. E. et al. Large multiallelic copy number variations in humans. Nat. Genet. 47, 296–303 (2015).
Article PubMed PubMed Central CAS Google Scholar
Prodanov, T. & Bansal, V. Robust and accurate estimation of paralog-specific copy number for duplicated genes using whole-genome sequencing. Nat. Commun. 13, 3221 (2022).
Article ADS PubMed PubMed Central MATH Google Scholar
Steyaert, W. et al. Systematic analysis of paralogous regions in 41,755 exomes uncovers clinically relevant variation. Nat. Commun. 14, 6845 (2023).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Lunn, M. R. & Wang, C. H. Spinal muscular atrophy. Lancet Lond. Engl. 371, 2120–2133 (2008).
Article MATH Google Scholar
Merke, D. P. & Auchus, R. J. Congenital adrenal hyperplasia due to 21-hydroxylase deficiency. N. Engl. J. Med. 383, 1248–1261 (2020).
Article PubMed MATH CAS Google Scholar
Neitz, J. & Neitz, M. The genetics of normal and defective color vision. Vision Res. 51, 633–651 (2011).
Article PubMed CAS Google Scholar
Neitz, M. & Neitz, J. Intermixing the OPN1LW and OPN1MW Genes Disrupts the Exonic Splicing Code Causing an Array of Vision Disorders. Genes 12, 1180 (2021).
Article PubMed PubMed Central CAS Google Scholar
Pignatelli, D. et al. The complexities in genotyping of congenital adrenal hyperplasia: 21-hydroxylase deficiency. Front. Endocrinol. 10, 432 (2019).
Haer-Wigman, L. et al. Diagnostic analysis of the highly complex OPN1LW/OPN1MW gene cluster using long-read sequencing and MLPA. Npj Genomic. Med. 7, 1–10 (2022).
Article MATH Google Scholar
Wang, T. et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437–446 (2022).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Gao, Y. et al. A pangenome reference of 36 Chinese populations. Nature 619, 112–121 (2023).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Porubsky, D. et al. Gaps and complex structurally variant loci in phased genome assemblies. Genome Res. 33, 496 (2023).
Article PubMed PubMed Central MATH CAS Google Scholar
Chen, X. et al. Comprehensive SMN1 and SMN2 profiling for spinal muscular atrophy analysis using long-read PacBio HiFi sequencing. Am. J. Hum. Genet. 110, 240–250 (2023).
Article PubMed PubMed Central CAS Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Trombetta, B. & Cruciani, F. Y chromosome palindromes and gene conversion. Hum. Genet. 136, 605–619 (2017).
Article PubMed MATH CAS Google Scholar
Jackson, E. K. et al. Large palindromes on the primate X Chromosome are preserved by natural selection. Genome Res. 31, 1337–1352 (2021).
Article PubMed PubMed Central MATH CAS Google Scholar
Lekarev, O. et al. Erroneous prenatal diagnosis of congenital adrenal hyperplasia owing to a duplication of the CYP21A2 gene. J. Perinatol. J. Calif. Perinat. Assoc. 33, 76–78 (2013).
CAS Google Scholar
ten Broeke, S. W. et al. Cancer risks for PMS2-Associated lynch syndrome. J. Clin. Oncol. 36, 2961–2968 (2018).
Article PubMed PubMed Central MATH Google Scholar
Hayward, B. E. et al. Extensive gene conversion at the PMS2 DNA mismatch repair locus. Hum. Mutat. 28, 424–430 (2007).
Article PubMed MATH CAS Google Scholar
van der Klift, H. M. et al. Quantification of sequence exchange events between PMS2 and PMS2CL provides a basis for improved mutation scanning of lynch syndrome patients. Hum. Mutat. 31, 578–587 (2010).
PubMed Google Scholar
Ganster, C. et al. Functional PMS2 hybrid alleles containing a pseudogene-specific missense variant trace back to a single ancient intrachromosomal recombination event. Hum. Mutat. 31, 552–560 (2010).
PubMed PubMed Central CAS Google Scholar
Wernstedt, A. et al. Improved multiplex ligation-dependent probe amplification analysis identifies a deleterious pms2 allele generated by recombination with crossover between PMS2 and PMS2CL. Genes. Chromosomes Cancer 51, 819–831 (2012).
Article PubMed PubMed Central MATH CAS Google Scholar
Yokota, Y. et al. Frequency and clinical features of hearing loss caused by STRC deletions. Sci. Rep. 9, 4408 (2019).
Article ADS PubMed PubMed Central MATH Google Scholar
Farashi, S. & Harteveld, C. L. Molecular basis of α-thalassemia. Blood Cells. Mol. Dis. 70, 43–53 (2018).
Article PubMed CAS Google Scholar
Cammarata-Scalisi, F., Fusco, F. & Ursini, M. V. Incontinentia pigmenti. Actas Dermosifiliogr 110, 273–278 (2019).
Article PubMed CAS Google Scholar
Zipfel, P. F. et al. Deletion of complement factor H-related genes CFHR1 and CFHR3 is associated with atypical hemolytic uremic syndrome. PLoS Genet 3, e41 (2007).
Article PubMed PubMed Central MATH Google Scholar
Klein, R. J. et al. Complement factor H polymorphism in age-related macular degeneration. Science 308, 385 (2005).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Hughes, A. E. et al. A common CFH haplotype, with deletion of CFHR1 and CFHR3, is associated with lower risk of age-related macular degeneration. Nat. Genet. 38, 1173–1177 (2006).
Article PubMed MATH CAS Google Scholar
Hruska, K. S., LaMarca, M. E., Scott, C. R. & Sidransky, E. Gaucher disease: mutation and polymorphism spectrum in the glucocerebrosidase gene (GBA). Hum. Mutat. 29, 567–583 (2008).
Article PubMed CAS Google Scholar
Sidransky, E. et al. Multicenter analysis of glucocerebrosidase mutations in Parkinson’s disease. N. Engl. J. Med. 361, 1651–1661 (2009).
Article PubMed PubMed Central MATH CAS Google Scholar
Noyes, M. D. et al. Familial long-read sequencing increases yield of de novo mutations. Am. J. Hum. Genet. 109, 631–646 (2022).
Article PubMed PubMed Central MATH CAS Google Scholar
Kucuk, E. et al. Comprehensive de novo mutation discovery with HiFi long-read sequencing. Genome Med. 15, 34 (2023).
Article PubMed PubMed Central MATH CAS Google Scholar
Swanepoel, C. M., Gerlinger, E. R. & Mueller, J. L. Large X-linked palindromes undergo arm-to-arm gene conversion across mus lineages. Mol. Biol. Evol. 37, 1979–1985 (2020).
Article PubMed PubMed Central CAS Google Scholar
Perry, G. H. et al. Diet and the evolution of human amylase gene copy number variation. Nat. Genet. 39, 1256–1260 (2007).
Article PubMed PubMed Central MATH CAS Google Scholar
Nuttle, X. et al. Emergence of a Homo sapiens-specific gene family and chromosome 16p11.2 CNV susceptibility. Nature 536, 205–209 (2016).
Article ADS PubMed PubMed Central CAS Google Scholar
Giannuzzi, G. et al. The human-specific BOLA2 duplication modifies iron homeostasis and anemia predisposition in chromosome 16p11.2 autism individuals. Am. J. Hum. Genet. 105, 947–958 (2019).
Article PubMed PubMed Central MATH CAS Google Scholar
Butcher, N. J. et al. Sulfotransferase 1A3/4 copy number variation is associated with neurodegenerative disease. Pharmacogenomics J 18, 209–214 (2018).
Article PubMed MATH CAS Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
Article PubMed PubMed Central MATH Google Scholar
Höps, W. et al. HiFi long-read genomes for difficult-to-detect, clinically relevant variants. Am. J. Hum. Genet. 112, 450–456 (2025).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article PubMed PubMed Central MATH CAS Google Scholar
Kühl, M. A., Stich, B. & Ries, D. C. Mutation-Simulator: fine-grained simulation of random mutations in any genome. Bioinforma. Oxf. Engl 37, 568–569 (2021).
Article Google Scholar
Ono, Y., Hamada, M. & Asai, K. PBSIM3: a simulator for all types of PacBio and ONT long reads. NAR Genomics Bioinforma 4, lqac092 (2022).
Article Google Scholar
Hodgkinson, A. & Eyre-Walker, A. Variation in the mutation rate across mammalian genomes. Nat. Rev. Genet. 12, 756–766 (2011).
Article PubMed MATH CAS Google Scholar

Download references

Acknowledgements

We thank John Belmont and Evan Eichler for their valuable comments for the manuscript. We acknowledge colleagues from the diagnostic division of the Human Genetics Department from Radboudumc (Genome Diagnostics Nijmegen) and the Radboud Genomics Technology Center for their technical assistance and the library preparation and sequencing of all samples provided from Radboudumc, in particular Ronny Derks, Amber den Ouden, Janneke Weiss, and Lonneke Haer-Wigman. We thank the Human Pangenome Reference Center (HPRC) for generating and releasing the HiFi WGS data. We thank the Genomic Answers for Kids (GA4K) program at Children’s Mercy Kansas City for generating the GA4K HiFi sequencing data. This research was made possible through access to the 100,000 Genomes Project data in the National Genomic Research Library, which is managed by Genomics England Limited (a wholly owned company of the Department of Health and Social Care). The National Genomic Research Library holds data provided by patients and collected by the NHS as part of their care and data collected as part of their participation in research. The National Genomic Research Library is funded by the National Institute for Health Research and NHS England. The Wellcome Trust, Cancer Research UK and the Medical Research Council have also funded research infrastructure.

Author information

Authors and Affiliations

PacBio, Menlo Park, CA, USA
Xiao Chen, Daniel Baker, Egor Dolzhenko & Michael A. Eberle
GeneDx, Gaithersburg, MD, USA
Joseph M. Devaney, Jessica Noya, April S. Berlyoung, Rhonda Brandon, Kathleen S. Hruska, Lucas Lochovsky, Paul Kruszka & Scott Newman
Genomic Medicine Center, Children’s Mercy Kansas City, Kansas City, MO, USA
Emily Farrow, Isabelle Thiffault & Tomi Pastinen
UMKC School of Medicine, University of Missouri Kansas City, Kansas City, MO, USA
Emily Farrow, Isabelle Thiffault & Tomi Pastinen
Department of Pediatrics, Children’s Mercy Kansas City, Kansas City, MO, USA
Emily Farrow
Department of Pathology and Laboratory Medicine, Children’s Mercy Kansas City, Kansas City, MO, USA
Isabelle Thiffault
Genomics England Ltd., London, UK
Dalia Kasperaviciute
Department of Human Genetics, Radboud University Medical Center, Nijmegen, The Netherlands
Christian Gilissen, Lisenka Vissers & Alexander Hoischen
Research Institute for Medical Innovation, Radboud University Medical Center, Nijmegen, The Netherlands
Christian Gilissen, Lisenka Vissers & Alexander Hoischen
Radboud Center for Infectious Diseases (RCI), Department of Internal Medicine, Radboud University Medical Center, Nijmegen, The Netherlands
Alexander Hoischen
Radboud Expertise Center for Immunodeficiency and Autoinflammation and Radboud Center for Infectious Disease (RCI), Radboud University Medical Center, Nijmegen, The Netherlands
Alexander Hoischen
Center for Genetics Medicine Research, Children’s National Hospital, Washington, DC, USA
Seth Berger, Andrea Cohen, Arthur Ko, Jamie Fraser, Meghan Delaney & Leandros Boukas
Institute for Clinical and Translational Science, University of California, Irvine, CA, USA
Eric Vilain, Emmanuèle Délot, Georgia Pitsava, Lightning Auriga, Ya Cui, Changrui Xiao, Charles Hadley King, Ivan de Dios, Jonathan LoTempio, Miguel Almalvez, Vincent Fusaro & Wei Li
Ambry Genetics, Aliso Viejo, CA, USA
Bojan Losic, Esther Yun-Hua Hsiao, John Harting, Krista Bluske, Rachid Karam & Sami Belhadj

Authors

Xiao Chen
View author publications
Search author on:PubMed Google Scholar
Daniel Baker
View author publications
Search author on:PubMed Google Scholar
Egor Dolzhenko
View author publications
Search author on:PubMed Google Scholar
Joseph M. Devaney
View author publications
Search author on:PubMed Google Scholar
Jessica Noya
View author publications
Search author on:PubMed Google Scholar
April S. Berlyoung
View author publications
Search author on:PubMed Google Scholar
Rhonda Brandon
View author publications
Search author on:PubMed Google Scholar
Kathleen S. Hruska
View author publications
Search author on:PubMed Google Scholar
Lucas Lochovsky
View author publications
Search author on:PubMed Google Scholar
Paul Kruszka
View author publications
Search author on:PubMed Google Scholar
Scott Newman
View author publications
Search author on:PubMed Google Scholar
Emily Farrow
View author publications
Search author on:PubMed Google Scholar
Isabelle Thiffault
View author publications
Search author on:PubMed Google Scholar
Tomi Pastinen
View author publications
Search author on:PubMed Google Scholar
Dalia Kasperaviciute
View author publications
Search author on:PubMed Google Scholar
Christian Gilissen
View author publications
Search author on:PubMed Google Scholar
Lisenka Vissers
View author publications
Search author on:PubMed Google Scholar
Alexander Hoischen
View author publications
Search author on:PubMed Google Scholar
Seth Berger
View author publications
Search author on:PubMed Google Scholar
Eric Vilain
View author publications
Search author on:PubMed Google Scholar
Emmanuèle Délot
View author publications
Search author on:PubMed Google Scholar
Michael A. Eberle
View author publications
Search author on:PubMed Google Scholar

Consortia

UCI Genomics Research to Elucidate the Genetics of Rare diseases (UCI GREGoR) Consortium

Andrea Cohen
, Arthur Ko
, Georgia Pitsava
, Lightning Auriga
, Ya Cui
, Changrui Xiao
, Charles Hadley King
, Ivan de Dios
, Jonathan LoTempio
, Miguel Almalvez
, Vincent Fusaro
, Wei Li
, Jamie Fraser
, Meghan Delaney
, Leandros Boukas
, Bojan Losic
, Esther Yun-Hua Hsiao
, John Harting
, Krista Bluske
, Rachid Karam
, Sami Belhadj
, Seth Berger
, Eric Vilain
& Emmanuèle Délot

Contributions

X.C. and M.A.E. designed the study; X.C. implemented the software and performed the analyses; X.C. and M.A.E. wrote the paper; D.B., Egor D., J.M.D., J.N., A.S.B., R.B., K.S.H., L.L., P.K., S.N., E.F., I.T., T.P., D.K., C.G., L.V., A.H., S.B., E.V., and Emmanuèle D. provided samples, data and feedback for the paper.

Corresponding authors

Correspondence to Xiao Chen or Michael A. Eberle.

Ethics declarations

Competing interests

X.C., D.B., Egor D., and M.A.E. are employees of PacBio. J.M.D., J.N., A.S.B., R.B., K.S.H., L.L., P.K. and S.N. are employees of GeneDx. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Matteo Chiara, Mark Ebbert and Timofey Prodanov for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1 and 2

Reporting Summary

Transparent Peer Review file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, X., Baker, D., Dolzhenko, E. et al. Genome-wide profiling of highly similar paralogous genes using HiFi sequencing. Nat Commun 16, 2340 (2025). https://doi.org/10.1038/s41467-025-57505-2

Download citation

Received: 24 April 2024
Accepted: 21 February 2025
Published: 08 March 2025
Version of record: 08 March 2025
DOI: https://doi.org/10.1038/s41467-025-57505-2

This article is cited by

The Platinum Pedigree: a long-read benchmark for genetic variants
- Zev Kronenberg
- Cillian Nolan
- Michael A. Eberle
Nature Methods (2025)