VACmap: an accurate long-read aligner for unraveling complex genomic rearrangements

Ding, Hongyu; Sedlazeck, Fritz J.; Proukakis, Christos; Morley, Caoimhe; Toffoli, Marco; Schapira, Anthony HV; Liao, Zhirui; Pu, Lianrong; Zhu, Shanfeng

doi:10.1038/s41467-025-67096-7

Download PDF

Article
Open access
Published: 05 January 2026

VACmap: an accurate long-read aligner for unraveling complex genomic rearrangements

Nature Communications volume 16, Article number: 11198 (2025) Cite this article

5124 Accesses
11 Altmetric
Metrics details

Subjects

Abstract

Sequence alignment is essential for genomic research and clinical diagnostics, yet detecting complex rearrangements such as inversions, duplications, and gene conversions remains challenging due to allele complexity and limitations of current methods. We introduce VACmap, a non-linear mapping approach to enhance the detection and representation of all genetic variations. VACmap improves duplication detection from 20% to 90% in the Challenging Medically-Relevant Genes (CMRG) benchmark and improves characterization of complex inversions in repetitive regions and gene conversion events. It improves resolving clinically significant loci, including the LPA gene (with repetitive KIV-2 units linked to coronary heart disease), GBA1 and STRC genes (risk factors for Parkinson’s disease and hearing loss, respectively, affected by pseudogene recombination with GBAP1 and STRCP1). Here, we show that VACmap delivers better alignment accuracy and SV detection, providing a robust tool for genomic analysis and clinical insights, with potential to advance understanding of genetic diversity and disease mechanisms.

Mapping gene and gene pathways associated with coronary artery disease: a CARDIoGRAM exome and multi-ancestry UK biobank analysis

Article Open access 12 August 2021

Within-host genetic diversity of pneumococcal serotype 3 during one-year prolonged carriage in a healthy adult

Article Open access 07 October 2025

Structural variation in 1,019 diverse humans based on long-read sequencing

Article Open access 23 July 2025

Introduction

Sequence alignment is a fundamental starting point for genomic research and clinical diagnostics, serving as the critical bridge between raw sequencing data and meaningful biological insights. Accurate alignment directly influences the quality of downstream analyses, such as variant detection, genome assembly, comparative genomics, and personalized medicine. In particular, the detection of structural variations (SVs)—genomic rearrangements such as inversions, duplications, translocations, and complex clustered events—relies heavily on precise alignment¹. However, the inability of existing linear aligners to adequately represent complex SVs remains a barrier to progress in genomic science. Misaligned or misrepresented SVs hinder downstream analyses, leading to gaps in our understanding of genetic variation and its impact on human health and disease.

SVs, defined as genomic alterations of 50 base pairs or larger, are among the most impactful sources of genetic variation^1,2,3,4. They affect more nucleotides in the genome than smaller variants, such as single-nucleotide polymorphisms (SNPs) or small insertions and deletions (indels)^3,5,6,7. Consequently, their influence is being recognized across evolutionary processes, human health, and contribute to both Mendelian and complex diseases as well as cancer development^8,9,10,11. Despite their critical importance, our understanding of SVs remains limited¹². This is in part due to the complexities of complex SV, but also the inability of the current state of art aligners to represent them. Research has predominantly focused on simpler SV classes, such as deletions and insertions, while more intricate variations—like duplications, inversions, and complex clustered events—are often misrepresented and thus missed. This gap in our analytical approach hinders a comprehensive understanding of these important genomic phenomena^13,14, despite their importance being highlighted in several studies already^{7,8,9,10,11,14,15,16}.

These studies have mainly been driven by long-read sequencing that indeed enables the characterization of tandem repeats and thus regions where SV are predominantly observed¹⁷. Despite their advancements, long-reads are challenging to align due to their length and generally higher sequencing errors. In the past, others have highlighted that specialized aligners are required to accurately align them using predominantly linear alignments to the reference, where multiple linear subalignments represent a potential SV. These subalignments are identified through a process commonly known as the seed-chain-extend algorithm, tailored for long-read mapping. In this approach, seeds (exact matches, such as k-mers) are identified, and a co-linear subset of these seeds is selected to form chains, which are then extended bidirectionally until significant differences between the read and reference sequences are encountered. For reads without SVs, a single chain (alignment) can represent the entire read. However, the presence of SVs requires multiple chains (subalignments) to represent the read fully—for example, a read spanning an inversion typically requires three subalignments to capture the inversion and its flanking regions. Due to the complexity of SVs and their tendency to occur in repetitive regions, the seed-chain-extend approach often generates a pool of redundant subalignments, necessitating an additional step to determine the optimal subset of subalignments. For instance, minimap2 employs a co-linear chaining algorithm to identify the set of all possible co-linear chains. It then uses a greedy strategy during primary chain selection¹⁸, which determines the optimal subset of subalignments to represent the read. This process initializes an empty set Q and iteratively processes subalignments from the highest to the lowest chaining scores: if a subalignment overlaps with a chain in Q by 50% or more of the shorter subalignment’s length, it is marked as secondary; otherwise, it is added to Q, ultimately representing the read with the subalignments in Q¹⁸. In contrast, YAHA adopts a graph-based approach, leveraging Optimal Query Coverage algorithm to finds the optimal set of subalignments that cover the length of the query¹⁹. NGMLR builds on a similar strategy, enhancing it with a refined scoring function to identify the optimal combination of subalignments with the highest joint score²⁰. Despite these advancements, post-alignment processes for determining the ideal set of sub-alignments are often inadequate because of the complexity of the allele and the underlying repetitive regions. They fail especially for duplication, inversion, and translocation, which are often missed or falsely identified by the linear alignment algorithms. Duplications, for example, are often misaligned as insertion because linear alignment algorithms prefer a single continuous alignment and treat duplication, which would require splitting reads, as insertion. Similarly, inversion and translocation, are often rather misaligned as splitting the reads is penalized and thus avoided.

To overcome the challenges posed by existing alignment methods, we introduce VACmap, a long-read mapping tool developed to improve the representation of all types of SVs. VACmap uses a non-linear alignment algorithm that captures an entire read as a unified, non-linear alignment. This approach streamlines the traditional alignment process by eliminating the need for splitting reads and selecting from multiple linear alignments. We demonstrate that this approach improves the representation of complex alleles, providing a more accurate and comprehensive view of SVs.

Results

The workflow of the non-linear alignment algorithm

Figure 1 gives an overview of VACmap’s non-linear mapping approach. The key important differentiation between VACmap and other approaches is implemented after initial matches between reference and read sequences have been identified. Here, existing methods try to conserve the order of all subalignments by heavily penalizing splits when searching chains of matches maintained. The linear alignment approach can efficiently model genomic alterations such as deletions, insertions, and substitutions since these don’t break the co-linearity of a chain. However, the linear approach penalizes the detection of complex SV such as duplication, inversion, translocation, or combinations of SV. In VACmap, we propose a hybrid alignment algorithm, which combines both linear and non-linear linkage approaches in a chain. In detail, VACmap represents matches as quadruples called ‘anchors’, which include the start positions in the long read and reference sequences, the strand match, and the anchor’s length. They are ordered by their end positions in the long read. The VACmap’s non-linear chaining algorithm then promotes the extension of the chain to subsequent anchors that preserve a strictly linear relationship with the preceding anchor, enhancing this connection with a positive score. Conversely, it penalizes extensions to anchors that disrupt this linearity by assigning negative scores to such connections. Then the optimal non-linear alignment of the entire sequence is the chain with the highest aggregate score (the longest path). Each of these linear subalignments can be extracted by dividing the non-linear alignment at the non-linear junction, eliminating the traditional necessity of additional post-alignment steps that reconstruct genomics rearrangement from a pool of error-prone independent subalignments (See “Methods” for details).

**Fig. 1: The workflow of VACmap non-linear alignment algorithm.**

Assessing the impact of VACmap on detecting complex SVs in synthetic data

To assess the impact of VACmap on variant detection in downstream applications, we conducted a series of tests using synthetic datasets. We generated synthetic long-read datasets containing a wide range of SVs, both simple and complex, using a custom tool we developed called VACsim, addressing the absence of simulation tools for complex SVs. VACsim introduced 30,000 SVs, each composed of 1 to 20 basic SV events, including deletions, insertions, duplications, inversions, and translocations.

As illustrated in Fig. 2a, VACmap’s alignment data enhanced SVIM’s²⁰ ability to detect complex SVs in simulated data from PacBio CLR, PacBio HiFi, and ONT, with F1 score improvements ranging between 29.5 and 73.2 percent (refer to Supplementary Table 1). For complex SVs located within repetitive sequences, the use of VACmap-produced alignments provided gains in precision and recall, performing better than other methods by about 35.2 to 64.6 percent in F1 score (see Supplementary Table 1). Figure 2b shows the recall rates of SVs detection under different SV complexity and sequencing technology. NGMLR, minimap2, Winnowmap2, and LRA²¹ were shown to be adequate only for identifying complex SVs comprising up to two simple SV events. Beyond this complexity, the recall rates of SVIM decreased when using alignments from these tools. Conversely, SVIM with VACmap alignments consistently displayed sensitive and reliable SV detection across the full spectrum of SV complexities.

**Fig. 2: Comparison of five mapping methods in downstream complex SV detection using SVIM on synthetic data.**

For precise gene copy number quantification, particularly in tandem duplications that might influence protein levels, accurate mapping is essential. To investigate the performance of copy number estimations with alignments produced by different aligners, we generated 10,000 tandem duplications on chromosome 1 using VACsim. These duplications had repeat unit sizes ranging from 100 to 500 base pairs and repeat counts between 1 and 20. SVIM was employed to estimate the copy number for each tandem duplication from the various aligners’ alignments. According to the results depicted in Fig. 2c and Supplementary Table 2, alignments from current mapping methods led to a bias in copy number estimation. There was a decline in the linear correlation between the actual and estimated copy numbers as the repeat count grew, especially within repetitive areas. In contrast, alignments from VACmap resulted in more precise copy number estimates across diverse copy number intervals and within repetitive regions. This underscores VACmap’s capability in accurately ascertaining the copy number of tandem duplications, indicating its effectiveness and accuracy in dealing with complex genomic structures.

Evaluation using genome in a bottle benchmark

We evaluated the SV detection performance of VACmap, NGMLR, Winnowmap2, minimap2, and LRA alignments using SVIM and cuteSV with the GIAB benchmark set^{4,20,21,22,23,24,25}. Truvari²⁶ was used to assess precision, recall, and F1 scores. Before evaluation, SVIM’s and cuteSV’s tandem duplication calls were relabeled as insertions to allow for comparability to the GIAB assembly-derived benchmark. As expected, all five alignment approaches demonstrated similar performance in detecting deletions and insertions in both GIAB tier 1 and CMRG regions (Fig. 3a, b). And the runtime of VACmap is faster than NGMLR and comparable with Winnowmap2 and LRA, but slower than minimap2. However, VACmap requires lower memory usage than the other aligners (Supplementary Table 3). It should be noted that NGMLR is no longer actively maintained, which may contribute to its performance limitations compared to more actively developed tools.

**Fig. 3: Comparison of five mapping methods in downstream SV detection using HG002.**

To evaluate SVIM’s and cuteSV’s sensitivity in detecting duplications using alignments from different tools, we isolated tandem duplication calls within the GIAB benchmark set using REPTYPE annotation. The results (Fig. 3c, d) showed that SVIM, using VACmap-produced alignments, exhibited high sensitivity for duplication detection, identifying approximately 70% to 80% more duplications compared to other alignment approaches in the GIAB tier 1 and CRMG regions, respectively. This is highly important for the interpretability of the impact of SV. Additionally, the SV distribution detected with VACmap alignments showed notable differences compared to other aligners (Fig. 3e and Supplementary Fig. 1). VACmap indicated that more than 67% of the sequence gain was due to duplications, consistent with previous findings¹⁴. In contrast, minimap2 attributed only 1% of the total sequence gain to duplications. This discrepancy in SV classification is critical for interpreting the biological impact of SVs, underscoring the importance of accurate SV detection.

VACmap’s ability to accurately map duplicated segments also enabled us to characterize a previously reported de novo variation²⁷ (Fig. 4a–c and Supplementary Fig. 2). This variation, located within a Tandem Repeat (TR) region at chr14:23,280,711 (GRCh38), was originally labeled as a de novo insertion, as different insertion sizes were observed in the child (HG002: 537 bp) and the parents (HG003: 214 bp and HG004: 15 bp). However, with VACmap’s alignment, what was initially thought to be an insertion was revealed to be a 109-bp Variable Number Tandem Repeat (VNTR), with varying repeat counts in the child (five repeats) and the paternal parent (two repeats). TR regions are known to be variable in the number of repeats, often changing between generations due to mechanisms like replication slippage and unequal crossing over during meiosis. These processes can lead to differences in repeat counts, which explains the variation observed between the child and the father in this case.

**Fig. 4: Enhanced detection of complex variants by VACmap.**

This example highlights a limitation of conventional alignment algorithms, which often misinterpret duplications as insertions. Traditional aligners rely on maintaining the relative order of sequences when aligning them. However, duplications disrupt this order, making it difficult for linear aligners to correctly map such regions. As a result, duplications are often misaligned as insertions or entirely ignored. In contrast, VACmap’s non-linear alignment approach accurately handles these complex repeat structures, providing a more precise representation of the true genetic variation.

Enhance the characterization of complex inversions in repetitive regions

We then analyzed the inversion callsets generated by five different SV detection pipelines. The VACmap-SVIM callsets captured nearly all of the inversions (105 out of 116) identified by the combined callsets of minimap2, Winnowmap2, NGMLR and LRA, and additionally uncovered 97 inversions not detected by these approaches (Fig. 3f). When comparing inversions that overlapped with a previously reported callset²⁸, the VACmap-SVIM pipeline identified nearly all the inversions (48 out of 49) detected by the other three pipelines, while also discovering 14 inversions that were missed by the other methods (Supplementary Fig. 3). Upon manual inspection of an inversion missed by VACmap-SVIM, we found a more complex structure—an inversion flanked by an inverted duplication and deletion. While VACmap could resolve this complex structure, SVIM failed to detect it because the intricate structure did not align with SVIM’s predefined rules for identifying inversions (Supplementary Fig. 4).

Thus, highlighting that inversions remain challenging to resolve because their locations are often surrounded by large segmental duplications. To further investigate this, we analyze the combined call set of 213 inversion regions from five aligners. Across all inversions, 32% (68/213) of them overlap with segmental duplications, and half of them (39/68) are only detectable through VACmap alignment. For instance, VACmap alignment enables accurate identification of a homozygous 16-kb inversion located in the SPIDR gene (Fig. 4d), a gene involved in DNA repair and associated with gonadal dysgenesis diseases²⁹. On the contrary, other aligners’ alignments are less reliable, as they showed more mismatch bases (i.e., signal of wrongly mapping of reads²⁰) and inconstant breakpoints across different read alignments. The standard deviation of inversion sizes called by SVIM is 291.4 for VACmap alignments and 2066.4 for NGMLR alignments, respectively. A higher variance will be considered an unreliable SV prediction and assign a lower quality score (The SVIM quality score for this inversion is 14 and 0 for VACmap and NGMLR alignments, respectively, and will be discarded).

Figure 4e demonstrates why minimap2 and other linear aligners fail to accurately pinpoint inversion breakpoints. Linear alignment methods, such as minimap2, rely on heuristic strategies like the Z-drop heuristic to infer breakpoints¹⁸. These methods monitor the alignment score and split the alignment when the score drops below a predefined threshold (indicated by the red dashed line in the figure). However, this approach often fails to identify the precise breakpoint because after the inversion, the sequence in the read is not significantly divergent from the reference. As shown in the figure, the alignment score continues to increase slowly rather than showing a sharp drop, leading minimap2 to incorrectly place the breakpoint upstream (marked by the red dashed line).

In contrast, VACmap’s non-linear alignment algorithm can simultaneously evaluate both forward and reverse strands (blue and orange curves, respectively) and automatically switch between them to maximize the alignment score. This allows VACmap to correctly identify the true inversion breakpoint, as it can seamlessly align both strands and capture subtle changes in the alignment score. The result is a more accurate alignment and a precise breakpoint, as reflected in the figure, where VACmap’s breakpoint (blue dashed line) aligns with the actual inversion. Supplementary Figs. 5–9 provide further examples of how VACmap performs better than traditional aligners in mapping complex inversions.

Improve identification on SIGLEC11::SIGLEC16 and RHCE::RHD gene conversion

Gene conversion is a challenging form of SV that is difficult to capture accurately using current alignment algorithms and SV detection tools. Figure 5a and Supplementary Fig. 10 illustrate an inversion initially misidentified by SVIM, which was actually a gene conversion event between the SIGLEC11 and SIGLEC16 genes on the maternal haplotype. These two genes share highly similar sequences in the regions encoding their extracellular domains, due to past gene conversion events³⁰. The most recent conversion, which occurred approximately one million years ago, involved regions A in SIGLEC11 and A* in SIGLEC16³⁰ (Fig. 5b). However, VACmap’s alignment revealed a gene conversion event involving different regions, B and B*, in these two genes.

**Fig. 5: Comparison of five aligners on gene conversion events.**

Notably, the B* region in SIGLEC16 had previously been flagged by the GIAB consortium due to a cluster of heterozygous small variants²³. However, GIAB’s alignment methods, which rely on minimap2, were unable to resolve this gene conversion, resulting in numerous false-positive SNP calls in both the GIAB CMRG benchmark set and the draft release of the GIAB T2T SV benchmark (Fig. 5d). This outcome is not surprising given minimap2’s limitations in handling complex rearrangements, as it struggles to split reads or assemblies appropriately to represent gene conversion events, leading to misalignments and erroneous variant calls.

Additionally, VACmap successfully resolved a homozygous gene conversion event between the RHCE and RHD genes, which had been inaccurately represented by existing aligners (Fig. 5c). This correction reduced over a hundred false-positive SNP and indel calls in the GIAB benchmark sets (Fig. 5e and Supplementary Figs. 11 and 12). This highlights VACmap’s ability to detect and accurately characterize gene conversion events that are typically missed or misclassified by conventional linear alignment methods.

Evaluation using the LPA, GBA1, and STRC genes

We next assessed the LPA gene to highlight a medically important region that is further improved using VACmap. The complexity of this region raises due to high diversity in the population which represents 5–40 copies of the KIV-2 repeat in the LPA gene¹⁰. This copy number is inversely correlated with human lipoprotein(a) levels, which are strongly linked to coronary heart disease¹⁰. However, quantifying the KIV-2 copy number accurately poses challenges due to repetitiveness and thus the low mappability of sequencing reads in the LPA gene region³¹. We assessed the performance of five mapping methods by aligning PacBio HiFi and ONT sequencing data from human samples (CHM13 and HG002) against the GRCh38 reference genome. IGV visualizations revealed that NGMLR, Winnowmap2, minimap2, and LRA produced alignments with more mismatches and less informative coverage information compared to VACmap (Fig. 6a). VACmap demonstrated an ability to accurately represent KIV-2 repeats, showing clear and distinct coverage boundaries (Supplementary Figs. 13–15).

**Fig. 6: Comparison of five aligners on the *LPA* gene.**

To simplify KIV-2 copy number determination, we modified the GRCh38 reference by removing the second to sixth KIV-2 repeat units and including the first 1000 bp sequence of the follow-up KIV-1 unit (Fig. 6c). We then realigned the PacBio HiFi and ONT data to the modified reference. The IGV visualizations indicated that VACmap-produced alignments (Fig. 6b and Supplementary Figs. 16–18) showed the expected alignment scheme of both type A and type B KIV-2 units (Fig. 6e, f. Other mapping methods struggled to produce correct alignments despite the reduced complexity of the modified reference. Furthermore, the ONT reads facilitated the resolution of all 23 copies of the KIV-2 repeat unit in the CHM13 sample due to its longer read length compared to PacBio HiFi data (Supplementary Fig. 19).

Then, we aligned the GRCh38 assembly, CHM13 assembly, and HG002 assembly to the modified GRCh38 reference. The non-linear alignment of these three assemblies is shown in Fig. 6d. Consistent with previous findings¹⁰, we found the GRCh38 assembly consisted of six copies of KIV-2 repeat units with the pattern “AAABAA” where “A” indicates the type A KIV-2 repeat unit, and “B” indicates the type B KIV-2 repeat unit. In the CHM13 assembly, 23 KIV-2 repeat units were identified, following the pattern: “BBBBBBAABAAAAAAAAAAAAAA”. Similarly, the HG002 paternal assembly contains 24 KIV-2 repeat units with the pattern “BBBBBBAAABAAAAAAAAAAAAAA”, while the HG002 maternal assembly consists of 14 KIV-2 repeat units arranged as “AAAAAAAAAAAAAA”.

To further demonstrate the clinical utility of VACmap, we chose GBA1. This is a major risk factor for Parkinson’s disease³², a challenging gene to analyze³³, which is prone to structural variants caused by recombination with a nearby highly homologous pseudogene (GBAP1). We previously detected using ONT long reads with adaptive sampling a pathogenic deletion which could not be correctly called after minimap2 or NGMLR alignment³⁴. In contrast, VACmap allowed SVIM and cuteSV to correctly report the breakpoints (Fig. 7a), which is crucial in determining whether a deletion is pathogenic. Similarly, the STRC gene is a known deafness-associated gene causing mild-to-moderate hearing loss³⁴ and is inherited in an autosomal recessive manner. However, it’s hard to detect due to its location in tandem duplication region and the presence of a highly homologous (>99%) pseudogene (STRCP1)³⁵. By examining the GIAB samples using VACmap-produced alignments (Fig. 7b), we identified a heterozygous deletion in NA19240³⁶ involving the loss of one copy of the CKMT1B-STRC-CATSPER2 gene cluster. However, all four other aforementioned aligners failed to pinpoint the deletion. In addition, Duplomap³⁷, a specialized aligner for remapping reads in tandem duplications, cannot detect the deletion, since it uses minimap2 internally, which often hesitates to split reads.

**Fig. 7: Comparison of five aligners on *GBA1/GBAP1* and *STRC/STRCP1* gene.**

Discussion

Sequence alignment is a fundamental starting point for virtually all genomic research and clinical diagnostics. It serves as the crucial bridge between raw sequencing data and the biological insights necessary for understanding genetic variation, evolutionary biology, and the molecular basis of diseases. Accurate alignment of sequencing reads to a reference genome is essential for a wide range of applications, including variant detection, comparative genomics, and personalized medicine. The precision and reliability of this initial alignment process directly influence the quality of downstream analyses, impacting our ability to identify genetic variations such as SNPs, indels, and critically SVs.

Despite significant advancements in sequencing technologies, particularly with the advent of long-read sequencing platforms, accurately aligning reads that encompass complex genomic rearrangements remains a formidable challenge. Traditional linear alignment algorithms are often inadequate for handling large-scale SVs such as inversions, duplications, translocations, and complex combinations of these events. These limitations create a cascade of analytical failures: when alignment is incorrect, subsequent analyses become unreliable or impossible, regardless of the sophistication of downstream tools. As a consequence, crucial SVs—including those with significant medical relevance—may be misrepresented or entirely missed, impeding our ability to fully understand their biological significance and clinical implications.

Graph-based genome representations offer significant advantages over linear reference genomes by providing a flexible framework to encode SVs such as duplications, inversions, and translocations as graph structures, enabling a more comprehensive representation of genetic diversity across populations³⁸. This approach can facilitate the integration of multiple genomes into a single graph, potentially improving variant calling and haplotype resolution in complex regions. However, we show that existing algorithms, such as minigraph³⁹, encounter substantial difficulties in producing correct genome graphs (Supplementary Note 1). Their reliance on co-linear matching during graph construction and read alignment often results in erroneous topologies, misinterpreting non-linear SVs (e.g., duplications as insertions or inversions as misalignments).

To address these challenges, we present VACmap. VACmap breaks through this long-standing barrier of inaccurately representing complex variants. This is achieved via a non-linear mapping approach and demonstrates the need for this method, especially on inversions and other critical medically challenging genes such as LPA, GBA1, and STRC. Indeed, inversions remain challenging to resolve, especially due to their location often surrounded by large segmental duplications²⁸. Furthermore, these regions often form more complex events than simple inversions. Neither complex or simple inversions are routinely detectable with state-of-the-art methods²⁸, despite their clinical importance¹⁵. VACmap enables this detection with more precise alignments of read segments than any other method available due to its non-linear mapping approach. This further improves the characterization of complex duplications, such as shown in KIV-2 a region in LPA itself and of gene/pseudogene recombination as shown in GBA1 and STRC. VACmap can more precisely recapitulate the exact breakpoints within the reads, which leads to an improved detection and thus will provide more insights. These are only a few examples of multiple medically important but challenging genes that VACmap can improve upon and thus deliver a more precise picture of the variants currently often missed by analytical methods²³.

Methods

Ethics statement

Ethics approval for the GBA1 carrier was provided by the National Research Ethics Service London—Hampstead Ethics Committee as part of the RAPSODI study (www.rapsodi.com)⁴⁰. Informed consent was provided.

Variant-aware chaining algorithm

Algorithm overview

Traditional sequence alignment algorithms^{13,18,19,20,22} rely on linear edits—insertions, deletions, and substitutions—that preserve sequence order and orientation. While effective for point mutations and small insertions or deletions (indels), these methods struggle with complex genomic rearrangements, such as duplications, inversions, and translocations. For example, duplications may be misidentified as insertions, and inversions are often indistinguishable from block substitutions. In cases of clustered rearrangements, linear methods attempt to reconstruct the structure by selecting from a pool of local subalignments, but this approach rarely yields a globally optimal alignment, especially in repetitive or highly rearranged regions⁴¹.

VACmap introduces a hybrid approach to address these limitations. Unlike linear alignment methods, VACmap employs the Variant-aware Chaining (VAC) algorithm, which integrates linear and non-linear edits within a weighted directed acyclic graph (DAG) framework. The algorithm identifies exact k-mer matches (anchors) between a reference and a long-read sequence, constructs a DAG where anchors are nodes, and connects them with edges representing possible alignments. Edges are classified as normal (for co-linear alignments) or variation (for rearrangements), with weights adjusted by penalties to account for indels and rearrangement complexity. By finding the longest path in the DAG, VACmap produces a unified, globally optimal alignment that accurately captures genomic rearrangements without relying on post-alignment selection.

Anchor identification

VACmap begins by identifying exact k-mer matches between the reference and long-read sequences. These matches, called anchors, are represented as quadruples (x, y, s, k), where:

x: Start position in the long-read sequence.

y: Start position in the reference sequence.

s: Orientation (1 for forward strand, −1 for reverse strand).

k: Length of the matched k-mer.

Graph construction

Anchors are modeled as nodes in a weighted DAG. A directed edge from node i to node j is created if the nodes satisfy:

$${x}_{i}+{k}_{i}\le {x}_{j}+{k}_{j}$$

(1)

This condition tests for all node j that satisfy it and ensures that the alignment respects the read sequence order, allowing overlaps between nodes. The overlap size is defined as:

$${{\rm{overlapsize}}}=\max (x_{i}+{k}_{i}-{x}_{j},0)$$

(2)

This accounts for cases where anchors partially overlap in the long-read sequence.

Edge classification and weighting

Edges are classified as normal or variation based on the relative positions and orientations of the connected nodes. The initial edge weight, or bonus, is calculated as:

$${{\rm{bon}}}{{\rm{u}}}{{\rm{s}}}_{i,j}=\min (x_{j}+{k}_{j}-{x}_{i}-{k}_{i},{k}_{j})$$

(3)

This bonus reflects the length of the aligned region between nodes. To determine the edge type, we compute the sequence gain or loss (diff) as:

$${{\rm{diff}}}=\left|{{\rm{readgap}}}-{{\rm{refgap}}}\right|$$

(4)

where readgap and refgap are the distances between node j’s first not overlapped base pair to node i and node i’s last base pair in the query and reference, respectively:

$${{\rm{readgap}}}={\max} (x_{j}-{x}_{i}-{k}_{i},0)$$

(5)

$${{\rm{refgap}}}=\left\{\begin{array}{c}{y}_{j}+{{\rm{overlapsize}}}-{y}_{i}-{k}_{i},{{\rm{if}}}{s}_{{{\rm{j}}}}={s}_{{{\rm{i}}}}=1\\ {y}_{i}-{y}_{j}-{{\rm{bonus}}},{{\rm{if}}}{s}_{{{\rm{j}}}}={s}_{{{\rm{i}}}}=-1\\ {y}_{j}+{{\rm{overlapsize}}}-{y}_{i}+1,{{\rm{if}}}{s}_{{{\rm{j}}}}=1,{s}_{{{\rm{i}}}}=-1\\ {y}_{j}+{{\rm{bon}}}{{\rm{u}}}{{\rm{s}}}-1-{y}_{i}-{k}_{i},{{\rm{if}}}{s}_{{{\rm{j}}}}\ne {s}_{{{\rm{i}}}},{s}_{{{\rm{i}}}}=-1\end{array}\right.$$

(6)

An edge is classified as normal if it satisfies:

$${{\rm{diff}}} < {{\rm{maxdiff}}},0\le {{\rm{refgap}}} < {{\rm{maxgap}}},{s}_{i}={s}_{j}$$

(7)

Here, maxdiff (default: 50 for first round chaining, 30 for second) regulates small indels, maxgap (default: 1000 for first round chaining, 100 for second) prevents misalignments. Edges not meeting these criteria are classified as variation edges, representing rearrangements.

Penalty calculations

For normal edges, an additional penalty (NP) account for small indels:

$$a=\left\{\begin{array}{c}0.5\;{{\rm{if\,diff}}} < 10\\ 2 \;{{\rm{if}}}\,{{\rm{diff}}}\ge 10\end{array}\right.$$

(8)

$${{{\rm{NP}}}}_{i,j}=0.01\times {k}_{j}*{diff}+{{\rm{\alpha }}}{\times \log }_{2}(\max \left({{\rm{diff}}},1\right))+0.1{\times \log }_{2}(\max \left({{\rm{readgap}}},1\right))$$

(9)

For a large readgap (default: >30), we use a larger penalty:

$${{{\rm{NP}}}}_{i,j}=0.01\times {k}_{j}*{{\rm{diff}}}+{{\rm{\alpha }}}{\times \log }_{2}\left(\max \left({{\rm{diff}}},1\right)\right)+0.5*{{\rm{readgap}}}$$

(10)

For variation edges, the penalty (VP) accounts for rearrangements:

$${{{\rm{VP}}}}_{i,j}=\beta+\min \left\{\begin{array}{c}30+0.5\times {\mathrm{ln}}\left(\max (1,{{\rm{diff}}})\right)\\ \min \left(10,\frac{{{\rm{diff}}}}{100}\right)+\min \left(30,\frac{{{\rm{diff}}}}{1000}\right)\\ {{\rm{\gamma }}}\end{array}\right.$$

(11)

Where the default settings for β is 59, 40, 30, and 30, for mode L, H, S, and R, respectively. For parameter γ, the default settings are 36. After applying the additional penalty, the final weight for normal and variation edges are defined in Eqs. (12) and (13), respectively.

$$w\left(i,j\right)={{{\rm{bonus}}}}_{i,j}-{{{\rm{NP}}}}_{i,j}$$

(12)

$$w\left(i,j\right)={{{\rm{bonus}}}}_{i,j}-{{{\rm{VP}}}}_{i,j}$$

(13)

Optimization

Here, we consider the optimal alignment of the read and reference sequence to be the longest path in the weighted directed graph. To find the longest path among N nodes can be computed in O(N²) time using dynamic programming. In detail, for a given node j, we can determine its maximum score S(j) and its best predecessor node i by utilizing Eq. (14):

$$S\left(j\right)=\max \left({\max }_{i < {{\rm{j}}}}\left(S\left(i\right)+w\left(i,j\right)\right),{k}_{j}\right)$$

(14)

However, we don’t need to test all node i because node j’s best predecessor score S(i) is always higher than the current highest score S(k) minus $(\beta+\gamma+20)$. So, for node j, we define a set M_j which includes all nodes that score higher than S(k) minus $(\beta+\gamma+20)$. Then we only need to test nodes in set M_j to identify the best predecessor of node j. This further reduces the time complexity to O(hN), where h is equal to the average size of set M.

We observed that the best predecessor of node j is usually among the top-ranked nodes in set M_j when sorted in descending order by their scores S(i). To exploit this property and enhance computational efficiency, we initialize the score of node j as S(j) = k_j and then iterate over all nodes i in M_j (where i < j) in descending order of their scores S(i) to identify the best predecessor. The iteration stops early if S(i) + k_j < S(j), as this condition indicates that further predecessors cannot improve the score. In practice, we maintain an array of nodes in descending order by their scores using binary search. Then the overall time complexity of the algorithm is O (kN + NlogN). The average value of k for HiFi, ONT, and CLR data in the first round of chaining is 21, 33, and 44, respectively.

Once the maximum score of all nodes is computed, we can identify the highest scores non-colinear chain by backtracking. The optimal set of colinear subalignments is recovered by discarding variation edges in the highest score chain.

Map quality calculations

For the map quality of the highest score chain, VACmap uses minimap2’s equation to compute the map quality:

$${{\rm{map}}}\,{{\rm{quality}}}=40\times (1-f2/f1)\times \min (1,\frac{m}{10})\times {\mathrm{ln}}(f1)$$

(15)

where m is the number of anchors on the highest scores chain, f1 is the chaining score, and f2 is the score of the second high chain.

Local index and anchor reduction

When sequencing errors or SNPs occur in clusters, accurate mapping becomes challenging, especially when utilizing a large k-mer size setting and a minimizer k-mer sampling strategy. To mitigate this issue, VACmap adopts a similar approach used in a previous study²² by constructing a local index with a smaller k-mer size, typically a 9-mer. The local index is built by collecting all possible k-mers from the previously computed highest and second highest score chains’ covered reference genome region. Subsequently, all the k-mers in the long read are employed to query the index, obtaining matching information known as anchors. The highest-scoring chain is then recomputed based on these anchors.

Note that the runtime of the VAC algorithm depends on the number of anchors. Using a smaller k-mer size setting often results in an increased number of anchors, consequently escalating computational time. To address this issue, two strategies are employed to reduce the number of anchors and improve runtime efficiency. Firstly, VACmap iterates through the anchors and removes an anchor if the distance between the anchor and the previously computed highest-scoring chain exceeds a certain threshold (2000 bp). The distance between an anchor and the previous chain is defined as the distance between the anchor and its closest anchor in the chain under the reference sequence coordinate. Determining the closest anchor in the chain can be accomplished through a binary search, with a time complexity of O(logN), where N is the number of anchors. Secondly, VACmap merges two anchors (x_i, y_i, s_i, k_i) and (x_j, y_j, s_j, k_j) into a new anchor (x_i, min(y_i, y_j), s_i, x_j + k_j – x_i) if they satisfy certain conditions (Eq. (16)).

$$\left\{\begin{array}{c}{{y}_{i}-{x}_{i}={y}_{j}{-x}_{j},x}_{i}+{k}_{i}\ge {x}_{j,}{s}_{j}={s}_{i}=1,{x}_{j} > {x}_{i},{x}_{j}+{k}_{j}-{x}_{i}\le 19\\ {or}\\ {{y}_{i}+{x}_{i}+{k}_{i}={y}_{j}{+}_{j}+{k}_{j},x}_{i}+{k}_{i}\ge {x}_{j,}{s}_{j}={s}_{i}=-1,{x}_{j} > {x}_{i},{x}_{j}+{k}_{j}-{x}_{i}\le 19\end{array}\right.$$

(16)

Two anchors can be merged only if they are overlapping or close adjacent and share identical sequences in their overlapping regions. To avoid producing large anchors—which could expand the search space during best predecessor computation—we limit the maximum output size (default: 19) for merged anchors, thereby reducing computational demand in the chaining process without compromising alignment quality.

The sorting of N anchors in ascending order based on their position in the long read allows the merging process to be computed in O(N) time using a hash table. The hash table utilizes an integer key and maintains a list of merged anchors as its value. For anchor_i, if s_i equals 1, then the key is set to y_i − x_i; otherwise, it is set to −(y_i + x_i). VACmap tests the last anchor_j in the list corresponding to a key. If the two anchors are overlapped or closely adjacent and the covered sequence in reference and long read is identical, VACmap updates the last anchor in the list; otherwise, anchor_i is appended to the list. If a key does not exist in the hash table, a new key-value pair is inserted. Finally, the merged anchors set can be obtained by traversing the hash table.

Quality control of linear subalignment

The presence of sequencing error and closely situated SNPs complicates the process of gathering sufficient anchors for the chaining process. To accommodate the occurrence of sequencing errors and clustered SNPs, VACmap employs a sizeable parameter, maxgap, with a default setting of 100, within the VAC algorithm. However, a large maxgap value has the potential drawback of mistakenly aligning sequences that are not actually related. To address this problem, we assess the error ratio within each linear subalignment and exclude those that demonstrate a high propensity for errors. This error ratio is ascertained by employing edlib⁴² to calculate the edit distance between the matched sequences in the reference genome and the long read. The edit distance obtained is then normalized by dividing the length of the shorter sequence involved in the comparison. A linear subalignment is deemed unreliable and is therefore rejected if its error ratio surpasses the predetermined threshold: 0.2 for sequences generated using PacBio CLR or ONT, and 0.1 for sequences generated using PacBio HiFi.

Mode selection

VACmap provides four mapping modes (L, H, S, R) tailored for specific alignment purposes. Mode L is recommended for high-accuracy long-read data (e.g., PacBio HiFi), using a large variation penalty (β = 59) and a low sequence divergence threshold (0.1) to retain only high-confidence alignments. Mode H is preferred for noisy long-read data (e.g., ONT), applying a slightly smaller variation penalty (β = 40) and a higher sequence divergence threshold (0.3) to accommodate higher sequencing errors. Mode S is designed to improve sensitivity for small-scale rearrangements (β = 30) and a high sequence divergence threshold (0.5) to capture subtle variations. Mode R is an experimental mode intended solely for testing purposes, employing a fixed variation penalty (VP_i,j = 30) and a divergence threshold of 0.5. Due to its developmental nature, Mode R is not recommended for normal usage, and users should rely on Modes L, H, or S for standard applications.

Synthetic data simulation and evaluation

To assess the performance of different mapping methods in downstream SV detection tasks, we employed the CHM13 T2T human reference genome⁴³ to generate two sets of synthetic datasets. For both simple and complex SVs simulation, we utilized VACsim to randomly introduce 30,000 SVs into the CHM13 T2T reference and generated the altered genome sequence containing both simple and complex SVs. VACsim determined the SV complexity (measured by the number of simple structural variation events within) of each simulated SV by randomly sampling from the range 1–20. The types of simple structural variation events were sampled from five simple SV events, namely deletion, insertion, duplication, inversion, and translocation, with respective probabilities of 0.24, 0.24, 0.24, 0.24, and 0.04. The size of each simple SV event ranged from 100 to 1000 bps. Additionally, SVIM (v1.4.2) employed a pattern match strategy to detect SVs, requiring each simple SV event to conform to the “Normal + Variation + Normal” pattern. Therefore, we spaced adjacent simple SV events by a normal sequence of 200 base pairs in size. For the copy number estimation task, VACsim was used to simulate and implant 10000 tandem duplications with repeat unit sizes ranging from 100 to 500 bps and repeat numbers ranging from 1 to 20 into chromosome 1 of the CHM13 T2T reference genome sequence. The SNPs were introduced into the altered sequences, and SURVIVOR generated simulated PacBio and ONT long-read datasets.

Five long-read mapping methods, namely VACmap, NGMLR, Winnowmap2, minimap2, and LRA, were utilized to align the simulated long reads to the T2T reference genome sequences. For parameter setting of VACmap, we use mode L to align synthetic PacBio HiFi data and mode H to align synthetic PacBio CLR and ONT data. The SVIM (v1.4.2) SV detection tool was employed to detect SVs from the alignments generated by the different mapping methods. For 40×, 20×, 10×, and 5× read coverage, the read support of SVIM called SVs is set to 10, 5, 3, and 2, respectively. Following previous studies⁶, we evaluated the performance of CSV calls by decomposing CSVs into individual simple SV events and evaluating each event separately.

Furthermore, following the procedure outlined in a previous study¹³, we employed Mashmap⁴⁴ to identify repetitive regions in the reference genome sequence with sequence similarity exceeding 95% and a sequence length greater than 10,000 bps. Subsequently, we analyzed the performance of downstream SV detection within these repetitive regions using alignments produced by different mapping algorithms.

Evaluation with GIAB callset

We evaluated five mapping methods, namely VACmap, NGMLR, Winnowmap2, minimap2, and LRA using the GIAB Tier1 (v0.6) and CMRG (1.00) SV benchmark set for the HG002 human sample. These benchmark sets are available at the following links: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NIST_STRUCTURAL VARIANT_v0.6/ and https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/, respectively. PacBio CLR, PacBio HiFi, and ONT sequencing data were accessible at s3://giab/data/AshkenazimTrio/HG002_NA24385_son/. For parameter setting of VACmap, we use mode L to align synthetic PacBio HiFi data and mode H to align synthetic PacBio CLR and ONT data. The SVIM (v1.4.2) and cuteSV (v2.1.2) were used to detect SVs. For PacBio CLR (69×), PacBio HiFi (30×), and ONT (50×) data, the read support of SVIM and cuteSV called SVs is set to 5. Except for the cuteSV called SVs using PacBio CLR data, the read support is set to 10 due to cuteSV needs a higher read support setting to filter out false-positive results (Supplementary Fig. 20). Subsequently, we evaluated the SVs against the GIAB benchmark set using Truvari (v2.0.0). Command line parameters provided to these tools are listed in Supplementary Table 4.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The GRCh37, GRCh38, CHM13 T2T, and HG002 T2T assembly used in this study are available at https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz, https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa, https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz, and https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v1.0.fasta, respectively. The HG002 sequencing data used in this study are available at s3://giab/data/AshkenazimTrio/HG002_NA24385_son. The HG003 PacBio HiFi sequencing data used in this study are available in the SRA database under accession codes SRR26402937. The HG004 PacBio HiFi sequencing data used in this study are available in the SRA database under accession codes SRR26402936. The CHM13 PacBio HiFi sequencing data used in this study are available in the SRA database under accession codes SRR11292120, SRR11292121, SRR11292122, SRR11292123. The CHM13 PacBio ONT sequencing data used in this study are available at s3://human-pangenomics/T2T/CHM13/nanopore/UW/chm13_UW_Guppy_3.6.0.fastq.gz. The NA19240 PacBio HiFi sequencing data used in this study are available in the SRA database under accession codes SRR30717225. The re-analyzed data on the GBA mutation carrier from the previous study ³⁴ cannot be made available due to restrictions in the ethical approval and consent obtained from study subjects. Synthetic data and SV callsets deposited in Zenodo under https://doi.org/10.5281/zenodo.17374865 [https://doi.org/10.5281/zenodo.17374865]. Source data are provided with this paper.

Code availability

VACmap and VACsim are available on GitHub: https://github.com/micahvista/VACmap⁴⁵.

References

Mahmoud, M. et al. Structural variant calling: the long and the short of it. Genome Biol. 20, 246 (2019).
Article PubMed PubMed Central Google Scholar
Ho, S. S., Urban, A. E. & Mills, R. E. Structural variation in the sequencing era. Nat. Rev. Genet. 21, 171–189 (2020).
Article CAS PubMed Google Scholar
Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).
Article CAS PubMed PubMed Central Google Scholar
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).
Article CAS PubMed PubMed Central Google Scholar
Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444–454 (2006).
Article ADS CAS PubMed PubMed Central Google Scholar
Chaisson, M. J. P. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1784 (2019).
Article ADS PubMed PubMed Central Google Scholar
Audano, P. A. et al. Characterizing the major structural variant alleles of the human genome. Cell 176, 663–675.e619 (2019).
Article CAS PubMed PubMed Central Google Scholar
Sekar, S. et al. Complex mosaic structural variations in human fetal brains. Genome Res 30, 1695–1704 (2020).
Article CAS PubMed PubMed Central Google Scholar
Leija-Salazar, M. et al. Evaluation of the detection of GBA missense mutations and other variants using the Oxford Nanopore MinION. Mol. Genet. Genom. Med. 7, e564 (2019).
Article Google Scholar
Schmidt, K., Noureen, A., Kronenberg, F. & Utermann, G. Structure, function, and genetics of lipoprotein (a). J. Lipid Res. 57, 1339–1359 (2016).
Article CAS PubMed PubMed Central Google Scholar
Aganezov, S. et al. Comprehensive analysis of structural variants in breast cancer genomes using single-molecule sequencing. Genome Res. 30, 1258–1273 (2020).
Article CAS PubMed PubMed Central Google Scholar
Hanlon, V. C. T., Lansdorp, P. M. & Guryev, V. A survey of current methods to detect and genotype inversions. Hum. Mutat. 43, 1576–1589 (2022).
Article PubMed Google Scholar
Jain, C. et al. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat. Methods 19, 705–710 (2022).
Article CAS PubMed PubMed Central Google Scholar
Schloissnig, S. et al. Structural variation in 1,019 diverse humans based on long-read sequencing. Nature 644, 442–452 (2025).
Carvalho, C. M. B. & Lupski, J. R. Mechanisms underlying structural variant formation in genomic disorders. Nat. Rev. Genet. 17, 224–238 (2016).
Article CAS PubMed PubMed Central Google Scholar
Beck, C. R. et al. Megabase length hypermutation accompanies human structural variation at 17p11.2. Cell 176, 1310–1324 (2019).
Article CAS PubMed PubMed Central Google Scholar
English, A.C. et al. Analysis and benchmarking of small and large genomic variants across tandem repeats. Nat Biotechnol. 43, 431–442 (2025).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Faust, G. G. & Hall, I. M. YAHA: fast and flexible long-read alignment with optimal breakpoint detection. Bioinformatics 28, 2417–24 (2012).
Article CAS PubMed PubMed Central Google Scholar
Sedlazeck, F. J. et al. Accurate detection of Complex structural variants using single-molecule sequencing. Nat. Methods 15, 461–8 (2018).
Article CAS PubMed PubMed Central Google Scholar
Heller, D. & Vingron, M. SVIM: structural variant identification using mapped long reads. Bioinformatics 35, 2907–2915 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ren, J. & Chaisson, M. J. lra: a long read aligner for sequences and contigs. PLoS Comput Biol. 17, e1009078 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).
Article CAS PubMed PubMed Central Google Scholar
Jiang, T. et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol. 21, 189 (2020).
Article CAS PubMed PubMed Central Google Scholar
Helal, A. A. et al. Benchmarking long-read aligners and SV callers for structural variation detection in Oxford nanopore sequencing data. Sci. Rep. 14, 6160 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
English, A. C. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol. 23, 271 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kirsche, M. et al. Jasmine and Iris: population-scale structural variant comparison and analysis. Nat. Methods 20, 408–417 (2023).
Article CAS PubMed PubMed Central Google Scholar
Porubsky, D. et al. Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders. Cell 185, 1986–2005.e26 (2022).
Article CAS PubMed PubMed Central Google Scholar
Smirin-Yosef, P. et al. A biallelic mutation in the homologous recombination repair gene SPIDR is associated with human gonadal dysgenesis. J. Clin. Endocrinol. Metab. 102, 681–688 (2017). Erratum in: J Clin Endocrinol Metab. 2018 Jan 1;103(1):364. doi: 10.1210/jc.2017-02413.
Article PubMed Google Scholar
Wang, X. et al. A. Evolution of siglec-11 and siglec-16 genes in hominins. Mol. Biol. Evol. 29, 2073–86 (2012).
Article CAS PubMed PubMed Central Google Scholar
Chin, C.-S. et al. A pan-genome approach to decipher variants in the highly complex tandem repeat of LPA. Preprint at https://www.biorxiv.org/content/10.1101/2022.06.08.495395v2.
O’Regan, G., deSouza, R. M., Balestrino, R. & Schapira, A. H. Glucocerebrosidase mutations in Parkinson disease. J. Parkinsons Dis. 7, 411–422 (2017).
Article PubMed Google Scholar
Mandelker, D. et al. Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. Genet. Med. 18, 1282–1289 (2016).
Article CAS PubMed Google Scholar
Toffoli, M. et al. Comprehensive short and long read sequencing analysis for the Gaucher and Parkinson’s disease-associated GBA gene. Commun. Biol. 5, 670 (2022).
Article CAS PubMed PubMed Central Google Scholar
Yokota et al. Frequency and clinical features of hearing loss caused by STRC deletions. Sci. Rep. 9, 4408 (2019).
Article ADS PubMed PubMed Central Google Scholar
Alvaro, S. et al. Refining the detection of complex rearrangements in 15q15.3 region involving the STRC gene in hereditary hearing loss patients. J Hum Genet 70, 395–403 (2025).
Prodanov, T. & Bansal, V. Sensitive alignment using paralogous sequence variants improves long-read mapping and variant calling in segmental duplications. Nucleic Acids Res. 48, e114 (2020).
Article CAS PubMed PubMed Central Google Scholar
Liao, W. W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020).
Article PubMed PubMed Central Google Scholar
Toffoli, M. et al. Phenotypic effect of GBA1 variants in individuals with and without Parkinson’s disease: the RAPSODI study. Neurobiol. Dis. 188, 106343 (2023).
Article CAS PubMed Google Scholar
Quinlan, A. R. et al. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res. 20, 623–635 (2010).
Article CAS PubMed PubMed Central Google Scholar
Šošic, M. & Šikic, M. Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance. Bioinformatics 33, 1394–1395 (2017).
Article PubMed PubMed Central Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Jain, C., Koren, S., Dilthey, A., Phillippy, A. M. & Aluru, S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics 34, i748–i756 (2018).
Article CAS PubMed PubMed Central Google Scholar
Ding, H. et al. VACmap: an accurate long-read aligner for unraveling complex genomic rearrangements. VACmap. https://doi.org/10.5281/zenodo.17389341 (2025).

Download references

Acknowledgements

We would like to thank Dr. Kristoffer Sahlin (Stockholm University) and Dr. Mingfu Shao (Pennsylvania State University) for their helpful suggestions and insightful comments on the manuscript. This work has been supported by the National Natural Science Foundation of China (Grant Nos U24A20257 and 62272105), 111 Project (Grant No. B18015), the ZJ Lab, the Shanghai Research Center for Brain Science and Brain-inspired Intelligence Technology. C.M. is supported by the MSA Trust.

Author information

Authors and Affiliations

Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
Hongyu Ding & Shanfeng Zhu
Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
Fritz J. Sedlazeck
Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
Fritz J. Sedlazeck
Department of Computer Science, Rice University, Houston, TX, USA
Fritz J. Sedlazeck
Department of Clinical and Movement Neurosciences, UCL Queen Square Institute of Neurology, London, UK
Christos Proukakis, Caoimhe Morley, Marco Toffoli & Anthony HV Schapira
School of Physics and Electronic Information, Guangxi Minzu University, Nanning, China
Zhirui Liao
Engineering Research Center of Multi-Modal Information Intelligent Sensing, Processing and Application, University of Guangxi, Nanning, China
Zhirui Liao
School of Computer Science and Technology, Shandong University, Qingdao, China
Lianrong Pu
Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
Shanfeng Zhu
Department of Neurology and Rare Disease Center, Huashan Hospital, Fudan University, National Center for Neurological Disorders (NCND), Shanghai, China
Shanfeng Zhu
Shanghai Key Lab of Intelligent Information Processing and Shanghai Institute of Artificial Intelligence Algorithm, Fudan University, Shanghai, China
Shanfeng Zhu
Zhangjiang Fudan International Innovation Center, Shanghai, China
Shanfeng Zhu

Authors

Hongyu Ding
View author publications
Search author on:PubMed Google Scholar
Fritz J. Sedlazeck
View author publications
Search author on:PubMed Google Scholar
Christos Proukakis
View author publications
Search author on:PubMed Google Scholar
Caoimhe Morley
View author publications
Search author on:PubMed Google Scholar
Marco Toffoli
View author publications
Search author on:PubMed Google Scholar
Anthony HV Schapira
View author publications
Search author on:PubMed Google Scholar
Zhirui Liao
View author publications
Search author on:PubMed Google Scholar
Lianrong Pu
View author publications
Search author on:PubMed Google Scholar
Shanfeng Zhu
View author publications
Search author on:PubMed Google Scholar

Contributions

S.Z. conceived and supervised this study. H.D. designed the study and implemented the software. H.D., F.J.S., C.P., and C.M. performed the data analysis. H. D. drafted the paper. F.J.S., C.P., M.T., A.H.V.S., Z. L., L.P., and S.Z. modified the paper. All authors agree to the content of the final paper.

Corresponding author

Correspondence to Shanfeng Zhu.

Ethics declarations

Competing interests

F.J.S. receives research support from Illumina, PacBio, and Oxford Nanopore. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer Review file

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ding, H., Sedlazeck, F.J., Proukakis, C. et al. VACmap: an accurate long-read aligner for unraveling complex genomic rearrangements. Nat Commun 16, 11198 (2025). https://doi.org/10.1038/s41467-025-67096-7

Download citation

Received: 19 September 2024
Accepted: 20 November 2025
Published: 05 January 2026
Version of record: 05 January 2026
DOI: https://doi.org/10.1038/s41467-025-67096-7