Introduction

Sequence alignment is a fundamental starting point for genomic research and clinical diagnostics, serving as the critical bridge between raw sequencing data and meaningful biological insights. Accurate alignment directly influences the quality of downstream analyses, such as variant detection, genome assembly, comparative genomics, and personalized medicine. In particular, the detection of structural variations (SVs)—genomic rearrangements such as inversions, duplications, translocations, and complex clustered events—relies heavily on precise alignment1. However, the inability of existing linear aligners to adequately represent complex SVs remains a barrier to progress in genomic science. Misaligned or misrepresented SVs hinder downstream analyses, leading to gaps in our understanding of genetic variation and its impact on human health and disease.

SVs, defined as genomic alterations of 50 base pairs or larger, are among the most impactful sources of genetic variation1,2,3,4. They affect more nucleotides in the genome than smaller variants, such as single-nucleotide polymorphisms (SNPs) or small insertions and deletions (indels)3,5,6,7. Consequently, their influence is being recognized across evolutionary processes, human health, and contribute to both Mendelian and complex diseases as well as cancer development8,9,10,11. Despite their critical importance, our understanding of SVs remains limited12. This is in part due to the complexities of complex SV, but also the inability of the current state of art aligners to represent them. Research has predominantly focused on simpler SV classes, such as deletions and insertions, while more intricate variations—like duplications, inversions, and complex clustered events—are often misrepresented and thus missed. This gap in our analytical approach hinders a comprehensive understanding of these important genomic phenomena13,14, despite their importance being highlighted in several studies already7,8,9,10,11,14,15,16.

These studies have mainly been driven by long-read sequencing that indeed enables the characterization of tandem repeats and thus regions where SV are predominantly observed17. Despite their advancements, long-reads are challenging to align due to their length and generally higher sequencing errors. In the past, others have highlighted that specialized aligners are required to accurately align them using predominantly linear alignments to the reference, where multiple linear subalignments represent a potential SV. These subalignments are identified through a process commonly known as the seed-chain-extend algorithm, tailored for long-read mapping. In this approach, seeds (exact matches, such as k-mers) are identified, and a co-linear subset of these seeds is selected to form chains, which are then extended bidirectionally until significant differences between the read and reference sequences are encountered. For reads without SVs, a single chain (alignment) can represent the entire read. However, the presence of SVs requires multiple chains (subalignments) to represent the read fully—for example, a read spanning an inversion typically requires three subalignments to capture the inversion and its flanking regions. Due to the complexity of SVs and their tendency to occur in repetitive regions, the seed-chain-extend approach often generates a pool of redundant subalignments, necessitating an additional step to determine the optimal subset of subalignments. For instance, minimap2 employs a co-linear chaining algorithm to identify the set of all possible co-linear chains. It then uses a greedy strategy during primary chain selection18, which determines the optimal subset of subalignments to represent the read. This process initializes an empty set Q and iteratively processes subalignments from the highest to the lowest chaining scores: if a subalignment overlaps with a chain in Q by 50% or more of the shorter subalignment’s length, it is marked as secondary; otherwise, it is added to Q, ultimately representing the read with the subalignments in Q18. In contrast, YAHA adopts a graph-based approach, leveraging Optimal Query Coverage algorithm to finds the optimal set of subalignments that cover the length of the query19. NGMLR builds on a similar strategy, enhancing it with a refined scoring function to identify the optimal combination of subalignments with the highest joint score20. Despite these advancements, post-alignment processes for determining the ideal set of sub-alignments are often inadequate because of the complexity of the allele and the underlying repetitive regions. They fail especially for duplication, inversion, and translocation, which are often missed or falsely identified by the linear alignment algorithms. Duplications, for example, are often misaligned as insertion because linear alignment algorithms prefer a single continuous alignment and treat duplication, which would require splitting reads, as insertion. Similarly, inversion and translocation, are often rather misaligned as splitting the reads is penalized and thus avoided.

To overcome the challenges posed by existing alignment methods, we introduce VACmap, a long-read mapping tool developed to improve the representation of all types of SVs. VACmap uses a non-linear alignment algorithm that captures an entire read as a unified, non-linear alignment. This approach streamlines the traditional alignment process by eliminating the need for splitting reads and selecting from multiple linear alignments. We demonstrate that this approach improves the representation of complex alleles, providing a more accurate and comprehensive view of SVs.

Results

The workflow of the non-linear alignment algorithm

Figure 1 gives an overview of VACmap’s non-linear mapping approach. The key important differentiation between VACmap and other approaches is implemented after initial matches between reference and read sequences have been identified. Here, existing methods try to conserve the order of all subalignments by heavily penalizing splits when searching chains of matches maintained. The linear alignment approach can efficiently model genomic alterations such as deletions, insertions, and substitutions since these don’t break the co-linearity of a chain. However, the linear approach penalizes the detection of complex SV such as duplication, inversion, translocation, or combinations of SV. In VACmap, we propose a hybrid alignment algorithm, which combines both linear and non-linear linkage approaches in a chain. In detail, VACmap represents matches as quadruples called ‘anchors’, which include the start positions in the long read and reference sequences, the strand match, and the anchor’s length. They are ordered by their end positions in the long read. The VACmap’s non-linear chaining algorithm then promotes the extension of the chain to subsequent anchors that preserve a strictly linear relationship with the preceding anchor, enhancing this connection with a positive score. Conversely, it penalizes extensions to anchors that disrupt this linearity by assigning negative scores to such connections. Then the optimal non-linear alignment of the entire sequence is the chain with the highest aggregate score (the longest path). Each of these linear subalignments can be extracted by dividing the non-linear alignment at the non-linear junction, eliminating the traditional necessity of additional post-alignment steps that reconstruct genomics rearrangement from a pool of error-prone independent subalignments (See “Methods” for details).

Fig. 1: The workflow of VACmap non-linear alignment algorithm.
figure 1

VACmap begins by identifying matching k-mers between the long read and the reference genome (blue indicates the forward strand, orange indicates the reverse strand). Next, it computes non-linear alignments and selects the one with the highest score. Finally, VACmap divides the highest-scoring non-linear alignment into multiple linear subalignments, enabling straightforward interpretation of SV.

Assessing the impact of VACmap on detecting complex SVs in synthetic data

To assess the impact of VACmap on variant detection in downstream applications, we conducted a series of tests using synthetic datasets. We generated synthetic long-read datasets containing a wide range of SVs, both simple and complex, using a custom tool we developed called VACsim, addressing the absence of simulation tools for complex SVs. VACsim introduced 30,000 SVs, each composed of 1 to 20 basic SV events, including deletions, insertions, duplications, inversions, and translocations.

As illustrated in Fig. 2a, VACmap’s alignment data enhanced SVIM’s20 ability to detect complex SVs in simulated data from PacBio CLR, PacBio HiFi, and ONT, with F1 score improvements ranging between 29.5 and 73.2 percent (refer to Supplementary Table 1). For complex SVs located within repetitive sequences, the use of VACmap-produced alignments provided gains in precision and recall, performing better than other methods by about 35.2 to 64.6 percent in F1 score (see Supplementary Table 1). Figure 2b shows the recall rates of SVs detection under different SV complexity and sequencing technology. NGMLR, minimap2, Winnowmap2, and LRA21 were shown to be adequate only for identifying complex SVs comprising up to two simple SV events. Beyond this complexity, the recall rates of SVIM decreased when using alignments from these tools. Conversely, SVIM with VACmap alignments consistently displayed sensitive and reliable SV detection across the full spectrum of SV complexities.

Fig. 2: Comparison of five mapping methods in downstream complex SV detection using SVIM on synthetic data.
figure 2

a The precision, recall, and F1 scores (dashed line) of SVIM’s complex SV detection performance in all chromosomes and the repetitive region (repeats) across different read depths and sequence technologies. b The recall rates of SVIM’s complex SV detection using alignments produced by five mapping methods under varying SV complexity and sequence technologies. The shaded color represents results in the repetitive region. c Box plots of the estimated tandem duplication copy numbers for each mapping method using 40-fold coverage ONT simulated data. Panels are arranged in two rows: the top row for all regions; the bottom row for repetitive regions. The green dashed line indicates the ideal (true) copy number. The orange dotted line represents the median estimated copy number across data points for each target. Notably, VACmap demonstrates improved performance in the downstream SV detection task across different SV complexities. For box plots, data are presented as median values (centre, horizontal line within each box) with bounds of the box representing the interquartile range (IQR; lower bound: first quartile or 25th percentile [Q1]; upper bound: third quartile or 75th percentile [Q3]). Whiskers extend from the box bounds to the minima (smallest value within 1.5 × IQR of Q1) and maxima (largest value within 1.5 × IQR of Q3); outliers beyond whiskers are not shown (filtered). No error bars are displayed. For all regions: VACmap n = 9557; minimap2 n = 7227; NGMLR n = 8626; Winnowmap2 n = 5674; LRA n = 0. For repetitive regions: VACmap n = 1061; minimap2 n = 387; NGMLR n = 324; Winnowmap2 n = 224; LRA n = 0. Source data are provided as a Source Data file.

For precise gene copy number quantification, particularly in tandem duplications that might influence protein levels, accurate mapping is essential. To investigate the performance of copy number estimations with alignments produced by different aligners, we generated 10,000 tandem duplications on chromosome 1 using VACsim. These duplications had repeat unit sizes ranging from 100 to 500 base pairs and repeat counts between 1 and 20. SVIM was employed to estimate the copy number for each tandem duplication from the various aligners’ alignments. According to the results depicted in Fig. 2c and Supplementary Table 2, alignments from current mapping methods led to a bias in copy number estimation. There was a decline in the linear correlation between the actual and estimated copy numbers as the repeat count grew, especially within repetitive areas. In contrast, alignments from VACmap resulted in more precise copy number estimates across diverse copy number intervals and within repetitive regions. This underscores VACmap’s capability in accurately ascertaining the copy number of tandem duplications, indicating its effectiveness and accuracy in dealing with complex genomic structures.

Evaluation using genome in a bottle benchmark

We evaluated the SV detection performance of VACmap, NGMLR, Winnowmap2, minimap2, and LRA alignments using SVIM and cuteSV with the GIAB benchmark set4,20,21,22,23,24,25. Truvari26 was used to assess precision, recall, and F1 scores. Before evaluation, SVIM’s and cuteSV’s tandem duplication calls were relabeled as insertions to allow for comparability to the GIAB assembly-derived benchmark. As expected, all five alignment approaches demonstrated similar performance in detecting deletions and insertions in both GIAB tier 1 and CMRG regions (Fig. 3a, b). And the runtime of VACmap is faster than NGMLR and comparable with Winnowmap2 and LRA, but slower than minimap2. However, VACmap requires lower memory usage than the other aligners (Supplementary Table 3). It should be noted that NGMLR is no longer actively maintained, which may contribute to its performance limitations compared to more actively developed tools.

Fig. 3: Comparison of five mapping methods in downstream SV detection using HG002.
figure 3

a–d Performance assessment of SVIM and cuteSV using five aligners’ alignments on GIAB Tier 1 and CMRG benchmarks. e Distribution of SV types (deletions [DEL], duplications [DUP], insertions [INS], and inversions [INV]) and their size ranges detected by SVIM using alignments produced by VACmap and minimap2. VACmap alignments revealed a broader and more balanced distribution of SV types and sizes compared to minimap2, which exhibited biases toward specific SV categories and sizes. These results highlight the advantages of VACmap in comprehensive SV detection. f Venn diagram showing the overlap of inversions detected by SVIM using alignments from VACmap, NGMLR, minimap2, Winnowmap2, and LRA on PacBio HiFi data. VACmap enabled the detection of the highest number of unique inversions compared to the other aligners. Source data are provided as a Source Data file.

To evaluate SVIM’s and cuteSV’s sensitivity in detecting duplications using alignments from different tools, we isolated tandem duplication calls within the GIAB benchmark set using REPTYPE annotation. The results (Fig. 3c, d) showed that SVIM, using VACmap-produced alignments, exhibited high sensitivity for duplication detection, identifying approximately 70% to 80% more duplications compared to other alignment approaches in the GIAB tier 1 and CRMG regions, respectively. This is highly important for the interpretability of the impact of SV. Additionally, the SV distribution detected with VACmap alignments showed notable differences compared to other aligners (Fig. 3e and Supplementary Fig. 1). VACmap indicated that more than 67% of the sequence gain was due to duplications, consistent with previous findings14. In contrast, minimap2 attributed only 1% of the total sequence gain to duplications. This discrepancy in SV classification is critical for interpreting the biological impact of SVs, underscoring the importance of accurate SV detection.

VACmap’s ability to accurately map duplicated segments also enabled us to characterize a previously reported de novo variation27 (Fig. 4a–c and Supplementary Fig. 2). This variation, located within a Tandem Repeat (TR) region at chr14:23,280,711 (GRCh38), was originally labeled as a de novo insertion, as different insertion sizes were observed in the child (HG002: 537 bp) and the parents (HG003: 214 bp and HG004: 15 bp). However, with VACmap’s alignment, what was initially thought to be an insertion was revealed to be a 109-bp Variable Number Tandem Repeat (VNTR), with varying repeat counts in the child (five repeats) and the paternal parent (two repeats). TR regions are known to be variable in the number of repeats, often changing between generations due to mechanisms like replication slippage and unequal crossing over during meiosis. These processes can lead to differences in repeat counts, which explains the variation observed between the child and the father in this case.

Fig. 4: Enhanced detection of complex variants by VACmap.
figure 4

a, b VACmap (a) accurately identifies a 109-bp Variable Number Tandem Repeat (VNTR) at chr14:23,280,711 (GRCh38), with five copies in HG002 and two copies in HG003. In contrast, minimap2 (b) misclassifies the same variation as a 537-bp insertion in HG002 and 214-bp insertion in HG003. c Schematic representation of the correct repeat structures identified by VACmap in HG002 and HG003, compared to minimap2’s misinterpretation as insertions. d VACmap detects precise breakpoints for a 16-kb inversion in the SPIDR gene (blue dashed line), while other aligners show more mismatched bases and incorrect breakpoints (red dashed lines). e Alignment score comparison highlighting VACmap’s ability to switch between forward and reverse strands, resulting in more accurate inversion breakpoint detection than minimap2.

This example highlights a limitation of conventional alignment algorithms, which often misinterpret duplications as insertions. Traditional aligners rely on maintaining the relative order of sequences when aligning them. However, duplications disrupt this order, making it difficult for linear aligners to correctly map such regions. As a result, duplications are often misaligned as insertions or entirely ignored. In contrast, VACmap’s non-linear alignment approach accurately handles these complex repeat structures, providing a more precise representation of the true genetic variation.

Enhance the characterization of complex inversions in repetitive regions

We then analyzed the inversion callsets generated by five different SV detection pipelines. The VACmap-SVIM callsets captured nearly all of the inversions (105 out of 116) identified by the combined callsets of minimap2, Winnowmap2, NGMLR and LRA, and additionally uncovered 97 inversions not detected by these approaches (Fig. 3f). When comparing inversions that overlapped with a previously reported callset28, the VACmap-SVIM pipeline identified nearly all the inversions (48 out of 49) detected by the other three pipelines, while also discovering 14 inversions that were missed by the other methods (Supplementary Fig. 3). Upon manual inspection of an inversion missed by VACmap-SVIM, we found a more complex structure—an inversion flanked by an inverted duplication and deletion. While VACmap could resolve this complex structure, SVIM failed to detect it because the intricate structure did not align with SVIM’s predefined rules for identifying inversions (Supplementary Fig. 4).

Thus, highlighting that inversions remain challenging to resolve because their locations are often surrounded by large segmental duplications. To further investigate this, we analyze the combined call set of 213 inversion regions from five aligners. Across all inversions, 32% (68/213) of them overlap with segmental duplications, and half of them (39/68) are only detectable through VACmap alignment. For instance, VACmap alignment enables accurate identification of a homozygous 16-kb inversion located in the SPIDR gene (Fig. 4d), a gene involved in DNA repair and associated with gonadal dysgenesis diseases29. On the contrary, other aligners’ alignments are less reliable, as they showed more mismatch bases (i.e., signal of wrongly mapping of reads20) and inconstant breakpoints across different read alignments. The standard deviation of inversion sizes called by SVIM is 291.4 for VACmap alignments and 2066.4 for NGMLR alignments, respectively. A higher variance will be considered an unreliable SV prediction and assign a lower quality score (The SVIM quality score for this inversion is 14 and 0 for VACmap and NGMLR alignments, respectively, and will be discarded).

Figure 4e demonstrates why minimap2 and other linear aligners fail to accurately pinpoint inversion breakpoints. Linear alignment methods, such as minimap2, rely on heuristic strategies like the Z-drop heuristic to infer breakpoints18. These methods monitor the alignment score and split the alignment when the score drops below a predefined threshold (indicated by the red dashed line in the figure). However, this approach often fails to identify the precise breakpoint because after the inversion, the sequence in the read is not significantly divergent from the reference. As shown in the figure, the alignment score continues to increase slowly rather than showing a sharp drop, leading minimap2 to incorrectly place the breakpoint upstream (marked by the red dashed line).

In contrast, VACmap’s non-linear alignment algorithm can simultaneously evaluate both forward and reverse strands (blue and orange curves, respectively) and automatically switch between them to maximize the alignment score. This allows VACmap to correctly identify the true inversion breakpoint, as it can seamlessly align both strands and capture subtle changes in the alignment score. The result is a more accurate alignment and a precise breakpoint, as reflected in the figure, where VACmap’s breakpoint (blue dashed line) aligns with the actual inversion. Supplementary Figs. 59 provide further examples of how VACmap performs better than traditional aligners in mapping complex inversions.

Improve identification on SIGLEC11::SIGLEC16 and RHCE::RHD gene conversion

Gene conversion is a challenging form of SV that is difficult to capture accurately using current alignment algorithms and SV detection tools. Figure 5a and Supplementary Fig. 10 illustrate an inversion initially misidentified by SVIM, which was actually a gene conversion event between the SIGLEC11 and SIGLEC16 genes on the maternal haplotype. These two genes share highly similar sequences in the regions encoding their extracellular domains, due to past gene conversion events30. The most recent conversion, which occurred approximately one million years ago, involved regions A in SIGLEC11 and A* in SIGLEC1630 (Fig. 5b). However, VACmap’s alignment revealed a gene conversion event involving different regions, B and B*, in these two genes.

Fig. 5: Comparison of five aligners on gene conversion events.
figure 5

a The IGV visualization of the SIGLEC11 and SIGLEC16 gene conversion event, the SVIM inversion prediction is shown in the top panel. b Proposed scenario of gene conversions between SIGLEC11 and SIGLEC16 loci. c The IGV visualization of a potential RHD and RHCE gene conversion event. d, e The IGV visualization of SIGLEC11/SIGLEC16 and RHD/RHCE gene conversion events using the HG002 assembly.

Notably, the B* region in SIGLEC16 had previously been flagged by the GIAB consortium due to a cluster of heterozygous small variants23. However, GIAB’s alignment methods, which rely on minimap2, were unable to resolve this gene conversion, resulting in numerous false-positive SNP calls in both the GIAB CMRG benchmark set and the draft release of the GIAB T2T SV benchmark (Fig. 5d). This outcome is not surprising given minimap2’s limitations in handling complex rearrangements, as it struggles to split reads or assemblies appropriately to represent gene conversion events, leading to misalignments and erroneous variant calls.

Additionally, VACmap successfully resolved a homozygous gene conversion event between the RHCE and RHD genes, which had been inaccurately represented by existing aligners (Fig. 5c). This correction reduced over a hundred false-positive SNP and indel calls in the GIAB benchmark sets (Fig. 5e and Supplementary Figs. 11 and 12). This highlights VACmap’s ability to detect and accurately characterize gene conversion events that are typically missed or misclassified by conventional linear alignment methods.

Evaluation using the LPA, GBA1, and STRC genes

We next assessed the LPA gene to highlight a medically important region that is further improved using VACmap. The complexity of this region raises due to high diversity in the population which represents 5–40 copies of the KIV-2 repeat in the LPA gene10. This copy number is inversely correlated with human lipoprotein(a) levels, which are strongly linked to coronary heart disease10. However, quantifying the KIV-2 copy number accurately poses challenges due to repetitiveness and thus the low mappability of sequencing reads in the LPA gene region31. We assessed the performance of five mapping methods by aligning PacBio HiFi and ONT sequencing data from human samples (CHM13 and HG002) against the GRCh38 reference genome. IGV visualizations revealed that NGMLR, Winnowmap2, minimap2, and LRA produced alignments with more mismatches and less informative coverage information compared to VACmap (Fig. 6a). VACmap demonstrated an ability to accurately represent KIV-2 repeats, showing clear and distinct coverage boundaries (Supplementary Figs. 1315).

Fig. 6: Comparison of five aligners on the LPA gene.
figure 6

a, b The IGV visualization of alignments produced by five aligners using GRCh38 and modified GRCh38 reference in the KIV-2 region. c The illustration of GRCh38 reference modification and the exon structure of the KIV-2 domain. The exon 2 (red) in the type A KIV-2 repeat unit, type B KIV-2 repeat unit, and KIV-1 repeat unit have 100% identical sequences. The exon 1 (purple) in the KIV-3 repeat unit and type B KIV-2 have 100% identical sequences. The light blue and light orange regions indicate the reserved and removed regions, respectively. d The dot plot depicts non-linear alignments generated by the VACmap algorithm of GRCh38, CHM13, and HG002 assembly against the modified GRCh38 reference. e, f The alignment scheme of the type A KIV-2 sequence and type B KIV-2 sequence against the modified GRCh38 reference.

To simplify KIV-2 copy number determination, we modified the GRCh38 reference by removing the second to sixth KIV-2 repeat units and including the first 1000 bp sequence of the follow-up KIV-1 unit (Fig. 6c). We then realigned the PacBio HiFi and ONT data to the modified reference. The IGV visualizations indicated that VACmap-produced alignments (Fig. 6b and Supplementary Figs. 1618) showed the expected alignment scheme of both type A and type B KIV-2 units (Fig. 6e, f. Other mapping methods struggled to produce correct alignments despite the reduced complexity of the modified reference. Furthermore, the ONT reads facilitated the resolution of all 23 copies of the KIV-2 repeat unit in the CHM13 sample due to its longer read length compared to PacBio HiFi data (Supplementary Fig. 19).

Then, we aligned the GRCh38 assembly, CHM13 assembly, and HG002 assembly to the modified GRCh38 reference. The non-linear alignment of these three assemblies is shown in Fig. 6d. Consistent with previous findings10, we found the GRCh38 assembly consisted of six copies of KIV-2 repeat units with the pattern “AAABAA” where “A” indicates the type A KIV-2 repeat unit, and “B” indicates the type B KIV-2 repeat unit. In the CHM13 assembly, 23 KIV-2 repeat units were identified, following the pattern: “BBBBBBAABAAAAAAAAAAAAAA”. Similarly, the HG002 paternal assembly contains 24 KIV-2 repeat units with the pattern “BBBBBBAAABAAAAAAAAAAAAAA”, while the HG002 maternal assembly consists of 14 KIV-2 repeat units arranged as “AAAAAAAAAAAAAA”.

To further demonstrate the clinical utility of VACmap, we chose GBA1. This is a major risk factor for Parkinson’s disease32, a challenging gene to analyze33, which is prone to structural variants caused by recombination with a nearby highly homologous pseudogene (GBAP1). We previously detected using ONT long reads with adaptive sampling a pathogenic deletion which could not be correctly called after minimap2 or NGMLR alignment34. In contrast, VACmap allowed SVIM and cuteSV to correctly report the breakpoints (Fig. 7a), which is crucial in determining whether a deletion is pathogenic. Similarly, the STRC gene is a known deafness-associated gene causing mild-to-moderate hearing loss34 and is inherited in an autosomal recessive manner. However, it’s hard to detect due to its location in tandem duplication region and the presence of a highly homologous (>99%) pseudogene (STRCP1)35. By examining the GIAB samples using VACmap-produced alignments (Fig. 7b), we identified a heterozygous deletion in NA1924036 involving the loss of one copy of the CKMT1B-STRC-CATSPER2 gene cluster. However, all four other aforementioned aligners failed to pinpoint the deletion. In addition, Duplomap37, a specialized aligner for remapping reads in tandem duplications, cannot detect the deletion, since it uses minimap2 internally, which often hesitates to split reads.

Fig. 7: Comparison of five aligners on GBA1/GBAP1 and STRC/STRCP1 gene.
figure 7

a The IGV visualization of alignments produced by five aligners in the GBA1 / GBAP1 region, with the SVIM, cuteSV deletion call shown in the top panel. b The IGV visualization of alignments produced by five aligners in the STRC/STRCP1 region, with the SVIM, cuteSV deletion call shown in the top panel.

Discussion

Sequence alignment is a fundamental starting point for virtually all genomic research and clinical diagnostics. It serves as the crucial bridge between raw sequencing data and the biological insights necessary for understanding genetic variation, evolutionary biology, and the molecular basis of diseases. Accurate alignment of sequencing reads to a reference genome is essential for a wide range of applications, including variant detection, comparative genomics, and personalized medicine. The precision and reliability of this initial alignment process directly influence the quality of downstream analyses, impacting our ability to identify genetic variations such as SNPs, indels, and critically SVs.

Despite significant advancements in sequencing technologies, particularly with the advent of long-read sequencing platforms, accurately aligning reads that encompass complex genomic rearrangements remains a formidable challenge. Traditional linear alignment algorithms are often inadequate for handling large-scale SVs such as inversions, duplications, translocations, and complex combinations of these events. These limitations create a cascade of analytical failures: when alignment is incorrect, subsequent analyses become unreliable or impossible, regardless of the sophistication of downstream tools. As a consequence, crucial SVs—including those with significant medical relevance—may be misrepresented or entirely missed, impeding our ability to fully understand their biological significance and clinical implications.

Graph-based genome representations offer significant advantages over linear reference genomes by providing a flexible framework to encode SVs such as duplications, inversions, and translocations as graph structures, enabling a more comprehensive representation of genetic diversity across populations38. This approach can facilitate the integration of multiple genomes into a single graph, potentially improving variant calling and haplotype resolution in complex regions. However, we show that existing algorithms, such as minigraph39, encounter substantial difficulties in producing correct genome graphs (Supplementary Note 1). Their reliance on co-linear matching during graph construction and read alignment often results in erroneous topologies, misinterpreting non-linear SVs (e.g., duplications as insertions or inversions as misalignments).

To address these challenges, we present VACmap. VACmap breaks through this long-standing barrier of inaccurately representing complex variants. This is achieved via a non-linear mapping approach and demonstrates the need for this method, especially on inversions and other critical medically challenging genes such as LPA, GBA1, and STRC. Indeed, inversions remain challenging to resolve, especially due to their location often surrounded by large segmental duplications28. Furthermore, these regions often form more complex events than simple inversions. Neither complex or simple inversions are routinely detectable with state-of-the-art methods28, despite their clinical importance15. VACmap enables this detection with more precise alignments of read segments than any other method available due to its non-linear mapping approach. This further improves the characterization of complex duplications, such as shown in KIV-2 a region in LPA itself and of gene/pseudogene recombination as shown in GBA1 and STRC. VACmap can more precisely recapitulate the exact breakpoints within the reads, which leads to an improved detection and thus will provide more insights. These are only a few examples of multiple medically important but challenging genes that VACmap can improve upon and thus deliver a more precise picture of the variants currently often missed by analytical methods23.

Methods

Ethics statement

Ethics approval for the GBA1 carrier was provided by the National Research Ethics Service London—Hampstead Ethics Committee as part of the RAPSODI study (www.rapsodi.com)40. Informed consent was provided.

Variant-aware chaining algorithm

Algorithm overview

Traditional sequence alignment algorithms13,18,19,20,22 rely on linear edits—insertions, deletions, and substitutions—that preserve sequence order and orientation. While effective for point mutations and small insertions or deletions (indels), these methods struggle with complex genomic rearrangements, such as duplications, inversions, and translocations. For example, duplications may be misidentified as insertions, and inversions are often indistinguishable from block substitutions. In cases of clustered rearrangements, linear methods attempt to reconstruct the structure by selecting from a pool of local subalignments, but this approach rarely yields a globally optimal alignment, especially in repetitive or highly rearranged regions41.

VACmap introduces a hybrid approach to address these limitations. Unlike linear alignment methods, VACmap employs the Variant-aware Chaining (VAC) algorithm, which integrates linear and non-linear edits within a weighted directed acyclic graph (DAG) framework. The algorithm identifies exact k-mer matches (anchors) between a reference and a long-read sequence, constructs a DAG where anchors are nodes, and connects them with edges representing possible alignments. Edges are classified as normal (for co-linear alignments) or variation (for rearrangements), with weights adjusted by penalties to account for indels and rearrangement complexity. By finding the longest path in the DAG, VACmap produces a unified, globally optimal alignment that accurately captures genomic rearrangements without relying on post-alignment selection.

Anchor identification

VACmap begins by identifying exact k-mer matches between the reference and long-read sequences. These matches, called anchors, are represented as quadruples (x, y, s, k), where:

x: Start position in the long-read sequence.

y: Start position in the reference sequence.

s: Orientation (1 for forward strand, −1 for reverse strand).

k: Length of the matched k-mer.

Graph construction

Anchors are modeled as nodes in a weighted DAG. A directed edge from node i to node j is created if the nodes satisfy:

$${x}_{i}+{k}_{i}\le {x}_{j}+{k}_{j}$$
(1)

This condition tests for all node j that satisfy it and ensures that the alignment respects the read sequence order, allowing overlaps between nodes. The overlap size is defined as:

$${{\rm{overlapsize}}}=\max (x_{i}+{k}_{i}-{x}_{j},0)$$
(2)

This accounts for cases where anchors partially overlap in the long-read sequence.

Edge classification and weighting

Edges are classified as normal or variation based on the relative positions and orientations of the connected nodes. The initial edge weight, or bonus, is calculated as:

$${{\rm{bon}}}{{\rm{u}}}{{\rm{s}}}_{i,j}=\min (x_{j}+{k}_{j}-{x}_{i}-{k}_{i},{k}_{j})$$
(3)

This bonus reflects the length of the aligned region between nodes. To determine the edge type, we compute the sequence gain or loss (diff) as:

$${{\rm{diff}}}=\left|{{\rm{readgap}}}-{{\rm{refgap}}}\right|$$
(4)

where readgap and refgap are the distances between node j’s first not overlapped base pair to node i and node i’s last base pair in the query and reference, respectively:

$${{\rm{readgap}}}={\max} (x_{j}-{x}_{i}-{k}_{i},0)$$
(5)
$${{\rm{refgap}}}=\left\{\begin{array}{c}{y}_{j}+{{\rm{overlapsize}}}-{y}_{i}-{k}_{i},{{\rm{if}}}{s}_{{{\rm{j}}}}={s}_{{{\rm{i}}}}=1\\ {y}_{i}-{y}_{j}-{{\rm{bonus}}},{{\rm{if}}}{s}_{{{\rm{j}}}}={s}_{{{\rm{i}}}}=-1\\ {y}_{j}+{{\rm{overlapsize}}}-{y}_{i}+1,{{\rm{if}}}{s}_{{{\rm{j}}}}=1,{s}_{{{\rm{i}}}}=-1\\ {y}_{j}+{{\rm{bon}}}{{\rm{u}}}{{\rm{s}}}-1-{y}_{i}-{k}_{i},{{\rm{if}}}{s}_{{{\rm{j}}}}\ne {s}_{{{\rm{i}}}},{s}_{{{\rm{i}}}}=-1\end{array}\right.$$
(6)

An edge is classified as normal if it satisfies:

$${{\rm{diff}}} < {{\rm{maxdiff}}},0\le {{\rm{refgap}}} < {{\rm{maxgap}}},{s}_{i}={s}_{j}$$
(7)

Here, maxdiff (default: 50 for first round chaining, 30 for second) regulates small indels, maxgap (default: 1000 for first round chaining, 100 for second) prevents misalignments. Edges not meeting these criteria are classified as variation edges, representing rearrangements.

Penalty calculations

For normal edges, an additional penalty (NP) account for small indels:

$$a=\left\{\begin{array}{c}0.5\;{{\rm{if\,diff}}} < 10\\ 2 \;{{\rm{if}}}\,{{\rm{diff}}}\ge 10\end{array}\right.$$
(8)
$${{{\rm{NP}}}}_{i,j}=0.01\times {k}_{j}*{diff}+{{\rm{\alpha }}}{\times \log }_{2}(\max \left({{\rm{diff}}},1\right))+0.1{\times \log }_{2}(\max \left({{\rm{readgap}}},1\right))$$
(9)

For a large readgap (default: >30), we use a larger penalty:

$${{{\rm{NP}}}}_{i,j}=0.01\times {k}_{j}*{{\rm{diff}}}+{{\rm{\alpha }}}{\times \log }_{2}\left(\max \left({{\rm{diff}}},1\right)\right)+0.5*{{\rm{readgap}}}$$
(10)

For variation edges, the penalty (VP) accounts for rearrangements:

$${{{\rm{VP}}}}_{i,j}=\beta+\min \left\{\begin{array}{c}30+0.5\times {\mathrm{ln}}\left(\max (1,{{\rm{diff}}})\right)\\ \min \left(10,\frac{{{\rm{diff}}}}{100}\right)+\min \left(30,\frac{{{\rm{diff}}}}{1000}\right)\\ {{\rm{\gamma }}}\end{array}\right.$$
(11)

Where the default settings for β is 59, 40, 30, and 30, for mode L, H, S, and R, respectively. For parameter γ, the default settings are 36. After applying the additional penalty, the final weight for normal and variation edges are defined in Eqs. (12) and (13), respectively.

$$w\left(i,j\right)={{{\rm{bonus}}}}_{i,j}-{{{\rm{NP}}}}_{i,j}$$
(12)
$$w\left(i,j\right)={{{\rm{bonus}}}}_{i,j}-{{{\rm{VP}}}}_{i,j}$$
(13)

Optimization

Here, we consider the optimal alignment of the read and reference sequence to be the longest path in the weighted directed graph. To find the longest path among N nodes can be computed in O(N2) time using dynamic programming. In detail, for a given node j, we can determine its maximum score S(j) and its best predecessor node i by utilizing Eq. (14):

$$S\left(j\right)=\max \left({\max }_{i < {{\rm{j}}}}\left(S\left(i\right)+w\left(i,j\right)\right),{k}_{j}\right)$$
(14)

However, we don’t need to test all node i because node j’s best predecessor score S(i) is always higher than the current highest score S(k) minus \((\beta+\gamma+20)\). So, for node j, we define a set Mj which includes all nodes that score higher than S(k) minus \((\beta+\gamma+20)\). Then we only need to test nodes in set Mj to identify the best predecessor of node j. This further reduces the time complexity to O(hN), where h is equal to the average size of set M.

We observed that the best predecessor of node j is usually among the top-ranked nodes in set Mj when sorted in descending order by their scores S(i). To exploit this property and enhance computational efficiency, we initialize the score of node j as S(j) = kj and then iterate over all nodes i in Mj (where i < j) in descending order of their scores S(i) to identify the best predecessor. The iteration stops early if S(i) + kj < S(j), as this condition indicates that further predecessors cannot improve the score. In practice, we maintain an array of nodes in descending order by their scores using binary search. Then the overall time complexity of the algorithm is O (kN + NlogN). The average value of k for HiFi, ONT, and CLR data in the first round of chaining is 21, 33, and 44, respectively.

Once the maximum score of all nodes is computed, we can identify the highest scores non-colinear chain by backtracking. The optimal set of colinear subalignments is recovered by discarding variation edges in the highest score chain.

Map quality calculations

For the map quality of the highest score chain, VACmap uses minimap2’s equation to compute the map quality:

$${{\rm{map}}}\,{{\rm{quality}}}=40\times (1-f2/f1)\times \min (1,\frac{m}{10})\times {\mathrm{ln}}(f1)$$
(15)

where m is the number of anchors on the highest scores chain, f1 is the chaining score, and f2 is the score of the second high chain.

Local index and anchor reduction

When sequencing errors or SNPs occur in clusters, accurate mapping becomes challenging, especially when utilizing a large k-mer size setting and a minimizer k-mer sampling strategy. To mitigate this issue, VACmap adopts a similar approach used in a previous study22 by constructing a local index with a smaller k-mer size, typically a 9-mer. The local index is built by collecting all possible k-mers from the previously computed highest and second highest score chains’ covered reference genome region. Subsequently, all the k-mers in the long read are employed to query the index, obtaining matching information known as anchors. The highest-scoring chain is then recomputed based on these anchors.

Note that the runtime of the VAC algorithm depends on the number of anchors. Using a smaller k-mer size setting often results in an increased number of anchors, consequently escalating computational time. To address this issue, two strategies are employed to reduce the number of anchors and improve runtime efficiency. Firstly, VACmap iterates through the anchors and removes an anchor if the distance between the anchor and the previously computed highest-scoring chain exceeds a certain threshold (2000 bp). The distance between an anchor and the previous chain is defined as the distance between the anchor and its closest anchor in the chain under the reference sequence coordinate. Determining the closest anchor in the chain can be accomplished through a binary search, with a time complexity of O(logN), where N is the number of anchors. Secondly, VACmap merges two anchors (xi, yi, si, ki) and (xj, yj, sj, kj) into a new anchor (xi, min(yi, yj), si, xj + kj – xi) if they satisfy certain conditions (Eq. (16)).

$$\left\{\begin{array}{c}{{y}_{i}-{x}_{i}={y}_{j}{-x}_{j},x}_{i}+{k}_{i}\ge {x}_{j,}{s}_{j}={s}_{i}=1,{x}_{j} > {x}_{i},{x}_{j}+{k}_{j}-{x}_{i}\le 19\\ {or}\\ {{y}_{i}+{x}_{i}+{k}_{i}={y}_{j}{+}_{j}+{k}_{j},x}_{i}+{k}_{i}\ge {x}_{j,}{s}_{j}={s}_{i}=-1,{x}_{j} > {x}_{i},{x}_{j}+{k}_{j}-{x}_{i}\le 19\end{array}\right.$$
(16)

Two anchors can be merged only if they are overlapping or close adjacent and share identical sequences in their overlapping regions. To avoid producing large anchors—which could expand the search space during best predecessor computation—we limit the maximum output size (default: 19) for merged anchors, thereby reducing computational demand in the chaining process without compromising alignment quality.

The sorting of N anchors in ascending order based on their position in the long read allows the merging process to be computed in O(N) time using a hash table. The hash table utilizes an integer key and maintains a list of merged anchors as its value. For anchori, if si equals 1, then the key is set to yixi; otherwise, it is set to −(yi + xi). VACmap tests the last anchorj in the list corresponding to a key. If the two anchors are overlapped or closely adjacent and the covered sequence in reference and long read is identical, VACmap updates the last anchor in the list; otherwise, anchori is appended to the list. If a key does not exist in the hash table, a new key-value pair is inserted. Finally, the merged anchors set can be obtained by traversing the hash table.

Quality control of linear subalignment

The presence of sequencing error and closely situated SNPs complicates the process of gathering sufficient anchors for the chaining process. To accommodate the occurrence of sequencing errors and clustered SNPs, VACmap employs a sizeable parameter, maxgap, with a default setting of 100, within the VAC algorithm. However, a large maxgap value has the potential drawback of mistakenly aligning sequences that are not actually related. To address this problem, we assess the error ratio within each linear subalignment and exclude those that demonstrate a high propensity for errors. This error ratio is ascertained by employing edlib42 to calculate the edit distance between the matched sequences in the reference genome and the long read. The edit distance obtained is then normalized by dividing the length of the shorter sequence involved in the comparison. A linear subalignment is deemed unreliable and is therefore rejected if its error ratio surpasses the predetermined threshold: 0.2 for sequences generated using PacBio CLR or ONT, and 0.1 for sequences generated using PacBio HiFi.

Mode selection

VACmap provides four mapping modes (L, H, S, R) tailored for specific alignment purposes. Mode L is recommended for high-accuracy long-read data (e.g., PacBio HiFi), using a large variation penalty (β = 59) and a low sequence divergence threshold (0.1) to retain only high-confidence alignments. Mode H is preferred for noisy long-read data (e.g., ONT), applying a slightly smaller variation penalty (β = 40) and a higher sequence divergence threshold (0.3) to accommodate higher sequencing errors. Mode S is designed to improve sensitivity for small-scale rearrangements (β = 30) and a high sequence divergence threshold (0.5) to capture subtle variations. Mode R is an experimental mode intended solely for testing purposes, employing a fixed variation penalty (VPi,j = 30) and a divergence threshold of 0.5. Due to its developmental nature, Mode R is not recommended for normal usage, and users should rely on Modes L, H, or S for standard applications.

Synthetic data simulation and evaluation

To assess the performance of different mapping methods in downstream SV detection tasks, we employed the CHM13 T2T human reference genome43 to generate two sets of synthetic datasets. For both simple and complex SVs simulation, we utilized VACsim to randomly introduce 30,000 SVs into the CHM13 T2T reference and generated the altered genome sequence containing both simple and complex SVs. VACsim determined the SV complexity (measured by the number of simple structural variation events within) of each simulated SV by randomly sampling from the range 1–20. The types of simple structural variation events were sampled from five simple SV events, namely deletion, insertion, duplication, inversion, and translocation, with respective probabilities of 0.24, 0.24, 0.24, 0.24, and 0.04. The size of each simple SV event ranged from 100 to 1000 bps. Additionally, SVIM (v1.4.2) employed a pattern match strategy to detect SVs, requiring each simple SV event to conform to the “Normal + Variation + Normal” pattern. Therefore, we spaced adjacent simple SV events by a normal sequence of 200 base pairs in size. For the copy number estimation task, VACsim was used to simulate and implant 10000 tandem duplications with repeat unit sizes ranging from 100 to 500 bps and repeat numbers ranging from 1 to 20 into chromosome 1 of the CHM13 T2T reference genome sequence. The SNPs were introduced into the altered sequences, and SURVIVOR generated simulated PacBio and ONT long-read datasets.

Five long-read mapping methods, namely VACmap, NGMLR, Winnowmap2, minimap2, and LRA, were utilized to align the simulated long reads to the T2T reference genome sequences. For parameter setting of VACmap, we use mode L to align synthetic PacBio HiFi data and mode H to align synthetic PacBio CLR and ONT data. The SVIM (v1.4.2) SV detection tool was employed to detect SVs from the alignments generated by the different mapping methods. For 40×, 20×, 10×, and 5× read coverage, the read support of SVIM called SVs is set to 10, 5, 3, and 2, respectively. Following previous studies6, we evaluated the performance of CSV calls by decomposing CSVs into individual simple SV events and evaluating each event separately.

Furthermore, following the procedure outlined in a previous study13, we employed Mashmap44 to identify repetitive regions in the reference genome sequence with sequence similarity exceeding 95% and a sequence length greater than 10,000 bps. Subsequently, we analyzed the performance of downstream SV detection within these repetitive regions using alignments produced by different mapping algorithms.

Evaluation with GIAB callset

We evaluated five mapping methods, namely VACmap, NGMLR, Winnowmap2, minimap2, and LRA using the GIAB Tier1 (v0.6) and CMRG (1.00) SV benchmark set for the HG002 human sample. These benchmark sets are available at the following links: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NIST_STRUCTURAL VARIANT_v0.6/ and https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/, respectively. PacBio CLR, PacBio HiFi, and ONT sequencing data were accessible at s3://giab/data/AshkenazimTrio/HG002_NA24385_son/. For parameter setting of VACmap, we use mode L to align synthetic PacBio HiFi data and mode H to align synthetic PacBio CLR and ONT data. The SVIM (v1.4.2) and cuteSV (v2.1.2) were used to detect SVs. For PacBio CLR (69×), PacBio HiFi (30×), and ONT (50×) data, the read support of SVIM and cuteSV called SVs is set to 5. Except for the cuteSV called SVs using PacBio CLR data, the read support is set to 10 due to cuteSV needs a higher read support setting to filter out false-positive results (Supplementary Fig. 20). Subsequently, we evaluated the SVs against the GIAB benchmark set using Truvari (v2.0.0). Command line parameters provided to these tools are listed in Supplementary Table 4.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.