Abstract
Population-level structural variant (SV) profiling is crucial in the era of pangenomes. However, identifying SVs from genome assemblies and pangenome graphs remains a substantial challenge. Here we present Swave, a sequence-to-image, deep learning-based method that accurately resolves both simple and complex SVs, along with their population characteristics, from assembly-derived pangenome graphs. Swave introduces ‘projection waves’ to summarize the dotplot images that capture mapping patterns between reference and SV-indicating alleles in the pangenome. Then, a recurrent neural network distinguishes true SV signals from background noise introduced by genomic repeats. Swave demonstrates superior performance in both SV-type classification and genotyping compared with existing methods. When applied to healthy cohorts and rare-disease cohorts, Swave reveals complex and polymorphic SV patterns across human populations and identifies potentially pathogenic SVs. These advancements will facilitate the creation of comprehensive population-level SV catalogs, deepening our understanding of SVs in genetic diversity and disease associations.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
Data availability
All the published reference genomes, sample assemblies and SV callsets are presented in Supplementary Table 17. The callsets on healthy and disease cohorts produced by Swave are available via Zenodo at https://doi.org/10.5281/zenodo.18229680 and https://doi.org/10.5281/zenodo.18425621 (refs. 54,55).
Code availability
Swave is available via GitHub at https://github.com/songbowang125/Swave.git (ref. 56). The custom scripts and scripts for reproducing the results in this paper are available via GitHub at https://github.com/songbowang125/Swave-Utils.git (ref. 57).
References
Ahsan, M. U., Liu, Q., Perdomo, J. E., Fang, L. & Wang, K. A survey of algorithms for the detection of genomic structural variants from long-read sequencing data. Nat. Methods 20, 1143–1158 (2023).
Wang, S. et al. De novo and somatic structural variant discovery with SVision-pro. Nat. Biotechnol. 43, 181–185 (2025).
Ding, W. et al. Adaptive functions of structural variants in human brain development. Sci. Adv. 10, eadl4600 (2024).
Collins, R. L. & Talkowski, M. E. Diversity and consequences of structural variation in the human genome. Nat. Rev. Genet. 26, 443–462 (2025).
Li, Y. et al. Patterns of somatic structural variation in human cancer genomes. Nature 578, 112–121 (2020).
Mahmoud, M. et al. Structural variant calling: the long and the short of it. Genome Biol. 20, 246 (2019).
Lin, J. et al. SVision: a deep learning approach to resolve complex structural variants. Nat. Methods 19, 1230–1233 (2022).
Chen, Y. et al. Deciphering the exact breakpoints of structural variations using long sequencing reads with DeBreak. Nat. Commun. 14, 283 (2023).
Popic, V. et al. Cue: a deep-learning framework for structural variant discovery and genotyping. Nat. Methods 20, 559–568 (2023).
Smolka, M. et al. Detection of mosaic and population-level structural variants with Sniffles2. Nat. Biotechnol. 42, 1571–1580 (2024).
Denti, L., Khorsand, P., Bonizzoni, P., Hormozdiari, F. & Chikhi, R. SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads. Nat. Methods 20, 550–558 (2023).
Wang, S. & Ye, K. Deep-learning based representation and recognition for genome variants-from SNVs to structural variants. Natl Sci. Rev. 11, nwae335 (2024).
Olson, N. D. et al. Variant calling and benchmarking in an era of complete human genome sequences. Nat. Rev. Genet. 24, 464–483 (2023).
Liu, Y. H., Luo, C., Golding, S. G., Ioffe, J. B. & Zhou, X. M. Tradeoffs in alignment and assembly-based methods for structural variant detection with long-read sequencing data. Nat. Commun. 15, 2447 (2024).
Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, abf7117 (2021).
Heller, D. & Vingron, M. SVIM-asm: structural variant detection from haploid and diploid genome assemblies. Bioinformatics 36, 5519–5521 (2021).
Liao, W. W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Gao, Y. et al. A pangenome reference of 36 Chinese populations. Nature 619, 112–121 (2023).
Logsdon, G. A. et al. Complex genetic variation in nearly complete human genomes. Nature 644, 430–441 (2025).
Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).
Groza, C. et al. Pangenome graphs improve the analysis of structural variants in rare genetic diseases. Nat. Commun. 15, 657 (2024).
Yilmaz, F. et al. Reconstruction of the human amylase locus reveals ancient duplications seeding modern-day variation. Science 386, eadn0609 (2024).
Bolognini, D. et al. Recurrent evolution and selection shape structural diversity at the amylase locus. Nature 634, 617–625 (2024).
Plender, E. G. et al. Structural and genetic diversity in the secreted mucins MUC5AC and MUC5B. Am. J. Hum. Genet. 111, 1700–1716 (2024).
Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).
Kirsche, M. et al. Jasmine and Iris: population-scale structural variant comparison and analysis. Nat. Methods 20, 408–417 (2023).
English, A. C., Menon, V. K., Gibbs, R. A., Metcalf, G. A. & Sedlazeck, F. J. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol. 23, 271 (2022).
Zheng, Z. Y. et al. A sequence-aware merger of genomic structural variations at population scale. Nat. Commun. 15, 960 (2024).
Jayakodi, M. et al. Structural variation in the pangenome of wild and domesticated barley. Nature 636, 654–662 (2024).
Bian, P. et al. A graph-based goat pangenome reveals structural variations involved in domestication and adaptation. Mol. Biol. Evol. 41, msae251 (2024).
Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 21, 265 (2020).
Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 42, 663–673 (2024).
Garrison, E. et al. Building pangenome graphs. Nat. Methods 21, 2008–2012 (2024).
Cui, Y., Peng, C., Xia, Z., Yang, C. & Guo, Y. A survey of sequence-to-graph mapping algorithms in the pangenome era. Genome Biol. 26, 138 (2025).
Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).
Andreace, F., Lechat, P., Dufresne, Y. & Chikhi, R. Comparing methods for constructing and representing human pangenome graphs. Genome Biol. 24, 274 (2023).
Garrison, E., Kronenberg, Z. N., Dawson, E. T., Pedersen, B. S. & Prins, P. A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLoS Comput. Biol. 18, e1009123 (2022).
Jiang, T. et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol. 21, 189 (2020).
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).
Porubsky, D. et al. Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders. Cell 185, 1986–2005 (2022).
Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, abj6965 (2022).
Yang, J. & Chaisson, M. J. P. TT-Mars: structural variants assessment based on haplotype-resolved assemblies. Genome Biol. 23, 110 (2022).
Zhao, X., Weber, A. M. & Mills, R. E. A recurrence-based approach for validating structural variation using long-read sequencing technology. Gigascience 6, 1–9 (2017).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Pajic, P., Lin, Y. L., Xu, D. & Gokcumen, O. The psoriasis-associated deletion of late cornified envelope genes LCE3B and LCE3C has been maintained under balancing selection since human Denisovan divergence. BMC Evol. Biol. 16, 265 (2016).
Ago, Y., Asano, S., Hashimoto, H. & Waschek, J. A. A. Probing the VIPR2 microduplication linkage to schizophrenia in animal and cellular models. Front. Neurosci. 15, 717490 (2021).
Chen, C. H. et al. Identification of rare mutations of the vasoactive intestinal peptide receptor 2 gene in schizophrenia. Psychiatric Genet. 32, 125–130 (2022).
Pitera, J. E., Scambler, P. J. & Woolf, A. S. Fras1, a basement membrane-associated protein mutated in Fraser syndrome, mediates both the initiation of the mammalian kidney and the integrity of renal glomeruli. Hum. Mol. Genet. 17, 3953–3964 (2008).
Slavotinek, A., Li, C., Sherr, E. H. & Chudley, A. E. Mutation analysis of the FRAS1 gene demonstrates new mutations in a propositus with Fraser syndrome. Am. J. Med. Genet. A 140a, 1909–1914 (2006).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Vollger, M. R. et al. Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads. Ann. Hum.Genet. 84, 125–140 (2020).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Wang, S. Swave call on healthy cohort. Zenodo https://doi.org/10.5281/zenodo.18229680 (2026).
Wang, S. Swave call on disease cohort. Zenodo https://doi.org/10.5281/zenodo.18425621 (2026).
Wang, S. Swave code. Zenodo https://doi.org/10.5281/zenodo.18229263 (2026).
Wang, S. Swave Utils code. Zenodo https://doi.org/10.5281/zenodo.18229275 (2026).
Acknowledgements
K.Y. is supported by the National Key R&D Program of China (grant no. 2022YFC3400300) and National Science Foundation of China (grant nos. 32125009 and 32430017). S.W. is supported by the National Science Foundation of China (grant no. 323B2015)
Author information
Authors and Affiliations
Contributions
K.Y. designed and supervised the research. S.W. developed the algorithm and performed the performance evaluation and downstream analysis. T.X. and P.Z. analyzed the impact of SVs.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Overview of pangenome construction and allele extraction in Swave.
a, Construction of pangenome graph using Minigraph with both reference and sample assemblies. The resulting graph is saved in GFA format, which encodes node sequences and directed edges between nodes. b, Assembly paths are recovered using–call function in Minigraph. Regions where paths diverge (Snarls) are identified as candidate structural variant loci. Allele sequences for each snarl are reconstructed by extracting the corresponding node sequences from the GFA. c, Based on the Minigraph–all outputs, Swave determines carrier assemblies for each allele and proceeds to generate dotplots for each reference-alternative pair in the next processing module. d, Swave’s handling of phasing information and multi-allelic loci. Sample genotypes are obtained by joining all haplotype genotypes.
Extended Data Fig. 2 Dotplot generation and projection.
a, Base-level refinement of kmer-based dotplots. Initial alignment introduces (k-1) base gaps near SV breakpoints. Swave performs base-level remapping at kmer stop-matching boundaries to improve breakpoint resolution for downstream SV classification. b, Influence of genomic repeats on wave patterns. Dense, repetitive regions generate abundant spurious matches in dotplots, resulting fluctuating wave signals upon projection.
Extended Data Fig. 3 Recurrent Neural Network for SV classification in Swave.
a, Projected wave signals are encoded as four-element tuples per genomic segment, comprising span length, background average wave value, and the differences between SV-implying and background waves for both forward and reverse matches. These tuples serve as the input for the RNN classification model. b, A one-layer Bi-LSTM with 64 hidden units forms the core of the RNN, enabling context-aware classification of SV components across the sequence. c, The time and memory consumption were performed using three datasets, including HGSVC3 (130 haplotypes), Health cohort (HGSVC3 + HPRC + CPC, 334 haplotypes) and Disease cohort (GA4K, 574 haplotypes). Using a person computer (CPU: Intel Core i9-13900K, Max Memory: 32GB), Swave run with 8 threads. Using a computing cluster node (CPU: Intel Xeon Gold 6240 R, Max Memory: 376GB), the computational process of Swave could be accelerated by using 24 threads.
Extended Data Fig. 4 Performance evaluation results.
a, F1-score comparison for simple structural variant (SSV) detection. b, F1-score comparison for complex structural variant (CSV) detection. c, Mendelian consistency across three trio datesets. Average consistencys are noted on the plot. d, Genotyping (GT) missing rate across two population datasets. Average missing rates on the two datasets are noted on the plot. e, Improvements in genotyping performance following PanPop refinement. We applied SVIM-asm followed by 3 merging tools on 3 trios, respectively, making n = 9. The boxplot defines the median (Q2, 50th percentile), first quartile (Q1, 25th percentile) and third quartile (Q3, 75th percentile). The bounds of box, that is interquartile range (IQR), of the boxplot is between Q1 and Q3. The minima and maxima values are defined as Q1-1.5*IQR and Q3 + 1.5*IQR, respectively. The whiskers are values between minima and Q1 as well as between Q3 and maxima. Values falling outside the Q1 – Q3 range are plotted as outliers of the data.
Extended Data Fig. 5 Validation and illustration of inversions.
a and b, Validation results for all detected balanced and complex inversions. Three orthogonal metrics were applied: mapping integrity, TT-mars, and Vapor. c, Comparison of breakpoint accuracy of the 52 overlapped inversions between Swave and PAV. The boxplot defines the median (Q2, 50th percentile), first quartile (Q1, 25th percentile) and third quartile (Q3, 75th percentile). The bounds of box, that is interquartile range (IQR), of the boxplot is between Q1 and Q3. The minima and maxima values are defined as Q1-1.5*IQR and Q3 + 1.5*IQR, respectively. The whiskers are values between minima and Q1 as well as between Q3 and maxima. Values falling outside the Q1 – Q3 range are plotted as outliers of the data. d, Illustration of breakpoint distortion caused by inverted segmental duplications (SDs). While PAV’s breakpoints are frequently shifted due to alignment ambiguity, Swave maintains accurate breakpoint placement within repetitive regions.
Extended Data Fig. 6 Characterization of polymorphic scarred inversions.
a, Example of a polymorphic scarred inversion snarl containing five distinct alleles (1), generated by combinatorial arrangements of five unique internal scars across four genomic regions. The most complex variant includes four separate scars (2). b, Length distribution of all detected scars (n = 81), ranging from 61 bp to 18,451 bp. c, Repeat annotation of all scars (n = 81). d, Example of polymorphic scarred inversions driven by repetitive elements, where two repeat expansions give rise to insertion scars of difference lengths.
Extended Data Fig. 7 Rare and complex structural variants revealed by Swave.
a, Pangenome graph structure of snarl ‘>s21910 > s21914’. A rare CSV allele introduced a novel traversal path not observed among the reported alleles. b, IGV snapshot of snarl ‘>s21910 > s21914’, illustrating co-occurrence of two distinct SSVs and one CSV. The rare SSV (67 kb deletion) extended the common 32 bp deletion, where the rare CSV (a duplication flanked by a deletion) added a 58 kb duplication at the right breakpoint of the frequent 32 kb simple deletion. c, Pangenome graph of snarl ‘>s10752 > s10754’, showing a novel CSV locus, structurally distinct from previously reported variants. d, Illustration of a rare scarred inversion that partially disrupts the coding structure of VIPR2, a gene associated with neuropsychiatric disorder.
Extended Data Fig. 8 How potentially pathogenic structural variants affect genes.
a, Mapping of residue-level disruptions caused by ClinVar pathogenic variants and the CSV detected by Swave. b, Structural annotation of the CYP17A1 protein highlights two functional binding sites, as sourced from UniProt. c Mapping of residue-level disruptions caused by ClinVar pathogenic variants and the CSV detected by Swave. d, Schematic of a simple structural variant, a 411 bp inversion, disrupting the 2nd exon of gene HYLS1, a gene implicated in Hydrolethalus Syndrome. e, Representation of a 43 kb deletion spanning introns 6 to 14 of gene FRAS1, a gene associated with Fraser syndrome.
Extended Data Fig. 9 Genotyping incompleteness associated with unresolved pangenome graph regions.
a, Mapping results of a carrier assembly exhibiting missing genotypes across four snarls. b, Genome-wide distribution of snarls with missing genotypes across HGSVC samples. The Y-axis indicates the number of assemblies lacking mappable sequence at each snarl.
Extended Data Fig. 10 Misclassified dispersed-duplications into insertions.
Dispersed duplications—where the source sequence originates from distant loci (a) on the same chromosome or from different chromosomes (b)—pose challenges for Swave. When generating dotplots, Swave extends the reference regions by twice the length of the alternative sequence on both sides. Consequently, if the duplicated source sequence lies outside this extended window, Swave fails to capture it and instead reports it as an insertion. c, Using the 65 samples from HGSVC, we compared Swave’s outputs with dispersed duplications reported by SVision-pro. We found that Swave misclassified 0–4 duplications with distant same-chromosome sources and 10–51 duplications with cross-chromosome sources as insertions. The boxplot defines the median (Q2, 50th percentile), first quartile (Q1, 25th percentile) and third quartile (Q3, 75th percentile). The bounds of box, that is interquartile range (IQR), of the boxplot is between Q1 and Q3. The minima and maxima values are defined as Q1-1.5*IQR and Q3 + 1.5*IQR, respectively. The whiskers are values between minima and Q1 as well as between Q3 and maxima. Values falling outside the Q1 – Q3 range are plotted as outliers of the data.
Supplementary information
Supplementary Tables (download XLSX )
Supplementary Tables 1–17.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, S., Xu, T., Zhang, P. et al. Population-level structural variant characterization using pangenome graphs. Nat Genet 58, 664–672 (2026). https://doi.org/10.1038/s41588-026-02538-6
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41588-026-02538-6


