Building pangenome graphs

Garrison, Erik; Guarracino, Andrea; Heumos, Simon; Villani, Flavia; Bao, Zhigui; Tattini, Lorenzo; Hagmann, Jörg; Vorbrugg, Sebastian; Marco-Sola, Santiago; Kubica, Christian; Ashbrook, David G.; Thorell, Kaisa; Rusholme-Pilcher, Rachel L.; Liti, Gianni; Rudbeck, Emilio; Golicz, Agnieszka A.; Nahnsen, Sven; Yang, Zuyu; Mwaniki, Moses Njagi; Nobrega, Franklin L.; Wu, Yi; Chen, Hao; de Ligt, Joep; Sudmant, Peter H.; Huang, Sanwen; Weigel, Detlef; Soranzo, Nicole; Colonna, Vincenza; Williams, Robert W.; Prins, Pjotr

doi:10.1038/s41592-024-02430-3

Brief Communication
Published: 21 October 2024

Building pangenome graphs

Nature Methods volume 21, pages 2008–2012 (2024)Cite this article

13k Accesses
146 Citations
37 Altmetric
Metrics details

Subjects

Abstract

Pangenome graphs can represent all variation between multiple reference genomes, but current approaches to build them exclude complex sequences or are based upon a single reference. In response, we developed the PanGenome Graph Builder, a pipeline for constructing pangenome graphs without bias or exclusion. The PanGenome Graph Builder uses all-to-all alignments to build a variation graph in which we can identify variation, measure conservation, detect recombination events and infer phylogenetic relationships.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Pangenome graph construction from genome alignments with Minigraph-Cactus

Article 10 May 2023

Personalized pangenome references

Article 11 September 2024

Pangenome graphs and their applications in biodiversity genomics

Article 08 January 2025

Data availability

Pangenomes are available at Zenodo at https://doi.org/10.5281/zenodo.7658895 (ref. ³⁷). Scripts and source data URLs for downloading the pangenomes at https://github.com/pangenome/pggb-paper/blob/main/workflows/0.Preparation.md. Lists of all accession codes for all pangenomes are reported in Supplementary File 1.

Code availability

PGGB is available at https://github.com/pangenome/pggb. Code used for experiments can be accessed at https://github.com/pangenome/pggb-paper.

References

Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature Biotechnol. 36, 875–879 (2018).
Article CAS Google Scholar
Paten, B., Novak, A. M., Eizenga, J. M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).
Article CAS PubMed PubMed Central Google Scholar
Armstrong, J. et al. Progressive cactus is a multiple-genome aligner for the thousand-genome era. Nature 587, 246–251 (2020).
Article CAS PubMed PubMed Central Google Scholar
Guarracino, A., Heumos, S., Nahnsen, S., Prins, P. & Garrison, E. ODGI: understanding pangenome graphs. Bioinformatics 38, 3319–3326 (2022).
Article CAS PubMed PubMed Central Google Scholar
Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. https://doi.org/10.1186/s13059-020-02168-z (2020).
Hickey, G. et al. Pangenome graph construction from genome alignments with minigraph-cactus. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01793-w (2023).
Noll, N., Molari, M., Shaw, L. P. & Neher, R. A. PanGraph: scalable bacterial pan-genome graph construction. Preprint at bioRxiv https://doi.org/10.1099/mgen.0.001034 (2022).
Garrison, E. & Guarracino, A. Unbiased pangenome graphs. Bioinformatics https://doi.org/10.1093/bioinformatics/btac743 (2022).
Minkin, I., Pham, S. & Medvedev, P. TwoPaCo: an efficient algorithm to build the compacted de bruijn graph from many complete genomes. Bioinformatics 33, 4024–4032 (2016).
Article Google Scholar
Chin, C.-S. et al. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat Methods 20, 1213–1221 (2023).
Article CAS PubMed PubMed Central Google Scholar
Sullivan, P. F. et al. Leveraging base-pair mammalian constraint to understand genetic variation and human disease. Science https://doi.org/10.1126/science.abn2937 (2023).
Guarracino, A., Mwaniki, N., Marco-Sola, S. & Garrison, E. wfmash: whole-chromosome pairwise alignment using the hierarchical wavefront algorithm. GitHub https://github.com/waveygang/wfmash (2021).
Lee, C., Grasso, C. & Sharlow, M. F. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002).
Article CAS PubMed Google Scholar
Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
Article CAS PubMed PubMed Central Google Scholar
Gao, Y. et al. abPOA: an SIMD-based c library for fast partial order alignment using adaptive band. Bioinformatics 37, 2209–2211 (2020).
Article Google Scholar
Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Article CAS PubMed PubMed Central Google Scholar
Bolognini, D. et al. Global diversity, recurrent evolution, and recent selection on amylase structural haplotypes in humans https://doi.org/10.1101/2024.02.07.579378 (2024).
Heumos, S. et al. Pangenome graph layout by path-guided stochastic gradient descent. Bioinformatics 40, 363 (2024).
Article Google Scholar
Doerr, D., Marijon, P. & Marschall, T. GFAffix identifies walk-preserving shared affixes in variation graphs and collapses them into a non-redundant graph structure. GitHub https://github.com/marschall-lab/GFAffix (2023).
Ewels, P., Magnusson, M., Lundin, S. & Käller, M. Multiqc: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047 (2016).
Article CAS PubMed PubMed Central Google Scholar
Marco-Sola, S. et al. Optimal gap-affine alignment in o(s) space. Bioinformatics https://doi.org/10.1093/bioinformatics/btad074 (2023).
Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput. Biol. 14, 1005944 (2018).
Article Google Scholar
Cagan, A. et al. Natural selection in the great apes. Mol. Biol. Evol. 33, 3268–3283 (2016).
Article CAS PubMed PubMed Central Google Scholar
Guarracino, A. et al. Recombination between heterologous human acrocentric chromosomes. Nature 617, 335–343 (2023).
Article CAS PubMed PubMed Central Google Scholar
Crysnanto, D., Leonard, A. & Pausch, H. Comparison of methods for building pangenome graphs. In Proc. 12th World Congress on Genetics Applied to Livestock Production (WCGALP) Technical and Species Orientated Innovations in Animal Breeding, and Contribution of Genetics to Solving Societal Challenges, 1066–1069 (Wageningen Academic Publishers, 2022).
Leonard, A. S., Crysnanto, D., Mapel, X. M., Bhati, M. & Pausch, H. Graph construction method impacts variation representation and analyses in a bovine super-pangenome. Preprint at bioRxiv https://doi.org/10.1186/s13059-023-02969-y (2022).
Zhou, Y. et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature 606, 527–534 (2022).
Article CAS PubMed PubMed Central Google Scholar
Heumos, S. et al. nf-core/pangenome. Zenodo https://doi.org/10.5281/zenodo.8202637 (2024).
Fischer, C. & Garrison, E. chfi/gfaestus: a pangenome graph browser. Zenodo https://doi.org/10.5281/zenodo.6954035 (2022).
Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature 593, 101–107 (2021).
Article CAS PubMed PubMed Central Google Scholar
Jain, C., Koren, S., Dilthey, A., Phillippy, A. M. & Aluru, S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics 34, i748–i756 (2018).
Article CAS PubMed PubMed Central Google Scholar
Llamas, B. et al. A strategy for building and using a human reference pangenome. F1000Res 8, 1751 (2021).
Article PubMed Central Google Scholar
Vollger, M. R. et al. Increased mutation and gene conversion within human segmental duplications. Nature 617, 325–334 (2023).
Article CAS PubMed PubMed Central Google Scholar
Traag, V. A., Waltman, L. & van Eck, N. J. From louvain to leiden: guaranteeing well-connected communities. Sci. Rep. 9, 1–22 (2019).
Article CAS Google Scholar
Kosugi, S. et al. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol. 20, 1–18 (2019).
Article Google Scholar
Poplin, R. et al. A universal snp and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Article CAS PubMed Google Scholar
Guarracino, A. Pangenomes of multiple species for the “Building pangenome graphs” publication. Zenodo 10.5281/zenodo.7658895 (2023).

Download references

Acknowledgements

The authors gratefully acknowledge support from National Institutes of Health (NIH)/NIDA U01DA047638 (E.G.), NIH/NIGMS R01GM123489 (E.G. and P.P.), NIH/NIGMS R35GM142916 (P.H.S.) and NSF PPoSS Award 2118709 (E.G. and P.P.) and the Center for Integrative and Translational Genomics (E.G.). S.H. acknowledges funding from the Central Innovation Program (ZIM) for SMEs of the Federal Ministry for Economic Affairs and Energy of Germany. This work was supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D and 031A538A). A.A.G. acknowledges the Alexander von Humboldt Foundation in the framework of Sofja Kovalevskaja Award and Deutsche Forschungsgemeinschaft (German Research Foundation) project no. 497667402. S.N. acknowledges support from iFIT funded by the Deutsche Forschungsgemeinschaft under Germany’s Excellence Strategy—EXC 2180—390900677 and CMFI under EXC 2124-390838134. The authors also acknowledge funding from the Max Planck Society (Z.B., S.V., C.K. and D.W.). Co-financed by the Connecting Europe Facility of the European Union. M.N.M. is fully funded by the EU H2020 ALPACA ITN under the Marie Skłodowska-Curie grant agreement no. 956229. The authors thank members of the HPRC Pangenome Working Group for their insightful discussion and feedback and members of the HPRC production teams for their development of resources used in our exposition.

Author information

These authors contributed equally: Erik Garrison, Andrea Guarracino.

Authors and Affiliations

Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
Erik Garrison, Andrea Guarracino, Flavia Villani, David G. Ashbrook, Vincenza Colonna, Robert W. Williams & Pjotr Prins
Human Technopole, Milan, Italy
Andrea Guarracino & Nicole Soranzo
Quantitative Biology Center (QBiC) Tübingen, University of Tübingen, Tübingen, Germany
Simon Heumos & Sven Nahnsen
Biomedical Data Science, Dept. of Computer Science, University of Tübingen, Tübingen, Germany
Simon Heumos & Sven Nahnsen
M3 Research Center, University Hospital Tübingen, Tübingen, Germany
Simon Heumos & Sven Nahnsen
Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
Zhigui Bao, Sebastian Vorbrugg, Christian Kubica & Detlef Weigel
Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
Zhigui Bao & Sanwen Huang
Université Côte d’Azur, CNRS, INSERM, IRCAN, Nice, France
Lorenzo Tattini & Gianni Liti
Data Science Department, EURECOM, Biot, France
Lorenzo Tattini
Computomics, Tübingen, Germany
Jörg Hagmann
Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
Santiago Marco-Sola
Department of Computer Science, Universitat Politècnica de Catalunya, Barcelona, Spain
Santiago Marco-Sola
Chemistry and Molecular Biology, Faculty of Science, University of Gothenburg, Gothenburg, Sweden
Kaisa Thorell
Earlham Institute, Norwich Research Park, Norwich, UK
Rachel L. Rusholme-Pilcher
Clinical Genomics Gothenburg, Bioinformatics and Data Centre, University of Gothenburg, Gothenburg, Sweden
Emilio Rudbeck
Department of Plant Breeding, Justus Liebig University Giessen, Giessen, Germany
Agnieszka A. Golicz
Institute for Bioinformatics and Medical Informatics (IBMI), Eberhard-Karls University of Tübingen, Tübingen, Germany
Sven Nahnsen
The Institute of Environmental Science and Research, Wellington, New Zealand
Zuyu Yang
Department of Computer Science, University of Pisa, Pisa, Italy
Moses Njagi Mwaniki
School of Biological Sciences, Faculty of Environmental and Life Sciences, University of Southampton, Southampton, UK
Franklin L. Nobrega & Yi Wu
Department of Pharmacology, Addiction Science and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
Hao Chen
Hartwig Medical Foundation, Amsterdam, the Netherlands
Joep de Ligt
Department of Integrative Biology, University of California Berkeley, Berkeley, CA, USA
Peter H. Sudmant
Institute for Bioinformatics and Medical Informatics, University Tübingen, Tübingen, Germany
Detlef Weigel
Wellcome Sanger Institute, Genome Campus, Hinxton, UK
Nicole Soranzo
National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge, UK
Nicole Soranzo
Department of Haematology, Cambridge Biomedical Campus, Cambridge, UK
Nicole Soranzo
British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK
Nicole Soranzo
Institute of Genetics and Biophysics, National Research Council, Naples, Italy
Vincenza Colonna

Authors

Erik Garrison
View author publications
Search author on:PubMed Google Scholar
Andrea Guarracino
View author publications
Search author on:PubMed Google Scholar
Simon Heumos
View author publications
Search author on:PubMed Google Scholar
Flavia Villani
View author publications
Search author on:PubMed Google Scholar
Zhigui Bao
View author publications
Search author on:PubMed Google Scholar
Lorenzo Tattini
View author publications
Search author on:PubMed Google Scholar
Jörg Hagmann
View author publications
Search author on:PubMed Google Scholar
Sebastian Vorbrugg
View author publications
Search author on:PubMed Google Scholar
Santiago Marco-Sola
View author publications
Search author on:PubMed Google Scholar
Christian Kubica
View author publications
Search author on:PubMed Google Scholar
David G. Ashbrook
View author publications
Search author on:PubMed Google Scholar
Kaisa Thorell
View author publications
Search author on:PubMed Google Scholar
Rachel L. Rusholme-Pilcher
View author publications
Search author on:PubMed Google Scholar
Gianni Liti
View author publications
Search author on:PubMed Google Scholar
Emilio Rudbeck
View author publications
Search author on:PubMed Google Scholar
Agnieszka A. Golicz
View author publications
Search author on:PubMed Google Scholar
Sven Nahnsen
View author publications
Search author on:PubMed Google Scholar
Zuyu Yang
View author publications
Search author on:PubMed Google Scholar
Moses Njagi Mwaniki
View author publications
Search author on:PubMed Google Scholar
Franklin L. Nobrega
View author publications
Search author on:PubMed Google Scholar
Yi Wu
View author publications
Search author on:PubMed Google Scholar
Hao Chen
View author publications
Search author on:PubMed Google Scholar
Joep de Ligt
View author publications
Search author on:PubMed Google Scholar
Peter H. Sudmant
View author publications
Search author on:PubMed Google Scholar
Sanwen Huang
View author publications
Search author on:PubMed Google Scholar
Detlef Weigel
View author publications
Search author on:PubMed Google Scholar
Nicole Soranzo
View author publications
Search author on:PubMed Google Scholar
Vincenza Colonna
View author publications
Search author on:PubMed Google Scholar
Robert W. Williams
View author publications
Search author on:PubMed Google Scholar
Pjotr Prins
View author publications
Search author on:PubMed Google Scholar

Contributions

E.G. conceived the project. E.G., S.N., N.S., V.C., R.W.W. and P.P. provided project guidance. E.G., A.G., Simon H. and S.M.S. developed the software. E.G., A.G., Simon H., V.C., R.W.W. and P.P. edited the paper. E.G. designed the experiments. E.G., A.G., and L.T. evaluated quality. E.G., A.G., Simon H., F.V., Z.B., L.T., J.H., S.V., C.K., K.T., R.L.R.P., A.A.G., S.N., Z.Y., M.N.M., F.L.N., H.C., J.d.L. and P.H.S. conducted testing. A.G. executed the experiments. A.G. and Simon H. provided documentation. F.V., D.G.A., H.C. and V.C. worked on Mus musculus and Rattus norvegicus. Z.B. and Sanwen H. contributed to the tomato pangenome. L.T. and G.L. worked on Saccharomyces cerevisiae and S. paradoxus. J.H. contributed to the soy pangenome. S.V., C.K., Z.B. and D.W. worked on A. thaliana. S.V. determined parameter settings. K.T. and E.R. worked on Helicobacter pylori. A.A.G. contributed to Vicia fava. Z.Y. and J.d.L. worked on Neisseria mingitidis. F.L.N. and Y.W. worked on Escherichia coli and Coliphages. P.H.S. contributed to the primate pangenome. P.P. managed High Performance Computing.

Corresponding author

Correspondence to Erik Garrison.

Ethics declarations

Competing interests

J.H. is employed by Computomics. D.W. holds equity in Computomics and consults for KWS SE. All other authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 SMOOTHXG.

Overview of the algorithmic steps in SMOOTHXG.

Extended Data Fig. 2 Impact of graph normalization on the human chromosome 6 pangenome graph.

Comparison of the raw SEQWISH graph and its normalized version after SMOOTHXG. The normalization process locally compresses and simplifies the graph structure, resulting in lower node degree and graph depth. This effect is particularly pronounced in repetitive regions such as centromeres and satellite sequences. Node degree and graph depth are visualized for both the original SEQWISH graph and the normalized graph.

Extended Data Fig. 3 Impact of graph normalization on complement component 4 (C4) pangenome graph.

Each bar represents a haplotype and black lines on the bottom represent the graph topology. Paths are colored by using the Spectra color palette with four levels of node depths: white indicates no depth, while gray, red, and yellow indicate depths 1, 2, and greater than or equal to 3, respectively. a) C4 subgraph extracted from the chromosome 6 graph built without SMOOTHXG. b) C4 subgraph extracted from the chromosome 6 graph built without SMOOTHXG and sorted. c). C4 subgraph extracted from the chromosome 6 graph built with SMOOTHXG. The two references present two different allele copies of the C4 genes (red = 2X coverage), both of them including the HERV sequence. The entirely gray paths have one copy of these genes (gray = 1X coverage). HG01071#2 presents three copies of the locus (orange = 3X coverage), of which one contains the HERV sequence (gray in the middle of the orange).

Extended Data Fig. 4 Human chromosome 6 variant calling performance.

Precision, recall, and F1-score of small variants in the H. sapiens chromosome 6 pangenome graph relative to HiFi–DeepVariant calls. Comparisons are made whole-chromosome and then stratified by the GIAB (v.3.0) genomic context. The 44 samples evaluated are colored by superpopulation. AFR = African, AMR = Ad Mixed American, EAS = East Asian, SAS = South Asian.

Extended Data Fig. 5 A. thaliana variant calling performance.

Precision, recall, and F1-score of small variants in the A. thaliana pangenome graph relative to HiFi–DeepVariant calls. Comparisons are made whole-genome and then stratified by genomic context. Easy and Hard regions exclude and include, respectively, rDNA, centromere, and Trasposable Elements. The 64 samples evaluated are colored by population. The low precision for Lor-16 and Met-6 is due to the high heterozygosity of these 2 samples.

Extended Data Fig. 6 Tomato variant calling performance.

Precision, recall, and F1-score of small variants in the tomato pangenome graph relative to HiFi–DeepVariant calls. Comparisons are made whole-genome and then stratified by genomic context. Easy and Hard regions exclude and include, respectively, Transposable elements. The 5 samples evaluated are colored by group. BIG = S. lycopersicum, big-fruited tomato; CER = S. lycopersicum var. cerasiforme, cherry tomato; PIM = S. pimpinellifolium, the progenitor of cultivated tomatoes.

Supplementary information

Supplementary Information

Supplementary discussion and Figs. 1–9.

Reporting Summary

Peer Review File

Supplementary Table

Tables with list of all accessions for all pangenomes.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Garrison, E., Guarracino, A., Heumos, S. et al. Building pangenome graphs. Nat Methods 21, 2008–2012 (2024). https://doi.org/10.1038/s41592-024-02430-3

Download citation

Received: 30 July 2023
Accepted: 26 August 2024
Published: 21 October 2024
Version of record: 21 October 2024
Issue date: November 2024
DOI: https://doi.org/10.1038/s41592-024-02430-3

This article is cited by

Compressive pangenomics using mutation-annotated networks
- Sumit Walia
- Harsh Motwani
- Yatish Turakhia
Nature Genetics (2026)
GFAKaleidos: a tool for computing and comparing pangenome graph statistics
- Yixin Xiang
- Keyu Liu
- Jianyu Zhou
Frontiers of Computer Science (2026)
Phased epigenomics and methylation inheritance in a historical Vitis vinifera hybrid
- Noé Cochetel
- Amanda M. Vondras
- Dario Cantu
Genome Biology (2025)
Comparative genomics profiling of Citrus species reveals the diversity and disease responsiveness of the GLP pangenes family
- Muhammad Tahir ul Qamar
- Kinza Fatima
- Xi-Tong Zhu
BMC Plant Biology (2025)
Pangenome graph mitigates heterozygosity overestimation from mapping bias: a case study in Chinese indigenous pigs
- Jian Miao
- Qingyu Wang
- Zhen Wang
BMC Biology (2025)