Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Building pangenome graphs

Abstract

Pangenome graphs can represent all variation between multiple reference genomes, but current approaches to build them exclude complex sequences or are based upon a single reference. In response, we developed the PanGenome Graph Builder, a pipeline for constructing pangenome graphs without bias or exclusion. The PanGenome Graph Builder uses all-to-all alignments to build a variation graph in which we can identify variation, measure conservation, detect recombination events and infer phylogenetic relationships.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: PGGB and applications.

Similar content being viewed by others

Data availability

Pangenomes are available at Zenodo at https://doi.org/10.5281/zenodo.7658895 (ref. 37). Scripts and source data URLs for downloading the pangenomes at https://github.com/pangenome/pggb-paper/blob/main/workflows/0.Preparation.md. Lists of all accession codes for all pangenomes are reported in Supplementary File 1.

Code availability

PGGB is available at https://github.com/pangenome/pggb. Code used for experiments can be accessed at https://github.com/pangenome/pggb-paper.

References

  1. Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature Biotechnol. 36, 875–879 (2018).

    Article  CAS  Google Scholar 

  2. Paten, B., Novak, A. M., Eizenga, J. M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Armstrong, J. et al. Progressive cactus is a multiple-genome aligner for the thousand-genome era. Nature 587, 246–251 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Guarracino, A., Heumos, S., Nahnsen, S., Prins, P. & Garrison, E. ODGI: understanding pangenome graphs. Bioinformatics 38, 3319–3326 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. https://doi.org/10.1186/s13059-020-02168-z (2020).

  6. Hickey, G. et al. Pangenome graph construction from genome alignments with minigraph-cactus. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01793-w (2023).

  7. Noll, N., Molari, M., Shaw, L. P. & Neher, R. A. PanGraph: scalable bacterial pan-genome graph construction. Preprint at bioRxiv https://doi.org/10.1099/mgen.0.001034 (2022).

  8. Garrison, E. & Guarracino, A. Unbiased pangenome graphs. Bioinformatics https://doi.org/10.1093/bioinformatics/btac743 (2022).

  9. Minkin, I., Pham, S. & Medvedev, P. TwoPaCo: an efficient algorithm to build the compacted de bruijn graph from many complete genomes. Bioinformatics 33, 4024–4032 (2016).

    Article  Google Scholar 

  10. Chin, C.-S. et al. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat Methods 20, 1213–1221 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Sullivan, P. F. et al. Leveraging base-pair mammalian constraint to understand genetic variation and human disease. Science https://doi.org/10.1126/science.abn2937 (2023).

  12. Guarracino, A., Mwaniki, N., Marco-Sola, S. & Garrison, E. wfmash: whole-chromosome pairwise alignment using the hierarchical wavefront algorithm. GitHub https://github.com/waveygang/wfmash (2021).

  13. Lee, C., Grasso, C. & Sharlow, M. F. Multiple sequence alignment using partial order graphs. Bioinformatics 18, 452–464 (2002).

    Article  CAS  PubMed  Google Scholar 

  14. Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Gao, Y. et al. abPOA: an SIMD-based c library for fast partial order alignment using adaptive band. Bioinformatics 37, 2209–2211 (2020).

    Article  Google Scholar 

  16. Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Bolognini, D. et al. Global diversity, recurrent evolution, and recent selection on amylase structural haplotypes in humans https://doi.org/10.1101/2024.02.07.579378 (2024).

  18. Heumos, S. et al. Pangenome graph layout by path-guided stochastic gradient descent. Bioinformatics 40, 363 (2024).

    Article  Google Scholar 

  19. Doerr, D., Marijon, P. & Marschall, T. GFAffix identifies walk-preserving shared affixes in variation graphs and collapses them into a non-redundant graph structure. GitHub https://github.com/marschall-lab/GFAffix (2023).

  20. Ewels, P., Magnusson, M., Lundin, S. & Käller, M. Multiqc: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Marco-Sola, S. et al. Optimal gap-affine alignment in o(s) space. Bioinformatics https://doi.org/10.1093/bioinformatics/btad074 (2023).

  22. Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput. Biol. 14, 1005944 (2018).

    Article  Google Scholar 

  23. Cagan, A. et al. Natural selection in the great apes. Mol. Biol. Evol. 33, 3268–3283 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Guarracino, A. et al. Recombination between heterologous human acrocentric chromosomes. Nature 617, 335–343 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Crysnanto, D., Leonard, A. & Pausch, H. Comparison of methods for building pangenome graphs. In Proc. 12th World Congress on Genetics Applied to Livestock Production (WCGALP) Technical and Species Orientated Innovations in Animal Breeding, and Contribution of Genetics to Solving Societal Challenges, 1066–1069 (Wageningen Academic Publishers, 2022).

  26. Leonard, A. S., Crysnanto, D., Mapel, X. M., Bhati, M. & Pausch, H. Graph construction method impacts variation representation and analyses in a bovine super-pangenome. Preprint at bioRxiv https://doi.org/10.1186/s13059-023-02969-y (2022).

  27. Zhou, Y. et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature 606, 527–534 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Heumos, S. et al. nf-core/pangenome. Zenodo https://doi.org/10.5281/zenodo.8202637 (2024).

  29. Fischer, C. & Garrison, E. chfi/gfaestus: a pangenome graph browser. Zenodo https://doi.org/10.5281/zenodo.6954035 (2022).

  30. Logsdon, G. A. et al. The structure, function and evolution of a complete human chromosome 8. Nature 593, 101–107 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Jain, C., Koren, S., Dilthey, A., Phillippy, A. M. & Aluru, S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics 34, i748–i756 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Llamas, B. et al. A strategy for building and using a human reference pangenome. F1000Res 8, 1751 (2021).

    Article  PubMed Central  Google Scholar 

  33. Vollger, M. R. et al. Increased mutation and gene conversion within human segmental duplications. Nature 617, 325–334 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Traag, V. A., Waltman, L. & van Eck, N. J. From louvain to leiden: guaranteeing well-connected communities. Sci. Rep. 9, 1–22 (2019).

    Article  CAS  Google Scholar 

  35. Kosugi, S. et al. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol. 20, 1–18 (2019).

    Article  Google Scholar 

  36. Poplin, R. et al. A universal snp and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    Article  CAS  PubMed  Google Scholar 

  37. Guarracino, A. Pangenomes of multiple species for the “Building pangenome graphs” publication. Zenodo 10.5281/zenodo.7658895 (2023).

Download references

Acknowledgements

The authors gratefully acknowledge support from National Institutes of Health (NIH)/NIDA U01DA047638 (E.G.), NIH/NIGMS R01GM123489 (E.G. and P.P.), NIH/NIGMS R35GM142916 (P.H.S.) and NSF PPoSS Award 2118709 (E.G. and P.P.) and the Center for Integrative and Translational Genomics (E.G.). S.H. acknowledges funding from the Central Innovation Program (ZIM) for SMEs of the Federal Ministry for Economic Affairs and Energy of Germany. This work was supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D and 031A538A). A.A.G. acknowledges the Alexander von Humboldt Foundation in the framework of Sofja Kovalevskaja Award and Deutsche Forschungsgemeinschaft (German Research Foundation) project no. 497667402. S.N. acknowledges support from iFIT funded by the Deutsche Forschungsgemeinschaft under Germany’s Excellence Strategy—EXC 2180—390900677 and CMFI under EXC 2124-390838134. The authors also acknowledge funding from the Max Planck Society (Z.B., S.V., C.K. and D.W.). Co-financed by the Connecting Europe Facility of the European Union. M.N.M. is fully funded by the EU H2020 ALPACA ITN under the Marie Skłodowska-Curie grant agreement no. 956229. The authors thank members of the HPRC Pangenome Working Group for their insightful discussion and feedback and members of the HPRC production teams for their development of resources used in our exposition.

Author information

Authors and Affiliations

Authors

Contributions

E.G. conceived the project. E.G., S.N., N.S., V.C., R.W.W. and P.P. provided project guidance. E.G., A.G., Simon H. and S.M.S. developed the software. E.G., A.G., Simon H., V.C., R.W.W. and P.P. edited the paper. E.G. designed the experiments. E.G., A.G., and L.T. evaluated quality. E.G., A.G., Simon H., F.V., Z.B., L.T., J.H., S.V., C.K., K.T., R.L.R.P., A.A.G., S.N., Z.Y., M.N.M., F.L.N., H.C., J.d.L. and P.H.S. conducted testing. A.G. executed the experiments. A.G. and Simon H. provided documentation. F.V., D.G.A., H.C. and V.C. worked on Musmusculus and Rattusnorvegicus. Z.B. and Sanwen H. contributed to the tomato pangenome. L.T. and G.L. worked on Saccharomycescerevisiae and S.paradoxus. J.H. contributed to the soy pangenome. S.V., C.K., Z.B. and D.W. worked on A.thaliana. S.V. determined parameter settings. K.T. and E.R. worked on Helicobacterpylori. A.A.G. contributed to Viciafava. Z.Y. and J.d.L. worked on Neisseriamingitidis. F.L.N. and Y.W. worked on Escherichiacoli and Coliphages. P.H.S. contributed to the primate pangenome. P.P. managed High Performance Computing.

Corresponding author

Correspondence to Erik Garrison.

Ethics declarations

Competing interests

J.H. is employed by Computomics. D.W. holds equity in Computomics and consults for KWS SE. All other authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 SMOOTHXG.

Overview of the algorithmic steps in SMOOTHXG.

Extended Data Fig. 2 Impact of graph normalization on the human chromosome 6 pangenome graph.

Comparison of the raw SEQWISH graph and its normalized version after SMOOTHXG. The normalization process locally compresses and simplifies the graph structure, resulting in lower node degree and graph depth. This effect is particularly pronounced in repetitive regions such as centromeres and satellite sequences. Node degree and graph depth are visualized for both the original SEQWISH graph and the normalized graph.

Extended Data Fig. 3 Impact of graph normalization on complement component 4 (C4) pangenome graph.

Each bar represents a haplotype and black lines on the bottom represent the graph topology. Paths are colored by using the Spectra color palette with four levels of node depths: white indicates no depth, while gray, red, and yellow indicate depths 1, 2, and greater than or equal to 3, respectively. a) C4 subgraph extracted from the chromosome 6 graph built without SMOOTHXG. b) C4 subgraph extracted from the chromosome 6 graph built without SMOOTHXG and sorted. c). C4 subgraph extracted from the chromosome 6 graph built with SMOOTHXG. The two references present two different allele copies of the C4 genes (red = 2X coverage), both of them including the HERV sequence. The entirely gray paths have one copy of these genes (gray = 1X coverage). HG01071#2 presents three copies of the locus (orange = 3X coverage), of which one contains the HERV sequence (gray in the middle of the orange).

Extended Data Fig. 4 Human chromosome 6 variant calling performance.

Precision, recall, and F1-score of small variants in the H. sapiens chromosome 6 pangenome graph relative to HiFi–DeepVariant calls. Comparisons are made whole-chromosome and then stratified by the GIAB (v.3.0) genomic context. The 44 samples evaluated are colored by superpopulation. AFR = African, AMR = Ad Mixed American, EAS = East Asian, SAS = South Asian.

Extended Data Fig. 5 A. thaliana variant calling performance.

Precision, recall, and F1-score of small variants in the A. thaliana pangenome graph relative to HiFi–DeepVariant calls. Comparisons are made whole-genome and then stratified by genomic context. Easy and Hard regions exclude and include, respectively, rDNA, centromere, and Trasposable Elements. The 64 samples evaluated are colored by population. The low precision for Lor-16 and Met-6 is due to the high heterozygosity of these 2 samples.

Extended Data Fig. 6 Tomato variant calling performance.

Precision, recall, and F1-score of small variants in the tomato pangenome graph relative to HiFi–DeepVariant calls. Comparisons are made whole-genome and then stratified by genomic context. Easy and Hard regions exclude and include, respectively, Transposable elements. The 5 samples evaluated are colored by group. BIG = S. lycopersicum, big-fruited tomato; CER = S. lycopersicum var. cerasiforme, cherry tomato; PIM = S. pimpinellifolium, the progenitor of cultivated tomatoes.

Supplementary information

Supplementary Information

Supplementary discussion and Figs. 1–9.

Reporting Summary

Peer Review File

Supplementary Table

Tables with list of all accessions for all pangenomes.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Garrison, E., Guarracino, A., Heumos, S. et al. Building pangenome graphs. Nat Methods 21, 2008–2012 (2024). https://doi.org/10.1038/s41592-024-02430-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41592-024-02430-3

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics