Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

A pangenome reference and population studies link structural variants with breeding traits in Gossypium hirsutum

Abstract

Limited pangenome and ambiguous genomic architecture constrain comprehensive genetic variation discovery and cotton improvement. Here we assembled a telomere-to-telomere (T2T) genome for elite cultivar NDM13 and near-T2T genomes for 27 additional representatives of Gossypium hirsutum over the recent century, with transcriptomic profiling of 15 distinct tissues from each. We uncovered 51,551 one-to-one conserved orthologs across all genomes and landscapes of telomere, centromere, 45S rDNA, segmental duplication and copy number variant. We revealed hotspots of structural variation (SV) and impacts of SV, segmental duplication and copy number variant on gene expression or content alteration, as well as adversity resistances. We identified thousands of divergent SVs and genes implicated in modern breeding evolution. Combining T2T-reference-based pangenome construction and 761,536 SVs identified across 1,671 worldwide accessions with phenotypic data from 22 environments, we captured a number of hidden SVs that potentially influence critical breeding traits. These will boost genetic study and biotechnological improvement of the crop.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Characterization of T2T genome of NDM13.
The alternative text for this image may have been generated using AI.
Fig. 2: Gene-based pangenome analysis of 28 cottons.
The alternative text for this image may have been generated using AI.
Fig. 3: Landscape and diversification of complex regions in cotton.
The alternative text for this image may have been generated using AI.
Fig. 4: Genome-wide patterns of SDs and CNVs.
The alternative text for this image may have been generated using AI.
Fig. 5: Inferences from SVs from 27 genomes with reference to NDM13 and breeding history.
The alternative text for this image may have been generated using AI.
Fig. 6: Identification of important associated SVs underlying FL and FS.
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

The raw sequencing and transcriptome data for 28 cottons have been deposited in the National Genomics Data Center (NGDC) under the BioProject accession PRJCA023347 and in the NCBI Sequence Read Archive (SRA) under the BioProject accession PRJNA1132390. The genome assemblies of 28 cottons and the CENH3 ChIP–seq data for NDM13 have been deposited in the NGDC under the BioProject accession PRJCA023347. The resequencing data for 1,671 accessions are available in the NCBI SRA under the BioProject accession PRJNA680449 (1,081 cotton accessions) and PRJNA1132397 (590 cotton accessions). Source data are provided with this paper.

Code availability

The script and software used in this study are all publicly available from the internet as described in Methods and Reporting Summary. All custom scripts and codes associated with this project are available via Zenodo at https://doi.org/10.5281/zenodo.18357054 (ref. 121) and GitHub at https://github.com/SLBio/Analysis_pipeleine-NG-A66010.

References

  1. Sven, B. Empire of Cotton: A Global History (Alfred A. Knopf Press, 2014).

  2. Fang, L. et al. Genomic analyses in cotton identify signatures of selection and loci associated with fiber quality and yield traits. Nat. Genet. 49, 1089–1098 (2017).

    PubMed  CAS  Google Scholar 

  3. The International Wheat Genome Sequencing Consortium et al. Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science 361, eaar7191 (2018).

    Google Scholar 

  4. Walkowiak, S. et al. Multiple wheat genomes reveal global variation in modern breeding. Nature 588, 277–283 (2020).

    PubMed  PubMed Central  CAS  Google Scholar 

  5. Goff, S. A. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296, 92–100 (2002).

    PubMed  CAS  Google Scholar 

  6. Yu, J. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296, 79–92 (2002).

    PubMed  CAS  Google Scholar 

  7. Schnable, P. S. et al. The B73 maize genome: complexity, diversity, and dynamics. Science 326, 1112–1115 (2009).

    PubMed  CAS  Google Scholar 

  8. Jiao, Y. et al. Improved maize reference genome with single-molecule technologies. Nature 546, 524–527 (2017).

    PubMed  PubMed Central  CAS  Google Scholar 

  9. Hufford, M. B. et al. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science 373, 655–662 (2021).

    PubMed  PubMed Central  CAS  Google Scholar 

  10. Schmutz, J. et al. Genome sequence of the palaeopolyploid soybean. Nature 463, 178–183 (2010).

    PubMed  CAS  Google Scholar 

  11. Paterson, A. H. et al. Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres. Nature 492, 423–427 (2012).

    PubMed  CAS  Google Scholar 

  12. Wang, K. et al. The draft genome of a diploid cotton Gossypium raimondii. Nat. Genet. 44, 1098–1103 (2012).

    PubMed  CAS  Google Scholar 

  13. Li, F. et al. Genome sequence of the cultivated cotton Gossypium arboreum. Nat. Genet. 46, 567–572 (2014).

    PubMed  CAS  Google Scholar 

  14. Li, F. et al. Genome sequence of cultivated Upland cotton (Gossypium hirsutum TM-1) provides insights into genome evolution. Nat. Biotechnol. 33, 524–530 (2015).

    PubMed  Google Scholar 

  15. Zhang, T. et al. Sequencing of allotetraploid cotton (Gossypium hirsutum L. acc. TM-1) provides a resource for fiber improvement. Nat. Biotechnol. 33, 531–537 (2015).

    PubMed  CAS  Google Scholar 

  16. Ma, Z. et al. Resequencing a core collection of upland cotton identifies genomic variation and loci influencing fiber quality and yield. Nat. Genet. 50, 803–813 (2018).

    PubMed  CAS  Google Scholar 

  17. Hu, Y. et al. Gossypium barbadense and Gossypium hirsutum genomes provide insights into the origin and evolution of allotetraploid cotton. Nat. Genet. 51, 739–748 (2019).

    PubMed  CAS  Google Scholar 

  18. Wang, M. et al. Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense. Nat. Genet. 51, 224–229 (2019).

    PubMed  Google Scholar 

  19. Chen, Z. J. et al. Genomic diversifications of five Gossypium allopolyploid species and their impact on cotton improvement. Nat. Genet. 52, 525–533 (2020).

    PubMed  PubMed Central  CAS  Google Scholar 

  20. Huang, G. et al. Genome sequence of Gossypium herbaceum and genome updates of Gossypium arboreum and Gossypium hirsutum provide insights into cotton A-genome evolution. Nat. Genet. 52, 516–524 (2020).

    PubMed  PubMed Central  CAS  Google Scholar 

  21. Ma, Z. et al. High-quality genome assembly and resequencing of modern cotton cultivars provide resources for crop improvement. Nat. Genet. 53, 1385–1391 (2021).

    PubMed  PubMed Central  CAS  Google Scholar 

  22. Wang, M. et al. Genomic innovation and regulatory rewiring during evolution of the cotton genus Gossypium. Nat. Genet. 54, 1959–1971 (2022).

    PubMed  CAS  Google Scholar 

  23. Sreedasyam, A. et al. Genome resources for three modern cotton lines guide future breeding efforts. Nat. Plants 10, 1039–1051 (2024).

    PubMed  PubMed Central  Google Scholar 

  24. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

    PubMed  PubMed Central  CAS  Google Scholar 

  25. Chen, J. et al. A complete telomere-to-telomere assembly of the maize genome. Nat. Genet. 55, 1221–1231 (2023).

    PubMed  PubMed Central  CAS  Google Scholar 

  26. Naish, M. et al. The genetic and epigenetic landscape of the Arabidopsis centromeres. Science 374, eabi7489 (2021).

    PubMed  PubMed Central  Google Scholar 

  27. Shang, L. et al. A complete assembly of the rice Nipponbare reference genome. Mol. Plant 16, 1232–1236 (2023).

    PubMed  CAS  Google Scholar 

  28. Hu, Y. et al. Post-polyploidization centromere evolution in cotton. Nat. Genet. 57, 1021–1030 (2025).

    Google Scholar 

  29. Hu, G. et al. A telomere-to-telomere genome assembly of cotton provides insights into centromere evolution and short-season adaptation. Nat. Genet. 57, 1031–1043 (2025).

    PubMed  CAS  Google Scholar 

  30. Liu, Y. et al. Pan-genome of wild and cultivated soybeans. Cell 182, 162–176 (2020).

    PubMed  CAS  Google Scholar 

  31. Alonge, M. et al. Major impacts of widespread structural variation on gene expression and crop improvement in tomato. Cell 182, 145–161 (2020).

    PubMed  PubMed Central  CAS  Google Scholar 

  32. Jayakodi, M. et al. The barley pan-genome reveals the hidden legacy of mutation breeding. Nature 588, 284–289 (2020).

    PubMed  PubMed Central  CAS  Google Scholar 

  33. Qin, P. et al. Pan-genome analysis of 33 genetically diverse rice accessions reveals hidden genomic variations. Cell 184, 3542–3558 (2021).

    PubMed  CAS  Google Scholar 

  34. Shi, J., Tian, Z., Lai, J. & Huang, X. Plant pan-genomics and its applications. Mol. Plant 16, 168–186 (2023).

    PubMed  CAS  Google Scholar 

  35. Tang, D. et al. Genome evolution and diversity of wild and cultivated potatoes. Nature 606, 535–541 (2022).

    PubMed  PubMed Central  CAS  Google Scholar 

  36. Zhou, Y. et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature 606, 527–534 (2022).

    PubMed  PubMed Central  CAS  Google Scholar 

  37. Li, N. et al. Super-pangenome analyses highlight genomic diversity and structural variation across wild and cultivated tomato species. Nat. Genet. 55, 852–860 (2023).

    PubMed  PubMed Central  CAS  Google Scholar 

  38. He, Q. et al. A graph-based genome and pan-genome variation of the model plant Setaria. Nat. Genet. 55, 1232–1242 (2023).

    PubMed  PubMed Central  CAS  Google Scholar 

  39. Yang, Z. et al. Graph pan-genome illuminates evolutionary trajectories and agronomic trait architecture in allotetraploid cotton. Nat. Genet. 58, 218–229 (2026).

    PubMed  CAS  Google Scholar 

  40. Gu, Q. et al. A high-density genetic map and multiple environmental tests reveal novel quantitative trait loci and candidate genes for fibre quality and yield in cotton. Theor. Appl. Genet. 133, 3395–3408 (2020).

    PubMed  CAS  Google Scholar 

  41. Gu, Q. et al. A stable QTL qSalt-A04-1 contributes to salt tolerance in the cotton seed germination stage. Theor. Appl. Genet. 134, 2399–2410 (2021).

    PubMed  CAS  Google Scholar 

  42. Zhang, X. et al. Breeding of high-quality cotton in Hebei province during the past 70 years. China Cotton 47, 1–6 (2020).

    Google Scholar 

  43. Liao, W. W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).

    PubMed  PubMed Central  CAS  Google Scholar 

  44. Yang, Z. et al. Recent progression and future perspectives in cotton genomic breeding. J. Integr. Plant Biol. 65, 548–569 (2023).

    PubMed  CAS  Google Scholar 

  45. Zhang, C. Y. et al. High-quality genome of a modern soybean cultivar and resequencing of 547 accessions provide insights into the role of structural variation. Nat. Genet. 56, 2247–2258 (2024).

    PubMed  CAS  Google Scholar 

  46. Yang, Z. et al. Multi-omics provides new insights into the domestication and improvement of dark jute (Corchorus olitorius). Plant J. 112, 812–829 (2022).

    PubMed  CAS  Google Scholar 

  47. Zhang, Y. et al. The telomere-to-telomere gap-free genome of four rice parents reveals SV and PAV patterns in hybrid rice breeding. Plant Biotechnol. J. 20, 1642–1644 (2022).

    PubMed  PubMed Central  CAS  Google Scholar 

  48. Aganezov, S. et al. A complete reference genome improves analysis of human genetic variation. Science 376, eabl3533 (2022).

    PubMed  PubMed Central  CAS  Google Scholar 

  49. Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 (2022).

    PubMed  PubMed Central  CAS  Google Scholar 

  50. Bretani, G. et al. Segmental duplications are hot spots of copy number variants affecting barley gene content. Plant J. 103, 1073–1088 (2020).

    PubMed  PubMed Central  CAS  Google Scholar 

  51. Emanuel, B. S. & Shaikh, T. H. Segmental duplications: an ‘expanding’ role in genomic instability and disease. Nat. Rev. Genet. 2, 791–800 (2001).

    PubMed  CAS  Google Scholar 

  52. Hosmani, P. S. et al. Dirigent domain-containing protein is part of the machinery required for formation of the lignin-based Casparian strip in the root. Proc. Natl Acad. Sci. USA 110, 14498–14503 (2013).

    PubMed  PubMed Central  CAS  Google Scholar 

  53. Paniagua, C. et al. Dirigent proteins in plants: modulating cell wall metabolism during abiotic and biotic stress exposure. J. Exp. Bot. 68, 3287–3301 (2017).

    PubMed  CAS  Google Scholar 

  54. Wang, Y. et al. A dirigent family protein confers variation of Casparian strip thickness and salt tolerance in maize. Nat. Commun. 13, 2222 (2022).

    PubMed  PubMed Central  CAS  Google Scholar 

  55. Yang, X. et al. A loss-of-function of the dirigent gene TaDIR-B1 improves resistance to Fusarium crown rot in wheat. Plant Biotechnol. J. 19, 866–868 (2021).

    PubMed  CAS  Google Scholar 

  56. Deng, J. et al. Dirigent gene family is involved in the molecular interaction between Panax notoginseng and root rot pathogen Fusarium solani. Ind. Crop. Prod. 178, 114544 (2022).

    CAS  Google Scholar 

  57. Lin, J. L. et al. Dirigent gene editing of gossypol enantiomers for toxicity-depleted cotton seeds. Nat. Plants 9, 605–615 (2023).

    PubMed  Google Scholar 

  58. Li, S. et al. Genome-edited powdery mildew resistance in wheat without growth penalties. Nature 602, 455–460 (2022).

    PubMed  CAS  Google Scholar 

  59. Li, Y. B. et al. The thioredoxin GbNRX1 plays a crucial role in homeostasis of apoplastic reactive oxygen species in response to Verticillium dahliae infection in cotton. Plant Physiol. 170, 2392–2406 (2016).

    PubMed  PubMed Central  CAS  Google Scholar 

  60. Chen, J. et al. NLR surveillance of pathogen interference with hormone receptors induces immunity. Nature 613, 145–152 (2023).

    PubMed  CAS  Google Scholar 

  61. Wang, N. et al. An F-box protein attenuates fungal xylanase-triggered immunity by destabilizing LRR-RLP NbEIX2 in a SOBIR1-dependent manner. New Phytol. 236, 2202–2215 (2022).

    PubMed  CAS  Google Scholar 

  62. Bian, Y. et al. Cancer SLC43A2 alters T cell methionine metabolism and histone methylation. Nature 585, 277–282 (2020).

    PubMed  PubMed Central  CAS  Google Scholar 

  63. Zhai, K. et al. NLRs guard metabolism to coordinate pattern- and effector-triggered immunity. Nature 601, 245–251 (2022).

    PubMed  CAS  Google Scholar 

  64. Porubsky, D. et al. Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders. Cell 185, 1986–2005 (2022).

    PubMed  PubMed Central  CAS  Google Scholar 

  65. Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).

    PubMed  PubMed Central  CAS  Google Scholar 

  66. Jamshed, M. et al. Identification of stable quantitative trait loci (QTLs) for fiber quality traits across multiple environments in Gossypium hirsutum recombinant inbred line population. BMC Genomics 17, 197 (2016).

    PubMed  PubMed Central  Google Scholar 

  67. Rico, M. & Egelhoff, T. T. Myosin heavy chain kinase B participates in the regulation of myosin assembly into the cytoskeleton. J. Cell Biochem. 88, 521–532 (2003).

    PubMed  CAS  Google Scholar 

  68. Song, X. et al. Genome-wide association analysis reveals loci and candidate genes involved in fiber quality traits under multiple field environments in cotton (Gossypium hirsutum). Front. Plant Sci. 12, 695503 (2021).

    PubMed  PubMed Central  Google Scholar 

  69. Shao, Q. et al. Identifying QTL for fiber quality traits with three upland cotton (Gossypium hirsutum L.) populations. Euphytica 198, 43–58 (2014).

    Google Scholar 

  70. Zhang, Z. et al. Genome-wide quantitative trait loci reveal the genetic basis of cotton fibre quality and yield-related traits in a Gossypium hirsutum recombinant inbred line population. Plant Biotechnol. J. 18, 239–253 (2020).

    PubMed  CAS  Google Scholar 

  71. Ling, J. Karyotype Analysis by Telomere-FISH and Primary Development of High-Resolution Cytological Map in Cotton. PhD thesis, Chinese Academy of Agricultural Sciences (2008).

  72. Dvořáčková, M., Fojtová, M. & Fajkus, J. Chromatin dynamics of plant telomeres and ribosomal genes. Plant J. 83, 18–37 (2015).

    PubMed  Google Scholar 

  73. Sykorova, E. et al. The absence of Arabidopsis-type telomeres in Cestrum and closely related genera Vestia and Sessea (Solanaceae): first evidence from eudicots. Plant J. 34, 283–291 (2003).

    PubMed  CAS  Google Scholar 

  74. Sykorová, E. et al. Minisatellite telomeres occur in the family Alliaceae but are lost in Allium. Am. J. Bot. 93, 814–823 (2006).

    PubMed  Google Scholar 

  75. He, S. et al. The genomic basis of geographic differentiation and fiber improvement in cultivated cotton. Nat. Genet. 53, 916–924 (2021).

    PubMed  CAS  Google Scholar 

  76. Yang, Z. et al. Extensive intraspecific gene order and gene structural variations in upland cotton cultivars. Nat. Commun. 10, 2989 (2019).

    PubMed  PubMed Central  Google Scholar 

  77. Harringmeyer, O. & Hoekstra, H. Chromosomal inversion polymorphisms shape the genomic landscape of deer mice. Nat. Ecol. Evol. 6, 1965–1979 (2022).

    PubMed  PubMed Central  Google Scholar 

  78. Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).

    PubMed  PubMed Central  Google Scholar 

  79. Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).

    PubMed  PubMed Central  Google Scholar 

  80. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    PubMed  PubMed Central  CAS  Google Scholar 

  81. Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).

    PubMed  PubMed Central  CAS  Google Scholar 

  82. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).

    PubMed  PubMed Central  Google Scholar 

  83. Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).

    PubMed  Google Scholar 

  84. Chang, X. et al. High-quality Gossypium hirsutum and Gossypium barbadense genome assemblies reveal the landscape and evolution of centromeres. Plant Commun. 5, 100722 (2023).

    PubMed  PubMed Central  Google Scholar 

  85. Marçais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).

    PubMed  PubMed Central  Google Scholar 

  86. Mount, D. W. Using the basic local alignment search tool (BLAST). CSH Protoc. 2007, pdb.top17 (2007).

    PubMed  Google Scholar 

  87. Smit, A., Hubley, R. & Green, P. RepeatMasker Open-4.0. Institute for Systems Biology http://www.repeatmasker.org (2013–2015).

  88. Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6, 11 (2015).

    PubMed  PubMed Central  Google Scholar 

  89. Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, 265–268 (2007).

    Google Scholar 

  90. Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21, i351–i358 (2005).

    PubMed  CAS  Google Scholar 

  91. Edgar, R. C. & Myers, E. W. PILER: identification and classification of genomic repeats. Bioinformatics 21, i152–i158 (2005).

    PubMed  CAS  Google Scholar 

  92. Smit, A. & Hubley, R. RepeatModeler Open-1.0. Institute for Systems Biology http://www.repeatmasker.org (2008–2015).

  93. Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res. 14, 988–995 (2004).

    PubMed  PubMed Central  CAS  Google Scholar 

  94. Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).

    PubMed  PubMed Central  CAS  Google Scholar 

  95. Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).

    PubMed  PubMed Central  CAS  Google Scholar 

  96. Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, ii215 (2003).

    PubMed  Google Scholar 

  97. Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004).

    PubMed  CAS  Google Scholar 

  98. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).

    PubMed  PubMed Central  Google Scholar 

  99. Guigó, R. Assembling genes from predicted exons in linear time with dynamic programming. J. Comput. Biol. 5, 681–702 (1998).

    PubMed  Google Scholar 

  100. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).

    PubMed  CAS  Google Scholar 

  101. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).

    PubMed  PubMed Central  Google Scholar 

  102. Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).

    PubMed  PubMed Central  CAS  Google Scholar 

  103. Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, R7 (2008).

    PubMed  PubMed Central  Google Scholar 

  104. The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 46, 2699 (2018).

  105. Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, 279–285 (2016).

    Google Scholar 

  106. The Gene Ontology Consortium. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 45, 331–338 (2017).

  107. Kanehisa, M. et al. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 42, 199–205 (2014).

    Google Scholar 

  108. Išerić, H., Alkan, C., Hach, F. & Numanagić, I. Fast characterization of segmental duplication structure in multiple genome assemblies. Algorithms Mol. Biol. 17, 4 (2022).

    PubMed  PubMed Central  Google Scholar 

  109. Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019).

    PubMed  PubMed Central  Google Scholar 

  110. Chakraborty, M., Emerson, J. J., Macdonald, S. J. & Long, A. D. Structural variants exhibit widespread allelic heterogeneity and shape variation in complex traits. Nat. Commun. 10, 4872 (2019).

    PubMed  PubMed Central  CAS  Google Scholar 

  111. Goel, M., Sun, H., Jiao, W. B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 20, 277 (2019).

    PubMed  PubMed Central  Google Scholar 

  112. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    PubMed  CAS  Google Scholar 

  113. Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).

    PubMed  PubMed Central  CAS  Google Scholar 

  114. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).

    PubMed  PubMed Central  Google Scholar 

  115. Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).

    PubMed  PubMed Central  CAS  Google Scholar 

  116. Putri, G. H., Anders, S., Pyl, P. T., Pimanda, J. E. & Zanini, F. Analysing high-throughput sequencing data in Python with HTSeq 2.0. Bioinformatics 38, 2943–2945 (2022).

    PubMed  PubMed Central  CAS  Google Scholar 

  117. Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).

    PubMed  PubMed Central  CAS  Google Scholar 

  118. Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010).

    PubMed  PubMed Central  CAS  Google Scholar 

  119. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    PubMed  PubMed Central  CAS  Google Scholar 

  120. Ge, X. et al. Efficient genotype-independent cotton genetic transformation and genome editing. J. Integr. Plant Biol. 65, 907–917 (2023).

    PubMed  CAS  Google Scholar 

  121. Sun, Z. Scripts and code used in ‘A pangenome reference and population studies link structural variants with breeding traits in Gossypium hirsutum’. Zenodo https://doi.org/10.5281/zenodo.18357054 (2026).

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China (2022YFF1001403) to Y.Z., Z.M. and Xingfen Wang; the Science Research Project of Hebei Education Department (PTZX2026014) to Xingfen Wang, Y.Z., Z.S. and Z.M.; the Natural Science Foundation (C2022204205) to Xingfen Wang, Y.Z. and Z.S.; the Key Research and Development Program (21326314D) to Z.M., Xingfen Wang and Y.Z.; the Top Talent Project (031601801) of Hebei Province to Z.M.; the National Key Project of Bio-breeding of China (2023ZD04039) to Z.M., Xingfen Wang, Y.Z. and Z.S.; the China Agricultural Research System (CARS-15-03) to L.W., Xingfen Wang, Y.Z. and Z.M. and the Project for National Top Talent (0602019) and Shennong Plan of China to Y.Z.

Author information

Authors and Affiliations

Authors

Contributions

Y.Z., Z.S., Xingfen Wang and Z.M. performed most of the experiments and analyzed the data. Y.Z., Z.S., Xingfen Wang, L.W., Q.G., H.K., G.Z., B.C., Z.W., J.Z., X.Z. Z.L., J.Y., J.W., G.W., D.Z., Xingyi Wang, C.M., Y.L., Z.Z., W.C., M.J., H.J., J. Li, H.Z., Y.W., M.G., M.X., L.W., Z.L., Y.Y., Y.C. and J. Liu performed field trials, trait determination and sample preparation. S.T., X.L., Y.J., K.Z., Z.S., Y.Z. and Xingfen Wang performed the genome assembly and genomic analyses. Y.Z., Xingfen Wang, Z.S., S.T., X.L. and Z.M. identified genomic variations and constructed tables and figures. Y.Z., Z.S., Q.G. and Xingfen Wang conducted the genetic analyses of breeding traits. Y.Z., Xingfen Wang and Z.M. wrote the paper. Z.M. and Xingfen Wang conceived and supervised the project.

Corresponding authors

Correspondence to Xingfen Wang or Zhiying Ma.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Phylogenetic tree of 1,671 accessions and highly diverse agronomic phenotypes across 28 accessions.

a, Phylogenetic tree using genome-wide SNP data. This tree incorporated G. barbadense cv. Pima90 as the outgroup, with all branch lengths annotated for clarity. Branch lengths are quantified through substitutions per site. (b-e) The diverse agronomic phenotypes among 28 accessions, including seed size and color (b), length of boll handle (c), leaf size and shape (d), boll size and shape (e). f, Fiber length. g, Fiber strength, h, Lint percentage. All scale bars represent 1 cm.

Extended Data Fig. 2 Density of SD blocks identified in the NDM13 and NDM8 genomes.

The blue lines indicate the synteny between NDM13 and NDM8 in each chromosome.

Extended Data Fig. 3 The density of gene models, Copia and Gypsy of the 28 genomes with 1,000 windows.

The vertical dashed lines indicate the 10% windows of the left and right, respectively.

Extended Data Fig. 4 Expression comparison among core, dispensable and private genes based on the averaged FPKM of 15 tissues in each cotton.

In the box plots, the center line denotes the median; box limits are the upper and lower quartiles; whiskers mark the range of the data. Statistical significance was determined using a two-side wilcox test.

Extended Data Fig. 5 An example for SV hotspots located in chromosome Dt01.

a, SV hotspots in 60-61 Mb of chromosome Dt01 from each accession. b, Disease resistance-related genes located in hotspot.

Supplementary information

Supplementary Information (download PDF )

Supplementary Notes 1 and 2, and Figs. 1–15.

Reporting Summary (download PDF )

Supplementary Tables (download XLSX )

Supplementary Tables 1–52.

Source data

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Sun, Z., Tian, S. et al. A pangenome reference and population studies link structural variants with breeding traits in Gossypium hirsutum. Nat Genet 58, 928–939 (2026). https://doi.org/10.1038/s41588-026-02523-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41588-026-02523-z

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing