Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Graph pan-genome illuminates evolutionary trajectories and agronomic trait architecture in allotetraploid cotton

Abstract

Upland cotton (Gossypium hirsutum), one of the world’s major fiber crops, faces challenges from the genetic homogeneity of modern varieties. Here we present 107 gold-standard genome assemblies spanning the wild-to-domesticated continuum, revealing six large-scale structural variations, including a chromosomal reciprocal translocation and five inversions tracing the evolutionary history of cultivated cotton in the Americas. This history also involved continuous introgression from Gossypium barbadense, shaping the genetic diversity of G.hirsutum landraces and cultivars. Leveraging the graph pan-genome, we capture the sequence and structural diversity of nucleotide-binding site–leucine-rich repeat genes, uncovering pathogen-driven selection signatures and loci associated with disease resistance. A presence–absence variation genome-wide association study (GWAS) identified previously overlooked loci for key fiber traits, complementing single-nucleotide polymorphism–GWAS findings. Additionally, we construct a detailed map of large inversions, offering insights into hybridization dynamics and strategies to mitigate linkage drag. This study enhances our understanding of cotton evolution and domestication while delivering a valuable resource to enhance breeding.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Genomic insights into the origin and domestication of upland cotton inferred from translocations and inversions.
Fig. 2: Large-scale introgressions from Gossypium barbadense enriched the upland cotton gene pool.
Fig. 3: Pan-NLR construction in the super pan-genome.
Fig. 4: PAV–GWAS is more powerful than SNP–GWAS in unveiling the genetic basis underlying traits.
Fig. 5: Pan-genome-based map of large inversions.

Similar content being viewed by others

Data availability

All raw sequencing data generated in this paper have been deposited into NCBI as studies PRJNA1103937 and PRJNA1111598. Genome assemblies and annotations are available in two repositories: NCBI (under PRJNA1345676) and Zenodo (https://doi.org/10.5281/zenodo.15644881 and https://doi.org/10.5281/zenodo.17046697 (refs. 91,92)). Pan-gene, pan-genome and eQTL datasets are available at Zenodo via https://doi.org/10.5281/zenodo.15646851 (ref. 93).

Code availability

Custom scripts can be accessed via Zenodo at https://doi.org/10.5281/zenodo.15644881 (ref. 91). The software and tools used in this study are detailed in the Methods and Reporting Summary.

References

  1. Wendel, J. F., Brubaker, C. L. & Percival, A. E. Genetic diversity in Gossypium hirsutum and the origin of upland cotton. Am. J. Bot. 79, 1291–1310 (1992).

    Article  Google Scholar 

  2. Chen, Z. J. et al. Genomic diversifications of five Gossypium allopolyploid species and their impact on cotton improvement. Nat. Genet. 52, 525–533 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Viot, C. R. & Wendel, J. F. Evolution of the cotton genus, Gossypium, and its domestication in the Americas. Crit. Rev. Plant Sci. 42, 1–33 (2023).

    Article  Google Scholar 

  4. Percy, R. G. & Wendel, J. F. Allozyme evidence for the origin and diversification of Gossypium barbadense L. Theor. Appl. Genet. 79, 529–542 (1990).

    Article  CAS  PubMed  Google Scholar 

  5. Yuan, D. et al. Parallel and intertwining threads of domestication in allopolyploid cotton. Adv. Sci. 8, 2003634 (2021).

    Article  Google Scholar 

  6. He, S. et al. The genomic basis of geographic differentiation and fiber improvement in cultivated cotton. Nat. Genet. 53, 916–924 (2021).

    Article  CAS  PubMed  Google Scholar 

  7. Huang, G., Huang, J. Q., Chen, X. Y. & Zhu, Y. X. Recent advances and future perspectives in cotton research. Annu. Rev. Plant Biol. 72, 437–462 (2021).

    Article  CAS  PubMed  Google Scholar 

  8. Yang, Z., Qanmber, G., Wang, Z., Yang, Z. & Li, F. Gossypium genomics: trends, scope, and utilization for cotton improvement. Trends Plant Sci. 25, 488–500 (2020).

    Article  CAS  PubMed  Google Scholar 

  9. Ma, Z. et al. Resequencing a core collection of upland cotton identifies genomic variation and loci influencing fiber quality and yield. Nat. Genet. 50, 803–813 (2018).

    Article  CAS  PubMed  Google Scholar 

  10. Wang, M. et al. Asymmetric subgenome selection and cis-regulatory divergence during cotton domestication. Nat. Genet. 49, 579–587 (2017).

    Article  CAS  PubMed  Google Scholar 

  11. Li, F. et al. Genome sequence of cultivated Upland cotton (Gossypium hirsutum TM-1) provides insights into genome evolution. Nat. Biotechnol. 33, 524–530 (2015).

    Article  PubMed  Google Scholar 

  12. Hu, Y. et al. Gossypium barbadense and Gossypium hirsutum genomes provide insights into the origin and evolution of allotetraploid cotton. Nat. Genet. 51, 739–748 (2019).

    Article  CAS  PubMed  Google Scholar 

  13. Huang, G. et al. Genome sequence of Gossypium herbaceum and genome updates of Gossypium arboreum and Gossypium hirsutum provide insights into cotton A-genome evolution. Nat. Genet. 52, 516–524 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Yang, Z. et al. Extensive intraspecific gene order and gene structural variations in upland cotton cultivars. Nat. Commun. 10, 2989 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  15. Ma, Z. et al. High-quality genome assembly and resequencing of modern cotton cultivars provide resources for crop improvement. Nat. Genet. 53, 1385–1391 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Yang, Z. et al. Recent progression and future perspectives in cotton genomic breeding. J. Integr. Plant Biol. 65, 548–569 (2023).

    Article  CAS  PubMed  Google Scholar 

  17. Sreedasyam, A. et al. Genome resources for three modern cotton lines guide future breeding efforts. Nat. Plants 10, 1039–1051 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Zhou, Y. et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature 606, 527–534 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Liu, Y. et al. Pan-genome of wild and cultivated soybeans. Cell 182, 162–176 (2020).

    Article  CAS  PubMed  Google Scholar 

  20. Jin, S. et al. Structural variation (SV)-based pan-genome and GWAS reveal the impacts of SVs on the speciation and diversification of allotetraploid cottons. Mol. Plant 16, 678–693 (2023).

    Article  CAS  PubMed  Google Scholar 

  21. Li, J. et al. Convergence and divergence of diploid and tetraploid cotton genomes. Nat. Genet. 56, 2562–2573 (2024).

    Article  CAS  PubMed  Google Scholar 

  22. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Chang, X. et al. High-quality Gossypium hirsutum and Gossypium barbadense genome assemblies reveal the landscape and evolution of centromeres. Plant Commun. 5, 100722 (2024).

    Article  CAS  PubMed  Google Scholar 

  24. Wang, M. et al. Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense. Nat. Genet. 51, 224–229 (2019).

    Article  PubMed  Google Scholar 

  25. Huang, G. et al. A telomere-to-telomere cotton genome assembly reveals centromere evolution and a Mutator transposon-linked module regulating embryo development. Nat. Genet. 56, 1953–1963 (2024).

    Article  CAS  PubMed  Google Scholar 

  26. Chen, J. et al. A complete telomere-to-telomere assembly of the maize genome. Nat. Genet. 55, 1221–1231 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Gan, Y. et al. Chromosomal locations of 5S and 45S rDNA in Gossypium genus and its phylogenetic implications revealed by FISH. PLoS ONE 8, e68207 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Cheng, Y. et al. Gossypium purpurascens genome provides insight into the origin and domestication of upland cotton. J. Adv. Res. 56, 15–29 (2024).

    Article  CAS  PubMed  Google Scholar 

  29. Li, X. et al. Large-scale gene expression alterations introduced by structural variation drive morphotype diversification in Brassica oleracea. Nat. Genet. 56, 517–529 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Morrell, P. L., Buckler, E. S. & Ross-Ibarra, J. Crop genomics: advances and applications. Nat. Rev. Genet. 13, 85–96 (2011).

    Article  PubMed  Google Scholar 

  31. Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 42, 663–673 (2024).

    Article  CAS  PubMed  Google Scholar 

  32. Yuan, Y., Bayer, P. E., Batley, J. & Edwards, D. Current status of structural variation studies in plants. Plant Biotechnol. J. 19, 2153–2163 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  33. Reinisch, A. J. et al. A detailed RFLP map of cotton, Gossypium hirsutum × Gossypium barbadense: chromosome organization and evolution in a disomic polyploid genome. Genetics 138, 829–847 (1994).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Li, L. et al. Genomic analyses reveal the genetic basis of early maturity and identification of loci and candidate genes in upland cotton (Gossypium hirsutum L.). Plant Biotechnol. J. 19, 109–123 (2021).

    Article  CAS  PubMed  Google Scholar 

  35. Lee, C. R. et al. Young inversion with multiple linked QTLs under selection in a hybrid zone. Nat. Ecol. Evol. 1, 119 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Wendel, J. F., Brubaker, C. L. & Seelanan, T. in Physiology of Cotton (eds Stewart, J. M. C. D. et al.) 1–18 (Springer, 2010).

  37. Edelman, N. B. & Mallet, J. Prevalence and adaptive impact of introgression. Annu. Rev. Genet. 55, 265–283 (2021).

    Article  CAS  PubMed  Google Scholar 

  38. Wang, N. et al. High-resolution sequencing of nine elite upland cotton cultivars uncovers genic variations and breeding improvement targets. Plant J. 113, 145–159 (2023).

    Article  CAS  PubMed  Google Scholar 

  39. Tang, D. et al. Genome evolution and diversity of wild and cultivated potatoes. Nature 606, 535–554 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Ma, S. et al. Direct pathogen-induced assembly of an NLR immune receptor complex to form a holoenzyme. Science 370, 3069 (2020).

    Article  Google Scholar 

  41. Zhang, Y. et al. Genome-wide association analysis reveals a novel pathway mediated by a dual-TIR domain protein for pathogen resistance in cotton. Genome Biol. 24, 111 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Zhang, Y. et al. Promotion of apoplastic oxidative burst by artificially selected GhCBSX3A enhances Verticillium dahliae resistance in upland cotton. Plant J. 118, 2154–2168 (2024).

    Article  CAS  PubMed  Google Scholar 

  43. Van de Weyer, A. L. et al. A species-wide inventory of NLR genes and alleles in Arabidopsis thaliana. Cell 178, 1260–1272 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Yanagisawa, M. et al. Patterning mechanisms of cytoskeletal and cell wall systems during leaf trichome morphogenesis. Nat. Plants 1, 15014 (2015).

    Article  CAS  PubMed  Google Scholar 

  45. Ding, M. et al. The hairless stem phenotype of cotton (Gossypium barbadense) is linked to a Copia-like retrotransposon insertion in a homeodomain-leucine zipper gene (HD1). Genetics 201, 143–154 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Yuan, R. et al. Differentiation in the genetic basis of stem trichome development between cultivated tetraploid cotton species. BMC Plant Biol. 21, 115 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Hinchliffe, D. J. et al. The GhTT2_A07 gene is linked to the brown colour and natural flame retardancy phenotypes of Lc1 cotton (Gossypium hirsutum L.) fibres. J. Exp. Bot. 67, 5461–5471 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Wen, T. et al. Linkage and association mapping reveals the genetic basis of brown fibre (Gossypium hirsutum). Plant Biotechnol. J. 16, 1654–1666 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Yan, Q. et al. Up-regulation of GhTT2-3A in cotton fibres during secondary wall thickening results in brown fibres with improved quality. Plant Biotechnol. J. 16, 1735–1747 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Walkowiak, S. et al. Multiple wheat genomes reveal global variation in modern breeding. Nature 588, 277–283 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Wellenreuther, M. & Bernatchez, L. Eco-evolutionary genomics of chromosomal inversions. Trends Ecol. Evol. 33, 427–440 (2018).

    Article  PubMed  Google Scholar 

  52. Lowry, D. B. & Willis, J. H. A widespread chromosomal inversion polymorphism contributes to a major life-history transition, local adaptation, and reproductive isolation. PLoS Biol. 8, e1000500 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  53. Shang, L. et al. A super pan-genomic landscape of rice. Cell Res. 32, 878–896 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Du, X. et al. Resequencing of 243 diploid cotton accessions based on an updated A genome identifies the genetic basis of key agronomic traits. Nat. Genet. 50, 796–802 (2018).

    Article  CAS  PubMed  Google Scholar 

  55. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26, 1641–1650 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Seppey, M., Manni, M. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness. Methods Mol. Biol. 1962, 227–245 (2019).

    Article  CAS  PubMed  Google Scholar 

  58. Ou, S., Chen, J. & Jiang, N. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. 46, e126 (2018).

    PubMed  PubMed Central  Google Scholar 

  59. Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Li, K., Xu, P., Wang, J., Yi, X. & Jiao, Y. Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assessment and improvement. Nat. Commun. 14, 6556 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Han, Y. & Wessler, S. R. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 38, e199 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  62. Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268 (2007).

    Article  PubMed  PubMed Central  Google Scholar 

  63. Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  64. Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 4, 4.10.1–4.10.14 (2009).

    Google Scholar 

  65. Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Beier, S., Thiel, T., Munch, T., Scholz, U. & Mascher, M. MISA-web: a web server for microsatellite prediction. Bioinformatics 33, 2583–2585 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, 215–225 (2003).

    Article  Google Scholar 

  68. Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  69. Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).

    Article  CAS  PubMed  Google Scholar 

  70. Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004).

    Article  CAS  PubMed  Google Scholar 

  71. Keilwagen, J., Hartung, F. & Grau, J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Methods Mol. Biol. 1962, 161–177 (2019).

    Article  CAS  PubMed  Google Scholar 

  72. Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. & Salzberg, S. L. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat. Protoc. 11, 1650–1667 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Besemer, J., Lomsadze, A. & Borodovsky, M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29, 2607–2618 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512 (2013).

    Article  CAS  PubMed  Google Scholar 

  76. Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, R7 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  77. Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  78. Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  80. Hu, G. et al. A telomere-to-telomere genome assembly of cotton provides insights into centromere evolution and short-season adaptation. Nat. Genet. 57, 1031–1043 (2025).

    Article  CAS  PubMed  Google Scholar 

  81. Li, P. et al. RGAugury: a pipeline for genome-wide prediction of resistance gene analogs (RGAs) in plants. BMC Genomics 17, 852 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Saucet, S. B. et al. Two linked pairs of Arabidopsis TNL resistance genes independently confer recognition of bacterial effector AvrRps4. Nat. Commun. 6, 6338 (2015).

    Article  CAS  PubMed  Google Scholar 

  83. Marcais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  84. Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Li, M. X., Yeung, J. M., Cherny, S. S. & Sham, P. C. Evaluating the effective numbers of independent tests and significant p-value thresholds in commercial genotyping arrays and public imputation reference datasets. Hum. Genet. 131, 747–756 (2012).

    Article  CAS  PubMed  Google Scholar 

  86. Yang, L. et al. Regulatory networks of coresident subgenomes during rapid fiber cell elongation in upland cotton. Plant Commun. 5, 101130 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Ge, X. et al. Efficient genotype-independent cotton genetic transformation and genome editing. J. Integr. Plant Biol. 65, 907–917 (2023).

    Article  CAS  PubMed  Google Scholar 

  88. Zhang, T., Liu, G., Zhao, H., Braz, G. T. & Jiang, J. Chorus2: design of genome-scale oligonucleotide-based probes for fluorescence in situ hybridization. Plant Biotechnol. J. 19, 1967–1978 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. Lindgren, V. V. In situ hybridization protocols. Arch. Pathol. Lab. Med. 124, 1387–1388 (2000).

    Article  Google Scholar 

  90. Huang, Y. et al. The formation and evolution of centromeric satellite repeats in Saccharum species. Plant J. 106, 616–629 (2021).

    Article  CAS  PubMed  Google Scholar 

  91. Zhang, M. & Gao, C. The pan-genome annotations and assemblies for cotton. Zenodo https://doi.org/10.5281/zenodo.15644881 (2025).

  92. Gao, C. The pan-genome annotations, assemblies and scripts for cotton. Zenodo https://doi.org/10.5281/zenodo.17046697 (2025).

  93. Gao, C. The pan-gene, pangenome, and eQTL dataset for cotton. Zenodo https://doi.org/10.5281/zenodo.15646851 (2025).

Download references

Acknowledgements

This study was supported by the National Key Research and Development Program of China (2022YFF1001400 and 2021YFF1000102-1), the Natural Science Foundation of Henan (232300421010), the Key Research and Development Project of Henan Province (231111110400), the National Natural Science Foundation of China (32472146), Xinjiang Science and Technology Major Project of China (2024A02002), Innovation Program of the Chinese Academy of Agricultural Sciences (CAASASTIP-IVFCAAS) and the Fundamental Research Funds of State Key Laboratory of Cotton Biology (CB2024C06).

Author information

Authors and Affiliations

Authors

Contributions

Zhaoen Yang, X.G. and F.L. conceived and designed the research. X.G., Zhaoen Yang and Zuoren Yang managed the project. Zhaoen Yang, R.L., Z.W., X.M., C.G. and H.Z. prepared the samples. Y.Z. and Z.Z. performed phenotyping. Zhaoen Yang, C.G., M.Z., L.Y., B.G. and X.L. performed data analysis. Y.Z. and M.M. designed and performed the molecular experiments. Zhaoen Yang, C.G. and M.Z. prepared the figures and tables. Zhaoen Yang, F.L., X.G., G.H. and J.F.W. wrote and revised the paper.

Corresponding authors

Correspondence to Zhaoen Yang, Xiongfeng Ma, Jonathan F. Wendel, Xiaoyang Ge or Fuguang Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Michael Bevan, Tianzhen Zhang and Yuxian Zhu for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Genomic landscapes of the 107 accessions.

a, Phylogenetic analysis of 3256 accessions collected worldwide. Purple lines and orange lines, respectively, represent the 7 semi-wild and 100 improved upland cotton accessions selected for de novo assembly. Blue lines represent the G. barbabense that as outgroup. b, Genomic landscape and assembly quality evaluation of the 107 accessions, including genome assembly size, Contig N50 length, LAI, percentage of gene-regions, exons, introns, repeats, LTRs, Gypsy LTR, Copia LTR, SINEs+LINEs and completed BUSCO genes. The phylogenetic tree in the left panel was inferred from PAVs rooted by the wild accession TX1236 ( = N301 here).

Extended Data Fig. 2 Characterization of 5S rDNA and 45SrDNA.

a-b, Schematic representation of 5S rDNA and 45S rDNA in upland cotton. c, Histograms showing the copy number of 5S rDNA, 18S rDNA, 5.8S rDNA, and 28S rDNA for each accession. d-e, Illustrations of 5S rDNA locations on chromosomes A09 (d) and D09 (e). The upper panel shows the long terminal repeats (LTRs) surrounding the 5S rDNA region, while the lower panel displays the reads mapping across the 5S rDNA locus. f, Depiction of a 45S rDNA locus on chromosome D07. The blue boxes in the upper panel highlight the rDNA loci, while the lower panel shows the reads mapping across the 45S rDNA region. The color bars in the upper panel represent the LTRs. g-h, Diversity analysis of 5S rDNA from A09 of N200 (g) and 45S rDNA from D07 of N302 (h). Using CD-HIT for clustering of 5S rDNA or 45S rDNA, sequences within the same cluster are completely identical, and these sequences are referred to as feature sequence. We calculated the frequency of each feature sequence relative to all sequences. Red lines denote the Levenshtein distance between a feature sequence and the most abundant feature sequence. Blue bars indicate frequency values. The right Y-axis measures Levenshtein distance, while the left Y-axis measures frequency per feature sequence cluster. Levenshtein distance, commonly known as Edit Distance, refers to the minimum number of edit operations required to transform one string into another.

Extended Data Fig. 3 Pan and core genome analysis of 107 upland cotton accessions.

a, Variation of gene families in the pan-genome and core genome with the addition of another cotton genome. b, Composition of the pan-genome. The histograms display the number of gene families in the 107 accessions with varying frequencies. Pie chart indicates the ratio of gene family in each category. c, Presence and absence information of pan gene families across the 107 cotton genomes. d, Gene number of each composition in individual genomes. e, Comparison of nucleotide diversity (π) across core, softcore and dispensable genes. π values were evaluated in a 5 Kb window. The bin numbers for each category are shown below the x-axis. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 0.0625, 1×10−15 and 1×10−15, respectively. f, Comparison of dN/dS ratios across core, softcore and dispensable genes, where dN represents the number of non-synonymous SNP mutations and dS represents the number of synonymous SNP mutations. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 2.4×10−14, 1×10−15 and 1.9×10−14, respectively. g, Comparison of gene expression across core, softcore and dispensable genes. Number n indicates gene number within each category. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 1×10−15, 1×10−15 and 1×10−15, respectively. Multiple comparisons were conducted using the Tukey’s Honestly Significant Difference test with α = 0.05 for panels e-g. Center line, median; box limits, first and third quartiles; whiskers, 1.5 × interquartile range.

Extended Data Fig. 4 Graph-based pan-genome construction.

a, Variations of inversion (INV), translocation (TRANS), duplication (DUP) and copy number variation (CNV) upon adding additional cotton genome. b, An example of the graph-based pan-genome for chromosome A07 within the 127895 to 128015 bp region. c, Frequency of combined PAVs of varying lengths. The original PAVs in the pan-genome were combined based on their locations on reference genome N244. d, PAV number increase with the addition of cotton genomes. e, Number of different polymorphic PAVs. f, Composition of PAVs in semi-wild/wild and improved lines. g, Comparison of insertions (INS) and deletions (DEL) across the 26 chromosomes, partitioned by subgenome. h, Comparison of expression levels between genes with INS and DEL. Based on PAV location, PAV-associated genes were categorized into four groups: exon, intron, 5 Kb upstream, and 5 Kb downstream. Center line, median; box limits, first and third quartiles; whiskers, 1.5 × interquartile range. A two-tailed Student’s tests were used to evaluate the P-values.

Extended Data Fig. 5 A reciprocal translocation contributes to the diversity of upland cotton.

a, Chromosome collinearity analysis between A03 and A09 among four AD1 accessions and six sister species (AD2-AD7). A03-A09 translocation is only observed in semi-wild/wild AD1 accessions. b, Validation of A03-A09 translocation using Hi-C interaction matrices. Red boxes highlight chromosomes A03 and A09, with arrows showing the interactions between them. c, Phylogenetic (left) and principal component analyses (right) of the translocation distribution, based on a previously reported study of wild, landrace 1 and landrace 2 populations of upland cotton (ref. 5). The map shows the geographic distribution of the TRANS and non-TRANS populations. L1: Landrace 1, L2: Landrace 2, W: Wild. The map was generated in R using the ggplot2 and maps packages.

Extended Data Fig. 6 Pan-NLRs analysis based on the 107 assemblies.

a, Composition of the five subgroups in each accession. b, Total number of NLRs, cluster NLRs and singletons in each accession (n = 107 samples). c, Comparison of paired NLRs between semi-wild/wild (n = 7) and cultivated accessions (n = 100). d, Comparison of cluster NLRs between semi-wild/wild (n = 7) and cultivated accessions (n = 100). P-value was assessed by a two-tailed Student’s t-test. Data are presented as mean values ± SD in panels b-d. e, Composition of the Pan-NLR. The histograms display the number of gene families in the 107 accessions, categorized by frequencies. The right panel is the zoom-in view of the blue histograms in left panel. f, Comparison of expression levels across core, softcore and dispensable NLRs. Gene numbers are shown below the x-axis. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 0.014, 0.034 and 0.89, respectively. g, Comparison of expression nucleotide diversity (π) across core, softcore and dispensable NLRs. Polymorphic nucleotide sites used for π evaluation are indicated below x-axis. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 1×10−15, 1×10−15 and 3.2×10−5, respectively. Multiple comparisons were conducted using the Tukey’s Honestly Significant Difference test with α = 0.05 for panels f and g. Center line, median; box limits, first and third quartiles; whiskers, 1.5 × interquartile range. h, Frequency of structural variants (SVs) within each NLR cluster. i, Comparison of nucleotide diversity among core, softcore, and dispensable NLRs. j, Comparison of π-values of different domains between YER and other regions. k, Comparison of LRR diversity between YER and other regions across core, softcore and dispensable NLRs. P-values were evaluated by two-tailed Student’s t-tests, and data are presented as mean ± SD in panels i-k. Polymorphic nucleotide sites used for π evaluation are shown below x-axis for panels i-k.

Extended Data Fig. 7 PAV-GWAS captures an 806bp-insertion associated with FS and SI.

a, Fiber strength (FS) and seed index (SI) are co-located on chromosome A07. The shaded box indicates the co-location signal. The horizontal lines represent the genome-wide significance threshold after Bonferroni correction (α = 1/n, where ‘n’ is the total number of independent SVs). Thresholds were −log(P) = 4.6 (419 accessions) and −log(P) = 4.5 (1226 accessions). b, Zoomed-in view of the signal interval identified by SNPs, InDels, and PAVs. Blue dots represent SNPs, green dots indicate InDels, and light coral dots are PAVs. A significance threshold of −log(P) = 4.6 for 419 accessions was used. c, Haplotype analysis at the SI locus. Three haplotypes were identified. d, An 806-bp insertion detected by PAV-GWAS in gene N244A07G24740 divides upland cotton into Hap. 1 and Hap. 2. A significance threshold of −log(P) = 4.6 for 419 accessions was used. e, Comparison of FS between accessions carrying Hap. 1 (n = 57 accessions) and Hap. 2 (n = 358 accessions). P-values were evaluated by two-tailed Student’s t-tests. Center line, median; box limits, first and third quartiles; whiskers, range from minimum to maximum. f, Accessions carrying SI3 exhibit higher expression levels and SI compared to those without any QTLs. The central lines indicate the mean value, whereas the lines above and below represent mean ± SD. Accessions for each category are shown above the x-axis. The number of different QTL combinations is shown above the x-axis. g, Overexpression of N244A07G024740 leads to increased seed size. The bar represents 1 cm. h-j, Comparison of SI (h), seed length (i), and seed width (g) among WT (n = 10 samples), OE-2 (n = 10 samples), OE-12 (n = 10 samples) and OE-13 (n = 10 samples). Multiple comparisons were conducted using the Tukey’s Honestly Significant Difference test with α = 0.05. Data are present as the mean ± SD. The P-values in panel h for WT_vs_OE-2, WT_vs_OE-12, WT_vs_OE-13, OE-2_vs_OE-12, OE-2_vs_OE-13 and OE-12_vs_OE-13 are 2×10−14, 5.3×10−13, 8.7×10−10, 1.6×10−5, 1×10−10 and 4.6×10−5, respectively. The P-values in panel i for WT_vs_OE-2, WT_vs_OE-12, WT_vs_OE-13, OE-2_vs_OE-12, OE-2_vs_OE-13 and OE-12_vs_OE-13 are 2.4×10−5, 5.1×10−6, 7.8×10−3, 0.59, 0.21 and 0.013, respectively. The P-values in panel j for WT_vs_OE-2, WT_vs_OE-12, WT_vs_OE-13, OE-2_vs_OE-12, OE-2_vs_OE-13 and OE-12_vs_OE-13 are 3.6×10−4, 9.9×10−4, 0.30, 0.98, 0.04 and 0.09, respectively.

Extended Data Fig. 8 A rare allele of CESA7 contributes to stronger fiber in lines derived from distant hybridization.

a, GWAS for fiber strength using PAVs with a minor allele frequency greater than 0.01. The horizontal lines represent the genome-wide significance threshold after Bonferroni correction (α = 1/n, where ‘n’ is the total number of independent SVs), with −log(P) = 4.7 for 419 accessions. b, Zoomed-in view of the signal on A07. A significance threshold of −log(P) = 4.7 for 419 accessions was used. c, Two PAVs located in N24407G004420 (CESA7) cluster upland cotton into Hap. A and Hap. B. d, Comparison of fiber strength (FS) and fiber length (FL) between the two haplotypes across two panels. Hap. A consistently exhibits superior FS and FL compared to Hap. B. Panel information indicates that accessions in Hap. A are primarily from distant hybridization, predominantly found in the Suyuan and Suyou series resources. Center line, median; box limits, first and third quartiles; whiskers, range from minimum to maximum. The “n” represents the number of accessions with Hap. A and Hap. B. P-values were evaluated by two-tailed Student’s t-tests.

Extended Data Fig. 9 Impact of the A06 inversion on leaf trichome density and stem trichome phenotypes.

a, Leaf and stem trichome phenotyping. From left to right: low LTA, high LTA, low STA, and high STA. LTA refers to leaf trichome amount; STA refers to stem trichome amount. The bar represents 500 μm. b, Manhattan plots for GWAS of LTA and STA, highlighting the associated genomic regions with blue box. The horizontal lines represent the genome-wide significance threshold after Bonferroni correction (α = 1/n, where ‘n’ is the total number of independent SVs), with −log(P) = 4.6 for 419 accessions. c, Zoomed-in view of the genomic region on A07: the upper panel shows GWAS signals, the middle panel displays gene models, and the lower panel provides a detailed view of the candidate gene region. A significance threshold of −log(P) = 4.6 for 419 accessions was used. d, Haplotype comparison of the gene N244A06G021950, indicating amino acid variants. The numerical values above denote the positions of these variations. e, Haplotype effect on LTA, demonstrating differences associated with distinct haplotypes. f, Haplotype effect on FPKM (Fragments Per Kilobase of transcript per Million mapped reads), illustrating the gene expression levels between the two haplotypes. Accession numbers carrying INV2 or no-INV2 are shown below the x-axis for panels e and f. Center line, median; box limits, first and third quartiles; whiskers, range from minimum to maximum. The P-value was evaluated by Student’s t-test.

Extended Data Fig. 10 eGWAS analysis of flavonoid biosynthesis genes reveals co-location with GhTT2.

a, Grey lines represent trans-eQTL regulation, while red lines indicate local eQTL regulation. Genes marked with an asterisk denote that the regulatory relationships between GhTT2 and either ANS or ANR have been experimentally validated in previous study. b, Comparison of the expression level of the two haplotypes of cis-eQTL from GhTT2. “n” indicates sample size. Center line, median; box limits, first and third quartiles; whiskers, range from minimum to maximum. The P-value was calculated by a two-tailed Student’s t-test.

Supplementary information

Supplementary Information (download PDF )

Supplementary Figs. 1–33.

Reporting Summary (download PDF )

Peer Review File (download PDF )

Supplementary Tables (download XLSX )

Supplementary Table 1. Detailed information of the germplasm used in this study. Supplementary Table 2. Summary statistics of PacBio sequencing reads for the 107 accessions. Supplementary Table 3. Summary of genomic landscape for the 107 assemblies. Supplementary Table 4. Sequencing quality of 12 Hi-C libraries in this study. Supplementary Table 5. The telomere assembly of each chromosome. Supplementary Table 6. The centromere location of each chromosome. Supplementary Table 7. Chi-squared tests for the Caribbean, South America, North America and Yucatan regions. Supplementary Table 8. Introgression from G.barbadense in the 107 accessions. Supplementary Table 9. Haplotype carried by the wild–semi-wild cotton. Supplementary Table 10. Summary of the gene number of the five subgroups in the 107 assemblies. Supplementary Table 11. QTLs identified by GWAS using SNP and PAV. Supplementary Table 12. List of genes. Supplementary Table 13. Summary of the length of the PAVs. Supplementary Table 14. The primers used in this study.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, Z., Yang, Z., Gao, C. et al. Graph pan-genome illuminates evolutionary trajectories and agronomic trait architecture in allotetraploid cotton. Nat Genet 58, 218–229 (2026). https://doi.org/10.1038/s41588-025-02462-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41588-025-02462-1

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing