Graph pan-genome illuminates evolutionary trajectories and agronomic trait architecture in allotetraploid cotton

Yang, Zhaoen; Yang, Zuoren; Gao, Chenxu; Zhang, Mingjun; Hu, Guanjing; Yang, Lan; Zhang, Yihao; Ma, Meng; Liu, Renju; Wang, Zhi; Gao, Baibai; Zhang, Zhibin; Zhao, Hang; Liu, Xuan; Ma, Xiongfeng; Wendel, Jonathan F.; Ge, Xiaoyang; Li, Fuguang

doi:10.1038/s41588-025-02462-1

Article
Published: 02 January 2026

Graph pan-genome illuminates evolutionary trajectories and agronomic trait architecture in allotetraploid cotton

Zhaoen Yang ORCID: orcid.org/0000-0003-2323-4654^1,2^na1,
Zuoren Yang^1,2,3^na1,
Chenxu Gao^1,2^na1,
Mingjun Zhang^1,2^na1,
Guanjing Hu ORCID: orcid.org/0000-0001-8552-7394^2,4^na1,
Lan Yang^1,2,
Yihao Zhang^1,2,
Meng Ma^1,2,
Renju Liu^1,2,
Zhi Wang^1,2,
Baibai Gao^1,2,
Zhibin Zhang^1,2,
Hang Zhao^1,2,5,
Xuan Liu^2,4,
Xiongfeng Ma ORCID: orcid.org/0000-0003-1005-7281^1,2,
Jonathan F. Wendel ORCID: orcid.org/0000-0003-2258-5081⁶,
Xiaoyang Ge ORCID: orcid.org/0000-0003-3428-2942^1,2 &
…
Fuguang Li ORCID: orcid.org/0000-0002-9057-0554^1,2

Nature Genetics volume 58, pages 218–229 (2026)Cite this article

4397 Accesses
3 Citations
22 Altmetric
Metrics details

Subjects

Abstract

Upland cotton (Gossypium hirsutum), one of the world’s major fiber crops, faces challenges from the genetic homogeneity of modern varieties. Here we present 107 gold-standard genome assemblies spanning the wild-to-domesticated continuum, revealing six large-scale structural variations, including a chromosomal reciprocal translocation and five inversions tracing the evolutionary history of cultivated cotton in the Americas. This history also involved continuous introgression from Gossypium barbadense, shaping the genetic diversity of G. hirsutum landraces and cultivars. Leveraging the graph pan-genome, we capture the sequence and structural diversity of nucleotide-binding site–leucine-rich repeat genes, uncovering pathogen-driven selection signatures and loci associated with disease resistance. A presence–absence variation genome-wide association study (GWAS) identified previously overlooked loci for key fiber traits, complementing single-nucleotide polymorphism–GWAS findings. Additionally, we construct a detailed map of large inversions, offering insights into hybridization dynamics and strategies to mitigate linkage drag. This study enhances our understanding of cotton evolution and domestication while delivering a valuable resource to enhance breeding.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Genomic insights into the origin and domestication of upland cotton inferred from translocations and inversions.**

**Fig. 2: Large-scale introgressions from *Gossypium barbadense* enriched the upland cotton gene pool.**

**Fig. 3: Pan-NLR construction in the super pan-genome.**

**Fig. 4: PAV–GWAS is more powerful than SNP–GWAS in unveiling the genetic basis underlying traits.**

**Fig. 5: Pan-genome-based map of large inversions.**

The genomic basis of geographic differentiation and fiber improvement in cultivated cotton

Article 15 April 2021

Genome resources for three modern cotton lines guide future breeding efforts

Article Open access 30 May 2024

Genomic innovation and regulatory rewiring during evolution of the cotton genus Gossypium

Article 06 December 2022

Data availability

All raw sequencing data generated in this paper have been deposited into NCBI as studies PRJNA1103937 and PRJNA1111598. Genome assemblies and annotations are available in two repositories: NCBI (under PRJNA1345676) and Zenodo (https://doi.org/10.5281/zenodo.15644881 and https://doi.org/10.5281/zenodo.17046697 (refs. ^91,92)). Pan-gene, pan-genome and eQTL datasets are available at Zenodo via https://doi.org/10.5281/zenodo.15646851 (ref. ⁹³).

Code availability

Custom scripts can be accessed via Zenodo at https://doi.org/10.5281/zenodo.15644881 (ref. ⁹¹). The software and tools used in this study are detailed in the Methods and Reporting Summary.

References

Wendel, J. F., Brubaker, C. L. & Percival, A. E. Genetic diversity in Gossypium hirsutum and the origin of upland cotton. Am. J. Bot. 79, 1291–1310 (1992).
Article Google Scholar
Chen, Z. J. et al. Genomic diversifications of five Gossypium allopolyploid species and their impact on cotton improvement. Nat. Genet. 52, 525–533 (2020).
Article CAS PubMed PubMed Central Google Scholar
Viot, C. R. & Wendel, J. F. Evolution of the cotton genus, Gossypium, and its domestication in the Americas. Crit. Rev. Plant Sci. 42, 1–33 (2023).
Article Google Scholar
Percy, R. G. & Wendel, J. F. Allozyme evidence for the origin and diversification of Gossypium barbadense L. Theor. Appl. Genet. 79, 529–542 (1990).
Article CAS PubMed Google Scholar
Yuan, D. et al. Parallel and intertwining threads of domestication in allopolyploid cotton. Adv. Sci. 8, 2003634 (2021).
Article Google Scholar
He, S. et al. The genomic basis of geographic differentiation and fiber improvement in cultivated cotton. Nat. Genet. 53, 916–924 (2021).
Article CAS PubMed Google Scholar
Huang, G., Huang, J. Q., Chen, X. Y. & Zhu, Y. X. Recent advances and future perspectives in cotton research. Annu. Rev. Plant Biol. 72, 437–462 (2021).
Article CAS PubMed Google Scholar
Yang, Z., Qanmber, G., Wang, Z., Yang, Z. & Li, F. Gossypium genomics: trends, scope, and utilization for cotton improvement. Trends Plant Sci. 25, 488–500 (2020).
Article CAS PubMed Google Scholar
Ma, Z. et al. Resequencing a core collection of upland cotton identifies genomic variation and loci influencing fiber quality and yield. Nat. Genet. 50, 803–813 (2018).
Article CAS PubMed Google Scholar
Wang, M. et al. Asymmetric subgenome selection and cis-regulatory divergence during cotton domestication. Nat. Genet. 49, 579–587 (2017).
Article CAS PubMed Google Scholar
Li, F. et al. Genome sequence of cultivated Upland cotton (Gossypium hirsutum TM-1) provides insights into genome evolution. Nat. Biotechnol. 33, 524–530 (2015).
Article PubMed Google Scholar
Hu, Y. et al. Gossypium barbadense and Gossypium hirsutum genomes provide insights into the origin and evolution of allotetraploid cotton. Nat. Genet. 51, 739–748 (2019).
Article CAS PubMed Google Scholar
Huang, G. et al. Genome sequence of Gossypium herbaceum and genome updates of Gossypium arboreum and Gossypium hirsutum provide insights into cotton A-genome evolution. Nat. Genet. 52, 516–524 (2020).
Article CAS PubMed PubMed Central Google Scholar
Yang, Z. et al. Extensive intraspecific gene order and gene structural variations in upland cotton cultivars. Nat. Commun. 10, 2989 (2019).
Article PubMed PubMed Central Google Scholar
Ma, Z. et al. High-quality genome assembly and resequencing of modern cotton cultivars provide resources for crop improvement. Nat. Genet. 53, 1385–1391 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yang, Z. et al. Recent progression and future perspectives in cotton genomic breeding. J. Integr. Plant Biol. 65, 548–569 (2023).
Article CAS PubMed Google Scholar
Sreedasyam, A. et al. Genome resources for three modern cotton lines guide future breeding efforts. Nat. Plants 10, 1039–1051 (2024).
Article PubMed PubMed Central Google Scholar
Zhou, Y. et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature 606, 527–534 (2022).
Article CAS PubMed PubMed Central Google Scholar
Liu, Y. et al. Pan-genome of wild and cultivated soybeans. Cell 182, 162–176 (2020).
Article CAS PubMed Google Scholar
Jin, S. et al. Structural variation (SV)-based pan-genome and GWAS reveal the impacts of SVs on the speciation and diversification of allotetraploid cottons. Mol. Plant 16, 678–693 (2023).
Article CAS PubMed Google Scholar
Li, J. et al. Convergence and divergence of diploid and tetraploid cotton genomes. Nat. Genet. 56, 2562–2573 (2024).
Article CAS PubMed Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chang, X. et al. High-quality Gossypium hirsutum and Gossypium barbadense genome assemblies reveal the landscape and evolution of centromeres. Plant Commun. 5, 100722 (2024).
Article CAS PubMed Google Scholar
Wang, M. et al. Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense. Nat. Genet. 51, 224–229 (2019).
Article PubMed Google Scholar
Huang, G. et al. A telomere-to-telomere cotton genome assembly reveals centromere evolution and a Mutator transposon-linked module regulating embryo development. Nat. Genet. 56, 1953–1963 (2024).
Article CAS PubMed Google Scholar
Chen, J. et al. A complete telomere-to-telomere assembly of the maize genome. Nat. Genet. 55, 1221–1231 (2023).
Article CAS PubMed PubMed Central Google Scholar
Gan, Y. et al. Chromosomal locations of 5S and 45S rDNA in Gossypium genus and its phylogenetic implications revealed by FISH. PLoS ONE 8, e68207 (2013).
Article CAS PubMed PubMed Central Google Scholar
Cheng, Y. et al. Gossypium purpurascens genome provides insight into the origin and domestication of upland cotton. J. Adv. Res. 56, 15–29 (2024).
Article CAS PubMed Google Scholar
Li, X. et al. Large-scale gene expression alterations introduced by structural variation drive morphotype diversification in Brassica oleracea. Nat. Genet. 56, 517–529 (2024).
Article CAS PubMed PubMed Central Google Scholar
Morrell, P. L., Buckler, E. S. & Ross-Ibarra, J. Crop genomics: advances and applications. Nat. Rev. Genet. 13, 85–96 (2011).
Article PubMed Google Scholar
Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 42, 663–673 (2024).
Article CAS PubMed Google Scholar
Yuan, Y., Bayer, P. E., Batley, J. & Edwards, D. Current status of structural variation studies in plants. Plant Biotechnol. J. 19, 2153–2163 (2021).
Article PubMed PubMed Central Google Scholar
Reinisch, A. J. et al. A detailed RFLP map of cotton, Gossypium hirsutum × Gossypium barbadense: chromosome organization and evolution in a disomic polyploid genome. Genetics 138, 829–847 (1994).
Article CAS PubMed PubMed Central Google Scholar
Li, L. et al. Genomic analyses reveal the genetic basis of early maturity and identification of loci and candidate genes in upland cotton (Gossypium hirsutum L.). Plant Biotechnol. J. 19, 109–123 (2021).
Article CAS PubMed Google Scholar
Lee, C. R. et al. Young inversion with multiple linked QTLs under selection in a hybrid zone. Nat. Ecol. Evol. 1, 119 (2017).
Article PubMed PubMed Central Google Scholar
Wendel, J. F., Brubaker, C. L. & Seelanan, T. in Physiology of Cotton (eds Stewart, J. M. C. D. et al.) 1–18 (Springer, 2010).
Edelman, N. B. & Mallet, J. Prevalence and adaptive impact of introgression. Annu. Rev. Genet. 55, 265–283 (2021).
Article CAS PubMed Google Scholar
Wang, N. et al. High-resolution sequencing of nine elite upland cotton cultivars uncovers genic variations and breeding improvement targets. Plant J. 113, 145–159 (2023).
Article CAS PubMed Google Scholar
Tang, D. et al. Genome evolution and diversity of wild and cultivated potatoes. Nature 606, 535–554 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ma, S. et al. Direct pathogen-induced assembly of an NLR immune receptor complex to form a holoenzyme. Science 370, 3069 (2020).
Article Google Scholar
Zhang, Y. et al. Genome-wide association analysis reveals a novel pathway mediated by a dual-TIR domain protein for pathogen resistance in cotton. Genome Biol. 24, 111 (2023).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y. et al. Promotion of apoplastic oxidative burst by artificially selected GhCBSX3A enhances Verticillium dahliae resistance in upland cotton. Plant J. 118, 2154–2168 (2024).
Article CAS PubMed Google Scholar
Van de Weyer, A. L. et al. A species-wide inventory of NLR genes and alleles in Arabidopsis thaliana. Cell 178, 1260–1272 (2019).
Article PubMed PubMed Central Google Scholar
Yanagisawa, M. et al. Patterning mechanisms of cytoskeletal and cell wall systems during leaf trichome morphogenesis. Nat. Plants 1, 15014 (2015).
Article CAS PubMed Google Scholar
Ding, M. et al. The hairless stem phenotype of cotton (Gossypium barbadense) is linked to a Copia-like retrotransposon insertion in a homeodomain-leucine zipper gene (HD1). Genetics 201, 143–154 (2015).
Article CAS PubMed PubMed Central Google Scholar
Yuan, R. et al. Differentiation in the genetic basis of stem trichome development between cultivated tetraploid cotton species. BMC Plant Biol. 21, 115 (2021).
Article CAS PubMed PubMed Central Google Scholar
Hinchliffe, D. J. et al. The GhTT2_A07 gene is linked to the brown colour and natural flame retardancy phenotypes of Lc1 cotton (Gossypium hirsutum L.) fibres. J. Exp. Bot. 67, 5461–5471 (2016).
Article CAS PubMed PubMed Central Google Scholar
Wen, T. et al. Linkage and association mapping reveals the genetic basis of brown fibre (Gossypium hirsutum). Plant Biotechnol. J. 16, 1654–1666 (2018).
Article CAS PubMed PubMed Central Google Scholar
Yan, Q. et al. Up-regulation of GhTT2-3A in cotton fibres during secondary wall thickening results in brown fibres with improved quality. Plant Biotechnol. J. 16, 1735–1747 (2018).
Article CAS PubMed PubMed Central Google Scholar
Walkowiak, S. et al. Multiple wheat genomes reveal global variation in modern breeding. Nature 588, 277–283 (2020).
Article CAS PubMed PubMed Central Google Scholar
Wellenreuther, M. & Bernatchez, L. Eco-evolutionary genomics of chromosomal inversions. Trends Ecol. Evol. 33, 427–440 (2018).
Article PubMed Google Scholar
Lowry, D. B. & Willis, J. H. A widespread chromosomal inversion polymorphism contributes to a major life-history transition, local adaptation, and reproductive isolation. PLoS Biol. 8, e1000500 (2010).
Article PubMed PubMed Central Google Scholar
Shang, L. et al. A super pan-genomic landscape of rice. Cell Res. 32, 878–896 (2022).
Article CAS PubMed PubMed Central Google Scholar
Du, X. et al. Resequencing of 243 diploid cotton accessions based on an updated A genome identifies the genetic basis of key agronomic traits. Nat. Genet. 50, 796–802 (2018).
Article CAS PubMed Google Scholar
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26, 1641–1650 (2009).
Article CAS PubMed PubMed Central Google Scholar
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).
Article CAS PubMed PubMed Central Google Scholar
Seppey, M., Manni, M. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness. Methods Mol. Biol. 1962, 227–245 (2019).
Article CAS PubMed Google Scholar
Ou, S., Chen, J. & Jiang, N. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. 46, e126 (2018).
PubMed PubMed Central Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).
Article CAS PubMed PubMed Central Google Scholar
Li, K., Xu, P., Wang, J., Yi, X. & Jiao, Y. Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assessment and improvement. Nat. Commun. 14, 6556 (2023).
Article CAS PubMed PubMed Central Google Scholar
Han, Y. & Wessler, S. R. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 38, e199 (2010).
Article PubMed PubMed Central Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268 (2007).
Article PubMed PubMed Central Google Scholar
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18 (2008).
Article PubMed PubMed Central Google Scholar
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 4, 4.10.1–4.10.14 (2009).
Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Article CAS PubMed PubMed Central Google Scholar
Beier, S., Thiel, T., Munch, T., Scholz, U. & Mascher, M. MISA-web: a web server for microsatellite prediction. Bioinformatics 33, 2583–2585 (2017).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, 215–225 (2003).
Article Google Scholar
Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).
Article PubMed PubMed Central Google Scholar
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Article CAS PubMed Google Scholar
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004).
Article CAS PubMed Google Scholar
Keilwagen, J., Hartung, F. & Grau, J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Methods Mol. Biol. 1962, 161–177 (2019).
Article CAS PubMed Google Scholar
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. & Salzberg, S. L. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat. Protoc. 11, 1650–1667 (2016).
Article CAS PubMed PubMed Central Google Scholar
Besemer, J., Lomsadze, A. & Borodovsky, M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29, 2607–2618 (2001).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512 (2013).
Article CAS PubMed Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, R7 (2008).
Article PubMed PubMed Central Google Scholar
Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019).
Article PubMed PubMed Central Google Scholar
Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).
Article CAS PubMed PubMed Central Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Article PubMed PubMed Central Google Scholar
Hu, G. et al. A telomere-to-telomere genome assembly of cotton provides insights into centromere evolution and short-season adaptation. Nat. Genet. 57, 1031–1043 (2025).
Article CAS PubMed Google Scholar
Li, P. et al. RGAugury: a pipeline for genome-wide prediction of resistance gene analogs (RGAs) in plants. BMC Genomics 17, 852 (2016).
Article CAS PubMed PubMed Central Google Scholar
Saucet, S. B. et al. Two linked pairs of Arabidopsis TNL resistance genes independently confer recognition of bacterial effector AvrRps4. Nat. Commun. 6, 6338 (2015).
Article CAS PubMed Google Scholar
Marcais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).
Article PubMed PubMed Central Google Scholar
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).
Article CAS PubMed PubMed Central Google Scholar
Li, M. X., Yeung, J. M., Cherny, S. S. & Sham, P. C. Evaluating the effective numbers of independent tests and significant p-value thresholds in commercial genotyping arrays and public imputation reference datasets. Hum. Genet. 131, 747–756 (2012).
Article CAS PubMed Google Scholar
Yang, L. et al. Regulatory networks of coresident subgenomes during rapid fiber cell elongation in upland cotton. Plant Commun. 5, 101130 (2024).
Article CAS PubMed PubMed Central Google Scholar
Ge, X. et al. Efficient genotype-independent cotton genetic transformation and genome editing. J. Integr. Plant Biol. 65, 907–917 (2023).
Article CAS PubMed Google Scholar
Zhang, T., Liu, G., Zhao, H., Braz, G. T. & Jiang, J. Chorus2: design of genome-scale oligonucleotide-based probes for fluorescence in situ hybridization. Plant Biotechnol. J. 19, 1967–1978 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lindgren, V. V. In situ hybridization protocols. Arch. Pathol. Lab. Med. 124, 1387–1388 (2000).
Article Google Scholar
Huang, Y. et al. The formation and evolution of centromeric satellite repeats in Saccharum species. Plant J. 106, 616–629 (2021).
Article CAS PubMed Google Scholar
Zhang, M. & Gao, C. The pan-genome annotations and assemblies for cotton. Zenodo https://doi.org/10.5281/zenodo.15644881 (2025).
Gao, C. The pan-genome annotations, assemblies and scripts for cotton. Zenodo https://doi.org/10.5281/zenodo.17046697 (2025).
Gao, C. The pan-gene, pangenome, and eQTL dataset for cotton. Zenodo https://doi.org/10.5281/zenodo.15646851 (2025).

Download references

Acknowledgements

This study was supported by the National Key Research and Development Program of China (2022YFF1001400 and 2021YFF1000102-1), the Natural Science Foundation of Henan (232300421010), the Key Research and Development Project of Henan Province (231111110400), the National Natural Science Foundation of China (32472146), Xinjiang Science and Technology Major Project of China (2024A02002), Innovation Program of the Chinese Academy of Agricultural Sciences (CAASASTIP-IVFCAAS) and the Fundamental Research Funds of State Key Laboratory of Cotton Biology (CB2024C06).

Author information

These authors contributed equally: Zhaoen Yang, Zuoren Yang, Chenxu Gao, Mingjun Zhang, Guanjing Hu.

Authors and Affiliations

Zhengzhou Research Base, State Key Laboratory of Cotton Bio‑breeding and Integrated Utilization, School of Agricultural Sciences, Zhengzhou University, Zhengzhou, China
Zhaoen Yang, Zuoren Yang, Chenxu Gao, Mingjun Zhang, Lan Yang, Yihao Zhang, Meng Ma, Renju Liu, Zhi Wang, Baibai Gao, Zhibin Zhang, Hang Zhao, Xiongfeng Ma, Xiaoyang Ge & Fuguang Li
State Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Institute of Cotton Research, Chinese Academy of Agricultural Sciences, Anyang, China
Zhaoen Yang, Zuoren Yang, Chenxu Gao, Mingjun Zhang, Guanjing Hu, Lan Yang, Yihao Zhang, Meng Ma, Renju Liu, Zhi Wang, Baibai Gao, Zhibin Zhang, Hang Zhao, Xuan Liu, Xiongfeng Ma, Xiaoyang Ge & Fuguang Li
Xinjiang Key Laboratory of Crop Gene Editing and Germplasm Innovation, Institute of Western Agricultural of CAAS, Changji, China
Zuoren Yang
Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
Guanjing Hu & Xuan Liu
College of Life Sciences, Qufu Normal University, Qufu, China
Hang Zhao
Department of Ecology, Evolution and Organismal Biology, Iowa State University, Ames, IA, USA
Jonathan F. Wendel

Authors

Zhaoen Yang
View author publications
Search author on:PubMed Google Scholar
Zuoren Yang
View author publications
Search author on:PubMed Google Scholar
Chenxu Gao
View author publications
Search author on:PubMed Google Scholar
Mingjun Zhang
View author publications
Search author on:PubMed Google Scholar
Guanjing Hu
View author publications
Search author on:PubMed Google Scholar
Lan Yang
View author publications
Search author on:PubMed Google Scholar
Yihao Zhang
View author publications
Search author on:PubMed Google Scholar
Meng Ma
View author publications
Search author on:PubMed Google Scholar
Renju Liu
View author publications
Search author on:PubMed Google Scholar
Zhi Wang
View author publications
Search author on:PubMed Google Scholar
Baibai Gao
View author publications
Search author on:PubMed Google Scholar
Zhibin Zhang
View author publications
Search author on:PubMed Google Scholar
Hang Zhao
View author publications
Search author on:PubMed Google Scholar
Xuan Liu
View author publications
Search author on:PubMed Google Scholar
Xiongfeng Ma
View author publications
Search author on:PubMed Google Scholar
Jonathan F. Wendel
View author publications
Search author on:PubMed Google Scholar
Xiaoyang Ge
View author publications
Search author on:PubMed Google Scholar
Fuguang Li
View author publications
Search author on:PubMed Google Scholar

Contributions

Zhaoen Yang, X.G. and F.L. conceived and designed the research. X.G., Zhaoen Yang and Zuoren Yang managed the project. Zhaoen Yang, R.L., Z.W., X.M., C.G. and H.Z. prepared the samples. Y.Z. and Z.Z. performed phenotyping. Zhaoen Yang, C.G., M.Z., L.Y., B.G. and X.L. performed data analysis. Y.Z. and M.M. designed and performed the molecular experiments. Zhaoen Yang, C.G. and M.Z. prepared the figures and tables. Zhaoen Yang, F.L., X.G., G.H. and J.F.W. wrote and revised the paper.

Corresponding authors

Correspondence to Zhaoen Yang, Xiongfeng Ma, Jonathan F. Wendel, Xiaoyang Ge or Fuguang Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Michael Bevan, Tianzhen Zhang and Yuxian Zhu for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Genomic landscapes of the 107 accessions.

a, Phylogenetic analysis of 3256 accessions collected worldwide. Purple lines and orange lines, respectively, represent the 7 semi-wild and 100 improved upland cotton accessions selected for de novo assembly. Blue lines represent the G. barbabense that as outgroup. b, Genomic landscape and assembly quality evaluation of the 107 accessions, including genome assembly size, Contig N50 length, LAI, percentage of gene-regions, exons, introns, repeats, LTRs, Gypsy LTR, Copia LTR, SINEs+LINEs and completed BUSCO genes. The phylogenetic tree in the left panel was inferred from PAVs rooted by the wild accession TX1236 ( = N301 here).

Extended Data Fig. 2 Characterization of 5S rDNA and 45SrDNA.

a-b, Schematic representation of 5S rDNA and 45S rDNA in upland cotton. c, Histograms showing the copy number of 5S rDNA, 18S rDNA, 5.8S rDNA, and 28S rDNA for each accession. d-e, Illustrations of 5S rDNA locations on chromosomes A09 (d) and D09 (e). The upper panel shows the long terminal repeats (LTRs) surrounding the 5S rDNA region, while the lower panel displays the reads mapping across the 5S rDNA locus. f, Depiction of a 45S rDNA locus on chromosome D07. The blue boxes in the upper panel highlight the rDNA loci, while the lower panel shows the reads mapping across the 45S rDNA region. The color bars in the upper panel represent the LTRs. g-h, Diversity analysis of 5S rDNA from A09 of N200 (g) and 45S rDNA from D07 of N302 (h). Using CD-HIT for clustering of 5S rDNA or 45S rDNA, sequences within the same cluster are completely identical, and these sequences are referred to as feature sequence. We calculated the frequency of each feature sequence relative to all sequences. Red lines denote the Levenshtein distance between a feature sequence and the most abundant feature sequence. Blue bars indicate frequency values. The right Y-axis measures Levenshtein distance, while the left Y-axis measures frequency per feature sequence cluster. Levenshtein distance, commonly known as Edit Distance, refers to the minimum number of edit operations required to transform one string into another.

Extended Data Fig. 3 Pan and core genome analysis of 107 upland cotton accessions.

a, Variation of gene families in the pan-genome and core genome with the addition of another cotton genome. b, Composition of the pan-genome. The histograms display the number of gene families in the 107 accessions with varying frequencies. Pie chart indicates the ratio of gene family in each category. c, Presence and absence information of pan gene families across the 107 cotton genomes. d, Gene number of each composition in individual genomes. e, Comparison of nucleotide diversity (π) across core, softcore and dispensable genes. π values were evaluated in a 5 Kb window. The bin numbers for each category are shown below the x-axis. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 0.0625, 1×10⁻¹⁵ and 1×10⁻¹⁵, respectively. f, Comparison of d_N/d_S ratios across core, softcore and dispensable genes, where d_N represents the number of non-synonymous SNP mutations and d_S represents the number of synonymous SNP mutations. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 2.4×10⁻¹⁴, 1×10⁻¹⁵ and 1.9×10⁻¹⁴, respectively. g, Comparison of gene expression across core, softcore and dispensable genes. Number n indicates gene number within each category. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 1×10⁻¹⁵, 1×10⁻¹⁵ and 1×10⁻¹⁵, respectively. Multiple comparisons were conducted using the Tukey’s Honestly Significant Difference test with α = 0.05 for panels e-g. Center line, median; box limits, first and third quartiles; whiskers, 1.5 × interquartile range.

Extended Data Fig. 4 Graph-based pan-genome construction.

a, Variations of inversion (INV), translocation (TRANS), duplication (DUP) and copy number variation (CNV) upon adding additional cotton genome. b, An example of the graph-based pan-genome for chromosome A07 within the 127895 to 128015 bp region. c, Frequency of combined PAVs of varying lengths. The original PAVs in the pan-genome were combined based on their locations on reference genome N244. d, PAV number increase with the addition of cotton genomes. e, Number of different polymorphic PAVs. f, Composition of PAVs in semi-wild/wild and improved lines. g, Comparison of insertions (INS) and deletions (DEL) across the 26 chromosomes, partitioned by subgenome. h, Comparison of expression levels between genes with INS and DEL. Based on PAV location, PAV-associated genes were categorized into four groups: exon, intron, 5 Kb upstream, and 5 Kb downstream. Center line, median; box limits, first and third quartiles; whiskers, 1.5 × interquartile range. A two-tailed Student’s tests were used to evaluate the P-values.

Extended Data Fig. 5 A reciprocal translocation contributes to the diversity of upland cotton.

a, Chromosome collinearity analysis between A03 and A09 among four AD₁ accessions and six sister species (AD₂-AD₇). A03-A09 translocation is only observed in semi-wild/wild AD₁ accessions. b, Validation of A03-A09 translocation using Hi-C interaction matrices. Red boxes highlight chromosomes A03 and A09, with arrows showing the interactions between them. c, Phylogenetic (left) and principal component analyses (right) of the translocation distribution, based on a previously reported study of wild, landrace 1 and landrace 2 populations of upland cotton (ref. ⁵). The map shows the geographic distribution of the TRANS and non-TRANS populations. L1: Landrace 1, L2: Landrace 2, W: Wild. The map was generated in R using the ggplot2 and maps packages.

Extended Data Fig. 6 Pan-NLRs analysis based on the 107 assemblies.

a, Composition of the five subgroups in each accession. b, Total number of NLRs, cluster NLRs and singletons in each accession (n = 107 samples). c, Comparison of paired NLRs between semi-wild/wild (n = 7) and cultivated accessions (n = 100). d, Comparison of cluster NLRs between semi-wild/wild (n = 7) and cultivated accessions (n = 100). P-value was assessed by a two-tailed Student’s t-test. Data are presented as mean values ± SD in panels b-d. e, Composition of the Pan-NLR. The histograms display the number of gene families in the 107 accessions, categorized by frequencies. The right panel is the zoom-in view of the blue histograms in left panel. f, Comparison of expression levels across core, softcore and dispensable NLRs. Gene numbers are shown below the x-axis. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 0.014, 0.034 and 0.89, respectively. g, Comparison of expression nucleotide diversity (π) across core, softcore and dispensable NLRs. Polymorphic nucleotide sites used for π evaluation are indicated below x-axis. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 1×10⁻¹⁵, 1×10⁻¹⁵ and 3.2×10⁻⁵, respectively. Multiple comparisons were conducted using the Tukey’s Honestly Significant Difference test with α = 0.05 for panels f and g. Center line, median; box limits, first and third quartiles; whiskers, 1.5 × interquartile range. h, Frequency of structural variants (SVs) within each NLR cluster. i, Comparison of nucleotide diversity among core, softcore, and dispensable NLRs. j, Comparison of π-values of different domains between YER and other regions. k, Comparison of LRR diversity between YER and other regions across core, softcore and dispensable NLRs. P-values were evaluated by two-tailed Student’s t-tests, and data are presented as mean ± SD in panels i-k. Polymorphic nucleotide sites used for π evaluation are shown below x-axis for panels i-k.

Extended Data Fig. 7 PAV-GWAS captures an 806bp-insertion associated with FS and SI.

a, Fiber strength (FS) and seed index (SI) are co-located on chromosome A07. The shaded box indicates the co-location signal. The horizontal lines represent the genome-wide significance threshold after Bonferroni correction (α = 1/n, where ‘n’ is the total number of independent SVs). Thresholds were −log(P) = 4.6 (419 accessions) and −log(P) = 4.5 (1226 accessions). b, Zoomed-in view of the signal interval identified by SNPs, InDels, and PAVs. Blue dots represent SNPs, green dots indicate InDels, and light coral dots are PAVs. A significance threshold of −log(P) = 4.6 for 419 accessions was used. c, Haplotype analysis at the SI locus. Three haplotypes were identified. d, An 806-bp insertion detected by PAV-GWAS in gene N244A07G24740 divides upland cotton into Hap. 1 and Hap. 2. A significance threshold of −log(P) = 4.6 for 419 accessions was used. e, Comparison of FS between accessions carrying Hap. 1 (n = 57 accessions) and Hap. 2 (n = 358 accessions). P-values were evaluated by two-tailed Student’s t-tests. Center line, median; box limits, first and third quartiles; whiskers, range from minimum to maximum. f, Accessions carrying SI3 exhibit higher expression levels and SI compared to those without any QTLs. The central lines indicate the mean value, whereas the lines above and below represent mean ± SD. Accessions for each category are shown above the x-axis. The number of different QTL combinations is shown above the x-axis. g, Overexpression of N244A07G024740 leads to increased seed size. The bar represents 1 cm. h-j, Comparison of SI (h), seed length (i), and seed width (g) among WT (n = 10 samples), OE-2 (n = 10 samples), OE-12 (n = 10 samples) and OE-13 (n = 10 samples). Multiple comparisons were conducted using the Tukey’s Honestly Significant Difference test with α = 0.05. Data are present as the mean ± SD. The P-values in panel h for WT_vs_OE-2, WT_vs_OE-12, WT_vs_OE-13, OE-2_vs_OE-12, OE-2_vs_OE-13 and OE-12_vs_OE-13 are 2×10⁻¹⁴, 5.3×10⁻¹³, 8.7×10⁻¹⁰, 1.6×10⁻⁵, 1×10⁻¹⁰ and 4.6×10⁻⁵, respectively. The P-values in panel i for WT_vs_OE-2, WT_vs_OE-12, WT_vs_OE-13, OE-2_vs_OE-12, OE-2_vs_OE-13 and OE-12_vs_OE-13 are 2.4×10⁻⁵, 5.1×10⁻⁶, 7.8×10⁻³, 0.59, 0.21 and 0.013, respectively. The P-values in panel j for WT_vs_OE-2, WT_vs_OE-12, WT_vs_OE-13, OE-2_vs_OE-12, OE-2_vs_OE-13 and OE-12_vs_OE-13 are 3.6×10⁻⁴, 9.9×10⁻⁴, 0.30, 0.98, 0.04 and 0.09, respectively.

Extended Data Fig. 8 A rare allele of CESA7 contributes to stronger fiber in lines derived from distant hybridization.

a, GWAS for fiber strength using PAVs with a minor allele frequency greater than 0.01. The horizontal lines represent the genome-wide significance threshold after Bonferroni correction (α = 1/n, where ‘n’ is the total number of independent SVs), with −log(P) = 4.7 for 419 accessions. b, Zoomed-in view of the signal on A07. A significance threshold of −log(P) = 4.7 for 419 accessions was used. c, Two PAVs located in N24407G004420 (CESA7) cluster upland cotton into Hap. A and Hap. B. d, Comparison of fiber strength (FS) and fiber length (FL) between the two haplotypes across two panels. Hap. A consistently exhibits superior FS and FL compared to Hap. B. Panel information indicates that accessions in Hap. A are primarily from distant hybridization, predominantly found in the Suyuan and Suyou series resources. Center line, median; box limits, first and third quartiles; whiskers, range from minimum to maximum. The “n” represents the number of accessions with Hap. A and Hap. B. P-values were evaluated by two-tailed Student’s t-tests.

Extended Data Fig. 9 Impact of the A06 inversion on leaf trichome density and stem trichome phenotypes.

a, Leaf and stem trichome phenotyping. From left to right: low LTA, high LTA, low STA, and high STA. LTA refers to leaf trichome amount; STA refers to stem trichome amount. The bar represents 500 μm. b, Manhattan plots for GWAS of LTA and STA, highlighting the associated genomic regions with blue box. The horizontal lines represent the genome-wide significance threshold after Bonferroni correction (α = 1/n, where ‘n’ is the total number of independent SVs), with −log(P) = 4.6 for 419 accessions. c, Zoomed-in view of the genomic region on A07: the upper panel shows GWAS signals, the middle panel displays gene models, and the lower panel provides a detailed view of the candidate gene region. A significance threshold of −log(P) = 4.6 for 419 accessions was used. d, Haplotype comparison of the gene N244A06G021950, indicating amino acid variants. The numerical values above denote the positions of these variations. e, Haplotype effect on LTA, demonstrating differences associated with distinct haplotypes. f, Haplotype effect on FPKM (Fragments Per Kilobase of transcript per Million mapped reads), illustrating the gene expression levels between the two haplotypes. Accession numbers carrying INV2 or no-INV2 are shown below the x-axis for panels e and f. Center line, median; box limits, first and third quartiles; whiskers, range from minimum to maximum. The P-value was evaluated by Student’s t-test.

Extended Data Fig. 10 eGWAS analysis of flavonoid biosynthesis genes reveals co-location with GhTT2.

a, Grey lines represent trans-eQTL regulation, while red lines indicate local eQTL regulation. Genes marked with an asterisk denote that the regulatory relationships between GhTT2 and either ANS or ANR have been experimentally validated in previous study. b, Comparison of the expression level of the two haplotypes of cis-eQTL from GhTT2. “n” indicates sample size. Center line, median; box limits, first and third quartiles; whiskers, range from minimum to maximum. The P-value was calculated by a two-tailed Student’s t-test.

Supplementary information

Supplementary Information (download PDF )

Supplementary Figs. 1–33.

Reporting Summary (download PDF )

Peer Review File (download PDF )

Supplementary Tables (download XLSX )

Supplementary Table 1. Detailed information of the germplasm used in this study. Supplementary Table 2. Summary statistics of PacBio sequencing reads for the 107 accessions. Supplementary Table 3. Summary of genomic landscape for the 107 assemblies. Supplementary Table 4. Sequencing quality of 12 Hi-C libraries in this study. Supplementary Table 5. The telomere assembly of each chromosome. Supplementary Table 6. The centromere location of each chromosome. Supplementary Table 7. Chi-squared tests for the Caribbean, South America, North America and Yucatan regions. Supplementary Table 8. Introgression from G. barbadense in the 107 accessions. Supplementary Table 9. Haplotype carried by the wild–semi-wild cotton. Supplementary Table 10. Summary of the gene number of the five subgroups in the 107 assemblies. Supplementary Table 11. QTLs identified by GWAS using SNP and PAV. Supplementary Table 12. List of genes. Supplementary Table 13. Summary of the length of the PAVs. Supplementary Table 14. The primers used in this study.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, Z., Yang, Z., Gao, C. et al. Graph pan-genome illuminates evolutionary trajectories and agronomic trait architecture in allotetraploid cotton. Nat Genet 58, 218–229 (2026). https://doi.org/10.1038/s41588-025-02462-1

Download citation

Received: 01 August 2024
Accepted: 21 November 2025
Published: 02 January 2026
Version of record: 02 January 2026
Issue date: January 2026
DOI: https://doi.org/10.1038/s41588-025-02462-1