Abstract
Upland cotton (Gossypium hirsutum), one of the world’s major fiber crops, faces challenges from the genetic homogeneity of modern varieties. Here we present 107 gold-standard genome assemblies spanning the wild-to-domesticated continuum, revealing six large-scale structural variations, including a chromosomal reciprocal translocation and five inversions tracing the evolutionary history of cultivated cotton in the Americas. This history also involved continuous introgression from Gossypium barbadense, shaping the genetic diversity of G. hirsutum landraces and cultivars. Leveraging the graph pan-genome, we capture the sequence and structural diversity of nucleotide-binding site–leucine-rich repeat genes, uncovering pathogen-driven selection signatures and loci associated with disease resistance. A presence–absence variation genome-wide association study (GWAS) identified previously overlooked loci for key fiber traits, complementing single-nucleotide polymorphism–GWAS findings. Additionally, we construct a detailed map of large inversions, offering insights into hybridization dynamics and strategies to mitigate linkage drag. This study enhances our understanding of cotton evolution and domestication while delivering a valuable resource to enhance breeding.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
Data availability
All raw sequencing data generated in this paper have been deposited into NCBI as studies PRJNA1103937 and PRJNA1111598. Genome assemblies and annotations are available in two repositories: NCBI (under PRJNA1345676) and Zenodo (https://doi.org/10.5281/zenodo.15644881 and https://doi.org/10.5281/zenodo.17046697 (refs. 91,92)). Pan-gene, pan-genome and eQTL datasets are available at Zenodo via https://doi.org/10.5281/zenodo.15646851 (ref. 93).
Code availability
Custom scripts can be accessed via Zenodo at https://doi.org/10.5281/zenodo.15644881 (ref. 91). The software and tools used in this study are detailed in the Methods and Reporting Summary.
References
Wendel, J. F., Brubaker, C. L. & Percival, A. E. Genetic diversity in Gossypium hirsutum and the origin of upland cotton. Am. J. Bot. 79, 1291–1310 (1992).
Chen, Z. J. et al. Genomic diversifications of five Gossypium allopolyploid species and their impact on cotton improvement. Nat. Genet. 52, 525–533 (2020).
Viot, C. R. & Wendel, J. F. Evolution of the cotton genus, Gossypium, and its domestication in the Americas. Crit. Rev. Plant Sci. 42, 1–33 (2023).
Percy, R. G. & Wendel, J. F. Allozyme evidence for the origin and diversification of Gossypium barbadense L. Theor. Appl. Genet. 79, 529–542 (1990).
Yuan, D. et al. Parallel and intertwining threads of domestication in allopolyploid cotton. Adv. Sci. 8, 2003634 (2021).
He, S. et al. The genomic basis of geographic differentiation and fiber improvement in cultivated cotton. Nat. Genet. 53, 916–924 (2021).
Huang, G., Huang, J. Q., Chen, X. Y. & Zhu, Y. X. Recent advances and future perspectives in cotton research. Annu. Rev. Plant Biol. 72, 437–462 (2021).
Yang, Z., Qanmber, G., Wang, Z., Yang, Z. & Li, F. Gossypium genomics: trends, scope, and utilization for cotton improvement. Trends Plant Sci. 25, 488–500 (2020).
Ma, Z. et al. Resequencing a core collection of upland cotton identifies genomic variation and loci influencing fiber quality and yield. Nat. Genet. 50, 803–813 (2018).
Wang, M. et al. Asymmetric subgenome selection and cis-regulatory divergence during cotton domestication. Nat. Genet. 49, 579–587 (2017).
Li, F. et al. Genome sequence of cultivated Upland cotton (Gossypium hirsutum TM-1) provides insights into genome evolution. Nat. Biotechnol. 33, 524–530 (2015).
Hu, Y. et al. Gossypium barbadense and Gossypium hirsutum genomes provide insights into the origin and evolution of allotetraploid cotton. Nat. Genet. 51, 739–748 (2019).
Huang, G. et al. Genome sequence of Gossypium herbaceum and genome updates of Gossypium arboreum and Gossypium hirsutum provide insights into cotton A-genome evolution. Nat. Genet. 52, 516–524 (2020).
Yang, Z. et al. Extensive intraspecific gene order and gene structural variations in upland cotton cultivars. Nat. Commun. 10, 2989 (2019).
Ma, Z. et al. High-quality genome assembly and resequencing of modern cotton cultivars provide resources for crop improvement. Nat. Genet. 53, 1385–1391 (2021).
Yang, Z. et al. Recent progression and future perspectives in cotton genomic breeding. J. Integr. Plant Biol. 65, 548–569 (2023).
Sreedasyam, A. et al. Genome resources for three modern cotton lines guide future breeding efforts. Nat. Plants 10, 1039–1051 (2024).
Zhou, Y. et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature 606, 527–534 (2022).
Liu, Y. et al. Pan-genome of wild and cultivated soybeans. Cell 182, 162–176 (2020).
Jin, S. et al. Structural variation (SV)-based pan-genome and GWAS reveal the impacts of SVs on the speciation and diversification of allotetraploid cottons. Mol. Plant 16, 678–693 (2023).
Li, J. et al. Convergence and divergence of diploid and tetraploid cotton genomes. Nat. Genet. 56, 2562–2573 (2024).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Chang, X. et al. High-quality Gossypium hirsutum and Gossypium barbadense genome assemblies reveal the landscape and evolution of centromeres. Plant Commun. 5, 100722 (2024).
Wang, M. et al. Reference genome sequences of two cultivated allotetraploid cottons, Gossypium hirsutum and Gossypium barbadense. Nat. Genet. 51, 224–229 (2019).
Huang, G. et al. A telomere-to-telomere cotton genome assembly reveals centromere evolution and a Mutator transposon-linked module regulating embryo development. Nat. Genet. 56, 1953–1963 (2024).
Chen, J. et al. A complete telomere-to-telomere assembly of the maize genome. Nat. Genet. 55, 1221–1231 (2023).
Gan, Y. et al. Chromosomal locations of 5S and 45S rDNA in Gossypium genus and its phylogenetic implications revealed by FISH. PLoS ONE 8, e68207 (2013).
Cheng, Y. et al. Gossypium purpurascens genome provides insight into the origin and domestication of upland cotton. J. Adv. Res. 56, 15–29 (2024).
Li, X. et al. Large-scale gene expression alterations introduced by structural variation drive morphotype diversification in Brassica oleracea. Nat. Genet. 56, 517–529 (2024).
Morrell, P. L., Buckler, E. S. & Ross-Ibarra, J. Crop genomics: advances and applications. Nat. Rev. Genet. 13, 85–96 (2011).
Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 42, 663–673 (2024).
Yuan, Y., Bayer, P. E., Batley, J. & Edwards, D. Current status of structural variation studies in plants. Plant Biotechnol. J. 19, 2153–2163 (2021).
Reinisch, A. J. et al. A detailed RFLP map of cotton, Gossypium hirsutum × Gossypium barbadense: chromosome organization and evolution in a disomic polyploid genome. Genetics 138, 829–847 (1994).
Li, L. et al. Genomic analyses reveal the genetic basis of early maturity and identification of loci and candidate genes in upland cotton (Gossypium hirsutum L.). Plant Biotechnol. J. 19, 109–123 (2021).
Lee, C. R. et al. Young inversion with multiple linked QTLs under selection in a hybrid zone. Nat. Ecol. Evol. 1, 119 (2017).
Wendel, J. F., Brubaker, C. L. & Seelanan, T. in Physiology of Cotton (eds Stewart, J. M. C. D. et al.) 1–18 (Springer, 2010).
Edelman, N. B. & Mallet, J. Prevalence and adaptive impact of introgression. Annu. Rev. Genet. 55, 265–283 (2021).
Wang, N. et al. High-resolution sequencing of nine elite upland cotton cultivars uncovers genic variations and breeding improvement targets. Plant J. 113, 145–159 (2023).
Tang, D. et al. Genome evolution and diversity of wild and cultivated potatoes. Nature 606, 535–554 (2022).
Ma, S. et al. Direct pathogen-induced assembly of an NLR immune receptor complex to form a holoenzyme. Science 370, 3069 (2020).
Zhang, Y. et al. Genome-wide association analysis reveals a novel pathway mediated by a dual-TIR domain protein for pathogen resistance in cotton. Genome Biol. 24, 111 (2023).
Zhang, Y. et al. Promotion of apoplastic oxidative burst by artificially selected GhCBSX3A enhances Verticillium dahliae resistance in upland cotton. Plant J. 118, 2154–2168 (2024).
Van de Weyer, A. L. et al. A species-wide inventory of NLR genes and alleles in Arabidopsis thaliana. Cell 178, 1260–1272 (2019).
Yanagisawa, M. et al. Patterning mechanisms of cytoskeletal and cell wall systems during leaf trichome morphogenesis. Nat. Plants 1, 15014 (2015).
Ding, M. et al. The hairless stem phenotype of cotton (Gossypium barbadense) is linked to a Copia-like retrotransposon insertion in a homeodomain-leucine zipper gene (HD1). Genetics 201, 143–154 (2015).
Yuan, R. et al. Differentiation in the genetic basis of stem trichome development between cultivated tetraploid cotton species. BMC Plant Biol. 21, 115 (2021).
Hinchliffe, D. J. et al. The GhTT2_A07 gene is linked to the brown colour and natural flame retardancy phenotypes of Lc1 cotton (Gossypium hirsutum L.) fibres. J. Exp. Bot. 67, 5461–5471 (2016).
Wen, T. et al. Linkage and association mapping reveals the genetic basis of brown fibre (Gossypium hirsutum). Plant Biotechnol. J. 16, 1654–1666 (2018).
Yan, Q. et al. Up-regulation of GhTT2-3A in cotton fibres during secondary wall thickening results in brown fibres with improved quality. Plant Biotechnol. J. 16, 1735–1747 (2018).
Walkowiak, S. et al. Multiple wheat genomes reveal global variation in modern breeding. Nature 588, 277–283 (2020).
Wellenreuther, M. & Bernatchez, L. Eco-evolutionary genomics of chromosomal inversions. Trends Ecol. Evol. 33, 427–440 (2018).
Lowry, D. B. & Willis, J. H. A widespread chromosomal inversion polymorphism contributes to a major life-history transition, local adaptation, and reproductive isolation. PLoS Biol. 8, e1000500 (2010).
Shang, L. et al. A super pan-genomic landscape of rice. Cell Res. 32, 878–896 (2022).
Du, X. et al. Resequencing of 243 diploid cotton accessions based on an updated A genome identifies the genetic basis of key agronomic traits. Nat. Genet. 50, 796–802 (2018).
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26, 1641–1650 (2009).
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).
Seppey, M., Manni, M. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness. Methods Mol. Biol. 1962, 227–245 (2019).
Ou, S., Chen, J. & Jiang, N. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. 46, e126 (2018).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).
Li, K., Xu, P., Wang, J., Yi, X. & Jiao, Y. Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assessment and improvement. Nat. Commun. 14, 6556 (2023).
Han, Y. & Wessler, S. R. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 38, e199 (2010).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268 (2007).
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18 (2008).
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 4, 4.10.1–4.10.14 (2009).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Beier, S., Thiel, T., Munch, T., Scholz, U. & Mascher, M. MISA-web: a web server for microsatellite prediction. Bioinformatics 33, 2583–2585 (2017).
Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, 215–225 (2003).
Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004).
Keilwagen, J., Hartung, F. & Grau, J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Methods Mol. Biol. 1962, 161–177 (2019).
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. & Salzberg, S. L. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat. Protoc. 11, 1650–1667 (2016).
Besemer, J., Lomsadze, A. & Borodovsky, M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29, 2607–2618 (2001).
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512 (2013).
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, R7 (2008).
Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019).
Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875–879 (2018).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Hu, G. et al. A telomere-to-telomere genome assembly of cotton provides insights into centromere evolution and short-season adaptation. Nat. Genet. 57, 1031–1043 (2025).
Li, P. et al. RGAugury: a pipeline for genome-wide prediction of resistance gene analogs (RGAs) in plants. BMC Genomics 17, 852 (2016).
Saucet, S. B. et al. Two linked pairs of Arabidopsis TNL resistance genes independently confer recognition of bacterial effector AvrRps4. Nat. Commun. 6, 6338 (2015).
Marcais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).
Li, M. X., Yeung, J. M., Cherny, S. S. & Sham, P. C. Evaluating the effective numbers of independent tests and significant p-value thresholds in commercial genotyping arrays and public imputation reference datasets. Hum. Genet. 131, 747–756 (2012).
Yang, L. et al. Regulatory networks of coresident subgenomes during rapid fiber cell elongation in upland cotton. Plant Commun. 5, 101130 (2024).
Ge, X. et al. Efficient genotype-independent cotton genetic transformation and genome editing. J. Integr. Plant Biol. 65, 907–917 (2023).
Zhang, T., Liu, G., Zhao, H., Braz, G. T. & Jiang, J. Chorus2: design of genome-scale oligonucleotide-based probes for fluorescence in situ hybridization. Plant Biotechnol. J. 19, 1967–1978 (2021).
Lindgren, V. V. In situ hybridization protocols. Arch. Pathol. Lab. Med. 124, 1387–1388 (2000).
Huang, Y. et al. The formation and evolution of centromeric satellite repeats in Saccharum species. Plant J. 106, 616–629 (2021).
Zhang, M. & Gao, C. The pan-genome annotations and assemblies for cotton. Zenodo https://doi.org/10.5281/zenodo.15644881 (2025).
Gao, C. The pan-genome annotations, assemblies and scripts for cotton. Zenodo https://doi.org/10.5281/zenodo.17046697 (2025).
Gao, C. The pan-gene, pangenome, and eQTL dataset for cotton. Zenodo https://doi.org/10.5281/zenodo.15646851 (2025).
Acknowledgements
This study was supported by the National Key Research and Development Program of China (2022YFF1001400 and 2021YFF1000102-1), the Natural Science Foundation of Henan (232300421010), the Key Research and Development Project of Henan Province (231111110400), the National Natural Science Foundation of China (32472146), Xinjiang Science and Technology Major Project of China (2024A02002), Innovation Program of the Chinese Academy of Agricultural Sciences (CAASASTIP-IVFCAAS) and the Fundamental Research Funds of State Key Laboratory of Cotton Biology (CB2024C06).
Author information
Authors and Affiliations
Contributions
Zhaoen Yang, X.G. and F.L. conceived and designed the research. X.G., Zhaoen Yang and Zuoren Yang managed the project. Zhaoen Yang, R.L., Z.W., X.M., C.G. and H.Z. prepared the samples. Y.Z. and Z.Z. performed phenotyping. Zhaoen Yang, C.G., M.Z., L.Y., B.G. and X.L. performed data analysis. Y.Z. and M.M. designed and performed the molecular experiments. Zhaoen Yang, C.G. and M.Z. prepared the figures and tables. Zhaoen Yang, F.L., X.G., G.H. and J.F.W. wrote and revised the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks Michael Bevan, Tianzhen Zhang and Yuxian Zhu for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Genomic landscapes of the 107 accessions.
a, Phylogenetic analysis of 3256 accessions collected worldwide. Purple lines and orange lines, respectively, represent the 7 semi-wild and 100 improved upland cotton accessions selected for de novo assembly. Blue lines represent the G. barbabense that as outgroup. b, Genomic landscape and assembly quality evaluation of the 107 accessions, including genome assembly size, Contig N50 length, LAI, percentage of gene-regions, exons, introns, repeats, LTRs, Gypsy LTR, Copia LTR, SINEs+LINEs and completed BUSCO genes. The phylogenetic tree in the left panel was inferred from PAVs rooted by the wild accession TX1236 ( = N301 here).
Extended Data Fig. 2 Characterization of 5S rDNA and 45SrDNA.
a-b, Schematic representation of 5S rDNA and 45S rDNA in upland cotton. c, Histograms showing the copy number of 5S rDNA, 18S rDNA, 5.8S rDNA, and 28S rDNA for each accession. d-e, Illustrations of 5S rDNA locations on chromosomes A09 (d) and D09 (e). The upper panel shows the long terminal repeats (LTRs) surrounding the 5S rDNA region, while the lower panel displays the reads mapping across the 5S rDNA locus. f, Depiction of a 45S rDNA locus on chromosome D07. The blue boxes in the upper panel highlight the rDNA loci, while the lower panel shows the reads mapping across the 45S rDNA region. The color bars in the upper panel represent the LTRs. g-h, Diversity analysis of 5S rDNA from A09 of N200 (g) and 45S rDNA from D07 of N302 (h). Using CD-HIT for clustering of 5S rDNA or 45S rDNA, sequences within the same cluster are completely identical, and these sequences are referred to as feature sequence. We calculated the frequency of each feature sequence relative to all sequences. Red lines denote the Levenshtein distance between a feature sequence and the most abundant feature sequence. Blue bars indicate frequency values. The right Y-axis measures Levenshtein distance, while the left Y-axis measures frequency per feature sequence cluster. Levenshtein distance, commonly known as Edit Distance, refers to the minimum number of edit operations required to transform one string into another.
Extended Data Fig. 3 Pan and core genome analysis of 107 upland cotton accessions.
a, Variation of gene families in the pan-genome and core genome with the addition of another cotton genome. b, Composition of the pan-genome. The histograms display the number of gene families in the 107 accessions with varying frequencies. Pie chart indicates the ratio of gene family in each category. c, Presence and absence information of pan gene families across the 107 cotton genomes. d, Gene number of each composition in individual genomes. e, Comparison of nucleotide diversity (π) across core, softcore and dispensable genes. π values were evaluated in a 5 Kb window. The bin numbers for each category are shown below the x-axis. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 0.0625, 1×10−15 and 1×10−15, respectively. f, Comparison of dN/dS ratios across core, softcore and dispensable genes, where dN represents the number of non-synonymous SNP mutations and dS represents the number of synonymous SNP mutations. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 2.4×10−14, 1×10−15 and 1.9×10−14, respectively. g, Comparison of gene expression across core, softcore and dispensable genes. Number n indicates gene number within each category. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 1×10−15, 1×10−15 and 1×10−15, respectively. Multiple comparisons were conducted using the Tukey’s Honestly Significant Difference test with α = 0.05 for panels e-g. Center line, median; box limits, first and third quartiles; whiskers, 1.5 × interquartile range.
Extended Data Fig. 4 Graph-based pan-genome construction.
a, Variations of inversion (INV), translocation (TRANS), duplication (DUP) and copy number variation (CNV) upon adding additional cotton genome. b, An example of the graph-based pan-genome for chromosome A07 within the 127895 to 128015 bp region. c, Frequency of combined PAVs of varying lengths. The original PAVs in the pan-genome were combined based on their locations on reference genome N244. d, PAV number increase with the addition of cotton genomes. e, Number of different polymorphic PAVs. f, Composition of PAVs in semi-wild/wild and improved lines. g, Comparison of insertions (INS) and deletions (DEL) across the 26 chromosomes, partitioned by subgenome. h, Comparison of expression levels between genes with INS and DEL. Based on PAV location, PAV-associated genes were categorized into four groups: exon, intron, 5 Kb upstream, and 5 Kb downstream. Center line, median; box limits, first and third quartiles; whiskers, 1.5 × interquartile range. A two-tailed Student’s tests were used to evaluate the P-values.
Extended Data Fig. 5 A reciprocal translocation contributes to the diversity of upland cotton.
a, Chromosome collinearity analysis between A03 and A09 among four AD1 accessions and six sister species (AD2-AD7). A03-A09 translocation is only observed in semi-wild/wild AD1 accessions. b, Validation of A03-A09 translocation using Hi-C interaction matrices. Red boxes highlight chromosomes A03 and A09, with arrows showing the interactions between them. c, Phylogenetic (left) and principal component analyses (right) of the translocation distribution, based on a previously reported study of wild, landrace 1 and landrace 2 populations of upland cotton (ref. 5). The map shows the geographic distribution of the TRANS and non-TRANS populations. L1: Landrace 1, L2: Landrace 2, W: Wild. The map was generated in R using the ggplot2 and maps packages.
Extended Data Fig. 6 Pan-NLRs analysis based on the 107 assemblies.
a, Composition of the five subgroups in each accession. b, Total number of NLRs, cluster NLRs and singletons in each accession (n = 107 samples). c, Comparison of paired NLRs between semi-wild/wild (n = 7) and cultivated accessions (n = 100). d, Comparison of cluster NLRs between semi-wild/wild (n = 7) and cultivated accessions (n = 100). P-value was assessed by a two-tailed Student’s t-test. Data are presented as mean values ± SD in panels b-d. e, Composition of the Pan-NLR. The histograms display the number of gene families in the 107 accessions, categorized by frequencies. The right panel is the zoom-in view of the blue histograms in left panel. f, Comparison of expression levels across core, softcore and dispensable NLRs. Gene numbers are shown below the x-axis. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 0.014, 0.034 and 0.89, respectively. g, Comparison of expression nucleotide diversity (π) across core, softcore and dispensable NLRs. Polymorphic nucleotide sites used for π evaluation are indicated below x-axis. The P-values for Core_vs_Softcore, Core_vs_Dispensable, and Softcore_vs_Dispensable are 1×10−15, 1×10−15 and 3.2×10−5, respectively. Multiple comparisons were conducted using the Tukey’s Honestly Significant Difference test with α = 0.05 for panels f and g. Center line, median; box limits, first and third quartiles; whiskers, 1.5 × interquartile range. h, Frequency of structural variants (SVs) within each NLR cluster. i, Comparison of nucleotide diversity among core, softcore, and dispensable NLRs. j, Comparison of π-values of different domains between YER and other regions. k, Comparison of LRR diversity between YER and other regions across core, softcore and dispensable NLRs. P-values were evaluated by two-tailed Student’s t-tests, and data are presented as mean ± SD in panels i-k. Polymorphic nucleotide sites used for π evaluation are shown below x-axis for panels i-k.
Extended Data Fig. 7 PAV-GWAS captures an 806bp-insertion associated with FS and SI.
a, Fiber strength (FS) and seed index (SI) are co-located on chromosome A07. The shaded box indicates the co-location signal. The horizontal lines represent the genome-wide significance threshold after Bonferroni correction (α = 1/n, where ‘n’ is the total number of independent SVs). Thresholds were −log(P) = 4.6 (419 accessions) and −log(P) = 4.5 (1226 accessions). b, Zoomed-in view of the signal interval identified by SNPs, InDels, and PAVs. Blue dots represent SNPs, green dots indicate InDels, and light coral dots are PAVs. A significance threshold of −log(P) = 4.6 for 419 accessions was used. c, Haplotype analysis at the SI locus. Three haplotypes were identified. d, An 806-bp insertion detected by PAV-GWAS in gene N244A07G24740 divides upland cotton into Hap. 1 and Hap. 2. A significance threshold of −log(P) = 4.6 for 419 accessions was used. e, Comparison of FS between accessions carrying Hap. 1 (n = 57 accessions) and Hap. 2 (n = 358 accessions). P-values were evaluated by two-tailed Student’s t-tests. Center line, median; box limits, first and third quartiles; whiskers, range from minimum to maximum. f, Accessions carrying SI3 exhibit higher expression levels and SI compared to those without any QTLs. The central lines indicate the mean value, whereas the lines above and below represent mean ± SD. Accessions for each category are shown above the x-axis. The number of different QTL combinations is shown above the x-axis. g, Overexpression of N244A07G024740 leads to increased seed size. The bar represents 1 cm. h-j, Comparison of SI (h), seed length (i), and seed width (g) among WT (n = 10 samples), OE-2 (n = 10 samples), OE-12 (n = 10 samples) and OE-13 (n = 10 samples). Multiple comparisons were conducted using the Tukey’s Honestly Significant Difference test with α = 0.05. Data are present as the mean ± SD. The P-values in panel h for WT_vs_OE-2, WT_vs_OE-12, WT_vs_OE-13, OE-2_vs_OE-12, OE-2_vs_OE-13 and OE-12_vs_OE-13 are 2×10−14, 5.3×10−13, 8.7×10−10, 1.6×10−5, 1×10−10 and 4.6×10−5, respectively. The P-values in panel i for WT_vs_OE-2, WT_vs_OE-12, WT_vs_OE-13, OE-2_vs_OE-12, OE-2_vs_OE-13 and OE-12_vs_OE-13 are 2.4×10−5, 5.1×10−6, 7.8×10−3, 0.59, 0.21 and 0.013, respectively. The P-values in panel j for WT_vs_OE-2, WT_vs_OE-12, WT_vs_OE-13, OE-2_vs_OE-12, OE-2_vs_OE-13 and OE-12_vs_OE-13 are 3.6×10−4, 9.9×10−4, 0.30, 0.98, 0.04 and 0.09, respectively.
Extended Data Fig. 8 A rare allele of CESA7 contributes to stronger fiber in lines derived from distant hybridization.
a, GWAS for fiber strength using PAVs with a minor allele frequency greater than 0.01. The horizontal lines represent the genome-wide significance threshold after Bonferroni correction (α = 1/n, where ‘n’ is the total number of independent SVs), with −log(P) = 4.7 for 419 accessions. b, Zoomed-in view of the signal on A07. A significance threshold of −log(P) = 4.7 for 419 accessions was used. c, Two PAVs located in N24407G004420 (CESA7) cluster upland cotton into Hap. A and Hap. B. d, Comparison of fiber strength (FS) and fiber length (FL) between the two haplotypes across two panels. Hap. A consistently exhibits superior FS and FL compared to Hap. B. Panel information indicates that accessions in Hap. A are primarily from distant hybridization, predominantly found in the Suyuan and Suyou series resources. Center line, median; box limits, first and third quartiles; whiskers, range from minimum to maximum. The “n” represents the number of accessions with Hap. A and Hap. B. P-values were evaluated by two-tailed Student’s t-tests.
Extended Data Fig. 9 Impact of the A06 inversion on leaf trichome density and stem trichome phenotypes.
a, Leaf and stem trichome phenotyping. From left to right: low LTA, high LTA, low STA, and high STA. LTA refers to leaf trichome amount; STA refers to stem trichome amount. The bar represents 500 μm. b, Manhattan plots for GWAS of LTA and STA, highlighting the associated genomic regions with blue box. The horizontal lines represent the genome-wide significance threshold after Bonferroni correction (α = 1/n, where ‘n’ is the total number of independent SVs), with −log(P) = 4.6 for 419 accessions. c, Zoomed-in view of the genomic region on A07: the upper panel shows GWAS signals, the middle panel displays gene models, and the lower panel provides a detailed view of the candidate gene region. A significance threshold of −log(P) = 4.6 for 419 accessions was used. d, Haplotype comparison of the gene N244A06G021950, indicating amino acid variants. The numerical values above denote the positions of these variations. e, Haplotype effect on LTA, demonstrating differences associated with distinct haplotypes. f, Haplotype effect on FPKM (Fragments Per Kilobase of transcript per Million mapped reads), illustrating the gene expression levels between the two haplotypes. Accession numbers carrying INV2 or no-INV2 are shown below the x-axis for panels e and f. Center line, median; box limits, first and third quartiles; whiskers, range from minimum to maximum. The P-value was evaluated by Student’s t-test.
Extended Data Fig. 10 eGWAS analysis of flavonoid biosynthesis genes reveals co-location with GhTT2.
a, Grey lines represent trans-eQTL regulation, while red lines indicate local eQTL regulation. Genes marked with an asterisk denote that the regulatory relationships between GhTT2 and either ANS or ANR have been experimentally validated in previous study. b, Comparison of the expression level of the two haplotypes of cis-eQTL from GhTT2. “n” indicates sample size. Center line, median; box limits, first and third quartiles; whiskers, range from minimum to maximum. The P-value was calculated by a two-tailed Student’s t-test.
Supplementary information
Supplementary Information (download PDF )
Supplementary Figs. 1–33.
Supplementary Tables (download XLSX )
Supplementary Table 1. Detailed information of the germplasm used in this study. Supplementary Table 2. Summary statistics of PacBio sequencing reads for the 107 accessions. Supplementary Table 3. Summary of genomic landscape for the 107 assemblies. Supplementary Table 4. Sequencing quality of 12 Hi-C libraries in this study. Supplementary Table 5. The telomere assembly of each chromosome. Supplementary Table 6. The centromere location of each chromosome. Supplementary Table 7. Chi-squared tests for the Caribbean, South America, North America and Yucatan regions. Supplementary Table 8. Introgression from G. barbadense in the 107 accessions. Supplementary Table 9. Haplotype carried by the wild–semi-wild cotton. Supplementary Table 10. Summary of the gene number of the five subgroups in the 107 assemblies. Supplementary Table 11. QTLs identified by GWAS using SNP and PAV. Supplementary Table 12. List of genes. Supplementary Table 13. Summary of the length of the PAVs. Supplementary Table 14. The primers used in this study.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, Z., Yang, Z., Gao, C. et al. Graph pan-genome illuminates evolutionary trajectories and agronomic trait architecture in allotetraploid cotton. Nat Genet 58, 218–229 (2026). https://doi.org/10.1038/s41588-025-02462-1
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41588-025-02462-1


