Abstract
The tea plant stands as a globally cherished nonalcoholic beverage crop, but the genetic underpinnings of important agronomic and metabolomic traits remain largely unexplored. Here we de novo deep resequenced 802 tea plants and their relative accessions globally. By integrating public Camellia accessions, we constructed a comprehensive genome-wide genetic variation map and annotated deleterious mutations for 1,325 accessions. Population genetic analyses provided insights into genetic divergence from its relatives, different evolutionary bottlenecks, interspecific introgression and conservation of wild relatives. Our findings suggest the pivotal role of southwest China as the origin of tea plants, revealing the genetic diversity and domestication status of ancient tea plants. Genome-wide association studies herein identified thousands of substantial associations with leaf shape and metabolite traits, pinpointing candidate genes for crucial agronomic and flavor traits. This study illuminates the tea plant’s evolution and provides references for tea plant design breeding.
Similar content being viewed by others

Main
As the world’s most popular nonalcoholic beverage, enjoyed daily by over 3 billion people in 160 countries, tea has left a substantial mark on the history and culture of numerous societies1,2,3. This widespread popularity likely stems from the health benefits associated with its unique chemical compounds, such as catechins, theanine and caffeine4,5. However, a crucial gap in our understanding lies in the genetic basis underlying the diverse and rich metabolic pathways encoded within the tea genome.
China, recognized as tea’s birthplace and domestication center, holds the richest tea plant (Camellia sinensis) germplasms and Theaceae species. Despite its importance, our understanding of the tea plant's origin, population differences, variety classification, selection genes and trait differentiation remains incomplete. Previous population analyses have been hampered in understanding these issues due to insufficient sampling (lack of ancient tea plant accessions), limited genetic data resolution (restricted nuclear markers) and the absence of reliable outgroups. For instance, the lack of ancient tea plant accessions and appropriate outgroup samples has likely led to erroneous inferences about tea plant origins6,7.
Precise and rapid tea plant breeding (Breeding 4.0) hinges on accumulating superior alleles, eliminating deleterious mutations and introducing beneficial exogenous genes from wild relatives. However, limited sample availability and phenome throughput have hampered the link between genetic variation and traits, hindering the identification of major genes or superior alleles responsible for essential traits, especially complex metabolomic traits. To our knowledge, multigenerational asexual reproduction in tea plants leads to the accumulation of harmful mutations5. But the exact quantity, distribution and impact of these mutations within the tea plant population remain unclear, hindering the ability to eliminate them through potato-like design breeding programs8,9. In addition, the weak reproductive isolation among Camellia species facilitates interspecific hybridization, resulting in interspecific hybrids or gene introgression into C. sinensis. Moreover, overlapping distribution areas of Camellia species and deliberate interspecific crosses further promote interspecific hybridization. These complicate tea plant genomic variation by introducing exogenous gene pools that can affect environmental adaptation and flavor formation. However, population analyses of interspecific populations are currently lacking, hindering our understanding of evolutionary divergence and gene flow between the tea plant and its relatives.
To address these gaps and propel tea plant breeding into ‘Breeding 4.0’, our study aggregated an unprecedented collection of 1,325 Camellia accessions from 14 major tea-producing countries. This collection spans wild, transitional, landrace and modern accessions, aiming to dissect the tea plant population genetic variation at a high-resolution scale. This includes exploring diversity, origins, variety classification, evolutionary history, selective sweeps, deleterious mutations and introgression. We also examined the population variation characteristics of metabolites and agronomic traits across hundreds of accessions, identifying key genes for essential flavor substances and agronomic traits. Our thorough analysis deepens the understanding of tea plant population differentiation and equips breeders with critical genomic data and gene targets for future breeding efforts.
Results
Genome-wide variation map
To fully address the evolutionary and genetic diversity, we performed deep resequencing on 802 new accessions (including 22 ancient tea plant accessions) of tea plants and their relatives, generating 37.79 terabytes (TB) of data (average of 14.96×/accession; Supplementary Table 1). By integrating this data with 523 publicly available deep resequencing datasets (>8× coverage)3,5,10,11,12, 1,325 Camellia accessions from 14 major tea-producing countries were used for our population genetic variation mapping (Fig. 1a and Supplementary Fig. 1). This included 1,251 C. sinensis accessions including 870 C. sinensis var. sinensis (CSS), 356 C. sinensis var. assamica (CSA) and 25 C. sinensis var. pubilimba (CSP), as well as, 74 C. sinensis relative (CSR) species (including 40 Camellia taliensis (CSR.CT), 17 Camellia tachangensis var. remotiserrata (CSR.CTR), 15 Camellia quinquelocularis (CSR.CQ), 1 Camellia sasanqua (CSR.CS) and 1 Camellia oleifera (CSR.CO)). We identified 24,393,021 high-quality single-nucleotide polymorphisms (SNPs) and 1,079,254 insertions and deletions (InDels) by mapping reads on Tieguanyin (TGY) genome5 (Supplementary Fig. 2 and Supplementary Tables 2 and 3). Of them, 7,366 SNPs and 6,425 InDels belonged to large-effect variations, resulting in frameshift variations or start/stop codon changes (Supplementary Tables 4 and 5).
a, Worldwide distribution of all Camellia accessions. The depth of the green color of the circle represents sample numbers. b, The ML phylogenetic tree and ADMIXTURE ancestry proportions of all Camellia accessions were inferred using 93,212 SNPs at fourfold degenerate sites. C. sasanqua and C. oleifera in ML phylogenetic tree were employed as outgroup species. K means the number of ancestry kinship. Samples marked in red diamonds were 25 CSP accessions and samples marked with green circles were collected from foreign countries. C. quinquelocularis, C. tachangensis var. remotiserrata, C. taliensis were abbreviated as CSR.CQ (15 accessions), CSR.CTR (17 accessions) and CSR.CT (40 accessions), respectively. CS.CA (40 accessions), CSA (257 CSA.G1 and 59 CSA.G2 accessions) and CSS (66 CSS.G1, 141 CSS.G2, 367 CSS.G3 and 181 CSS.G4 accessions) represented the cultivated ancient C. sinensis, C. sinensis var. assamica, and C. sinensis var. sinensis, respectively. CSS.G1 x CSS.G2 (50 accessions), CSA.G1 x CSS.G3 (54 accessions) and CSS. Mixed (36 accessions) represented three intervarietal-derived subpopulations inferred based on ADMIXTURE ancestry proportions. c, The PCA plot of all Camellia accessions using the first two PCs. d, The ADMIXTURE ancestry proportions of all 25 CSP accessions (the red diamonds labeled sample in b). e, The PCA plot of tea plant accessions from different countries/regions using the first two PCs. China NYR, China SYR, China SC and China SWC means accessions from NYR, SYR, SC and SWC, respectively. f, The ADMIXTURE ancestry proportions of all foreign tea plant accessions (the green circles labeled sample in b). NYR, Northern Yangtze River; SYR, Southern Yangtze River; SC, South China; SWC, Southwest China.
Phylogenetic relationship and population structure
To investigate the genetic history and population structure, a subset of 93,212 fourfold degenerated SNPs was extracted to construct a maximum likelihood (ML) phylogenetic tree for 1,325 Camellia accessions, with CSR.CS and CSR.CO as outgroup species. The analysis revealed four distinct clades corresponding to CSR species, including CSR.CQ, CSR.CTR and CSR.CT, cultivated ancient C. sinensis (CS.CA), CSS, and CSA populations (Fig. 1b and Supplementary Table 1).
ADMIXTURE analysis effectively separated CSS accessions (blue background) from CSA, CS.CA and CSR accessions at K = 2. At K = 3, CSA ancestry (red) became distinct from CS.CA and CSR accessions (Fig. 1b). At higher K values (4–7), further subdivision was observed within CSS (CSS.G1-CSS.G4) and CSA (CSA.G1 and CSA.G2) populations. Notably, the Fuyun family accessions (in CSA.G1 x CSS.G3 group), known hybrids derived from artificial crosses between CSA and CSS, have biparental genetic components that can be clearly highlighted based on their ancestry proportions at K = 3–7 (Fig. 1b and Supplementary Table 1). However, inferred ancestral proportions became less informative at K = 8 and 9, with Fuyun accessions containing unidentifiable ancestral components that contradicted breeding records. Collectively, the evolutionary tree and ADMIXTURE analysis supported the subdivision of C. sinensis into two CSA subpopulations, four CSS subpopulations and three intervarietal hybrid-derived subpopulations. In addition, we constructed an ML evolutionary tree and population structure based on the SNPs shared by all groups, and the results (Supplementary Figs. 3 and 4) highly supported the ML evolutionary tree and population structure results based on 93,212 fourfold degenerated SNPs.
Interestingly, the CSP, traditionally classified as a separate tea plant variety by morphology13, failed to show the expected segregation pattern. All 25 CSP accessions lacked a single, unified cluster in the phylogenetic tree and were instead interspersed among CSS and CSA accessions (Fig. 1b and Supplementary Fig. 5). They shared highly similar genetic backgrounds with neighboring accessions (Fig. 1b,d). In principal component analysis (PCA), CSP accessions did not segregate from CSA and CSS populations (Fig. 1c and Supplementary Fig. 6). These findings suggested that CSP may not warrant separate variety status, and its trichome trait may not be a reliable marker for defining distinct varieties. Consequently, CSP accessions were reclassified into CSS and CSA groups for subsequent analyses.
The CS.CA accessions represent ancient cultivated tea plants (semiwild type), some of which are thousands of years old1,3, including the ‘Jinxiu’ ancient tea plant, over 3200 years old. ADMIXTURE analysis revealed that CS.CA accessions shared ancestral components with CSR species (CSR.CT, CSR.CTR and CSR.CQ) and possessed genetic components of both CSS and CSA (Fig. 1b). Furthermore, PCA (second principal component (PC2)) positioned CS.CA accessions in a transitional region between CSR species and modern cultivated tea plants (Fig. 1c). These observations collectively suggested that CS.CA accessions represent an intermediate stage in the domestication process of tea plants, providing clear genetic evidence for tea plant domestication within China.
We further found that all foreign accessions displayed high genetic similarity to China accessions that is difficult to separate from China accessions in the evolutionary tree and PCA (Fig. 1b,e and Supplementary Fig. 5) and also lacked distinct ancestral components in the ADMIXTURE analysis (Fig. 1b,f). These foreign accessions clustered with China accessions from documented historical export regions. For instance, accessions from Southeast Asia (Vietnam and Laos) clustered with accessions from China’s southwestern regions (Fig. 1e). Importantly, CSR.CT, CSR.CTR, CSR.CQ, CS.CA and CSA at the base of the evolutionary tree were almost all from China’s southwest region (Yunnan, Sichuan and Guizhou provinces; Supplementary Fig. 7). Notably, tea plants belong to the genus Camellia, and nearly 85% of the Camellia species are densely distributed in China’s southwest region14,15. This observation aligns with the principle that the distribution density of this species and its relatives is extremely high in the origin centers of this species. Collectively, these results strongly support the conclusion that southwest China is the origin and domestication center of tea plants.
Genetic diversity and population history
To investigate genetic diversity between tea plant and CSR species, we analyzed heterozygosity, nucleotide diversity (π), population divergence (Fst) and identity-by-descent (IBD) values within and between groups (Fig. 2a,b and Supplementary Figs. 8 and 9) and found that C. sinensis exhibited elevated levels of heterozygosity and π values in comparison to all CSR species (Fig. 2a,b). Using SIFT4G16, we identified a substantial number (14,918–14,991) of deleterious mutations on 15 chromosomes across all groups (Supplementary Figs. 10 and 11), impacting 9,487–9,518 genes (Fig. 2c). Notably, the C. sinensis groups displayed a higher prevalence of deleterious mutations compared to CSR groups. Kyoto Encyclopedia of Genes and Genomes (KEGG) annotation further revealed that deleterious mutant genes were involved in environmental adaptation and in the synthesis of many secondary metabolites, including important flavor-related metabolites such as the diterpenoids (Fig. 2d), suggesting that deleterious mutations have important effects on stress adaptation and flavor formation in tea plants.
a, The heterozygosity of each group. b, Nucleotide diversity (π) of each group. c, Number of deleterious genes in C. sinensis and CSR species. d, The KEGG pathways of deleterious genes. e, The LD decay estimation of C. sinensis and CSR species. f, The historical effective population size (neff) of C. sinensis and CSR species. For a and b, n represents the sample size in each group. In the box plots, the upper and lower edges represent the 75% and 25% quartiles, the central line denotes the median, the whiskers extend to 1.5× the interquartile range and the points outside indicate the outliers.
The average genome-wide Fst between groups ranged from 0.05 to 0.45. CSR species had high pairwise Fst values (0.32–0.45) and low pairwise IBD values with cultivated C. sinensis (Supplementary Figs. 8 and 9), indicating substantial genetic divergence between CSR species and C. sinensis. Furthermore, we found that different groups exhibited differentiated linkage disequilibrium (LD) length (from 1/2 of max r2 value) ranging from 273 bp of CSR.CT to 6452 bp of CSS.G4 (Fig. 2e and Supplementary Table 6).
To analyze the population histories of different groups, we estimated the effective population size (neff) for each group using SMC++ (ref. 17). All CSR groups exhibited a similar population history distinct from the seven C. sinensis groups (Fig. 2f). Their effective population history dates back approximately 300,000 years before present (YBP), including two bottleneck events at 300,000–130,000 YBP and 60,000-2,000 YBP. Differently, during the period 300,000-2,000 YBP, the effective population size of all C. sinensis groups continued to decline while experiencing an ancient bottleneck event around 1000,000 YBP (Fig. 2f). This suggested that C. sinensis may have diverged from the common Camellia ancestor earlier than CSR species, and that C. sinensis groups have undergone distinct evolutionary histories from the CSR species. Due to a history of crop domestication dating back only 10,000 years18, the substantial divergence between C. sinensis and CSR species predated human intervention (Fig. 2f). Notably, C. sinensis underwent earlier and larger population expansion in the last few thousand years compared to CSR species (Fig. 2f). This suggested that human cultivation practices may have contributed to the expansion of C. sinensis populations, aligning with extensive historical records of tea plant usage in China19,20,21.
Interspecific introgression may contribute to flavor variation
The f-branch (fb) statistic based on triplet topologies revealed unequal introgression of CSR.CT with CSA and CSS and the probability of gene flow was higher between CSR.CT and CSS (average fb value = 0.4) than between CSR.CT and CSA (average fb value = 0.2; Fig. 3a). To further investigate this difference, we performed an additional analysis using the fd statistic on 20-kb nonoverlapping windows and found the higher proportion of windows/gene number with high fd values (≥0.2) in CSS.G4-CSR.CT compared to CSA.G1-CSR.CT, suggesting greater introgressions between CSR.CT and CSS.G4 relative to CSA.G1 (Fig. 3b,c and Supplementary Fig. 12a,b).
a, The fb statistic indicated different introgression signals between groups. The value size of fd is indicated by the depth of the red color. b, Kernel density estimation of fd values in the CSA.G1 and CSS.G4 groups, using 20-kb windows. The presence of a green vertical line marked the top 5% fd regions for the CSA.G1 group, while a blue vertical line highlights the top 5% fd values for the CSS.G4 group. c, The percentage in regions within fd value bins for either CSA.G1 or CSS.G4 group. d, The KEGG pathway enrichment result for genes in the highest 1% fd regions of the CSS.G4-CSR.CT groups. e, A visualization of the region that has undergone introgression between CSR.CT and CSS.G4 group. The FLS gene was highlighted in yellow background. Fst, Dxy and π ratio were calculated per 20-kb window between the CSR.CT and CSS.G4 group. f, The phylogenetic tree of FLS gene; red, green and purple lines denoted the CSR.CT, CSR.CTR and CSS.G4 groups. FLS, flavonol synthase.
We identified 107 and 725 introgressed genes in the highest 1% fd regions for CSA.G1-CSR.CT and CSS.G4-CSR.CT, respectively, and KEGG annotation revealed that these interspecific introgressed genes were involved in the synthesis of terpenoids and flavonoids, suggesting that interspecific gene introgression may contribute to flavor variation in tea plants (Fig. 3d). For instance, one introgression region on chromosome 4 (chr4; 90.00–91.00 Mb) contained one FLS gene encoding a key enzyme in flavonoid biosynthesis that was associated with catechin content in tea leaves. This region exhibited low Fst and Dxy values (Fig. 3e), suggesting high genomic similarities between CSS and CSR.CT populations, which further supported the hypothesis of gene flow from CSR.CT to CSS. Also, the local phylogenetic tree of this introgression region delineated three distinct clades: Clade 1 consisting of CSR.CT and CSR.CTR, Clade 2 including CSR.CT and CSS and Clade 3 comprise exclusively CSS.G4. The Clade 2 strongly suggested introgression from CSR.CT into certain CSS.G4 accessions. Similarly, the gene for geranylgeranyl diphosphate synthase (GGPS), in the synthesis of aromatic terpenoids, located on one introgression region of chr6 (20.82–20.86 Mb), and population parameters and the evolutionary tree also showed that CSR.CT-type GGPS gene introgressed into some CSS.G4 accessions (Supplementary Fig. 12c,d).
Selective sweep genes related to tea flavor and environmental adaptation
CS.CA accessions reflected the intermediate domestication state of the tea plant from wild to cultivated varieties, which can be well differentiated into the following two stages of domestication history: early and recent domestication. We approximated the wild CSR as the wild ancestor of the tea plant, and the adequate CS.CA samples allowed us the opportunity to characterize the differences in selection signals between different domestication stages. Based on the 20-kb moving windows, we totally identified 57.16-Mb selective sweep regions that can be codetected by Fst and cross-population composite likelihood ratio (XP-CLR; both in top 3%) covering 2,482 genes in CSR/CS.CA (early domestication) and 43.14–70.64 Mb selective sweep regions containing 1,533–2,482 genes in CS.CA/CSA.G1-CSS.G4 (recent domestication; Fig. 4a,b and Supplementary Table 7). The UpSet analysis revealed that selective sweep genes shared by different groups accounted for only a very small percentage, and group-specific selective sweep genes accounted for a high percentage (Fig. 4b). Although shared genes were rare, KEGG annotation revealed that selective gene sets in all groups contained a large number of genes related to flavor synthesis, leaf color and environmental adaptation. This interesting result highlighted that tea flavor, leaf color and environmental adaptation have been focal traits for human selection throughout the whole domestication and cultivar population differentiation process, but different gene sets were selected during population differentiation process (Fig. 4c). Compared to early domestication, some genes related to caffeine, unsaturated fatty acids and long-chain fatty acid synthesis were highlighted during the recent domestication phase, which may reflect the change in tea flavor preferences between modern and ancient people.
a, The Manhattan plot of XP-CLR values for 20-kb moving windows across all chromosomes. The horizontal line represented the 97th percentile. The green dots above the horizontal line represented windows that can be codetected by the Fst and XP-CLR methods (both in top 3%). b, The UpSet plot of selective sweep genes in different groups. c, The KEGG pathway annotation of selective sweep genes in different groups.
Importantly, we found that selective sweep genes of all groups contained cloned functional genes involving flavor substance synthesis, abiotic stress tolerance, growth and development, fertility, etc., demonstrating the important roles of these selective sweep genes on tea plant (Fig. 4a). For example, selective sweep genes in all groups contained UGT family members, including UGT84A22,23, CsUGT72AM1 (ref.24) and UGT91C1 (ref. 25), closely associating with tea quality and abiotic stress tolerance26,27. In addition, we found selective sweep genes associated with dwarfing, including DWF3 (ref.28), DWF4 (ref. 29), DWF5 (ref. 30) and DWARF27 (ref. 31), in all recent domesticated groups. And, selective sweep genes in CS.CA/CSA.G2, CS.CA/CSS.G1 and CS.CA/CSS.G2 contained CsS-RNase genes relating to self-incompatibility32, suggesting that recent domestication may affect tea plant fertility.
The development of leaf and internode-related genes
Leveraging the genome-wide variation map, we conducted a genome-wide association study (GWAS) for six key agronomic traits, namely, leaf length (LLe), leaf width (LWi) and leaf length–width ratio (LLWR) from mature leaves, as well as the weight of hundred buds (WHB), first internode length (IL1; including first bud), and second internode length (IL2) from young shoots (Fig. 5a). Pearson correlation analysis revealed strong positive correlations between IL1 and IL2, as well as between LLe and LWi (Supplementary Fig. 13). The GWAS identified 154, 23, 126 and 7 significant SNPs (sSNPs) associated with LLe, LWi, LLWR and WHB, respectively, overlapping with a total of 346 candidate genes (Fig. 5b and Supplementary Figs. 14 and 15).
a, Schematic of traits in young shoots. b, Genome-wide sSNPs (the lowest P value from MLM and FarmCPU was used for the Manhattan plot) associated with LLe, LWi, LLWR and WHB). The significance threshold is 1/ total SNPs. c, Genome-wide sSNPs associated with IL1 (including first bud) and IL2. The significance threshold is 1/total SNPs. d, Manhattan plot of sSNPs associating with WAK10 gene (CsTGY06G0001293). Nine sSNPs associated with WAK10 were highlighted by red dots (some red dots were overlapped due to extremely close positions and the same P values). e, Box plots of IL1 trait of different haplotypes of the lead sSNP (chr6: 62,329,859) of WAK10 gene. n represents the sample size in each haplotype. Significance (P value) of differences between haplotypes was determined by two-sided Student’s t test. The upper and lower edges of the box plots represent the first and third quartiles, the central line denotes the median and the whiskers extend to 1.5× the interquartile range. b–d, The horizontal dashed line represents the significant threshold (P < 5.08 × 10−8). The statistical analysis was performed with Bonferroni correction.
The Gene Ontology (GO) enrichment analysis revealed 14 genes involved in seven biological processes related to leaf development (Supplementary Table 8), implying the complexity of leaf development regulation. Beyond these GO-annotated leaf development genes, we identified CsTGY01G0000394, a homolog of Arabidopsis REVOLUTA gene, as a potential candidate affecting LLWR. Previous studies have demonstrated that mutations in REVOLUTA alter leaf morphology in Arabidopsis33,34. To investigate the role of this gene in tea plant leaf morphology, we analyzed LLWR values of different chr1: 22,487,501 genotypes (the sSNP of CsTGY01G0000394) among 139 additional accessions35. Different genotyped accessions exhibited significant differences in LLWR, suggesting that REVOLUTA also stably influences leaf morphology in different populations of tea plant (Supplementary Fig. 16).
We also identified 51 and 23 sSNPs for IL1 and IL2, respectively. Notably, the rice homologs of six tea plant genes linked to lead sSNPs have been experimentally confirmed to be involved in internode length development (Fig. 5d). For example, WAK10 encodes a receptor-like kinase and wak10 mutants exhibit shortened internodes and decreased cell length-to-width ratios36. We found that nine IL1-associated sSNPs were located upstream of the WAK10 gene (CsTGY06G0001293; Fig. 5e) and that different haplotypes of the lead SNP (chr6: 62,329,859) exhibit significant differences in IL1 (Fig. 5f).
Complex genetic basis of metabolite biosynthesis
To characterize population metabolite content variations, we performed untargeted metabolomics analysis using ultra-performance liquid chromatography-quadrupole time-of-flight mass spectrometry (UPLC-QTOF MS) on young leaves (one bud and two leaves) from 300 accessions representing 7 CSA, 13 CSS.G1, 59 CSS.G2, 147 CSS.G3 and 74 CSS.G4 accessions. After filtering and database matching, we annotated 643 and 919 metabolites belonging to 12 superclasses in positive and negative electrospray ionization (ESI) modes, respectively (Supplementary Tables 9 and 10 and Supplementary Fig. 17). Of these 1,562 annotated metabolites, 70% exhibited coefficients of variation greater than 0.5, indicating substantial variation among accessions (Supplementary Fig. 18). PCA of all metabolites, 11 superclass metabolites and flavonoids did not clearly differentiate accessions based on their group classification (Fig. 6a and Supplementary Figs. 19–21). However, the pairwise Student’s t tests between groups can identify 336 group-specific enriched metabolites, with high enrichment observed in one/two groups (Fig. 6b and Supplementary Fig. 22). These results suggested that while there is no significant differentiation between groups at the metabolomic superclass level, a small number of metabolites, including important flavor-related compounds such as catechins (high in CSS.G1 and CSS.G2) and caffeine (high in CSS.G1), exhibit significant content differences among groups and can serve as molecular markers for group distinction (Fig. 6c). This may reflect the relatively limited intensity of human selection on tea plants due to their long breeding cycles and self-incompatibility, resulting in subtle metabolomic alterations between groups rather than large-scale changes. Nevertheless, human selection appears to have influenced the differential distribution of important flavor-related metabolites among groups, supporting its role in shaping tea flavor and aligning with the suitability of different groups for producing distinct tea types.
a, The PCA plot result of all accessions based on all 1,562 annotated metabolites. b, Relative contents of group-specific enriched metabolites among different groups. c, Content variability among groups for important flavor metabolites (catechins and caffeine) in tea leaves. n represents the sample size in each group. The upper and lower edges of the box plots represent the first and third quartiles, the central line denotes the median and the whiskers extend to 1.5× the interquartile range. The upper and lower edges of the violin plot represent the maximum and minimum values. d, Micro-synteny plot depicting copy number variation of CHS genes on CHS tandem region on chr9 among six C. sinensis accessions. e, Pan-genomic copy number variation of CHS genes on CHS tandem region on chr9. f, Genome-wide sSNPs associated with variation of (−)-epicatechin content. g,h, Box plots of (−)-epicatechin (g) and epicatechin (h) content of different haplotypes of the sSNP (chr7: 121,313,741) in bHLH162 gene. i, In validation population independent of this study, Box plot of (−)-epicatechin content of different haplotypes of the sSNP (chr7: 121,313,741) in bHLH162 gene. The horizontal dashed line represents the significant threshold (P < 5.24 × 10−8). The statistical analysis was performed with Bonferroni correction. g–i, Significance (P values) of differences between haplotypes was determined by two-sided Student’s t test. n represents the sample size in each group. In the box plots, the upper and lower edges represent the 75% and 25% quartiles, the central line denotes the median, the whiskers extend to 1.5× the interquartile range and the points outside indicate the outliers.
To identify important loci associated with metabolome variation, we further performed metabolite GWAS (mGWAS) on 1,562 metabolites in young leaves, identifing 135,176 nonredundant sSNPs with phenotypic variance explained greater than 10% for 1,523 metabolites (Supplementary Figs. 23 and 24 and Supplementary Table 11). Notably, 79.45% of metabolites (1210/1523) were associated with multiple (>5) sSNPs and 27.66% of multi-effect sSNPs (37,383/135,176) were linked to more than two metabolites (Supplementary Figs. 25 and 26), suggesting a complex genetic basis for population metabolomic variation. Visualization of genome-wide sSNP distribution using a 1-Mb sliding window with a 5-kb step size revealed nonuniform distribution across 15 chromosomes, identifying 298 super-class hotspots by combining windows containing over 50 sSNPs within the 1-Mb window (Supplementary Fig. 27 and Supplementary Table 12). For example, KEGG enrichment analysis of phenylpropanoids and polyketides hotspots revealed seven related genes involved in their biosynthesis, namely, CHI, Chalcone synthase (CHS), DFR, F5H, CCoAOMT and IF7MaT (Supplementary Table 12). Additionally, we identified many transcription factors (TFs) potentially affecting phenylpropanoid and polyketide biosynthesis, including the well-known MYB and bHLH TFs, which can either form MBW complexes with WD40 or independently regulate flavonoid synthesis37. Quantitative analysis of TFs across all hotspots revealed a relative abundance of MYB/MYB-related and FAR1 TFs, suggesting their important regulatory roles in tea plant metabolite biosynthesis (Supplementary Fig. 28).
To dissect the genetic determinants of flavonoid biosynthesis, mGWAS identified 40,244 sSNPs associated with 208 flavonoids, encompassing flavonoid glycosides (128), flavans (31) and O-methylated flavonoids (20). The KEGG analysis of 3,482 sSNPs-linked genes revealed 105 key genes enriched in the phenylpropanoid, flavonoid and isoflavonoid biosynthesis pathways and 663 genes coding TFs including FAR1, MYB and AP2/ERF-ERF (Supplementary Fig. 29). On chr9, we noticed a region containing three copies of CHS gene in the TGY genome, enriched with numerous lead SNPs (Supplementary Fig. 29). This suggested that variations within this region substantially contribute to population flavonoid content variation. Further collinearity analysis of this region using the published pan-genome2 revealed substantial variation in CHS copy number, ranging from 0 to 6 copies across different accessions, exposing the underlying cause of flavonoid content variation in the tea plant population (Fig. 6d,e and Supplementary Fig. 30). (−)-Epicatechin content exhibited substantial variability in tea plants, with relatively higher levels observed in CSS.G2 accessions (Fig. 6c). The GWAS identified the following three TFs influencing (−)-epicatechin synthesis: MYB36 (CsTGY04G0001692), bHLH62 (CsTGY07G0001184) and NY-YB (CsTGY11G0000241; Fig. 6f and Supplementary Fig. 31). For instance, the sSNP (chr7: 121,313,741) within the bHLH62 gene divided the population into three haplotypes (CC, TC and TT), CC haplotyped accessions exhibiting substantially higher levels of (−)-epicatechin and epicatechin compared to TC and TT haplotyped accessions (Fig. 6g,h). To validate the role of bHLH62 gene in epicatechin synthesis, we used an additional population consisting of 92 accessions for chr7: 121,313,741 phenotyping and (−)-epicatechin content determination35. This result confirmed that accessions with different chr7: 121,313,741 phenotypes displayed similar differential characteristics as observed in our population (Fig. 6i), supporting bHLH62 as a regulator of epicatechin synthesis.
Discussion
This comprehensive study of genetic variation of the 1,325 Camellia accessions including all types (relatives, transitional, wild, landrace and elite) from 14 major tea-producing countries updates our knowledge on key issues such as variety classification, origin and domestication, population differentiation of tea plants and provides important targets for future precision breeding. For example, we recommend excluding CSP as a tea plant variety based on genetic analysis of a broader range of CSP accessions.
Although the direct ancestor of the tea plant is still unknown, the gradual evolutionary pattern of wild-type ancestral—transitional type—cultural type tea plant is recognized12. However, several recent studies on the tea plant population have not well collected and categorized over 1,000-year-old ancient tea plants5,11,12,38. This study found that these ancient tea plants from southwest China were situated between CSR species and cultivated type tea plants on the evolutionary tree and PCA result, reflecting the transitional type of tea plants, which coincided with their extra-long age and the combination of both wild and cultivated physiological characteristics. Furthermore, the broader collection of C. sinensis accessions found many genetically highly similar samples within China for each foreign accession, basically coinciding with China’s documented historical exports. These results strengthened the theory that tea plant was originated and domesticated in southwest China. Importantly, ancient tea plants and CSR species are important gene pools for cultural tea plants, but we noticed the effective population size of CS.CA and CSR.CT has continued to become smaller in recent 10,000 years, which suggests that we should emphasize the conservation of ancient tea plants and CSR species while vigorously developing the tea industry.
In the domestication and improvement process of species from the wild to semiwild and then into cultivated plants, the genetic bottleneck effect from artificial selection pressure generally reduces the nucleotide diversity of the population and results in a longer LD distance39,40,41,42,43. However, we found two exceptions in tea plant population—where CSR.CQ and CSR.CTR as wild species had longer LD length than semiwild (CS.CA) and cultivated accessions; CSS.G4, as a cultivated subpopulation, combined comparable nucleotide diversity to CS.CA (semiwild type) and the longest LD length. The unusual LD pattern in tea plants may be attributed to factors including sample collection diversity, hybridization, introgression and propagation methods, which impact genome-wide LD length and nucleotide diversity more than purifying selection pressures from domestication, causing LD length and nucleotide diversity to deviate from the expected pattern of model-like plants44,45. Thus, the longer LD lengths of CSR.CQ and CSR.CTR could result from limited sample collection in genetically homogeneous regions, such as Guizhou’s mountainous forests. For CSS.G4, its breeding history has a crucial role. This group primarily consists of Fujian Oolong tea accessions, derived from a few backbone parents, leading to a homogeneous genetic background and longer LD lengths. Additionally, some CSS x CSA-derived progenies contained in CSS.G4 increase nucleotide diversity and influence LD patterns, similar to findings in human populations with mixed ancestry with longer LD length and higher nucleotide diversity45. Also, introgressions between CSS.G4 and CSR.CT can increase the nucleotide diversity of CSS.G4. These exceptions in tea plant population LD length and nucleotide diversity distinguished from model-like plants may precisely reflect the complex domestication and usage history of tea plant.
Domestication, the process by which humans selectively breed organisms for desirable traits, has led to directed changes in agronomic and metabolite traits, as well as the frequency of genetic variation among subpopulations. Previous studies have revealed the large-scale changes in metabolite profiles between subpopulations grouped on the basis of SNPs during the domestication of crops like maize, jujube, tomato and millet39,40,41,46,47. Differently, tea plant SNP analysis can categorize accessions into distinct groups, but metabolite profiles, even at the superclass level, do not show clear differentiation among different groups. This may suggest that, unlike other crops, tea plants have undergone relatively limited metabolite-level domestication. Despite the lack of large-scale metabolite alterations, our results indicated that domestication has substantially influenced tea plant flavor profiles and environmental adaptations. Differential selection signals during different domestication stages and the identification of characteristic flavor metabolites in each group support this conclusion. These selective genes and metabolites align with human preferences, further highlighting the role of domestication.
To propel tea plant breeding into ‘Breeding 4.0,’ this multi-omics study investigated potential effects on growth and flavor of deleterious mutations in the tea plant population, also identifying major-effect genes associated with agronomic traits, key flavor metabolites and environmental adaptation. This population-level information on favorable alleles and deleterious mutations allows us to breed superior tea plant varieties by design breeding in the future, similar to the potato breeding project8,9.
In conclusion, this large-scale resequencing data of tea plant accessions, identifying important genes and molecular markers, can accelerate tea plant improvement, enhancing desirable traits and ultimately contributing to a more robust and flavorful crop.
Methods
Ethics statement
Ethics approval was not required for this research.
Sampling and whole-genome resequencing
A total of 802 Camellia accessions, including 11 CSR.CT, 18 CS.CA and 773 C. sinensis accessions, were collected from around the world (Supplementary Table 1), and the plants grew in the planting base at the Fujian tea germplasm resources depository in Fuan City (27°N, 119°E), Fujian Province or Yunnan Pu’er tea plantation in Pu’er City (22°N, 100°E), Yunnan Province. Fresh young leaves were frozen in liquid nitrogen and genomic DNA was extracted using the DNeasy Plant Mini Kit (Qiagen). A total of 5 μg of genomic DNA for each sample was used to construct libraries with an insert size of 500 bp. Each sample was sequenced with the paired-end 150 bp reads on the DNBSEQ-T7 platform (Beijing Genomics Institute).
In addition, deep resequencing data with sequencing depth of TGY genome >8× for public 523 Camellia accessions were downloaded from the National Genomics Data Center (https://ngdc.cncb.ac.cn/) and the National Center for Biotechnology Information (https://www.ncbi.nlm.nih.gov/) based on the sample IDs provided in publications (Supplementary Table 1)3,5,10,11,12. In total, 1,325 Camellia accessions including 1,251 C. sinensis accessions, 40 CSR.CT, 17 CSR.CTR, 15 CSR.CQ, 1 CSR.CS and 1 CSR.CO were used in the population analysis for this project. In this study, no statistical methods were used to predetermine sample sizes but our sample sizes exceeded previous publications and encompassed the deep resequencing samples (>8×) of previous publications3,5,10,11,12.
In this study, the sampling process for DNA sequencing and phenotype data collection was randomly conducted. Additionally, for phenotype data collection, all plants were exposed to the same growth condition and treatment.
Variant calling
All raw resequencing reads of 1,325 Camellia accessions were filtered to remove sequencing adapters and low-quality bases using fastp (v.0.20.0)48 with default parameters and then all clean reads mapped onto the monoploid reference genome of TGY5 by BWA49 with default parameters. After read mapping, we further used Picard (1.117, http://broadinstitute.github.io/picard/) to remove PCR duplications of BAM files. Next, Samtools (v1.3.1)50 was used to convert the format of SAM files, sort BAM file and filter mapping quality with the ‘q30’ parameter. Then, the population SNPs and small InDels (1–10 bp) were identified using the GATK (v.4.2.5.0, https://github.com/broadinstitute/gatk)51.
To obtain high-quality SNPs and InDels, the initial hard filtering on the population SNPs and InDels was performed using GATK (v.4.2.5.0) with the parameters of SNPs—quality by depth (QD) < 2.0 || mapping quality (MQ) < 40.0 || Fisher strand (FS) > 60.0 || strand odds ratio (SOR) > 3.0 || MQRankSum < −12.5 || ReadPosRankSum < −8.0 and with the parameters of InDels—QD < 2.0 || FS > 200.0 || SOR > 10.0 || MQRankSum < −12.5 || ReadPosRankSum < −8.0. Then, hard-filtered SNPs and InDels with high missing values were further filtered using Plink (v1.90b6.21)52 with parameters --maf 0.05 and --geno 0.1. Finally, we annotated the filtered SNPs and InDels using SnpEff (v.4.3t)53 based on the gene annotation file (gff3 file) of the monoploid TGY genome5.
Population structure analyses
The fourfold degenerate sites are more evolutionarily conserved54. Thus, we extracted 93,212 fourfold degenerated SNPs from 24,393,021 SNPs based on the SnpEff annotation result and then constructed an ML phylogenetic tree of these fourfold degenerated SNPs using IQ-TREE 2.0 (v2.3.6)55 with the GTR + ASC model and 1,000 nonparametric bootstrap replicates. To validate the fourfold degenerated SNP-based evolutionary tree, we also extracted 7,229,749 SNPs shared by all groups from the total vcf containing all the samples and further thinned the SNPs and samples by LD filtering (Plink --geno 0.1 --maf 0.01 --biallelic-only strict, --indep-pairwise 50 10 0.2) and intersample IBD values (the samples with IBD > 0.98, only one was randomly retained) yielding 542,918 LD-filtered SNPs from 1,195 accessions. We then constructed an additional ML phylogenetic tree based on 542,918 LD-filtered SNPs from 1,195 accessions using the same IQ-TREE 2.0 (v2.3.6) parameters—the GTR + ASC model and 1,000 nonparametric bootstrap replicates. Based on 93,212 fourfold degenerated SNPs and 542,918 LD-filtered SNPs from 1,325 accessions, the individual ancestry estimation of population was further analyzed by ADMIXTURE (v1.3.0)56 with K values ranging from 2 to 8. For the PCA, the top ten PCs of 93,212 fourfold degenerated SNPs from all accessions were extracted by Plink (v1.90b6.21)52 and the top three PCs were plotted using package ggplot2 (v.4.1.0)57.
Population differentiation analyses
To estimate and compare the patterns of LD between different groups, the squared correlation coefficient (r2) between all 24,393,021 SNPs was computed and plotted using the PopLDdecay (v3.41)58 package using the following parameters: max distance between two SNPs = 500 kb, and the decay distance of LD indicated the r2 decreased to 0.1. In addition, we used SMC++ (v.1.15.2)17 to infer the population history of different groups. To obtain more accurate results, before using SMC++ for population history analysis, we first identified the repetitive sequence-enriched regions in the genome using SNPable (https://lh3lh3.users.sourceforge.net/snpable.shtml) (splitfa parameter of 80) to generate the masked genome bed file. Then, during the conversion from vcf to smc format using SMC++ (v.1.15.2) vcf2smc, the SNPs in the repetitive sequence-rich regions were marked as missing according to the masked genome bed file, to avoid the bad effect of inaccurate SNPs on the estimation of the population history. Next, the history effective population size for each group was generated using SMC++ (v.1.15.2) estimate with parameters: the mutation rate per generation = 6.5 × 10−9 (refs. 5,11) and the knots = 15. Finally, the SMC++ results were visualized using SMC++ (v.1.15.2) plot, with the generation time set to 55,11.
To avoid false-positive differences between different populations due to high missing SNPs in each group, we therefore extracted 7,229,749 SNPs that were contained in all subpopulations for the analysis of π, Fst, IBD, deleterious mutation and the historical effective population size. The π and Fst values were counted using pixy (1.2.11.beta1)59 with a window size of 10 Kb. IBD values were calculated using Plink (v1.90b6.21) --const-fid52 with the parameter: --indep-pairwise 1000 50 0.2. The heterozygosity was calculated by VCFtools (v0.1.16) --het. In addition, we used SIFT4G (v2.0.0)16 to identify deleterious SNPs/genes in different groups (SIFT_SCORE < 0.05).
Introgression signals
We used the Dsuite (v0.5 r53, https://github.com/millanek/Dsuite)60 to calculate Patterson’s D (ABBA-BABA) and f4-ratio statistics among different populations, aiming to evaluate the probability of genome-wide introgression. The SNP data and a phylogenetic tree that records the group relationships served as input files for this analysis. The fb statistic, derived from the f4-ratio values, was also employed to assess branch-specific introgression levels in this study, using CSR.CTR data as an outgroup. To estimate introgression across genomic windows, we implemented the ABBABABAwindows.py script from the Genomics General repository (https://github.com/simonhmartin/genomics_general) with parameters: -f phased, -w 20000, -s 10000, -m 100 and --T 10. This approach facilitated the calculation of fd and fdM statistics for each 20-kb genomic window, enabling the identification of regions suggestive of potential introgression and quantifying the extent of gene flow between two populations.
Additionally, sequence similarity (Dxy) and Fst were calculated using the popgenWindows.py script (https://github.com/simonhmartin/genomics_general), with the parameters set to --windSize 20000, -s 10000, -m 100 and -T 10. The π-ratio value was determined via VCFtools (v0.1.16) with the --window-pi 20000 parameter.
Selective signals during different domestication stages
To explore the genome-wide selection signals at different domestication stages of tea plants, we used two methods including Fst in VCFtools (v0.1.16) and XP-CLR in xpclr (v1.1.2, https://github.com/hardingnj/xpclr)61 to identify the selection signals between CSR and CS.CA, as well as between CS.CA and six cultivated tea plant groups (CSA1, CSA2, CSS1, CSS2, CSS3 and CSS4). The analysis window was set to 20 kb, and a step size of 2 kb was used; only windows containing more than four SNPs were used for Fst and XP-CLR analyses. To ensure the accuracy of the results, only windows where both the Fst value and XP-CLR value exceed the 97th percentile were adopted as selection windows.
Phenotyping of leaf shape traits and metabolomic traits
A total of 265 tea plant accessions in the same planting base of the Fujian tea germplasm resources depository in Fuan City (27°N, 119°E) were selected for leaf shape trait analysis in May 2023. LLe, LWi and LLWR were derived from the mean of 30 randomly selected mature leaves. For each accession, six copies of 100 one-bud two leaves from current year’s young shoots were collected separately to count the WHB and the average WHB of the six copies was recorded as the WHB of this accession. In addition, IL1 including first bud and IL2 were obtained from the mean of 30 randomly selected current year’s young shoots.
Similarly, we collected 300 accessions of young leaves (one bud and two leaves) from the same planting base for untargeted metabolomics analysis by UPLC-QTOF MS4. To ensure the accuracy of the results, we set up four independent biological replicates for each accession. The Pearson’s correlation coefficients of the four biological replicates for each accession in this study were greater than 90%.
Sample collection and preparation
The young leaves (one bud and two leaves; >3 g) of each tea plant accession were quickly frozen in liquid nitrogen and ground into a fine powder with a mortar and pestle. Then, 80 mg of fine powder was taken and added to 1000 μl of methanol/acetonitrile/H2O (2:2:1, vol/vol/vol) for metabolite extraction. The mixture was centrifuged for 20 min (14,000g, 4 °C). The supernatant was dried in a vacuum centrifuge. For liquid chromatography–mass spectrometry (LC–MS) analysis, the samples were redissolved in 100 μl acetonitrile/water (1:1, vol/vol) solvent and centrifuged at 14,000g at 4 °C for 15 min, then the supernatant was injected.
LC–MS/MS analysis
Analyses were performed using ultra-high performance LC (UHPLC; 1290 Infinity, Agilent Technologies) coupled to a QTOF (AB Sciex TripleTOF 6600) in Shanghai Applied Protein Technology. The samples were separated by Agilent 1290 infinity UHPLC on a C-18 column; the column temperature was 40 °C. The flow rate was set at 0.4 ml min−1 and the injection volume was 2 μl. The mobile phase A consisted of 25 mM ammonium acetate and 0.5% formic acid in water; mobile phase B was methanol. The gradient elution procedure was as follows: 0–0.5 min, 5% B; 0.5–10 min, B increased linearly from 5% to 100%; 10–12 min, B was held at 100%; 12–12.1 min, B decreased linearly from 100% to 5%; 12.1–16 min, B was maintained at 5%. During the whole analysis, the sample was placed in an automatic sampler at 4 °C. To avoid the influence caused by the fluctuation of the instrument, a random sequence was used for the analysis of samples. Quality control samples are inserted into the sample queue to monitor and evaluate the stability and reliability of the data. The ESI source conditions were set as follows: Ion Source Gas1 (Gas1) as 60, Ion Source Gas2 (Gas2) as 60, curtain gas (CUR) as 30, source temperature = 600°C and IonSpray Voltage Floating = ±5,500 V. In MS-only acquisition, the instrument was set to acquire over the m/z range 60–1000 Da, and the accumulation time for time-of-flight MS scan was set at 0.20 s per spectra. In auto MS/MS acquisition, the instrument was set to acquire over the m/z range 25–1000 Da, and the accumulation time for product ion scan was set at 0.05 s per spectra. The product ion scan is acquired using information-dependent acquisition with high high-sensitivity mode selected. The parameters were set as follows: the collision energy was fixed at 35 V with ±15 eV; declustering potential = 60 V (+) and −60 V (−); exclude isotopes within 4 Da and candidate ions to monitor per cycle = 10.
Data processing
The raw MS data (wiff.scan files) were converted to MzXML files using ProteoWizard’s (https://github.com/ProteoWizard/pwiz) MSConvert before importing into freely available XCMS (v4, https://github.com/sneumann/xcms) software. For peak picking, the following parameters were used: centWave m/z = 10 ppm, peakwidth = c (10, 60) and prefilter = c (10, 100). For peak grouping, bw = 5, mzwid = 0.025 and minfrac = 0.5 were used. Collection of Algorithms of MEtabolite pRofile Annotation (v1.58.0) was used for annotation of isotopes and adducts. In the extracted ion features, only the variables having more than 50% of the nonzero measurement values in at least one group were kept. Compound identification of metabolites was performed by comparing the accuracy of m/z values (<10 ppm) and MS/MS spectra with an in-house database (Applied Protein Technology) established with available authentic standards.
GWAS analysis and gene function enrichment
We first extracted the association analysis population vcf from the vcf file of all the samples using VCFtools (v0.1.16), and then further filtered out the high missing SNPs in the association analysis population vcf and retained the bi-allelic SNPs using Plink (v1.90b6.21) parameters: --maf 0.05 --geno 0.2 --biallelic-only52, resulting in 19,671,641 SNPs of 265 tea plant accessions for GWAS for tea plant agronomic traits and 19,070,598 SNPs of 300 tea plant accessions for mGWAS. We used the rMVP (v1.0.0)62 built-in MLM and FarmCPU models for GWAS analysis of leaf shape (including LLe, LWi and LLWR) traits, internode length (including IL1 and IL2) traits and all metabolite traits. In the GWAS analysis, we used the population kinship matrix and the first three components of population PCA as covariates to correct for the effects of population structure bias by setting the parameters: K = kinship, nPC.GLM = 3, nPC.MLM = 3 and nPC.FarmCPU = 3. The population kinship matrix and the first three PCA components were obtained by rMVP’s (v1.0.0) built-in MVP.Data (fileKin = TRUE, filePC = TRUE). rMVP performed multiple test of the Bonferroni algorithm on all SNPs to obtain the final association analysis results. In this study, only SNPs with P value of less than 1/total number of SNPs and phenotypic variance explained > 10% were adopted as trait substantially associated SNPs. Genes that had an overlap with the upper and lower 50-kb intervals of trait substantially associated SNPs were recognized as candidate genes. In this study, the GO and KEGG enrichment analyses of candidate genes or selective genes were all performed on TBtools (v2.042)63. The P values of KEGG or GO annotations were obtained by Fisher’s exact test, and only KEGG pathways and GO terms with P < 0.05 were used for visualization.
Effect validation of sSNPs related to leaf shape and epicatechin in other populations
In this study, we selected one LLWR sSNP (a sSNP of CsTGY01G0000394) and one epicatechin-related sSNP (the sSNP in bHLH162 gene) for the validation of phenotypic effects in other population. We obtained resequenced reads of the 191 tea plant accessions published in ref. 35 and genotyped these two sSNPs by mapping these reads to the monoploid TGY genome based on our mapping method described above. Then we measured LLe and LWi of mature leaves and the epicatechin content in young leaves from accessions with clear genotypes of these two sSNPs. The epicatechin content was determined according to the national standard method by HPLC35. Finally, two-sided Student’s t test was used to compare the trait differences of accessions with different genotypes to assess the phenotypic effects of the sSNPs.
Statistical analysis
All statistics applied in this study were performed using R (v.4.1.2) and provided alongside the respective analysis in Methods, Main and figure legends in Fig. 5e and Fig. 6g–i. The statistical tests of significance in GWASs, haplotype difference analysis and intergroup difference analysis have been described above.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The raw sequence data in this paper have been deposited in the GSA database (https://bigd.big.ac.cn/gsa/, BioProject, accession PRJCA026608). The other data generated in the previous studies are publicly available in the National Center for Biotechnology Information database under the accessions PRJNA665594 (ref. 5) and PRJCA011312 (ref. 10), and in the GSA database under the accessions PRJCA003090 (ref. 5), PRJNA646044 (ref. 11), PRJNA716079 (ref. 3) and PRJNA597714 (ref. 12). The phenotypes used in GWAS analysis have been deposited in Zenodo at https://doi.org/10.5281/zenodo.14880636 (ref. 64).
Code availability
All codes and tools used in this study are described in Methods and the Reporting summary. The customized scripts and codes used in the present study are available via Zenodo at https://doi.org/10.5281/zenodo.14880636 (ref. 64).
References
Hayat, K., Iqbal, H., Malik, U., Bilal, U. & Mushtaq, S. Tea and its consumption: benefits and risks. Crit. Rev. Food Sci. Nutr. 55, 939–954 (2015).
Chen, S. et al. Gene mining and genomics-assisted breeding empowered by the pangenome of tea plant Camellia sinensis. Nat. Plants 9, 1986–1999 (2023).
Lu, L. et al. Genome-level diversification of eight ancient tea populations in the Guizhou and Yunnan regions identifies candidate genes for core agronomic traits. Hortic. Res. 8, 190 (2021).
Yu, X. et al. Metabolite signatures of diverse Camellia sinensis tea populations. Nat. Commun. 11, 5586 (2020).
Zhang, X. et al. Haplotype-resolved genome assembly provides insights into evolutionary history of the tea plant Camellia sinensis. Nat. Genet. 53, 1250–1259 (2021).
Meegahakumbura, M., Wambulwa, M., Li, D.-Z. & Gao, L. Preliminary investigations on the genetic relationships and origin of domestication of the tea plant (Camellia sinensis (L.)) using genotyping by sequencing. Trop. Agric. Res. 29, 230 (2018).
Meegahakumbura, M. K. et al. Domestication origin and breeding history of the tea plant (Camellia sinensis) in China and India based on nuclear microsatellites and cpDNA sequence data. Front. Plant Sci. 8, 2270 (2017).
Wu, Y. et al. Phylogenomic discovery of deleterious mutations facilitates hybrid potato breeding. Cell 186, 2313–2328.e15 (2023).
Li, D. et al. The multi-omics basis of potato heterosis. J. Integr. Plant Biol. 64, 671–687 (2022).
Lei, Y. et al. Whole-genome resequencing reveals the origin of tea in Lincang. Front. Plant Sci. 13, 984422 (2022).
Wang, X. C. et al. Population sequencing enhances understanding of tea plant evolution. Nat. Commun. 11, 4447 (2020).
Xia, E. et al. The reference genome of tea plant and resequencing of 81 diverse accessions provide insights into its genome evolution and adaptation. Mol. Plant 13, 1013–1026 (2020).
Chen, L., Yu, F. L. & Tong, Q. Q. Discussions on phylogenetic classification and evolution of section Thea. J. Tea Sci. 20, 89–94 (2000).
Mondal, T. K. in Wild Crop Relatives: Genomic and Breeding Resources: Plantation and Ornamental Crops (ed. Kole, C.) 15–39 (Springer, 2011).
Zan, T. et al. Phylogenomic analyses of Camellia support reticulate evolution among major clades. Mol. Phylogenet. Evol. 182, 107744 (2023).
Vaser, R., Adusumalli, S., Leng, S. N., Sikic, M. & Ng, P. C. SIFT missense predictions for genomes. Nat. Protoc. 11, 1–9 (2016).
Terhorst, J., Kamm, J. A. & Song, Y. S. Robust and scalable inference of population history from hundreds of unphased whole genomes. Nat. Genet. 49, 303–309 (2017).
Doebley, J. F., Gaut, B. S. & Smith, B. D. The molecular genetics of crop domestication. Cell 127, 1309–1321 (2006).
Hara, Y., Luo, S. J., Wickremasinghe, R. L. & Yamanishi, T. Special issue on tea. Food Rev. Int. 11, 371–542 (1995).
Lu, H. Y. et al. Earliest tea as evidence for one branch of the Silk Road across the Tibetan Plateau. Sci. Rep. 6, 18955 (2016).
Zhang, W., Rong, J., Wei, C., Gao, L. & Chen, J. Domestication origin and spread of cultivated tea plants. Biodivers. Sci. 26, 357–372 (2018).
Yonekura-Sakakibara, K. et al. Two glycosyltransferases involved in anthocyanin modification delineated by transcriptome independent component analysis in Arabidopsis thaliana. Plant J. 69, 154–167 (2012).
Ni, B. B. et al. A chromosome-scale genome of Rhus chinensis Mill. provides new insights into plant-insect interaction and gallotannins biosynthesis. Plant J. 118, 766–786 (2024).
He, X. J. et al. Isolation and characterization of key genes that promote flavonoid accumulation in purple-leaf tea (Camellia sinensis L.). Sci. Rep. 8, 130 (2018).
Huang, X. X. et al. A novel UDP-glycosyltransferase 91C1 confers specific herbicide resistance through detoxification reaction in Arabidopsis. Plant Physiol. Biochem. 159, 226–233 (2021).
Wang, J. et al. Evolution and functional divergence of glycosyltransferase genes shaped the quality and cold tolerance of tea plants. Plant Cell 37, koae268 (2024).
Jing, T. T. et al. UGT89AC1-mediated quercetin glucosylation is induced upon herbivore damage and enhances Camellia sinensis resistance to insect feeding. Plant Cell Environ. 47, 682–697 (2024).
Vogler, F., Schmalzl, C., Englhart, M., Bircheneder, M. & Sprunck, S. Brassinosteroids promote Arabidopsis pollen germination and growth. Plant Reprod. 27, 153–167 (2014).
Sakaguchi, J. & Watanabe, Y. Light perception in aerial tissues enhances DWF4 accumulation in root tips and induces root growth. Sci. Rep. 7, 1808 (2017).
Choe, S. et al. Lesions in the sterol Δ7 reductase gene of Arabidopsis cause dwarfism due to a block in brassinosteroid biosynthesis. Plant J. 21, 431–443 (2000).
Wen, C. et al. Physiological controls of chrysanthemum DgD27 gene expression in regulation of shoot branching. Plant Cell Rep. 35, 1053–1070 (2016).
Zhang, C.-C. et al. Cloning and characterization of an S-RNase gene in Camellia sinensis. Sci. Hortic. 207, 218–224 (2016).
Talbert, P. B., Adler, H. T., Parks, D. W. & Comai, L. The REVOLUTA gene is necessary for apical meristem development and for limiting cell divisions in the leaves and stems of Arabidopsis thaliana. Development 121, 2723–2735 (1995).
Preciado, J., Begcy, K. & Liu, T. The Arabidopsis HDZIP class II transcription factor ABA INSENSITIVE TO GROWTH 1 functions in leaf development. J. Exp. Bot. 73, 1978–1991 (2022).
Fang, K. X. et al. Genome-wide association analysis identified molecular markers associated with important tea flavor-related metabolites. Hortic. Res. 8, 42 (2021).
Cai, W. G. et al. A receptor-like kinase controls the amplitude of secondary cell wall synthesis in rice. Curr. Biol. 33, 498–506.e6 (2023).
Xu, W. J., Dubos, C. & Lepiniec, L. Transcriptional control of flavonoid biosynthesis by MYB-bHLH-WDR complexes. Trends Plant Sci. 20, 176–185 (2015).
Zhang, W. et al. Genome assembly of wild tea tree DASZ reveals pedigree and selection history of tea varieties. Nat. Commun. 11, 3719 (2020).
Xu, G. H. et al. Evolutionary metabolomics identifies substantial metabolic divergence between maize and its wild ancestor, teosinte. Plant Cell 31, 1990–2009 (2019).
Zhu, G. T. et al. Rewiring of the fruit metabolome in tomato breeding. Cell 172, 249–261.e12 (2018).
Li, X. et al. Multi-omics analyses of 398 foxtail millet accessions reveal genomic regions associated with domestication, metabolite traits, and anti-inflammatory effects. Mol. Plant 15, 1367–1383 (2022).
Wang, W. S. et al. Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature 557, 43–49 (2018).
Wu, X. Y. et al. Genomic footprints of sorghum domestication and breeding selection for multiple end uses. Mol. Plant 15, 537–551 (2022).
Nei, M. & Li, W. H. Linkage disequilibrium in subdivided populations. Genetics 75, 213–219 (1973).
Huang, X. et al. Efficient estimation for large-scale linkage disequilibrium patterns of the human genome. eLife 12, RP90636 (2023).
Zhang, Z. et al. Jujube metabolome selection determined the edible properties acquired during domestication. Plant J. 109, 1116–1133 (2022).
Zhao, H. et al. Rewiring of the seed metabolome during tartary buckwheat domestication. Plant Biotechnol. J. 21, 150–164 (2023).
Chen, S. F., Zhou, Y. Q., Chen, Y. R. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, 884–890 (2018).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
McKenna, A. et al. The Genome Analysis Toolkit: a mapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6, 80–92 (2012).
Dutoit, L., Burri, R., Nater, A., Mugal, C. F. & Ellegren, H. Genomic distribution and estimation of nucleotide diversity in natural populations: perspectives from the collared flycatcher (Ficedula albicollis) genome. Mol. Ecol. Resour. 17, 586–597 (2017).
Minh, B. Q. et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020).
Alexander, D. H. & Lange, K. Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12, 246 (2011).
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2009).
Zhang, C., Dong, S. S., Xu, J. Y., He, W. M. & Yang, T. L. PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files. Bioinformatics 35, 1786–1788 (2019).
Korunes, K. L. & Samuk, K. pixy: unbiased estimation of nucleotide diversity and divergence in the presence of missing data. Mol. Ecol. Resour. 21, 1359–1368 (2021).
Malinsky, M., Matschiner, M. & Svardal, H. Dsuite—fast D-statistics and related admixture evidence from VCF files. Mol. Ecol. Resour. 21, 584–595 (2021).
Chen, H., Patterson, N. & Reich, D. Population differentiation as a test for selective sweeps. Genome Res. 20, 393–402 (2010).
Yin, L. L. et al. rMVP: a memory-efficient, visualization-enhanced, and parallel-accelerated tool for genome-wide association study. Genom. Proteom. Bioinform. 19, 619–628 (2021).
Chen, C. J. et al. TBtools: an integrative toolkit developed for interactive analyses of big biological data. Mol. Plant 13, 1194–1202 (2020).
Kong, W. Genomic analysis of 1,325 Camellia accessions sheds light on agronomic and metabolic traits for tea plant improvement. Zenodo https://doi.org/10.5281/zenodo.14880636 (2025).
Acknowledgements
This work was supported by ‘5511’ Collaborative Innovation Project (XTCXGC2021004), Key-Area Research and Development Program of Guangdong Province (2020B020220004), Shenzhen Science and Technology Program (RCYX20210706092103024), National Tea Industry Technology System Project (CARS-19), Fujian Public Welfare Projects (2021R1029003 and 2022R1029003) and National Natural Science Foundation of China grant (32222019). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
X.Z. and C.C. designed this project and coordinated research activities. W.K., X.K., C.C., R.S., Z.C., X.Y., Y. Zhao, Y.H., S. Zheng, S. Zhong, Y.W. and G.-B.C. collected or provided plant materials. K.F. and H.W. provided genotype and phenotype data for the validation population. W.K. performed DNA extraction, read mapping, variant calling, population structure, phylogenetic tree, population differentiation, metabolomics, population GWAS and mGWAS analyses. Z.X. performed introgression analysis. Xiaofeng Li analyzed selective sweeps among populations. S. Zhang, Yanbing Zhang and H.L. visualized individual results. C.C., Yazhen Zhang and Xinlei Li provided advice on paper writing. W.K., X.K., X.Z., Z.X., Xinlei Li and Xiaofeng Li interpreted the data and contributed to the manuscript writing.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks the anonymous reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information (download PDF )
Supplementary Figs. 1–31.
Supplementary Tables (download XLSX )
Supplementary Tables 1–10.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Kong, W., Kong, X., Xia, Z. et al. Genomic analysis of 1,325 Camellia accessions sheds light on agronomic and metabolic traits for tea plant improvement. Nat Genet 57, 997–1007 (2025). https://doi.org/10.1038/s41588-025-02135-z
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41588-025-02135-z
This article is cited by
-
Genomic DNA barcodes provide novel insights into species delimitation in the complex Camellia sect. Thea (Theaceae)
BMC Plant Biology (2025)
-
TEA5K: a high-resolution and liquid-phase multiple-SNP array for molecular breeding in tea plant
Journal of Nanobiotechnology (2025)
-
Near telomere-to-telomere genome assembly of Camellia pitardii
Scientific Data (2025)







