Background & Summary

There are more than 1200 camellia species in the world, with the majority being red flowers and only a few number (83 species) of golden flowers (https://camellia.iflora.cn/)1, which belongs to the Camellia sect. Chrysantha. These golden camellias are evergreen shrubs or small trees, often praised as the “Queen of Camellias”. However, due to the serious threat, C. fascicularis has been listed as a second-class key protected wild plants in China2. The golden-flower camellias are precious resources that combine ornamental, medicinal, and edible purposes. In addition, they are valuable resources for studying the color formation and breeding the yellow-flower camellia cultivars1,3.

According to the IUCN evaluation criteria, C. fascicularis was categorized as critically endangered (CR) on the China biodiversity red list higher plants of 2020. It was also listed as one of the 62 and 101 target species in the Yunnan provincial conservation action plan for plants with extremely small populations (PSESP) in 2010 and 2021 respectively4. In addition to its high ornamental value in gardens, it is also an edible and medicinal plant, with its leaves rich in nutritional and health-promoting compounds. These leaves have antioxidant, anti-inflammatory, and other beneficial physiological activities5,6. In recent years, despite intensified conservation efforts for C. fascicularis, limitations in natural regeneration and genetic rescue persist. As a flagship PSESP endemic to southeastern Yunnan, its successful genomic conservation would establish a model for rescuing co-occurring endangered species in this biodiversity hotspot7,8,9,10. Therefore, developing genomic-level strategies is critical to enable effective protection mechanisms for this species and other PSESPs.

Plant genome sequencing has developed rapidly in the past 20 years, and by early December 2024, over 1,500 genome sequences of higher plant taxa have been published11. Sequencing genomes can provide insights and evidence for a better understanding of plant genome biology and evolution12,13. Although the genomes of so many plant species have been studied, only a few studies have sequenced the genomes of threatened plant species (such as Anisodus tanguticus14, Firmiana kwangsiensis15, Zanthoxylum nitidum16, C. nitidissima17, Rhododendron griersonianum18 and Magnolia sinica19), in order to focus on their conservation.

The de novo assembly of plant genomes is pivotal for advancing our understanding of plant evolution processes, ornamental plants breeding and conservation efforts15,20. In this study, we generated a high-quality chromosomal-level genome assembly of C. fascicularis (Fig. 1) based on PacBio HiFi reads (89.22 Gb) and Hi-C reads (249.45 Gb) (Fig. 2; Supplementary Tables 1, 2). The final assembly (~5.65 Gb) consisted of two complete haplotypes: haplotype A (~2.89 Gb) and haplotype B (~2.76 Gb), with contig N50 lengths of 196.04 Mb and 104.27 Mb, respectively (Table 1). The final genome assembly, comprising two haplotypes, was anchored onto 30 pseudochromosomes (15 per haplotype), with 99.67% of the assembled sequences successfully placed, as shown in Fig. 2 and Table 1. Telomeric repeat motifs (TTTAGGG)n were identified at the ends of all chromosomes, and the scaffold N50 reached 191.08 Mb. In addition, a total of 7,943,983 repetitive sequences were identified, with a total length of 3,853,635,256 bp, accounting for 68.21% of the genome. Among them, the most abundant are long terminal repeat sequences (LTRs), with a total of 4,349,137 elements and a cumulative length of 3,162,073,259 bp, accounting for 55.97% of the entire genome (Table 2; Supplementary Table 3). The genome of C. fascicularis contains a total of 88,796 genes, including 85,316 protein coding genes and 3,480 non-coding genes (Table 3). A total of 85,316 protein coding genes were functionally annotated (Table 4). This study obtained a high-quality genome of C. fascicularis, which is of being great significance for the conservation of this species.

Fig. 1
figure 1

Morphological Characteristics of C. fascicularis. (a,b) Flowers. (c) Seedlings. (d) Stems. (e) Leaves. (f) Roots of Seedlings.

Fig. 2
figure 2

The genomic landscape of 30 pseudo-chromosomes of C. fascicularis (labeled as Chr01-15[a, b]). Circles from outside to inside: (a) chromosome length, (b) density of Class I transposable elements (TEs), including long terminal repeats (LTRs) and long/short dispersed elements, (c) density of Class II TEs (DNA and Helitrons), (d) density of coding genes (mRNA), (e) proportion of tandem repeat sequences, (f) GC content, and (g) synteny blocks.

Table 1 Statistics of the C. fascicularis genome assembly.
Table 2 Summary of the repetitive sequences in C. fascicularis genome assembly.
Table 3 Summary of C. fascicularis genome annotations.
Table 4 Functional annotation of protein coding genes in C. fascicularis.

Methods

Plant material

For genomic sequencing, we collected fresh leaf samples from adult healthy plants of C. fascicularis in the wild from Hekou County, Yunnan Province (103°58′E, 22°52′N) (Fig. 1). In addition to the leaf samples, the flowers, roots, and young stem samples were also collected for full-length transcriptome sequencing. After collection, the samples were immediately wrapped in aluminum foil and quickly frozen in liquid nitrogen, then stored at −80 °C until further processing.

Genomic DNA extraction and sequencing

Genome DNA sequencing was performed using different sequencing platforms simultaneously to ensure accurate assembly. The specific steps are as follows:

High-quality genomic DNA was isolated from the fresh leaves using the CTAB method, and the DNA quality and concentration were tested by 0.75% agarose gel electrophoresis, NanoDrop One spectrophotometer (Thermo Fisher Scientific) and Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA).

The libraries used for single-molecule real-time (SMRT) Pacific Biosciences (PacBio) genome sequencing were constructed according to standard protocols. A library with a DNA-fragment insert size of ~15 kb was prepared from 3 μg of high-quality genomic DNA and placed into the Revio system (Pacbio, USA) sequencer for sequencing. A total of 89.22 Gb (~5.06 M reads) of HiFi sequencing data was obtained, which was used for the subsequent genome assembly of C. fascicularis (Supplementary Table 1).

High-quality gDNA was randomly fragmented by ultrasonic oscillation (Covaris, USA) and used for DNBSEQ-T7 (BGI Inc., China) short-read sequencing. To increase continuity of the genome, the gDNA was used to construct the Hi-C libraries according to the standard library preparation protocol21. The Hi-C libraries were enriched, A-tailed, and then subjected to PCR amplification (12–14 cycles). Subsequently, the Hi-C libraries were sequenced using the DNBSEQ-T7 platform, operating in PE150 mode. Approximately 249.45 Gb (~1663 M reads) of Hi-C data were generated for subsequent pseudochromosome assembly (Supplementary Table 2). Hi-C data were integrated with the PacBio assembly to resolve chromosomal structures.

RNA extraction and sequencing

For transcriptome sequencing, fresh tissue samples including stems, young and mature leaves, roots, and petals were collected from C. fascicularis and immediately frozen in liquid nitrogen. Total RNA was subsequently extracted using the TRIzol® Reagent (Invitroge). The concentration and quality of RNA were assessed using a NanoDrop 2000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and a Bioanalyzer 2,100 system (Agilent Technologies, CA, USA). High-quality RNA from different tissues was mixed in equal amounts and used to construct the cDNA libraries according to the manufacturer’s instructions. Libraries were then loaded onto an R9.4 sequencing chip and sequenced on a PromethION sequencer (Oxford Nanopore Technologies, UK). Finally, a total of 17.25 Gb (~13.84 M reads) of full-length RNA-seq data was obtained (Supplementary Table 4).

De novo genome assembly

First, contigs were assembled from PacBio HiFi reads using hifiasm (v0.19.8-r602), with haplotype assembly selected for subsequent analysis22. Next, Hi-C reads were aligned to the assembled contigs using Juicer23, followed by preliminary Hi-C-assisted chromosome scaffolding and anchoring with 3D-DNA24. The chromosome segmentation boundaries and assembly errors were manually checked and adjusted using Juicebox25. After manual review, the final chromosome framework and scattered sequences were generated. Additionally, the chloroplast and mitochondrial genomes were assembled using the GetOrganelle toolkit26.

To further optimize the assembly, the quarTeT software was used to fill in missing gaps based on the HiFi reads, thereby improving the completeness of the assembly. Additionally, by aligning the HiFi reads to the regions near the chromosome telomeres, the terminal sequences were assembled using hifiasm, and the chromosomes were extended to recover the telomere sequences as completely as possible. To remove redundancy and foreign contamination, redundans was employed to align scattered contigs with chromosomes and organelle genomes, eliminating redundant sequences, particularly the large fragments that might arise from organelle genomes and rDNA27. Finally, low-coverage fragments or haplotigs were identified and removed to ensure the purity and accuracy of the assembled data.

Finally, based on the reference tea28 genome, the chromosomes were numbered as chr01-15[ab], where “a” and “b” do not represent sequences from the same parent; the “a” set was chosen as the primary assembly result. All chromosome sequences were adjusted to ensure they were oriented in the same direction. According to the assembly results (Supplementary Table 5), a total of 30 chromosomes were identified, accounting for 99.67% of the total genome length. Chromosomes 1b, 2b, 8a, 9b, and 14a were fully assembled without gaps, including the successful assembly of their terminal sequences with chromosomes 1b, 2b, 8a, 9b, and 14a showing the best assembly quality (Supplementary Table 6).

Genome annotation

Repetitive sequence annotation

Several different methods were employed to annotate the repetitive sequences. First, transposable elements were identified de novo using EDTA29 (parameters -sensitive 1-anno 1) to generate a TE library. Then, RepeatMasker (http://www.repeatmasker.org/RepeatMasker/) was used to identify repetitive regions in the genome. A total of 3.85 Gb (68.21%) of the assembled sequences were annotated as TEs, with LTR (55.97%) and TIR (7.62%) being the two most abundant TE superfamilies (Fig. 2; Table 2; Supplementary Table 3).

Protein-coding gene annotation

For gene annotation, public protein sequences from C. sinensis, Diospyros lotus, Rhododendron ovatum, Gilia yorkii, Vitellaria paradoxa, Aegiceras corniculatum, Coffea canephora, Solanum lycopersicum, Cornus controversa, Vitis vinifera, and Arabidopsis thaliana were collected. A total of 339,202 non-redundant protein sequences were used as homologous protein evidence for gene annotation.

For comprehensive genome annotation, RNA-seq data were first aligned to the genome using minimap230, followed by transcript assembly with StringTie31 (Supplementary Table 7). Subsequently, the PASA pipeline32 was used for gene structure annotation of the transcripts, and full-length genes were identified through reference protein alignment. Based on the full-length gene set, AUGUSTUS33 and SNAP34 were trained to optimize the prediction models. In the MAKER235 pipeline, de novo predictions, transcript evidence, and homologous protein evidence were integrated. Repetitive regions were masked using RepeatMasker, and de novo (ab initio) gene predictions were made using AUGUSTUS33 and SNAP34. Meanwhile, the transcripts and protein sequences were aligned to the genome using BLASTN and BLASTX, and the alignment results were further optimized with Exonerate36, generating hint files to integrate the gene models. Since the annotation accuracy of the MAKER pipeline is relatively low37, the annotations from MAKER and PASA were further integrated using EVidenceModeler38 (EVM) to generate a consistent gene annotation. To avoid introducing TE coding regions, TE protein domains on the genome were identified using TEsorter39, and these regions were masked using EVM.

Non-coding RNA annotation

For non-coding RNA annotation, tRNA genes were annotated using tRNAScan-SE40, rRNA genes were annotated using barrnap (https://github.com/tseemann/barrnap/), and various non-coding RNAs, including miRNA and snRNA, were annotated through RfamScan. All annotations were merged, with redundant annotations removed (coding genes were prioritized, and overlapping genes were excluded). Finally, both coding and non-coding genes were uniformly named. In the end, all the annotation results were merged and redundancies were removed to get the complete set of genes.

Functional annotation

To further annotate the functions of protein-coding genes, three strategies were employed: (1) eggNOG-mapper41 annotation: Gene functions were annotated by comparing them with the eggNOG homologous gene database, including GO, KEGG, and other categories. (2) Sequence similarity search: DIAMOND42 was used to align protein sequences to protein databases (e.g., Swiss_Prot, TrEMBL, NR, and the Arabidopsis database) to identify the best matches for genes. The alignment criteria were an identity percentage greater than 30% and an E-value less than 1e-5. (3) Domain similarity search: InterProScan43 was used to compare with sub-databases within InterPro, such as PRINTS, Pfam, SMART, PANTHER, and CDD, to obtain conserved amino acid sequences, motifs, and domains of the proteins.)

Data Records

The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive44 in National Genomics Data Center45, China National Center for Bioinformation Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRA024719)46 that are publicly accessible. The genome assembly and annotation data can also be accessed at Genome Warehouse(GWH) using the accession number GWHFSQE00000000.147. The genome assembly has also been submitted to the National Center for Biotechnology Information (NCBI) under the accession numbers GCA_051225895.1(CfHapA)48 and GCA_051225905.1 (CfHapB)49.

Technical Validation

Evaluation of the assembled genomes

In the genome assembly of C. fascicularis, we assembled two complete haplotypes with a total size of 5.65 Gb, including two haplotypes (Fig. 2; Supplementary Tables 1 and 2). The scaffold N50 was 191.08 Mb, indicating good continuity. The N50 of the contigs reached 115.51 Mb, with 57 gaps, and these contigs were anchored to 30 pseudochromosomes, with an anchoring rate of 99.67%. Additionally, the telomeric sequence (TTTAGGG)n was detected at the ends of the most chromosomes, and the 18S and 5S rDNA arrays were detected on chromosomes 7, 8, 11, and 13 (Fig. 4). All chromosomes contained a highly tandem repeat sequence (designated as cent2), with the unit sequence being AGAATTTACTGGGAATTTACTGAGTAATTTACTGAGAATTTACTG (Fig. 4). This sequence may be related to centromere function. Its sequence characteristics and distribution provide important clues for studying chromosome segregation mechanisms, centromere function, and chromosome evolution.

To further validate the assembly quality, the third-generation sequencing data (HiFi reads) and (Iso seq) were compared with the genome using bwa27 and minimap230, respectively (Supplementary Table 8). HiFi reads mapped to 99.64% of the assembly with ≥5 × coverage, ensuring minimal assembly gaps. The Hi-C interaction map showed a strong intrachromosomal interactive signal along the diagonal. The BUSCO (Benchmarking Universal Single-Copy Orthologs) assessment showed high completeness of the core genes (Supplementary Tables 9, and 10). A total of 99.13% of the core genes (including both single-copy and multi-copy genes) were successfully assembled, with a gene missing rate of 0.43%. Additionally, there were no significant heterozygous peaks in the BUSCO core single-copy and multi-copy gene regions, and their distribution was consistent (Fig. 3a,b). The Hi-C interaction heatmaps generated by Juicer show high resolution for each chromosome (Fig. 3c), with no obvious noise observed outside the diagonals, further supporting the quality of the assembly. Additionally, no anomalies were observed between each pair of homologous chromosomes (Fig. 3d), indicating no obvious switch errors.

Fig. 3
figure 3

Distribution of genome-wide (a) and BUSCO core gene regions (b) coverage depth evaluated using HiFi data; Hi-C interaction heatmaps for haplotypes A and B, depicting interactions for (c) reads with mapping quality of 0 or higher (including duplicates) and (d) mapping quality of 1 or higher (excluding duplicates). The color scale represents interaction intensity, with yellow signifying weak and red indicating strong interactions.

Fig. 4
figure 4

Profiles of repetitive sequence distribution across the chromosomes of C. fascicularis, highlighting telomeric TTTAGGG sequences, tandem repeats, 18S rDNA, and 5S rDNA. The vertical scale denotes the frequency of repetitive sequences in 20 kb segments. Black triangles signify the positions of gaps.

In this study, the assembled genome is characterized by minimal gaps, high telomere sequence integrity, elevated mapping rates, exceptional completeness, absence of redundancy, and high-resolution Hi-C interaction heatmaps, collectively demonstrating the superior quality of the genome assembly.

Evaluation of the gene annotation

The integrated annotation of proteins was evaluated using BUSCO, and the results are presented in Supplementary Tables 9, 10. The evaluation indicated that the complete core genes coverage was 99.44%, including 1.98% single-copy genes and 97.46% duplicated genes. The annotation was further characterized by a minimal presence of fragmented genes (0.31%) and missing genes (0.25%), collectively representing high-quality annotation.