Background & Summary

The species Citrus reticulata, categorized within the Rutaceae family, is recognized as one of the three fundamental species of the Citrus genus1. Archaeobotanical evidence and historical records suggest that the cultivation of this species dates back over 4000 years, with early references appearing in ancient texts such as the “Yu Gong” section of the “Xia Shu,” which chronicled the history of the Xia Dynasty2,3. The peelability of mandarin fruits, a trait highly valued by consumers, facilitates easier access to their nutrient-dense flesh. This characteristic, coupled with their high content of vitamin C and dietary fiber, positions mandarins as one of the best-loved fruits4,5. Moreover, studies have demonstrated that certain local or wild mandarin varieties, which are rich in phenolic compounds and antioxidants, exhibit potential for medicinal applications and serve as beneficial food ingredients6,7.

The south-central regions of China are recognized as the primary centers of origin for the genus Citrus8. During the past several decades, numerous wild mandarin indigenous to China have been documented, including Citrus mangshanensis, the Mangshan mandarin, and Citrus daoxianensis8,9,10. Wild Citrus species are known for their high genetic diversity, which constitutes a significant genetic resource for the breeding of Citrus crops. The genetic variability within these species provides a wealth of alleles that can be harnessed to improve the traits of cultivated varieties, such as disease resistance, fruit quality, and adaptability to different environmental conditions11,12. The investigation into wild Citrus species not only sheds light on the evolutionary relationships of the Citrus genus but also furnishes compelling evidence that enhances our understanding of the considerable transformations in key quality traits throughout the domestication of Citrus.

Xingan mandarin is a wild mandarin species discovered on Mao’er Mountain in Xing’an County, located in northern Guilin (Fig. 1a–d). In December 2013, Deng Chongling’s research group from the Guangxi Academy of Specialty Crops first discovered wild Citrus reticulata in Maoer Mountain, the main peak of the Yuechengling Mountains in the Nanling Mountains of Guangxi. This discovery marks the first record of wild mandarin in the Yuechengling Mountains. It is located in Huajiang Yao Township, Xing’an County, Guilin City, Guangxi Zhuang Autonomous Region, with a latitude and longitude of 25° 51′51″N, 110°51′51″E and an altitude of 601 m.

Fig. 1
figure 1

Photo and genomic characteristics of Xingan mandarin. (a) Xingan mandarin tree. (b) Flowers and leaves during the peak flowering period. (c) Anatomical diagram of Xingan mandarin flower. (d) Xingan mandarin mature fruits. (e) Characterization of the Xingan mandarin genome. The circle from inside to outside, respectively, represents Chromosome ideograms (I), TE density (II), SSR density (III), gene density (IV), and GC density (V).

The main traits of Xingan mandarin include: arbor, medium tree vigor, semi-circular crown, upright tree posture, fine branches and spines. The leaves are long oval, the top is acuminate, the base is wedge-shaped, and the leaves of spring shoots are 5.4 cm in length and 2.4 cm in width. The fruit is oblate, with a height of 3.6 cm and a diameter of 4.6 cm, with an average fruit weight of 46.2 g. The peel is yellow and rough, with large, obviously raised oil cells, and the flesh is orange-yellow. The average number of seeds is 9.8, and the seeds are nearly spherical. It exhibits favorable traits such as vigorous growth, strong resistance to pests and diseases, and good fruiting habits. Xingan mandarin shows obvious resistance to infection by Citrus Huanglongbing. After infection with Huanglongbing, the bacterial content in it was significantly lower than that of other materials, and the growth was normal. This species harbors numerous exceptional genes associated with high yield, quality, disease resistance, and abiotic stress tolerance. Thus, it serves as a valuable resource for research on the origins, classification, zoning, genetic breeding, and high-yield cultivation techniques of citrus, representing an important genetic resource for agricultural citrus breeding.

In the scope of the present study, we utilized a comprehensive sequencing strategy, ingeniously combining the capabilities of Illumina short-read sequencing, PacBio long-read sequencing, and Hi-C sequencing technologies. This integrated approach facilitated the assembly, annotation, and anchoring of the Xingan mandarin genome at the chromosome level, as illustrated in Fig. 1e. This genome assembly is poised to enhance the discovery of pivotal genes associated with agronomic traits, positioning it as an indispensable resource with potential medicinal applications.

Methods

Sample collection

All samples designated for sequencing were procured from the Citrus germplasm resource garden at the Guangxi Academy of Specialty Crops (Guilin). We utilized tender leaves from Xingan mandarin trees for Illumina, HiFi, and Hi-C sequencing. To ensure exhaustive capture of transcriptomic data, we collected a variety of tissues from the Xingan mandarin, including roots, stems, leaves, flowers, seeds, as well as fruits at both immature and mature stages. These samples were rapidly frozen in liquid nitrogen and subsequently stored at −80 °C, awaiting the extraction of DNA and RNA.

Library construction and sequencing

Genomic DNA was carefully extracted from the leaf tissue of the Xingan mandarin, employing the established CTAB method. Post extraction, short-read libraries, each with a read length of 350 base pairs, were meticulously constructed using a dedicated library construction kit. Following this, the libraries were sequenced on the HiSeq. 2500 platform (Illumina, CA, USA). The process resulted in an impressive total of 18.89 Gb of data, corresponding to an overall sequencing depth of approximately 64 × of the genome. The GC content was approximately 36.19%, and the Q20 and Q30 ratios surpassed 97.27% and 94.91%, respectively. The cleaned reads obtained were used for genomic surveys, encompassing assessments of genome size, GC content, and heterozygosity.

To obtain HiFi sequencing data, after the samples passed the quality control test, the genomic DNA fragments were selected using BluePippin, then subjected to end repair and A-tailing. Subsequently, adapters were ligated to both ends of the fragments to prepare a DNA library. After the library passed qualification, the sequencing operation was conducted using the PacBio Revio platform (Pacific Biosciences, Menlo Park, CA, USA), guided by the library’s effective concentration and the specifications for data output. PacBio HiFi sequencing generated approximately 12.71 Gb of clean data, with an overall sequencing depth of approximately 46 × of the genome. The reads exhibited an N50 of 16.26 kb and an average read length of 15.99 kb.

For Hi-C sequencing, the library underwent rigorous quality assessment to ensure quality, including library concentration determination, insert size evaluation, and precise determination of library molar concentration. The main methods for Hi-C sequencing include: 1) the initial evaluation of library concentration using Qubit 2.0 (Invitrogen, CA, USA); 2) the assessment of library DNA fragment integrity and insert size facilitated by Agilent 2100 (Agilent Technologies, CA, USA); 3) the exact quantification of the effective library concentration using the qPCR approach. After library qualification, high-throughput sequencing was performed on the Illumina platform, with a sequencing read length of PE150. The project ultimately generated a total of 35.25 Gb of clean data, with an overall sequencing depth of approximately 108 × of the genome, and the Q20 and Q30 ratios surpassed 95.55% and 92.34%, respectively.

Genome survey and assembly

The HIFI long reads, generated from the Pacbio platform, were subjected to a quality filtration process performed using fastp (v0.23.4)13, operating under default parameters. The quality-filtered reads were subsequently used in the critical process of genome size estimation. We used Jellyfish (v2.2.10)14 software to count the 17-mers and assessed the genome characteristics using GenomeScope (v2.0)15 software (Fig. 2). The genome size of Xingan mandarin was estimated to be 328.20 Mb, with approximately 55.62% repeat sequences, a heterozygosity rate of ~1.32%. Thus, the genome of this species is classified as a highly heterozygous and complex genome.

Fig. 2
figure 2

K-mer distribution plot of Xingan mandarin. The presented overview comprehensively displays the frequency distribution of the 17-mer in the Xingan mandarin genome, where the x-axis signifies the k-mer depth, and the y-axis stands for the k-mer frequency that aligns with the aforementioned depth.

The HiFi long reads were assembled using Hifiasm (v0.19.8-r603)16, yielding a total contig length of 340.74 Mb and a contig N50 value of 29.32 Mb. Comparisons were conducted against the NCBI nucleotide sequence database, along with the mitochondrial and plastid databases (https://www.ncbi.nlm.nih.gov/refseq/), to filter out mitochondrial and plastid sequences from the assembled genome. As a result of this procedure, the contig length was recalibrated to 332.40 Mb, while concurrently sustaining a contig N50 value of 29.32 Mb (Table 1).

Table 1 Statistics of Hi-C assembly data.

To anchor contigs into chromosomal scaffolds, we first generated clean read pairs from the Hi-C library and aligned them to the polished Xingan mandarin genome using BWA (v0.7.17)17 with default parameters. Paired-end reads mapping to different contigs were then used for Hi-C-based scaffolding. A stringent filtering step was applied to remove invalid reads, including those derived from random breaks, self-ligation, non-ligation events, and fragments with abnormally large or small sizes. Subsequent ordering and orientation of the filtered contigs were performed using Lachesis (https://github.com/shendurelab/LACHESIS)18. This workflow yielded the first high-quality chromosomal-level assembly of Xingan mandarin, with individual chromosome lengths ranging from 25.43 Mb to 49.65 Mb and collectively accounting for 93.08% of the total assembly length (Fig. 3; Table 1). The final chromosome-scale genome assembly spanned 325.12 Mb, with contig N50 and scaffold N50 values reaching 29.32 Mb and 29.62 Mb, respectively. In this study, the final assembled genome size is found to align closely with both the previously reported citrus genomes and the k-mer-based estimates.

Fig. 3
figure 3

The Hi-C interaction heatmap illustrates the chromosomal interactions in Xingan mandarin.

Repeat element identification

The task of annotating Transposable Elements (TEs) and tandem repeats was performed via a series of well-defined workflows. The identification of TEs was achieved using an integrated approach combining homology-based and de novo strategies. Our initial step involved the construction of a de novo repeat library of the genome, leveraging the capabilities of RepeatModeler (v1.0.5)19, an automated software that efficiently runs two de novo repeat discovery tools, namely RECON (v1.0.8)20 and RepeatScout (v1.0.6)21. The subsequent stages of our methodology focused on identifying and characterizing full-length long terminal repeat retrotransposons (fl-LTR-RTs), a process enabled by both LTRharvest (v1.5.9)22 and LTR_finder (v2.8)23. We then generated high-quality, intact fl-LTR-RTs and a non-redundant LTR library using LTR_retriever (v2.9.0)24.

In an effort to create a species-specific, non-redundant TE library, we combined de novo TE sequences with the well-regarded Dfam database (v3.5). The definitive TE sequences in the Xingan mandarin genome were subsequently identified and classified via a homology search against this library using RepeatMasker (v4.12)25. For the annotation of tandem repeats, we used Tandem Repeats Finder (v4.09)26 and the Microsatellite Identification Tool (MISA, v2.1)27. In the Xingan mandarin genome, we identified 112.96 Mb (34.75% of the genome) as TEs and 39.61 Mb (12.18% of the genome) as tandem repeats. The majority of these repeats (26.27% of the genome) were Class I retrotransposons, with Gypsy elements being the most prevalent (comprising 11.47% of the genome). Class II DNA transposons were also identified, making up 8.47% of the Xingan mandarin genome (Table 2).

Table 2 Statistical information of transposable element sequences.

Protein-coding genes prediction

The annotation of protein-coding genes within the genome was achieved through the combination of three uniquely effective approaches: de novo prediction, homology search, and transcript-based assembly. The generation of de novo gene models was performed using two leading ab initio gene prediction tools, namely Augustus (v3.1.0)28 and SNAP (v2006-07-28)29. For the homology-based strategy, we used the GeMoMa (v1.7)30 software, leveraging reference gene models derived from other Citrus species.

The transcript-based prediction process involved the alignment of RNA-sequencing data to the reference genome using Hisat (v2.1.0)31, followed by assembly using Stringtie (v2.1.4)32. Following transcript assembly, we used GeneMarkS-T (v5.1)33 to perform gene prediction. Moreover, gene prediction based on unigenes and full-length transcripts (derived from PacBio sequencing) was performed using PASA (v2.4.1)34 software. These sequences were assembled using Trinity (v2.11)35. The gene models resulting from these varied methodologies, were integrated using the EVM (v1.1.1)36 software, with ensuing updates conducted via PASA. The assembly of the Xingan mandarin genome led to the prediction of 30,581 protein-coding genes, each averaging a length of 3,360.50 bp (see Table 3 for details).

Table 3 Statistics of gene structure prediction.

Functional annotation of protein-coding genes

Gene functionality was deduced based on the highest alignment match to several protein databases, including NR, EggNOG37, KOG, TrEMBL38, InterPro, and Swiss-Prot38. This was accomplished using diamond blastp (v0.9.29.130)39 and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database40, applying an E-value threshold of 1E−3. Protein domains were annotated using InterProScan (v5.34-73.0)41, derived from InterPro protein databases. The discernment of motifs and domains embedded within gene models was expedited using the Pfam42 database. The corresponding Gene Ontology (GO) IDs for each gene were systematically collated from a combination of databases, including TrEMBL, InterPro, and EggNOG. Altogether, functional annotation was possible for 89.08% (27,242) of the predicted protein-encoding genes with established genes, conserved domains, and GO terms (Table 4).

Table 4 Summary of gene function annotations.

Annotation of non-coding RNA genes

Non-coding RNAs, which include a variety of functionally known RNAs such as miRNA, rRNA, and tRNA, do not encode proteins. To predict these non-coding RNAs, we adopted various strategies based on their structural characteristics. For tRNA identification, we utilized tRNAscan-SE (v1.3.1)43 with default parameters; barrnap (v0.9)44 was primarily employed for rRNA prediction with its default parameters. The prediction of miRNA, snoRNA, and snRNA was conducted using the Rfam (v14.5)45 database and Infernal (v1.1)46 software. Our exploration of the Xingan mandarin genome unveiled 2,293 ncRNAs, composed of 728 rRNAs, 415 tRNAs, 166 miRNAs, 325 snRNAs, and 659 snoRNAs.

Data Records

The raw data of Hi-C short reads, llumina DNA short reads, PacBio DNA long reads, RNA short reads, and have been deposited in the National Center for Biotechnology Information (NCBI Sequence Read Archive database with accession numbers SRR3182379947, SRR3182379848, SRR3182379749, SRR3182379650). The genome assembly has been deposited in NCBI under accession number JBKFGA00000000051. The annotation files have been uploaded in figshare52.

Technical Validation

To evaluate the accuracy of gene annotation, we compared the distributions of gene length, CDS length, exon length, and intron length in Xingan mandarin with those of Arabidopsis thaliana, pummelo (C. grandis), Mangshan mandarin, C. reticulata cv. Ponkan, and sweet orange (C. sinensis). The results showed that the gene structural features of Xingan mandarin are highly similar to those of other Citrus species, as illustrated in Fig. 4. Furthermore, we performed a BUSCO assessment on the predicted gene set using the embryophyta_odb10 database. The results showed that the predicted gene set has a completeness of 99.01%, including 97.71% complete and single-copy orthologs and 1.30% complete and duplicated orthologs, with only 0.37% fragmented orthologs and 0.62% missing orthologs. These BUSCO results indicate that our predicted gene set is highly complete, thus reflecting the reliability of the genome annotation.

Fig. 4
figure 4

Distribution of gene structural features across six species. Subpanels (ad) show comparative analyses of (a) gene length, (b) CDS length, (c) exon length, and (d) intron length between Xingan mandarin (Citrus reticulata ‘Xingan’) and five other species: Arabidopsis thaliana, C. grandis, C. reticulata ‘Mangshan’, C. sinensis, and C. reticulata cv. Ponkan.

Chromosomal synteny analysis using MCScanX53 (with BLAST E-value ≤ 1 × 1010) demonstrated strong collinearity among Xingan mandarin, Mangshan mandarin, and the mandarin haplotype of C. sinensis, characterized by nearly one-to-one chromosomal correspondence (Fig. 5). Notably, Xingan mandarin chromosomes exhibited a higher number of mapping fragments in Mangshan mandarin than in the mandarin haplotype of C. sinensis, providing empirical evidence for the reliability of the assembled chromosomal sequences.

Fig. 5
figure 5

Linear collinearity plot among Xingan mandarin, Mangshan mandarin, and Mandarin haplotype of C. sinensis.

To assess genome assembly quality, we employed BUSCO (v5.2.1)54 in conjunction with the embryophyta_odb10 database to curate a dataset of single-copy orthologs across major evolutionary lineages. This gene set was used for comparative analysis with the assembled genome, quantifying the proportion and completeness of orthologous genes. The final assembly achieved a BUSCO completeness score of 99.01%, reflecting exceptional gene space integrity. Additionally, our genome assembly demonstrated a robust long terminal repeat (LTR) assembly index (LAI)55 of 20.83, exceeding the threshold of 20 that defines a “golden reference” genome, indicating high structural integrity for LTR sequences. Using Merqury v1.356, we assessed the accuracy of the genome assembly with short-read sequencing data, obtaining a QV score of 46.65—a metric indicating an error rate below 0.0002%, which reflects near-ideal base-calling precision.

Finally, to gauge the completeness of the assembly and the uniformity of sequencing coverage, Illumina short reads and HiFi reads were mapped to the assembled genome using BWA17 and Minimap257 software, respectively. The completeness and coverage uniformity were evaluated based on alignment rates, the proportion of the genome covered, and the distribution of sequencing depths. The alignment results for the Illumina short reads revealed an alignment rate of 96.68%, a coverage of 99.81%, and an average sequencing depth of 50. On the other hand, the alignment results for the HiFi reads showcased an alignment rate of 99.61%, a coverage of 99.99%, and an average sequencing depth of 35 × . The collective results of the aforementioned analyses provide strong empirical evidence for the exceptional completeness and accuracy of the Xingan mandarin genome assembly, establishing a robust foundation for subsequent functional genomic studies and comparative analyses.