Background & Summary

Pitaya, also known as “pitahaya” or “dragon fruit”, refers to the fruits of several species in the genus Selenicereus of the family Cactaceae. It is a rapidly growing and economically promising emerging fruit. Although there is no definitive evidence, it is widely believed that pitaya originated in North America and has been a traditional fruit in that region1. In recent years, pitaya has gained popularity in Asian countries, especially in Vietnam and China, where it has become a major fruit crop. In 2021, China surpassed Vietnam in terms of cultivation area, becoming the country with the largest pitaya cultivation area2. Pitaya is not only a delicious fruit but also rich in nutrients such as vitamin C, fiber, and various minerals. However, our understanding of the biological background of pitaya is still limited. Genome sequencing has played a crucial role in advancing various aspects of basic biology, and a high-quality reference genome can greatly enhance our understanding of the genetic basis and evolutionary processes underlying the biological characteristics of pitaya. While two versions of the pitaya genome, with genome sizes of 1.41 Gb and 1.33 Gb, N50 values of 127.15 Mb and 109.7 Mb, have recently been published3,4, further enhancements are required in terms of genome assembly completeness and annotation. In-depth studies on the pitaya genome will help uncover its genetic characteristics, interspecific differences, and gene regulatory networks, thus promoting genetic improvement and commercial cultivation of pitaya.

The aim of this study was to sequence, assemble, and annotate the genome of pitaya to obtain comprehensive genomic information and explore its genetic diversity and biological characteristics, providing strong support for genetic improvement and conservation of pitaya. Using PacBio long reads, nanopore ultra-long reads, and high-throughput chromosome conformation capture (Hi-C) data, we assembled the data using the hifiasm2 software5, resulting in a chromosome-level high-quality monoploid resolution genome assembly for pitaya. The two haplotypes had sizes of approximately 1.47 Gb and 1.44 Gb, respectively, and included 21 telomeres, nearly reaching the T2T level, with N50 values of 133.4 Mb and 132.6 Mb, respectively. Subsequent gene prediction and annotation identified a total of 34,673 genes, covering most regions of the pitaya genome. Through functional annotation and analysis of the pitaya genome, we discovered numerous genes related to its biological characteristics and agronomic traits, including fruit color, texture, quality, and disease resistance.

In addition, the genome assembly demonstrated a high level of completeness, with completeness scores of 97.7% and 94.4% for hap1 and hap2 genome, respectively, as determined by BUSCO analysis. Repetitive sequences accounted for 58% of the genome assembly. A total of 29,139 protein-coding genes were identified, with functional annotations available for 27,117 genes. Transcriptome analysis revealed that at least 24,893 genes were expressed in at least one tissue.

This study successfully sequenced, assembled, and annotated the genome of pitaya, providing an important foundation for further research on its genetic characteristics, biological features, and cultivation improvement. The decoding of the pitaya genome helps reveal its genetic characteristics and interspecific differences, thereby providing strong support for breeding improvement, variety identification, and resource conservation of pitaya, and promoting the sustainable development of the pitaya industry.

Methods

Sample collection and preparation

‘Dahong’ (Selenicereus polyrhizus) is one of the main cultivated varieties in China known for its excellent quality and high yield. The sequencing material selected for this study was a five-year-old pitaya variety ‘Dahong’ (Selenicereus polyrhizus) obtained from the germplasm nursery of the Fruit Research Institute, Guangdong Academy of Agricultural Sciences (113.3708° E, 23.1502° N) was chosen as the sequencing material. Multiple tissues including stem, tender shoots, flower bud, calyx, petals, style, stigma, filament, anther, pollen, fruit peel, fruit flesh, and root were collected from the same individual plant. These 13 samples were immediately frozen using liquid nitrogen and stored at −80 °C. Genomic DNA extraction and sequencing, as well as RNA sequencing, were carried out by Biomarker (Biomarker Technologies Co., LTD in Beijing, China).

Genome and transcriptome sequencing

The genomic DNA for PacBio HiFi sequencing was extracted using the QIAamp DNA Mini Kit (QIAGEN) and processed according to the standard protocol provided by PacBio, which included sample quality assessment, library preparation, library quality assessment, and sequencing. In this project, the average CCS length exceeded 20,038 bp, with the longest CCS read achieving 56,993 bp. A total of 81.62 Gb (~56×) of valid data was obtained.

The Monarch® HMW DNA Extraction Kit for Tissue (New England Biolabs, T3060) was employed to extract high-quality DNA for Ultra-Long DNA Sequencing following the Kit protocol. The DNA library was constructed using the SQK-LSK109 Kit (Oxford Nanopore Technologies, Oxford, UK), followed by single-molecule sequencing of the DNA using GridION X5/PromethION. All procedures were conducted in accordance with the instruction manual. The effective ONT data output for this project was 52.65 Gb (~36×) with an average reads length 101,938 bp and the longest read achieving 907,782 bp. The statistical results are summarized in Table 1, the detail of reads length distribution can be found in Fig. 1.

Table 1 Length distribution of ONT ultra-long and CCS Pacbio Hifi reads.
Fig. 1
figure 1

Read length and base quality distribution of the ONT ultra-long (A,C) and CCS Pacbio Hifi (B,D) sequencing data.

Hi-C library preparation, sequencing and genome assembly

The Hi-C method is a chromosome conformation capture technique that integrates high-throughput sequencing6. The Hi-C library was prepared following the steps described below: DNA cross-linking using formaldehyde as a fixative agent, restriction enzyme cleavage using DpnII, introduction of biotin-labeled bases for end-repair, circularization, and DNA capturing and purification. The concentration and insert size of the library were examined using Qubit 2.0 and Agilent 2100, respectively. Q-PCR was further processed for accurate DNA quantification to ensure an adequate amount. For this Hi-C library, the ratio of truncated reads in total reads is about 30%, which is significantly higher than the general standard of not less than 10%, indicating a high-quality library. After sequencing, a total of 144.64 Gb (~100×) of Clean Data was obtained, with a Q30 ratio exceeding 93.70%.

RNA library construction and transcriptome sequencing

Equal amounts of total RNA from all 13 samples, including stem, tender shoots, flower bud, calyx, petals, style, stigma, filament, anther, pollen, fruit peel, fruit flesh, and root, were pooled and sent to Biomarker Technologies Co., LTD in Beijing, China for library construction and sequencing. Two different sequencing platforms were employed, with Novaseq. 6000 Platform (150 bp paired-end mode), resulting in a total of 41.08 M reads and 12.29 Gb of clean data, with a Q30 base percentage of 94.05%. Full-length mRNA sequencing was conducted using the Oxford Nanopore Sequencing Platform, yielding 12.87 Gb of clean data, comprising a total of 10,729,994 sequences, with an N50 value of 1,423 bp.

Chromosome-level genome assembly

The primary assembly utilized hifiasm v0.19.6-r595 software5 to integrate PacBio HiFi reads, Oxford Nanopore Technology ultra-long reads, and Hi-C reads, resulting in two distinct haplotype-resolved genome primary assemblies. Hi-C reads were aligned to the genomes of the two haplotypes using chromap7 v0.2.5-r473and samtools8 v1.20, followed by contig anchoring to chromosomes using yahs9 v1.2a.1 with Hi-C data. Manual adjustment of anchored results, refining chromosome segment boundaries, and correcting visible assembly errors was performed using Juicebox10 v1.11.08. Finally, nextpolish211 v0.2.0 was employed with Hifi reads to fill gaps between contigs, yielding two chromosome-level, haplotype-resolved genome assemblies.

Haplotype 1 consists of 11 chromosomes and 519 unanchored scaffolds, with an N50 value of 133.4 Mb and a genome size of 1.48 Gb. The 11 chromosomes range in size from 96.03 Mb to 164.45 Mb, collectively representing 96.7% of the total genome size (Fig. 2, Tables 2, 3).

Fig. 2
figure 2

An overview of haplotype-resolved genome assembly of Selenicereus polyrhizus. (A) Chromosome ID and size, (B) GC content, (C) protein-coding gene density, and (D) analysis of collinearity.

Table 2 Assembly statistics for the two haplotypes.
Table 3 Chromosomes and 相应的 centromeres statistics for the two haplotypes.

Haplotype 2 comprises 11 chromosomes and 200 unanchored scaffolds, with an N50 value of 133.4 Mb and a genome size of 1.44 Gb. The 11 chromosomes range in size from 87.84 Mb to 165.42 Mb, collectively representing 98.4% of the total genome size (Fig. 2, Tables 2, 3).

The CentroMiner and TeloExplorer function in quarTeT12 v1.2.1 with the “-c plant” option was utilized to identify telomeres and centromeres in the genome, detecting a total of 22 centromeres ranging from 123.5 kb to 9.86 Mb, and 42 telomeres with the sequence (AAACCCT) repeating at least 170 times in 10 kb regions at the ends of chromosomes, across 22 chromosomes in the two distinct haplotypes (Tables 3, 4). Interestingly, both haplotypes were found to have one telomere missing from the 11th chromosome (Fig. 3). In summary, this assembly can be regarded as having nearly achieved telomere-to-telomere and gap-free completeness.

Table 4 Repeat_number of telomeres sequence unit for the two haplotypes chromosomes.
Fig. 3
figure 3

Telomere distribution in the assembly of two haploid genomes (Haplotype 1; Panel A, Haplotype 2; Panel B).

Identification of repetitive elements

To identify repeat sequences within the genome, a combined approach of de novo prediction and homology-based prediction was employed. For de novo prediction, long terminal repeat retrotransposons were first detected using LTRharvest13 (part of genometools14 v1.6.5) and LTRfinder15 to predict LTRs, and LTR_retriever16 v2.9.8 was used to filter and integrate the LTR predictions. MITE-Hunter17 was then utilized to predict miniature inverted-repeat transposable elements (MITEs), followed by a subsequent de novo prediction using RepeatModeler18 v2.0.4. The results from these steps were combined to produce the final de novo prediction. For homology-based prediction, RepeatMasker19 v4.1.6 and Tandem Repeats Finder20 (TRF v4.09) software were utilized. The de novo prediction and homology-based prediction results were then merged, and the bedtools21 v2.31.0 software was used to generate a soft-masked genome. Following the identification process, haplotype 1 was found to contain 58.94% repeat sequences, while haplotype 2 was found to contain 58.37% repeat sequences. This indicates that approximately 58% of the ‘Dahong’ genome sequence comprises repeat sequences.

Gene identification and functional annotation

Gene identification is a complex and intricate process, and to ensure the accuracy and completeness of gene prediction and annotation, three strategies were utilized. Firstly, ab initio prediction was conducted using the SNAP22 (Semi-HMM-based Nucleic Acid Parser, version 2013-11-29) and glimmerHMM23 v3.0.4 software. Subsequently, homology-based prediction was performed using GeMoMa24, with the homologous protein library sourced from the model plants Arabidopsis thaliana, rice, and Populus trichocarpa. Furthermore, prediction based on direct evidence from RNA-seq was carried out, involving assembly of the RNA-seq data using Trinity25 and StringTie26, followed by integration of the assembly results using PASA27. The RNA-seq data was aligned to the genome using hisat228 v2.31.0, and gene prediction was conducted based on the alignment results using the BRAKER329 pipeline, which integrates GeneMark-EFP30, AUGUSTUS31 and TSEBRA32. Finally, the results obtained from the three different methods were integrated using EVidenceModeler27 v2.1.0 (EVM) to obtain the final prediction. Hap1 genome predicted 29,139 coding genes, while hap2 genome predicted 28,538 coding genes. The completeness of the prediction results was evaluated using BUSCO33 (V5.3.2) with the embryophyta_odb10 dataset, which indicated that the completeness of protein sequences for the two haplotypes reached 97.6% and 97.5%.

For functional annotation, we integrated the results from three independent methods. Firstly, the protein sequences were locally annotated for functional analysis using InterProScan34 v5.60 with the InterPro database (v100.0). Subsequently, BLAST comparisons were performed against UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, NCBI nonredundant protein (NR), and Arabidopsis databases with an e-value threshold of 1e-5. Finally, gene sequences were mapped to the eggNOG35 v6.0 database using eggNOG-mapper36 v2.1.14 for protein annotation based on gene ontology (GO) terms. Upon statistical analysis of the annotation results, it was observed that 27,256 protein-coding genes (93.5%) in haplotype 1 and 26,703 protein-coding genes (93.6%) in haplotype 2 were annotated by at least one of the databases.

For non-coding RNA genes, we employed additional strategies for annotation and identification. Initially, Infernal37 v1.074 was used in conjunction with the Rfam database38 to retrieve non-coding RNA genes. Subsequently, tRNAscan-SE39 v2.0.972 and barrnap40 v0.973 (https://github.com/tseemann/barrnap) were employed to annotate tRNA and rRNA, respectively, using default parameters. Finally, the annotation results were merged, and redundant annotations were removed. In haplotype 1, we annotated 18,378 non-coding genes (11,723 rRNA, 5,215 tRNA, 101 miRNA, 396 snoRNA, and 943 others). In haplotype 2, we annotated 19,458 non-coding genes (14,948 rRNA, 2,870 tRNA, 102 miRNA, 684 snoRNA, and 854 others), the details can be found in Table 5.

Table 5 Protein coding Gene and Non-coding RNA annotation.

Synteny analysis between haplotype assemblies

The two haplotypes and a previously published chromosome-level genome3 (herein referred to as the reference genome) were aligned using minimap241 with the parameter “-ax asm5”, and a dot plot illustrating the collinearity was generated using the R package ‘pafr’, highlighting the high similarity between the two haplotypes and the reference genome (Fig. 4A,B). Subsequently, the haplotypes were compared again using minimap2 with the parameters “-ax asm5–eqx”, and SyRI42 (Synteny and Rearrangement Identifier) v1.6.3 was employed to detect synteny and structural variations between the haplotypes (default parameters). The results were visualized using Plotsr43 v1.1.1, revealing a total of 1,946 syntenic regions (approximately 930 Mb), 214 inversions (approximately 173 Mb), 2,810 translocations (approximately 25.5 Mb), with 606 duplications (5.5 Mb) identified on haplotype 1 and 2,237 duplications (9.7 Mb) detected on haplotype 2 (Fig. 4C, Table 6). Additionally, 9,487,343 SNPs, 337,977 insertions, and 347,207 deletions were identified (Table 6). We speculate that the observed genetic variations between the two haplotypes may be attributed to potential interspecific hybridization among several Selenicereus species. Furthermore, we cannot dismiss the possibility that ‘Dahong’ is a product of multi-species hybridization, which could serve as a source of genetic variation between the two haploid genomes.

Fig. 4
figure 4

Comparison of two haplotypes with the reference genome. (A) Dot-plot of synteny blocks between Haplotype 1 and the reference genome; (B) Dot-plot of synteny blocks between Haplotype 2 and the reference genome; (C) Structural variations between Haplotype 1 and Haplotype 2.

Table 6 Details of structural and sequence variations between the two assemblies.

Data Records

The whole genome sequence data reported in this paper have been deposited in the Genome Warehouse44 in National Genomics Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, under accession number GWHEUSQ00000000.1 and GWHEUSR00000000.1 that is publicly accessible at https://ngdc.cncb.ac.cn/gwh. The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive45 in National Genomics Data Center (Nucleic Acids Res 2022), China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRA017679) under the BioProject accession number PRJCA026852 that are publicly accessible at https://ngdc.cncb.ac.cn/gsa, including Pacbio HiFi sequencing data (CRX1135011), ONT ultra-long sequencing data (CRX1135012), Hi-C sequencing data (SRR25859129), Illumina mixed-samples RNA sequencing data (CRR1232453), ONT mixed-samples RNA sequencing data (CRR1232454).

The data mentioned above has also been submitted to the National Center for Biotechnology Information (NCBI) SRA database. The data can be accessed under the bioproject PRJNA1192404, with the accession numbers SRR31557923 to SRR3155792746,47,48,49,50. Furthermore, the genome assembly results have been deposited in the DDBJ/ENA/GenBank, with accession numbers for the two haploid genomes designated as JBJNHD00000000051 and JBJNHE00000000052, respectively. Additionally, genome repeat annotations and functional annotations—including files for coding sequences (CDS), protein-coding regions, repeat annotations, and documentation of genetic variations between the two haploids—have been submitted to Figshare53.

Technical Validation

Evaluation of the sequenced read and quality control

Firstly, the raw data obtained from sequencing was subjected to data filtering to remove adapters and low-quality reads, yielding high-quality clean data. For the PacBio Circular Consensus Sequencing (CCS) data, we employed the ccs tool54. In the case of Nanopore Ultra-Long DNA Sequencing, we utilized Porechop (V0.2.4) for data processing55. For second-generation sequencing files, Trimmomatic was applied for quality trimming56. Subsequently, an assessment of data contamination in the clean data was conducted. The presence of contamination in the data can affect the accuracy of subsequent analyses, leading to significant deviations in genome features evaluations such as genome size, heterozygosity rate, and ultimately impacting the effectiveness of subsequent genome assembly. Therefore, 2,000 reads were randomly selected from the clean data and aligned against the NT database to assess contamination. According to the evaluation criteria, if more than 1% of reads align to evolutionarily distant species, it indicates potential contamination. Following the assessment, only approximately 0.3% of the data could be aligned to Chenopodium quinoa, demonstrating the absence of significant contamination. Finally, using visNano v0.1.1 (https://github.com/renzilin/visNano), the base quality was evaluated, demonstrating the exceptionally high quality of the original sequencing data (Fig. 1). Overall, the sequencing data was deemed to be of high purity and quality.

Evaluation of the assembled genome

We employed multiple methods to assess the accuracy and completeness of the assembly results. Firstly, based on high-quality ONT Ultra-long reads, PacBio HiFi reads, and Hi-C-assisted assembly, the preliminary assembly was scaffolded using YaHS9 (Yet Another Hi-C Scaffolding tool) v1.2a.1, generating a chromatin contact matrix. Visualization was performed using Juicebox10 Assembly Tools v2.20.00, revealing that both haplotypes of the assembly exhibited good contiguity, distinctly resolving 11 chromosomes (Fig. 5).

Fig. 5
figure 5

The Hi-C heatmap of chromosome interactions in Selenicereus polyrhizus chromosomes, panels A and B corresponding to haplotype 1 and haplotype 2, respectively.

Secondly, TeloExplorer from quarTeT12 v1.2.1 was utilized to identify telomeres in the two haplotypes of the assembly, revealing the presence of 42 telomeres. Interestingly, both haplotypes of the assembly exhibited a lack of one telomere on chromosome 11 (Fig. 3).

Thirdly, the completeness of the genome was assessed using BUSCO33 v5.3.2 (genome mode, default parameters) with the embryophyta_odb10 dataset, validating that the two haplotypes achieved scores of 97.7% and 97.4%, respectively, including single copy and duplicated BUSCOs.

Lastly, merqury57 v1.3 was utilized to evaluate the consensus and completeness of the two haplotypes’ assembly using PacBio HiFi reads, following the recommended merqury algorithm with K-mer = 20. The quality values (QV) for haplotype 1 and haplotype 2 were determined to be 54.5 and 56.0, respectively. The completeness of the two haplotypes and the combined set was found to be 72.19%, 69.77%, and 96.58%, respectively. The aforementioned analyses collectively provide evidence of the accuracy and completeness of the genome assembly.

Evaluation of the gene annotation

The two haplotypes were annotated with 29,139 and 28,538 protein-coding genes, respectively. Subsequently, the completeness of the annotated proteins was evaluated using BUSCO33 v5.3.2 (protein mode, default parameters). Notably, both haplotypes achieved completeness scores of 97.6% and 97.5%, respectively, indicating high quality of the gene annotation.