Background & Summary

Dipteris, commonly known as Broad-Leaf Fern, is an early divergent genus of leptosporangiate ferns, with only eight species in the world1,2, and limitedly distributed in the Indo-Malay archipelago, including northeastern India, southern China, southern Ryukyu Islands to northeastern Queensland and Fiji Islands3. Contrary to the extant taxa, the fossils of Dipteris are extremely abundant and widely distributed throughout the world, which are important indicators of global climate warming and plant geography during the Mesozoic era4,5. Furthermore, Dipteris is a key transitional group in the evolution of the key morphological trait “sporangial annulus” from horizontal to vertical6, and one of the most controversial evolutionary branches in the fern phylogeny7,8. Importantly, the rhizomes of Dipteris plants can be used to treat edema, kidney deficiency, low back pain and other diseases9, and its plant extracts also have antioxidant and antibacterial activities, effective cholesterol degradation and anti-lipid solubility activities9,10, and show the potential to treat Alzheimer’s disease11. However, the lack of genomic resources has hindered the study of systematic evolution, paleoclimate, paleogeology, ornamental and medicinal value of this genus.

Dipteris shenzhenensis is a critically endangered plant endemic to China12 and a peculiar and beautiful plant with leaves split into two fan shapes (Fig. 1A). Its chromosome number is 2n = 2x = 66 according to the chromosome counts database (CCDB, https://ccdb.tau.ac.il)13, and the genome size was estimated as 2.14 Gb by flow cytometry (Fig. 1B) and 1.94 Gb by genome survey (Fig. 1C). In this study, we sequenced and assembled its chromosome-level genome based on Illumina short-read sequencing (56× according to genome survey), PacBio single molecule real-time (SMRT) long-read sequencing (35×) and high-through chromosome conformation capture (Hi-C) technologies (134×) (Table 1). The assembled genome was 1.9 Gb with a contig N50 length of 4.75 Mb and GC content of 42.28% (Table 2). In which, 98.37% of the assembled sequences were anchored onto 33 pseudochromosomes (Figs. 1D, 2), and 1.37 Gb (71.97%) of the genome were predicted to be repetitive sequences, including 699.52 Mb (36.82%) of LTR retrotransposons, 424.14 Mb (22.33%) of DNA transposons and so on (Table 3). The LTR insertion mainly occurred about 0.24 million years ago (MYA). 45 telomeres were identified in 33 pseudochromosomes, among them, 15 pseudochromosomes had paired telomeres, 15 pseudochromosomes had only one telomere, and 3 pseudochromosomes failed to identified telomeres (Table 2). A total of 26,471 protein coding genes were predicted with an average CDS length of 1.164 bp, and 24,485 (92.5%) genes could be functionally annotated. In the genome, 11,215 non-coding RNA were identified, including 5,063 miRNAs, 4,700 tRNAs, 580 rRNAs and 872 snRNAs (Table 3). The first high-quality genome of Dipteris will be of great significance for plant evolution, paleoclimate and paleogeology since Mesozoic era, and provide important genome resources for understanding the systematic evolution, ornamental and medicinal value of ferns.

Fig. 1
figure 1

The genome size estimation and chromosome assembly of Dipteris shenzhenensis. (A) The mature plant of D. shenzhenensis. (B) Flow cytometry results using Solanum lycopersicum (~900 Mb) as an internal reference. (C) Genome survey results estimated by GenomeScope with 17 k-mer. (D) Hi-C interaction heatmap of D. shenzhenensis genome showing the interactions among 33 pseudochromosomes.

Table 1 The information of Dipteris shenzhenensis genome sequencing.
Table 2 The information of genome assembly and estimation of Dipteris shenzhenensis.
Fig. 2
figure 2

The genome landscape of 33 pseudochromosomes of Dipteris shenzhenensis. Circles from outside to inside are pseudochromosome length (A), gene density (B), repeat density (C), GC content (D), syntenic blocks across pseudochromosomes (E).

Table 3 The information of Dipteris shenzhenensis genome annotation.

Methods

Plant materials and genome sequencing

Fresh leaves were collected from a mature plant of D. shenzhenensis (Voucher specimen number: YYH24624) at the China National Orchid Conservation Center (CNOCC), Shenzhen, China, and were sent to Novogene Co., Ltd. (Tianjin, China) for genome sequencing. DNA extraction was used a modified cetyltrimethylammonium bromide (CTAB) protocol. Short-read sequencing libraries with an insert size of 350 bp were pooled and sequenced on Illumina Hiseq platforms with PE150 strategy. After quality control and filtering, 108.72 Gb Illumina short reads (56×) and 260.68 Gb Hi-C reads (134×) were generated. PacBio long-read sequencing libraries with fragment sizes of 15–18 kb were sequenced by PacBio Sequel II/IIe platforms with circular consensus sequencing (CCS) mode, and 68.31 Gb HiFi reads were obtained (Table 1).

Genome size estimation

The genome size of D. shenzhenensis was estimated by flow cytometry (BD FACScalibur) and k-mer analysis. For flow cytometry, Solanum lycopersicum L. (1 C = 0.9 Gb) was used as the internal reference, the coefficient of variation (CV%) was controlled within 5%, and Modifit v3.0 was used to calculate the ratio and plotting the histogram. The genome size of 2.14 Gb was estimated by flow cytometry (Fig. 1B, Table 2). After obtaining high quality Illumina Hiseq sequencing data (108.72 Gb), k-mer analysis was conducted with jellyfish v.2.3.014, and the 17-mer spectrum was fitted using GenomeScope15, which indicated a genome size of 1.94 Gb (Fig. 1C).

Genome assembly and annotation

The raw data were broken at the junction and the junction sequences were filtered out to obtain subreads by minimum length = 50. High quality HiFi reads were filtered by ccs software (https://github.com/PacificBiosciences/ccs) with the criteria of min-passes = 3 and min-rq = 0.99. The HiFi reads obtained after quality control were assembled using Hifiasm16, and the obtained contig genome was combined with the sequenced Hi-C data for chromosome clustering, orientation, and sorting using ALLHiC v0.9.817 (parameters: enz = DpnII, CLUSTER = n). The Juicebox software was then used for manual correction based on the chromosome interaction strength to obtain the chromosome-level genome. 98.37% of the assembled genome (1.87 Gb) was mounted on 33 pseudochromosomes. The completeness of genome assembly was evaluated by BUSCO v5.2.218 with viridiplantae_odb10 database and OMArk19 with Viridiplantae.h5 database, and QV scores were calculated by MERQURY v1.320 for measuring the assembly accuracy.

De novo prediction of tandem repeats in the genome using TRF v4.09.121, Then LTR_FINDER v1.0722, RepeatScout v1.0.523, RepeatModeler v2.0.324 were used to predict the repeat sequence of D. shenzhenensis genome, and the sequences with length less than 100 bp and unknown base (N) content greater than 5% were filtered out, so as to construct the unique repeats database. The UCLUST method in USEARCH v1025 was used to merge the constructed repeat sequence database with the Repbase database26 to obtain a non-redundant repeat sequence database, and RepeatMasker v4.1.227 was used to predict the repeats in the genome based on homologous sequence alignment.

De novo prediction of gene structure was performed with Augustus v2.5.528, GlimmerHMM v3.0.429, SNAP30, Geneid v1.4.431 and GENSCAN32 based on statistical characteristics of genome sequence, such as codon frequency, exon and intron distribution, and so on. BLAST v.2.2.2633 was used to align D. shenzhenensis with homologous gene dataset constructed with protein-coding sequences of Alsophila spinulosa (Figshare, 19075346.v6), Ceratopteris richardii (Phytozome v13, C.richardii v2.1), Adiantum capillus-veneris (Figshare, 24619215.v1), Salvinia cucullata (FernBase, Salvinia_asm_v1.2), and Arabidopsis thaliana (NCBI, TAIR10.1). Then the protein-coding sequence of D. shenzhenensis genome was predicted by GeneWise v.2.4.134. In order to further optimize the annotation of genome structure, the transcriptome data of different tissues (root, bulb, and leaf) were compared to the genome sequence using HISAT2 v2.2.135, so as to identify exon regions and splicing sites. Based on the alignment results, the transcript was assembled using StringTie v1.3.3b36, and gene prediction was performed using PASA v2.5.237. Finally, EvidenceModeler v1.1.138 was used to combine the three gene datasets with weights (TRASCRIPT: 50, PROTEIN: 20, ABINITIO PREDICTION: 2) to obtain the final non-redundant gene set. InterProSan v5.54–87.039 was used to annotate the conserved motifs and domains of the proteins and obtain the GO number of each gene. The gene set was compared with KEGG database (https://www.genome.jp/kegg) to annotate the functional metabolic pathway of each gene. Transcriptome factors were predicted with iTAK v1.740.

The telomeres identification was performed by the module TeloExplorer of quarTeT v1.2.541 with the parameter “-c plant”. EDTA v2.2.242 was used to estimate the LTR insertion time with the parameters “--anno 1–u 6.5e-9--force 1--sensitive 1” and the LTR Assembly Index (LAI)43 was calculated by LAI program with the parameters “-genome genome.fa -intact genome.fa.mod.pass.list -all genome.fa.mod.out”.

Data Records

The whole genome sequencing datasets have been deposited in the Genome Sequence Archive44 (GSA) in National Genomics Data Center45 (NGDC), China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences. The raw data of Illumina reads, PacBio HiFi reads and Hi-C reads can be located using the GSA numbers of CRA02001546, CRA01994047, CRA01999248, respectively, which corresponds to the BioProject accession number PRJCA03159749. The genome assembly has been deposited at DDBJ/ENA/GenBank under the accession JBLQTB00000000050. The genome assembly and annotation files have been deposited in Figshare51.

Technical Validation

The sequencing depth was sufficient with 108.72 Gb Illumina short reads (56×), 260.68 Gb HiC reads (134×) and 68.31 Gb PacBio HiFi reads (35×). The evaluation of genome assembly was conducted by N50 for assessing continuity (contig N50 = 4.75 Mb), the sequences accuracy (QV = 37.986) was measured by MERQURY v1.320, which was higher than 99.9% (QV = 30), and the Illumina paired-end reads mapping rate for ensuring consistency with the raw data (Mapping rate = 99.1%). The completeness and consistency of genome assembly was estimated by BUSCO18 with viridiplantae_odb10 database and OMArk19 with Viridiplantae.h5 database, 98.3% of BUSCOs and 92.58% of Conserved HOGs were present in the D. shenzhenensis genome, and 65.03% of gene families were consistent with the known gene families of Viridiplantae.h5 database, while only 1.7% of BUSCOs were missing and the protein-coding genes of D. shenzhenensis genome were not contaminated. The LAI value was 12.16, and 45 telomeres was identified in 33 pseudochromosomes, including 15 paired telomeres.