Background & Summary

Fraxinus mandshurica Rupr, belongs to the Oleaceae family and Fraxinus genus, is both economically and ecologically significant. It is mainly distributed in Northeast China, with smaller populations found in Northwest China, Russia, Japan, and North Korea1. F. mandshurica is a deciduous tree with pinnate leaves bearing ovate leaflets with serrulate margins. Its seeds are samaras with twisted wings extending to the base of the nutlet2 (Fig. 1). Renowned as a valuable timber species, it produces valuable wood that is widely used for furniture, flooring, and handicrafts. Beyond its economic importance, the species holds substantial medicinal value. Its bark, leaves, and other parts have been extensively utilized in traditional Chinese medicine, as well as in the traditional remedies of other Asian countries, such as South Korea and Japan3.

Fig. 1
Fig. 1
Full size image

Morphological features of Fraxinus mandshurica Rupr. (A) Seedlings of F. mandshurica used for sequencing. (B) Left – abaxial (upper) surface of F. mandshurica leaf; Right – adaxial (underside) surface. (C) Seed of F. mandshurica.

However, in recent decades, environmental degradation has led to a significant decline in the wild populations of F. mandshurica4. Furthermore, the lack of a high-quality reference genome has posed substantial challenges to conducting in-depth research on this species. Currently, genomes of several Fraxinus species have been published5, including F. americana, F. excelsior and F. pennsylvanica, among others, but the telomere-to-telomere (T2T) genome assembly of F. mandshurica is still lacking. Although transcriptome analyses6, gene family studies7, and structural variations analyses5 in F. mandshurica are gradually increasing, these studies all rely on the availability of a high-quality reference genome. A T2T genome provides a comprehensive view of tandem repeat sequences such as centromeres and telomeres, and enables more accurate and complete prediction of protein-coding genes8. To address these problems, we have constructed a high-quality reference genome for F. mandshurica. This genomic resource will facilitate preservation genetics research and provide insights into the molecular mechanisms underlying the species’ important economic phenotypes, ultimately aiding in its preservation and sustainable utilization.

In this work, we utilized Illumina (144.80 × ), PacBio HiFi (115.34 × ), and Hi-C data (126.10 × ) to construct the T2T genome. High-coverage HiFi sequencing data alone can assemble 19 complete T2T chromosomes without relying on Hi-C scaffolding (Table 1). With the assistance of Hi-C, we further anchored the remaining chromosomes and filled the gaps, resulting in the first T2T genome assembly of F. mandshurica. The final assembly consists of 23 chromosomes, with a total genome size of 781.40 Mb, the contig N50 of 34.29 Mb, and the contig L50 of 10. A total of 35,009 protein-coding genes were predicted in our genome, of which 34,735 were successfully functionally annotated (Fig. 2). A higher sequencing depth can significantly improve the quality of genome assembly, indicating that sufficient data coverage is crucial for obtaining a more accurate and complete genome. Compared with the previously published genome in China National Genomics Data Center (Accession: GWHFDPP00000000.1)9, with a nearly equivalent level of BUSCO completeness, our genome exhibits further enhancements in contig number, contig N50, and L50 relative to assembly GWHFDPP00000000.1, which indicates a higher degree of continuity in the new assembly (Table 2).

Table 1 Sequencing data used for the Fraxinus mandshurica Rupr genome assembly and annotation.
Fig. 2
Fig. 2
Full size image

Genome characteristics of Fraxinus mandshurica Rupr. (A) A Circos plot illustrating, from the outside inwards: chromosomes (with each tick representing 1 Mb), gene density, repeat sequence density, GC content, and intra-genomic synteny. (B) Schematic representation of chromosome structure. (C) UpSet plot summarizing the functional annotation of the genome.

Table 2 Comparison of the Fraxinus mandshurica Rupr genome assemblies.

Methods

Sample collection

F. mandshurica plant from which the samples were tissue-cultured was collected, cultivated, and preserved at the National Forest Seed Base of Jilin Forest Industry Hongshi Forestry Co., Ltd., Huadian, China. Genomic DNA was extracted from three stages of young leaves with varying degrees of tenderness: the most tender leaf (from the middle of the apical part), a moderately tender leaf, and a slightly less tender leaf. These were used for Illumina short-read sequencing, PacBio HiFi long-read sequencing (CCS), and Hi-C sequencing. RNA samples were extracted from three different tissues—leaves, bark, and roots—collected at the same developmental stage of the plant. All samples used for DNA and RNA extraction were obtained from the same individual plant.

DNA and RNA sequencing

The phenol/chloroform extraction protocol was applied to fresh leaf tissue to obtain genomic DNA for sequencing library preparation. Short-read sequencing was typically used for genome survey, providing a preliminary understanding of genome characteristics such as size, heterozygosity, and repeat content. Long-read sequencing, on the other hand, is employed for de novo genome assembly due to its ability to span repetitive regions and produce longer contigs. Finally, Hi-C data was utilized to correct and scaffold the contigs.

The integrity of the extracted genomic DNA was assessed using agarose gel electrophoresis. A paired-end sequencing library with an insert size of 300–400 bp was constructed and sequenced on the Illumina NovaSeq 6000 platform, generating a total of 113.15 Gb of raw short reads (150 bp in length). Additionally, PacBio HiFi long reads were obtained using the PacBio Sequel II platform in Circular Consensus Sequencing (CCS) mode, which provides both long read lengths and high base-level accuracy. In total, 90 Gb of HiFi sequencing data were generated for genome assembly.

We also performed Hi-C sequencing, the genomic DNA conformation is fixed in cells using paraformaldehyde, followed by cell lysis and treatment of the crosslinked DNA with restriction enzymes to generate sticky ends. The DNA ends are then repaired, during which biotin is introduced to label the oligonucleotide ends. Subsequently, DNA fragments are ligated using DNA ligase, and protein digestion is performed to reverse the DNA-protein crosslinking. The DNA is then purified and randomly fragmented into 300–500 bp fragments, and the biotin-labeled DNA is captured using streptavidin magnetic beads and prepared for short-read library sequencing. Finally, the raw data amounted to 98.54 GB, consisting of a total of 656,944,166 reads.

Total RNA was extracted from F. mandshurica leaves, barks, and roots. RNA quality was assessed based on appearance, purity (NanoDrop), concentration (Qubit), and integrity (agarose gel electrophoresis). mRNA was enriched using oligo (dT) magnetic beads, fragmented, and reverse-transcribed to cDNA. After end repair, A-tailing, and adapter ligation, libraries were purified and optionally PCR-amplified. Library quality was evaluated, pooled, and sequenced on the Illumina NovaSeq 6000 platform.

T2T genome assembly

The short reads were quality-controlled and filtered using fastp (v0.23.4)10 to obtain clean data. Based on 97.69 Gb of clean short reads, we used the Kmer-based analysis to estimate genome size, heterozygosity, and repetitive sequence percentage. We counted the number of each 17-mer with GCE (v1.0.0)11. The analysis results revealed that the F. mandshurica genome size was approximately 810 Mb, with an adjusted size of 804 Mb. The heterozygosity rate was 0.82%, and the proportion of repetitive sequences was 55.94%. We were also use Smudgeplot (v0.2.3dev)12 with -k21 -m100 -ci1 -cs10000 to analyze genome structure. The AB-type determined that the most likely ploidy of F. mandshurica is diploid. (Fig. 3).

Fig. 3
Fig. 3
Full size image

The k-mer count distribution for the genome size estimation and genomic haplotype analyse.

We used an integrated genome assembly strategy combining hifiasm (v0.25.0)13 with the parameter–telo-m AAACCCT and verkko (v2.2.1)14,15 with the parameter–telomere-motif AAACCCT to assemble HiFi sequencing data, with both tools running in Hi-C mode. The telomere motif used is a common plant telomeric sequence16, and telomeric sequence identification using tidk(v0.2.0)17 confirmed its presence without variation. By comparing the number of telomeres from both assembly tools (Table 2), we selected the hifiasm assembly as the preliminary assembly result. Due to the high heterozygosity and repetitive sequence percentage of the genome, the assembled genome size exceeded our expectations. We employed purge_dups (v1.2.5)18 to remove redundant sequences from the initial assembly, using minimap2 (v2.28)19 to align the raw reads back to the assembled genome. Redundant contigs were identified and removed based on read depth distribution and sequence similarity, resulting in 39 contigs. We then used quarTeT (v1.2.5)20 with the parameter te –c plant to detect telomeric signals, identifying 19 contigs as complete chromosomes with telomeric repeat sequences at both ends.

To generate a complete T2T genome assembly, the clean Hi-C data were aligned to our genome by using Chromap (v0.2.7-r494)21 and scaffolded with YaHS (v1.2.2)22. The resulting scaffolds were manually curated using Juicebox(v1.11.08) (GitHub - aidenlab/Juicebox: Visualization and analysis software for Hi-C data -) to correct misassemblies and improve chromosome-level organization. As a result, the F. mandshurica genome was successfully anchored onto 23 chromosomes. Gaps during the scaffolding process were filled using TGS-GapCloser (v1.2.1)23, with additional gap filling assisted by Verkko-assembled contigs. This process yielded a T2T genome of F. mandshurica comprising 45 telomeres and no gap (Fig. 4).

Fig. 4
Fig. 4
Full size image

Hi-C contact map and contig distribution map on chromosomes of the genome. (A)The map shows scaffolded and independently assembled chromosomes at 500 kb resolution in 5 Mb windows. (B) Contig Distribution Map on Chromosomes of the Genome.

Repeat elements identification

Before performing genome annotation, to comprehensively identify and mask repetitive elements in the F. mandshurica genome, we employed a combination of de novo and homology-based approaches. Tandem repeats were first identified using Tandem Repeats Finder (TRF) (v4.09.1)24. For de novo transposable element (TE) discovery, we applied RepeatModeler (v2.0.1)25, which constructs a custom repeat library specific to the genome. In parallel, we utilized EDTA (v2.2.2)26, an integrated pipeline optimized for plant genomes, to annotate and classify TEs with high sensitivity. The custom repeat libraries generated by de novo prediction tools were then used as input for RepeatMasker (v4.1.7)27 to screen the genome for repetitive sequences. For homology-based repeat annotation, we conducted masking with RepeatMasker using both the Dfam28 and Repbase29 libraries, ensuring comprehensive coverage of known repetitive elements. Additionally, RepeatProteinMask was used to detect TEs at the protein level, further improving annotation accuracy. To ensure accuracy and reduce redundancy, we removed overlapping regions among the results obtained from the various methods, retaining only non-redundant repeat annotations. In total, 61.01% of the genome was masked as repetitive (Table 3), a proportion consistent with that reported in currently published Fraxinus genomes5. This indicates that our integrative strategy enabled robust and reliable repeat identification and masking.

Table 3 Repetitive annotations statistics.

Gene annotation

To ensure high-quality and high-fidelity annotation, we employed the EviAnn (v2.0.2)30 software, which performs purely evidence-based annotation rather than relying on gene prediction models. The annotation was based on two sources of evidence: the protein from closely related species and the transcriptome data derived from the same organism. The protein dataset, comprising approximately 335,013 protein sequences from Oleaceae family, was obtained from NCBI31. For transcriptome evidence, to minimize potential gene annotation omissions caused by insufficient transcriptome coverage, we supplemented our own RNA-seq data—generated from root, bark, and leaf tissues—with additional transcriptome datasets derived from other F. mandshurica tissues, including flowers (Accession: SRP513361)32, pollen (Accession: SRP522725)33, and stigmas(Accession: SRP559292)34, which were downloaded from NCBI. All RNA-seq reads were aligned to the reference genome using HISAT2 (v2.2.1)35. Samtools (v1.9)36 was used to change SAM files to BAM format, followed by sorting and merging of the BAM files to generate a coordinated and consolidated alignment dataset. Both protein and transcriptome evidence were then supplied to EviAnn to generate the final gene annotation. A total of 35,009 protein-coding genes and 57,195 transcripts were ultimately predicted, which is consistent with the previously published genome of Fraxinus pennsylvanica37. (Fig. 5A and B).

Fig. 5
Fig. 5
Full size image

Structural comparison of the gene models among the Fraxinus mandshurica Rupr and classification of GO and KEGG. (A) Gene element length distribution plot of the gene set. (B) Cumulative gene element length distribution plot of the gene set. (C) GO classification bar chart. (D) KEGG classification bar chart.

To functionally annotate the predicted genes, we first performed sequence similarity searches against major protein databases including NR38, SwissProt39, and TrEMBL39 using DIAMOND (v0.9.25.126)40 (parameters:–max-target-seqs. 1–evalue 1e-5). These alignments allowed for the initial functional characterization of the gene set based on homology. To further investigate potential protein functions, we applied InterProScan (v5.61-93.0)41 and HMMER3 (v3.3.1)42 to identify conserved domains by querying the InterPro43 and Pfam44 databases. The annotated domains were subsequently mapped to Gene Ontology (GO) terms45, and GO enrichment analysis was performed (Fig. 5C). In addition, KofamScan (v1.3.0)46 was used to assign KEGG47 identifiers to the genes, providing pathway-level insights into gene functions (Fig. 5D). Finally, 99.22% of the genes were functionally annotated (Table 4).

Table 4 Gene functional annotation statistics.

We also annotated non-coding RNAs by employing various methods, tRNA sequences were identified using tRNAscan-SE (v1.3.1)48; rRNA sequences were annotated using BLASTN (v2.14.1 + )49 with reference sequences from closely related species due to their high conservation; and miRNA and snRNA sequences were annotated using the INFERNAL (v1.1.5)50 from the Rfam (v14.8)51 (Table 5).

Table 5 The statistics of the annotation of non-coding RNAs.

Data Records

The sequencing dataset and genome assembly were deposited in public repositories All raw date has been submitted to the NCBI under the BioProject accession number PRJNA1273044.

The genomic Illumina (SRR33856795), PacBio HiFi (SRR33856794), and Hi-C (SRR33856793) sequencing data were deposited in the NCBI Sequence Read Archive. The transcriptome Illumina sequencing data from leaves (SRR33856792), barks (SRR33856791), and roots (SRR33856790) were also deposited in the NCBI Sequence Read Archive (Accession: SRP590381)52.

The final assembly genome was submitted to NCBI Assembly with accession number GCA_050941835.153. The GFF file was available on Figshare (https://doi.org/10.6084/m9.figshare.29424683.v1)54.

Technical Validation

Identification of telomeric and centromeric regions

Telomeres and centromeres were located using quartet (v1.2.5)20 with the parameters–TE–gene. The TE annotations were derived from the annotation files generated by EDTA during repeat masking, while the gene annotations were obtained by integrating mRNA annotations from EviAnn and non-coding RNA annotations from Infernal. A total of 45 telomeres and 23 centromeres were identified (Table 6). Notably, the only missing telomere was at the end of chromosome 23, where a high density of rRNA annotations was observed. This suggests that the highly repetitive rDNA region may have hindered the complete assembly of the terminal telomeric sequence.

Table 6 23 Chromosomes telomere and centromere statistics.

Quality assessment of the genome and proteins

In the raw sequencing data, adapter sequences, low-quality bases, and unsequenced bases (represented by N) can significantly interfere with downstream analysis. Through fastp (v0.23.2)10 filtering and redundancy removal.

To evaluate the quality of the assembled genome, the completeness of the assembled genome was assessed by using BUSCO (v5.8.2)55 and Compleasm (v0.2.7)56 with the embryophyta_odb12. We found that 98.7% and 99.9% complete of the 1,999 and 2,024 BUSCO genes (Table 7). BUSCO and Compleasm results confirmed the genome assembly’s completeness.

Table 7 Quality evaluation for Fraxinus mandshurica Rupr genome assembly.

We also aligned NGS short reads and PacBio HiFi long reads to the genome by bwa (v0.7.12) and minimap2 (v 2.24). A total of 99.55% of the reads were mapped, achieving 99.82% genome coverage, which reflects the high quality of the sequencing data. Meanwhile, evaluation of assembly accuracy by Merqury (v1.4.1)57, the quality value (QV) obtained from short-read and long-read alignments were 50.78 and 67.13, both demonstrating further the high quality of assembled genomes.