Background & Summary

Magnolia zenii W. C. Cheng, a Tertiary relict specie, persists as a critically endangered (CR)1 plant with only 115 mature individuals2 remaining in fragmented distribution across Baohua Mountain of Jiangsu Province, eastern China, as documented in the China Species Red List1. This species is valued in horticulture because of its elegant pale blossoms and stately upright trunk3 (Fig. 1a). Despite its ecological importance, while previous research have mainly concentrated on chloroplast genome sequencing for phylogenetic analyses and only a limited number of microsatellite markers developed for population genetic diversity analyses4,5, the absence of whole-genome investigations hinders comprehensive understanding of its reproductive limitations6, genetic bottlenecks, and adaptive mechanisms. Magnoliaceae is one of the oldest extant angiosperm families, yet genomic studies on this family have been relatively limited. Globally and within China, nearly 50% of Magnoliaceae species are threatened7,8. Despite recent advancements, only eight species (Liriodendron chinense (Hemsley) Sargent, Magnolia biondii Pamp., M. kwangsiensis Figlar & Noot., M. hypoleuca Sieb. & Zucc, M. officinalis Rehd., M. sieboldii K. Koch, M. sinica (Law) Noot., Michelia alba DC) have been subjected to in-depth genomic analysis in the past five years9,10,11,12,13,14,15,16. This limited genomic resource significantly hinders both the understanding of the endangered mechanisms of Magnoliaceae species and their evolutionary relationships.

Fig. 1
figure 1

The floral characteristics (a) and genomic features of M. zenii (b). From outer to inner layers (excluding the outermost chromosome coordinates): Class I TE density, Class II TE density, protein-coding gene density, tandem repeat proportion, GC content, and collinearity blocks (minimum length of 100 kb). Window size: 500 kb.

Whole-genome sequencing would enable systematic identification of key reproductive development genes and functional variations, while elucidating the molecular basis of pollen and embryo sac abortion6. Furthermore, it would facilitate assessment of population genetic diversity and genomic architecture, providing critical genetic insights for conservation strategies such as artificial propagation (e.g., seedling transplantation), population restoration optimization, and stress resistance improvement. Particularly crucial given current population decline and ecological competition pressures17, genomic data integration with population dynamics and niche analysis may help to inform precision conservation measures. This integrated approach would enhance the species’ long-term survival prospects and support sustainable utilization efforts.

Here, we combined PacBio HiFi, ONT ultra-long reads, and Hi-C interaction maps to generate a haplotype-resolved chromosome-level assembly and performed robust gene annotation with ONT transcriptomic data. This approach addresses challenges in resolving complex regions and ensures high contiguity and annotation accuracy. The assembly spans 3.6 Gb with a scaffold N50 of 86 Mb and 7 gaps (Table 1), achieving near telomere-to-telomore resolution. Benchmarking Universal Single-Copy Orthologs (BUSCO18, v. 5.3.2) assessment confirmed 99.4% genome completeness and 98.4% gene annotation completeness, with 74,782 protein-coding genes identified. Repetitive elements (72.87%) were dominated by LTR retrotransposons (53.59%). The mitochondrial (915,126 bp) and chloroplast (160,079 bp) genomes were also fully characterized. The assembly provides a foundational resource for investigating its genomic basis of endangerment, identifying key genomic variants for breeding, and reconstructing evolutionary history. It also advances horticultural trait research by enabling precise genetic and genomic analysis of M. zenii.

Table 1 Summary of the M. zenii genome assembly data.

Methods

Sample preparation and DNA/RNA extraction

Sampled fresh flowers and leaves from a mature M. zenii indivadual in Baohua Mountain, Jiangsu, China. Flowers and leaves were preserved in liquid nitrogen and stored at −80°. DNA was extracted from leaves using a modified CTAB methods19. Total RNA was extracted from flowers and leaves using the R6827 Plant RNA Kit (Omega Bio-Tek, Norcross, GA, USA) following the manufacturer’s instructions. Subsequently, the concentration of the extracted DNA and RNA samples was determined using the NanoDrop One spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and the Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA), while their integrity was assessed via 1% agarose gel electrophoresis.

Library construction and sequencing

DNA integrity was assessed using the Femto Pulse system. Genomic DNA (8 μg) was fragmented using the Megaruptor 3, followed by purification with AMPure PB magnetic beads. SMRT bell libraries were constructed using the Pacific Biosciences SMRT Bell Express Template Prep Kit 2.0. Libraries were size-selected on the BluePippin™ system to target 15 kb insert sizes, followed by primer annealing and binding of SMRT bell templates to polymerases using a DNA/polymerase binding kit. The sequencing was performed on the PacBio Sequel II/IIe platform at Benagen Genomics. This yielded 88 Gb (49×) of HiFi data with N50 of 17 kb.

ONT library construction and sequencing

A 2.5 μg aliquot of quality-checked genomic DNA was purified with 1 × magnetic beads. A 1 μL sample was quantified via Qubit fluorometry. DNA underwent end repair and damage remediation, followed by 1 × magnetic bead purification. DNA was eluted with 61 μL EB buffer, and a 1 μL aliquot was re-quantified. Adapter ligation was performed, followed by 0.4 × magnetic bead purification. Elution was conducted with 25 μL Elution Buffer (SQK-LSK110), and a final 1 μL sample was quantified. Libraries were loaded onto R9.4 sequencing chips and sequenced on the PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK) for 48–72 hours. This generated 182 Gb (101×) of data with an N50 of 60 kb.

Hi-C library construction and sequencing

Input Hi-C DNA was fragmented, followed by end blunting. 5′ ends were phosphorylated, and 3′ dA tails were added. Adapters were ligated to fragmented DNA ends. Streptavidin magnetic beads were used to capture biotin-labeled regions. Captured Hi-C DNA underwent PCR amplification, followed by quantification and quality control. Libraries were circularized and validated. Qualified libraries were sequenced on the DNBSEQ-T7 platform using paired-end sequencing protocols.

ONT transcriptome sequencing

A 500 ng aliquot of quality-checked total RNA was diluted to 9 μL with nuclease-free water. Reverse transcription primers were ligated, followed by cDNA synthesis and amplification. Libraries were purified using AMPure beads and adapter ligated. Final libraries were loaded onto R9.4 sequencing chips and sequenced on the PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK) for 48–72 hours. This produced 18 Gb of transcriptome data. This produced 18 Gb of transcriptome data.

Chromosome-level genome assembly

Quality filtering of second-generation Hi-C data was performed using the fastp20 (v. 0.21.0) software, including adapter trimming, removal of low-quality regions, and discarding of short reads. PacBio HiFi reads were initially assembled using Hifiasm21 (v. 0.19.8-r602) to generate a draft genome. Hi-C reads were then aligned to this draft genome using Juicer22 (v. 0.6.14), followed by preliminary Hi-C-assisted chromosome scaffolding with 3D-DNA23 (v. 180922). Subsequent manual validation and iterative refinement were performed in Juicebox24 (v. 3.1.4), focusing on optimizing chromosomal partition boundaries, correcting structural misassembles, and resolving switch errors. Each chromosome was then individually re-scaffolded using 3D-DNA to enhance contiguity, with further manual adjustments in Juicebox to remove spurious insertions, correct orientation discrepancies, and address other detectable errors. Following rigorous manual curation, the final assembly comprised a chromosome-scale framework and dispersed sequences, with gaps fixed to 100 bp. Statistical details are provided in Table S1-4.

To further refine the assembly, gaps were closed using HiFi reads with the QuarTeT25 (v. 1.1.3) software, and telomeric extensions were implemented. Chromosome termini with incomplete (TTTAGGG)n repeats were extended by re-aligning HiFi reads to the assembly using Minimap226 (v. 2.24-r1122). Telomeric regions were reassembled into contigs and integrated into the final scaffolds to achieve near-telomere-to-telomere continuity. Given the incomplete assembly of chloroplast (160 Kbp) and mitochondrial (915 Kbp) genomes generated by Hifiasm, we employed OATK27 (v. 0.9) to independently assemble these organelle genomes using HiFi reads.

Redundancy and contamination filtering

Isolated contigs were aligned to chromosome, mitochondrial, and chloroplast sequences using Redundans28 (v. 0.13c). This workflow identified redundant fragments (e.g., haplotigs) and high-copy regions (e.g., rDNA arrays) prone to fragmentation in Hifiasm assemblies. Contigs with a typical low coverage or rDNA signatures were identified and flagged, followed by manual validation and systematic removal of suspected redundancies and contaminants in a consolidated workflow. In addition, the chromosome number (2n = 38) was determined based on the results of karyotype analysis. Finally, we generated a high-quality, haplotype-phased gapless genome of M. zenii (Fig. 1b).

Collinearity and synteny

Collinear analysis between the M. zenii assembly and reference genomes (Pachylarnax sinica (M. W. Law) N. H. Xia & C. M. Wu) using Minimap2 confirmed consistent chromosome ordering (Fig. 2).

Fig. 2
figure 2

Chromosome-level collinearity between the current assembly and the reference (P. sinica) genome and haplotype assembly. (a) Collinearity with the reference genome (P. sinica); (b) Collinearity with the haplotype assembly. The x-axis represents the current assembly.

Identification of chromosomal landmarks

Key genomic features were mapped onto chromosomes. Telomere repeats: The telomeric motif (TTTAGGG) was identified at termini of most chromosomes. Centromeric tandem repeat: A highly repetitive centromeric-like sequence (CCAACGGAGATGCTCATAACTCCCTCGTTTGATGTCGGATTGGGTCGATTTTTGGTCATCTGAAAGGTACTTTAATGCTCTTTCATTCCTGATTGGAATCGTGGTATTCCGAGACCATTTGGGCATCCGTAAGTGGGATCGCAGTCGTCGGTCCTGAATTTTTCCTTCATCTCCTTGTCCAATAGTAACCTTCATAACTCTCTCTTCAGTGGTCGAATTTAGACTATTTTGAGCTCATTGGAATGGTAACTCGACAAGCTTCCAAATGGGACCAGAATTGTCTAATTCTGAGACCGTTCGATATCCCAAAAGCCCGTTGGAACTCTTCGATCGTCATTTTGCAATTTGGATTGTTAATCTTTCCAAATCGCGCGA) was detected across multiple chromosomes, consistent with centromeric regions. rDNA arrays: The 18-5.8-28S rDNA cluster was distributed across multiple chromosomes, as were 5S rDNA arrays (Supplementary Figure S1).

Annotation

Repetitive element characterization

Repetitive elements were identified using the EDTA29 (v. 2.2.0; parameters: --sensitive 1--anno 1) pipeline to generate a transposable element (TE) library and annotate repetitive regions in the genome. In total, 5,100,109 repetitive sequences were identified, spanning 2.62 Gb and accounting for 72.87% of the genome. Among these, LTR retrotransposons were the most abundant class, representing 2,588,781 elements and 1.93 Gb, or 53.59% of the repetitive content (Table S5).

Gene annotation

A multi-evidence strategy was employed for gene annotation. Homology-based evidence was generated using non-redundant protein sequences from 14 related species (M. biondii Pampan., M. sinica (Y. W. Law) Noot., Liriodendron chinense (Hemsl.) Sargent., Chimonanthus salicifolius H. H. Hu, Aristolochia fimbriata Cham., Annona glabra Linn., Piper nigrum Linn., Saururus chinensis (Lour.) Baill., Warburgia ugandensis Sprague, Cinnamomum kanehirae (Hayata) K.F.Chung & C.L.Hsieh, Acorus tatarinowii Schott, Amborella trichopoda Baill., Tetracentron sinense Oliv., Ceratophyllum demersum L.) in an aggregated database (398,887 sequences). For transcriptome evidence, ONT long-reads were mapped with Minimap2 and assembled into transcripts via StringTie30 (v. 1.3.5), assembly statistics are presented in Table 2.

Table 2 Gene annotation statistics of M. zenii.

Transcriptome assembly results were first annotated using the PASApipeline31 (v. 2.5.3) to define gene structures, followed by alignment to the reference protein dataset of 398,887 non-redundant sequences. Full-length genes were identified through this alignment, and their sequences were used to train both AUGUSTUS32 (v. 3.5.0) and SNAP33, with five rounds of iterative optimization.

The MAKER234 (v. 2.31.9) annotation workflow was then applied, integrating ab initio predictions (via AUGUSTUS and SNAP), transcript evidence, and homologous protein evidence. Key steps included: (1) masking repetitive regions with RepeatMasker; (2) de novo gene prediction using AUGUSTUS and SNAP; (3) aligning transcript sequences to the genome via BLASTN and protein sequences via BLASTX; (4) refining alignments with Exonerate35 (v. 2.2.0); and (5) generating hints files to guide the integration of gene models by AUGUSTUS and SNAP.

To address the lower precision of MAKER process annotations36, EVidenceModeler37 (EVM, v. 2.0.0) was employed to consolidate MAKER and PASA results into a unified gene set. TE protein domains identified by TESorter38 (v. 1.4.6) were masked during EVM integration to exclude transposable element (TE) coding regions. PASA was further applied to refine EVM annotations by incorporating untranslated regions (UTRs) and alternative splicing events. Final annotations were filtered to exclude genes with disrupted coding frames (e.g., internal stop codons, ambiguous bases, or missing start/stop codons) or lengths less than 50 amino acids. NcRNA annotation was performed with tRNAScan-SE39 (v. 2.0.12) for tRNAs, barrnap (https://github.com/tseemann/barrnap) for rRNAs, and RfamScan for other ncRNAs, ensuring comprehensive functional characterization.

Integrated annotation and functional characterization

Protein-coding genes were named systematically using a species-specific nomenclature (e.g., Yuzen01aG0000100), while non-coding RNAs were annotated separately.

Functional annotation combined homology-based approaches

The functional annotation of protein-coding genes was conducted using three strategies. Gene functions were annotated by aligning against the EggNOG-Mapper40 homologous gene database, incorporating functional information such as Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG). Protein sequences were aligned against protein databases (Swiss-Prot, TrEMBL, NCBI NR, and Arabidopsis) using DIAMOND41 (v. 2.4.0; parameters: Identity ≥30%, E-value < 1e-5) to identify the best hit for each gene. Finally, the sequence was aligned with the sub databases in InterPro (PRIMTS, Pfam, SMART, PANTHER, and CDD) using InterProScan42 (v. 5.14-53.0) to identify the conserved domains and motifs. Ultimately, 15,883 genes (21.24%) remained unannotated. Full annotation statistics are provided in Table S6.

Data Record

All data associated with this study have been deposited in the National Genomics Data Center (NGDC)43,44, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, under the BioProject (accession number PRJCA037928) and the BioSample (accession number SAMC5173072), accessible via https://ngdc.cncb.ac.cn/gwh/. The raw sequencing data (Hi-C reads, ONT reads, PacBio HiFi reads and Iso-Seq ONT reads) have been deposited in the Genome Sequence Archive (GSA) in NGDC under the accession number CRR187751845, CRR187752046, CRR187752147 and CRR187751948, respectively. The complete genome assembly and annotation have been submitted in the Genome Warehouse (GWH) in NGDC under the accession number GWHGEBD00000000.149. The sequencing data have been also deposited in the NCBI Sequence Read Archive (SRA) with accession number SRR3387190150, SRR3387190251, SRR3387190352 and SRR3387190453 under BioProject accession number PRJNA1273192. The two haplotypes of the final assembled genome have also been submitted to GenBank, with assembly accession numbers GCA_052055185.154 and GCA_052055145.155.

Technical Validation

Assessment of the genome assembly

The final genome assembly spans 3.6 Gb, containing two complete haplotypes consistent with the k-mer based size estimation. The assembly achieved 85 Mb contig N50 and 7 gaps, demonstrating high continuity (Table 1). To evaluate coverage, HiFi reads were mapped using BWA56 (v. 0.7.17-r1188), while transcriptomic ONT reads were mapped to the genome using Minimap2. Non-primary alignments were filtered out, both HiFi reads and Iso-seq demonstrated very high mapping rates of 99.14% and 95.29%, respectively. This indicates that our sequencing data provides relatively high coverage of the genome. Redundancy was assessed by remapping sequencing data to the assembled genome (filtered for non-primary alignments). In redundant regions, coverage depths were approximately half of the expected values due to allelic phasing. The absence of prominent peaks preceding the main coverage peak (determined by genome size estimation) indicated no significant redundancy (Fig. 3a). Similarly, BUSCO analysis of single-copy and multi-copy core genes showed uniform coverage distributions with no overlapping peaks (Fig. 3b). GC content distribution across HiFi sequencing data at varying coverage depths showed no detectable bias (Fig. 4), ensuring even representation of genomic regions. Remapping Hi-C data to the final assembly using BWA demonstrated strong chromosome clustering (Fig. 5), with no evident misassemble artifacts.

Fig. 3
figure 3

Distribution of coverage depth assessed using HiFi data for the whole genome (a) and BUSCO core gene regions (b).

Fig. 4
figure 4

Coverage depth distribution of HiFi sequencing data across varying GC content.

Fig. 5
figure 5

Hi-C contact maps of the final assembled genome. (a) Minimum mapping quality of 1; (b) Minimum mapping quality of 0.

Assessment of the gene annotation

The genome was evaluated using BUSCO, revealing that 99.4% of core genes were complete (including both single-copy and duplicated genes), with only 0.3% missing, indicating a high level of genome completeness. For the transcriptome data integrated with annotated proteins, BUSCO analysis showed that 98.4% of core genes were complete. Additionally, 0.4% were fragmented (F) and 1.2% were missing (M), with a total of 1,614 BUSCOs assessed.