Background & Summary

Long-fruited mulberry (Morus macroura), belonging to the Moraceae family, is native to China, Malaysia, and India, with its primary distribution in South and Southwest China1,2. Preliminary surveys indicate that it grows at altitudes ranging from 100 to 1,800 meters. Notably, a rare variant with exceptionally long fruits (up to 18 cm) has been discovered, characterized by high yield and significant potential for development and utilization1,3 Generally prolific and easy to flower, the plant matures from late March to late May, coinciding with the off-season for fruit supply4. Its fresh berries can help alleviate seasonal shortages and, through cultivation practices, produce fruits year-round. The vibrant-colored berries, with their unique flavor, are widely favored and rich in fructose, glucose, seven types of vitamins, 21 amino acids, mineral salts, and trace elements5,6. As a “medicinal-edible dual-purpose” fruit in China, mulberries hold high medicinal value and considerable economic worth5,6. In China, economic development has shifted the sericulture industry northward and southward, with global warming making low chilling requirement germplasm key to industrial advancement. Recent studies have published chromosome-level genomes of Morus species, including M. notabilis, M. alba, M. atropurpurea, and M. yunnanensis7,8,9,10. Although low chilling requirement M. macroura is extensively used for fresh consumption and processing (Fig. 1), and substantial research has been conducted on its cultivation techniques and nutritional quality, its genetic foundations remain underexplored.

Fig. 1
Fig. 1
Full size image

Morphological characteristics of M. macroura cv.‘Sijiguo 72 C’; (a) different developmental stages of mulberry fruit, (b) mulberry flowering branches, (c) mulberry fruit branches.

This study employed a combination of PacBio HiFi11 and Hi-C12 data to generate a high-quality chromosome-scale genome assembly of the cultivated long-fruit mulberry variety M. macroura cv. ‘Sijiguo 72 C’ (Fig. 2). The assembled genome spanned 318.59 Mb, with a contig N50 of 17.98 Mb and a scaffold N50 of 21.88 Mb (Table 1). Approximately 99.64% (316.47 Mb) of the contig sequences were anchored onto 14 pseudo-chromosomes, aligning with the known haploid chromosome count of M. alba (Table 1 and Fig. 2). Repetitive sequences accounted for 173.34 Mb of the genome, and 21,824 protein-coding genes were annotated (Table 6). These results demonstrate a contiguous and accurate genome assembly and annotation. Furthermore, comparative genomic analyses with six other Morus species revealed insights into their phylogenetic relationships, divergence times, and evolutionary history.

Fig. 2
Fig. 2
Full size image

Chromosome-scale assembly genomic landscape of Morus macroura. Circos plot from the outer to the inner layers represents the following: (1) 14 pseudo-chromosomes length at the Mb scale; (2) GC content per Mb; (3) gene density per Mb repeat density per Mb; (4) Transposable element density; (5) Copia (blue) and Gypsy (purple) LTR retroelement density per Mb; and (6) center: intragenomic syntenic relationships.

Table 1 Statistics of the Morus macroura genome assembly.

This chromosome-scale genome assembly serves as a pivotal resource for characterizing agronomic traits in mulberry fruits and accelerating genetic breeding of Morus spp. It further establishes a foundation for elucidating regulatory mechanisms of winter bud dormancy and flowering, alongside comparative genomic analyses between M. macroura and other Morus species.

Methods

Sample collection and sequencing

For genome sequencing, fresh young leaves of M. macroura ‘sijiguo72c’ were collected from an adult individual at the Institute of Environment and Plant Protection, Chinese Academy of Tropical Agricultural Sciences, (Danzhou City, Hainan Province, N 19° 35′ E 109° 29′). High-molecular-weight genomic DNA was extracted using the CTAB method13 followed by purification with the Grandomics Genomic Kit according to the manufacturer’s protocol. DNA degradation and contamination were assessed via electrophoresis on 1% agarose gels. Purity was measured using a NanoDrop™ One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA), with acceptable ratios of 1.8–2.0 (OD₂₆₀/₂₈₀) and 2.0–2.2 (OD₂₆₀/₂₃₀). DNA concentration was quantified using a Qubit® 4.0 Fluorometer (Invitrogen, USA). Size selection and cleanup were performed using the SMRTbell Prep Kit 3.0. Sequencing was conducted on the PacBio Revio platform following the manufacturer’s manual. Raw reads were processed with CCS software (https://github.com/PacificBiosciences/ccs) under default parameters (min passes = 3, min RQ = 0.99), generating high-precision HiFi reads (Q > 20)14. This yielded 17.56 Gb of HiFi data, representing ~52.63 × genome coverage (Table 2).

Table 2 Statistics of sequencing data for Morus macroura genome assembly and annotation.

Total RNA was extracted from four distinct tissues (leaf, stem, flower, and fruit) sampled from the same M. macroura plant. Subsequently, the mRNA was synthesized to cDNA, and four libraries were constructed with an insertion size of 350 bp using a MGIEasy Universal DNA Library Prep Kit V1.0 (CAT#1000005250, MGI) following the manufacturer’s instructions. The qualified libraries were sequenced on DNBSEQ-T7RS platform set in the PE150 program. Raw reads were filtered with fastp v0.23.4 Subsequently, the reads were filtered using fastp v0.23.4 with the following criteria: removal of reads containing adapters; remove reads with a proportion exceeding 10% N; remove reads where the proportion of low-quality (quality value < 20) bases exceeds 50%.with parameters.Generating an average of 8.97 GB of clean reads per sample (Table 2).

Contig-level genome assembly

HiFi reads were generated from filtered subreads using the CCS module in SMRT Link v12.015 with parameters:--maxLength = 50000, --minPasses = 3, and --min Predicted Accuracy = 0.99. After removing adapter sequences and discarding low-quality reads (average quality values rq < 0.99), the resulting HiFi dataset (reads with base quality ≥ Q20 and average rq > 0.99) yielded 17.56 Gb of CCS reads (52.63 × coverage), exhibiting an average length of 18.38 kb and an N50 of 18.96 kb (Table 2). These reads were converted from BAM to FASTQ format using bam2fastx v1.3.1.(https://github.com/ pacificbiosciences /bam-2fastx/). A pure third-generation assembly strategy was implemented with HiFi reads, employing Hifiasm v0.16.011 under default parameters to generate contigs. Post-assembly, redundant sequences were removed to produce a non-redundant preliminary assembly (322.62 Mb). Within this assembly, two contigs (104,112 bp; 0.03% of total length) were identified as bacterial contaminants and excised. Following the elimination of exogenous sequences—including non-target taxa, mitochondrial, and chloroplast DNA—the final genome assembly reached 320.58 Mb with a contig N50 of 17.98 Mb (Table 1).

Hi-C library construction and pseudo-chromosome anchoring

To anchor hybrid scaffolds to chromosomes, genomic DNA was isolated from tender leaves of M. macroura for Hi-C library construction. Leaves were sectioned into ~2 cm² pieces and cross-linked with 2% formaldehyde. Purified DNA was digested with DpnII restriction enzyme, biotinylated using biotin-14-dCTP, sheared to 300–400 bp fragments, and subjected to blunt-end repair. The Hi-C library was sequenced on the Illumina NovaSeq/MGI-2000 platform16. Raw reads were filtered with fastp v0.23.417, generating ~60.64 Gb (190.34 × coverage) of data for pseudomolecule assembly. Cleaned Hi-C data were aligned to contig assemblies using Bowtie2 v2.3.218 under parameters --very-sensitive -L 30. After merging paired-end reads, 480,988 uniquely aligned read pairs (45.64% of clean data) were obtained (Table S1). Hi-C scaffolding employed HiC-Pro v3.1.012, yielding 370,804 valid interaction pairs (77.09% of uniquely mapped reads; 35.18% of total clean data; Table S1). Quality-controlled data were aligned to the reference genome using BWA v0.7.1719 with mem -5SP, followed by filtering via filter_bam v2.0.0 (--nm 3) to retain reads with mapping quality ≥ 1 and edit distance < 3. Pseudomolecule construction utilized HapHiC v1.0.520 (command: HapHiC pipeline Genome bam Chrnumber) for clustering, reassignment, ordering, orientation, and assembly. Contigs totaling 316.47 Mb were anchored to 14 chromosomes (range: 14.19–58.92 Mb; Table 3), achieving a scaffolding rate of 99.34% (Table 1). The final pseudochromosome-level genome size was 318.59 Mb with 9 gaps (total gap length: 0.9 Mb) and a scaffold N50 of 21.88 Mb (Table 1). To validate anchoring accuracy, pseudochromosomes were partitioned into 100-kb bins for genome-wide interaction matrix construction, visualized as a heatmap using HiC Plotter v0.6.63621 (Fig. 3).

Table 3 Statistics of Morus macroura genome assembly result by Hi-C.
Fig. 3
Fig. 3
Full size image

Heat map of genome-wide Hi-C intra-chromosome interactions in M. macroura. The interaction density is measured by the number of supporting Hi-C reads and illustrated by the color bar from dark red (high density) to light pink (low density).

Genome annotation and functional prediction

Identifying repeat sequences

The M. macroura genome assembly harbored abundant repetitive sequences, broadly classified as tandem repeats and interspersed repeats based on distribution patterns. 1. Microsatellite (SSR) analysis, GMATA v2.222 identified 117,316 SSR loci spanning 630,253 bp (0.20% of the genome). 2. Tandem repeat annotation, using Tandem Repeats Finder (TRF) v4.07b23 with parameters 2 7 7 80 10 50 500 -f -d -h -r, we detected 50,263 tandem repeats totaling 2,931,480 bp (0.92% genome length; Table 4). 3. Transposable element (TE) annotation pipeline, MITE-hunter24 (-n 20 -P 0.2 -c 3) generated a MITE library, and LTR_FINDER v1.0.725 and LTR_HARVEST v1.6.526 predicted LTR retrotransposons, and LTR_retriever v1.927 consolidated MITEs and LTRs into TE.lib and RepeatModeler v1.0.1128 produced RepMod.lib. Merged TE.lib, RepMod.lib, and RepBase database into a composite repeat library. RepeatMasker v4.1.2-p128 annotated repeats using this library. Combined annotations revealed 173.34 Mb repetitive sequences (54.41% genome coverage), dominated by: LTR retrotransposons: 34.92% (110.92 Mb), DNA transposons: 12.40% (39.40 Mb), LINEs: 2.48% (7.88 Mb), SINEs: 0.18% (0.57 Mb)(Complete classification in Table 4).

Table 4 Statistics of repeat elements in the genome of Morus macroura.

Identifying non-coding RNA (ncRNA) gene

Non-coding RNA (ncRNA) annotation was performed through an integrated approach: Rfam alignment: The genome was aligned against the Rfam database (release 14.9)29 using Infernal v1.130 with default E-value thresholds; tRNA prediction: tRNAscan-SE v1.3.131 was executed under standard parameters; rRNA identification: Homology-based detection of rRNA genes and subunits was conducted with RNAmmer v1.232. Integrated analysis revealed the following ncRNA repertoire (summarized in Table 5): Ribosomal RNA (rRNA), 1,170 genes (including subunits such as 5S/18S/28S); Small RNAs, 1,026 loci (e.g., snoRNAs, snRNAs, miRNAs); Regulatory RNAs, 7 elements (e.g., riboswitches or lncRNAs with regulatory roles); Transfer RNA(tRNA): 767 genes (covering all 20 standard amino acid-specific tRNAs).

Table 5 Statistics for non-coding RNA genes in the genome of Morus macroura.

Gene structure prediction

Gene structure prediction was performed by integrating three complementary approaches: homology-based prediction, transcriptome-based prediction, and ab initio prediction..1. Homology-based prediction: GeMoMa v1.6.133 was employed to infer gene models through cross-species comparison with six reference genomes: M. alba heyebai8, M. alba zhenzhubai10, M. notabilis7, Oryza sativa34], Arabidopsis thaliana35, and Prunus persica36; 2. Transcriptome-based prediction: quality-controlled RNA-seq reads were aligned to the reference genome using STAR 2.7.3a37; transcript coordinates were generated with StringTie v1.3.4 d38 under default parameters; trinity-assembled transcripts and full-length cDNAs were mapped to the soft-masked genome via GMAP v2014-10-239; PASA v2.3.340 integrated aligned data to assemble transcripts, with open reading frames predicted by GeneMark-ST v5.141; this pipeline yielded 24,189 transcriptome-supported genes (Table 6); 3. Ab initio prediction: Augustus v3.3.142 was trained on 3,000 high-confidence transcriptome-derived genes to generate a species-specific model. De novo prediction using this model identified 22,413 candidate genes (Table 7). Integration and refinement: Predictions from PASA, GeMoMa, and Augustus were consolidated using Evidence Modeler v1.1.1 (EVM)40 with weighted evidence priorities: PASA > GeMoMa > Augustus. The initial gene set was filtered through TransposonPSI43 to remove sequences containing potential transposable elements or coding errors. The final curated annotation contained 21,824 protein-coding genes with the following features (Table 6): average gene length of 4,034 bp, cds length of 1,305 bp, exon length of 230 bp, and intron length of 583 bp.

Table 6 Summary of gene structure prediction by three strategies of Morus macroura.
Table 7 Statistics for the Morus macroura functionally annotated protein-coding genes.

Gene function annotation

Functional annotation of the protein-coding genes was performed using Blastp v2.7.144 (-evalue 1e-5, -max_target_seqs. 1), against the following public databases: Non-redundant protein (NR; ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz), SwissProt, Kyoto Encyclopedia of Genes and Genomes (KEGG)45, Eukaryotic Orthologous Groups(KOG)46. Gene Ontology (GO)47 terms were assigned based on BLASTP matches through annotation transfer. Genome-wide functional profiling was conducted with InterProScan v5.32-71.048 with default parameters via Pfam database alignment. InterPro entries were mapped to GO annotations across three domains: Biological Process (BP), Cellular Component (CC), and Molecular Function (MF). This integrated approach annotated 21,181 genes (97.05% of the total gene set) with functional terms (Table 7).

Comparative genomic analysis

For comparative genomic analysis, we selected 9 species meeting three criteria: publicly available high-quality genome sequences with annotations, extensive prior research background, and documented medicinal or edible value.The taxa include 8 Moraceae species7,8,9,10,49 and P. persica (peach)35. Orthologs were identified using OrthoFinder based on the longest transcripts of protein-coding genes from 9 species. In the M. macroura genome, 17,326 gene families were identified,encompassing 20,959 genes (Fig. 4a). We also detected 387 expanded and 4,727 con-tracted gene families. The rooted tree generated by OrthoFinder v2.5.450. Ultrametric trees were constructed with r8s v1.81 using the OrthoFinder-rooted phylogeny Divergence between M. macroura and M. alba: ≈ 4.162 Mya (Fig. 4c), M. macroura and M. atropurpurea show minimal evolutionary distance (recent divergence), indicating close phylogenetic affinity.

Fig. 4
Fig. 4
Full size image

Comparative genomic and evolution analysis of morus species. (a) A Venn diagram of specific and shared orthologs among 9 species; (b) Gene-based genome colinear comparison between M. macroura and M. alba. Conserved syntenic blocks are highlighted with grey color corresponding to the fourteen pseudo-chromosomes, indicating visible genome rearrangements occurred during evolution among morus species. (c) Phylogeny and divergence time analysis among 9 species. The divergence times among different plant species are labelled on the right (million years ago, Mya).To better visualize the branching patterns of Morus, the divergence distance between P. persica and F. chinensis was rescaled from 76 to 12. Similarly, the divergence time between F. chinensis and the common ancestor of Morus was adjusted from 31.218 to 9.218.

Whole-genome synteny

To understand the extend of genomic rearrangement of M. macroura during evolution, whole-genome synteny analysis was conducted between M. macroura and M. alba.The protein sequences of M. macroura and M. alba were blasted using blastp using an E-value cutoff of 1 × 10−5. The multiple alignments of syntenic blocks were identified by MCScanX51 with the parameter -s 15 (number of genes required to call a collinear block) and visualized by jcvi v1.2.87552 with a minimum span threshold of 30 genes (--minspan = 30). The analysis revealed interweaving conserved syntenic blocks across all fourteen M. macroura pseudochromosomes (Fig. 4b), indicating potential large-scale genomic rearrangements during Morus divergence.

Data Records

The raw sequencing data including the Hi-C sequencing, PacBio HiFi and Illumina NGS RNA-seq have been submitted to the NCBI Sequence Read Archive (SRA) under accession numbers SRR3502011253, SRR3502011354, and SRR34997372 to SRR3499737555,56,57,58, respectively. The final chromosome-level assembled genome sequences were deposited in the NCBI Assembly database under Accession Number PRJNA130563759. The genome annotation results, including repeated sequences, gene structure, and functional predictions were deposited in the Figshare database (https://doi.org/10.6084/m9.figshare.30143464)60.

Technical Validation

Genome assembly quality assessment(Integrating alignment statistics, ortholog completeness, and k-mer metrics). 1. HiFi read alignment, Minimap2 v2.2661 (-ax map-hifi) mapped 955,429 HiFi reads to the post-assembly error-corrected genome, achieving: Mapping rate, 99.64% (952,000 reads); Coverage, 99.9% of filtered short reads mapped; Depth analysis (Samtools v1.1862), the average depth of the third-generation data is 52.63x. When the statistical coverage depth is 1x, the coverage of the entire genome is 99.94%. 2. Evolutionary conserved element analysis, BUSCO v5.1.363 with embryophyte_odb10 database revealed high integrity:Genome-level completeness, 97.21% eudicot BUSCOs (Table 1); Gene model completeness, 97.46% eudicot BUSCOs (Fig. 5). 3. K-mer based quality metrics, using Merfin64 and Merqury v1.365 with 21-kmer profile: Consensus quality (QV), 62.55 (Q20≈99.999% accuracy); K-mer completeness, 80.90%. Interpretation: QV > 60 indicates fewer than 1 error per million bases.

Fig. 5
Fig. 5
Full size image

Benchmarking of genome completeness of M. macroura genome assembly and annotation, evaluated by BUSCO based on embryophyta odb10 database which includes 1,614 genes.

Protein coding genes comparison with close species. To determine the prediction accuracy and reliability, the distribution of gene length, CDS length, exons number per gene, exon length, intron length, and exon number in M. macroura and other closely related species (M. alba8,M. alba zhenzhubai10, Morus notabilis7, O. sativa34, A. thaliana35, P. persica36) were determined. The consistent distribution tendency among all species further supported an ideal annotated gene dataset in M. macroura (Fig. 6).

Fig. 6
Fig. 6
Full size image

Annotated genes comparison of the distribution of gene length (a), CDS length (b), C exon length (c), exon number (d), intron length (e) and intron number (f) in M. macroura with other closely related species. Te x-axis represents the length or number and the y-axis represents the density of genes.

In addition, the LTR Assembly Index (LAI) for the assembly was also calculated LTR_retriever v.2.9.067 to evaluate the contiguity of the assembly. The LAI score for the whole genome was estimated to be 26.45, surpassing the quality standard for reference genomes.

Hence, a high-quality completeness and accuracy M. macroura genome was assembled and annotated in the present study.