Background & Summary

The genus Zizania, belonging to the rice tribe (Oryzeae) of the grass family Poaceae, is closely related to the genus Oryza, along with Leersia1,2,3,4. Of the four species within Zizania, three are native to North America, including the annual Zizania palustris, commonly known as wild rice, which has recently been domesticated as a grain crop3,5,6. On the other hand, the only East Asian species, perennial Zizania latifolia, is prevalent in freshwater wetlands of eastern China and plays a crucial role in emergent plant communities7,8,9. Interestingly, Z. latifolia’s young stems can be infested by the smut fungus Ustilago esculenta, resulting in the formation of fleshy, edible galls. This phenomenon, observed and documented by ancient Chinese over 2,000 years ago, led to the domestication of the Z. latifolia-U. esculenta complex as an aquatic vegetable called “Jiaobai”8,10,11,12.

The domestication of cultivated Z. latifolia presents unique characteristics compared to other cultivated plants. According to historical literatures, Z. latifolia was domesticated as a vegetable crop in the late Tang Dynasty (more than 1000 years ago). The infection of U. esculenta disabled the ability to reproduce sexually, forcing the cultivated Z. latifolia to rely on asexual tillers for reproduction11,13. This reproductive constraint led to extremely low genetic diversity among cultivated varieties7,8. Notably, the domestication of Z. latifolia deviates from the ordinary binary co-evolutionary relationship consisting of a sole domesticated crop and humans14. Instead, Z. latifolia was domesticated as a plant-fungus complex, involving two closely related species that are simultaneously subjected to human selective pressure8,10,15. This unique domestication process makes cultivated Z. latifolia a potentially novel model for studying host-parasite co-evolution and the response of symbiotic systems to artificial selection8,13,16. Additionally, as a close relative of Z. palustris, Z. latifolia has historical significance as a former grain crop. This ancestral usage, combined with its perennial nature, suggests that Z. latifolia harbors the potential to be de novo domesticated into a new perennial grain crop12,17,18,19,20.

Significant progress has been made in understanding the genomic structure of Z. latifolia. The draft genome and chromosome-level genome of wild Z. latifolia have been sequenced successively, providing valuable resources for exploring the origin of cultivated Z. latifolia and dissecting the potential agronomic traits in wild Z. latifolia germplasms12,13,20. We have also previously made preliminary inferences on the possible domestication scenarios of cultivated Z. latifolia using molecular markers8. Despite these advancements, a high-quality genome of cultivated Z. latifolia remains indispensable to further address the origin affairs, and to infer the genetic basis of domesticated traits.

This study presents the first near-complete chromosomal-scale genome assembly of cultivated Z. latifolia using long-read sequencing data and Hi-C sequencing technologies. The assembly yielded a 578.42 Mb genome with a contig N50 of 33.75 Mb, and the contigs were successfully clustered into 17 chromosomal-sized scaffolds with only one gap. The assembly’s quality was validated through Benchmarking Universal Single-Copy Ortholog (BUSCO) analysis, which revealed 98.39% completeness. Furthermore, 39,934 protein-coding genes were predicted, with 88.79% of these genes being functionally annotated.

This genome assembly and annotation will lay out a genetic map and milestone for comparative genomics in the genus Zizania. It enables researchers to unravel the mysteries surrounding the domestication of cultivated Z. latifolia, and serves as an important resource for future conservation and breeding efforts of Z. latifolia. These genomic insights pave the way for deeper understanding of Z. latifolia’s evolutionary history and its potential for agricultural improvement.

Methods

Sampling and genomic sequencing

In 2022, a landrace of cultivated Z. latifolia was collected from the rural area near Tonglu city (29.78°N, 119.57°E) of Zhejiang province in China. The collected sample was transplanted to the Z. latifolia germplasm in Lushan Botanical Garden, then the young leaves were harvested for DNA extraction and genome sequencing. Genomic DNA was extracted following the CTAB method. DNA quality and concentration were examined using NanoDrop ND2000 spectrophotometer (Thermo Fisher Scientific, USA) and Qubit 3.0 Fluorometer (Thermo Fisher Scientific, USA).

For genome survey, the paired-end (PE 150 bp) library was generated using the DNBSEQ-T7RS High-throughput Sequencing FCL PE150 Kit (MGI Tech, China), and the library was sequenced on an DNBSEQ-T7 platform (MGI Tech, China) following the manufacturer’s instructions. This yielded ~60.86 Gb of paired-end reads, covering about 110.7× of the estimated Z. latifolia genome (Supplementary Table S1). The Pacbio HiFi sequencing was then performed on the PacBio revio platform (Pacific Biosciences, USA), according to the manufacturer’s instructions. It produced ~127.37 Gb HiFi reads, equivalent to about 231.6× coverage of the Z. latifolia genome (Supplementary Table S1). To prepare the library for High-through chromosome conformation capture (Hi-C) sequencing, formaldehyde was used for crosslinking the fresh leaves. Subsequently, the Hi-C library was constructed based on the instructions and sequenced on DNBSEQ-T7 platform, generating ~111.89 Gb raw reads (Supplementary Table S1). For the RNA-seq, diverse tissues including stem, leaves, inflorescence and roots, were collected and immediately frozen in liquid nitrogen, with three biological replications. The total RNA per sample was extracted and purified. The integrity of the RNA was assessed on an Agilent 2100 Bioanalyzer (Agilent, USA). After DNase treatment, RNA-seq libraries were constructed and sequenced on the DNBSEQ-T7 platform with 150 bp paired-end sequences according to the manufacturer’s recommended protocol. A total of ~21.65 Gb RNA-seq reads were obtained to assist the subsequent analysis (Supplementary Table S1).

Genome estimation and chromosome-level assembly

Prior to the actual genome assembly, a genome survey was conducted using the filtered MGI short reads to assess the main features of the Z. latifolia genome, including genome size, heterozygosity, and repetitive sequence content. The k-mer analyses (17–31 k-mer) were conducted using Jellyfish v2.1.421. Genome evaluation was performed based on k-mer frequency distribution and k-mer = 23 using Genome Scope22. Subsequently, the survey results estimated the genome size as ~550.84 Mb with a heterozygosity of 0.39% and a repeat rate of 43.59%.

The PacBio HiFi reads were used to perform de novo genome assembly by using hifiasm v0.19.623 with default parameters. This initial assembly resulted in a genome size of ~583.74 Mb, containing 41 contigs with N50 sizes of ~33.75 Mb. Finally, Hi-C sequencing data were used to anchor the assembled contigs into pseudochromosome molecules. The filtered Hi-C data were first mapped to the polished genome assembly with Juicer v1.624. Then the unique mapped reads were taken as input for 3D-DNA pipeline v18092225 with parameters “-r 0”. Afterward, a careful manual inspection and correction of any visual errors in the graph was done using JuiceBox v1.11.0826. As a result, seventeen pseudochromosomes were identified by distinct interaction signals in the Hi-C interaction heatmap (Supplementary Fig. S1).

We finally obtained chromosomal-level genome of ~578.42 Mb in size, closely aligning with the estimated genome size of ~550.84 Mb from the initial survey (Fig. 1, Table 1). This assembly incorporated 99.44% of the assembled contigs, resulting in a scaffold N50 length of ~34.71 Mb. The GC content in cultivated Z. latifolia genome was observed to be 43.26%. Benchmarking Universal Single-Copy Ortholog (BUSCO) v5.4.327 was employed to assess the integrity, purity and completeness of the genome using embryophyta gene set (odb10). Out of the 1614 BUSCOs, 1588 (98.39%) BUSCOs were identified as complete, including 1261 (78.07%) single-copy BUSCOs. Additionally, 327 BUSCO genes were identified as duplicates, 8 being fragmented and 18 identified as missing BUSCO genes (Table 1).

Fig. 1
figure 1

A circular visualization of chromosomes in Z. latifolia genome. The outermost plot represents ideograms of 17 chromosomes (scale mark = 5 Mb). Moving from the second outermost track to the innermost track, each concentric circle denotes the GC content, density of protein-coding genes, repeat sequence density, Gypsy-like element distribution and Copia-like element distribution. The innermost track indicates genomic synteny among the chromosomes.

Table 1 Analytical summary of genome assembly and genome estimation analysis.

Repeat elements prediction

Repeat elements in the assembled genome were identified by combining de novo and homology-based methods. Tandem repeat sequences were annotated using Tandem Repeat Finder (TRF v4.09)28 with default parameters. For de novo-based searches, RepeatModeler v1.0.1129 and LTR_FINDER v1.0730 were used to construct the de novo repeat libraries following default parameters. Subsequently, RepeatMasker v4.0.931 was applied to detect repeat sequences based on these libraries. For homology-based searches, RepeatMasker v4.0.9 was employed against a known repeat library Repbase v23.0832.

After completing the aforementioned analyses, we identified a total of ~276.82 Mb as repeat sequence length representing 47.59% of the entire genome. The majority of these repeats were the long terminal repeats (LTRs), which contributed to 35.08% of the genome. The DNA transposons, long interspersed nuclear elements (LINEs), and short interspersed nuclear elements (SINEs) accounted for 10.17%, 0.92%, and 0.01% of the genome, respectively (Supplementary Table S2).

Gene prediction and functional annotation

To annotate protein-coding genes in the cultivated Z. latifolia genome, a multi-approach criterion was performed by employing ab initio prediction, homolog-based gene prediction and transcriptome-based prediction.

The assembled genome was masked by RepeatMasker v4.0.931 to prevent repetitive sequences from interfering with gene prediction. Applying the default settings, the ab initio gene prediction approach was performed by using AUGUSTUS v3.2.232 and GlimmerHMM v3.0233 based on statistical models of gene structure. For homology-based gene annotation, the Exonerate v2.2.034 program was employed to search against protein sequences from wild Z. latifolia (NGDC Genome Warehouse, GWHBFHI00000000)12,35, Z. palustris (NCBI database, GCA_019279435.1)5, Oryza sativa (MSU 7.0)36 and Aegilops tauschii (NCBI database, GCF_002575655.2)37. For the transcriptome gene prediction, quality-controlled RNA-seq reads were mapped to the wild Z. latifolia genome by HiSat2 v2.1.038, and StringTie v1.3.539 was used to generate transcripts for referencing-guided assembly. Moreover, Trinity v2.15.140 was employed for de novo assembling transcripts based on RNA-seq data. The resulting transcripts were consolidated, with redundancies removed using CD-HIT v.4.8.141. Then TansDecoder v5.5.0 (https://github.com/TransDecoder/TransDecoder) was used to predict the open reading frames (ORFs) based on the assembled transcripts.

Applying the default parameters, Maker2 v2.31.1042 was used to integrate the three gene prediction models into a consensus gene set. The integration resulted in the prediction of 39,934 protein-coding genes distributed across the genome, with a mean gene length of 5,087.29 bp. Gene functional annotation was executed by aligning the predicted protein sequences against public functional databases using BLAST v2.11.043 (e-value < 10−5), including Trembl44, NCBI-nr45, KEGG46, InterPro47, KOG48 and SwissProt49. This comprehensive annotation process resulted into 35,458 being functionally annotated genes representing 88.79% of the protein-coding genes (Supplementary Table S3). Gene Ontology (GO) was performed using InterProScan v5.55–88.050 (Supplementary Fig. S2).

To provide a comprehensive visual representation of the cultivated Z. latifolia genome, we employed Circos v0.69-951 to create a circular genome map. This visualization depicts the distribution of several key genomic features across the 17 chromosomes, including the GC content, density of protein-coding genes, repeat sequence density, Gypsy-like element, Copia-like element and intra-genomic synteny (Fig. 1). In addition to protein-coding genes, we also annotated various non-coding RNA elements in the genome. tRNAscan-SE v1.3.152 software was used to predict tRNAs. The rRNA, miRNA, and snRNA were predicted using INFERNAL v1.1.253 software through searches against the Rfam database v14.854. The non-coding RNA annotation yielded 228 miRNAs, 2,805 rRNAs, 659 tRNAs, and 756 snRNAs in the cultivated Z. latifolia genome (Supplementary Table S4).

Data Records

The sequencing data and genome assembly were deposited in the National Genomics Data Center (NGDC), Beijing Institute of Genomics, the Chinese Academy of Sciences/China National Center for Bioinformation with BioProject accession number PRJCA02078655. The sequencing data of MGI short reads, PacBio HiFi long-reads, RNA-seq data, Hi-C reads were deposited in the Genome Sequence Archive (GSA) of NGDC under accession numbers CRA01318656, CRA01798857, CRA01809158 and CRA01798759, respectively. The genome assembly was deposited in GenBank under the accession number GCA_043380935.160, and it was also deposited in Genome Warehouse (GWH) of NGDC under the accession number GWHFFOM0000000061. Furthermore, the assembled genome and annotation data were deposited in the figshare database for broader accessibility62.

Technical Validation

Genome assembly assessment

Two approaches were used to evaluate the robustness and completeness of the assembled genomes. First, the conserved protein models from the lineage database embryophyta_odb10 were searched against genome using the Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.4.3. 98.39% of the genes were present in the assembled genome, which suggests that a substantial majority of the essential and conserved genes were successfully captured. Second, the MGI short paired-end reads generated in genome survey were mapped to the final genome using BWA v0.7.1263 with default settings. Approximately, 99.59% of the short reads were aligned to the genome, which covered 98.50% of the assembled genome.

In addition, the plant-specific telomeric repeats (T3AG3) were identified in all seventeen chromosome sequences. 13 chromosomes harbored telomeric repeats at both sides, and the rest 4 chromosomes had telomeric repeats at one side (Supplementary Table S5), underlining the near-complete assembly of chromosome ends.

We further compared the assembly parameters of newly assembled cultivated Z. latifolia genome with two published wild Z. latifolia genomes12,13 and found that it has better assembly integrity and contiguity (Table 2). We also investigated the syntenic relationships between the cultivated Z. latifolia genome and the other two published chromosome-level Zizania genomes5,12 using JCVI v1.2.764. The results indicate that our genome assembly of cultivated Z. latifolia demonstrates superior sequence continuity and genome correctness (Fig. S3).

Table 2 Analytical summary of three published whole genome sequencing of Zizania latifolia.

Assessment of the gene annotation

The annotated and integrated proteins were also evaluated using BUSCO v5.4.3 with the lineage dataset embryophyte_odb10. Briefly, the proportion of complete core gene coverage was 98.10% (1218 single-copy genes and 365 duplicated genes), and there were only a few fragmented (1.40%) and missing (2.40%) genes, indicating high-quality annotation of the predicted gene models.