De novo assembly of a near-complete genome of aquatic vegetable Zizania latifolia in the Yangtze River Basin

Zhao, Yao; Liao, Li-bing; Zhu, Zi-wei; Zhang, Li-dong; Xiong, Zi-dong; Song, Zhi-ping; Yan, Ning; Zhong, Ai-wen; Zhang, Jian; Zhou, Cheng-chuan; Rong, Jun

doi:10.1038/s41597-024-04220-0

Download PDF

Data Descriptor
Open access
Published: 18 December 2024

De novo assembly of a near-complete genome of aquatic vegetable Zizania latifolia in the Yangtze River Basin

Yao Zhao^1,2,3,
Li-bing Liao³,
Zi-wei Zhu^2,4,
Li-dong Zhang ORCID: orcid.org/0009-0000-8141-881X¹,
Zi-dong Xiong¹,
Zhi-ping Song⁵,
Ning Yan⁶,
Ai-wen Zhong³,
Jian Zhang^1,2,
Cheng-chuan Zhou⁴ &
…
Jun Rong ORCID: orcid.org/0000-0003-1408-2898^1,2,3

Scientific Data volume 11, Article number: 1341 (2024) Cite this article

3103 Accesses
2 Citations
Metrics details

Subjects

Abstract

The cultivated Zizania latifolia, an aquatic vegetable prevalent in the Yangtze River Basin, represents a unique plant-fungus complex whose domestication is associated with host-parasite co-evolution. In this study, we present a high-quality, chromosome-scale genome assembly of cultivated Z. latifolia. We employed PacBio long-read sequencing and Hi-C technology to generate ~578.42 Mb genome assembly, which contains 47.59% repeat sequences with a contig N50 of ~33.75 Mb. The contigs were successfully clustered into 17 chromosomal-sized scaffolds with a GC content of 43.26%, showing 98.39% completeness in BUSCO analysis. In total, we predicted 39,934 protein-coding genes, 88.79% of which could be functionally annotated. This genome assembly provides a valuable resource for unraveling Z. latifolia’s domestication process, and advances our understanding of the evolutionary history and agricultural potential of Z. latifolia.

Chromosome-level genome assembly of Zizania latifolia provides insights into its seed shattering and phytocassane biosynthesis

Article Open access 11 January 2022

Chromosome-level genome assembly of Jiaobai (Zizania latifolia, Poceace)

Article Open access 31 March 2025

Chromosome-level genome assembly and annotation of Zicaitai (Brassica rapa var. purpuraria)

Article Open access 03 November 2023

Background & Summary

The genus Zizania, belonging to the rice tribe (Oryzeae) of the grass family Poaceae, is closely related to the genus Oryza, along with Leersia^1,2,3,4. Of the four species within Zizania, three are native to North America, including the annual Zizania palustris, commonly known as wild rice, which has recently been domesticated as a grain crop^3,5,6. On the other hand, the only East Asian species, perennial Zizania latifolia, is prevalent in freshwater wetlands of eastern China and plays a crucial role in emergent plant communities^7,8,9. Interestingly, Z. latifolia’s young stems can be infested by the smut fungus Ustilago esculenta, resulting in the formation of fleshy, edible galls. This phenomenon, observed and documented by ancient Chinese over 2,000 years ago, led to the domestication of the Z. latifolia-U. esculenta complex as an aquatic vegetable called “Jiaobai”^8,10,11,12.

The domestication of cultivated Z. latifolia presents unique characteristics compared to other cultivated plants. According to historical literatures, Z. latifolia was domesticated as a vegetable crop in the late Tang Dynasty (more than 1000 years ago). The infection of U. esculenta disabled the ability to reproduce sexually, forcing the cultivated Z. latifolia to rely on asexual tillers for reproduction^11,13. This reproductive constraint led to extremely low genetic diversity among cultivated varieties^7,8. Notably, the domestication of Z. latifolia deviates from the ordinary binary co-evolutionary relationship consisting of a sole domesticated crop and humans¹⁴. Instead, Z. latifolia was domesticated as a plant-fungus complex, involving two closely related species that are simultaneously subjected to human selective pressure^8,10,15. This unique domestication process makes cultivated Z. latifolia a potentially novel model for studying host-parasite co-evolution and the response of symbiotic systems to artificial selection^8,13,16. Additionally, as a close relative of Z. palustris, Z. latifolia has historical significance as a former grain crop. This ancestral usage, combined with its perennial nature, suggests that Z. latifolia harbors the potential to be de novo domesticated into a new perennial grain crop^{12,17,18,19,20}.

Significant progress has been made in understanding the genomic structure of Z. latifolia. The draft genome and chromosome-level genome of wild Z. latifolia have been sequenced successively, providing valuable resources for exploring the origin of cultivated Z. latifolia and dissecting the potential agronomic traits in wild Z. latifolia germplasms^12,13,20. We have also previously made preliminary inferences on the possible domestication scenarios of cultivated Z. latifolia using molecular markers⁸. Despite these advancements, a high-quality genome of cultivated Z. latifolia remains indispensable to further address the origin affairs, and to infer the genetic basis of domesticated traits.

This study presents the first near-complete chromosomal-scale genome assembly of cultivated Z. latifolia using long-read sequencing data and Hi-C sequencing technologies. The assembly yielded a 578.42 Mb genome with a contig N50 of 33.75 Mb, and the contigs were successfully clustered into 17 chromosomal-sized scaffolds with only one gap. The assembly’s quality was validated through Benchmarking Universal Single-Copy Ortholog (BUSCO) analysis, which revealed 98.39% completeness. Furthermore, 39,934 protein-coding genes were predicted, with 88.79% of these genes being functionally annotated.

This genome assembly and annotation will lay out a genetic map and milestone for comparative genomics in the genus Zizania. It enables researchers to unravel the mysteries surrounding the domestication of cultivated Z. latifolia, and serves as an important resource for future conservation and breeding efforts of Z. latifolia. These genomic insights pave the way for deeper understanding of Z. latifolia’s evolutionary history and its potential for agricultural improvement.

Methods

Sampling and genomic sequencing

In 2022, a landrace of cultivated Z. latifolia was collected from the rural area near Tonglu city (29.78°N, 119.57°E) of Zhejiang province in China. The collected sample was transplanted to the Z. latifolia germplasm in Lushan Botanical Garden, then the young leaves were harvested for DNA extraction and genome sequencing. Genomic DNA was extracted following the CTAB method. DNA quality and concentration were examined using NanoDrop ND2000 spectrophotometer (Thermo Fisher Scientific, USA) and Qubit 3.0 Fluorometer (Thermo Fisher Scientific, USA).

For genome survey, the paired-end (PE 150 bp) library was generated using the DNBSEQ-T7RS High-throughput Sequencing FCL PE150 Kit (MGI Tech, China), and the library was sequenced on an DNBSEQ-T7 platform (MGI Tech, China) following the manufacturer’s instructions. This yielded ~60.86 Gb of paired-end reads, covering about 110.7× of the estimated Z. latifolia genome (Supplementary Table S1). The Pacbio HiFi sequencing was then performed on the PacBio revio platform (Pacific Biosciences, USA), according to the manufacturer’s instructions. It produced ~127.37 Gb HiFi reads, equivalent to about 231.6× coverage of the Z. latifolia genome (Supplementary Table S1). To prepare the library for High-through chromosome conformation capture (Hi-C) sequencing, formaldehyde was used for crosslinking the fresh leaves. Subsequently, the Hi-C library was constructed based on the instructions and sequenced on DNBSEQ-T7 platform, generating ~111.89 Gb raw reads (Supplementary Table S1). For the RNA-seq, diverse tissues including stem, leaves, inflorescence and roots, were collected and immediately frozen in liquid nitrogen, with three biological replications. The total RNA per sample was extracted and purified. The integrity of the RNA was assessed on an Agilent 2100 Bioanalyzer (Agilent, USA). After DNase treatment, RNA-seq libraries were constructed and sequenced on the DNBSEQ-T7 platform with 150 bp paired-end sequences according to the manufacturer’s recommended protocol. A total of ~21.65 Gb RNA-seq reads were obtained to assist the subsequent analysis (Supplementary Table S1).

Genome estimation and chromosome-level assembly

Prior to the actual genome assembly, a genome survey was conducted using the filtered MGI short reads to assess the main features of the Z. latifolia genome, including genome size, heterozygosity, and repetitive sequence content. The k-mer analyses (17–31 k-mer) were conducted using Jellyfish v2.1.4²¹. Genome evaluation was performed based on k-mer frequency distribution and k-mer = 23 using Genome Scope²². Subsequently, the survey results estimated the genome size as ~550.84 Mb with a heterozygosity of 0.39% and a repeat rate of 43.59%.

The PacBio HiFi reads were used to perform de novo genome assembly by using hifiasm v0.19.6²³ with default parameters. This initial assembly resulted in a genome size of ~583.74 Mb, containing 41 contigs with N50 sizes of ~33.75 Mb. Finally, Hi-C sequencing data were used to anchor the assembled contigs into pseudochromosome molecules. The filtered Hi-C data were first mapped to the polished genome assembly with Juicer v1.6²⁴. Then the unique mapped reads were taken as input for 3D-DNA pipeline v180922²⁵ with parameters “-r 0”. Afterward, a careful manual inspection and correction of any visual errors in the graph was done using JuiceBox v1.11.08²⁶. As a result, seventeen pseudochromosomes were identified by distinct interaction signals in the Hi-C interaction heatmap (Supplementary Fig. S1).

We finally obtained chromosomal-level genome of ~578.42 Mb in size, closely aligning with the estimated genome size of ~550.84 Mb from the initial survey (Fig. 1, Table 1). This assembly incorporated 99.44% of the assembled contigs, resulting in a scaffold N50 length of ~34.71 Mb. The GC content in cultivated Z. latifolia genome was observed to be 43.26%. Benchmarking Universal Single-Copy Ortholog (BUSCO) v5.4.3²⁷ was employed to assess the integrity, purity and completeness of the genome using embryophyta gene set (odb10). Out of the 1614 BUSCOs, 1588 (98.39%) BUSCOs were identified as complete, including 1261 (78.07%) single-copy BUSCOs. Additionally, 327 BUSCO genes were identified as duplicates, 8 being fragmented and 18 identified as missing BUSCO genes (Table 1).

Table 1 Analytical summary of genome assembly and genome estimation analysis.

Full size table

Repeat elements prediction

Repeat elements in the assembled genome were identified by combining de novo and homology-based methods. Tandem repeat sequences were annotated using Tandem Repeat Finder (TRF v4.09)²⁸ with default parameters. For de novo-based searches, RepeatModeler v1.0.11²⁹ and LTR_FINDER v1.07³⁰ were used to construct the de novo repeat libraries following default parameters. Subsequently, RepeatMasker v4.0.9³¹ was applied to detect repeat sequences based on these libraries. For homology-based searches, RepeatMasker v4.0.9 was employed against a known repeat library Repbase v23.08³².

After completing the aforementioned analyses, we identified a total of ~276.82 Mb as repeat sequence length representing 47.59% of the entire genome. The majority of these repeats were the long terminal repeats (LTRs), which contributed to 35.08% of the genome. The DNA transposons, long interspersed nuclear elements (LINEs), and short interspersed nuclear elements (SINEs) accounted for 10.17%, 0.92%, and 0.01% of the genome, respectively (Supplementary Table S2).

Gene prediction and functional annotation

To annotate protein-coding genes in the cultivated Z. latifolia genome, a multi-approach criterion was performed by employing ab initio prediction, homolog-based gene prediction and transcriptome-based prediction.

The assembled genome was masked by RepeatMasker v4.0.9³¹ to prevent repetitive sequences from interfering with gene prediction. Applying the default settings, the ab initio gene prediction approach was performed by using AUGUSTUS v3.2.2³² and GlimmerHMM v3.02³³ based on statistical models of gene structure. For homology-based gene annotation, the Exonerate v2.2.0³⁴ program was employed to search against protein sequences from wild Z. latifolia (NGDC Genome Warehouse, GWHBFHI00000000)^12,35, Z. palustris (NCBI database, GCA_019279435.1)⁵, Oryza sativa (MSU 7.0)³⁶ and Aegilops tauschii (NCBI database, GCF_002575655.2)³⁷. For the transcriptome gene prediction, quality-controlled RNA-seq reads were mapped to the wild Z. latifolia genome by HiSat2 v2.1.0³⁸, and StringTie v1.3.5³⁹ was used to generate transcripts for referencing-guided assembly. Moreover, Trinity v2.15.1⁴⁰ was employed for de novo assembling transcripts based on RNA-seq data. The resulting transcripts were consolidated, with redundancies removed using CD-HIT v.4.8.1⁴¹. Then TansDecoder v5.5.0 (https://github.com/TransDecoder/TransDecoder) was used to predict the open reading frames (ORFs) based on the assembled transcripts.

Applying the default parameters, Maker2 v2.31.10⁴² was used to integrate the three gene prediction models into a consensus gene set. The integration resulted in the prediction of 39,934 protein-coding genes distributed across the genome, with a mean gene length of 5,087.29 bp. Gene functional annotation was executed by aligning the predicted protein sequences against public functional databases using BLAST v2.11.0⁴³ (e-value < 10⁻⁵), including Trembl⁴⁴, NCBI-nr⁴⁵, KEGG⁴⁶, InterPro⁴⁷, KOG⁴⁸ and SwissProt⁴⁹. This comprehensive annotation process resulted into 35,458 being functionally annotated genes representing 88.79% of the protein-coding genes (Supplementary Table S3). Gene Ontology (GO) was performed using InterProScan v5.55–88.0⁵⁰ (Supplementary Fig. S2).

To provide a comprehensive visual representation of the cultivated Z. latifolia genome, we employed Circos v0.69-9⁵¹ to create a circular genome map. This visualization depicts the distribution of several key genomic features across the 17 chromosomes, including the GC content, density of protein-coding genes, repeat sequence density, Gypsy-like element, Copia-like element and intra-genomic synteny (Fig. 1). In addition to protein-coding genes, we also annotated various non-coding RNA elements in the genome. tRNAscan-SE v1.3.1⁵² software was used to predict tRNAs. The rRNA, miRNA, and snRNA were predicted using INFERNAL v1.1.2⁵³ software through searches against the Rfam database v14.8⁵⁴. The non-coding RNA annotation yielded 228 miRNAs, 2,805 rRNAs, 659 tRNAs, and 756 snRNAs in the cultivated Z. latifolia genome (Supplementary Table S4).

Data Records

The sequencing data and genome assembly were deposited in the National Genomics Data Center (NGDC), Beijing Institute of Genomics, the Chinese Academy of Sciences/China National Center for Bioinformation with BioProject accession number PRJCA020786⁵⁵. The sequencing data of MGI short reads, PacBio HiFi long-reads, RNA-seq data, Hi-C reads were deposited in the Genome Sequence Archive (GSA) of NGDC under accession numbers CRA013186⁵⁶, CRA017988⁵⁷, CRA018091⁵⁸ and CRA017987⁵⁹, respectively. The genome assembly was deposited in GenBank under the accession number GCA_043380935.1⁶⁰, and it was also deposited in Genome Warehouse (GWH) of NGDC under the accession number GWHFFOM00000000⁶¹. Furthermore, the assembled genome and annotation data were deposited in the figshare database for broader accessibility⁶².

Technical Validation

Genome assembly assessment

Two approaches were used to evaluate the robustness and completeness of the assembled genomes. First, the conserved protein models from the lineage database embryophyta_odb10 were searched against genome using the Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.4.3. 98.39% of the genes were present in the assembled genome, which suggests that a substantial majority of the essential and conserved genes were successfully captured. Second, the MGI short paired-end reads generated in genome survey were mapped to the final genome using BWA v0.7.12⁶³ with default settings. Approximately, 99.59% of the short reads were aligned to the genome, which covered 98.50% of the assembled genome.

In addition, the plant-specific telomeric repeats (T₃AG₃) were identified in all seventeen chromosome sequences. 13 chromosomes harbored telomeric repeats at both sides, and the rest 4 chromosomes had telomeric repeats at one side (Supplementary Table S5), underlining the near-complete assembly of chromosome ends.

We further compared the assembly parameters of newly assembled cultivated Z. latifolia genome with two published wild Z. latifolia genomes^12,13 and found that it has better assembly integrity and contiguity (Table 2). We also investigated the syntenic relationships between the cultivated Z. latifolia genome and the other two published chromosome-level Zizania genomes^5,12 using JCVI v1.2.7⁶⁴. The results indicate that our genome assembly of cultivated Z. latifolia demonstrates superior sequence continuity and genome correctness (Fig. S3).

Table 2 Analytical summary of three published whole genome sequencing of Zizania latifolia.

Full size table

Assessment of the gene annotation

The annotated and integrated proteins were also evaluated using BUSCO v5.4.3 with the lineage dataset embryophyte_odb10. Briefly, the proportion of complete core gene coverage was 98.10% (1218 single-copy genes and 365 duplicated genes), and there were only a few fragmented (1.40%) and missing (2.40%) genes, indicating high-quality annotation of the predicted gene models.

Code availability

No custom codes were used in this study. All bioinformatics tools and software applications were executed in accordance with their respective manuals and protocols. The specific software versions and the parameters used are detailed in the methods section.

References

Kellogg, E. A. The evolutionary history of Ehrhartoideae, Oryzeae, and Oryza. Rice. 2, 1–14 (2009).
Article Google Scholar
Xu, X. et al. Phylogeny and biogeography of the eastern Asian-North American disjunct wild-rice genus (Zizania L., Poaceae). Mol. Phylogenet. Evol. 55, 1008–1017 (2010).
Article CAS PubMed Google Scholar
Porter, R. in North American crop wild relatives: important species. Vol.2 (eds. Greene, S. L., Williams, K. A., Khoury, C. K., Kantar, M. B., Marek, L. F.) Ch.3 (Springer International Publishing 2019).
Zhang, T. et al. Phylogenomic profiles of whole-genome duplications in Poaceae and landscape of differential duplicate retention and losses among major Poaceae lineages. Nat. Commun. 15, 3305 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Haas, M. et al. Whole-genome assembly and annotation of northern wild rice, Zizania palustris L., supports a whole-genome duplication in the Zizania genus. Plant J. 107, 1802–1818 (2021).
Article CAS PubMed Google Scholar
McGlip, L., Castell-Miller, C., Haas, M., Millas, R. & Kimball, J. Northern Wild Rice (Zizania palustris L.) breeding, genetics, and conservation. Crop Sci. 63, 1904–1933 (2023).
Article Google Scholar
Xu, X., Ke, W., Yu, X., Wen, J. & Ge, S. A preliminary study on population genetic structure and phylogeography of the wild and cultivated Zizania latifolia (Poaceae) based on Adh1a sequences. Theor. Appl. Genet. 116, 835–843 (2008).
Article PubMed Google Scholar
Zhao, Y. et al. Inferring the origin of cultivated Zizania latifolia, an aquatic vegetable of a plant-fungus complex in the Yangtze River Basin. Front. Plant Sci. 10, 1406 (2019).
Article PubMed PubMed Central Google Scholar
Wagutu, G. K. Genetic structure of wild rice Zizania latifolia in an expansive heterogeneous landscape along a latitudinal gradient. Front. Ecol. Evol. 10, 929944 (2022).
Article Google Scholar
Chan, Y. S. & Thrower, L. The host-parasite relationship between Zizania caducifyora Turcz. and Ustilago esculenta P. Henn. I. structure and development of the host and host-parasite combination. New Phytol. 85, 201–207 (1980).
Article Google Scholar
Guo, H. B., Li, S. M., Peng, J. & Ke, W. D. Zizania latifolia Turcz. Cultivated in China. Genet. Resour. Crop Evol. 54, 1211–1217 (2007).
Article Google Scholar
Yan, N. et al. Chromosome-level genome assembly of Zizania latifolia provides insights into its seed shattering and phytocassane biosynthesis. Commun. Biol. 5, 36 (2022).
Article CAS PubMed PubMed Central Google Scholar
Guo, L. B. et al. A host plant genome (Zizania latifolia) after a century-long endophyte infection. Plant J. 83, 600–609 (2015).
Article CAS PubMed Google Scholar
Purugganan and Fuller Purugganan, M. D., Fuller, D. Q. The nature of selection during plant domestication. Nature. 457, 843–848 (2009).
Article ADS PubMed Google Scholar
Zhang, J. Z., Chu, F. Q., Guo, D. P., Hyde, K. D. & Xie, G. L. Cytology and ultrastructure of interactions between Ustilago esculenta and Zizania latifolia. Mycol. Prog. 11, 499–508 (2012).
Article CAS Google Scholar
Guttman, D. S., McHardy, A. C. & Schulze-Lefert, P. Microbial genome-enabled insights into plant-microorganism interactions. Nat. Rev. Genet. 15, 797–813 (2014).
Article CAS PubMed Google Scholar
Zhai, C. K., Jiang, X. L., Xu, Y. S. & Lorenz, K. J. Protein and amino acid composition of Chinese and North American wild rice. J. Food Compos. Anal. 14, 371–382 (1994).
Article Google Scholar
Zhao, Y. et al. Seed characteristic variations and genetic structure of wild Zizania latifolia along a latitudinal gradient in China: implications for neo-domestication as a grain crop. AoB. PLANTS. 10, ply072 (2018).
Article PubMed PubMed Central Google Scholar
Yan, N. et al. A comparative UHPLC-QqQ-MS-based metabolomics approach for evaluating Chinese and North American wild rice. Food Chem. 275, 618–627 (2019).
Article CAS PubMed Google Scholar
Xie, Y. N. et al. Domestication, breeding, omics research, and important genes of Zizania latifolia and Zizania palustris. Front. Plant Sci. 14, 1183739 (2023).
Article PubMed PubMed Central Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 27, 764–770 (2011).
Article PubMed PubMed Central Google Scholar
Vurture, G. W. et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. 33, 2202–2204 (2017).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with Hifiasm. Nat. Methods. 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 3, 95–98 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes Aegypti genome using Hi-C yields chromosome-length scaffolds. Science. 356, 92–95 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. The Juicebox assembly tools module facilitates de novo assembly of mammalian genomes with chromosome-length scaffolds for under $1000. Cold Spring Harbor: Cold Spring Harbor Laboratory Press, 2018.
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647–4654 (2021).
Article CAS PubMed PubMed Central Google Scholar
Benson, G. Tandem Repeats Finder: a program to analyze DNA sequences. Nucleic. Acids. Res. 27, 573–580 (1999).
Article CAS PubMed PubMed Central Google Scholar
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics. 21(Suppl 1), i351–i358 (2005).
Article CAS PubMed Google Scholar
Xu, Z. & Wang, H. LTR_Finder: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic. Acids. Res. 35, W265–W268 (2007).
Article PubMed PubMed Central Google Scholar
Tempel, S. Using and understanding Repeatmasker. Totowa, NJ: Humana Press, 29–51 (2012).
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA. 6, 11 (2015).
Article PubMed PubMed Central Google Scholar
Stanke, M. et al. Augustus: ab initio prediction of alternative transcripts. Nucleic. Acids. Res. 34, W435–W439 (2006).
Article CAS PubMed PubMed Central Google Scholar
Majoros, W. H., Pertea, M. & Salzberg, S. L. Tigrscan and Glimmerhmm: two open source ab initio eukaryotic gene-finders. Bioinformatics. 20, 2878–2879 (2004).
Article CAS PubMed Google Scholar
Slater, G. S. C. & Birney, E. Automated Generation of Heuristics for Biological sequence comparison. BMC Bioinform. 6, 31 (2005).
Article Google Scholar
Ouyang, S. et al. The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic. Acids. Res. 35, D883–D887 (2007).
Article CAS PubMed Google Scholar
Wang, L. et al. Aegilops tauschii genome assembly Aet v5.0 features greater sequence contiguity and improved annotation. G3-Genes. Genom. Genet. 11, jkab325 (2021).
CAS Google Scholar
Kim, D., Langmead, B. & Salzberg, S. L. Hisat: a fast spliced aligner with low memory requirements. Nat. Methods. 12, 357–360 (2015).
Article CAS PubMed PubMed Central Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Article CAS PubMed PubMed Central Google Scholar
Grabherr, M. G. M. G. et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-seq data. Nat. Biotechnol. 29, 644–652 (2011).
Article CAS PubMed PubMed Central Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 28, 3150–3152 (2012).
Article CAS PubMed PubMed Central Google Scholar
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinform. 12, 491 (2011).
Article Google Scholar
Boratyn, G. M. et al. Blast: a more efficient report with usability improvements. Nucleic. Acids. Res. 41, W29–W33 (2013).
Article PubMed PubMed Central Google Scholar
Coudert, E. et al. Annotation of biologically relevant ligands in UniProtKB using ChEBI. Bioinformatics. 39, btac793 (2023).
Article CAS PubMed Google Scholar
Coordinators, N. R. Database resources of the national center for biotechnology information. Nucleic. Acids. Res. 44, D7–D19 (2016).
Article Google Scholar
Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic. Acids. Res. 28, 27–30 (2000).
Article CAS PubMed PubMed Central Google Scholar
Blum, M. et al. The Interpro protein families and domains database: 20 years on. Nucleic. Acids. Res. 49, D344–D354 (2021).
Article CAS PubMed Google Scholar
Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A genomic perspective on protein families. Science. 278, 631–637 (1997).
Article ADS CAS PubMed Google Scholar
Bateman, A. et al. Uniprot: the universal protein knowledgebase in 2021. Nucleic. Acids. Res. 49, D480–D489 (2021).
Article ADS Google Scholar
Jones, P. et al. Interproscan 5: Genome-scale protein function classification. Bioinformatics. 30, 1236–1240 (2014).
Article CAS PubMed PubMed Central Google Scholar
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome. Res. 19, 1639–1645 (2009).
Article CAS PubMed PubMed Central Google Scholar
Lowe, T. M. & Eddy, S. R. TRNAscan-Se: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic. Acids. Res. 25, 955–964 (1997).
Article CAS PubMed PubMed Central Google Scholar
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 29, 2933–2935 (2013).
Article CAS PubMed PubMed Central Google Scholar
Griffiths-Jones, S. Rfam: annotating non-coding RNAs in complete genomes. Nucleic. Acids. Res. 33, D121–D124 (2004).
Article PubMed Central Google Scholar
National Genomics Data Center (NGDC) BioProject https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA020786 (2023).
National Genomics Data Center (NGDC) Genome Sequence Archive https://ngdc.cncb.ac.cn/search/all?&q=CRA013186 (2024).
National Genomics Data Center (NGDC) Genome Sequence Archive https://ngdc.cncb.ac.cn/search/all?&q=CRA017988 (2024).
National Genomics Data Center (NGDC) Genome Sequence Archive https://ngdc.cncb.ac.cn/search/all?&q=CRA018091 (2024).
National Genomics Data Center (NGDC) Genome Sequence Archive https://ngdc.cncb.ac.cn/search/all?&q=CRA017987 (2024).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_043380935.1 (2024).
NGDC Genome Warehouse https://ngdc.cncb.ac.cn/search/all?q=GWHFFOM00000000 (2024).
Zhao, Y. The de novo assembled chromosome-scale genome of cultivated Zizania latifolia. figshare. Dataset. https://doi.org/10.6084/m9.figshare.26384776.v5 (2024).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Tang, H. et al. Synteny and collinearity in plant genomes. Science. 320, 486–488 (2008).
Article ADS CAS PubMed Google Scholar

Download references

Acknowledgements

The research work received financial support from the National Natural Science Foundation of China (Grant No. 31600293, 32260091) and the Natural Science Foundation of Jiangxi province (Grant No. 20212BAB205029).

Author information

Authors and Affiliations

Key Laboratory of Poyang Lake Environment and Resource Utilization, Ministry of Education, Center for Watershed Ecology, School of Life Sciences, Nanchang University, Nanchang, 330031, Jiangxi, P. R. China
Yao Zhao, Li-dong Zhang, Zi-dong Xiong, Jian Zhang & Jun Rong
Jiangxi Poyang Lake Wetland Conservation and Restoration National Permanent Scientific Research Base, National Ecosystem Research Station of Jiangxi Poyang Lake Wetland, Nanchang University, Nanchang, 330031, Jiangxi, P. R. China
Yao Zhao, Zi-wei Zhu, Jian Zhang & Jun Rong
Jiangxi Province Key Laboratory of Wetland Plant Resources Conservation and Utilization, Lushan Botanical Garden, Jiangxi Province and Chinese Academy of Sciences, Jiujiang, 332900, P. R. China
Yao Zhao, Li-bing Liao, Ai-wen Zhong & Jun Rong
Jiangxi Academy of Forestry, Nanchang, 330013, Jiangxi, P. R. China
Zi-wei Zhu & Cheng-chuan Zhou
Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, National Observations and Research Station for Wetland Ecosystems of the Yangtze Estuary, Fudan University, Shanghai, 200438, P. R. China
Zhi-ping Song
Tobacco Research Institute of Chinese Academy of Agricultural Sciences, Qingdao, 266101, P. R. China
Ning Yan

Authors

Yao Zhao
View author publications
Search author on:PubMed Google Scholar
Li-bing Liao
View author publications
Search author on:PubMed Google Scholar
Zi-wei Zhu
View author publications
Search author on:PubMed Google Scholar
Li-dong Zhang
View author publications
Search author on:PubMed Google Scholar
Zi-dong Xiong
View author publications
Search author on:PubMed Google Scholar
Zhi-ping Song
View author publications
Search author on:PubMed Google Scholar
Ning Yan
View author publications
Search author on:PubMed Google Scholar
Ai-wen Zhong
View author publications
Search author on:PubMed Google Scholar
Jian Zhang
View author publications
Search author on:PubMed Google Scholar
Cheng-chuan Zhou
View author publications
Search author on:PubMed Google Scholar
Jun Rong
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.Z., C.Z. and J.R. conceived and led the research, L.L., Z.Z., L.Z., Z.X., Z.S., N.Y., J.Z. and A.Z. were involved in sample collection, preparation and genome assembly. C.Z. and Y.Z. contributed to gene prediction and annotation, data visualization and other bioinformatics analysis. Y.Z. and C.Z. wrote the manuscript and all authors read, revised and approved the final version of the manuscript.

Corresponding authors

Correspondence to Cheng-chuan Zhou or Jun Rong.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, Y., Liao, Lb., Zhu, Zw. et al. De novo assembly of a near-complete genome of aquatic vegetable Zizania latifolia in the Yangtze River Basin. Sci Data 11, 1341 (2024). https://doi.org/10.1038/s41597-024-04220-0

Download citation

Received: 30 July 2024
Accepted: 02 December 2024
Published: 18 December 2024
Version of record: 18 December 2024
DOI: https://doi.org/10.1038/s41597-024-04220-0