Background & Summary

Scirpus × mariqueter (Tang & F.T.Wang) Tatanov, which is endemic to China1, belongs to the species-rich Cyperaceae family. This species makes critical contributions to the integration and health of coastal ecosystems. Because of its thriving and resilient underground rhizome system, S. mariqueter is often a pioneer species in inter-tidal zones and forms single-species patches covering vast areas (Fig. 1), thereby prompting the development of coastal lines and off-shore islands2,3.

Fig. 1
figure 1

Scirpus × mariqueter in the field. (a) S. mariqueter inflorescence bearing only one spikelet; (b) Seedling and rhizome at the stem base; (c) a large mono-species patch of S. mariqueter in a coastal area.

Many migratory birds routinely rely on its corms and achenes as a source of food4. It also has a crucial effect on the carbon budget of coastal wetlands. A previous study has shown that S. mariqueter emits substantial amounts of methane (CH4) and responds significantly to tidal variation5. Following an invasion by Spartina alterniflora Loisel. (Poaceae), S. mariqueter can enhance plant–soil feedback and mitigate the negative effects of biological invasions6.

Although S. mariqueter is a key coastal ecosystem species, there are considerable controversies regarding its evolutionary trajectory. It is considered to be a hybrid species derived from Bolboschoenus planiculmis (F. Schmidt) T.V. Egorova and Schoenoplectus triqueter (L.) Palla7,8. However, previous research have shown that S. mariqueter is much more closely related to B. planiculmis than to S. triqueter, with no intermediate individuals confirmed in the field7,9. These results raise questions about the validity of hybrid speciation as the sole mechanism underlying the origin of S. mariqueter. Nevertheless, the absence of an S. mariqueter genome assembly has greatly limited our understanding of the biological mechanisms and evolutionary significance of this species.

In this data descriptor, we report a chromosome-level Scirpus × mariqueter genome assembly. The current formal nomenclature for this species is × Bolboschoenoplectus mariqueter (https://powo.science.kew.org). However, we herein mainly use S. mariqueter in accordance with how this species is most commonly referred to in the published literature.

In general, the final genome assembly comprises 227.75 Mb. The contig N50 value is 3.89 Mb. The scaffold N50 value is 4.07 Mb. The overall BUSCO score is 95.10. Most of the complete BUSCO units are single-copy (93.40%), with few duplicated BUSCOs (1.70%) (Table 1). By integrating valid Hi-C data (Table 2), we determine the association between most contigs. We congregate these contigs into pseudochromosomes (i.e. approximations of the actual chromosomes, especially in the orders and orientations of the compositional bases) and further improve our genome assembly to chromosome level (Fig. 2, Supplementary Table S1).

Table 1 Key information for the Scirpus × mariqueter genome assembly “Anchored rate” refers to the percentage of bases that were incorporated into pseudochromosomes.
Table 2 Assessment of Hi-C library quality according to the ratio of read pairs with valid interaction.
Fig. 2
figure 2

Heatmap of Hi-C interactions of S. mariqueter pseudochromosomes. The resolution is 300 kb. Color gradient from yellow to red indicates the frequencies of Hi-C interactions (low to high, respectively).

The constructed pseudochromosomes reveal a haploid karyotype (n = 54). According to critical research on Cyperid evolution by Elliot et al.10, the genome size of our sample approximate the average value in the genus Bolboschoenus (223.82 Mb ± 13.54 Mb). Analyses of random reads indicate that B. planiculmis is the most frequently matched species (41.15% of all reads) (Table 3). The average pseudochromosome size in our genome (4.05 Mb) also approaches the lowest mean chromosome size in cyperid species (3.7 Mb from Bolboschoenus robustus10).

Table 3 Results of the random sequence component check.

Annotation results show that repetitive elements constitute 54.12% (~123.25 Mb) of the S. mariqueter genome. Approximately 35.27% of the genome are comprised of Transposable Elements (TEs), including Long-Terminal Repeat (LTR) retrotransposons (15.87%) and DNA transposons (13.91%) (Supplementary Table S2). Tandem repeats make up 18.85% of the genome (Table 4). By analyzing the repeat-masked genome, 25,239 protein-coding genes are identified in the S. mariqueter genome. Approximately 94.66% of the predicted genes get annotated using canonical databases (Table 5). We also establish a non-coding RNA (ncRNA) library consisting of 3,039 rRNAs, 1,090 tRNAs, 163 miRNAs, and 243 snRNAs (Table 6). A Circos graph (Fig. 3) is provided to intuitively display key information, including GC contents, gene density, intra-genome collinearity, and TEs.

Table 4 Detection and classification of tandem repeats in the Scirpus × mariqueter genome.
Table 5 Summary of the gene prediction and annotation results.
Table 6 Summary of non-coding RNAs (ncRNAs) prediction and annotation results.
Fig. 3
figure 3

Circos plot of the distribution of S. mariquter genomic features. Four circular tiers represent (a) chromosome ideograms, (b) gene density, (c) transposable element density, and (d) GC content. Central lines indicate putative homology among linked sections. Colors were arbitrarily selected.

Methods

Sampling and pretreatment

We chose a healthy S. mariqueter individual on Yonglongsha island (31.709°N, 121.618°E). We transported this individual to a plantation at the Chinese Academy of Forestry, where it was maintained for long-term research purposes. We carefully collected and cleaned sampled tissues to prevent exogenous contamination. All samples were swiftly transferred to the laboratory and stored at −80 °C.

Genome sequencing

We followed the conventional CTAB (cetyltrimethylammonium bromide) approach to extract genomic DNA, after which DNA quality was assessed via agarose gel electrophoresis. For PacBio long-reads sequencing, SMRTbell (15 kb in length) DNA libraries were generated using the standard protocol for SMRTbell Express Template Prep Kit 2.0 (PacBio, CA, USA). After filtering and quantification, library modules were sequenced using a PacBio Revio platform. Raw data were treated using SMRTLink v.8.0 (https://www.pacb.com/support/documentation/). For Illumina short-reads sequencing, we chose the Next era DNA Flex Library Prep Kit (Illumina, CA, USA) to create pair-end libraries (insertion size 250 bp). Sequencing was performed using a NovaSeq 6000 platform. Raw reads were filtered using SOAPnuke v.2.1.411 (-n 0.01 -l 20 -q 0.1 -i -Q 2 -G -M 2 -A 0.5 -d).

Transcriptome sequencing

For gene prediction, samples were taken from five different parts (root, stem, leaf, bract, and spikelet). All the samples were mixed together to form a pooled sample. We then extracted the total RNA from this pooled sample following the conventional procedure for RNA prep Pure Plant Kit (Tiangen Biotech, Beijing, China). Sequencing was carried out using an Illumina NovaSeq 6000 platform, with pair-end libraries constructed following a standard Illumina protocol (San Diego, CA, USA). The insertion size was 250 bp.

K-mer analysis and genome assembly

A K-mer analysis was realized using Genome Scope v.2.012 and Jellyfish v. 2.2.11313 (count -m 19 -C -c 7 -t 96 -s 1 G -f 2). The results showed that, according to a 19-mer model, the S. mariqueter genome manifested a genome size of 202.19 Mb, with a heterozygosity of 0.73% (Fig. 4).

Fig. 4
figure 4

K-mer analysis results and preliminary estimation of S. mariqueter genome parameters. Results were based on a 19-mer model. Estimated heterozygosity was 0.73%.

Following this estimation, we obtained a primary assembly using the NextDenovo pipeline14 (read_cutoff = 1k, genome_size = 0.5 g, sort_options = -m 128 g -t 96, nextgraph_options = -a 1 -q 10) (https://github.com/Nextomics/NextDenovo). The primary genome had a length of 227.75 Mb, which approximated the estimated size from K-mer modeling. Double rounds of error correction were performed for the primary assembly using Pilon v.1.2315 (https://github.com/broadinstitute/pilon), after which heterozygous sequences were removed using Purge_haplotigs pipeline v.1.0.416 to decrease ambiguities. The HindIII enzymatic digestion method detailed by Xie et al.17 was selected to guide our Hi-C library construction. Clean Hi-C data were aligned to the primary assembly according to the Burrows-Wheeler-Aligner (BWA) v.0.7.1718 algorithm. Valid-interaction reads which got unique alignment were then filtered using HiCUP v.0.8.019. ALLHiC v.0.9.820 was utilized to group the contigs of the draft assembly into pseudochromosomes with reference to valid interaction information. By applying this Hi-C methodology, we were able to detect the association between most contigs and cluster them into pseudomolecules reflecting real chromosomes. Thus, the primary assembly was improved to produce a chromosome-level genome assembly. Additionally, we used 3D-DNA v.18092221 and Juciebox v.1.11.0822 to further improve contig orientation and order.

Detection of repetitive elements

We built a de novo repeat library using RepeatModeler v.2.0.123. We refined this library using a combination of RepeatMasker v.4.15 (http://www.repeatmasker.org) and RepBase v.2018102624. We performed further predictions for two major repeat components bearing evolutionary significance: LTR and tandem repeats. By integrating LTR_finder, LTR_harvester, and LTR_retriever, we acquired high-quality LTRs following the instructions of Ou & Jiang25. Tandem repeats were predicted using TRF v.4.1.026 and MISA v.2.127. We also identified the potential sites for centromeres and telomeres, as they are vital factors affecting speciation in Cyperid species which commonly have holocentric chromosomes10. We executed the standard Python scripts of quarTeT28 to detect centromeres, telomeres, and gaps. Based on the results from quarTeT, we plotted a comprehensive karyotype ideogram using RIdeogram v.0.2.229 to represent the patterns intuitively.

Gene prediction and annotation

After masking the repeat content, we used Augustus v.3.5.0 to produce de novo gene models30. We inferred homology on the basis of high-quality genome assemblies for the following five species: Rhynchospora breviuscula (GCA_027562975.1)31, Bolboschoenus planiculmis (GCA_031770325.1)32, Oryza sativa (GCA_034140825.1)33, Schoenoplectus tabernaemontani (GCA_037127355.1)34, and Arabidopsis thaliana (GCA_000001735.2)35. We decoded the transcripts using TransDecoder v.5.7.1 (https://github.com/TransDecoder/TransDecoder). Finally, we reconciled these results using Maker v.3.0136 pipeline to get the ultimate gene sets (https://github.com/Yandell-Lab/maker?tab). Gene functions were annotated using the NR, InterPro, UniProt, GO, KEGG, and Pfam databases with an e-value of 1e-5. For ncRNAs, tRNAs were predicted using tRNAscan-SE v.1.3.137. rRNAs were identified using RNAmmer v.1.238 (https://services.healthtech.dtu.dk/services/RNAmmer-1.2/). We further determined miRNAs, snoRNAs and snRNAs using Infernal 1.139 with reference to Rfam (v.14.9) database. Detailed procedures and parameters could be found in the manual of ncRNA analysis using Rfam database40.

Data Records

The genome assembly and all sequence data are deposited in the NCBI database. The genome assembly number is GCA_037678475.141. The Bioproject ID is PRJNA1079027. The Biosample ID is SAMN40029249. Raw reads used to generate the genome assembly are stored in the Sequence Read Archive (SRP491792)42. The complete genome annotation files in gff3 format, including coding sequences, ncRNA sequences, and repeat sequences, are shared in the Figshare database (https://doi.org/10.6084/m9.figshare.25479922.v1)43.

Technical Verification

Data volume and quality

Sequence data volume and quality were sufficient for constructing a high-quality genome assembly. We produced 21.12 Gb PacBio long-reads data for the primary assembly. Among the long-reads, 60.27% got a Phred quality score (Q score) better than Q30. Details regarding PacBio long-read quality are provided in Fig. 5.

Fig. 5
figure 5

Distribution of quality scores and lengths of PacBio reads. The x-axis presents the read length (bp), whereas the y-axis presents the predicted base calling accuracy. Dot colors reflect count abundance (scale bar on the right). The plot presents a high volume of reads with a Phred quality score (Q score) of 28–41 and a length of 13,500–25,000 bp.

We also generated a total of 132.85 Gb Illumina short-reads for genome profiling, gene prediction, pseudochromosome construction, and back-mapping check. The average Q30 value for short-reads data was 93.74. (Table 7). A total of 50,000 random short-reads were blasted into Nucleotide Sequence Database (NT) using BLASTN v. 2.11.0 (-evalue 1e-5 -max_target_seqs1). Eventually, 7,510 reads got matches. The top 20 matched species were listed in Table 3. The results showed that all the matched records belong to Viridiplantae, with approximately 67% of the matched reads belonging to Cyperaceae, indicating that our sample was safeguarded against extraneous contamination. Thus, our genome assembly had a robust foundation in data volume, data quality, and data source.

Table 7 Summary of data volume and quality.

Continuity and completeness of the genome assembly

We assembled a high-quality genome with the following characteristics: (1) There were relatively few gaps. The total length of the detected gaps was 500Ns. Specifically, there was one gap(100Ns) in Chr3, Chr19, Chr21, Chr23, and Chr44 respectively (Fig. 6).

Fig. 6
figure 6

Chromosome ideograms of S. mariqueter. Ideograms present a haploid karyotype (n = 54). Ideogram length is proportional to chromosome size. Background color scales represent gene densities (100 kb window). Five gaps (orange box) in the genome assembly are shown. Putative telomeres are indicated by green triangles, whereas centromeres are indicated by purple circles.

Gaps were not detected in the other 49 pseudochromosomes (i.e., 90.74% of the total number); (2) In our final genome, 36 pseudochromosomes showed telomeres at both ends, whereas 15 pseudochromosomes had a telomere at only one end. The rest three pseudochromosomes currently manifested no trace of telomeres. Considering the observed gaps, we assembled 32 telomere-to-telomere pseudochromosomes, accounting for 59.25% of the total number; (3) The overall BUSCO score was 95.1% (Table 1), similar to some recently published Cyperid genomes44. In terms of the back-mapping rate, 99.03% of the sequenced reads were aligned in our final genome. Overall, 99.88% of the final genome was covered through back-mapping. Specifically, 99.61% of the final genome was covered at least 4×, 99.25% of the final genome was covered at least 10×, and 98.84% of the final genome was covered at least 20×.