Background & Summary

Agricultural weeds compete with crops for essential resources such as light, nutrients, and water, resulting in reduced crop yields and significant economic, environmental, and ecological consequences1,2. It is estimalia, 11 billion USD in India, and 33 billion USD in the United States, contributing to a global loss of around 200 million tons in food production2. The Cyperaceae family comprises about 3,000 species, with approximately 300 classified as weeds, and about 42% of these belong to the genus Cyperus3.Among these, C.rotundus stands out as the most aggressive weed worldwide, due to its widespread distribution and its ability to outcompete crops4. Furthermore, this species demonstrates resistance to conventional control methods, making it one of the most problematic weeds globally, responsible for yield losses ranging from 20% to 90% in both crops and horticultural plants3,5. Consequently, C. rotundus exerts severe impacts on agricultural ecosystems across various regions.

C. rotundus (Family: Cyperaceae) derives its genus name from the ancient Greek word “Cypeiros” and its species epithet, “rotundus”, from the Latin term for “round,” referencing its round-shaped tubers6. Native to South Asia, Africa, Central and Southern Europe, and tropical, subtropical, and temperate regions of Australia7, C. rotundus is an upright, hairless, grass-like perennial herb characterized by fibrous roots and slender, scaly, creeping rhizomes (Fig. 1). This species is commonly found in temperate, tropical, and subtropical regions, including China, India, South Africa, Korea, Japan, Egypt, and Iran8,9,10,11,12. Its primary mode of reproduction is asexual, through underground tubers, rhizomes, and basal bulbs, although sexual reproduction occurs via seeds13. C. rotundus exhibits remarkable reproductive and survival capabilities, thriving in dryland crop fields such as sugarcane14, maize, soybeans15, cotton16 and peanuts17, where it significantly affects crop growth and quality. In warm climates, C. rotundus is particularly difficult to manage due to its perennial nature, rapid growth, and high tuber production, making it a highly invasive species18. Despite the economic and ecological significance of this noxious weed, its complete genome has not yet been sequenced, and its biological characteristics and adaptive mechanisms remain inadequately understood. Therefore, the chromosome-level genome sequence of C. rotundus will help further elucidate its biological characteristics, adaptive mechanisms, and phylogenetic relationships.

Fig. 1
Fig. 1
Full size image

Morphological characteristics of C. rotundus: (a) growth habit in cassava fields, (b) whole plant, (c) seedling, (d) root, (e) stem, (f) leaf, (g) tuber, (h) flower.

In this study, we constructed and annotated a high-quality chromosome-level reference genome using integrated data (Fig. 2). The genome was initially assembled at the contig level using PacBio HiFi long reads and the hifiasm v0.1819 tool. To achieve chromosome-level assembly, we employed Illumina Hi-C paired-end reads, processed through the HiCUP v0.9.220. After masking repetitive sequences, we utilized three strategies for gene annotation with EVidenceModeler (EVM)21. These strategies included: homologous prediction based on closely related species, transcriptome prediction using Illumina paired-end RNA-seq short reads via the PASA22 pipeline, and de novo prediction relying on genomic sequence features. Following gene function annotation for protein-coding genes and protein domains, we validated the results against relevant databases. The genomic features were then visualized by Circos v0.69.823 (Fig. 2). Additionally, we performed Benchmarking Universal Single-Copy Orthologs (BUSCO)24 analysis and evaluated genome mapping and coverage using Illumina paired-end short reads to assess the completeness and quality of both the genome assembly and annotation. These results demonstrate that the current genome assembly and annotation are both continuous and accurate. Thus, the present C. rotundus genomic resource will lay the foundation for further research on plants within this genus.

Fig. 2
Fig. 2
Full size image

Chromosome-scale genome assembly map of C. rotundus. From outermost to innermost, the Circos plot represents the following: (1) the length of 54 pseudo-chromosomes at the Mb scale; (2) gene density per Mb; (3) GC content per Mb; and (4) center: collinearity within C. rotundus.

Methods

Plant material collection and preparation

In March 2024, mature and healthy C. rotundus plants were collected from the Danzhou campus of Hainan University in Danzhou City, Hainan Province (19°54′37.26″N, 109°31′48.67″E). The roots, stems, leaves, flowers, and tubers were harvested after being washed with deionized water. All tissues were immediately placed in liquid nitrogen and stored in a cryogenic freezer until further use.

DNA library construction and genome sequencing

DNA extraction was carried out using a modified cetyltrimethylammonium bromide (CTAB) method25. The quality and concentration of the extracted DNA were assessed through 0.75% agarose gel electrophoresis, NanoDrop One spectrophotometry (Thermo Fisher Scientific, Wuhan), and Qubit 3.0 fluorometry (Life Technologies, Carlsbad, USA). For Illumina sequencing, high-quality DNA was defined as having an OD260/280 ratio of 1.6–1.8, no visible viscosity, a total amount ≥0.2 µg, a Qubit concentration ≥5 ng/µL, and intact bands on agarose gel electrophoresis. For ONT sequencing, high-quality DNA was defined as being clear and transparent, with no insoluble particles or viscosity, an Nc/Qc ratio of 0.95–1.5, A260/280 of 1.8–2.0, A260/230 ≥ 1.5, no degradation or only slight degradation, and a total amount ≥5 µg.After obtaining high-quality and purified genomic DNA, a SMRT cell sequencing library containing approximately 15–20 kb fragments was constructed and sequenced on the DNBSEQ-T7 platform. For PacBio HiFi sequencing, circular consensus sequencing (CCS) reads were generated using the ccs tool26 in SMRT Link with parameters --min-passes 3 --min-length 10 --min-rq 0.99 to ensure high accuracy and remove low-quality data. Raw sequencing reads were processed to remove adapter sequences, low-quality reads (Q score < 20), and short fragments (<1 kb). A total of 12 Gb of clean data were generated, covering approximately 40.95 × of the genome (Table 1).

Table 1 Genomic Assembly and Annotation Sequencing Data Statistics of C. rotundus.

For Oxford Nanopore sequencing, the SQK-LSK110 ligation kit and standard protocol were used to prepare the library. The purified library was loaded onto an initialized R9.4 Spot-On flow cell and sequenced using the PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK). Basecalling was conducted using Guppy v6.3.827, during which reads with an average Q score below 7 were automatically filtered.Raw reads were filtered to remove adapter sequences, short reads (<1 kb), and low-quality reads (Q score <20), resulting in 12 Gb of clean data, which provided approximately 40.95 × genome coverage.Concurrently, an Illumina second-generation library (Illumina, San Diego, USA) was constructed with an insert fragment size of 350 bp. A total of 12 Gb of clean data, corresponding to approximately 40.95× genome coverage, was obtained after filtering with fastp v0.23.428, which removed low-quality reads, adapter sequences, and polyG tails using default settings.This short-read data was then utilized for genomic analysis.

For chromosomal scaffolding, a Hi-C library was prepared from C. rotundus tissues and sequenced using the Illumina NovaSeq 6000 platform (Illumina, San Diego, CA, USA). Raw sequencing data were processed using fastp28 to remove adapter sequences, low-quality reads (Q < 20), and reads with ambiguous bases. After initial filtering, 23 Gb of clean Hi-C data were obtained, achieving approximately 78.49 × coverage. Hi-C data preprocessing was performed using HiCUP20. First, the reference genome was digested in silico using: hicup_digester–genome genome–re1 ^GATC,MboI genome.fasta, simulating MboI restriction enzyme cleavage. The paired-end Hi-C reads were then aligned to the draft genome assembly using Bowtie2 within the HiCUP pipeline via: hicup–bowtie2 bowtie2–digest Digest_genome*.txt–format Sanger–index genome–outdir $PWD–threads 4 $HiC_data_dir/*.fq.gz. Invalid read pairs-including self-ligated, re-ligated, circularized, and uninformative pairs-were automatically filtered. Valid paired-end reads were extracted using: samtools fastq −1 hicup.sam.tmp_R1.fastq −2 hicup.sam.tmp_R2.fastq hicup.sam, and all clean pairs were concatenated into final read files for scaffolding. Sequencing was performed by Wuhan Baita Gene Technology Co., Ltd. (Wuhan, China) (Table 1).

RNA library construction and transcriptome sequencing

Total RNA was extracted from the roots, stems, leaves, flowers, and tubers following the standard TRIzol protocol (Invitrogen, USA)29. Approximately 100 mg of tissue was ground into a powder with liquid nitrogen, followed by the addition of 1000 μL TRIzol into a 2.0 mL tube. The solution was incubated for about 5 minutes, then 200 μL of chloroform was added. The mixture was vigorously shaken for 30 seconds and allowed to stand for 3 minutes. After centrifugation at 12,000 rpm for 15 minutes at 4 °C, the upper aqueous phase was collected. This phase was then extracted with 500 μL isopropanol into a 1.5 mL tube and gently inverted to mix. After standing for approximately 10 minutes, the mixture was centrifuged at 12,000 rpm for 10 minutes. The supernatant was discarded, the pellet was washed twice with 75% ethanol, and the final pellet was dissolved in 50 µL of DNase- and RNase-free water for further analysis.

For Illumina paired-end sequencing, mRNA was reverse-transcribed into cDNA, and four libraries with an insert size of 350 bp were constructed following the manufacturer’s instructions using the TruSeq RNA library preparation kit (Illumina, USA). Whole-genome shotgun sequencing was performed on the Novaseq 6000 platform using the PE 150 program. Raw sequencing data were processed using fastp28 with parameters “--detect_adapter_for_pe -- qualified_quality_phred 5 --unqualified_percent_limit 50 --n_base_limit 5 --dedup” to remove adapter sequences, reads with ≥50% low-quality bases (Phred score ≤ 5),reads containing >5% ambiguous bases (N), and putative PCR duplicates. After filtering, a total of 40 Gb of clean data were obtained from the RNA-seq libraries (Table 1).

Genome assembly and Hi-C scaffolding

Initially, we estimated the genome size and heterozygosity of C. rotundus using K-mer analysis with Jellyfish v1.1.1030 (parameter “-m 21”) and GenomeScope v2.031 (parameter “k = 21”) using clean Illumina short-read data. The K-mer analysis indicated that the genome size of C. rotundus is approximately 0.29 Gb (Fig. S1). To assemble third-generation long-read sequencing data into genomic sequences, tools such as Flye32, Canu33, and Hifiasm19 are widely used. Genome assembly was performed using a hybrid approach combining PacBio HiFi and Oxford Nanopore (ONT) long reads. The assembly pipeline utilized Hifiasm v0.1819, with parameters optimized for integrating ultra-long ONT data into the HiFi assembly. Specifically, the following command was used for hybrid genome assembly: nohup /software/hifiasm/hifiasm-0.18/hifiasm/hifiasm -t 4 -l 3 --ul ../ONT.fastq.gz -o HIFI_ONT_assembly ../HIFI_data.fastq 2 > error.txt &. After assembly completion, the primary contig fasta file was extracted from the GFA output using the command: awk ‘/^S/{print” > “$2;print$3}’ HIFI_ONT_assembly.bp.p_ctg.gfa > genome.fasta. This approach enabled us to leverage the high base accuracy of PacBio HiFi reads and the ultra-long coverage of ONT reads, producing a draft genome suitable for subsequent scaffolding and annotation. To correct errors in the initial assembly, Illumina-derived short reads were employed for correction using Pilon v1.2334. The final C. rotundus genome assembly comprises 201 scaffolds, each corresponding to a chromosome (including ChrUN).This number reflects the scaffold-level assembly. However, since the assembly has not reached telomere-to-telomere (T2T) completeness, gaps still exist within certain chromosomes. These gaps divide chromosomes into multiple contiguous sequences (contigs), resulting in a total of 237 contigs (Table 2).

Table 2 Statistics of the C. rotundus genome assembly.

The assembled genome was further evaluated using the Benchmarking Universal Single-Copy Orthologs BUSCO v5.4.324 method with the embryophyta_odb10 dataset to assess the C. rotundus genome. The results indicated that 93.4% of BUSCO genes were successfully detected in the genome assembly, including 1,460 single-copy genes, 46 duplicated genes, 16 fragmented BUSCO genes, and 92 missing genes (Fig. 3).

Fig. 3
Fig. 3
Full size image

Quality assessment of genome using embryophyta_odb10 database showed genome busco 93.4% and protein busco 93.3%. C: the number of complete genes, S: the number of complete and single-copy genes, D: the number of complete and duplicated genes, F: the number of incomplete genes, M: the number of missing genes.

To achieve a chromosome-level genome, the 237 contigs from the draft assembly were anchored onto 54 chromosomes using Haphic v1.0.635,36. Hi-C reads were then aligned to the modified genome, and erroneous links, order, and orientation were manually corrected using Juicer v1.537. After processing, Haphic was rerun. Ultimately, the Hi-C scaffolding resulted in chromosome-length scaffolds, producing 54 chromosomes (Fig. 4). The genome was purified of haplotypic duplications during the hybrid assembly process using HiFi and ONT reads, followed by chromosome-level scaffolding. Haplotypes were distinguished using synteny analysis with MUMmer software and the Hi-C heatmap, and the best-quality sequences from each haplotype were selected as the final genome assembly. The assumption of 54 chromosomes was based on our cytogenetic (karyotyping) analysis, which revealed that the organism has 162 chromosomes in total (Fig. S2). Furthermore, ploidy evaluation using the Smudgeplot38 tool confirmed that C. rotundus is a triploid species (Fig. S3), and based on this, the number of haploid chromosomes was inferred to be 54. Notably, haploid chromosome numbers around 54 have also been reported in other species of the Cyperus genus, and the presence of peaks at 18, 36, and 54 in haploid counts has been interpreted as potential evidence of polyploid series in this group39.This value was used as the input parameter for HapHiC.

Fig. 4
Fig. 4
Full size image

The heatmap represents 54 chromosomes of the C. rotundus genome.

Genome annotation and functional prediction

Repeat Sequence Annotation

To identify repetitive sequences in the C. rotundus genome, we employed both homology-based methods and de novo prediction strategies.

Homology-based analysis

We first used RepeatMasker v1.32340 and the Repbase TE library41 to identify known transposable elements (TEs) in the C. rotundus genome.

De novo prediction

Next, we constructed a de novo repeat library for the genome using RepeatModeler open-1.0.842, which integrates two core de novo repeat detection tools-RECON v1.0843 and RepeatScout v1.0.544 to identify and optimize dispersed repeat regions. To further explore long terminal repeat (LTR) retrotransposons, we conducted dedicated de novo searches using LTR_FINDER v1.0.745, LTR_harvest v1.5.1146, and LTR_retriever v2.747. Additionally, we employed Tandem Repeat Finder (TRF)48 to identify tandem repeat sequences and MISA v1.049 to identify simple sequence repeats (SSRs).

Finally, we merged the repeat libraries obtained from the above methods and used RepeatMasker to annotate the repetitive content in the genome. The results indicated that 44.78% of the assemblies consisted of repetitive sequences (Table 3). The four main types of repetitive sequences were long terminal repeat sequences (LTR) (14.38% of the genome size), simple repeats (3.12%), DNA elements (2.46%), and unclassified elements (23.86%) (Table 3).

Table 3 Statistics of repeat elements in the genome of C. rotundus.

Non-coding RNA annotation

We used Rfam v14.050 to predict ribosomal RNA (rRNA), small nuclear RNA (snRNA), and microRNA (miRNA) by comparing the C. rotundus genome with known non-coding RNA libraries. The tRNAscan-SE v1.3.151 algorithm with default parameters was used to identify tRNA-related genes. In total, 3,650 non-coding RNAs (ncRNAs) were identified in the C. rotundus genome, including 1,187 tRNAs, 1,208 rRNAs, 23,280 miRNAs, 120 snRNAs, and 1,135 snoRNAs (Table 4).

Table 4 Non-coding RNAs in the C. rotundus assembly.

Gene structure prediction

We used three strategies to predict the gene structure of the repeat-masked C. rotundus genome: initial gene prediction, homology-based gene prediction, and RNA-Seq guided gene prediction. Prior to gene prediction, the assembled C. rotundus genome was subjected to both hard and soft repeat masking using RepeatMasker40, in order to improve annotation accuracy. Hard masking was applied to eliminate potential false gene predictions in highly repetitive regions, while soft masking was retained to preserve sequence context for downstream analyses that are sensitive to repeat content.

For initial gene prediction, we used Augustus v3.3.352. Each gene prediction model was trained with a set of high-quality proteins generated from the RNA-Seq dataset. For homology-based gene prediction, we employed Maker v2.31.1053, which was used to align protein and transcript sequences to our genome assembly and predict the coding genes.

For RNA-Seq guided gene prediction, we first used Hisat2 v2.0.054 to align the cleaned RNA-Seq reads to the genome. We then used Trinity v2.3.255, Transdecoder v2.01 (github.com/TransDecoder/TransDecoder), and Maker54 to construct the gene structures.

Finally, EVidenceModeler (EVM) v1.1.121 was employed to integrate the predictions from the three methods and generate the final gene models. The output consisted of consistent, non-overlapping sequence assemblies to define the gene structure. In total, 23,280 protein-coding genes were predicted in the C. rotundus genome, with an average gene length of 2,280 bp (Table 5).

Table 5 Statistics of gene structure prediction results.

Gene function annotation

Gene functions were inferred according to the best match of the alignments to the National Center for Biotechnology Information (NCBI) Non-Redundant (NR)56, TrEMBL57, KOG58 and Swiss-Prot protein databases59 using BLASTP v2.6.060 and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database61 with an E-value threshold of 1E-5. The protein domains were annotated using PfamScan v1.662 based on PFAM databse63 and InterPro protein database64. Gene Ontology (GO) IDs for each gene were obtained from Blast2GO65. In total, approximately 97% of the predicted protein-coding genes of C. rotundus genome could be functionally annotated with known genes, conserved domains, and Gene Ontology terms(Table 6).

Table 6 Statistics for the C. rotundus functionally annotated protein-coding genes.

Data Records

Raw sequencing data have been deposited in the NCBI Sequence Read Archive (SRA)(Table 1) under BioProject accession number PRJNA1280253 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1280253), including transcriptomic, PacBio HiFi, Hi-C, Oxford Nanopore Technologies (ONT), and Illumina sequencing datasets (accessions SRR34082898–SRR34082906)4,66,67,68,69,70,71,72,73,74.

The chromosome-level genome assembly of Cyperus rotundus is available at NCBI GenBank under the accession GCA_052426515.1 (https://www.ncbi.nlm.nih.gov/assembly/GCA_052426515.1)75, linked to BioProject PRJNA1283543 and BioSample SAMN49699731.

The genome annotation files are accessible via Figshare (https://doi.org/10.6084/m9.figshare.29435915)76.

Technical Validation

Genomic integrity, fragmentation, and possible loss rates were measured using BUSCO. The completeness of the protein sequences aligns with the genomic assessment results, indicating that the assembled genome has high integrity in terms of protein-coding genes. Among the 1,614 conserved core genes in the Embryophyta database, 1,506 (93.3%) were identified as complete BUSCO, and 16 (0.99%) as fragmented BUSCO, indicating that the assembled genome exhibits high integrity and validity, making it suitable for further analysis (Fig. 3).