Background & Summary

Reynoutria japonica, also known as Polygonum cuspidatum, and commonly referred to Huzhang in Chinese and Japanese knotweed in Japan1, is a perennial herbaceous species belonging to the family Polygonaceae. In Europe, it is recognized as one of the most invasive alien plant species and is currently prevalent across nearly all European countries2. However, in the Qinba mountain region of central China, it is valued as a traditional medicinal plant and a vegetable crop3. Over the past five years, rapid advancements in sequencing technologies have significantly expanded our understanding of the complete genomes of medicinal plants4. To date, the genomes of approximately 126 key Chinese herbs have been described4, including Artemisia argyi5, Dendrobium officinale6, Taxus wallichiana7, Coptis chinensis8,9, and Andrographis paniculata10. Among these efforts, a previous study utilizing next-generation short-read Illumina sequencing and transcriptome-assisted annotation produced a draft genome assembly for P. cuspidatum, revealing a genome size of 2.56 Gb and identifying 55,075 functional genes11. Due to the high abundance of transposable elements (TEs) in the R. japonica genome, however, this draft genome remains incomplete11, hindered by technological limitations inherent to the sequencing platform. These TEs complicates de novo assembly, leading to numerous gaps and errors, particularly in complex genomic regions. These challenges emphasize the necessity of further refinement to achieve a more accurate and complete genome representation in R. japonica.

In this study, to overcome the difficulties associated with assembling the R. japonica genome, we employed a combination of Illumina sequencing, high-throughput chromosome conformation capture (Hi-C) sequencing, and single molecule real-time (SMRT) sequencing. Subsequently, the completeness and contiguity of the assembled genome were evaluated. The final assembled genome spans approximately 3.30 Gb with a contig N50 of 1.39 Mb. 99.22% of the assembled sequences were anchored to 22 pseudo-chromosomes, and 74.79% of the genome consisting of repeat elements. Genome annotation revealed 68,646 protein-coding genes and 14,788 non-coding RNAs. The present high-resolution genome of R. japonica provides a valuable reference for the entire Polygonum genus, offering insights into comparative genomics and advancing our understanding of evolutionary relationships and gene functions across closely related species.

Methods

Sample collection and DNA/RNA extraction

R. japonica plants were cultivated in the Qinba Mountains of Shiyan, Hubei Province, China12. Fresh young leaves from one-year-old plants were harvested and immediately frozen in liquid nitrogen (Fig. 1a,b). Genomic DNA was extracted using an improved CTAB method13. Five tissues types (leaves, stems, flowers, roots, and fruits) were collected from a single individual for RNA extraction (Fig. 1c–e). The samples were promptly frozen in liquid nitrogen and stored at −80 °C until extraction. Total RNA was extracted using the TruSeq Stranded mRNA preparation kit, according to the manufacturer’s instructions.

Fig. 1
figure 1

Photographs taken from the sampled plant of Reynoutria japonica. (a) Cultivation field; (b) One-year-old plants of P. japonica; (c) Inflorescences with floral buds; (d) Fruits; (e) Underground tubers.

Genome sequencing

The sequencing library (DNBSEQ) was constructed and detected by MGIEasy Universal DNA Library Prep Set (MGI), Qubit™ dsDNA BR Assay Kit (Invitrogen) and Qubit® ssDNA Assay Kit (Invitrogen), and the sequencing was conducted on the MGISEQ-2000, generating 150-bp paired-end reads (PE150). For PacBio sequencing, DNA libraries were prepared using the SMRTbell® prep kit 2.0, following the manufacturer’s instructions. Sequencing was performed on the PacBio Sequel II platform. For Hi-C sequencing, DNA was purified using the QIAamp DNA Mini Kit (CAT#51306, Qiagen) according to the manufacturer’s protocol. The Hi-C library was subsequently sequenced on the MGISEQ-2000 platform. All genome sequencing and Hi-C sequencing data were derived from a single plant. Details of the data from each platform are provided in Table 1. Raw reads from transcriptome sequencing were processed sequenced using the Illumina NovaSeq. 6000 platform to generate 5.24–6.91 Gb of paired-end reads. These transcriptomic data were utilized for subsequent gene structure annotation.

Table 1 Data Output Statistics for Genome Sequencing.

In total, we generated 276.14 Gb (~84 × coverage) Illumina reads with a Q20 rate at 96.18%, 322.36 Gb (~98 × coverage) Hi-C reads with a Q20 rate at 92.57%, 252.45 Gb (~77 × coverage) PacBio reads, and 123.30 Gb RNA data with GC content was stable at 39.60%. These controls ensure the reliability of our sequencing data (Table 1).

Genomic survey

The generated Illumina sequencing data were processed using Fastp software (v0.23.3)14 with default parameters. This process included discarding reads with adapter contamination, trimming low-quality bases from both the 5′ and 3′ ends using a sliding window approach, and correcting mismatched base pairs in the overlapping regions of paired-end reads. Then the clean data were then used for K-mer analysis with GCE software (v1.0.2)15. Based on the 17-mer distribution (Fig. 2), information on the peak depth (86) and the number of 17-mers (241,952,873,067) was obtained and used to estimate genome size (2,813 Mb) (Table 2). The estimation was carried out using the following formula: Genome size = K-mer num/Peak depth16. Additionally, based on K-mer analysis, the heterozygosity rate (0.35%) and proportion of repeat sequence (81.28%) were calculated according to the methods described by Liu et al.16.

Fig. 2
figure 2

Frequency distribution of depth and K-mer Species.

Table 2 Genome survey statistics for Reynoutria japonica.

Genome assembly and quality assessment

The 252.45 Gb PacBio subreads were initially assembled using Canu v2.1.117. The primary assembled genome was polished using PacBio long reads processed with Arrow (Arrow: https://github.com/PacificBiosciences/GenomicConsensus) and short reads processed with Pilon18 with default parameters. Based on this primary genome assembly, Hi-C short reads were subsequently employed to construct chromosomes for elongate loach. Approximately 84,647,123 of valid paired reads, filtered from the total pool of 331.31 Gb of clean Hi-C reads (Table 1), were utilized for assembly and error correction in scaffold extension and chromosome assembly. Quality control measures were applied to the Hi-C reads using Juicer (v1.6)19. The contig assembly was subsequently organized into a chromatin scaffold utilizing 3D-DNA20 (v180922, parameter -r2). The visualization of Hi-C interactions was conducted with 3D-DNA and further examined through Juicebox (v1.11.08)21. The de novo genome assembly generated a draft genome of approximately 3,297.29 Mb, consisting of 9,085 contigs with an N50 of about 1.39 Mb and a scaffold N50 of roughly 158.33 Mb (Table 3). Finally, a total of 22 pseudochromosomes were obtained (Figs. 3 and 4), encompassing 99.22% (3,271.86 Mb) of the assembled contigs (Table 3). The GC content of these pseudochromosomes was approximately 38.40% (Table 3), ranging from 38.10% to 38.58% (Table 4).

Table 3 Features of the R. japonica genome assembly and annotation.
Fig. 3
figure 3

The circos diagram of P. japonica genome. Circles (a) to (g) represent 22 pseudochromosomes of the P. japonica pseudochromosomes (a), GC content (b), gene density (c), repeat density (d), copia elements density (e), gypsy elements density (f) and collinearity (g) between the pseudochromosomes (g), respectively. All calculations were done within 1 Mb windows.

Fig. 4
figure 4

Hi-C interactive heatmap (bin size = 100 kb). Genome-wide chromatin interactions in the R. japonica genome at 100-kb resolution. Color blocks represent interaction strength, ranging from white (low) to red (high).

Table 4 Summary of the structure of 22 pseudochromosomes.

The assessment of genome assembly’s completeness was conducted using the Benchmarking Universal Single-Copy Orthologs (BUSCO v5.4.3) assessment22. After searching against the eudicots_odb10 database, R. japonica genome was validated to 95.20% of 2,326 BUSCO groups (Table 5). These findings collectively demonstrate the high sequence integrity, continuity, and accuracy of the R. japonica assembly, meeting reference-quality standards.

Table 5 BUSCO assessment result.

Repeat annotation

A combination strategy of homology-based and de novo prediction methods was used to identify the repeat elements (REs) in the R. japonica genome. In the homology-based approach, RepeatMasker v4.0.6 (-e rmblast) and RepeatProteinMask v4.0.6 (-pvalue 0.0001)23 were employed to identify repeats at both the DNA and protein levels by searching against the RepBase library24 and the TE protein database. Tandem repeats were characterized using Tandem Repeats Finder (TRF, v4.07)25. Additionally, LTR_FINDER v1.0.626 with default parameters were utilized for the de novo prediction of novel repetitive elements.

In this study, the annotated 2,465.7 Mb of repetitive sequences accounted for 74.79% of the assembled R. japonica genome (Table 6). Among these sequences, Long Terminal Repeats (LTR) constituted the greatest proportion (47.918%, consist of 6.831% Copia, 16.958% Gypsy, and 24.129% Other LTRs), followed by DNA transposons (DNA) (3.750%), Long Interspersed Nuclear Elements (LINE) (2.707%), and Short Interspersed Nuclear Elements (SINE) (0.004%) (Table 6). The repetitive regions of the genome were then masked before proceeding with further gene prediction.

Table 6 Classification of repeat annotation in R. japonica.

Non-coding gene annotation

In this study, we examined the gene structures of tRNAs, rRNAs, and other non-coding RNAs. tRNAs were predicted using the t-RNAscan-SE v1.427 program (http://lowelab.ucsc.edu/tRNAscan-SE/). Given the high conservation of rRNAs, we chose reference rRNA sequences from closely related species and used BLAST (blastn, evalue 1e-05) for rRNA sequences prediction. We also identified additional ncRNAs such as miRNAs and snRNAs by searching the Rfam28 database with Infernal v1.129 using default parameters. This analysis result in the annotation of 14,788 noncoding genes, which include 339 miRNAs, 7,508 tRNAs, 1355 rRNAs, and 5,586 snRNAs (Table 7).

Table 7 Statistics of noncoding genes.

Protein-coding genes prediction and functional annotation

To ensure precise gene prediction, a comprehensive approach combining de novo prediction, homology-based prediction, and transcriptome-based prediction. First, it predicted the de novo gene structure with AUGUSTUS v3.2.130 and GlimmerHMM v.3.0.431. Second, homologous protein sequences of three other plants in the Caryophyllales order, including Fagopyrum tataricum, Beta vulgaris, and Spinacia oleracea obtained from NCBI were aligned with the R. japonica genome with TBLASTN. Third, the RNA-seq data from five tissues were mapped onto the assembled genomes with HISAT2 v.2.2.17832. RNA-seq data were filtered using SOAPnuke software (v2.1.0)33 with the following parameters: -lowQual = 20, -nRate = 0.005, and -qualRate = 0.5. The data were processed by removing paired reads containing adapters, discarding those with more than 0.5% Ns, and eliminating low-quality reads where over 50% of bases had a quality score (Q) ≤ 20. Subsequently, StringTie v.2.1.67934 identified potential exon regions, and ORFs were predicted via TransDecoder v.5.1.0 using the transcript sequences. Finally, the gene sets were integrated by braker v2.1.535.

In this study, we identified 68,646 protein-coding genes in the R. japonica genome. The gene structure and gene elements, including average transcript length, average CDS length, and average exon and intron length, were compared with the above three related species in the order Caryophyllales (Table 8).

Table 8 Comparative analysis of gene elements.

Gene functions were assigned aligned all predicted protein-coding genes against multiple publicly available databases such as Nr (http://www.ncbi.nlm.nih.gov/protein/), Uniprot, InterPro, Pfam, Swissprot, GO, and KEGG. Overall, 65,774 protein-coding genes were functionally annotated in at least one database (Fig. 5, Table 9). Among these annotated genes, 65,441 genes were annotated in the Nr database36, 65,312 genes were annotated in the Uniport database37, 58,797 genes were annotated in the InterPro database38, 54,309 genes were annotated in the Pfam database39, 48,078 in the Swiss-Prot database40, 37,217 in the GO database41, and 32,456 in the KEGG database42 (Fig. 5, Table 9).

Fig. 5
figure 5

The UpSet plot of Gene function annotations. The intersection size of genes with functional annotation using multiple public databases.

Table 9 Summary of gene function annotations.

Data Records

The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive (GSA43) in National Genomics Data Center44. Access to this data is available to the public under the accession number PRJCA030379, which can be found with the following GSA IDs: CRA01925145, CRA01918246, CRA01918347, CRA01945148. The assembled genome sequence has been made available in GenBank with JBLJBX00000000049. Additionally, the annotation data has been deposited at the Figshare repository50.

Technical Validation

DNA quality was assessed using 1% agarose gel electrophoresis, and DNA concentration was measured with Qubit 3.0 Fluorometer, achieving an absorbance ratio of around 2.0 at 260/280. We used Fastp14 to assess the quality scores of all bases in the raw sequencing data. Additionally, the 17-mer distribution analysis was performed on the clean data to estimate the target genome size. The genome size estimated by the survey closely matched the assembled genome size, further supporting the reliability of the evaluation results.

The genome-wide Hi-C interaction heatmap was generated using Juicerbox. In the heatmap, the coordinates represent bins across individual chromosomes, with the color of each point reflecting the logarithmic value of the interaction strength between corresponding bin pairs (Fig. 4). Notably, regions with higher interaction strength are represented by deeper colors, and the diagonal shows significantly stronger interactions compared to the ends.

The scaffold N50, the length at which half of the genome assembly is represented in scaffolds of that size, improved significantly to 158.33 Mb, indicating high assembly quality (Table 3). For the genome evaluation, 95.20% of BUSCOs were classified as complete, with 19.20% being single-copy and 76.00% being duplicated. Fragmented BUSCOs made up only 1.30%, while 3.50% were missing. The gene set evaluation similarly shows a high percentage of completeness at 94.30%, with 22.80% single-copy and 71.50% duplicated BUSCOs. Fragmented BUSCOs were slightly lower at 0.60%, and missing BUSCOs were higher at 5.10%. The BUSCO analysis indicates excellent sequencing quality, with over 94% of BUSCOs complete in both the genome and gene set, suggesting minimal fragmentation and high completeness in the assembly. The presence of a higher proportion of duplicated BUSCOs may indicate some degree of redundancy, but the low percentage of missing and fragmented BUSCOs further confirms the robustness of the assembly (Table 5).

Usage Notes

The final assembled R. japonica genome spans approximately 3.30 Gb, larger than the 2.56 Gb genome of P. cuspidatum11. Although both genomes contain a high proportion of repetitive sequences, R. japonica has a slightly higher percentage (74.79% compared to 71.54%). However, R. japonica exhibits superior assembly quality, with an N50 of 1.39 Mb, and 99.22% of the sequences are anchored to 22 pseudo-chromosomes, demonstrating a high level of assembly integrity. Future research could explore gene functions in R. japonica that are linked to its invasiveness and pharmacological properties, as well as utilize this reference genome for selective breeding initiatives.