Background & Summary

Biodiversity is the backbone of earth’s life support system, it plays a vital role in supporting and sustaining life on earth and maintaining the ecosystems functioning, it also underpins numerous essential benefits from nature that are vital for human well-being1,2,3. The ecosystem involved enormous organisms, raising an interesting question that how animals adapt into different habitats. The ability of flight and water adaptation, for example, enables animals gain novel niches and escape predators. For a number of animals, the molecular bases underlying the adaptive evolution have been uncovered. These animals include marine mammals4, Asian honeybee5, the water strider6 and the poultry shaft louse7. However, the adaptive mechanisms of the majority of extant species remain unknown.

Insects are a crucial component of biodiversity and play a vital role in maintenance of ecological balance, which are estimated to be as many as 5.5 million species on earth8. They make up around 80% of animals and around half of all living species, and occupy nearly every terrestrial habitat on the planet8,9. The true bugs (Hemiptera: Heteroptera) is one of the most diverse lineages of insects and the most diversified lineage among hemimetabola, which possesses a diversity of feeding strategies, ranging from predation on other arthropods and hematophagy on vertebrates, to mycetophagy and phytophagy. Hemipterans also vary in living habitats, including various terrestrial, aquatic and even marine habitats10,11. The enormous diversity makes Hemiptera an ideal lineage for exploration of adaptive evolution. In previous studies, the common ancestor of Heteroptera was considered as terrestrial and experienced multiple independent evolutionary events on habitat transitions, from terrestrial to water surface, aquatic and other habitation, and from aquatic to shoreline11,12. The infraorder Nepomorpha, specifically, containing many species living in water, is helpful for understanding on water adaptation and habitats transition. Exploring the water adaptation of water bugs can pave the way for the following studies related to living habitat adaptation. However, the lack of chromosome-level genomes prevents us from a deep genomic analysis of aquatic true bugs and also impedes the discovery of the adaptive mechanism.

The family Nepidae (Heteroptera: Nepomorpha), or water scorpions, can be recognized by the characteristics of the antennae hidden under the head and the long tail-like siphon (the respiratory siphon) on the rear end of their body that cannot be retracted into the apex of the abdomen, which is unique within the aquatic insects13,14,15. Normally, they can be found in ponds, lakes, marshes and riversides and often move far from its aquatic habitat at night16. As predatory insects, they usually reside in the shade of plants in water or between stones, waiting for its preys, including small fish and aquatic insects and other invertebrates in water16. This means, Nepidae have evolved multiple derived traits compared to ancestors and are also potential natural enemies of health pests for biocontrol, because they are found to be able to control mosquito population by preying the larvae17,18. Ranatra chinensis, also known as Chinese water scorpion, is a representative species of family Nepidae with a long caudally breathing tube. Its body size ranges from 41 mm to 48 mm, and the length of its respiratory siphon can be equal to that of torso18. Ranatra chinensis is widely distributed in China, from the northernmost Heilongjiang Province to southernmost Guangdong Province, making it a relatively accessible specimen for scientific studies.

There is no chromosome-level genome available within family Nepidae yet. Analysis on R. chinensis genome will benefit our interpretation about the molecular mechanisms of water bugs’ habitat transition and adaptative evolution. With the development of sequencing technologies and bioinformatic tools, obtaining a high-quality reference genome becomes feasible for most organisms, enabling us to investigate the genome evolution and the underlying molecular mechanisms. In this study, we assembled and annotated the chromosome-level genome of R. chinensis, combining of PacBio long-read, Illumina short-read sequencing, and chromosome conformation capture (Hi-C) technologies. Our data will provide genomic resources for future exploration on the biology and evolution of water bugs, and will facilitate the understanding of habitat transition and water adaptation.

Methods

Sample preparation

The R. chinensis specimen were collected from Xichuan County, Henan Province, China (33.24° N, 111.02° E), and were put into dry ice for transportation. All samples were stored at −80°C until further usage. The female adult was used for genomic DNA extraction based on the CTAB method, and extracted DNA were purified using a Blood and Cell Culture DNA Midi Kit (QIAGEN, Germany). The DNA degradation and contamination of the extracted DNA was monitored on 1% agarose gels. The purity of DNA samples was then detected using NanoDrop™. One UV-Vis spectrophotometer (Thermo Fisher Scientific, USA) and the Qubit® 4.0 Fluorometer (Invitrogen, USA) were used to measure the DNA concentration.

Genomic DNA sequencing

Illumina short read library was firstly constructed with an insertion size of about 350 bp, and then sequenced using the Illumina Novaseq6000 to generate 150 bp paired-end short reads. We obtained 59.89 Gb raw data of Illumina short reads and finally got 57.55 Gb clean data after removing adapters and low-quality short reads using Fastp version 0.21.019 with default parameters (Table 1).

Table 1 Library sequencing data and methods used in this study to assemble the Ranatra chinensis genome.

For long-read sequencing, a PacBio HiFi-read library with insertion sizes of 15 kb was generated, and sequenced the long DNA fragments using a SMRT cell on PacBio Sequel II sequencing platform (Pacific Biosciences, Menlo Park, USA). A total of 60.41 Gb clean data were obtained from the raw long reads generated using Circular Consensus Sequencing (CCS) model (Table 1).

A single female adult was used for chromosome conformation capture (Hi-C) sequencing and the library was prepared according to the standard protocol described by Belton with minor modifications20. The sample was cut into pieces and mixed with 2% formaldehyde solution for cross-linking, and then treated with New England Biolabs (NEB) buffer to digest nuclei. Biotinylated nucleotides were used to fill the cohesive ends and purified DNA was sheared to fragments of 350 bp in length after ligation. DNA purification was achieved using QIAamp DNA Mini Kits (Qiagen). The final generated Hi-C library was sequenced on Illumina Novoseq6000 platform with paired-end 150 bp. The sequencing yields 59.18 Gb raw data and 57.87 Gb clean data obtained after applying the same filter criteria for short reads (Table 1).

Transcriptome sequencing

Total RNA was extracted from two adults (one female and one male) with the TRIzol reagent (Thermo Fisher Scientific, USA) for transcriptome sequencing. The construction of a paired-end library was obtained by using the TruSeq RNA Library Preparation Kit (Illumina, USA). The transcriptome sequencing was finished on an Illumina Novoseq6000 platform, resulting in a total of 4.86 Gb RNA-seq clean data (Table 1).

Genome size estimation

Genomic characteristics including genome size, heterozygosity, and duplication were estimated using 57.55 Gb clean Illumina short-reads. The distribution of k-mer copy number was calculated to perform this estimation in JELLYFISH version 2.1.321. Genome size and genome heterozygosity were estimated based on 17-mer depth analysis in GenomeScope version 2.022 with default parameters, and the results were 605.54 Mb and 2.27%, respectively (Fig. 1).

Fig. 1
figure 1

The K-mer distribution of Illumina paired-end reads using GenomeScope version 2.0 based on a k value of 17. The K-mer distributions showed double peaks: the first peak with a coverage of ~40 indicates genome duplication, and the second peak with a coverage of ~60 represents a genome size peak.

Chromosome-level genome assembly

The initial de novo assembly of R. chinensis genome was performed based on PacBio sequence data using Hifiasm version 0.13 with default parameters23. After assembly, the genome was then polished by the Purge_Dups pipeline24 to remove alternative haplotype and redundant fragments from the genome. Then, the subsequent polishing was performed using Illumina sequencing data to enhance the quality of the contigs. Finally, an 867.89 Mb contig-level genome assembly of R. chinensis was obtained based on PacBio sequencing data, containing 689 contigs with contig N50 and N90 sizes of 26.48 Mb and 3.80 Mb, respectively, and the GC content of 39.50% (Table 2).

Table 2 Statistics of the Ranatra chinensis genome assembly.

The high-quality chromosome-scale genome was generated using a scaffolding pipeline based on Durand25. Initially, BWA-MEM version 0.7.1726 with the parameters: ‘mem -SP5M’ was used for mapping Hi-C data to the contig assembly genome. The DpnII sites were generated using the ‘generate_site_positions.py’ script in Juicer version 1.525. Subsequently, contigs were assembled into the chromosome-level scaffolds using the 3D-DNA pipeline with the parameter “-r 2”27. After the confirmation by Hi-C contact maps, chromosome interaction matrix was manually adjusted and corrected using Juicebox version 1.11.0828. Ultimately, we anchored and generated 23 pseudo-chromosomes, and the final chromosome-level genome assembly of R. chinensis was obtained with a scaffold N50 of 29.80 Mb (Fig. 2; Table 2).

Fig. 2
figure 2

The Circos atlas of the Ranatra chinensis chromosome-level genome. Tracks represent (a) the distribution of chromosome karyotypes, (b) gene density, (c) transposable element content, (d) DNA transposon and (e) GC density. Densities were calculated in 100-kb windows. Chr8, Chr21, Chr22 and Chr23 are predicted to be the sex chromosomes and the remaining are autosomes.

Synteny analysis and the determination of sex chromosome in R. chinensis

Based on JCVI v1.1.17 with default parameters29, we performed synteny analysis to confirm the sex chromosome in R. chinensis using public chromosome-level genomes whose sex chromosomes have been verified, including Rhynocoris fuscipes (Hemiptera: Reduviidae)30, Triatoma rubrofasciata (Hemiptera: Reduviidae)31 and Riptortus pedestris (Hemiptera: Alydidae)32. The result showed that the Chr8, Chr21, Chr22 and Chr23 of R. chinensis exhibited high homology with Chr12 and Chr13 of T. rubrofasciata, Chr13, Chr14 and Chr15 of R. fuscipes, and the ChrX of R. pedestris (Fig. S1). This result indicates that R. chinensis has the same sex chromosome system as that of Nepa cinerea (Heteroptera: Nepidae, N = 14 A + X1X2X3X4) and Ranatra linearis (Heteroptera: Nepidae, N = 19 A + X1X2X3X4) reported in previous study33, and Chr8, Chr21, Chr22 and Chr23 correspond to the X1, X2, X3, and X4 chromosomes.

Prediction of repetitive elements

Repeat sequences of R. chinensis genome were detected in Extensive de novo TE Annotator (EDTA) version 1.9.434. LTR retrotransposons were determined in LTR FINDER version 1.0735, LTRharvest36, and LTR retriever version 2.9.037 with default parameters. DNA transposons were classified utilizing TIR Learner38 and HelitronScanner39 with default parameters. RepeatMasker version 4.0.7 (parameters: -gff -xsmall -no_is)40 and RepeatProteinMask version 4.0.7 (parameters: -engine wublast) were used to find the interspersed repeats against the RepBase database41 (http://www.girinst.org/repbase). In addition, Tandem Repeats Finder version 4.0942 was used to classify tandem repeats with parameters ‘2 7 7 80 10 50 500 -f -d -m’ based on the de novo prediction. RepeatModeler version 2.0.443 (parameters: ‘-engine ncbi -pa 4’) was utilized to construct a repetitive sequence library and RepeatMasker version 4.0.740 was used for annotation of the repeat element against this repeat library. In the genomic sequences, a total of 391.73 Mb (45.13%) repetitive elements were identified, mainly including 22.96% retrotransposon, 3.35% DNA transposons and 11.26% tandem repeat (Table 3). Retrotransposons include LTR, SINE, and LINE; and LTR is further classified in to Copia, Gypsy, and other LTR.

Table 3 Repeats elements statistics in genome of Ranatra chinensis.

Protein-coding gene prediction and functional annotation

Protein-coding genes (PCGs) within the genome were predicted by a combined method of homology-based prediction, ab initio prediction and transcriptome-based prediction. HISAT2 version 2.2.144 was utilized to map RNA-seq short data to the genome with the parameter ‘-k 2’. Then the StringTie version 2.4.045 was used to assemble the mapped reads into transcripts with default parameters. For the homology-based prediction, the protein sequences of eight representative insect species were downloaded from the NCBI GenBank database (Table S1). Homologous proteins were aligned against R. chinensis genome using Exonerate version 2.4.0 with default parameters to train the gene sets. Additionally, the bam2hints program (parameter: -intronsonly) in AUGUSTUS version 3.2.346 was employed to transfer the sorted and mapped bam file of RNA-seq data into a hints file. To predict coding genes from the assembled genome, AUGUSTUS version 3.2.346 with default parameters was performed for prediction, in which the combination of trained gene sets and hint files was the input. In the end, MAKER version 2.31.1047 was utilized to merge and generate a consensus high-confidence gene set on the basis of homology-based, de novo-derived and transcript genes. We predicted a total of 18,424 genes in R. chinensis genome with an average gene length of 5,900.48 bp (Table 4). The average length of coding sequences (CDS) and protein sequence were 1,189.79 bp and 396.60 AA, respectively (Table 4). The above statistics on sex chromosomes and autosomes were provided in Table S2.

Table 4 Statistics of predicted protein-coding genes of Ranatra chinensis genome assembly.

The predicted genes were functionally annotated using multiple methods, includeincludeing eggnog-mapper48 (parameter: -m diamond–tax_scope auto–go_evidence experimental–target_orthologs all–seed_ortholog_evalue 0.001–seed_ortholog_score 60–query-cover 20–subject-cover 0 –override), InterProscan version 5.049 (parameter: -iprlookup -goterms -appl Pfam -f TSV), BLAST version 2.2.2850 (parameter: -evalue 1e-5), and HMMER version 3.3.251 (parameter: –noali–cut_ga Pfam-A.hmm). All these approaches were performed to search against several public databases: Gene Ontology (GO), Clusters of Orthologous Groups of Proteins (COG), Kyoto Encyclopedia of Genes and Genomes (KEGG), NCBI non-redundant protein (Nr), Swiss-Prot, and Pfam. Overall, 16,262 genes were functionally annotated with at least one public database (Table 5).

Table 5 Number of functionally annotated protein-coding gene of Ranatra chinensis genome.

Data Records

Genomic Illumina short-reads data were deposited at the NCBI Sequence Read Archive database under accession number SRR2878529252. Genomic PacBio HiFi sequencing data were deposited at the NCBI Sequence Read Archive database under accession number SRR2878938053. RNA-seq data was deposited at the NCBI Sequence Read Archive database under accession number SRR2878888054. The Hi-C sequencing data were deposited at the NCBI Sequence Read Archive database under accession number SRR2878753855.

The final chromosome assembly was submitted to GenBank at NCBI under accession number JBFDAA00000000056. The genome sequence and raw reads have been deposited in GenBank and Sequence Read Archive at NCBI under BioProject PRJNA110371857.

Technical Validation

To assess the accuracy of the final genome assembly, we mapped the Illumina short-reads to the R. chinensis genome with BWA-MEM version 0.7.1726, and the result showed 97.08% of short reads were successfully mapped to the genome. Benchmarking Universal Single-Copy Orthologs (BUSCO version 3.0.2)58 was used to evaluate the genome completeness based on the insecta_odb10 database, revealing the completeness was 95.7%. Among 1,309 orthologous, 1,299 genes were classified as complete single-copy genes and 10 genes were complete duplicated genes, eight genes were fragmented and 48 genes were missing (Table 6).

Table 6 BUSCO evaluation for the final genome assembly of Ranatra chinensis.