Background &Summary

Members of the genus Chaunax (Lophiiformes: Chaunacidae) are commonly referred to as sea toads, and they are extensively distributed across deep-seas areas at mid- to low latitudes1,2,3. These sea toads are particularly prevalent on the outer continental shelf and the upper continental slope, inhabiting depths ranging from 200 to 2,500 m3,4. Their morphologically distinctive traits include a head that is rounded to slightly cuboidal; a skin texture interspersed with minute, spine-like scales; and the presence of lateral-line neuromasts on both the head and body3. These features are thought to enhance their sensory capabilities, which are crucial for their survival in challenging deep-sea habitats5,6.

Similar to other members of the order Lophiiformes, sea toads are benthic and use their pectoral and pelvic fins for support and locomotion across the seafloor, a movement often likened to ‘walking’7. They use a sit-and-wait strategy to capture their prey and can employ their esca (a specialized lure at the end of the first dorsal fin spine) to entice prey within striking distance8,9. These distinctive morphological and behavioral traits make Chaunax species excellent models for studies of evolutionary adaptations to deep-sea environments.

Genetic and evolutionary insights into Chaunax have been limited by a lack of deep-sea sampling and thus genomic data for Chaunacidae. Here, we collected a sea toad specimen in the genus Chaunax from the Zhenbei seamount at a depth of 555.3 m using a submersible vehicle (Faxian) (Fig. 1). The Chaunax specimen had a pinkish-red body with bright-white patches on the dorsal surface, and its skin was covered with a mix of numerous bifurcated and simple spinules. Specialized pectoral fins resembling ‘little feet’ support the body and move quickly across the sea floor. We employed PacBio long-read sequencing, Illumina sequencing, and high-throughput chromosome conformation capture (Hi-C) technology to generate an annotated chromosome-level genome assembly of this Chaunax specimen. This high-quality genome assembly will aid future studies of the deep-sea adaptation and phenotypic evolution of this species, as well as genomic analyses of related species. Our assembly will also advance our understanding of the genomic evolution and phylogenetic relationships within the order Lophiiformes.

Fig. 1
figure 1

In situ observations of the Chaunax sp.

Methods

Sample collection and DNA extraction

The Chaunax specimen was collected using a submersible vehicle (Faxian) from the Zhenbei seamount (15°04′45.888″N, 116°33′55.210″E, 555.3 m deep) in the South China Sea (Fig. 1). The sea toad sample was kept in a closed sample chamber placed inside the sample basket of the submersible. The sample was then immediately stored in liquid nitrogen. All experimental protocols were approved by relevant guidelines and regulations established by the Institutional Animal Care and Use Committee of the Institute of Oceanology, Chinese Academy of Science. Sea toad muscle was used to extract genomic DNA using the sodium dodecyl sulfate method10. The quantity and quality of DNA were determined using agarose gel electrophoresis and a Qubit Fluorometer, respectively.

Illumina sequencing and genome size estimation

Illumina paired-end libraries with insert sizes of 350 bp were constructed and sequenced on an Illumina NovaSeq6000 platform (Illumina, CA, USA). Low-quality reads and sequencing adaptor-contaminated reads were trimmed using fastp11 with the following parameters: -q 10 -u 50 -y -g -Y 10 -e 20 -l 100 -b 150 -B 150. A total of 32.41 Gb Illumina short reads (clean data) were generated and retained for the genome survey (Table 1). The genome survey of Chaunax sp. was performed using the K-mer method. K-mer analysis was conducted using jellyfish v2.2.712 with an optimal K-value of 17 (Table 2). The K-mer frequency distribution map was generated to estimate the genome size, heterozygosity, and proportion of repetitive sequences using GenomeScope software13. A total of 24,428,214,893 K-mers were obtained using a K-mers peak at a depth of 35 (Table 2). The genome size estimated using the formula Knum/Kdepth was approximately 683.46 Mb. The heterozygosity rate and proportion of repetitive sequences were 0.38% and 36.16%, respectively (Table 2).

Table 1 Sequencing data used for the assembly and annotation of Chaunax sp. genome.
Table 2 Statistics of the genome survey, genome assembly and quality assessments.

PacBio long-read sequencing and Hi-C sequencing

The genomic DNA was fragmented to 20 kb and sequenced using the PacBio Sequel II platform (Pacific Biosciences, USA) following the manufacturer’s protocols. Briefly, the gDNA was sheared using the g-TUBE device (Covaris) to the target fragment size for the construction of 20-kb libraries. Damage repair and end repair were performed using the SMRTbell Damage Repair Kit on the interrupted DNA fragments. After attaching dumbbell adapters, the fragments were digested by exonuclease. BluePippin electrophoresis (Sage Science, MA, USA) was used for size selection of the sequencing library, and the cutoff threshold size was set to 20 kb. A total of 23.67 Gb high-quality HiFi reads were produced using the circular consensus sequencing mode on the PacBio Sequel II platform (Table 1).

A Hi-C library was constructed by chromatin cross-linking, restriction endonuclease (DpnII) cleavage, end repairing and biotin labeling, ligation, and DNA purification and shearing; interacting DNA fragments were captured by streptavidin magnetic beads for library construction. A Qubit 2.0 Fluorometer and Agilent 2100 Bioanalyzer were used to detect the concentration and insert size of the library, and Q-PCR was used to accurately quantify the effective concentration of the library to ensure its quality. The Hi-C sequencing library was sequenced on an Illumina NovaSeq (PE150) platform. A total of 70.63 Gb Hi-C data were generated after quality control (Table 1).

Chromosome-level genome assembly

The high-accuracy PacBio HiFi reads were assembled into the initial set of contigs using Hifiasm v0.1914 with the following parameters: -l 2 -n 4. Sequence contamination in the genome assembly and mitochondrial sequences were removed via comparison of the genome assembly with the nucleotide sequence database (nt) and mitochondrial database (https://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/) from the National Center for Biotechnology Information (NCBI). The initial genome assembly length was 709.63 Mb with a contig number of 354 and N50 of 15.24 Mb (Table 2). The size of the assembled genome nearly matched that of the results of the genome survey, which reflected the high accuracy and integrity of our assembled genomes.

To generate the chromosome-level assembly, the unmapped pair-end reads, singleton reads, and other invalid reads were filtered using HiC-Pro v2.10.015. Only uniquely valid interaction pairs (79.12%) were retained for further assembly. A total of 70.63 Gb clean reads pairs were obtained from the Hi-C library and mapped to the assembled genome using BWA v0.7.1716. The uniquely mapped data were retained, and the clustered contigs were sorted and oriented using LACHESIS software17. After Hi-C assembly and manual heat map adjustment, a total of 679.46 Mb of genome sequences were anchored and oriented onto 24 pseudo-chromosomes (Fig. 2a and Table 3). These pseudo-chromosomes, ranging in size from 13.00 to 35.46 Mb (Fig. 2b and Table 3), comprised approximately 96.11% of the total genome. Finally, the chromosome-level genome assembly with a total length of 706.94 Mb was obtained; this assembly showed a high level of continuity, with a scaffold N50 of 29.42 Mb (Table 2).

Fig. 2
figure 2

Chromosome-level genome assembly of Chaunax sp. (a) Contact map of chromosomal interactions in the Chaunax sp. genome using Hi-C data. (b) Circos plot of 24 chromosomes in Chaunax sp. genome. The tracks from outside to inside are 24 chromosome ideograms, transposable element (TE) density, simple sequence repeat (SSR) density, gene density, GC content, and co-linearity relationship.

Table 3 Summary of assembled 24 chromosomes of Chaunax sp.

Transcriptome sequencing and assembly

RNA was extracted from multiple tissue samples, including muscle, kidney, liver, gonad, and cholecyst, using the TRIzol reagent (Thermo Fisher Scientific). RNA integrity and quality were assessed using agarose gel electrophoresis, and the quantity of RNA was determined using a Qubit Fluorometer. Sequencing libraries were constructed using the NEBNext® UltraTM RNA Library Prep Kit for Illumina® (NEB, USA) per the manufacturer’s instructions. Illumina RNA-seq libraries were prepared and sequenced on an Illumina NovaSeq6000 platform, and 150 bp paired-end reads were generated. After trimming based on the quality scores using Trimmomatic-0.3917, a total of 6.77 Gb of clean transcriptome data were obtained (Table 1). The clean reads were aligned to the sea toad genomes using HISAT2 v2.2.118.

Annotation of repetitive sequences

Transposable elements (TEs) and tandem repeats were annotated via the following workflows. TEs were identified via a combination of homology-based searches and de novo prediction. A de novo repeat library was generated using RepeatModeler2 v2.0.119, along with the two embedded programs RECON v1.0.820 and RepeatScout v1.0.621, all of which were operated using default parameters. Long terminal repeats (LTRs) were identified using LTR_retriever v2.9.022, and they were mainly identified using the predicted results of LTRharvest v1.5.1023 and LTR_FINDER v1.0724. A non-redundant species-specific TE library was constructed by combining the de novo TE sequence library above with the known Dfam v3.5 database, and RepeatClassifier19 was used to classify the prediction results. Finally, RepeatMasker v4.1.225 was used to predict the TEs of the genome based on the constructed repeat sequence database. Tandem repeats were annotated by TRF v4.0926 and MISA v2.127. The results showed that 30.20% of repetitive sequences were annotated in the Chaunax sp. genome, and the total length of the TEs was 143.93 Mb, which comprised 20.36% of the genome; the total length of tandem repeat sequences was 69.54 Mb, which comprised 9.84% of the genome (Table 4).

Table 4 Statistics of the annotated Chaunax sp. repeat sequences.

Gene prediction and functional annotation

Protein-coding genes were annotated using a combination of ab initio prediction, homology-based searches, and RNA sequencing (RNA-seq) (Table 5). Augustus v.3.1.028 and SNAP v2006-07-2829 were used for ab initio gene prediction with default parameters. GeMoMa v.1.730 software was performed for homology-based prediction. The protein sequences of Lophius litulon31, Lophiodes sp.31, Solocisquama erythrina31, Takifugu rubripes32, and Thamnaconus septentrionalis33 were downloaded from the NCBI and figshare database. RNA-seq-based gene prediction was performed by mapping clean RNA-seq reads to the reference genome using HISAT2 v2.2.118, and the transcripts were assembled by StringTie v.1.2.334. GeneMarkS-T v5.135 was used to predict genes based on the assembled transcripts. PASA v2.4.136 software was used to predict genes based on the unigenes (and full-length transcripts from the PacBio HiFi sequencing) assembled by Trinity v.2.1137. Gene models from these different approaches were combined using EVM v1.1.138 and updated by PASA. The final gene models were annotated by blasting sequences against the GenBank NR, eggNOG, GO, KEGG, TrEMBL, KOG, Pfam, and SwissProt databases using an E-value cut-off of 1 × 10−5. We predicted a total of 25,280 protein-coding genes in the Chaunax sp. genome via multiple methods (Table 5), with a gene length of 398,546,511 bp, exon length of 55,947,487 bp, coding sequence length of 42,599,371 bp, and intron length of 342,599,024 bp (Table 6). A total of 23,457 genes (92.79% of the total) were functionally annotated using public databases (Table 7).

Table 5 Gene prediction through integrating multiple methods.
Table 6 The comparison of gene models annotated from the Chaunax sp. genome with orther closely related species.
Table 7 Functional annotation of Chaunax sp. predicted gene.

Non-coding RNA annotation and pseudogene prediction

For non-coding RNA annotation, tRNAs and rRNAs were predicted using tRNAScan-SE v1.3.139 and Barrnap v0.9 (the parameters deployed in barrnap–kingdom euk)40, respectively; miRNAs, snRNAs, and snoRNAs were identified using Infernal v1.141 software against the Rfam v14.542 database. A total of 2,462 tRNAs, 1,811 rRNAs, 202 miRNAs, 440 snRNAs, and 175 snoRNAs were predicted (Table 8).

Table 8 Non-coding RNAs and pseudogenes statistics of Chaunax sp. genome.

GenBlastA v1.0.443 software was used to scan the whole genomes after masking predicted functional genes. Putative candidates were then analyzed by searching for premature stop codons and frame-shift mutations using GeneWise v2.4.144. Finally, a total of 67 pseudogenes were identified, encompassing a combined length of 496,033 bp, with an average length of 7,403 bp (Table 8).

Data Records

The Chaunax sp. genome has been deposited in GenBank under the accession number JBAGJB00000000045 and the BioProject number PRJNA1068823. The Illumina, PacBio HiFi, Hi-C, and RNAseq data have been deposited in the NCBI Sequence Read Archive under the accession number SRR27768100-SRR2776810346. The genome-related annotation files can be accessed through Figshare at https://doi.org/10.6084/m9.figshare.2510047447. The specific accessions are provided in the Material and Methods sections describing the data and analyses.

Technical Validation

Genome assembly and annotation completeness evaluation

The completeness of the final genome assembly was evaluated using the Benchmarking Universal Single-Copy Orthologs (BUSCO v5.2.1)48 program, along with the actinopterygii_odb10 database and the Core Eukaryotic Genes Mapping Approach (CEGMA v2.5)49. This yielded 97.06% of the complete BUSCO genes and 99.56% of the core eukaryotic genes (Table 2). The trimmed Illumina short reads and PacBio long reads were mapped against the assembled genome using BWA to evaluate the accuracy of the assembly, and the mapping rates were 99.63% and 99.75%, respectively (Table 3). The BUSCO completeness of the predicted gene models was determined against the actinopterygii_odb10 database under the protein mode, and the orthologous genes of Chaunax sp. contributed 97.20% of the complete genes, indicating the high completeness of the gene annotation (Table 2).

The interaction strength among chromosomes was evaluated and a Hi-C interaction heat map was constructed using HiCPlotter software50. The assembled sequences were anchored and oriented onto 24 pseudo-chromosomes (Fig. 2); within each pseudo-chromosome group, the interaction strength was higher at diagonal positions than at non-diagonal positions, suggesting that the quality of our genome assembly was high.