Background & Summary

External issues such as fishing activities, climate change, habitat degradation and pollution are largely threatening marine biodiversity and ecosystem stability1, leading to a widespread decline in global fishery resources2,3. As one of the most active fishing areas in China, the Yellow Sea is currently experiencing significant fisheries resource declines due to anthropogenic activities4. As a result, the Yellow Sea ecosystem has showed rapid responses to these above-mentioned pressures, particularly a continual change in fish community structure with frequent replacement of dominant species5. However, the Tanaka’s snailfish (Liparis tanakae) has been a dominant species since the 1980s, flourishing for more than 40 years with a relatively stable population size4.

In addition to its long-term dominance, the Tanaka’s snailfish is also one of the top predators, playing an important ecological role in regulating the biomass of other species through a top-down effect6. Therefore, the Tanaka’s snailfish is of high ecological importance, contributing to maintain the stability of the Yellow Sea ecosystem. Besides, the Tanaka’s snailfish belongs to the family Liparidae, which contains species survive in extreme hadal environments7,8,9. Considering the Tanaka’s snailfish inhabits muddy bottom regions at depths of 50–90 m5, comparative analyses between the Tanaka’s snailfish and hadal snailfish species can help to decipher the evolutionary mechanisms of vertebrates adapt to the hadal environments9, which is of great importance in exploring the survival strategies of organisms in extreme environments.

Previous studies generally focused on ecological aspects of the Tanaka’s snailfish4,5,6,10,11,12, yet limited genetic and genomic resources have largely constrained evolutionary investigations of this species. For example, molecular mechanisms underlying the long-term dominance of the Tanaka’s snailfish are still unknown, mainly due to the limited availability of genomic data. Besides, genetic information such as the degree of genetic diversity, evolutionary history and population genetic differentiation, which could provide valuable reference information for fishery resource management and conservation of the Tanaka’s snailfish, are poorly understood. The development of sequencing techniques and genome-scale analytical approaches have facilitated genomics studies of marine fish species13,14, including phylogenomics15, population genomics16, evolutionary genomics17, conservation genomics18, and among others. Till now, a total of two genome assemblies of the Tanaka’s snailfish have been deposited in the NCBI Genome database: one chromosome-level assembly (GenBank accession no. GCA_036178185.1) and one scaffold-level assembly (GCA_006348945.1). Although the chromosome-level assembly was chosen as the reference genome, it has a total of 926 scaffolds, including 24 chromosomes and 902 unplaced scaffolds. Besides, the reference genome sequence was only assembled using Oxford Nanopore sequencing data, lacking processes such as gap closing and genome polishing based on short-reads sequencing data, to some extent impacting the continuity of the assembly.

In this study, we assembled a chromosome-scale genome sequence of the Tanaka’s snailfish using Illumina short reads, PacBio HiFi long reads and Hi-C data (Table 1). The initial genome assembly had a total length of 574.97 Mb with 1,626 contigs and a contig N50 of 1.35 Mb (Table 2). After Hi-C scaffolding approach, 97.87% of the initial assembled sequences were anchored to 24 pseudo-chromosomes (Fig. 1), and the total length of the final genome assembly was 574.44 Mb, with 126 scaffolds and scaffold N50 of 24.64 Mb (Table 2). Our assembly was 20.18 Mb longer than the NCBI reference genome in total length, with higher scaffold N50 (24.64 vs. 23.04 Mb), fewer scaffolds (126 vs. 926) and higher chromosome anchoring rate (Table 3), showing relatively high assembly quality. Higher assembly completeness, continuity and integrity were also observed when comparing to the scaffold-level assembly GCA_006348945.1 (Table 3). In our assembled sequence, a total of 162.47 Mb of repetitive sequences were annotated, representing 28.28% of the genome assembly (Table 2). The repetitive sequences (Table 4) were dominated by DNA transposons (46.32 Mb, 8.06%), long interspersed elements (LINEs, 28.93 Mb, 5.04%) and long terminal repeats (LTRs, 12.03 Mb, 2.09%). In addition, combining ab initio, homology-based and RNA-seq assisted gene prediction approaches, a total of 20,933 protein-coding genes were predicted, among which 20,376 (97.11%) were annotated (Table 2, Fig. 2). A total of 46,587 non-coding RNA (ncRNA) genes were predicted, including 1,583 miRNAs, 32,466 tRNAs, 10,955 rRNAs and 1,583 snRNAs (Table 5). The assembled genome sequence and associated annotation information provide valuable resources for elucidating the genetic adaptation and underlying molecular basis of the long-term dominance of Tanaka’s snailfish. These genomic data can be also used in future comparative genomics studies to investigate genomic evolution and phylogeny of snailfishes.

Table 1 Sequencing data for the Tanaka’s snailfish genome assembly.
Table 2 Assembly and annotation statistics of the Tanaka’s snailfish genome.
Fig. 1
figure 1

The Hi-C contact map of the Tanaka’s snailfish genome assembly in this study. chr 1–24 represented for the 24 pseudo-chromosomes. The color bar showed the contact density from white (low) to black (high).

Table 3 Comparison of assembly statistics of three Tanaka’s snailfish genome sequences.
Table 4 Statistics of repetitive sequences in the Tanaka’s snailfish genome assembly.
Fig. 2
figure 2

Venn diagram of functional annotation of the Tanaka’s snailfish genome assembly in this study.

Table 5 Classification of ncRNA genes in the Tanaka’s snailfish genome assembly in this study.

Methods

Sample collection and sequencing

An adult female Tanaka’s snailfish individual was sampled from the Yellow Sea (123°10′E, 38°33′N) in May 2023. The muscle tissue below the dorsal fin was taken and stored in the liquid nitrogen until DNA extraction. Genomic DNA was isolated using the cetyltrimethylammonium bromide (CTAB) method. High-quality DNA was used for library preparation and high-throughput sequencing.

Illumina short-insert (350 bp) libraries were prepared according to the protocol and paired-end (PE150) sequenced on the Illumina Novaseq 6000 platform (Illumina, Inc., San Diego, CA, USA). HiFi long-read sequencing was performed using the PacBio Sequel II sequencer (Pacific Biosciences, Menlo Park, CA, USA). For Hi-C sequencing, fresh muscle was fixed with formaldehyde in a concentration of 1% and the fixation was terminated using 0.2 M glycine. A Hi-C library was prepared following the Hi-C library protocol19 and then sequenced using an Illumina Novaseq 6000 sequencing platform. We also constructed four RNA-seq libraries to facilitate prediction of protein-coding genes. The RNA-seq libraries were then sequenced on an Illumina sequencing platform.

Genome assembly

A total of 26.95 Gb PacBio HiFi long-read data (Table 1) were used for de novo genome assembly using Hifiasm20 with default parameters. Genome polishing was performed using BWA v0.7.1021 and Pilon v1.2322 with Illumina short reads (clean data 43.00 Gb, Table 1). These sequencing data resulted in a 574.97 Mb assembly with 1,626 contigs and a contig N50 of 1.35 Mb (Table 2). The draft genome contigs were then anchored and oriented into a chromosomal-scale assembly using the Hi-C data. A total of 49.00 Gb clean data (Table 1) were aligned to the draft genome assembly using BWA. Duplication removal, sorting, and quality control were performed using HiC-Pro v2.8.023. Only uniquely mapped valid read pairs were used for further analysis. LAchesis24 was then used to cluster, order, and orient the contigs into chromosomal-scale assembly. Finally, 97.87% of the initial assembled sequences were anchored to 24 pseudo-chromosomes (Fig. 1), and the total length of the genome assembly was 574.44 Mb, with 126 scaffolds and scaffold N50 of 24.64 Mb (Table 2).

Repetitive sequence annotation

A combined strategy based on homology alignment and de novo search was applied in our repeat annotation pipeline. A de novo repetitive elements database was built by LTR_FINDER25, RepeatScout26, RepeatModeler (www.repeatmasker.org/RepeatModeler.html) with default parameters. Tandem repeats were also ab initio extracted using TRF v4.0927. Then all repeat sequences with lengths >100 bp and gap ‘N’ less than 5% constituted the raw transposable element (TE) library. The homolog-based prediction commonly searched against Repbase28 database employing RepeatMasker v3.3.029 software and its in-house scripts RepeatProteinMask (v3.2.2) with default parameters. The combination of Repbase and our de novo TE library was processed by uclust30 to yield a non-redundant library and RepeatMasker was used to identify DNA-level repeat. The results of repetitive sequence annotation were listed in Table 4.

Protein-coding gene prediction and annotation

We employed ab initio, homology-based and RNA-seq assisted prediction to detect the protein-coding genes. For homology-based prediction, protein sequences of Gasterosteus aculeatus, Oryzias sinensis, Gadus morhua, Danio rerio and Takifugu rubripes were downloaded from Ensembl database31. The protein sequences were aligned against the genome assembly using TBLASTN v2.2.2632 (E-value ≤ 1e-5), and then matching proteins were aligned to the homologous genome sequences for accurate spliced alignments with GeneWise v2.4.133. The ab initio prediction was performed using Augustus v3.2.334, GeneID v1.435, GeneScan v1.036, GlimmerHMM v3.0437, and SNAP v2013-11-2938 based on the repeat masked genome sequences. RNA-seq data were mapped to the genome using HISAT2 v2.1.039. Transcript structures were predicted using Stringtie v1.3.340, and candidate coding regions were predicted using TransDecoder v. 5.5.0 (https://github.com/TransDecoder/TransDecoder). Finally, genes predicted by the above three methods were merged into a non-redundant reference gene set with EvidenceModeler v1.1.141 with identical weights, leading to a total of 20,933 protein-coding genes (Table 2).

Protein-coding genes were annotated by aligning the gene sequences to the SwissProt, NT, NR, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases using BLAST + v2.2.2842 with an e-value threshold of 1e-5. InterProScan v5.3143 was used to predict protein function based on conserved domains and motif by searching against ProDom, PRINTS, Pfam, SMRT, PANTHER and PROSITE. Ultimately, 20,376 (97.11%) predicted genes were successfully annotated (Table 2, Fig. 2).

For noncoding RNA (ncRNA) annotation, Infernal44 (v1.1.4) was utilized based on the Rfam database (http://eggnogdb.embl.de/). Four types of ncRNA were identified from the Tanaka’s snailfish genomes (46,587 genes in total), including 1,583 miRNAs, 32,466 tRNAs, 10,955 rRNAs and 1,583 snRNAs (Table 5).

Data Records

The sequencing dataset and genome assembly were deposited in public repositories. The raw sequencing data including Illumina, PacBio, Hi-C and RNA-seq data were submitted to the National Center for Biotechnology Information (NCBI) SRA database under BioProject accession number PRJNA123158045. The assembled genome data have been deposited at GenBank under accession JBMEBB000000000.146, and the associated genomic annotation results are stored in Figshare database47.

Technical Validation

Evaluation of the quality of genomic DNA and RNA

In our DNA extraction section, the DNA quality and concentration were measured using agarose gel electrophoresis (1%), pulse field gel electrophoresis (1%) and Qubit 3.0 (Thermo Fisher Scientific, Inc., Carlsbad, CA, USA), respectively. For RNA, the integrity and quantity was evaluated using the Agilent 2100 Bioanalyzer (Agilent, USA). Subsequently, high-quality DNA and RNA were used for library preparation and high-throughput sequencing.

Evaluation of the completeness of genome assembly

The completeness of the assembled genome sequence was evaluated using BUSCO v3.0.148. The BUSCO analysis against the vertebrata_odb10 database found that 97.3% of the conserved single copy orthologue genes, including 95.7% of the complete and 1.6% fragmented genes, were found in the genome assembly (Table 2). The mapping rate of Illumina short reads from same individual were used to evaluate the quality of the initial genome assembly using BWA v0.7.10. By using a total of 43.00 Gb Illumina sequencing data from the same individual, the mapped read rate and coverage were 98.10% and 99.66%, respectively (Table 2), showing high consistency of our assembly. Additionally, using the Merqury’ k-mer analysis49, the quality value (QV) scores of our assembly based on short reads were estimated as 36.98 (Table 2) and the base accuracy rates were >99.9%, indicating high assembly accuracy.