Background & Summary

The Small snakehead (Channa asiatica, Channidae) is a fish with significant culinary and ornamental value, highly favoured by both aquaculture farmers and fish enthusiasts1,2. This species is primarily distributed across Southeast Asia and the regions south of the Yangtze River in China. The Small snakehead is particularly noted for the striking appearance of its body surface, which is characterized by distinctive stripes and silver-white spots3. The Small snakehead is known for its strong vitality, fast growth rate, and ability to reach market size within the first year of cultivation. It is highly favoured by farmers and consumers for its abundant meat, few bones, and delicious taste4. Additionally, due to its small size, surface features colourful stripes and white spots, making it highly ornamental and popular among native fish enthusiasts who keep it in aquariums. Small snakehead has strong environmental adaptability and can use its gill rakers to breathe air, which facilitates high-density intensive cultivation and live fish transportation5.

Previous research on Small snakehead has primarily focused on mitochondrial sequences, colour variation, ecology, physiology, and toxicology6,7,8. The complete mitochondrial DNA sequence of Small snakehead has been determined, revealing a genome size of 16,550 base pairs, which provides essential genetic information for molecular identification6. Whole-genome resequencing has also identified a nonsense mutation in the csf1ra gene associated with the white phenotype, which affects pigmentation and sheds light on the genetic basis of albinism in this species3. Recent studies have extensively explored the molecular mechanisms and environmental adaptability of Small snakehead, particularly its responses to hexavalent chromium (Cr6+)7. These findings offer crucial insights for reproduction and molecular breeding programs in Small snakehead cultivation. Despite the ecological and economic significance of this species, genomic data on this species remain relatively limited. To date, the chromosomal genomes of several species within the Channa genus have been sequenced, including C. argus and C. maculata9,10. Additionally, the mitochondrial genomes of C. siamensis, C. burmanica, and C. aurantimaculata have been determined, providing further insights into the genetic diversity of the Channa genus11,12.

In this study, we used PacBio HiFi long-read sequencing, Illumina short-read sequencing, and Hi-C technologies to generate a high-quality chromosome-level genome of the C. asiatica. The development of this reference genome is expected to significantly advance population genetics and facilitate the identification of functional genes linked to key economic traits in the Small snakehead. This genomic resource provides a solid foundation for advancing molecular breeding and gene editing applications in this species.

Methods

Sample collection and DNA extraction

A mature male C. asiatica specimen was collected from the Pearl River (Guangzhou, China). Muscle tissue from this specimen was used to extract DNA for whole-genome sequencing, which included Illumina short-read sequencing, PacBio HiFi long-read sequencing, and Hi-C sequencing. Genomic DNA was isolated from muscle tissue with the Qiagen DNeasy Blood and Tissue Kit (Qiagen, USA) in accordance with the manufacturer’s instructions. DNA quality was evaluated by 1% agarose gel electrophoresis, and concentrations were measured with a NanoDrop One spectrophotometer (Thermo Scientific, USA). All procedures strictly followed the guidelines approved by the Ethics Committee of the Pearl River Fisheries Research Institute, Chinese Academy of Fishery Sciences (Approval No. PRFRI-2024-012).

Genome sequencing

For short-read sequencing, a 350 bp paired-end library was constructed using the Illumina TruSeq DNA PCR-Free Kit and sequenced on an Illumina NovaSeq 6000 platform (Illumina, CA, USA), yielding 56.47 Gb (84.27x) of paired-end raw sequence data (Table 1). Long-read sequencing was performed using the PacBio Sequel II system with a SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, USA). A total of 38.74 Gb of continuous long reads (HiFi) with an average length of 15.96 kb were generated (Table 1). Hi-C library construction involved dissection of approximately 1 g muscle tissue, followed by in situ chromatin proximity ligation using the DpnII restriction enzyme according to the manufacturer’s protocol (Arima Genomics, USA). The resulting Hi-C library was sequenced on the Illumina NovaSeq 6000 platform, producing 101.46 Gb (149.54x) of raw Hi-C reads (Table 1).

Table 1 Statistics of sequencing data.

RNA extraction and transcriptome sequencing

Total RNA was isolated from ten tissues (muscle, liver, spleen, kidney, intestine, heart, brain, swim bladder, testis) using TRIzol reagent. RNA integrity was verified via an Agilent 2100 Bioanalyzer and quantified using a NanoDrop 2000 spectrophotometer. Equal quantities of high-quality RNA from each tissue were pooled to construct a strand-specific cDNA library using the TruSeq RNA Library Prep Kit v2 (Illumina, CA, USA). The library was sequenced on an Illumina NovaSeq 6000 platform (Illumina, CA, USA), yielding 22.64 Gb of transcriptomic data for genome annotation (Table 1).

Genome size and heterozygosity estimation

To estimate the genome size of the C. asiatica, a k-mer analysis was conducted using Illumina clean reads. First, Jellyfish (v2.3.0)13 was employed to calculate the frequency of 17-mers and generate the k-mer frequency table. Subsequently, GenomeScope (v2.0)14 was used to analyze the k-mer frequency table, yielding a total genome size of 622,512,009 bp, with a heterozygosity rate of 0.379% and 65% unique sequences (Fig. 1).

Fig. 1
figure 1

The 17-mer frequency distribution analysis chart for the Channa asiatica genome.

Genome assembly

The genome was de novo assembled using Hifiasm v0.19.515 with default parameters. This process generated 60 contigs with a total length of 659.44 Mb, featuring a maximum contig size of 47.37 Mb and an N50 of 23.92 Mb (Table 2). For chromosome-scale scaffolding, a hybrid approach combining Juicer v1.616 and 3D-DNA v20100817 was implemented. The workflow initiated with BWA v0.7.1718 indexing of the contig-level genome, followed by Juicer processing to identify restriction enzyme cutting sites. Clean Hi-C paired-end reads were then mapped to the contigs using Juicer, and 3D-DNA was applied following standard protocols to generate the initial chromosome assembly. Manual curation was performed using Juicerbox v1.11.0819 to refine chromosome boundaries and correct scaffold misassemblies, resulting in 23 resolved chromosomes (Figs. 2, 3). The revised output from Juicebox was reprocessed through 3D-DNA for per-chromosome scaffolding. The final assembly consisted of 103 scaffolds with a maximum size of 47.37 Mb and an N50 of 29.61 Mb (Tables 2, 3).

Table 2 Summary statistics of Channa asiatica genome assembly.
Fig. 2
figure 2

Hi-C contact map produced by 3D-DNA.

Fig. 3
figure 3

Features of the Channa asiatica genome. From outside to inside: (a) The 23 pseudo-chromosomes, (b) GC content, (c) Gene density, (d) Repeats content, (e) LTR content, (f) LINE content and (g) DNA-TE content.

Table 3 Pseudo-chromosome length statistics after Hi-C assisted assembly.

Repeat annotation

Given the biological significance of tandem repeats, a genome-wide survey was performed using GMATA v2.2.120 and Tandem Repeats Finder (TRF) v4.10.021 with default parameters. GMATA was specifically applied to detect simple sequence repeats (SSRs) with short repeat units, while TRF was used to identify all classes of tandem repeats. For dispersed repetitive sequences, the workflow initiated with MITE-hunter22 to detect miniature inverted-repeat transposable elements (MITEs), generating a MITE library file. The genome was then masked using a hard-masking approach (converting repeats to “N”), followed by de novo repeat discovery with RepeatModeler23 to construct a RepMod.lib library. Given the presence of unclassified elements in RepMod.lib, these sequences were subsequently classified using TEclass v2.424. A comprehensive repeat library was created by integrating MITE.lib, RepMod.lib, and Repbase25. This combined library was used to annotate repetitive sequences across the entire genome with RepeatMasker26.

The annotation results revealed that dispersed repeats accounted for 25.29% of the genome (Table 4). Among transposable elements (TEs), DNA transposons were the most abundant class (7.01%), followed by long interspersed nuclear elements (LINEs, 4.34%), long terminal repeat (LTR) retrotransposons (2.15%), and short interspersed nuclear elements (SINEs, 0.84%). Collectively, repetitive sequences spanned 182,862,722 bp, representing 27.72% of the total genome length (Table 4).

Table 4 Repetitive sequences in the genome of Channa asiatica.

Gene prediction and function annotation

Gene annotation was performed using a three-tiered evidence integration pipeline, incorporating transcriptomic evidence, homologous protein evidence, and ab initio predictions. Transcriptomic evidence was obtained by aligning Illumina RNA-seq reads to the genome assembly with HISAT2 v2.2.127, followed by transcript assembly using StringTie v2.2.328. Putative coding sequences (CDS) were identified with TransDecoder v5.7.129 using default parameters. Homology-based annotation was carried out using protein sequences from five evolutionarily conserved species. Protein-to-genome alignments were conducted with miniprot30. For ab initio prediction, braker v2.1.5 was used to perform gene predictions employing Augustus v3.5.0 and GeneMark-ETP v131 based on reference proteins from the OrthoDB v12 database32. The three evidence streams were consolidated using EVidenceModeler v2.1.0, yielding 26,603 high-confidence protein-coding genes, with an average gene length of 20,822.68 bp, an average coding sequence length of 1,675.78 bp, and an average of 10.08 exons per gene (Table 5).

Table 5 Gene structures and function annotation.

The functional annotation of predicted protein sequences was performed using Diamond v2.1.1033 against the SwissProt34, KEGG35, KOG36, GO37 and NR38 databases with an e-value cut-off of 1e-5. A total of 25,133 genes were annotated, which accounted for 94.47% of all inferred genes (Fig. 4 and Table 5).

Fig. 4
figure 4

Venn diagram of function annotations from various databases.

Genome synteny analysis

To compare the whole genome synteny, four chromosome-level genomes of Oryzias latipes, Anabas testudineus, Channa argus and Channa maculata were aligned to the genome assembly of C. asiatica using MCscan (v0.8)39, and syntenic relationships were plotted using the JCVI (v1.1.12)40. Collinearity analysis revealed significant chromosomal collinearity between C. asiatica and the other four bony fish species, although numerous chromosomal rearrangements were also observed (Fig. 5).

Fig. 5
figure 5

A synteny analysis of the chromosomes among genomes of Channa asiatica and the other four fish. (a) Channa asiatica vs Oryzias latipes and Anabas testudineus, (b) Channa asiatica vs Channa argus and Channa maculata.

Data Records

The raw sequencing reads of all libraries have been deposited into NCBI SRA database via the accession number PRJNA113901141. The assembled genome has been deposited at Genbank under the accession number GCA_041146785.142. Moreover, the genome annotations, predicted coding sequences and protein sequences are available at Figshare43.

Technical Validation

Assessment of genome assembly

The accuracy of the Small snakehead genome assembly was evaluated by assessing its completeness using the conserved metazoan gene set ‘actinopterygii_odb10’ from BUSCO (v5.4.3)44. The analysis demonstrated high completeness, with an overall completeness of 98.93%. Specifically, 98.24% of the genes were complete and single-copy, 0.69% were complete and duplicated, 0.35% were fragmented, and 0.71% were missing. These findings indicate the high quality of the Small snakehead genome assembly (Table 6). To evaluate the quality of the genome assembly, we calculated the mapping rates of both PacBio HiFi and Illumina reads against the assembled genome. Illumina short reads were aligned using BWA (v0.7.17-r1188), while HiFi reads were mapped with Minimap2 (v2.24). A total of 99.58% of Illumina reads and 99.75% of PacBio HiFi reads successfully aligned to the reference genome, indicating high assembly accuracy and completeness.

Table 6 BUSCO analysis of the genome assembly.

Gene annotation validation

To evaluate the integrity of the annotated gene set, we conducted BUSCO44 analysis using conserved single-copy homologous genes from the ‘actinopterygii_odb10’ library. The results revealed that approximately 99.29% of the complete gene elements are present in the annotated gene set, indicating a high level of completeness in the conserved gene predictions. Specifically, 98.35% of the genes were complete and single-copy BUSCOs, with only 0.93% fragmented and 0.63% missing from the assembly (Table 7). These findings highlight the exceptional integrity and conservation of gene content in the Small snakehead genome assembly, leading to highly confident prediction outcomes.

Table 7 BUSCO analysis of the genome annotation.