Background & Summary

The whitespotted conger (Conger myriaster) is a member of the order Anguilliformes, family Congridae, and widely distributed in the East China Sea, Yellow Sea, and the Bohai Bay in China, and from southern Hokkaido to northern Okinawa in Japan, and around the Korean Peninsula1. Despite its ecological and economic importance, knowledge of its life history is limited. Furthermore, exemplified by its complex leptocephalus larval stage, which exhibits unique ecological traits distinct from juveniles and adults and utilizes major ocean currents, such as the North Equatorial Current and the Kuroshio Current, for dispersal—a strategy also observed in eels like Anguilla japonica and Anguilla marmorata1,2,3, and successful artificial breeding has yet to be achieved4. At present, C. myriaster remains entirely reliant on wild catches for consumption, with limited genomic resources available to support conservation and breeding efforts5.

Indeed, C. myriaster faces significant threats from overfishing and marine pollution, leading to a steady decline in its population6, and the urgency for improved management and conservation measures in primary fishing grounds has become increasingly evident, driven by ecological pressures and the imperative to ensure sustainable fisheries7. While the Yellow Sea Fisheries Research Institute has made progress in artificial breeding and aquaculture research8,9, immature individuals from offshore fisheries present significant challenges for further studies. Critically, underlying many of these challenges is the lack of published genomic resources for C. myriaster.

In this study, we present the first high-quality chromosome-level genome assembly of C. myriaster using an integrated strategy which combines Whole Genome Sequencing (WGS), 10X Genomics, PacBio Continuous Long Reads (CLR), and Hi-C data. The assembled genome has a total size of 1.09 Gb, with 97.49% of the sequences anchored to the 19 chromosomes as previously identified through karyotype analysis10. The assembly achieved N50 lengths of 16.76 Mb for contigs and 58.40 Mb for scaffolds, and 34.80% of the genome was annotated as repetitive sequences. A total of 24,063 protein-coding genes were predicted, with 99.80% of them functionally annotated. This high-quality genome assembly provides a robust foundation for developing molecular markers and advancing conservation and aquaculture efforts for C. myriaster. Furthermore, it serves as a critical resource for investigating the evolutionary dynamics and phylogenetics of eel species.

Methods

Sample collection and sequencing

The present study was dedicated to the genome sequencing of an adult female C. myriaster. The specimen was sourced from a local aquaculture facility (Haiyang Yellow Sea Fisheries Co., Ltd.), with stringent selection criteria to ensure optimal health status. Prior to sampling, the fish was subjected to a mild anesthetic protocol to alleviate stress and ensure humane handling.

For CLR data, sequencing was performed on the PacBio Sequel II platform with blood tissue using the CLR mode. This mode is particularly advantageous for obtaining long reads, which are essential for accurate genome assembly and resolving complex genomic regions. The sequencing run generated 104.76 Gb of raw data, providing approximately 95-fold depth of the fish genome. This high depth ensures robust representation of the genome and facilitates the detection of low-frequency variants and structural variations (Table 1).

Table 1 Statistics of the sequencing data.

For Hi-C data, the library was prepared by first crosslinking cells with formaldehyde. DNA was then digested using the MboI restriction enzyme, followed by end-filling and biotin labeling. The resulting blunt-end fragments were ligated, purified, and sheared into 300–500 bp fragments. Quality control was conducted using Qubit 2.0, an Agilent 2100 instrument (Agilent Technologies, CA, USA), and q-PCR. Finally, 150 bp paired-end sequencing was performed on the Illumina platform with blood tissue, yielding 117.70 Gb of Hi-C data, which provided approximately 107-fold depth of the fish genome (Table 1).

For WGS data, genomic DNA extracted from muscle tissues was fragmented to approximately 350 bp using E220 Covaris (Covaris Inc., USA). The fragmented DNA underwent 3′ end-repair, adaptor ligation, and amplification via ligation-mediated polymerase chain reaction (LM-PCR). Single-stranded DNA molecules were separated and circularized, followed by rolling-circle amplification (RCA) to generate DNA nanoballs (DNBs). These DNBs were loaded onto patterned nanoarrays of the BGI-Seq 500 platform and sequenced using PE100 + 10 chemistry, yielding 100 bp paired-end reads with 10 bp dual-index barcodes for sample demultiplexing. A total of 39.46 Gb of short-read data was generated, achieving approximately 36-fold depth of the fish genome (Table 1).

For 10X Genomics data, DNA was processed using the 10X Genomics Chromium platform to generate a library with long-range genomic information. The library was then sequenced on the BGI-Seq 500 platform with muscle tissue, leveraging its high-throughput capabilities and advanced sequencing chemistry. This combination yielded a total of 203.73 Gb of data, achieving approximately 186-fold depth of the fish genome (Table 1).

For full-length transcript sequencing, RNA was extracted from a mixed tissue sample (eye, intestine, spleen, kidney, testis, ovary, pituitary, liver, muscle, brain, skin, gill, heart, and stomach), and qualified RNA samples underwent reverse transcription, end repair, DNA fragmentation, adapter ligation, and amplification to construct the library. Sequencing was performed on the PacBio Sequel IIe platform, yielding 32.76 Gb of Circular Consensus Sequencing (CCS) data (Table 1), thereby facilitating precise genome annotation by resolving complex gene structures.

Genome estimate and assembly

The estimated sizes of the predicted genomes were determined using 10X Genomics reads alongside JELLYFISH11 (v2.2.3) with a k-mer size of 19-mer frequency depth distribution and subsequently analyzed with GenomeScope12 (http://genomescope.org/) to generate the k-mer frequency distribution plot. K-mer analysis suggests that the genome spans approximately 936.65 Mb, with 259.55 Mb consisting of repeated sequences and a heterozygosity rate estimated at 1.04% (Fig. 1). To assemble a high-quality genome for the whitespotted conger, we first generated a draft genome using CLR data with NextDenovo13 (v2.5.2). The draft genome was then polished with error-corrected CLR, 10X Genomics and WGS data using NextPolish14 (v1.4.1). Scaffolding was performed with yahs15 (v1.2.2) in combination with Hi-C data, resulting in 19 chromosomes significantly longer than the other scaffolds, consistent with the chromosome number reported in prior karyotype analyses10 (Fig. 2a,b and Table 2). Final manual refinement was completed with Juicer-box16 (v1.91), producing a 1.09 Gb genome (Fig. 3a and Table 3). Utilizing MCScanX17 and SynVisio18 to generate synteny plots reveals significant genomic synteny between the genomes of C. myriaster and its congeneric species Conger conger19, indicating that the assembled chromosomes possess reliability (Fig. 3b).This genome size aligns with the C-value estimated in the ANIMAL GENOME SIZE DATABASE (https://www.genomesize.com), with scaffold N50 of 58.40 Mb (Fig. 2a).

Fig. 1
figure 1

Genome size estimates for C. myriaster using Kmer-based method.

Fig. 2
figure 2

C. myriaster genome snail plot and circos plot. (a) The BlobToolKit snail plot displays the scaffold N50 metric and BUSCO gene completeness. The main plot is divided into 1,000 size-ordered bins, with each bin representing 0.10% of the 1,093,778,583 bp assembly. The gray area shows the distribution of sequence lengths, and the radius of the plot is scaled to the longest sequence in the assembly (91,834,031 bp, highlighted in red). The orange and light orange arcs represent the scaffold N50 and N90 sequence lengths (58,402,862 bp and 39,753,706 bp, respectively). The light gray spiral displays the cumulative sequence count on a logarithmic scale, with white tick marks indicating successive orders of magnitude. The outer blue and light blue regions illustrate the distribution of GC, AT, and N percentages within the same bins as the inner plot. In the upper right corner, a summary of BUSCO genes from the actinopterygii_odb10 dataset is shown, including complete, fragmented, duplicated, and missing BUSCOs. (b) The Circos plot illustrates the GC content, gene density, repetitive sequences, and collinearity between chromosomes in the assembled genome of C. myriaster. The Circos plot represents fundamental information about the genome. The outermost layer displays the chromosome names and their lengths. Moving inward, the purple band represents GC content, the blue band indicates gene density, and the green band denotes repetitive sequences. The innermost part illustrates the synteny among chromosomes within C. myriaster.

Fig. 3
figure 3

Hi-C heatmap and collinearity dot plot of the genome. (a) The Hi-C heatmap illustrates the chromosome interaction frequencies of C. myriaster genome, with each blue contour representing a chromosome. (b) The dot plot displays the collinearity relationships between the genome assembly of C. myriaster and C. conger.

Repetitive sequence annotation

To annotate the repetitive elements in the genome of C. myriaster, we employed a combination of de novo prediction and homology-based annotation methods. For de novo prediction, LTR retrotransposons were predicted using LTR-FINDER-parallel20 (v1.1), while de novo repetitive elements were identified with RepeatModeler21 (v2.0.6). The resulting predictions were integrated into a genome-specific repeat element database of C. myriaster of the repeat sequences. For the homology-based method, RepeatProteinMask and Repbase modules in RepeatMasker22 (v.4.1.7) were used for predicting based on homologous sequences in RepBase database (http://www.girinst.org/repbase) and Dfam database23 (v3.8). In total, 380.59 Mb consisted of repetitive sequences, accounting for 34.80% of the genome assembly (Fig. 2b and Table 4).

Genomic structure and functional annotation

Building upon the masked repetitive sequences of C. myriaster, we employed braker324 (v3.0.8) with default parameters for de novo gene prediction. Furthermore, we leveraged Miniprot25 (v0.13) for homology-based annotation of protein sequences from three species, including C. conger, Anguilla anguilla, and Anguilla rostrata. Concurrently, we aligned RNA-seq data (SRP36125626) from the National Center for Biotechnology Information (NCBI) using HISAT227 (v2.2.1) to the genome and subsequently assembled the transcriptome with StringTie28 (v2.2.3). TransDecoder29 (v5.7.1) was utilized to further identify open reading frames (ORFs) within the assembled transcripts, predicting potential coding regions to construct annotation evidence. We processed full-length transcriptome data using the ISOseq 330 (v4.0.0) pipeline and aligned them to the genome using GMAP31 (2021-08-25) to generate annotation evidence. Finally, we integrated four annotation strategies—de novo, homology-based, transcript-based, and full-length transcriptome—using EvidenceModeler32 (v1.1.1). We also used the annotations from EGAPx33 (v0.3.1-alpha) and RNA-seq data as references, comparing them with C. myriaster annotations to further refine the final gene set, thereby identifying 24,063 protein coding genes.

We used Diamond34 (v2.1.6) to align the predicted gene protein sequences with various functional databases, setting an E-value threshold of 1e-5. These databases included KEGG35, Swiss-Prot36, EggNOG37, Pfam38, Kofam39, and the Non-Redundant40 (NR) database, to extract potential gene functional information for subsequent statistical analysis. A total of 24,016 genes, accounting for 99.80% of the estimated total number of protein-coding genes, were effectively annotated by at least one of these databases (Fig. 4 and Table 5).

Fig. 4
figure 4

Upset plot of gene functional annotation using data from EggNOG, Pfam, KEGG, NR, Kofam, and SwissProt.

Data Records

The WGS (SRR32021150), PacBio CLR (SRR32021152), Hi-C (SRR32021151), 10X Genomics (SRR32021149), RNA (SRR32021148) data used for the genome have been deposited in the NCBI Sequence Read Archive (SRA) under the accession number SRP55777641. And the chromosome-level assembly of C. myriaster was deposited in the National Center for Biotechnology Information (NCBI) under the accession number GCA_047653785.142. The chromosome assembly of C. myriaster and genomic annotation results can be found in the figshare dataset under DOI code: https://doi.org/10.6084/m9.figshare.2812451943.

Technical Validation

To assess the accuracy of gene annotation, we compared the distribution of gene length, exon length, CDS length and intron length in C. myriaster with gene data from C. conger, A. anguilla44, and A. rostrata45.The results indicated a high degree of similarity in gene composition distribution among the three species (Fig. 5). We utilized BUSCO46 (v5.3) with the Actinopterygii database (actinopterygii_odb10) to evaluate the completeness of our genome assembly. The BUSCO analysis indicated an overall completeness of 98.00%, with 91.80% being single-copy, 6.20% being duplicated, 0.70% being fragmented, and 1.30% being missing (Fig. 2a and Table 2). We mapped WGS data to the genome using Merqury47 (v1.3) and minimap248 (v2.28-r1209), achieving a quality score of 32.63 (as indicated by the quality value, QV) and a read-to-contig alignment rate of 99.49%. Additionally, CRAQ49 (v1.0.9-alpha) assesses the quality of genome assembly by aligning raw CLR reads to the assembled genome. The assembly quality for small regions and large structural segments can be calculated as 97.70 (R-AQI) and 100 (S-AQI), respectively, indicating that the genome has achieved reference quality (>90).

Fig. 5
figure 5

Comparisons of gene, exon, CDS, and intron lengths of C. myriaster and the three closely related species (C. conger, A. Anguilla and A. rostrata).

Table 2 Assembly statistics of C. myriaster.
Table 3 Assembly statistics of chromosomes.
Table 4 Statistics of repeat content.
Table 5 Statistics of gene annotation.