Background & Summary

The East Asian fourfinger threadfin (E. rhadinum), a member of the family Polynemidae and genus Eleutheronema, was historically misidentified as E. tetradactylum due to their high degree of morphological similarity1. It was not until the taxonomic revision of the genus Eleutheronema by Motomura2 that E. rhadinum was formally recognized as a distinct species, based on several diagnostic characteristics, including the coloration of the pectoral fins (dense black in E. rhadinum vs. vivid yellow in E. tetradactylum), the presence of lateral line squamation on the caudal-fin membrane, and differences in scale counts along the pored lateral line as well as above and below it (Fig. 1). E. rhadinum is highly valued by consumers for its fast growth rate, excellent flesh quality, high nutritional content, and substantial economic value. However, in recent years, wild populations have been affected by overfishing, marine pollution, and revised fishing bans, leading to insufficient protection, limited utilization, and lack of sustainable development of this species’ natural resources.

Fig. 1
figure 1

The map of E. rhadinum.

At present, research on the East Asian fourfinger threadfin mainly focuses on the diversity of utilization of its living environment3, the determination of its age based on the microstructure and length of otolith4, the fluctuation of the species’ population5, genetic burden5, adaptive divergence5 and responses to environmental stressors6. Although artificial breeding technologies for this species have gradually advanced, its biological and genomic background remains largely unexplored.

In recent years, with rapid advances in genome sequencing technologies, genomic tools have been increasingly applied to studies of species conservation7. Whole-genome sequencing enables the comprehensive acquisition of genomic sequences and gene function information, providing critical insights into the genetic mechanisms underlying species evolution and environmental adaptation.

In this study, we performed high-fidelity (HiFi) long-read sequencing using the PacBio Sequel II platform, generating a total of 57.34 Gb of high-quality data. The initial assembly resulted in a total contig length of 586.87 Mb, with a contig N50 of 24.2 Mb. In addition, Hi-C sequencing was conducted on the DNBSEQ platform, yielding 85.39 Gb of clean reads after quality filtering, which were subsequently used to assist with chromosome-level scaffolding. Based on these datasets, we successfully constructed a high-quality, chromosome-level reference genome for E. rhadinum. The final assembly comprised 585.27 Mb, with a contig N50 of 24.22 Mb, and 99.73% of the assembled sequences were anchored to 26 chromosomes. This high-quality reference genome provides a valuable foundation for future studies on the evolutionary biology, population genetics, and molecular breeding of E. rhadinum.

Methods

Sample collection

A two-year-old male E. rhadinum was obtained from a local aquaculture farm in Zhanjiang City, Guangdong Province, China. Its body length measured 19.6 cm, total length 24.3 cm, and weight 104.2 g. Tissue samples, including heart, liver, spleen, gill, kidney, intestine, eye, brain, and muscle, were collected from this individual for genome and transcriptome sequencing. All tissues were immediately snap-frozen in liquid nitrogen and subsequently stored at −80 °C until further processing. The sampling procedures were approved by the Institutional Review Board for Bioethics and Biosafety of UBM Shenzhen (Approval No. FT18134).

Library construction and sequencing

In fish genomics research, muscle tissue is the preferred source for extracting high-quality, high-molecular-weight genomic DNA. Its main advantages are its large volume and sample homogeneity, and it effectively reduces the risk of DNA contamination from other biological sources (such as gut microbiota), thereby ensuring the purity and accuracy of the genome assembly8,9,10. Genomic DNA was extracted from muscle tissue for SMRT (Single Molecule Real-Time) sequencing, Hi-C sequencing, and downstream genomic analyses.

For SMRT sequencing, high-quality DNA was used to construct libraries with an insert size of 15–20 kb, following the standard protocol provided by Pacific Biosciences (PacBio, Menlo Park, CA, USA). Sequencing was performed on the PacBio Sequel II platform in CCS (circular consensus sequencing) mode. The raw data was filtered to obtain high-precision HiFi reads.

Hi-C library preparation was carried out according to previously published protocols11, with minor modifications. Briefly, muscle tissue ground in liquid nitrogen was cross-linked with formaldehyde, then digested using restriction enzyme. The resulting DNA fragments were biotin-labeled, ligated to form chimeric junctions, and reverse cross-linked with SDS and proteinase K. The purified DNA was subsequently sheared to 300–400 bp fragments, followed by paired-end library construction and sequencing on the DNBSEQ platform.

Additionally, Total RNA was extracted separately from eye, brain, liver, heart, spleen, kidney, muscle, and gill tissues using TRIzol reagent (Invitrogen). The paired-end raw sequencing was performed using the MGI-SEQ 2000 platform.

A total of 57.34 Gb of high-quality HiFi data, 85.39 Gb of clean Hi-C data and 16.65 Gb of RNA-seq data were generated for genome assembly and scaffolding (Table 1).

Table 1 Summary of sequencing data used for the E. rhadinum genome assembly.

Genome survey and assembly

A short-insert (300–400 bp) paired-end DNA library was constructed and sequenced on the DNBSEQ platform to perform a genome survey. Raw reads were quality-filtered using Fastp (v0.23.2) with default parameters12. K-mer frequency analysis was conducted using Jellyfish (v2.3.0) with a k-mer size of 17 (parameters: -m 17 -s 1000000000)13. Subsequently, the 17-mer distribution was modeled using GenomeScope14 to estimate basic genomic features. The genome size of E. rhadinum was preliminarily estimated to be approximately 564 Mb, with a peak 17-mer depth of 140. The genome was characterized by a heterozygosity rate of 0.39% and a duplication rate of 33.74% (Fig. 2).

Fig. 2
figure 2

17-mer frequency distribution of the E. rhadinum genome. The x-axis represents the k-mer depth (coverage), and the y-axis shows the frequency of each k-mer at a given depth. This distribution was used to estimate genome size, heterozygosity, and repeat content.

The de novo genome assembly was conducted using Hifiasm (v0.19.6; default parameters)15 following the completion of sequencing. After that, the purge_haplotigs (v1.0.419; parameter: -a 70 -j 80 -d 200)16 was employed to eliminate redundant sequences. The initial assembly yielded a total contig length of 586.87 Mb (46 contigs), with a contig N50 of 24.2 Mb (Table 2).

Table 2 Summary statistics of the preliminary genome assembly of E. rhadinum.

To upgrade the contig-level assembly to a chromosome-level genome, Hi-C data were integrated using Juicer17 and 3D-DNA18 with default parameters. As a result, 585.27 Mb of the assembled sequences were anchored to 26 pseudo-chromosomes, achieving a high anchoring rate of 99.73%. The final assembly exhibited a scaffold N50 of 24.32 Mb and a contig N50 of 24.22 Mb, indicating high continuity and consistency between the contig and scaffold levels (Table 3). The Hi-C contact heatmap (Fig. 3) further confirmed the quality of the chromosomal assembly, showing clear interaction signals along the diagonal.

Table 3 Summary statistics of the E. rhadinum genome assembly.
Fig. 3
figure 3

Hi-C interaction heatmap of the E. rhadinum genome assembly. The x- and y-axes correspond to genomic positions represented as bins (N × bin size). Color intensity ranges from yellow (low interaction frequency) to red (high interaction frequency), indicating the strength of chromatin interactions. The first 26 squares along the diagonal correspond to the 26 assembled chromosomes, followed by unanchored scaffolds.

Repeats annotation

Repeat sequences are identical or symmetrical segments within the genome that play crucial roles in gene regulatory networks, gene expression, and transcriptional regulation, while also influencing evolutionary processes, heredity, and genetic variation. De novo prediction was performed primarily using RepeatModeler (v1.0.4, default parameters)19 and LTRharvest20 to construct a species-specific repeat library, which was subsequently employed by RepeatMasker (default parameters)21 for repeat identification. In parallel, Tandem Repeats Finder (default parameters)22 was used to detect tandem repeats within the genome.

Homology-based annotation relied on the RepBase database23, where sequences homologous to known repetitive elements were identified and classified using RepeatMasker21 and RepeatProteinMask21. By integrating and de-duplicating results from these four approaches (Tandem Repeats Finder22, RepeatMasker21, RepeatProteinMask21, and de novo prediction), we identified that 18.53% of the assembled E. rhadinum genome was identified as repetitive sequences (Fig. 4). Specifically, DNA transposons, LINEs, SINEs, and LTR elements comprised about 12.85%, 6.24%, 0.47%, and 3.99% of the genome, respectively (Table 4). The overall repeat content of E. rhadinum is comparable to that of Lates calcarifer (18.53%)24.

Fig. 4
figure 4

Circos plot of the E. rhadinum genome assembly. The tracks from outside to inside are GC content; 26 chromosome-level scaffolds; gene density; repeat density; LTR retroelement density; LINE density; DNA transposon density.

Table 4 Classification and statistics of repetitive sequences in the E. rhadinum genome.

Gene prediction and function annotation

To annotate the genes in the E. rhadinum genome, we conducted both structural gene prediction and functional annotation. Structural prediction aims to identify gene locations and structures through two main approaches: homology-based prediction and de novo prediction, while functional annotation assigns biological roles and metabolic pathways to the predicted gene products.

Gene structure prediction was performed by integrating three complementary approaches: homology-based prediction, de novo prediction, and transcriptome-assisted prediction. For homology-based prediction, the E. rhadinum genome was aligned against the protein-coding sequences of closely related species—including Dicentrarchus, labrax, Larimichthys, crocea, Lateolabrax maculatus, Lates calcarifer, Oreochromis niloticus, and Paralichthys leopardus—using GeMoMa (default parameters)25. This approach allowed inference of gene structures based on conserved regions across species. Subsequently, structurally intact genes identified from the homology-based results were used to train de novo gene prediction tools, specifically Augustus (default parameters)26 and SNAP (default parameters)27. Concurrently, transcriptome-assisted prediction was conducted by aligning RNA-Seq reads to the genome using HISAT228, followed by transcript assembly with StringTie29. Full-length transcripts obtained from third-generation ISO-seq data were further aligned with GMAP30 or Minimap231 and assembled using PASA32. These multiple sources of evidence were subsequently integrated to produce a high-quality, non-redundant gene set using MAKER 233.

For gene functional annotation, predicted protein sequences were compared against multiple databases, including GO34, NR35, InterPro36, KEGG37, TrEMBL38, SwissProt39, and KOG (https://ftp.ncbi.nih.gov/pub/COG/KOG/), using Diamond (parameters: Default)40. In parallel, InterProscan41 was employed to identify conserved protein domains, enabling comprehensive functional characterization.

In total, we predicted 23,090 genes with an average gene length of 14,314.71 bp, an average coding sequence (CDS) length of 1,680.97 bp, an average of 10.1 exons per gene, an average exon length of 166.51 bp, and an average intron length of 1,389.06 bp (Table 5). Functional annotation was successfully assigned to 20,970 genes, representing 90.82% of the predicted gene set (Table 6).

Table 5 Statistics of gene structure prediction for E. rhadinum.
Table 6 Functional annotation statistics of predicted genes in the E. rhadinum genome.

Non-coding RNA, which refers to RNA that does not translate proteins, including rRNA, tRNA, snRNA and miRNA, were also predicted using BLASTN(v2.11.0+; parameters: -evalue 1e-5)42, tRNAscan-SE (v1.3.1; parameters: default)43, and RFAM (v14.8; parameters: cmscan --rfam --nohmmonly)44. This analysis identified 657 miRNA, 1840 tRNA, 1576 rRNA and 899 snRNA in the E. rhadinum genome (Table 7).

Table 7 Statistics of non-coding RNAs in the E. rhadinum genome.

Data Records

The final chromosome-level genome assembly of E. rhadinum is available under GenBank accession GCA_052924935.145, and comprehensive annotation files including structural annotations in GFF3 format and genomic sequences in FASTA format are provided via Figshare (https://doi.org/10.6084/m9.figshare.30164752)46. The raw sequencing data generated in this study are available in the NCBI Sequence Read Archive (SRA) under the following accession numbers: SRR3230964247 (HiFi sequencing), SRR3230964348 (Hi-C sequencing), SRR3230964149 (genome survey sequencing), and SRR3230964050 (RNA-seq).

Technical Validation

Genome assembly and gene annotation quality assessment

The completeness of the genome assembly and gene annotation was evaluated using BUSCO (v5. 4. 3; parameters default)51 with the vertebrata_odb10 database (parameter: Default)51. The results indicated that the contig-level assembly contained 98.96% complete BUSCOs, while the chromosome-level assembly improved slightly to 99.05% complete BUSCOs, demonstrating a high degree of genome completeness and integrity (Table 8). Collectively, these results confirm that a high-quality genome assembly of E. rhadinum was successfully generated.

Table 8 BUSCO assessment of genome assembly completeness.