Background & Summary

The fourfinger threadfin (Eleutheronema tetradactylum), a member of the family Polynemidae1, is a euryhaline pelagic fish inhabiting offshore waters of the Indo-West Pacific, spanning from the Persian Gulf to Australia2,3. Distinguished by its elongated, flattened body and four filamentous pectoral fins (Fig. 1), this species employs these fins for sensory detection of prey. As a commercially valuable species, E. tetradactylum is highly regarded in aquaculture for its rapid growth rate and high-quality meat4. However, overfishing and habitat degradation have led to significant population declines, resulting in its classification as Near Threatened on the International Union for Conservation of Nature (IUCN) Red List in 20145. Despite its importance, research has been constrained by a lack of high-quality genomic data, although studies have explored its population structure6,7, disease control8,9, farming techniques10, reproduction, basic biology11,12, and responses to environmental stressors2,13.

Fig. 1
figure 1

The map of Eleutheronema tetradactylum.

Although chromosome-level genome assemblies of E. tetradactylum have been published14, they are constrained by gaps, structural discontinuities, and incomplete annotations. Recent advancements in long-read sequencing technologies, such as PacBio and ONT, combined with improved assembly algorithms, have enabled the production of gap-free telomere-to-telomere (T2T) genomes. These assemblies resolve fragmented regions, enhance chromosomal continuity, and facilitate comprehensive variant detection15,16.

In this study, we generated a Near T2T genome assembly for E. tetradactylum using PacBio HiFi reads, ONT ultra-long reads, and Hi-C data. The assembly totals 585.38 Mb, comprising 83 contigs with an N50 of 22.14 Mb, and anchors 98.76% of sequences to 26 chromosomes. We annotated 22,362 protein-coding genes, achieving BUSCO completeness scores of 99.53% for the genome and 99.04% for annotations. This resource surpasses prior assemblies in contiguity and completeness, providing a foundational tool for advancing genetic, evolutionary, and conservation research on E. tetradactylum.

Methods

Sample collection

A two-year-old male E. tetradactylum, sourced from a local fishery farm in Zhanjiang, Guangdong Province, China, was used for this study. Tissues including muscle, eye, brain, liver, heart, spleen, kidney, and gill, were collected for genomic and transcriptomic sequencing. All samples were flash-frozen in liquid nitrogen and stored at −80 °C. was All procedures were approved by the Institutional Review Board on Bioethics and Biosafety of BGI-Shenzhen, China (No. FT18134).

Library construction and sequencing

Genomic DNA was extracted from muscle tissue for PacBio Single Molecule Real-Time (SMRT), Hi-C, Oxford Nanopore Technologies (ONT), and short-read genome survey sequencing. For SMRT sequencing, high-quality DNA was used to construct genomic libraries according to PacBio’s standard protocol (Pacific Biosciences, CA, USA) and sequenced on the PacBio Sequel II platform (Pacific Biosciences, Menlo Park, CA, USA) in Circular Consensus Sequence (CCS) mode. The raw data was filtered to obtain high-precision HiFi reads.

For ONT sequencing, an ultra-long library was constructed and sequenced on a PromethION flow cell (Oxford Nanopore Technologies Co., UK). Raw reads were filtered for quality (QV ≥ 7) using base-calling software, adapters were removed with Porechop (https://github.com/rrwick/Porechop), and reads shorter than 30 kb or with mean quality <90% were discarded using Filtlong (https://github.com/rrwick/Filtlong).

Hi-C libraries were constructed from muscle tissue following established protocols17. Tissue was cross-linked with formaldehyde, digested with a restriction enzyme, biotin-labeled, and ligated. After reversing cross-links and purifying DNA, fragments were sheared to ~300 bp, and paired-end libraries were sequenced on the DNBSEQ platform.

Total RNA was extracted from eight tissues (eye, brain, liver, heart, spleen, kidney, muscle, gill) using TRIzol reagent (Invitrogen). Paired-end sequencing was performed on the MGI-SEQ. 2000 platform.

Sequencing generated 32.77 Gb of HiFi data, 31.74 Gb of ONT data, 139.39 Gb of Hi-C data, and 21.94 Gb of RNA-seq data (Table 1).

Table 1 Sequencing data for the E. tetradactylum genome assembly.

Genome survey and assembly

For the genome survey, DNA libraries with 300–400 bp inserts were constructed. Then, DNA was purified, quantified, and sequenced from both ends using the DNBSEQ platform to obtain raw reads. Quality filtering of raw reads was performed using Fastp (v0.23.2; parameters: default)18, and K-mer frequency (K = 21) was calculated with Jellyfish (v2.3.0; parameters: -m 21 -s 1000000000)19. Based on K-mer distribution, GenomeScope 2.0 (v2.0; parameters: -k 21 -p 2)20 estimated the genome size to be 543.84 Mb, with a peak 21-mer depth of 120 (Fig. 2). The heterozygosity and repeat rates were found to be 0.545% and 10.328%, respectively. Smudgeplot (v0.2.3dev; parameters: -k21 -m100 -ci1 -cs1000)20 determined the species’ ploidy as AB type, indicating diploidy (Table 2).

Fig. 2
figure 2

Overview of the 21-mer frequency distribution in the E. tetradactylum genome. The x-axis indicates the coverage of the K-mer, the y-axis represents the k-mer frequency for a given depth.

Table 2 Smudgeplot analysis statistics for ploidy determination.

The draft assembly of E. tetradactylum was performed using HiFi data combined with ONT ultra-long reads and Hi-C reads. The assembly was carried out with HiFiasm (v0.19.6; parameters: default)21, followed by redundancy removal with Purge Haplotigs (v1.0.4; parameters: default)22. This high-quality genome assembly served as the foundation for subsequent construction of chromosomes using the Hi-C reads.

Hi-C reads were aligned to the draft23, and the 3D-DNA pipeline, which included splitting, anchoring, sorting, orienting, and merging contigs or scaffolds, was employed to achieve chromosome-level scaffolding24. An interaction matrix was generated with Juicer (v1.5; parameters: chr_num 24)25 and manually refined using Juicebox (v1.11.08; parameters: default)26.

Ultra-long Oxford Nanopore Technologies (ONT) reads were aligned to chromosomes using minimap2 (v2-2.24; parameters: ont: -ax map-ont ccs: -ax map-hifi)27 to generate consensus sequences. These consensus sequences were then aligned to the ends of the chromosomes using blastn (v2.11.0+; parameters: -outfmt 7), and sequences with coverage ≥90% were used to replace the telomere sequences on the chromosomes based on their alignment positions. Gaps between contigs were filled using TGS-GapCloser (v1.2.0; parameters:–min_nread 10)28 by leveraging the coverage information between ultra-long ONT reads and the assembled contigs to perform contig extension. Subsequent polishing was carried out with Pilon (v1.23; parameters:–fix all–changes)29 using short-read sequencing data to correct errors in the extended and gap-filled genome, yielding the final telomere-to-telomere assembly of E. tetradactylum.

The final assembly spans 585.38 Mb across 77 scaffolds (26 chromosomes), with scaffold N50 of 23.88 Mb, contig N50 of 22.14 Mb, and 98.76% anchoring rate (Table 3). A Hi-C interaction heatmap confirmed high-quality chromosome assignments (Fig. 3). A total of 36 telomeric sequences were identified at the ends of the 26 chromosomes by searching the entire genome for the telomeric repeat motif (TTAGGG) (Table 4). The genomic positions of these telomeres and their distribution across contigs were annotated and visualized (Fig. 4).

Table 3 Statistics of the E. tetradactylum genome assembly and comparison with a prior assembly.
Fig. 3
figure 3

Hi-C interaction heatmap of the E. tetradactylum genome. The x- and y-axes represent genomic positions (N*bin). Color intensity (yellow: low; red: high) indicates interaction strength. The first 26 squares represent the 26 chromosomes, followed by unanchored sequences.

Table 4 Assembly statistics of chromosomes.
Fig. 4
figure 4

An overview of the T2T gap-free reference genome of E. tetradactylum.

Repeats annotation

Repetitive elements were identified using a combination of de novo and homology-based approaches. Tandem repeats were predicted with TRF (v4.09; default)30. Homology searches employed RepeatMasker (v4.0.9; default)31 against the RepBase library (http://www.girin-st.org/repbase). Additionally, RepeatModeler (open-4.0.9; parameters: default)32 and LTR_FINDER_parallel (v1.0.7; parameters: default)33 were used to construct a de novo repeat library for E. tetradactylum, followed by a further de novo prediction using RepeatModeler. By integrating results from TRF, RepeatMasker, RepeatProteinMask, and de novo methods, and subsequently removing redundancies, we determined that repeat sequences and transposable elements (TEs) constitute approximately 18.09% and 16.69% of the E. tetradactylum genome, respectively (Fig. 5). Of which, repetitive DNAs, LINEs, SINEs and LTRs covered 8.69%, 3.19%, 0.29% and 1.70% of the entire genome, respectively (Table 5). This repeat content is comparable to that in Lates calcarifer (18.6%)34 but higher than in oreochromis niloticus (14%)35.

Fig. 5
figure 5

Genomic landscape of the E. tetradactylum chromosome-level assembly. Metrics were calculated using a window size of approximately 200 kb. Circos plot from the outer to the inner layers represents the following: (a) GC content (range: 36%–56%); (b) gene density (range: 0%–100%); (c) repeat density (range: 0%–100%); (d) LTR retroelement density (range: 0%–24%); (e) LINE density (range: 0%–61%); and (f) DNA transposon density (range: 0%–88%).

Gene prediction and functional annotation

To annotate genes in the E. tetradactylum genome, we conducted both structural and functional annotation. Gene structure annotation aimed to predict gene positions and structures through homology-based and de novo approaches, while functional annotation determined the biological roles and metabolic pathways associated with predicted gene products.

For gene structure annotation, we combined three strategies, including homology-based predictions, de novo prediction and RNA-sequencing-assisted prediction. we utilized Exonerate (v2.2.0; parameters: model protein2genome)36 and Liftoff (v1.6.3; parameters: showtargetgff 1)37 to align E. tetradactylum genome sequence with protein sequences from closely related species (Paralichthys olivaceus, Oryzias latipes, Carassius gibelio, Danio rerio, and Oryzias latipes) for homology-based prediction. De novo predictions were performed using AUGUSTUS (v3.3.2; parameters: default)38 and Genscan (v1.0; parameters: default)39. Additionally, RNA-seq data were mapped onto the E. tetradactylum genome, with transcripts and protein-coding genes predicted separately using StringTie (v1.3.5; parameters: default)40 and TransDecoder (v5.5.0; https://github.com/TransDecoder/TransDecoder) with default parameters. The predictions from these methods were integrated into a high-quality, non-redundant gene set using MAKER 2 (v2.31.10; parameters: default)41.

For gene function annotation, we compared the protein sequences of the genome derived from structure annotation against various databases, including GO42, KEGG43, Swissprot44, TrEMBL45, NR46, KOG (https://ftp.ncbi.nih.gov/pub/COG/KOG/) and AnimalTFDB47. This analysis, conducted using diamond (v2.0.14; parameters:–evalue 1e-05)48 software, provided insight into protein functions, metabolic pathways, and additional characteristics. To further identify conserved sequences, motifs, and structural domains, we analyzed Pfam49 and InterPro50 databases by using InterProScan (v5.61–93.0; parameters:–seqtype p–formats TSV–gote rms –pathways -dp)51. Pathway annotation was performed using KOBAS (v3.0; parameters: -t blastout: tab-sko)52 against the KEGG database Table 5.

Table 5 Statistics of repeat sequence classification in the E. tetradactylum genome.

Overall, we predicted 22362 protein-coding genes, with average gene length of 14620.72 bp, CDS length of 1811.74 bp,10.80 exons per gene, and exon length of 270.42 bp (Table 6). And then we predicted a total of 22,046 genes (98.59% of the total predicted genes) and 37,591 mRNA (98.71%) of the total predicted transcript) were successfully annotated (Fig. 6 and Table 7).

Table 6 Statistics of protein-coding gene predictions in the E. tetradactylum genome.
Fig. 6
figure 6

UpSet plot of gene functional annotations across nine databases: NR, SwissProt, TrEMBL, KOG, InterPro, GO, KEGG-ALL, KEGG-KO, and Pfam.

Table 7 Functional annotation of protein-coding genes in the E. tetradactylum genome.

Non-coding RNAs were predicted using BLASTN(v2.11.0+; parameters: -evalue 1e-5)53 for rRNAs, tRNAscan-SE (v1.3.1; parameters: default)54 for tRNAs, and Infernal (v1.3.3) against Rfam (v14.8; parameters: cmscan --rfam --nohmmonly)55 for miRNAs and snRNAs. We identified 791 miRNAs, 1594 tRNAs, 1102 rRNAs, and 651 snRNA (Table 8).

Table 8 Statistics of non-coding RNAs in the E. tetradactylum genome.

Genome collinearity analysis

To investigate the conservation of genome structure, a synteny analysis was performed between the coding genes of E. tetradactylum and a related species, E. rhadinum, using JCVI (v1.1.22; parameters: “jcvi.compara.catalog ortholog --dbtype = prot --cscore 0.99 jcvi.compara.synteny screen --minspan = 70–align-chromosomes”)56. Both species share a 2n = 52 karyotype and exhibit high collinearity, indicating conserved synteny (Fig. 7).

Fig. 7
figure 7

Synteny analysis between E. tetradactylum and E. rhadinum genomes.

Data Records

The final telomere-to-telomere genome assembly for E. tetradactylum have been deposited in the National Center for Biotechnology Information (NCBI) GenBank database under accession number JBIEKL00000000057. Annotated coding sequences and protein sequences have been submitted to Figshare (https://doi.org/10.6084/m9.figshare.30164734)58. Raw sequencing reads (HiFi, Hi-C, ONT, genome survey, and RNA-seq) are deposited in the NCBI Sequence Read Archive (SRA) under accession number SRP53881059. All data are publicly accessible without restriction.

Technical Validation

Genome assembly and gene annotation quality assessment

Assembly and annotation completeness were evaluated with BUSCO (v5. 4. 3; parameters: default)60 against the actinopterygii_odb10 lineage. The genome recovered 99.53% BUSCOs (99.29% single-copy, 0.25% duplicated, 0.44% fragmented, 0.03% missing; Table 9). Annotations recovered 99.04% (84.09% single-copy, 14.95% duplicated, 0.33% fragmented, 0.63% missing; Table 9).

Table 9 BUSCO completeness and accuracy evaluation of the E. tetradactylum genome and annotations.

The PacBio HiFi reads were aligned to the assembly using minimap2 (v2.12, parameters: -ax map-pb)27, achieving 99.72% mapping and 99.84% coverage. GC content and depth were uniform across 100-kb windows (Fig. 8). Short reads were aligned with samtools (v1.17, parameters: sort -m 1 G)27, picard (v2.25.6; https://broadinstitute.github.io/picard/) and, GATK(v4.4.0.0; https://broadinstitute.github.io/picard/), revealing heterozygous SNP and InDel rates of 0.279% and 0.111%, with no homozygosity.

Fig. 8
figure 8

GC content and sequencing depth distribution. The x-axis represents the GC content; the y-axis represents the average depth.

In this study, we successfully achieved a T2T assembly for ten chromosomes: Chr1, Chr4, Chr5, Chr6, Chr7, Chr9, Chr10, Chr11, Chr12, and Chr26. For the remaining chromosomes, telomeres were identified at only one terminus. The difficulty in achieving complete T2T status for these sequences is likely attributable to the presence of recalcitrant genomic regions characterized by high complexity and extreme repetitive content (Fig. 5). Despite the utilization of current ONT ultra-long reads, the structural intricacy of these regions remains challenging to fully resolve. We anticipate that future advancements in sequencing read lengths and the continuous refinement of T2T assembly algorithms will eventually overcome these limitations, enabling the complete, gap-free assembly of the entire E. tetradactylum genome.