Background & Summary

The oyster pompano (Trachinotus anak) is one of the most economically significant marine aquaculture species in China. It belongs to the family Carangidae, and has often been misidentified as its closely related sister species, T. ovatus or T. blochii, both of which are commonly referred to the “golden pompano” by aquaculture practitioners in the country1,2. Notably, T. anak can be distinguished from its congeners in morphology and distribution: T. blochii possesses elongated fin rays in the dorsal and anal fins1, and T. ovatus has a native distribution restricted to the Atlantic Ocean2,3.

T. anak is primarily distributed in tropical and subtropical waters of the western Pacific Ocean3. Owing to its tender flesh, absence of intermuscular spines, and favorable taste, it is highly favored by consumers. In addition, its rapid growth rate and strong environmental adaptability make it an appealing species for aquaculture producers. These combined traits have driven the rapid expansion of T. anak aquaculture in recent years, particularly along the southeastern coast of China, which has become the main farming region. As a result, its annual production has increased substantially, reaching over 290,000 tons in 2023 and ranking first among all marine aquaculture fish species in the country4.

Although several genome assemblies of T. anak have been previously reported, they are all limited to the chromosome level and contain numerous gaps and incomplete telomeric sequences5,6,7. In this study, we present two haplotype-resolved, telomere-to-telomere (T2T) genome assemblies of T. anak, representing the first gap-free and fully resolved genomes for this species. These two assemblies were generated using a combination of PacBio HiFi, ONT ultra-long, and Hi-C reads, enabling accurate reconstruction of both haplotypes at the chromosomal level with complete telomeric structures. Compared to previous assembly versions, our T2T-level genome assemblies substantially improve assembly continuity and completeness, offering invaluable resources for molecular breeding, functional genomics, and evolutionary studies of T. anak and other Carangidae species.

Methods

Sample collection and nucleic acid extraction

Following phenotypic and genotypic sex identification based on morphological observation and a previously described sex-specific marker8 with minor modifications (Table 1), a two-year-old female T. anak was obtained from Hainan Lanliang Agriculture Technology Co., Ltd. (Sanya, Hainan Province, China) for genomic DNA and total RNA extraction (Fig. 1). High-quality genomic DNA was isolated from muscle tissue using the cetyltrimethylammonium bromide (CTAB) method and used for the construction of MGI short-read, PacBio HiFi long-read, and ONT ultra-long-read sequencing libraries. Total RNA was extracted from nine different tissues—including heart, liver, spleen, kidney, muscle, hypothalamus, brain, pituitary, gonad—using TRIzol reagent (Invitrogen, USA) for transcriptome library preparation. The concentration and quality of extracted DNA and RNA were assessed using a NanoDrop One spectrophotometer (Thermo Fisher Scientific, USA), a Qubit 3.0 fluorometer (Life Technologies, USA), and agarose gel electrophoresis.

Table 1 The sequences of female-specific primers used for genetic sex identification.
Fig. 1
figure 1

Sample collection and sex identification of a female T. anak used for T2T genome assembly. (a) An adult T. anak individual collected from a marine aquaculture base in southeastern China. (b) Dissected gonad of the sequenced individual, showing a well-developed ovary. (c) PCR-based sex determination of the sequenced individual and its parents. The T2T sample shows a female-specific band consistent with the maternal parent.

Library preparation and sequencing

For MGI short-read sequencing, a paired-end genomic library with an average insert size of ~350 bp was constructed using MGIEasy Universal DNA Library Preparation Kit v.1.0 (MGI, China) and sequenced on DNBSEQ-T7 platform with 150 bp paired-end reads generated. Quality control of raw short reads was performed using fastp (version 0.23.2)9 with default parameters to remove adapter sequences and low-quality reads, resulting in 76.96 Gb of clean data (Table 2).

Table 2 Sequencing data generated for T. anak genome assembly.

For PacBio HiFi long-read sequencing, a circular consensus sequencing (CCS) library was prepared using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, USA) and sequenced on the PacBio Revio platform. Raw subreads were processed using CCS software (version 6.0.0, https://github.com/PacificBiosciences/ccs) with the parameters “–min-passes 3–min-snr 2.5–top-passes 60”, resulting in 137.54 Gb of high-accuracy HiFi reads with an average length of 21.07 kb (Table 2).

For ONT ultra-long sequencing, high-molecular-weight (HMW) genomic DNA was used to construct libraries using an SQK-ULK001 Kit (Oxford Nanopore Technologies, UK) and sequenced on the PromethION P48 platform. Raw sequencing data was processed with Porechop (version 0.2.4, https://github.com/rrwick/Porechop) to remove the adapter sequences and Filtlong (version 0.2.1, https://github.com/rrwick/Filtlong) to filter low-quality reads with parameter “–min_length 30000–min_mean_q 90”, resulting in 85.72 Gb of high-quality ultra-long reads with an average length of 80.39 kb (Table 2).

For Hi-C sequencing, fresh liver tissue was collected for library preparation as previously described10, with minor modifications. Briefly, chromatin was cross-linked with 1% formaldehyde, digested with DpnII, end-repaired with biotin-14-dCTP, and proximity-ligated using T4 DNA ligase. After reversing cross-links and purifying DNA, the DNA was sheared to fragments of 300–700 bp and enriched using streptavidin magnetic beads. The Hi-C library was constructed using with MGIEasy Universal DNA Library Preparation Kit v.1.0 (MGI, China) and sequenced on DNBSEQ-T7 platform. Raw sequencing data was processed with fastp and HICUP (version 0.8.0, http://www.bioinformatics.babraham.ac.uk/projects/hicup/) to remove the adapter sequences and low-quality reads, resulting in 158.46 Gb of clean reads (Table 2).

Genome assembly

Prior to the genome assembly of T. anak, a genome survey was conducted using MGI short-read sequencing data. Clean reads were used to perform k-mer (k = 19) frequency analysis using Jellyfish (version 2.2.10)11. The genome size and heterozygosity rate were subsequently estimated using GCE (version 1.0.2)12 and GenomeScope (version 2.0)13. Based on the k-mer analysis, the genome size of the female T. anak was estimated to be approximately 641‒642 Mb, with a heterozygosity rate of 0.23‒0.24% (Table 3).

Table 3 Genome survey of the female T. anak.

We firstly assembled high-quality PacBio HiFi, ONT ultra-long, and Hi-C reads into initial contigs using Hifiasm (version 0.19.9-r616)14 with default parameters. To obtain a chromosome-level genome of T. anak, the primary contigs were scaffolded jointly on the haplotype-resolved assemblies using Hi-C data. Specifically, ALLHiC (version 0.9.8)15, using BWA-MEM for Hi-C reads alignment (version 0.7.17, https://github.com/lh3/bwa), was employed as primary scaffolding tool to cluster, order, and orient the contigs into 48 chromosomal groups. To assist in manual correction, Juicer (version 1.6)16 and 3D-DNA (version 180419)17 were subsequently used to convert interaction data into specific binary files, which were visualized and manually curated in Juicebox (version 1.11.08)18. Finally, we generated two sets of haplotype-resolved chromosome-level genome assemblies, comprising a total of 48 chromosomes (derived from 76 contigs) and 131 unanchored scaffolds, with gaps between adjacent contigs filled by 100 ‘N’ strings (Table S1). The two haplotype genomes were designated hapA (chr##A) and hapB (chr##B).

To achieve high-accuracy T2T gap-free genome assemblies, we performed telomere repair, gap closure, and genome polishing based on the chromosome-level assemblies. Specifically, telomeric regions were refined by aligning ONT reads to the genome using Winnowmap2 (version 2.03)19 with parameter “k = 15, –MD”, extracting reads mapped to the terminal 50 bp of each chromosome, identifying those enriched in canonical telomeric repeats (TTAGGG), and generating consensus sequences with Medaka (version 1.5.0, https://github.com/nanoporetech/medaka) for end replacement of each chromosome based on high-identity (identity > 80%) MUMmer (version 4.0.0)20 alignments. Then, using Winnowmap2, gaps were manually filled with aligned other primary assembly versions, followed by ONT and HiFi reads. Finally, the gap-filled genome was polished using HiFi reads with one round of Racon (version 1.4.3)21 followed by two rounds of NextPolish2 (version 0.2.1)22. The final assemblies of the two haplotypes successfully anchored 663.78 Mb (hapA) and 661.09 Mb (hapB) onto 24 chromosomes, with all chromosomes in both assemblies achieving T2T continuity (Fig. 2, Table 4).

Fig. 2
figure 2

Genome assemblies of two haplotyped-resolved assemblies of T. anak. (a) Hi-C contact map of two haplotype genome assemblies at a bin size of 500 kb, showing clear chromosome-scale scaffolding and interaction signals. (b) Circos plot of genomic features across both haplotype assemblies. Tracks from outer to inner rings represent: gene density (blue), repeat density (purple), LTR element density (brown), DNA transposon density (red), and GC content (turquoise). Central ribbons represent syntenic relationships between homologous chromosomes of the two haplotypes.

Table 4 Assembly statistics of two haplotype-resolved genomes of T. anak.

Telomeric and centromeric regions analysis

The identification of telomeric and centromeric regions was conducted using the quarTeT (version 1.2.1)23 toolkit. For telomeric region prediction, the TeloExplorer module was used to scan each chromosome end for canonical telomeric repeats. All chromosomal telomeres of both haplotype genomes were successfully predicted by identifying >100 tandem repeats of the canonical 6-bp motif “TTAGGG” at both ends of the chromosomes (Fig. 2, Tables S2, S3). The detection of canonical telomeric repeats at both ends of all 24 chromosomes (48 telomeres in total) confirms the completeness of chromosome ends and supports the gap-free status of the assemblies.

For centromeric region prediction, the CentroMiner module was employed. Putative centromeres were successfully predicted on 19 chromosomes in both haplotypes, while 5 chromosomes lacked definitive centromeric signals (Fig. 3, Table S4). These undetected centromeres may correspond to non-canonical satellite repeat regions, which often consist of complex insertions involving multiple satellite families, long terminal repeat (LTR) retrotransposons, or other unidentified transposable elements that are beyond the resolution of tandem repeat-based prediction tools.

Fig. 3
figure 3

Presentation of telomeres and centromeres of two haplotyped-resolved assemblies of T. anak.

Repeat element and non-coding RNA annotation

To comprehensively annotate repetitive elements in the T. anak genome, both dispersed and tandem repeats were identified using a combination of de novo and homology-based approaches. For dispersed repeats, a de novo repeat library was first generated using RepeatModeler (version 2.0.4)24. To enhance the sensitivity and accuracy of LTR elements annotation, LTR_FINDER (version 1.07)25 and LTRharvest (version 1.62)26 were used, with results integrated and de-redundantized using LTR_retriever (version 2.9.0)27. The merged LTR sequences and RepeatModeler library formed a comprehensive de novo library. Unknown elements were further classified using TEclass (version 2.1.3)28. This de novo library was combined with RepBase database (version 20181026, https://www.girinst.org/repbase/), and repetitive elements were annotated using RepeatMasker (version 4.1.5)29. Additionally, RepeatProteinMask, a protein-based module of RepeatMasker, was used to detect TE-related coding regions. All outputs were merged and filtered to produce the final set of dispersed transposable elements. For tandem repeats, Tandem Repeats Finder (version 4.09)30 and MISA (version 2.1)31 were employed for identification. Finally, a total of 161.26 Mb (24.29%) and 152.64 Mb (23.09%) of dispersed repetitive sequences, and a total of 26.68 Mb (4.02%) and 25.99 Mb (3.93%) of tandem repetitive sequences were detected in the hapA and hapB assemblies, respectively. Among them, DNA transposons accounted for 11.56% and 10.77%, long interspersed nuclear elements (LINEs) for 3.92% and 3.35%, short interspersed elements (SINEs) for 0.17% and 0.17%, LTRs for 8.23% and 8.75%, respectively (Table 5).

Table 5 Statistics of repetitive sequence.

Non-coding RNAs (ncRNAs) were identified using a combination of structure- and homology-based approaches. Transfer RNAs (tRNAs) were predicted based on conserved secondary structure features using tRNAscan-SE (version 2.0.12)32. Other classes of ncRNAs, including ribosomal RNAs (rRNAs), small nuclear RNAs (snRNAs), microRNAs (miRNAs), and small nucleolar RNAs (snoRNAs), were identified using INFERNAL (version 1.1.2)33 against the Rfam database. Finally, a total of 2,742 miRNAs, 1,373 tRNAs, 4,842 rRNAs, and 1,569 snRNAs were identified in hapA (Tables S5), and 2,353 miRNAs, 1,404 tRNAs, 3,607 rRNAs, and 2,278 snRNAs were detected in hapB (Table S6).

Gene prediction and functional annotation

Gene structure prediction was performed on the repeat-masked genome by integrating evidence from transcriptome-based, homology-based, and ab initio approaches. For transcriptome-based prediction, RNA-seq data from nine different tissues were aligned to the genome using HISAT2 (version 2.1.1)34, followed by transcript assembly with StringTie (version 2.2.1)35. Protein-coding regions were then predicted from assembled transcripts using TransDecoder (version 5.7.0, https://github.com/TransDecoder/TransDecoder). For homology-based prediction, protein sequences from five related species—Danio rerio, Oryzias latipes, Seriola dumerili, Seriola lalandi dorsalis, and a previously assembled T. anak genome (GCF_046630095.1)—were downloaded from the Ensembl and NCBI databases, and aligned to the genome using TBLASTN (version 2.7.1)36, and gene structures were inferred with Exonerate (version 2.4.0, https://github.com/nathanweeks/exonerate). For ab initio gene prediction, Augustus (version 3.5.0)37 and GlimmerHMM (version 3.0.4)38 were employed on the repeat-masked genome. All prediction results were integrated using MAKER (version 3.01.03)39 to generate the final gene models.

Functional annotation of protein-coding genes was performed using both sequence similarity and motif/domain-based approaches. For similarity-based annotation, protein sequences were aligned against the UniProt, NR, and KEGG databases using Diamond (version 2.1.8)40, and KOBAS (version 3.0)41 was used to assign KEGG Orthology (KO) terms and associated pathway information. Gene Ontology (GO) annotations were derived based on UniProt mappings. For motif and domain annotation, InterProScan (version 5.55–88.0)42 was used to identify conserved protein motifs and domains.

As a result, 23,118 and 23,119 protein-coding genes were predicted in the hapA and hapB assemblies, respectively, of which 23,069 (99.79%) and 23,068 (99.78%) genes were annotated by at least one functional database (Tables 6, 7).

Table 6 Statistics of the predicted protein-coding genes of T. anak.
Table 7 Statistics of functional annotation.

Data Records

The two telomere-to-telomere haplotype-resolved genome assemblies of T. anak have been deposited in the European Nucleotide Archive (ENA), an INSDC member repository, under the BioProject accession PRJEB100546. The corresponding assembly accession numbers are GCA_977005155.1 (haplotype 1)43 and GCA_977005145.1 (haplotype 2)44. In addition, for broader accessibility, the same genome assemblies together with their annotation data have been co-deposited in the Genome Warehouse (GWH) database of the National Genomics Data Center (NGDC, https://ngdc.cncb.ac.cn) under BioProject PRJCA042885 with accession numbers GWHGEPT00000000.145 and GWHGEPU00000000.146, and are also available on Figshare47. The raw sequencing data used for genome assembly and annotation, including PacBio HiFi, ONT ultra-long, Hi-C, MGI, and RNA-seq reads are available in the NGDC Sequence Read Archive (SRA) database with accession numbers CRA02771548.

Data Overview

To place the T. anak genome in a broader evolutionary context, we surveyed publicly available teleost genomes to illustrate the utility of our assemblies for future comparative and phylogenomic studies. These two haplotype-resolved, gap-free genomes will serve as valuable references for evolutionary biology, molecular breeding, and comparative genomics.

Technical Validation

Quality evaluation of the initial assembly

A high-quality initial assembly is essential for successful T2T genome construction. Therefore, we evaluated multiple assembly workflows to identify the optimal version as the reference backbone for assembly refinement. These workflows were based on different combinations of input sequencing data and assemblers. Two data combinations were used: (1) PacBio HiFi reads and Hi-C data; and (2) PacBio HiFi, ONT ultra-long reads, and Hi-C data. Each data combination was assembled using both Hifiasm and Verkko (version 2.2)49, resulting in four initial assemblies: Hifiasm (HiFi + Hi-C), Hifiasm (HiFi + ONT + Hi-C), Verkko (HiFi + Hi-C), and Verkko (HiFi + ONT + Hi-C). To comprehensively evaluate the quality of the four initial assemblies, we assessed several key metrics, including contig N50 and total contig number calculated using Assembly-stats (version 1.0.1, https://github.com/sanger-pathogens/assembly-stats), k-mer-based consensus quality value (QV) estimated with Merqury (version 1.3)50, and Benchmarking Universal Single-Copy Orthologs (BUSCO) completeness scores estimated with BUSCO (version 5.7.1)51. The detailed results are summarized in Table S7. Among the four assemblies, the Hifiasm (HiFi + ONT + Hi-C) assembly exhibited the highest overall quality, with the longest contig N50 of 27.62 Mb, the fewest contigs (205), the highest QV score of 61.07, and high BUSCO completeness scores (C:98.9% [S:0.2%, D:98.7%]). Based on these comprehensive evaluations, this assembly was selected as the reference backbone for subsequent T2T genome construction.

Quality evaluation of the final assembly

A comprehensive quality evaluation (covering continuity, accuracy, and completeness) was performed on the final genome assembly. Assembly continuity was assessed using three key metrics: (1) contig N50 values, (2) gap number, and (3) Genome Continuity Inspector (GCI, version 2.28-r1209)52 scores. The assembled haplotypes exhibited high continuity, with total lengths of 663.78 Mb (hapA) and 661.09 Mb (hapB), and contig N50 values of 28.62 Mb and 29.02 Mb, respectively (Table 8). Both haplotypes were completely gap-free (0 gaps detected) (Table 8). GCI analysis revealed overall continuity scores of 94.44 (hapA) and 85.16 (hapB), with the majority of individual chromosomes achieving 99.99 (Table 8, Table S8). The slightly lower GCI score observed for hapB likely reflects differences in repeat-rich regions between the two haplotypes.

Table 8 Comparative quality assessment of the two haplotype-resolved assemblies and three previously published assemblies of T. anak.

Assembly accuracy was assessed also using other three complementary metrics: (1) QV scores, (2) high-accuracy sequencing reads mapping rates, and (3) Hi-C interaction patterns. The k-mer QV scores reached up to 70.45 for hapA and 68.66 for hapB (Table 8). MGI short reads, ONT reads, and HiFi reads were aligned to the genome assemblies, with mapping rates of >99.7%, 100.0% and >99.9%, respectively (Table 8). The Hi-C heatmaps demonstrated strong chromosomal interaction signals and clear diagonal patterns (Fig. 2a).

Assembly completeness was assessed based on BUSCO scores and telomere region analysis. BUSCO evaluation with the “actinopterygii_odb10” reference dataset demonstrated high completeness, showing 98.9% complete BUSCOs for hapA and 99.0% for hapB (Table 8). Additionally, telomere analysis successfully detected all chromosomal telomeres through identification of >100 repeats of the 6-bp “TTAGGG” motif (Fig. 3, Table 8). These results collectively highlight the high quality of our two haplotype-resolved genome assemblies.

Contamination assessment

To ensure the reliability and purity of the genome assemblies, multiple strategies were used to assess potential contamination. First, 10,000 randomly selected MGI short reads were aligned to the NCBI NT database (version 202407) using BLAST (version 2.11.0+, parameters: -evalue 1e-5, -max_target_seqs. 1). Apart from hits classified as “Unknown”, all matched sequences belonged to the Metazoa clade, indicating no detectable exogenous contamination. In addition, we further assessed potential contamination using the NCBI Foreign Contamination Screen for Genomes (FCS-GX, version 0.5.5)53 with default parameters. This analysis identified no putative contaminant divisions, and the contamination summary reported zero contaminated sequences and bases, confirming that the genome assembly is free from detectable contamination.