Telomere-to-telomere haplotype-resolved genome assembly of a female oyster pompano (Trachinotus anak)

Wang, Tao; Guo, Yuwen; Wang, Yan; Chen, Rong; Ge, Si; Chen, Huapu

doi:10.1038/s41597-025-06345-2

Download PDF

Data Descriptor
Open access
Published: 02 December 2025

Telomere-to-telomere haplotype-resolved genome assembly of a female oyster pompano (Trachinotus anak)

Tao Wang ORCID: orcid.org/0009-0002-6034-6014¹,
Yuwen Guo¹,
Yan Wang¹,
Rong Chen²,
Si Ge¹ &
…
Huapu Chen¹

Scientific Data volume 13, Article number: 33 (2026) Cite this article

1517 Accesses
Metrics details

Subjects

Abstract

Oyster pompano (Trachinotus anak) is currently the highest-yielding marine aquaculture fish species in China. However, previously available genome assemblies were fragmented and incomplete, hindering advances in genomic research and breeding applications. Here, we generated two haplotype-resolved, telomere-to-telomere (T2T) genome assemblies for a female T. anak using PacBio HiFi, ONT ultra-long, and Hi-C data. The resulting assemblies (haplotype A and B) span 663.78 Mb and 661.09 Mb with contig N50 of 28.62 Mb and 29.02 Mb, respectively. Both haplotypic assemblies were anchored into 24 gap-free chromosomes, each comprising a complete set of 48 telomeres. The quality value (QV) scores were 70.45 and 68.66, and Benchmarking Universal Single-Copy Orthologs (BUSCO) completeness scores were 98.9% and 99.0% for haplotype A and B, respectively. Genome annotation identified 187.94 Mb and 178.63 Mb of repetitive sequences, and predicted 23,118 and 23,119 protein-coding genes in haplotypes A and B, respectively, with 99.79% and 99.78% functionally annotated. These two high-quality T2T assemblies provide invaluable resources for molecular breeding and advancing biological and evolutionary research in T. anak.

A telomere-to-telomere gap-free genome assembly of the protandrous maroon clownfish (Premnas biaculeatus)

Article Open access 04 February 2026

A telomere-to-telomere chromosome-scale genome assembly of glass catfish (Kryptopterus vitreolus)

Article Open access 23 March 2025

A telomere-to-telomere gap-free genome assembly of the striped catfish (Pangasianodon hypophthalmus)

Article Open access 28 November 2025

Background & Summary

The oyster pompano (Trachinotus anak) is one of the most economically significant marine aquaculture species in China. It belongs to the family Carangidae, and has often been misidentified as its closely related sister species, T. ovatus or T. blochii, both of which are commonly referred to the “golden pompano” by aquaculture practitioners in the country^1,2. Notably, T. anak can be distinguished from its congeners in morphology and distribution: T. blochii possesses elongated fin rays in the dorsal and anal fins¹, and T. ovatus has a native distribution restricted to the Atlantic Ocean^2,3.

T. anak is primarily distributed in tropical and subtropical waters of the western Pacific Ocean³. Owing to its tender flesh, absence of intermuscular spines, and favorable taste, it is highly favored by consumers. In addition, its rapid growth rate and strong environmental adaptability make it an appealing species for aquaculture producers. These combined traits have driven the rapid expansion of T. anak aquaculture in recent years, particularly along the southeastern coast of China, which has become the main farming region. As a result, its annual production has increased substantially, reaching over 290,000 tons in 2023 and ranking first among all marine aquaculture fish species in the country⁴.

Although several genome assemblies of T. anak have been previously reported, they are all limited to the chromosome level and contain numerous gaps and incomplete telomeric sequences^5,6,7. In this study, we present two haplotype-resolved, telomere-to-telomere (T2T) genome assemblies of T. anak, representing the first gap-free and fully resolved genomes for this species. These two assemblies were generated using a combination of PacBio HiFi, ONT ultra-long, and Hi-C reads, enabling accurate reconstruction of both haplotypes at the chromosomal level with complete telomeric structures. Compared to previous assembly versions, our T2T-level genome assemblies substantially improve assembly continuity and completeness, offering invaluable resources for molecular breeding, functional genomics, and evolutionary studies of T. anak and other Carangidae species.

Methods

Sample collection and nucleic acid extraction

Following phenotypic and genotypic sex identification based on morphological observation and a previously described sex-specific marker⁸ with minor modifications (Table 1), a two-year-old female T. anak was obtained from Hainan Lanliang Agriculture Technology Co., Ltd. (Sanya, Hainan Province, China) for genomic DNA and total RNA extraction (Fig. 1). High-quality genomic DNA was isolated from muscle tissue using the cetyltrimethylammonium bromide (CTAB) method and used for the construction of MGI short-read, PacBio HiFi long-read, and ONT ultra-long-read sequencing libraries. Total RNA was extracted from nine different tissues—including heart, liver, spleen, kidney, muscle, hypothalamus, brain, pituitary, gonad—using TRIzol reagent (Invitrogen, USA) for transcriptome library preparation. The concentration and quality of extracted DNA and RNA were assessed using a NanoDrop One spectrophotometer (Thermo Fisher Scientific, USA), a Qubit 3.0 fluorometer (Life Technologies, USA), and agarose gel electrophoresis.

Table 1 The sequences of female-specific primers used for genetic sex identification.

Full size table

Library preparation and sequencing

For MGI short-read sequencing, a paired-end genomic library with an average insert size of ~350 bp was constructed using MGIEasy Universal DNA Library Preparation Kit v.1.0 (MGI, China) and sequenced on DNBSEQ-T7 platform with 150 bp paired-end reads generated. Quality control of raw short reads was performed using fastp (version 0.23.2)⁹ with default parameters to remove adapter sequences and low-quality reads, resulting in 76.96 Gb of clean data (Table 2).

Table 2 Sequencing data generated for T. anak genome assembly.

Full size table

For PacBio HiFi long-read sequencing, a circular consensus sequencing (CCS) library was prepared using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, USA) and sequenced on the PacBio Revio platform. Raw subreads were processed using CCS software (version 6.0.0, https://github.com/PacificBiosciences/ccs) with the parameters “–min-passes 3–min-snr 2.5–top-passes 60”, resulting in 137.54 Gb of high-accuracy HiFi reads with an average length of 21.07 kb (Table 2).

For ONT ultra-long sequencing, high-molecular-weight (HMW) genomic DNA was used to construct libraries using an SQK-ULK001 Kit (Oxford Nanopore Technologies, UK) and sequenced on the PromethION P48 platform. Raw sequencing data was processed with Porechop (version 0.2.4, https://github.com/rrwick/Porechop) to remove the adapter sequences and Filtlong (version 0.2.1, https://github.com/rrwick/Filtlong) to filter low-quality reads with parameter “–min_length 30000–min_mean_q 90”, resulting in 85.72 Gb of high-quality ultra-long reads with an average length of 80.39 kb (Table 2).

For Hi-C sequencing, fresh liver tissue was collected for library preparation as previously described¹⁰, with minor modifications. Briefly, chromatin was cross-linked with 1% formaldehyde, digested with DpnII, end-repaired with biotin-14-dCTP, and proximity-ligated using T4 DNA ligase. After reversing cross-links and purifying DNA, the DNA was sheared to fragments of 300–700 bp and enriched using streptavidin magnetic beads. The Hi-C library was constructed using with MGIEasy Universal DNA Library Preparation Kit v.1.0 (MGI, China) and sequenced on DNBSEQ-T7 platform. Raw sequencing data was processed with fastp and HICUP (version 0.8.0, http://www.bioinformatics.babraham.ac.uk/projects/hicup/) to remove the adapter sequences and low-quality reads, resulting in 158.46 Gb of clean reads (Table 2).

Genome assembly

Prior to the genome assembly of T. anak, a genome survey was conducted using MGI short-read sequencing data. Clean reads were used to perform k-mer (k = 19) frequency analysis using Jellyfish (version 2.2.10)¹¹. The genome size and heterozygosity rate were subsequently estimated using GCE (version 1.0.2)¹² and GenomeScope (version 2.0)¹³. Based on the k-mer analysis, the genome size of the female T. anak was estimated to be approximately 641‒642 Mb, with a heterozygosity rate of 0.23‒0.24% (Table 3).

Table 3 Genome survey of the female T. anak.

Full size table

We firstly assembled high-quality PacBio HiFi, ONT ultra-long, and Hi-C reads into initial contigs using Hifiasm (version 0.19.9-r616)¹⁴ with default parameters. To obtain a chromosome-level genome of T. anak, the primary contigs were scaffolded jointly on the haplotype-resolved assemblies using Hi-C data. Specifically, ALLHiC (version 0.9.8)¹⁵, using BWA-MEM for Hi-C reads alignment (version 0.7.17, https://github.com/lh3/bwa), was employed as primary scaffolding tool to cluster, order, and orient the contigs into 48 chromosomal groups. To assist in manual correction, Juicer (version 1.6)¹⁶ and 3D-DNA (version 180419)¹⁷ were subsequently used to convert interaction data into specific binary files, which were visualized and manually curated in Juicebox (version 1.11.08)¹⁸. Finally, we generated two sets of haplotype-resolved chromosome-level genome assemblies, comprising a total of 48 chromosomes (derived from 76 contigs) and 131 unanchored scaffolds, with gaps between adjacent contigs filled by 100 ‘N’ strings (Table S1). The two haplotype genomes were designated hapA (chr##A) and hapB (chr##B).

To achieve high-accuracy T2T gap-free genome assemblies, we performed telomere repair, gap closure, and genome polishing based on the chromosome-level assemblies. Specifically, telomeric regions were refined by aligning ONT reads to the genome using Winnowmap2 (version 2.03)¹⁹ with parameter “k = 15, –MD”, extracting reads mapped to the terminal 50 bp of each chromosome, identifying those enriched in canonical telomeric repeats (TTAGGG), and generating consensus sequences with Medaka (version 1.5.0, https://github.com/nanoporetech/medaka) for end replacement of each chromosome based on high-identity (identity > 80%) MUMmer (version 4.0.0)²⁰ alignments. Then, using Winnowmap2, gaps were manually filled with aligned other primary assembly versions, followed by ONT and HiFi reads. Finally, the gap-filled genome was polished using HiFi reads with one round of Racon (version 1.4.3)²¹ followed by two rounds of NextPolish2 (version 0.2.1)²². The final assemblies of the two haplotypes successfully anchored 663.78 Mb (hapA) and 661.09 Mb (hapB) onto 24 chromosomes, with all chromosomes in both assemblies achieving T2T continuity (Fig. 2, Table 4).

Table 4 Assembly statistics of two haplotype-resolved genomes of T. anak.

Full size table

Telomeric and centromeric regions analysis

The identification of telomeric and centromeric regions was conducted using the quarTeT (version 1.2.1)²³ toolkit. For telomeric region prediction, the TeloExplorer module was used to scan each chromosome end for canonical telomeric repeats. All chromosomal telomeres of both haplotype genomes were successfully predicted by identifying >100 tandem repeats of the canonical 6-bp motif “TTAGGG” at both ends of the chromosomes (Fig. 2, Tables S2, S3). The detection of canonical telomeric repeats at both ends of all 24 chromosomes (48 telomeres in total) confirms the completeness of chromosome ends and supports the gap-free status of the assemblies.

For centromeric region prediction, the CentroMiner module was employed. Putative centromeres were successfully predicted on 19 chromosomes in both haplotypes, while 5 chromosomes lacked definitive centromeric signals (Fig. 3, Table S4). These undetected centromeres may correspond to non-canonical satellite repeat regions, which often consist of complex insertions involving multiple satellite families, long terminal repeat (LTR) retrotransposons, or other unidentified transposable elements that are beyond the resolution of tandem repeat-based prediction tools.

Repeat element and non-coding RNA annotation

To comprehensively annotate repetitive elements in the T. anak genome, both dispersed and tandem repeats were identified using a combination of de novo and homology-based approaches. For dispersed repeats, a de novo repeat library was first generated using RepeatModeler (version 2.0.4)²⁴. To enhance the sensitivity and accuracy of LTR elements annotation, LTR_FINDER (version 1.07)²⁵ and LTRharvest (version 1.62)²⁶ were used, with results integrated and de-redundantized using LTR_retriever (version 2.9.0)²⁷. The merged LTR sequences and RepeatModeler library formed a comprehensive de novo library. Unknown elements were further classified using TEclass (version 2.1.3)²⁸. This de novo library was combined with RepBase database (version 20181026, https://www.girinst.org/repbase/), and repetitive elements were annotated using RepeatMasker (version 4.1.5)²⁹. Additionally, RepeatProteinMask, a protein-based module of RepeatMasker, was used to detect TE-related coding regions. All outputs were merged and filtered to produce the final set of dispersed transposable elements. For tandem repeats, Tandem Repeats Finder (version 4.09)³⁰ and MISA (version 2.1)³¹ were employed for identification. Finally, a total of 161.26 Mb (24.29%) and 152.64 Mb (23.09%) of dispersed repetitive sequences, and a total of 26.68 Mb (4.02%) and 25.99 Mb (3.93%) of tandem repetitive sequences were detected in the hapA and hapB assemblies, respectively. Among them, DNA transposons accounted for 11.56% and 10.77%, long interspersed nuclear elements (LINEs) for 3.92% and 3.35%, short interspersed elements (SINEs) for 0.17% and 0.17%, LTRs for 8.23% and 8.75%, respectively (Table 5).

Table 5 Statistics of repetitive sequence.

Full size table

Non-coding RNAs (ncRNAs) were identified using a combination of structure- and homology-based approaches. Transfer RNAs (tRNAs) were predicted based on conserved secondary structure features using tRNAscan-SE (version 2.0.12)³². Other classes of ncRNAs, including ribosomal RNAs (rRNAs), small nuclear RNAs (snRNAs), microRNAs (miRNAs), and small nucleolar RNAs (snoRNAs), were identified using INFERNAL (version 1.1.2)³³ against the Rfam database. Finally, a total of 2,742 miRNAs, 1,373 tRNAs, 4,842 rRNAs, and 1,569 snRNAs were identified in hapA (Tables S5), and 2,353 miRNAs, 1,404 tRNAs, 3,607 rRNAs, and 2,278 snRNAs were detected in hapB (Table S6).

Gene prediction and functional annotation

Gene structure prediction was performed on the repeat-masked genome by integrating evidence from transcriptome-based, homology-based, and ab initio approaches. For transcriptome-based prediction, RNA-seq data from nine different tissues were aligned to the genome using HISAT2 (version 2.1.1)³⁴, followed by transcript assembly with StringTie (version 2.2.1)³⁵. Protein-coding regions were then predicted from assembled transcripts using TransDecoder (version 5.7.0, https://github.com/TransDecoder/TransDecoder). For homology-based prediction, protein sequences from five related species—Danio rerio, Oryzias latipes, Seriola dumerili, Seriola lalandi dorsalis, and a previously assembled T. anak genome (GCF_046630095.1)—were downloaded from the Ensembl and NCBI databases, and aligned to the genome using TBLASTN (version 2.7.1)³⁶, and gene structures were inferred with Exonerate (version 2.4.0, https://github.com/nathanweeks/exonerate). For ab initio gene prediction, Augustus (version 3.5.0)³⁷ and GlimmerHMM (version 3.0.4)³⁸ were employed on the repeat-masked genome. All prediction results were integrated using MAKER (version 3.01.03)³⁹ to generate the final gene models.

Functional annotation of protein-coding genes was performed using both sequence similarity and motif/domain-based approaches. For similarity-based annotation, protein sequences were aligned against the UniProt, NR, and KEGG databases using Diamond (version 2.1.8)⁴⁰, and KOBAS (version 3.0)⁴¹ was used to assign KEGG Orthology (KO) terms and associated pathway information. Gene Ontology (GO) annotations were derived based on UniProt mappings. For motif and domain annotation, InterProScan (version 5.55–88.0)⁴² was used to identify conserved protein motifs and domains.

As a result, 23,118 and 23,119 protein-coding genes were predicted in the hapA and hapB assemblies, respectively, of which 23,069 (99.79%) and 23,068 (99.78%) genes were annotated by at least one functional database (Tables 6, 7).

Table 6 Statistics of the predicted protein-coding genes of T. anak.

Full size table

Table 7 Statistics of functional annotation.

Full size table

Data Records

The two telomere-to-telomere haplotype-resolved genome assemblies of T. anak have been deposited in the European Nucleotide Archive (ENA), an INSDC member repository, under the BioProject accession PRJEB100546. The corresponding assembly accession numbers are GCA_977005155.1 (haplotype 1)⁴³ and GCA_977005145.1 (haplotype 2)⁴⁴. In addition, for broader accessibility, the same genome assemblies together with their annotation data have been co-deposited in the Genome Warehouse (GWH) database of the National Genomics Data Center (NGDC, https://ngdc.cncb.ac.cn) under BioProject PRJCA042885 with accession numbers GWHGEPT00000000.1⁴⁵ and GWHGEPU00000000.1⁴⁶, and are also available on Figshare⁴⁷. The raw sequencing data used for genome assembly and annotation, including PacBio HiFi, ONT ultra-long, Hi-C, MGI, and RNA-seq reads are available in the NGDC Sequence Read Archive (SRA) database with accession numbers CRA027715⁴⁸.

Data Overview

To place the T. anak genome in a broader evolutionary context, we surveyed publicly available teleost genomes to illustrate the utility of our assemblies for future comparative and phylogenomic studies. These two haplotype-resolved, gap-free genomes will serve as valuable references for evolutionary biology, molecular breeding, and comparative genomics.

Technical Validation

Quality evaluation of the initial assembly

A high-quality initial assembly is essential for successful T2T genome construction. Therefore, we evaluated multiple assembly workflows to identify the optimal version as the reference backbone for assembly refinement. These workflows were based on different combinations of input sequencing data and assemblers. Two data combinations were used: (1) PacBio HiFi reads and Hi-C data; and (2) PacBio HiFi, ONT ultra-long reads, and Hi-C data. Each data combination was assembled using both Hifiasm and Verkko (version 2.2)⁴⁹, resulting in four initial assemblies: Hifiasm (HiFi + Hi-C), Hifiasm (HiFi + ONT + Hi-C), Verkko (HiFi + Hi-C), and Verkko (HiFi + ONT + Hi-C). To comprehensively evaluate the quality of the four initial assemblies, we assessed several key metrics, including contig N50 and total contig number calculated using Assembly-stats (version 1.0.1, https://github.com/sanger-pathogens/assembly-stats), k-mer-based consensus quality value (QV) estimated with Merqury (version 1.3)⁵⁰, and Benchmarking Universal Single-Copy Orthologs (BUSCO) completeness scores estimated with BUSCO (version 5.7.1)⁵¹. The detailed results are summarized in Table S7. Among the four assemblies, the Hifiasm (HiFi + ONT + Hi-C) assembly exhibited the highest overall quality, with the longest contig N50 of 27.62 Mb, the fewest contigs (205), the highest QV score of 61.07, and high BUSCO completeness scores (C:98.9% [S:0.2%, D:98.7%]). Based on these comprehensive evaluations, this assembly was selected as the reference backbone for subsequent T2T genome construction.

Quality evaluation of the final assembly

A comprehensive quality evaluation (covering continuity, accuracy, and completeness) was performed on the final genome assembly. Assembly continuity was assessed using three key metrics: (1) contig N50 values, (2) gap number, and (3) Genome Continuity Inspector (GCI, version 2.28-r1209)⁵² scores. The assembled haplotypes exhibited high continuity, with total lengths of 663.78 Mb (hapA) and 661.09 Mb (hapB), and contig N50 values of 28.62 Mb and 29.02 Mb, respectively (Table 8). Both haplotypes were completely gap-free (0 gaps detected) (Table 8). GCI analysis revealed overall continuity scores of 94.44 (hapA) and 85.16 (hapB), with the majority of individual chromosomes achieving 99.99 (Table 8, Table S8). The slightly lower GCI score observed for hapB likely reflects differences in repeat-rich regions between the two haplotypes.

Table 8 Comparative quality assessment of the two haplotype-resolved assemblies and three previously published assemblies of T. anak.

Full size table

Assembly accuracy was assessed also using other three complementary metrics: (1) QV scores, (2) high-accuracy sequencing reads mapping rates, and (3) Hi-C interaction patterns. The k-mer QV scores reached up to 70.45 for hapA and 68.66 for hapB (Table 8). MGI short reads, ONT reads, and HiFi reads were aligned to the genome assemblies, with mapping rates of >99.7%, 100.0% and >99.9%, respectively (Table 8). The Hi-C heatmaps demonstrated strong chromosomal interaction signals and clear diagonal patterns (Fig. 2a).

Assembly completeness was assessed based on BUSCO scores and telomere region analysis. BUSCO evaluation with the “actinopterygii_odb10” reference dataset demonstrated high completeness, showing 98.9% complete BUSCOs for hapA and 99.0% for hapB (Table 8). Additionally, telomere analysis successfully detected all chromosomal telomeres through identification of >100 repeats of the 6-bp “TTAGGG” motif (Fig. 3, Table 8). These results collectively highlight the high quality of our two haplotype-resolved genome assemblies.

Contamination assessment

To ensure the reliability and purity of the genome assemblies, multiple strategies were used to assess potential contamination. First, 10,000 randomly selected MGI short reads were aligned to the NCBI NT database (version 202407) using BLAST (version 2.11.0+, parameters: -evalue 1e-5, -max_target_seqs. 1). Apart from hits classified as “Unknown”, all matched sequences belonged to the Metazoa clade, indicating no detectable exogenous contamination. In addition, we further assessed potential contamination using the NCBI Foreign Contamination Screen for Genomes (FCS-GX, version 0.5.5)⁵³ with default parameters. This analysis identified no putative contaminant divisions, and the contamination summary reported zero contaminated sequences and bases, confirming that the genome assembly is free from detectable contamination.

Data availability

The telomere-to-telomere haplotype-resolved genome assemblies of T. anak are available in the ENA under BioProject PRJEB100546 (GCA_977005155.1 and GCA_977005145.1). The same assemblies and annotations are also available in the GWH at NGDC under BioProject PRJCA042885 (GWHGEPT00000000.1 and GWHGEPU00000000.1). Raw sequencing data are deposited in the NGDC SRA database with accession number CRA027715, and all supporting data are available from Figshare (https://doi.org/10.6084/m9.figshare.29526377.v3).

Code availability

No custom scripts were developed for this study. All analyses were conducted using publicly available bioinformatics software, with parameters applied according to recommended practices or specified where necessary. Full details of the tools and configurations are provided in the Methods section.

References

Fan, B. et al. A single intronic single nucleotide polymorphism in splicing site of steroidogenic enzyme hsd17b1 is associated with phenotypic sex in oyster pompano, Trachinotus anak. Proc. R. Soc. B. 288, 20212245, https://doi.org/10.1098/rspb.2021.2245 (2021).
Article CAS PubMed PubMed Central Google Scholar
Shadrin, A. M., Semenova, A. V. & Thanh, N. T. H. Are there Atlantic species of the genus Trachinotus (Carangidae), T. falcatus and T. ovatus, in Asian mariculture? J. Ichthyol. 64, 854–863, https://doi.org/10.1134/S0032945224700577 (2024).
Article Google Scholar
Froese, R. & Pauly, D. (eds). FishBase: Trachinotus anak. https://fishbase.mnhn.fr/summary/Trachinotus_anak.html (accessed 20 Jun 2025).
Ministry of Agriculture and Rural Affairs of the People’s Republic of China. Chinese Fishery Statistical Yearbook 2024. China Agriculture Press (2024).
Zhang, D. C. et al. Chromosome-level genome assembly of golden pompano (Trachinotus ovatus) in the family Carangidae. Sci. Data 6, 216, https://doi.org/10.1038/s41597-019-0238-8 (2019).
Article CAS PubMed PubMed Central Google Scholar
Guo, L. et al. Turnovers of sex-determining mutation in the golden pompano and related species provide insights into microevolution of undifferentiated sex chromosome. Genome Biol. Evol. 16, evae037, https://doi.org/10.1093/gbe/evae037 (2024).
Article CAS PubMed PubMed Central Google Scholar
Luo, H. L. et al. The male and female genomes of golden pompano (Trachinotus ovatus) provide insights into the sex chromosome evolution and rapid growth. J. Adv. Res. 65, 1–17, https://doi.org/10.1016/j.jare.2023.11.030 (2024).
Article CAS PubMed Google Scholar
Zhang, K. X. et al. Development and verification of sex‐specific molecular marker for golden pompano (Trachinotus blochii). Aquac. Res. 53, 3726–3735, https://doi.org/10.1111/are.15876 (2022).
Article CAS Google Scholar
Chen, S. F., Zhou, Y. Q., Chen, Y. R. & Gu, J. Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890, https://doi.org/10.1093/bioinformatics/bty560 (2018).
Article CAS PubMed PubMed Central Google Scholar
Gong, G. R. et al. Chromosomal-level assembly of yellow catfish genome using third-generation DNA sequencing and Hi-C analysis. Gigascience 7, 1–9, https://doi.org/10.1093/gigascience/giy120 (2018).
Article CAS Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770, https://doi.org/10.1093/bioinformatics/btr011 (2011).
Article CAS PubMed PubMed Central Google Scholar
Liu, B. H. et al. Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv 1308.2012, https://doi.org/10.48550/arXiv.1308.2012 (2013).
Vurture, G. W. et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33, 2202–2204, https://doi.org/10.1093/bioinformatics/btx153 (2017).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H. Y., Concepcion, G. T., Feng, X. W., Zhang, H. W. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zhang, X. T., Zhang, S. C., Zhao, Q., Ming, R. & Tang, H. B. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat. Plants 5, 833–845, https://doi.org/10.1038/s41477-019-0487-8 (2019).
Article CAS PubMed Google Scholar
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 3, 95–98, https://doi.org/10.1016/j.cels.2016.07.002 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95, https://doi.org/10.1126/science.aal3327 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. The Juicebox Assembly Tools module facilitates de novo assembly of mammalian genomes with chromosome-length scaffolds for under $1000. bioRxiv 254797, https://doi.org/10.1101/254797 (2018).
Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat. Methods 19, 705–710, https://doi.org/10.1038/s41592-022-01457-8 (2022).
Article CAS PubMed PubMed Central Google Scholar
Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944, https://doi.org/10.1371/journal.pcbi.1005944 (2018).
Article CAS PubMed PubMed Central Google Scholar
Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746, https://doi.org/10.1101/gr.214270.116 (2017).
Article CAS PubMed PubMed Central Google Scholar
Lin, Y. Z. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Hortic. Res. 10, 8, https://doi.org/10.1093/hr/uhad127 (2023).
Article Google Scholar
Hu, J. et al. NextPolish2: a repeat-aware polishing tool for genomes assembled using HiFi long reads. Genom. Proteom. Bioinform. 22, qzad009, https://doi.org/10.1093/gpbjnl/qzad009 (2024).
Article Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. USA 117, 9451–9457, https://doi.org/10.1073/pnas.1921046117 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Ou, S. J. & Jiang, N. LTR_FINDER_parallel: parallelization of LTR_FINDER enabling rapid identification of long terminal repeat retrotransposons. Mobile DNA 10, 48, https://doi.org/10.1186/s13100-019-0193-0 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18, https://doi.org/10.1186/1471-2105-9-18 (2008).
Article CAS PubMed PubMed Central Google Scholar
Ou, S. J. & Jiang, N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 176, 1410–1422, https://doi.org/10.1104/pp.17.01310 (2018).
Article CAS PubMed Google Scholar
Abrusán, G., Grundmann, N., DeMester, L. & Makalowski, W. TEclass—a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics 25, 1329–1330, https://doi.org/10.1093/bioinformatics/btp084 (2009).
Article CAS PubMed Google Scholar
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 25, Unit 4.10, https://doi.org/10.1002/0471250953.bi0410s25 (2009).
Article Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580, https://doi.org/10.1093/nar/27.2.573 (1999).
Article CAS PubMed PubMed Central Google Scholar
Beier, S., Thiel, T., Münch, T., Scholz, U. & Mascher, M. MISA-web: a web server for microsatellite prediction. Bioinformatics 33, 2583–2585, https://doi.org/10.1093/bioinformatics/btx198 (2017).
Article CAS PubMed PubMed Central Google Scholar
Chan, P. P., Lin, B. Y., Mak, A. J. & Lowe, T. M. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic Acids Res. 49, 9077–9096, https://doi.org/10.1093/nar/gkab688 (2021).
Article CAS PubMed PubMed Central Google Scholar
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935, https://doi.org/10.1093/bioinformatics/btt509 (2013).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915, https://doi.org/10.1038/s41587-019-0201-4 (2019).
Article CAS PubMed PubMed Central Google Scholar
Shumate, A., Wong, B., Pertea, G. & Pertea, M. Improved transcriptome assembly using a hybrid of long and short reads with StringTie. PLoS Comput. Biol. 18, e1009730, https://doi.org/10.1371/journal.pcbi.1009730 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Mount, D. W. Using the basic local alignment search tool (BLAST). Cold Spring Harb. Protoc. 2007, pdb.top17, https://doi.org/10.1101/pdb.top17 (2007).
Article Google Scholar
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644, https://doi.org/10.1093/bioinformatics/btn013 (2008).
Article CAS PubMed Google Scholar
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879, https://doi.org/10.1093/bioinformatics/bth315 (2004).
Article CAS PubMed Google Scholar
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12, 491, https://doi.org/10.1186/1471-2105-12-491 (2011).
Article PubMed PubMed Central Google Scholar
Buchfink, B., Reuter, K. & Drost, H. G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368, https://doi.org/10.1038/s41592-021-01101-x (2021).
Article CAS PubMed PubMed Central Google Scholar
Xie, C. et al. KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases. Nucleic Acids Res. 39, W316–W322, https://doi.org/10.1093/nar/gkr483 (2011).
Article CAS PubMed PubMed Central Google Scholar
Hunter, S. et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 37, D211–D215, https://doi.org/10.1093/nar/gkn785 (2009).
Article CAS PubMed Google Scholar
European Nucleotide Archive https://identifiers.org/insdc.gca:GCA_977005155.1 (2025).
European Nucleotide Archive https://identifiers.org/insdc.gca:GCA_977005145.1 (2025).
NGDC Genome WareHouse https://ngdc.cncb.ac.cn/gwh/Assembly/98259/show (2025).
NGDC Genome WareHouse https://ngdc.cncb.ac.cn/gwh/Assembly/98260/show (2025).
Wang, T. Telomere-to-telomere haplotype-resolved genome assembly and annotation of a female oyster pompano (Trachinotus anak). Figshare https://doi.org/10.6084/m9.figshare.29526377.v3 (2025).
NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA027715 (2025).
Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. 41, 1474–1482, https://doi.org/10.1038/s41587-023-01662-6 (2023).
Article CAS PubMed PubMed Central Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245, https://doi.org/10.1186/s13059-020-02134-9 (2020).
Article CAS PubMed PubMed Central Google Scholar
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647–4654, https://doi.org/10.1093/molbev/msab199 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chen, Q. Y., Yang, C. T., Zhang, G. J. & Wu, D. Y. GCI: a continuity inspector for complete genome assembly. Bioinformatics 40, btae633, https://doi.org/10.1093/bioinformatics/btae633 (2024).
Article CAS PubMed PubMed Central Google Scholar
Astashyn, A. et al. Rapid and sensitive detection of genome contamination at scale with FCS-GX. Genome Biol. 25, 60, https://doi.org/10.1186/s13059-024-03198-7 (2024).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This study was supported by the Natural Science Foundation of China (32273131), the Youth Science and Technology Innovation Talent of Guangdong TeZhi plan talent (2023TQ07A888), the Science and Technology Plan of Zhanjiang City (2024E03007, 2025A0301004 and 2025R02104), and the Science and Technology Plan of Yangjiang City (SDZX2023027 and BQW2024013).

Author information

Authors and Affiliations

Guangdong Research Center on Reproductive Control and Breeding Technology of Indigenous Valuable Fish Species, Guangdong Provincial Key Laboratory of Aquatic Animal Disease Control and Healthy Culture, Fisheries College, Guangdong Ocean University, Zhanjiang, 524088, China
Tao Wang, Yuwen Guo, Yan Wang, Si Ge & Huapu Chen
Wuhan Huabiology Co., Ltd., Wuhan, Hubei, 430000, China
Rong Chen

Authors

Tao Wang
View author publications
Search author on:PubMed Google Scholar
Yuwen Guo
View author publications
Search author on:PubMed Google Scholar
Yan Wang
View author publications
Search author on:PubMed Google Scholar
Rong Chen
View author publications
Search author on:PubMed Google Scholar
Si Ge
View author publications
Search author on:PubMed Google Scholar
Huapu Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

Tao Wang and Huapu Chen conceived and supervised the study. Tao Wang collected samples. Tao Wang, Yuwen Guo, Yan Wang, and Si Ge performed the experiments. Tao Wang and Rong Chen performed bioinformatics analysis. Tao Wang drafted the manuscript. Huapu Chen revised the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Tao Wang or Huapu Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Tables

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, T., Guo, Y., Wang, Y. et al. Telomere-to-telomere haplotype-resolved genome assembly of a female oyster pompano (Trachinotus anak). Sci Data 13, 33 (2026). https://doi.org/10.1038/s41597-025-06345-2

Download citation

Received: 14 July 2025
Accepted: 20 November 2025
Published: 02 December 2025
Version of record: 15 January 2026
DOI: https://doi.org/10.1038/s41597-025-06345-2