Background & Summary

The genus Choerodon, belonging to the family Labridae, includes approximately 24 species, widely recognized for their vibrant coloration and unique dental morphology1. Among them, Choerodon schoenleinii, commonly referred to as the blackspot tuskfish, hereafter referred to as the tuskfish, is the largest species in the genus. Characterized by a small black spot on its dorsal fin, the tuskfish can grow up to one meter in length and weigh as much as 15.5 kilograms, making it ecologically and economically significant. This species inhabits sandy or weedy areas near lagoons and seaward reefs, where it primarily preys on hard-shelled organisms such as crustaceans, mollusks, and sea urchins2,3. Notably, observations have shown that individuals of this species can also utilize tools during feeding4. It exhibits protogynous hermaphroditism, transitioning from female to male as it matures, a process influenced by body size, age, and environmental factors5,6,7.

The tuskfish is predominantly found in the South China Sea, the coastal waters of Indonesia, and northern Australia. However, due to overfishing and marine environmental degradation, wild populations have experienced a sharp decline, leading to its classification as “Near Threatened” by the International Union for Conservation of Nature and Natural Resources (IUCN)8,9. To address this issue, comprehensive surveys of its wild germplasm resources and the development of artificial domestication and breeding techniques are urgently needed10.

In this study, we present a high-quality near telomere-to-telomere (T2T) genome assembly of the tuskfish, constructed using PacBio HiFi data at ~100× and Hi-C data at ~135×. The genome assembly spans 865.99 Mb and 24 chromosome-level scaffolds, of which 14 are gapless assemblies with terminal telomeric repeats. Notably, telomeres were resolved at both ends of 23 chromosomes, demonstrating high assembly completeness. A total of 24,524 protein-coding genes were annotated, with 96.25% assigned functional annotations. This dataset provides a valuable resource for genetic and evolutionary research, as well as for conservation and breeding programs aimed at restoring natural populations and promoting sustainable aquaculture production.

Methods

Sample collection and DNA extraction

Genomic DNA and total RNA were extracted from muscle and other tissues of a single C. schoenleinii specimen (female, body length: 24 cm, weight: 296 g) collected from the waters surrounding the Qizhou Archipelago. DNA was extracted using the Blood & Tissue DNA Kit (Qiagen 69504), while RNA was extracted with TRIzol reagent (Invitrogen) following the manufacturer’s protocols. DNA and RNA integrity were assessed using gel electrophoresis and an Agilent 2100 Bioanalyzer (Agilent Technologies), and purity and concentration were measured with a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific). High-quality DNA and RNA were used for library preparation and sequencing.

Library construction and sequencing

For whole-genome sequencing (WGS), genomic DNA was fragmented into ~350 bp fragments using a Covaris ultrasonicator. After end-repair, adapter ligation, single-strand separation, and circularization, the library was amplified by rolling circle amplification (RCA) to generate DNA nanoballs (DNBs). Qualified DNB libraries were sequenced on the DNBSEQ platform, generating 143.70 Gb of data (~166×) (Table 1).

Table 1 Statistics of the sequencing data.

For PacBio HiFi sequencing, genomic DNA was used to construct a SMRTbell library with the SMRTbell Express Template Prep Kit 2.0. The library was sequenced on the PacBio REVIO system, producing 86.54 Gb of HiFi data (~100×) (Table 1).

For Hi-C sequencing, muscle tissue was fixed with 2% formaldehyde to cross-link DNA and proteins. Cross-linked DNA was digested with MboI, ligated with biotin-labeled adapters, circularized, fragmented, and enriched by biotin pull-down. Size-selected DNA was used to construct Hi-C libraries, which were sequenced on the DNBSEQ platform, producing 117.52 Gb data (~135×) (Table 1).

For transcriptome sequencing, RNA from 10 tissues, including brain, gill, gut, heart, kidney, liver, muscle, ovary, skin, and spleen, was used to construct transcriptome libraries. Poly-A mRNA was enriched using magnetic oligo(dT) beads, fragmented, and reverse-transcribed into cDNA. Libraries were prepared with adapter ligation and sequenced on the Illumina NovaSeq 6000 platform, yielding 66.66 Gb transcriptome data (Table 1).

Genome survey and assembly

A genome survey was conducted prior to assembly to assess the basic characteristics of the tuskfish genome. Short reads generated by the DNBSEQ platform were used for k-mer analysis, estimating the genome size at 815.81 Mb (Fig. 1A).

Fig. 1
figure 1

Summary of the blackspot tuskfish genome assembly and quality assessment. (A) GenomeScope k-mer analysis (k = 19) of whole-genome sequencing reads. (B) Snail plot of assembly statistics. The plot displays, from the center outwards: log-scaled scaffold count (purple spiral), scaffold length distribution (dark gray, longest scaffold in red), N50 (orange) and N90 (light orange) lengths, and GC/AT content (blue/light blue rings). The actinopterygii BUSCO score is shown in the upper right.

For genome assembly, HiFi reads were first assembled into a draft genome using Hifiasm (v0.20.0)11, resulting in a highly contiguous assembly with a total size of 879.13 Mb. To improve the quality of the assembly, redundant sequences and haplotigs were removed using purge_haplotigs12 and kmerDedup13, reducing the genome size to 865.99 Mb with contig N50 38.17 Mb (Fig. 1B). Hi-C sequencing data were then integrated using HapHic (v1.0.6)14 and further refined using Juicer (v1.6)15 to anchor the contigs to 24 chromosomes (Fig. 2A), which is consistent with the number of chromosomes already demonstrated in some closely related species16,17,18.

Fig. 2
figure 2

Hi-C contact maps and genomic feature overview of the blackspot tuskfish genome. (A) Interchromosomal interaction matrix. A heatmap representing Hi-C interaction frequencies (100-kb bins) across the 24 blackspot tuskfish chromosomes (chr1-chr24) (B) Circos plot summarizing genomic features. Concentric rings display, from outermost to innermost: (1) Chromosome ideograms; (2) DNA transposon density; 1 (3) LINEs density; 1 (4) SINEs density; (5) LTRs density; 1 (6) Gene density; (7) GC content. 1 A central image depicts a blackspot tuskfish. All feature densities are presented along the chromosomes.

To further enhance genome quality, polishing was performed using NextPolish (v1.4.1; https://github.com/Nextomics/NextPolish), significantly reducing assembly errors. The final assembly consisted of 14 completely gapless chromosomes, with gaps closed and telomeres detected at both ends of 23 chromosomes using quarTeT (v1.23)19 (Table 2).

Table 2 Assembly statistics of chromosomes.

Genome annotation

Transposable element (TE) annotation was performed using HiTE20, which identified and masked repetitive elements across the genome. A total of 32.07% of the genome was annotated as repetitive sequences, consistent with the genome survey results (Fig. 2B; Table 3).

Table 3 Statistic results of different types of annotated repeat content.

Protein-coding gene annotation combined de novo prediction, homology-based annotation, and transcriptome-based strategies. De novo gene prediction was performed using Augustus (v3.5.0)21 and GALBA (v1.0.11)22, while homology-based annotation employed miniport (v0.13)23 to align the genome against protein sequences from closely related species, inculding Cheilinus undulatus, Labrus bergylta, and Notolabrus celidotus, generating homology-based gene models. Transcriptome data were integrated with homology-based annotations using EGAPx (v0.3.1-alpha; https://github.com/ncbi/egapx), which prepared input data for downstream integration. The final gene models were produced by integrating results from all annotation strategies using EvidenceModeler (EVM; v2.10)24 and further refined using the PASApipeline (v2.5.3)25, resulting in the annotation of 24,524 protein-coding genes (Table 4).

Table 4 Genome function annotation result.

Functional annotation of the protein-coding genes was conducted by aligning protein sequences against multiple databases, including KEGG, NR, SwissProt, Kofam, and EggNOG, using diamond. A total of 96.25% of the protein-coding genes were functionally annotated, providing valuable insights into the biological roles of the identified genes (Table 4).

Data Records

The genome assembly data is available in GenBank under the accession number JBKFGB000000000.126.

The raw sequencing data of blackspot tuskfish transcriptome, PacBio HiFi, Hi-C and WGS have been deposited into the National Center for Biotechnology Information (NCBI) with the accession number PRJNA120415927. The genome assembly data, genome annotation files, gene CDS, and protein data have been submitted to Figshare28.

Technical Validation

Genome assembly was evaluated with a Benchmarking Universal Single-Copy Orthologs (BUSCO), achieving 99.8% completeness, confirming the genome’s high completeness. Inspector (v1.3)29 calculated a QV of 48.54, reflecting high base-level accuracy, while GCI (v1.0)30 yielded a value of 43.681, highlighting the structural continuity of the assembly. CRAQ (v1.0.9)31 further assessed assembly accuracy, reporting AQI metrics of 95.71 (R-AQI) and 99.88 (S-AQI), confirming high assembly quality. Additionally, 14 chromosomes were completely gapless, and telomeres were detected at both ends of 23 chromosomes, demonstrating near telomere-to-telomere assembly quality.

Annotation quality was validated with BUSCO, which reported 98.5% completeness for the gene sets. OMArk (v.0.3.0)32 showed that 96.19% (23,590 proteins) of the annotated genes were consistent with the Clade Teleostei, and 96.25% of the genes were functionally annotated across databases such as KEGG, NR, and SwissProt, indicating high functional and structural completeness.