Background & Summary

Sex determination is a genetic or epigenetic process that initiates and regulates the developmental trajectory of sexual differentiation, whereas sex differentiation encompasses the cascade of morphological and physiological events through which a bi-potential gonad progressively develops into either a testis or an ovary, culminating in the establishment of species-specific secondary sexual characteristics1. Compared with those highly conserved sex determination systems in various mammals and birds, fishes exhibit remarkable diversity in sex determination patterns. They present more diversified sex determination modes than higher vertebrates, such as genetic sex determination (GSD), environmental sex determination (ESD), and the coexistence of both2,3. Notably, among diverse environmental cues, temperature emerges as the most influential exogenous factor to modulate sexual development in fishes. Numerous species across different taxa have been documented to own thermally sensitive sex determination, where incubation temperature during critical developmental windows can override genotypic sex determinants. Good examples include European seabass (Dicentrarchus labrax)4, tilapia (Nile tilapia and Oreochromis niloticus)5, and Atlantic halibut (Hippoglossus hippoglossus)6,7. These fishes exhibit interesting characteristics of temperature-dependent sex determination, and their sex ratios can change significantly with variations in environmental temperature during their hatching period.

In addition to gonochorism (separate sexes), fishes also exhibit hermaphroditism as an important reproductive strategy. Approximately 2% of teleost fishes are hermaphroditic, distributed across 27 families within 7 orders8. Sex change is a biological process in which an organism transitions from its original sex to another through specific physiological mechanisms. Organisms capable of naturally undergoing sex change are referred to hermaphrodites, which are typically categorized into protandrous (male-to-female) and protogynous (female-to-male)9. Common examples in these fishes include groupers, black seabream, clownfish, and ricefield eel10,11,12,13.

Asian seabass holds substantial cultural and economic values throughout the tropical Indo-West Pacific region, serving as both a key fishery resource and a commercially important aquaculture species14. As a protandrous hermaphroditic fish15, it usually first develops into a male at 3–4 years of age, and then approximately 90% of individuals undergoes natural sex change to female by age 616. Despite its remarkable reproductive strategy, the genetic mechanisms underlying sex change in Asian seabass remain poorly understood, as is the case for most hermaphroditic species. Genomic resources, including DNA markers, high-resolution linkage maps, transcriptomes, reference genome sequences along with their comprehensive annotations, play a pivotal role in supporting aquaculture. These valuable genetic resources provide a solid foundation for diverse applications, enabling comprehensive genetic investigations to support development of sophisticated artificial breeding strategies. Ultimately, they contribute to the sustainable expansion and increased productivity of international aquaculture industry14. Given the economic value of Asian seabass and its remarkable natural sex change, construction of its high-quality genome assembly is absolutely essential.

In this study, we combined MGI short-read, PacBio HiFi long-read, ONT (Oxford Nanopore Technologies) ultra-long, and Hi-C sequencing data to generate a high-fidelity T2T genome assembly of Asian seabass. This assembly was rigorously assessed for quality, and its key genomic features were systematically characterized. In fact, this gap-free and complete reference assembly represents a substantial improvement over any previous assembly of this species17. It will not only facilitate population genetic research and evolutionary study, but also provide an important genetic resource for molecular breeding and investigating molecular mechanisms of sex change in this economically important fish.

Methods

Sample collection

A male Asian seabass (Fig. 1A) was collected from a local aquaculture facility of the South China Sea Fisheries Research Institute under Chinese Academy of Fishery Sciences, which is located in Guangzhou City, Guangdong Province, China. Muscle tissue was sampled for whole-genome sequencing, including MGI short read, PacBio HiFi long read, ONT (Oxford Nanopore Technologies) Ultra-long and Hi-C sequencing technologies. Additionally, seven distinct tissues (such as gill, brain, liver, muscle, eye, testis, and skin) were collected for transcriptome sequencing (Table 1). Upon dissection into small fragments, the tissue samples were washed with ice-cold PBS (pH 7.4) to eliminate blood residues and contaminants. After removing outside liquid by blotting, these samples were rapidly frozen in liquid nitrogen and subsequently maintained at −80 °C before use. For transcriptome sequencing, frozen specimens were shipped in dry ice containers to the sequencing company (BGI, Shenzhen, Guangdong, China).

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Asian seabass and its whole-genome sequence distribution. (A) A morphological image of the sequenced Asian seabass. (B) A k-mer (21-mer) distribution curve for estimation of the genome size.

Table 1 Sequencing data of the Asian seabass genome and transcriptomes.

DNA extraction and genome sequencing

Genomic DNA (gDNA) was extracted from muscle tissue using a QIAamp DNA Mini Kit (Qiagen, Valencia, CA, USA) following the manufacturer’s protocols18. Fragment size, purity, and quantification of the extracted gDNA were assessed via 0.75% agarose gel electrophoresis, an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA) and a Qubit Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA), respectively.

For the MGI short-read sequencing, gDNA was randomly fragmented using a MGIEasy Universal DNA Library Preparation Kit (MGI, Shenzhen, China) to construct a library with an insert-size of 350 bp. Sequencing was performed on a DNBSEQ-T7 platform (MGI), generating 37.4 Gb of raw 150-bp paired-end reads, and then filtered by fastp v0.12.619 (parameter: -n 0 -f 5 -F 5 -t 5 -T 5) to remove adaptor sequences and low-quality reads. Finally, a total of 33.69 Gb of clean reads (Table 1) were obtained for further data error correction and genome-size estimation.

For the PacBio HiFi sequencing, approximately 10 μg of high-quality gDNA was applied to construct a SMRTbell library following the manufacturer’s standard protocol (SMRTbell Express Template Prep Kit 2.0; Pacific Biosciences, Menlo Park, CA, USA), which was then sequenced on a PacBio Sequel II System using the circular consensus sequencing (CCS) technology. A total of 90.47 Gb of HiFi reads with a N50 of 18,366 bp were obtained (Table 1) using the CCS v6.0.020 (Circular Consensus Sequencing) software with the optimized parameter “-min-passes 3”.

Two ultra-long read libraries were constructed using Oxford Nanopore Technologies (ONT) protocols, which were sequenced on a PromethION platform (Oxford Nanopore Technologies Co., Littlemore, Oxford, UK). Raw reads were initially processed to eliminate those with a quality value (QV) lower than 7 using the NanoFilt v2.8.021 software. Finally, a total of 1.54 million clean reads were retained, accumulating a substantial base count of 61.32 Gb. The average read length was 39.69 kb, with an N50 length of 71.17 kb (Table 1).

For the high-throughput chromosome conformation capture (Hi-C) sequencing, one Hi-C library was generated using a GrandOmics Hi-C kit (GrandOmics, Wuhan, Hubei, China) following the manufacturer’s protocol. In brief, gDNA was first cross-linked using a 4% formaldehyde solution to stabilize chromatin structures. Subsequently, the DNA was digested with the restriction enzyme MboI to introduce specific cleavage sites. Those resulting DNA fragments were then labeled with biotin-14-dCTP, allowing for incorporation of a detectable marker. The labeled DNA fragments were ligated using T4 DNA ligase to facilitate subsequent enrichment steps. Following ligation, the DNA was further digested to yield fragments in the size range of 200 to 600 bp. The library was sequenced on a DNBSEQ-T7 platform (MGI, Shenzhen, China) using a 150-bp paired-end model. The Hi-C sequencing technology generated 102.32 Gb of raw data. Subsequently, fastp v0.12.619 was applied to filter adaptor sequences and low-quality reads. Finally, 93.8 Gb of Hi-C clean data were retained (Table 1) for chromosome assembly.

RNA extraction and transcriptome sequencing (RNA-seq)

Total RNA was extracted from seven tissues separately according to a standard Trizol protocol (Invitrogen, Frederick, MD, USA), followed by purification with a Qiagen RNeasy Mini Kit (Qiagen, Germantown, MD, USA). RNA concentration and integrity were measured using a NanoDrop 8000 Spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) and an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA), respectively. Only those RNA samples with OD260/280 ≥ 1.8 and RNA integrity ≥ 7.0 were selected for transcriptome sequencing. RNA was used for construction of a cDNA library followed the manufacture’s guideline, which was then sequenced on a HiSeq X Ten platform (Illumina, San Diego, CA, USA). A total of 48.07 Gb of transcriptome raw data were generated (Table 1), which aided in annotation of protein-coding genes and prediction of gene structures.

Genome-size estimation and construction of a T2T genome assembly

To estimate the genome size of Asian seabass, we employed jellyfish (v2.2.10)22 to perform k-mer counting with k = 21, and the parameters were set as ‘-m 21 -s 10 G -C’. Subsequently, a generated histogram was utilized as an input file for GenomeScope v2.023 to estimate genetic characteristics. This approach provided a sequence-derived estimate of the Asian seabass genome characteristics prior to assembly. Our analysis results show that the genome size of Asian seabass is approximately 576.74 Mb, with an estimated heterozygosity of about 0.46% (Fig. 1B) and repetitive sequences accounting for 32.79 Mb (5.69%).

Primary contigs were initially generated by assembling PacBio HiFi and ONT data using Hifiasm v0.19.824 with default parameters. Then, purge_dups v1.2.525 was employed to remove haplotypic and heterozygous duplications from the de novo assembly, yielding a final assembly with a total length of 614.08 Mb.

Using the preliminary assembly as the reference, Hi-C clean reads were utilized to construct chromosomes for Asian seabass. First, the Hi-C reads were mapped to the assembled contigs using bowtie2 v2.2.5 (–very-sensitive -L 20–score-min L, −0.6, −0.2–end-to-end)26. Subsequently, the HiC-Pro v2.8.127 pipeline was applied to detect ligation products, retaining only valid paired reads for downstream analysis. Based on these valid reads, the primary assembly was clustered, ordered, and oriented into chromosomes using the Juicer v1.528 and 3D-DNA v3.029 software with parameters -m haploid -r 2 -c 24. Juicebox v1.11.0830 was employed to visualize before manually adjusting the candidate assemblies.

To fill the remaining gaps, those corrected ultra-long ONT reads were applied to generate a gap-free genome assembly using TGS-GapCloser v1.2.131 with optimized parameter “–min_match 1000–min_nread 3” and LR_Gapcloser v1.032 with the parameter “-t 35 -m 1000000 -v 500”. The final genome assembly spans 614.19 Mb, and it is anchored onto 24 chromosomes (Fig. 2), among them the longest and the shortest are 31.85 Mb and 14.85 Mb, respectively (Table 2).

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

The first T2T genome assembly of Asian seabass. (A) Genome-wide chromatin interactions at a 500-kb resolution. Color blocks represent corresponding interactions, with various strengths from yellow (low) to red (high). (B) A Circos plot of the main genome features. From outside to inside include the 24 chromosomes, gene density, GC content, repetitive sequences density, and a colinear relationship among chromosomes of the Asian seabass genome assembly. Note that the density calculation window is set as 100 kb.

Table 2 Comparison of the available genome assemblies for Asian seabass.

Identification of the centromere and telomere sequences

Telomeres were identified by searching for the target sequence (CCCTAA/TTAGGG) at both ends of each chromosome using Telomere-to-Telomere Toolkit quarTeT v1.1.133. Centromeres, as specialized DNA sequences connecting sister chromatids, exhibit complex structures in most animals and plants with highly repetitive satellite DNA and scattered retrotransposon sequences. In this study, after identifying repeat sequences according to TRF v4.0.434 and RepeatMasker v4.0.635 and obtaining a TE annotation file, quarTeT v1.1.133 was applied to identify centromeres, and the candidate interval range of every centromere was predicted. Ultimately, we determined that the Asian seabass genome contains a complete set of 24 centromeres and 48 telomeres (Table 3; Fig. 3).

Table 3 Telomere and centromere positions in the assembled genome.
Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Genome-wide localization of repetitive elements (REs), telomeres and centromeres. The triangles at both ends of each chromosome represent the telomere regions, and the gully area within each chromosome stands for the centromere region.

Annotation of repeat elements

For prediction of repetitive elements (REs), tandem repeats were first annotated using TRF v4.0.434 and GMATA v2.236. TRF was employed to identify simple sequence repeats (SSRs), whereas GMATA was used to recognize all tandem REs across the entire genome.

Transposable elements (TEs) in the assembled genome were predicted using a combination of homology-based and de novo methods. For the homology approach, TEs were identified using RepeatMasker v4.0.6 and RepeatProteinMask v4.0.635. For the de novo approach, RepeatModeler v1.0.837 and LTR_FINDER v1.0.638 were employed to generate a de novo repeat library, and RepeatMasker was applied to annotate REs against this repeat library. The annotation results of all repetitive sequences were merged into a comprehensive dataset. This comprehensive annotation revealed 111.64 Mb of repetitive sequences, which account for 18.18% of the assembled Asian seabass genome (Fig. 3). The most abundant repetitive element was DNA transposons at 9.00% (55.26 Mb), followed by long interspersed nuclear elements (LINEs) at 2.89% (17.76 Mb) and long terminal repeats (LTRs) at 2.46% (15.07 Mb) (see Table 4).

Table 4 Classification of repetitive sequences in Asian seabass genome.

Prediction and functional annotation of protein-coding genes

Repetitive regions of the assembled genome were masked prior to prediction of genes and their structures. Protein-coding genes was annotated by combination of three methods, including de novo, homology and RNA-seq-based annotations. First, AUGUSTUS v3.2.139 and GlimmerHMM v3.0.440 were employed to perform the ab inito gene structure prediction. Second, GeMoMa v1.6.441 was applied for the homology-based prediction. We aligned homology proteins from five representative fish species, including Epinephelus fuscoguttatus (brown-marbled grouper, GCA_011397635.1), Epinephelus moara (kelp grouper, GCA_006386435.1), Lates japonicus (Japanese lates, GCA_033238685.1), Perca flavescens (yellow Perch, GCA_004354835.1) and Sebastes umbrosus (Honeycomb rockfish, GCA_015220745.1) downloaded from the NCBI. Third, the RNA-seq data from seven tissues were assembled into contigs using Trinity v2.5.142, and then gene structures were identified using PASA v2.3.343. Finally, gene sets were integrated by the Evidence Modeler (EVM) pipeline v1.044.

A total of 25,093 protein-coding genes were annotated, with an average gene length of 13.81 kb and an average coding sequence (CDS) length of 1,721.49 bp (Table 5). Protein-coding genes were evaluated using BUSCO with the actinopterygii_odb10 database as the reference. More than 98.8% of complete BUSCOs were identified within the predicted protein-coding genes.

Table 5 Summary of the predicted gene structures using three methods.

Functional annotation of the protein-coding genes was performed using Blastp v2.2.2645, which aligned deduced protein sequences against five public databases including NCBI Non-Redundant Protein Sequence (NR), SwissProt46, Gene Ontology (GO)47, Kyoto Encyclopedia of Genes and Genomes (KEGG)48 and EuKaryotic Orthologous Groups (KOG)49, with an E-value cutoff of <1e−5. Ultimately, 23,711 protein-coding genes (94.49% of the total predicted genes) were functionally annotated, with at least one hit for each gene in the searched databases (Table 6).

Table 6 Functional annotation of predicted protein-coding genes.

Data Records

Files of the MGI, PacBio, ONT, Hi-C and transcriptome sequencing, and the assembled genome for Asian seabass were deposited at NCBI under the accession number PRJNA1245135. Raw reads are available in the Sequence Reads Archive (SRA) with the accession numbers SRR32997291 to SRR3299730550. The genome assembly, predicted coding sequences and function annotation files of Asian seabass were stored in Figshare (No: m9.figshare.28735226)51. The genome assembly has also been deposited at the NCBl/GenBank under the accession number of GCA_051027255.152.

Technical Validation

To evaluate the quality of our genome assembly, we employed four approaches. First, BUSCO v5.2.253 was employed to examine completeness. A total of 100% (single copy complete genes (S): 99.84%, duplicated complete genes (D): 0.16%) of complete BUSCOs in the actinopterygii_odb10 database were identified. Second, Merqury v1.32854 was applied to estimate the base-level accuracy and completeness on the basis of k-mer counts (generated from Illumina and PacBio HiFi reads), resulting in a QV of 40.59 and 57.80 respectively. Third, Clipping information for Revealing Assembly Quality (CRAQ, v1.09)55 was used to assess the accuracy of our genome assembly based on PacBio HiFi and Illumina reads, resulting in a R-AQI (assembly quality indicator) of 98.42 and a S-AQI of 99.45. Fourth, we mapped the sequencing data to the assembled genome using bwa v0.7.1756 and minimap2 v2.2657, which showed mapping rates of 99.46% for the MGI data, 99.99% for the PacBio data, and 98.43% for the ONT data. These results collectively support high quality of the Asian seabass genome assembly. The BUSCO completeness value was calculated to be 98.8% for the predicted protein-coding genes of Asian seabass (Table 7). To further evaluate the quality of these predicted protein-coding genes, we aligned the transcriptome data to the assembled genome using STAR v 2.7.11b58, and then calculated the exonic coverage rate with bedtools v2.29.259. We observed that 94.71% of the exonic regions had been covered with sequencing reads, indicating high annotation accuracy (see Table 7).

Table 7 Assessment metrics of the genome assembly and annotation.