Background & Summary

Humphead wrasse (Cheilinus undulates), also commonly known as Maori or Napoleon wrasse, is an endangered fish species with significant ecological importance for coral reef ecosystems. As a member of the Labridae family within the order Perciformes, it is characterized by large size and striking appearance1,2. Moreover, it has a sparse population, slow growth, and a long lifespan (over 30 years), as well as intricate reproductive behaviors, but it significantly contributes to bioerosion and sand production3. Previous investigations indicate that its populations are declining alarmingly because of ongoing overharvesting, habitat degradation, and climate change impacts4,5. Humphead wrasse has been classified as ‘Endangered’ on the IUCN (International Union for Conservation of Nature) Red List and is included in the CITES (Convention on International Trade in Endangered Species of Wild Fauna and Flora) Appendix II6.

Like many other environmentally-sensitive sex-changing species, humphead wrasse exhibits protogynous hermaphroditism, i.e., transitioning from female to male at approximately 8-9 years old, after attaining female sexual maturity at 5–7 years7,8. Males usually follow two distinct developmental pathways to achieve diandry, either developing directly from juveniles into small males (smaller than the smallest mature female) or transitioning from adult females through sex change to become large males (exceeding female size)4. However, the detailed molecular mechanisms of its sex change remain largely unknown. Additionally, due to its unique visual system and fused pharyngeal bones, it always serves as an excellent model organism for studying opsin evolution in coral reef fishes, and for comparative studies with other fish genomes to demonstrate specific opsin gene expansions in humphead wrasse9.

In a previous genome study for humphead wrasse, a draft chromosome-level genome assembly was reported9. Nevertheless, this assembly version contains excessive gaps accompanied by low BUSCO values, leading to significant fragment loss that impairs both genome completeness and annotation accuracy. Here, utilizing cutting-edge high-throughput sequencing platforms including PacBio HiFi and Oxford Nanopore Technologies (ONT) ultra-long technology, we produced a refined telomere-to-telomere (T2T) chromosome-level genome assembly for humphead wrasse. This improved assembly demonstrates superior scaffold N50 and BUSCO scores, as well as gap-free genome sequence with encouraging details of telomeres and centromeres. This new genome assembly not only provides a valuable genetic resource for in-depth investigations on population genetics and conservation biology of humphead wrasse, but also supports comparative and molecular studies on the regulation of natural sex change and opsin evolution in various vertebrates.

Methods

Sample collection

We obtained an adult humphead wrasse (Fig. 1a) from Guangdong Marine Fisheries Experimental Centre, an offsite facility of the Agro-Tech Extension Center of Guangdong Province, which is situated in Huizhou city, Guangdong province, China. Muscle tissue was collected for whole-genome sequencing, and ten tissues (including intestine, spleen, lung, heart, liver, muscle, gill, eye, skin, and gonad) were sampled for transcriptome sequencing. The sampling procedure and experimental workflow were performed in accordance with the guidelines and approval from the Animal Ethics Committee of Shenzhen University (Shenzhen, China).

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

A T2T genome assembly of the humphead wrasse. (a) An image of the sequenced fish. (b) A GenomeScope k-mer plot. (c) The Hi-C contact matrices contain 24 unique blocks in total. (d) A Circos plot displaying the primary genome characteristics, with features annotated from the outermost to innermost rings, (I) the 24 chromosomes, (II) gene density, (III) GC skew, (IV) GC content, (V) tandem repeats, and (VI) transposons.

DNA extraction and genome sequencing

Genomic DNA (gDNA) was extracted from the muscle tissue using a modified CTAB method10. The extracted gDNA was used for construction of a BGISeq DNA PCRfree library, which was then sequenced on a BGI T7 platform (MGI, Shenzhen, China). A total of 57.98 Gb of raw reads (150 bp in length) were generated, among them low-quality reads and adaptor sequences were filtered using Trimmomatic (v0.40)11 with default settings. Finally, we obtained 56.64 Gb of clean reads for estimating genome size and assembling sequences.

Moreover, we prepared long-read libraries using the PacBio Sequel II System and SMRTbell Express Template Prep Kit 3.0 (Pacific Biosciences, Menlo Park, CA, USA) for HiFi sequencing. The CCS software (SMRT Link v9.0)12 was then applied to generate consensus sequences. In this study we yielded approximately 113.26 Gb of consensus reads, with an average length of 19.07 kb.

ONT technology was applied by construction of an ultra-long library and then sequencing of one flow cell on a PromethION platform (Oxford Nanopore Technologies Co., UK). The raw reads were first refined to remove those with quality value (QV) below 7. Subsequently, Porechop (https://github.com/rrwick/Porechop) was applied to eliminate adaptors, and Filtlong (https://github.com/rrwick/Filtlong) was employed to remove those reads shorter than 30 kb and mean read quality scores less than 90%. Finally, we obtained a total of 27.88 Gb clean reads, with an average read length of 96.15 kb and an N50 length of 100 kb.

DNA libraries for Hi-C sequencing were constructed with a GrandOmics Hi-C kit (GrandOmics, China), employing DpnII as the restriction enzyme, in accordance with the manufacturer’s instructions. By using the Illumina Novaseq system (Illumina Inc., San Diego, CA, USA), we produced 74.82 Gb of raw reads from the Hi-C libraries. We then employed Trimmomatic (v0.4)11 to remove low-quality reads (quality scores <20), adapter sequences, and reads shorter than 36 bp. After filtering, 71.53 Gb of clean data were available for subsequent chromosome scaffolding.

RNA extraction and transcriptome sequencing

We used poly‐T oligo‐attached magnetic beads to purify mRNAs from the lung, heart, liver, muscle, gill, eye, skin, and gonad tissues. Sequencing libraries were generated from the purified mRNAs using the VAHTS Universal V6 RNA-seq Library Kit for MGI (Vazyme, Nanjing, China) following the manufacturer’s recommendations with unique index codes. Library quantification and size were assessed using Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA) and Bioanalyzer 2100 system (Agilent Technologies, CA, USA). Subsequently, sequencing was performed on a MGI-SEQ. 2000 platform by Frasergen Bioinformatics Co. Ltd. (Wuhan, China).

To obtain clean reads, adaptor sequences and low-quality raw reads were filtered via SOAPfilter (v2.2)13 with default parameters. In the end, the clean reads of the ten tissues (namely the intestine, spleen, lung, heart, liver, muscle, gill, eye, skin, and gonad) were 5.77, 5.58, 6.29, 8.56, 5.64, 8.38, 6.10, 6.14, 5.96 and 13.58 Gb, respectively. These retained data were collected for annotation of gene structures.

Genome assembly

Genome-size estimation

We employed Jellyfish (v2.2.6)14 and GenomeScope (v2.0)15 to analyze the K-mer frequency distribution of the BGI clean reads. Our results showed that the humphead wrasse genome was estimated to be 1.17 Gb in length, with a genomic heterozygosity rate of 0.27% (Fig. 1b).

De novo genome assembly

We applied HiFiasm (v0.19.5)16 to assemble HiFi and ONT long reads into contigs, which were then polished using T2T-polish17 with the optimized parameter set to task = best using the BGI short reads. The initial genome assembly had a total length of 1.253 Gb, with a contig N50 of 54.5 Mb and an organized of 53 contigs.

Construction of chromosomes and gap filling

Hi-C reads were aligned to the primary genome assembly using Bowtie2 (v2.3.2)18, followed by identification of valid contact paired reads through the HiC-Pro (v2.8.1) pipeline19. The assembled contigs were anchored to chromosomes using these Hi-C valid reads through the 3D-DNA pipeline20 with the parameter -r 0, followed by manual refinement of the chromosome-level scaffolds in JuiceBox21. To close nucleotide gaps in the chromosome-level genome assembly, we utilized TGS-GapCloser (v1.1.1)22 with default parameters, leveraging both HiFi and ONT long reads. The final genome assembly spans 1.25 Gb, with 99% of the primary sequences anchored to 24 chromosomes, achieving a contig N50 of 54.51 Mb (Fig. 1c,d and Table 1).

Table 1 Positions of telomeres and centromeres across all chromosomes in the assembled genome of the humphead wrasse.

Identification of centromere and telomere sequences

Telomere sequences were identified by detecting (TTAGGG/CCCTAA) repeats in telomeric regions, while centromeres were localized using the Centromics program (https://github.com/ShuaiNIEgithub/Centromics) to analyze HiFi sequencing data, Hi-C data, and the final genome assembly. Finally, we revealed that the humphead wrasse chromosomes possessed 48 telomeres and 24 centromeres (see more details in Fig. 2 and Table 1).

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

An overview of the T2T gap-free genome assembly of humphead wrasse. The telomere regions are shown as orange segments at both ends of each chromosome, while the centromere region appears as a gully area within the chromosome.

Genome annotation

Repeat annotation. Tandem repeats were identified using Tandem Repeats Finder (TRF, v4.09.1)23 with the following parameters: 2 7 7 80 10 50 2000. Moreover, transposable elements (TEs) were detected through an integration of de novo prediction and homology searches at both DNA and protein levels. LTR retrotransposons were initially identified using LTR_FINDER (v1.0.7)24 at the DNA level, while RepeatModeler (v2.0.1)25 generated a classified de novo repeat library. Subsequently, RepeatMasker (v4.1.2)26 performed comparative analyses against both the Repbase TE database27 and the newly constructed repeat library. Protein-level TE annotation was performed using RepeatProteinMask26 against the transposable element protein database. A total of 658.03 Mb repetitive sequences were detected in the humphead wrasse genome assembly (Table 2).

Table 2 The proportion of repetitive sequences identified in the humphead wrasse genome assembly.

Gene annotation

Protein-coding genes were predicted using a combination of homology-based, ab initio and transcriptome-assisted annotation approaches. The homology-based annotation was initiated by performing Tblastn (v2.11.0+)28 searches against our assembly using protein sequences from four representative species, including yellowfin seabream (Acanthopagrus latus), sharksucker (Echeneis naucrates), zebrafish (Danio rerio), and medaka (Oryzias latipes). The high-quality alignments were subsequently refined using Exonerate (v2.4.0)29 for precise gene model prediction. The de novo annotation was performed using Augustus (v3.4.0)30 and GlimmerHMM (v3.0.4)31. For the transcriptome-assisted annotation, RNA-seq reads were first aligned to the reference genome using HiSat2 (v2.2.1)32, followed by transcript assembly through a genome-guided approach implemented in StringTie (v2.1.7)33. Predicted gene models were integrated and refined using MAKER (v3.01.03)34 to generate a non-redundant gene set. Final annotation improvements, including UTR annotation and alternative splicing variant prediction, were accomplished through the PASA pipeline (v2.4.1)35. Ultimately, we annotated a total of 25,064 protein-coding genes, with an average gene length of 27.84 kb and a mean coding sequence (CDS) size of 1,745.64 bp (see Table 3).

Table 3 Gene structures and functional annotations.

Functional annotations

Functional annotation was performed by aligning protein sequences against multiple databases (NCBI NR, KEGG36, GO37, TrEMBL and Swiss-Prot38) using DIAMOND BLASTP (v2.0.7)39, with assignments based on best matches. Functional annotations were assigned to 24,789 genes (98.90%) with supportive evidence from at least one database (see more details in Table 3).

Annotation of non-coding RNA genes

tRNAscan-SE (v2.0.9)40 with default settings was utilized to detect tRNA genes. Moreover, we applied RNAmmer (v1.2)41 to identify rRNA sequences. Annotation of MiRNA and snRNA genes was performed using Infernal (v1.1.2)42 through homology searches against the Rfam database (v14.6)43. Finally, a total of 2,221 rRNAs, 3,020 tRNAs, 781 miRNAs, and 646 snRNAs were predicted (see Table 4 for more details).

Table 4 Statistics of the non-coding RNA annotations.

Data Records

All genomic data are publically available from China National GeneBank DataBase (CNGBdb) under the project ID no. CRA02360944. The genome assembly has been submitted to the GenBank database with the accession number JBMUSF01000000045. In addition, comprehensive documentation regarding the genome assembly, gene structures, functional annotations, and repeat elements of humphead wrasse has been deposited on Figshare46.

Technical Validation

Evaluation of the genome assembly

Genome completeness was assessed using BUSCO (v5.2.2)47 against the actinopterygii_odb10 database (3,640 single-copy orthologs). Our results demonstrated 98.9% complete gene coverage (including 98.4% single-copy and 0.5% duplicated genes), with only 0.4% fragmented sequences (see Table 5). Moreover, the Merqury (v1.3)48 analysis estimated a genome assembly quality value of 53.45. Genome assembly accuracy was evaluated by aligning sequencing datasets, revealing mapping rates of 96.58% (RNA-Seq reads), 99.64% (for BGI reads), 99.93% (for PacBio reads), and 100% (for ONT reads). These analyses collectively validate the high-quality of this humphead wrasse genome assembly.3

Table 5 Statistics of BUSCO results of the T2T genome assembly.

Collinearity analysis

GenomeSyn (v1.2.7)49 was employed for whole-genome synteny comparison between the newly assembled genome and the previously published version (GCF_018320785.1)9. Our findings revealed good one-to-one chromosomal synteny between both assemblies (Fig. 3), which further validates that our present assembly of the humphead wrasse genome is indeed of high quality.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Good synteny of chromosomes between the newly assembled genome and the previously published version9.