Background & Summary

Maroon clownfish (Premnas biaculeatus), also named as spinecheek anemonefish, is a marine teleost under the order Perciformes and the family Pomacentridae. However, it is the only clownfish species that does not fall within the genus Amphiprion1. As a representative of coral reef ecosystems, maroon clownfish is distinguished by its vivid body coloration, species-specific behaviors, and its obligate symbiotic relationship with sea anemones2. These traits make it not only a globally popular ornamental fish but also a classic model organism for investigating ecological and molecular mechanisms of marine symbiosis3,4.

Evolution of sex change strategies in fish is tightly associated with their mating systems, as these strategies directly impact reproductive fitness5,6. Previous field observations and experimental studies have consistently demonstrated that, to maximize reproductive output under the constraints of social structure (e.g., dominance hierarchy), some fish species have evolved a genetic feature of sex change7. Sex change and hermaphroditic strategies in bony fishes are taxonomically and functionally diverse, and they can be categorized into four primary types based on the direction and timing of sexual differentiation: protogyny (female-first sex change), protandry (male-first sex change), bidirectional sex change (flexible transition between sexes), and synchronous hermaphroditism (simultaneous possession of functional male and female gonads)8. Representative species include Asian seabass (Lates calcarifer)9, black seabream (Acanthopagrus schlegelii)10, and orange-red pygmygoby (Trimma okinawae)11. Within the family Pomacentridae, fishes of the genera Amphiprion and Premnas are typically protandrous hermaphrodites, which initially develop as functional males and can later undergo a transition to females. In contrast, some species of the genus Dascyllus display protogynous sex change, wherein individuals first appear as females and subsequently shift to males6. Despite sharing similarities in reproductive behaviors (e.g., egg guarding) and parental care strategies with other Pomacentridae species, these genera possess highly specialized mating systems, rendering them ideal models for investigating the evolution of sex change and the structure of social behaviors.

Maroon clownfish is the only anemonefish species within the genus Premnas. It exhibits strict host specificity to the bubble-tip anemone (Entacmaea quadricolor)12. Certain studies have demonstrated that maroon clownfish holds a competitive advantage over other organisms inhabiting Entacmaea quadricolor13. Compared with other anemonefish species, maroon clownfish possesses a unique social structure. That is to say, it typically exists solely in monogamous pairs, rather than in social groups that include immature subadults. Breeding females are generally twice the size of breeding males14. Like other anemonefishes, maroon clownfish is a protandrous hermaphrodite, and its social structure is characterized by female dominance15.

In our present study, we combined MGI short-read, PacBio HiFi long-read, ONT (Oxford Nanopore Technologies) ultra-long, and Hi-C sequencing data to construct a high-fidelity telomere-to-telomere (T2T) genome assembly of maroon clownfish. This assembly was rigorously assessed for quality, and its key genomic features were characterized. This gap-free reference genome assembly will not only facilitate population genetic research and evolutionary study, but also provide an important genetic resource for molecular breeding and investigating molecular mechanisms of sex change in this economically important fish.

Methods

Sample collection

The maroon clownfish used in this study (Fig. 1A) was obtained from a local base of the South China Sea Fisheries Research Institute (SCSFRI), Chinese Academy of Fishery Sciences (CAFS), in Shenzhen city, Guangdong province, China. It had a body length of 4.6 cm and body weight of 6.5 g. For whole-genome sequencing, muscle was dissected from the individual and subjected to multi-platform sequencing, including MGI short-read, PacBio HiFi long-read, ONT ultra-long-read, and Hi-C sequencing technologies. Additionally, transcriptome sequencing (RNA-seq) was performed using four distinct tissue samples (gill, eye, muscle and skin) (Table 1). All sampling procedures and experimental workflows were conducted in compliance with the guidelines established by the Animal Ethics Committee of SCRFRI, CAFS (No. nhdf2025-30).

Fig. 1
Fig. 1
Full size image

Maroon clownfish and its whole-genome sequence distribution. (A) A morphological image of the sequenced fish. (B) A k-mer (21-mer) distribution curve for estimation of the genome size.

Table 1 Sequencing data of the maroon clownfish genome and transcriptomes.

DNA extraction and genome sequencing

Genomic DNA (gDNA) was extracted from the muscle tissue using a QIAamp DNA Mini kit (Qiagen, Valencia, CA, USA) in accordance with the manufacturer’s protocols16. Quality of the extracted gDNA was evaluated using three complementary approaches. Fragment size was determined via 0.75% agarose gel electrophoresis, molecular integrity was assessed by an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA), and absolute quantification was performed using a Qubit Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA).

For the MGI sequencing, a paired-end library with an insert size of 350 bp was constructed using a MGIEasy Universal DNA Library Preparation kit (MGI, Shenzhen, China). Sequencing was then conducted on a DNBSEQ-T7 platform (MGI) to generate 67.56 Gb of paired-end reads. These raw reads were filtered using fastp v0.12.617 with the following parameters: -n 0 -f 5 -F 5 -t 5 -T 5 -q 20. This filtering step removed adapter sequences, reads with excessive Ns, and low-quality bases (Phred quality score < 20), resulting in 59.92 Gb of clean reads (Table 1) that were used for subsequent data error correction and genome size estimation.

For the PacBio HiFi sequencing, approximately 10 μg of high-quality gDNA was used to construct a SMRTbell library following the standard protocol provided with the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA). The library was sequenced on a PacBio Sequel II System using the circular consensus sequencing (CCS) technology. HiFi reads were generated using CCS v6.0.018 with the optimized parameter “-min-passes 3”. This process yielded a total of 101.76 Gb of HiFi reads with a N50 of 17,331 bp (Table 1).

Two ultra-long read libraries were prepared following Oxford Nanopore Technologies (ONT) standard protocols and sequenced on a PromethION platform (Oxford Nanopore Technologies Co., Littlemore, Oxford, UK). Raw ONT reads were quality-filtered using NanoFilt v2.8.019 to remove those reads with a quality value (QV) < 7. After filtering, 72.04 Gb of clean ultra-long reads were retained, which had an average length of 40,256 bp and a N50 of 62,144 bp (Table 1).

For the high-throughput chromosome conformation capture (Hi-C) sequencing, a Hi-C library was constructed using a GrandOmics Hi-C kit (GrandOmics, Wuhan, Hubei, China) according to the manufacturer’s protocol. This library was sequenced on a DNBSEQ-T7 platform (MGI) using the 150-bp paired-end sequencing mode to generate 106.47 Gb of raw data. These raw reads were filtered using fastp v0.12.617 to remove adapters and low-quality sequences, resulting in 104.07 Gb of clean data (Table 1) for subsequent chromosome-level genome assembly.

RNA extraction and transcriptome sequencing

Total RNA was extracted from four tissues (Table 1) separately according to a standard Trizol protocol (Invitrogen, Frederick, MD, USA), followed by purification with a Qiagen RNeasy Mini Kit (Qiagen, Germantown, MD, USA). RNA concentration and integrity were measured using a NanoDrop 8000 Spectrophotometer (Thermo Fisher Scientific) and an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA), respectively. Only those RNA samples with OD260/280 ≥ 1.8 and RNA integrity ≥ 7.0 were selected for transcriptome sequencing. RNA was used for construction of a cDNA library followed the manufacture’s guideline, which was then sequenced on a HiSeq X Ten platform (Illumina, San Diego, CA, USA). A total of 48.07 Gb of transcriptome raw data were generated (Table 1), which aided in annotation of protein-coding genes and prediction of gene structures.

Genome-size estimation and construction of a T2T genome assembly

To estimate the genome size of maroon clownfish, k-mer counting was performed using Jellyfish v2.2.1020 with a k-mer length of 21. The analysis parameters were configured as follows: -m 21 -s 10 G -C. A k-mer frequency histogram generated by Jellyfish was used as the input for GenomeScope v2.021, a tool designed to infer genome-wide genetic characteristics from k-mer distribution patterns. This analytical workflow allowed for sequence-based estimation of genome features prior to de novo assembly, eliminating potential biases introduced by assembly artifacts. As shown in Fig. 1B, the genome size of maroon clownfish was calculated to be approximately 842.2 Mb. Additional key genomic characteristics derived from this analysis include an estimated heterozygosity of 0.387%.

Primary contigs were initially constructed via de novo assembly of PacBio HiFi reads and ONT ultra-long reads using Hifiasm v0.19.822 with default parameters. Subsequent to contig generation, the purge_dups v1.2.523 pipeline (--low 65 --mid 115 --high 150 --min-length 1000) was applied to filter out haplotypic duplications and heterozygous redundant sequences from the initial de novo assembly. This quality control step resulted in a genome assembly with a total length of 884.09 Mb.

Using this preliminary genome assembly as a reference, Hi-C clean reads were utilized to perform chromosome-level scaffolding. The workflow was implemented as follows: (1) Hi-C clean reads were mapped to the assembled contigs using Bowtie2 v2.2.524 with the following parameter settings: --very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end. (2) The HiC-Pro v2.8.1 pipeline25 was applied to detect Hi-C ligation products, with only valid paired-end reads retained for downstream analysis to ensure high reliability of interaction signals. (3) Based on the valid Hi-C reads, the preliminary contig assembly was clustered into chromosome groups, ordered, and oriented to construct chromosome-level scaffolds using Juicer v1.526 and 3D-DNA v3.027. Juicebox v1.11.0828 was employed to visualize the chromosome-level assembly. Manual adjustments were further conducted to correct misclusters, misorders, or misorientations in the candidate chromosome assemblies, ensuring good consistency with Hi-C interaction patterns.

To eliminate residual gaps in the chromosome-level assembly, corrected ultra-long ONT reads were used for gap filling. Two complementary tools were implemented, including TGS-GapCloser v1.2.129 with optimized parameters: --min_match 1000 --min_nread 3, and LR_Gapcloser v1.030 with the parameter settings: -t 35 -m 1000000 -v 500. The final gap-free genome assembly of maroon clownfish has a total length of 884.39 Mb (Table 2), with all sequences anchored onto 24 chromosomes (Chr; Fig. 2). The longest Chr is 45.43 Mb, while the shortest Chr is 26.18 Mb (Table 3). The number of assembled chromosomes (haplotypic n = 24) is consistent with the previously established diploid karyotype (2n = 48) of this species31.

Table 2 Summary of the genome assembly for maroon clownfish.
Fig. 2
Fig. 2
Full size image

T2T genome assembly of maroon clownfish. (A) Genome-wide chromatin interactions at a 500-kb resolution. Color blocks represent corresponding interactions, with various strengths from white (low) to red (high). (B) A Circos plot of the main genomic features. From outside to inside the details include (1) the 24 chromosomes, (2) gene density, (3) GC content, (4) repetitive sequences density, and (5) a colinear relationship among chromosomes of the maroon clownfish genome assembly. Note that the density calculation window is set as 100 kb.

Table 3 Telomere and centromere positions in the assembled genome.

Identification of centromere and telomere sequences

Telomeres were annotated by searching for the conserved telomeric repeat motif (CCCTAA/TTAGGG) at both terminal regions of each chromosome using the Telomere-to-Telomere (T2T) Toolkit quarTeT v1.1.132. To annotate centromeres in maroon clownfish, we first identified repetitive sequences using TRF v4.0.433 and RepeatMasker v4.0.634, generating a comprehensive transposable element (TE) annotation file. This TE annotation was then used as an input for quarTeT v1.1.132 to predict candidate centromeric intervals based on the enrichment of centromere-associated repetitive elements. Collectively, our annotations confirmed that the assembled genome contains a complete set of 24 centromeres and 48 telomeres (Fig. 3), consistent with the previously established diploid karyotype for this species31.

Fig. 3
Fig. 3
Full size image

Genome-wide localization of repetitive elements (REs), telomeres, and centromeres. Triangles at both ends of each chromosome represent the telomere regions, and the gully area within each chromosome stands for a centromere region.

Annotation of repeat elements

To annotate repetitive elements (REs) in the maroon clownfish genome, tandem repeats were identified using Tandem Repeats Finder (TRF) v4.0.433 and Genome-wide Microsatellite Analyzing Tool Package (GMATA) v2.235. Specifically, TRF was employed to detect simple sequence repeats (SSRs), while GMATA was applied to characterize all tandem repeat families across the entire assembled genome, ensuring comprehensive coverage of tandemly repeated sequences.

For transposable element (TE) annotation, a combined strategy of homology-based and de novo prediction was adopted to minimize false negatives and improve annotation accuracy. In the former approach, RepeatMasker v4.0.6 and RepeatProteinMask v4.0.634 were utilized: RepeatMasker aligned the assembled genome against the RepBase database (v26.04) to identify known TE families, while RepeatProteinMask detected TE-related sequences via homology to conserved TE protein domains. In the latter approach, RepeatModeler v1.0.836 and LTR_FINDER v1.0.637 were used to construct a de novo repeat library with candidate TE consensus sequences. This custom library was then used as a reference for RepeatMasker to annotate de novo-identified REs. Finally, the integrated annotation revealed a total of 296.37 Mb of repetitive sequences in the maroon clownfish genome, accounting for 33.51% of the assembled genome (Table 4, Fig. 3). Among these REs, DNA transposons were the most abundant component, representing 17.49% of the genome (154.69 Mb), followed by long interspersed nuclear elements (LINEs; 6.84%, 60.48 Mb) and long terminal repeats (LTRs; 3.29%, 29.08 Mb) (Table 4).

Table 4 Classification of repetitive sequences in maroon clownfish genome.

Prediction and functional annotation of protein-coding genes

Prior to protein-coding gene prediction and structural annotation, repetitive regions in the assembled genome were masked to minimize interference from repetitive sequences on gene model construction. An integrated strategy of de novo prediction, homology-based annotation, and RNA-seq-supported annotation was applied for comprehensive protein-coding gene annotation. First, AUGUSTUS v3.2.138 and GlimmerHMM v3.0.439 were employed to perform the ab inito gene structure prediction. Second, GeMoMa v1.6.440 was employed to transfer gene annotations from five evolutionarily related fish species. Second, homologous protein sequences were downloaded from the NCBI database for the following species: Amphiprion ocellaris (clown anemonefish, GCA_022539595.1), Amphiprion clarkii (yellowtail clownfish, GCA_027123335.1), Lates calcarifer (Asian seabass, GCA_051027255.1), Perca flavescens (yellow Perch, GCA_004354835.1) and Stegastes partitus (Bicolor damselfish, GCA_000690725.1). These protein sequences were aligned to the masked genome so as to guide homology-based gene structure prediction. Third, RNA-seq data derived from the four tissues (Table 1) were assembled into transcript contigs with Trinity v2.5.141 using default parameters. These assembled transcripts were then used as evidence for gene structure refinement via PASA v2.3.342. Finally, gene models generated from the three approaches were integrated using the Evidence Modeler (EVM) pipeline v1.043, with weights assigned to each evidence type (RNA-seq > homology > de novo) to prioritize reliable gene structures.

This integrated annotation yielded a total of 24,556 high-confidence protein-coding genes, with an average gene length of 19.72 kb and an average coding sequence (CDS) length of 1,734.11 bp (Table 5). Completeness of the predicted gene set was evaluated using Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.4.744 against the actinopterygii_odb10 database revealing that 99.12% of the complete BUSCO orthologs were successfully recovered, which confirms the high quality of gene annotation.

Table 5 Summary of the predicted gene structures.

Functional annotation of these predicted protein-coding genes was conducted using Blastp v2.2.2645, with deduced protein sequences aligned against five publicly available databases, including NCBI Non-Redundant Protein Sequence (NR) database, Swiss-Prot46, Gene Ontology (GO)47 (for functional classification), Kyoto Encyclopedia of Genes and Genomes (KEGG)48 (for pathway annotation), and EuKaryotic Orthologous Groups (KOG)49 (for orthologous gene family classification). All alignments were performed with an E-value threshold of 1e-5 to ensure significant sequence homology. In total, 23,361 protein-coding genes (95.13% of the total predicted gene set) were successfully annotated with at least one functional hit in at least one of the five searched databases (Table 6).

Table 6 Functional annotation of predicted protein-coding genes.

Data Records

Sequencing data (including MGI, PacBio, ONT, Hi-C, and transcriptome sequencing) and the assembled genome of maroon clownfish were deposited in the National Center for Biotechnology Information (NCBI) database under the BioProject accession number PRJNA130608050. Raw sequencing reads are available in the Sequence Read Archive (SRA) with accession number SRP61780451. The genome assembly, predicted coding sequences, and functional annotation files of maroon clownfish were deposited in Figshare (No: https://doi.org/10.6084/m9.figshare.30104977)52. The complete genome assembly has also been deposited at the NCBI under accession number of GCA_053813585.153.

Technical Validation

To evaluate the quality of maroon clownfish genome assembly, four complementary approaches were employed (Table 7). First, BUSCO v5.4.744 was employed to evaluate assembly completeness against the actinopterygii_odb10 database. Our results showed that 99.98% of the complete BUSCOs were recovered, including 99.73% as single-copy complete genes (S) and 0.25% as duplicated complete genes (D), confirming high overall completeness of the assembly. Second, Merqury v1.32854, a k-mer-based quality evaluation tool, was applied to assess base-level accuracy and assembly completeness. K-mers were generated from MGI short reads and PacBio HiFi long reads, respectively. This analysis yielded a Quality Value (QV) of 44.37 (short reads) and 71.01 (long reads), indicating high base-level accuracy of the assembly. Third, Clipping information for Revealing Assembly Quality (CRAQ) v1.0955, a tool leveraging read clipping signals to quantify assembly accuracy, was used with PacBio HiFi and Illumina reads. The analysis determined a Reference-based Assembly Quality Indicator (R-AQI) of 97.93 and a Sample-based Assembly Quality Indicator (S-AQI) of 98.98, further verifying high accuracy of the assembled genome. Fourth, we mapped the sequencing data to the assembled genome using bwa v0.7.1756 and minimap2 v2.2657, which demonstrated mapping rates of 99.46% for the MGI data, 99.99% for the PacBio data, and 98.43% for the ONT data. Collectively, these quality metrics confirm that the maroon clownfish genome assembly is of high quality.

Table 7 Assessment metrics of the genome assembly and annotation.

For the predicted protein-coding genes, additional quality evaluations were performed. In fact, BUSCO assessment against the actinopterygii_odb10 database showed a completeness value of 99.12% (Table 7), indicating comprehensive gene prediction. To validate annotation accuracy, transcriptome data were aligned to the assembled genome using STAR v2.7.11b58, and exonic coverage was calculated using BEDTools v2.29.259. Our results showed that 95.89% of exonic regions were covered by sequencing reads, further confirming high accuracy of the gene annotation (Table 7).