Background & Summary

Coral reefs are among the most biologically diverse and productive marine ecosystems in the world, providing an ideal habitat for marine life1. As key consumers, coral reef fishes play a crucial role in cycling carbon (C), nitrogen (N), and phosphorus (P), as well as in processes such as biomass production, herbivory, and piscivory (secondary consumption)2. These functions are intrinsically linked, with the synergistic actions of reef fishes collectively enhancing ecosystem resilience and supporting the maintenance of ecological balance. However, due to the combined effects of marine heatwaves, ocean acidification, pollution, and overfishing, global coral reef coverage has declined significantly in recent decades3,4. The degradation of coral reefs has led to the decrease of coral fish biomass and the change of community structure5. For instance, in Australia’s Great Barrier Reef, a massive coral bleaching event in 2016 led to the widespread death of about 30% of coral communities and a continuing decline in local fish species richness6. Recent studies have demonstrated that the spatial covariation patterns of herbivore functional roles can significantly affect coral reef resilience. Following coral mortality events, browsers, in contrast to grazers, play a more critical role in removing established macroalgae from reef substrates7. As the transition of corals to macroalgae gradually becomes the norm, herbivorous coralfish have received increasing attention for their role in regulating competition between corals and algae8.

Herbivorous fishes are considered key contributors to maintaining coral reef health, as they regulate the competition between algae and scleractinian corals for substrate space by controlling benthic algae, thereby enhancing reef resilience and preventing phase shifts9,10,11. Among them, Acanthuridae and Scaridae are prominent herbivores in coral reef ecosystems12. Acanthurus nigrofuscus is recognized for its broad dietary range, which enables it to thrive under various environmental conditions, including degraded coral reefs13. In contrast, Ctenochaetus striatus primarily feeds on detritus; however, the accumulation of fine sediments can inhibit its feeding efficiency, thereby diminishing its critical role in sediment removal and redistribution within coral reef ecosystems14,15. Species such as S. rivulatus and S. taeniopterus exhibit a preference for feeding on crustose coralline algae and the epilithic algal matrix associated with these substrates. This feeding behavior effectively suppresses the accumulation of tall filamentous and late-successional macroalgae, maintaining early-stage algal communities dominated by short filamentous algae and crustose coralline algae, which do not inhibit coral growth16,17. The study found that the growth rate of Caribbean coral reefs declined in both prehistoric and historical periods as parrotfish declined18. Thus, parrotfish play a critical role in maintaining coral-dominated reef habitats, and there is an urgent need to restore parrotfish populations for reef persistence.

The Scaridae family comprises 10 genera and 90 recognized species, with the genus Scarus emerging as the largest and most diverse, accounting for about 50 of these species19. Parrotfishes of the genus Scarus are primarily scrapers due to their special mouth structure, the fused dentate plate. Their beaks are strong and effective enough to grind up hard corals and rocks20. Among them, S. rivulatus is a widely distributed species in Indo-Pacific coral reefs, recognized for its crucial ecological functions and increasing commercial exploitation21. Due to rising demand for fisheries and the aquarium trade, populations of S. rivulatus are facing mounting pressure from overfishing, which could compromise their ecological role in reef systems22. Despite its ecological and economic importance, genetic and genomic resources for S. rivulatus remain limited, constraining our understanding of its adaptive capacity in the face of ongoing coral reef decline.

In recent years, genomics has become an increasingly important tool in conservation biology for understanding the genetic diversity of threatened species. For economically significant species, high-quality reference genomes are essential foundational genetic resources, which also hold considerable value for applications in aquaculture. In this study, we constructed a high-quality, chromosome-level genome assembly of S. rivulatus by integrating Illumina short-read sequencing, Nanopore long-read sequencing, and high-throughput chromosome conformation capture (Hi-C) technology. The final assembly consisted of 24 chromosomes, with a total length of 1.58 Gb and a scaffold N50 of 67.2 Mb. We annotated 41,823 protein-coding genes, of which 73.91% (30,910 genes) were functionally annotated. Repetitive elements accounted for 48.84% of the genome, with DNA and LTR elements being particularly abundant. This reference genome fills an important gap in S. rivulatus genomic resources, providing a fundamental basis for exploring the genetic mechanisms underlying the adaptation of parrotfish to current reef degradation and supporting the conservation and restoration of both parrotfish populations and coral reef ecosystems.

Methods

Sample collection and DNA extraction

A single adult female S. rivulatus specimen was collected from Xincun Harbor, Hainan Province, China, in May 2019. Muscle tissue was excised and immediately snap-frozen in liquid nitrogen before storage at –80 °C. High-quality genomic DNA (gDNA) was extracted from freshly harvested muscle using the DNeasy Blood & Tissue Kit (Qiagen, Hilden, Germany) following the manufacturer’s protocol23. The integrity of the DNA was verified by 1% agarose gel electrophoresis, and DNA concentrations were quantified using the Quant-iT™ PicoGreen® dsDNA assay (Thermo Fisher Scientific, Waltham, MA, USA).

Short-read library construction and sequencing

According to the method described previously24, gDNA was sheared to an average fragment size of 300–500 bp using a Covaris 2000 Ultrasonicator (Covaris, USA) for short-read sequencing. Fragmented DNA was size-selected, end-repaired, and PCR-amplified to produce sequencing libraries. The prepared libraries were sequenced on an Illumina HiSeq 2500 platform (Illumina, San Diego, CA, USA) in paired-end mode (150 bp), generating approximately 96 Gb of raw data.

Long-read library construction and sequencing

For long-read sequencing, high-molecular-weight genomic DNA was size-selected (~20 kb) using the BluePippin system (Sage Science, USA). Library preparation followed the 1D Ligation Sequencing Kit (SQK-LSK109) protocol (Oxford Nanopore Technologies, UK). The final library concentration was measured using a Qubit 3.0 fluorometer (Thermo Fisher Scientific). Sequencing was carried out on a single flow cell of the PromethION platform (Oxford Nanopore Technologies), yielding approximately 196 Gb of raw data.

Hi-C library construction and sequencing

A Hi-C library was also prepared from the same genomic DNA sample to enable chromosome-level scaffolding. Following a previously described standard protocol with specific modifications25, we digested the DNA with MboI and enriched the resulting biotin-labeled Hi-C fragments using streptavidin C1 magnetic beads. Subsequent library preparation involved adding A-tails to the fragment ends and ligating them with Illumina PE sequencing adapters. The final libraries were amplified by PCR and sequenced on the Illumina HiSeq X Ten to generate 150 bp paired-end reads. In total, 63 Gb of Hi-C sequencing data were obtained.

RNA library construction and sequencing

Total RNA was extracted from ten tissues (fins, gonads, heart, intestines, blood, liver, muscles, brain, spleen, and kidneys) and treated with DNase I (Thermo Fisher Scientific, Wilmington, DE, USA) to remove any genomic DNA contamination24. The integrity of the RNA from each tissue was verified using a Bioanalyzer 2100 (Agilent Technologies, Santa Clara, USA). RNA-sequencing libraries with a 300 bp insert size were constructed for each sample, and sequencing was carried out on the Illumina HiSeq platform using 150 bp PE mode. The result was 90 Gb of raw data.

Sequencing data processing and genome survey

The Illumina short-read data were first assessed for quality using FastQC (v0.11.9)26, and low-quality reads and adapters were removed with SOAPnuke (v2.X)27. After filtering, the clean reads were used for genome size and heterozygosity estimation for S. rivulatus. K-mer frequency analysis was conducted with GCE (v1.0.2)28 using a k-mer size of 17. The resulting k-mer frequency distribution (Fig. 1) showed a major peak at a depth of 80. After excluding low-frequency k-mers, the genome size was estimated using the formula: genome size = total k-mer count/peak depth. The final estimated genome size was 1.55 Gb, with a heterozygosity rate of 0.77%.

Fig. 1
figure 1

K-mer frequency distribution of the S. rivulatus genome. 17-mer frequency distribution generated from S. rivulatus Illumina data, with k-mer depth plotted on the x-axis and frequency on the y-axis.

De novo genome assembly

For S. rivulatus, de novo genome assembly was performed using Nanopore long-read data together with Illumina short reads. First, the assembly process using Nanopore data was carried out with NextDenovo (v2.5.0) (https://github.com/Nextomics/NextDenovo) under default parameters to generate an initial assembly. Then, to further improve the base-level accuracy, the assembly was polished using NextPolish (v1.4.1)29 by the Illumina data. The final contig-level assembly comprised 495 contigs, with a total length of 1.71 Gb, an N50 of 18.5 Mb, and a GC content of 39.36%. The longest contig reached 60.9 Mb (Table 1).

Table 1 Statistics of the assembled genome for S. rivulatus.

The Hi-C sequencing data were utilized to achieve the chromosome-level assembly of the S. rivulatus genome. Initially, low-quality and duplicate Hi-C raw reads were removed using Trimmomatic (v0.39)30. The resulting high-quality reads were aligned to the reference genome with Juicer (v1.6)31. Chromosome-level scaffolds were then generated by leveraging the genomic proximity information captured by the Hi-C data. To further scaffold the genome, the 3D-DNA32 pipeline was employed, followed by manual refinement of misassemblies using Juicebox (v1.11.08)33. The final chromosome-level assembly reached a total size of 1.58 Gb, with a scaffold N50 of 67.17 Mb (Table 1). It comprised 24 chromosomes (Fig. 2), ranging from 36.08 Mb to 81.71 Mb in length, with an average chromosome size of 66.02 Mb (Table 2). The GC content of the final assembly was 39.31%. Notably, the total assembled genome size closely matched the estimated genome size from the genome survey, reflecting the high integrity and completeness of this assembly.

Fig. 2
figure 2

Assembly results of the S. rivulatus genome. (A) Heat map of interactive intensity between chromosome sequences anchored by Hi-C. The width of each column reflects the relative length of the corresponding chromosome, while the intensity of the red color indicates the contact density. (B) Circos plot of the genome features. From the outermost to the innermost rings: (a) chromosomes, (b) GC content, (c) repeat content, and (d) gene density. The chromosome lengths were calculated, and the corresponding positions and lengths were used to create the outermost ring in the Circos plot. GC content was estimated by dividing the genome into 100 KB windows and calculating the GC ratio within each window, with the distribution of GC content displayed in the second ring. Repeat content was estimated by calculating the overlap between each window and known repetitive regions, represented in the third ring of the Circos plot. Gene density was estimated by counting the number of genes within each window and is shown in the innermost ring, reflecting the distribution of genes across the genome.

Table 2 Chromosome length information of S. rivulatus.

Repetitive sequences annotation

Prior to annotating protein-coding genes, repetitive regions in the genome were masked using a combined approach of De Novo and homology-based methods. We constructed an S. rivulatus specific repeat library using RepeatModeler (v2.0.3)34. Initially, this library contained 3,173 consensus sequences, among which 2,510 were categorized as unknown transposable element (TE) families. These unknown sequences were subsequently classified using DeepTE35, resulting in a reduction to 566 consensus sequences.

With this refined consensus sequence library, we annotated repetitive regions in the S. rivulatus genome using RepeatMasker (v4.1.2)36. This approach revealed that 48.84% of the genome consists of TE. The most prevalent DNA transposon family was hobo-Activator comprising 8.09% of the genome, followed by the Tc1 family (7.77%). The Gypsy/DIRS1 retrotransposons accounted for 7.41% of LTR (Table 3). Overall, S. rivulatus possesses a notably high proportion of TE.

Table 3 Repetitive element annotations in the genome of S. rivulatus.

Additionally, we calculated the Kimura two-parameter divergence (K divergence) using the calcDivergenceFromAlign.pl script from RepeatMasker (v4.1.2). The insertion time for each consensus sequence was estimated using the formula T = K/2r, where K denotes the divergence calculated by the script, and r represents the neutral mutation rate for teleost (2.5 × 10−⁹ substitutions/site/year). Our findings indicated that although S. rivulatus has a high overall TE content, most of these transposable elements underwent significant expansions approximately 10 million years ago (Mya) (Fig. 3A). Currently, active TE expansions appear to be limited.

Fig. 3
figure 3

Repeat and protein-coding genes annotations of the S. rivulatus genome. (A) Distribution of divergence rate for each type of TEs in the S. rivulatus genome. (B) Venn diagram of the functionally annotated protein-coding genes based on different databases.

Gene prediction and functional annotation

A repeat sequence library was built using RepeatModeler (v2.0.3) and applied with RepeatMasker (v4.1.2) to identify repetitive elements in the S. rivulatus genome. Redundant and overlapping sequences were removed to improve accuracy. The resulting masked genome was used for subsequent gene annotation.

To comprehensively predict protein-coding genes in the assembled genome, three complementary strategies were employed. First, de novo gene prediction was conducted using the self-training mode of Augustus (v3.4.0)37, with subsequent annotation refinement performed via the SNAP_to_GFF3.pl and augustus_GTF_to_EVM_GFF3.pl scripts from the Evidence Modeler (v1.1.1)38. Second, transcriptome-based prediction involved aligning RNA-seq data to the S. rivulatus genome and assembling the transcriptome using HISAT2 (v2.1.0)39 and StringTie (v2.1.4)40. Subsequently, TransDecoder (v5.7.0, https://github.com/TransDecoder/TransDecoder) was used to predict the open reading frames (ORFs). Third, homology-based predictions were carried out by aligning protein sequences from Labrus bergylta, Cheilinus undulatus, Notolabrus celidotus, Sparus aurata, and Acanthopagrus latus, which were downloaded from the NCBI database (Table 4), to the S. rivulatus genome. These alignments were further analyzed using Genewise41 to precisely determine exon-intron structures. Finally, the gene predictions from all three strategies were integrated using EVM to generate a high-confidence, consensus set of protein-coding genes.

Table 4 Genomic information of the species used in annotation analysis.

Functional annotation of the predicted protein-coding genes in the S. rivulatus genome was conducted by aligning these sequences to commonly used protein databases, including SWISS-PROT, NR, TrEMBL, COG, and KEGG, using BLAST (blastp) with an e-value cutoff of 1e-5. Motifs and domains were further annotated using InterProScan (v4.8)42, and partial non-coding RNAs were identified with Infernal (v1.1.2)43.

A total of 41,823 protein-coding genes were predicted in the S. rivulatus genome, among which 30,910 genes (73.91%) were functionally annotated in at least one of the utilized databases (Table 5). Specifically, 30,567 genes (73.09%) were annotated in the NR database, 30,557 genes (73.06%) in TrEMBL, 23,536 genes (56.28%) in SwissProt, 25,970 genes (62.10%) in KEGG, and 8,474 genes (20.26%) in COG (Fig. 3B). In addition to protein-coding genes, non-coding RNAs were comprehensively identified, including 1,794 tRNAs, 284 rRNAs, 543 snRNAs, and 338 miRNAs (Table 6).

Table 5 Statistics of gene functional annotation for S. rivulatus.
Table 6 Statistics of gene Non-code RNA annotation for S. rivulatus.

Data Records

The sequencing data and genome assembly have been submitted to the public databases. The Illumina short-read sequencing, Nanopore long-read sequencing, Hi-C sequencing, and RNA-seq data have been deposited in the NCBI Sequence Read Archive (SRA) database under the accession number SRP60003344. The genome assembly has been deposited at the NCBI GenBank under the accession GCA_051912175.145. Moreover, data of the genome annotations, predicted coding sequences, and protein sequences have been deposited at Figshare (https://doi.org/10.6084/m9.figshare.29642444.v1)46.

Technical Validation

Evaluation of the genome assembly

To assess the integrity and quality of the genome assembly, we used BUSCO (v5.4.4)47 software and performed analyses based on the Actinopterygii database, which contains 3,640 conserved single-copy orthologs. The results showed that a total of 3,573 (98.16%) complete BUSCO genes were detected, including 3,455 complete single-copy genes and 118 complete duplicated genes. In addition, 21 (0.58%) fragmented genes were detected and only 46 (1.26%) missing genes (Table 7). Additionally, to evaluate the accuracy of our assembly, we aligned Illumina short reads to the assembled genome using BWA (v0.7.17)48 and performed statistical analysis using SAMtools (v1.13)49. The results showed that 99.66% of the short reads were successfully aligned, and 96.48% of them were correctly aligned, indicating that the assembly had high alignment consistency and accuracy. These results suggest that the S. rivulatus genome assembled in this study has high quality and strong integrity compared to other published teleost genomes.

Table 7 BUSCO analysis statistics in the genome of S. rivulatus.

Genome collinearity analysis

We performed genome synteny analysis between S. rivulatus and C. undulatus50 using JCVI (v1.4.21)51. The results revealed a strong collinearity between the two species, highlighting the high quality of the S. rivulatus genome assembly (Fig. 4).

Fig. 4
figure 4

The genome synteny analysis between S. rivulatus and Cheilinus undulatus.

Ethics statement

The animal experiment was approved by the Committee of the South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences (Project No. 201810825825) SCSFRI96-253 and carried out according to applicable standards.