Background & Summary

Sea cucumbers, belong to the phylum Echinodermata and the class Holothuroidea, are one of the largest groups of marine invertebrates found in diverse habitats on the sea floor worldwide1. The number of known sea cucumber species exceeds 1,800 in global2, of which over 80 species hold significant economic value, encompassing both dietary and medicinal applications3. In addition, sea cucumbers play crucial roles in the marine ecosystem through bioturbation, organic matter processing, nutrient recycling, seawater chemistry balancing, biodiversity supporting and energy transferring in the food chain1. It has been further reported that sea cucumbers safeguard coral reefs by mitigating disease4. However, the enormous commercial demand has led to overfishing, resulting in a significant decrease in many sea cucumbers’ populations5. Therefore, artificial breeding and release of seedlings were used to restore the wild sea cucumber resources6,7.

Stichopus monotuberculatus is a tropical sea cucumber that is commonly distributed in the coral reefs of the Indo-Pacific Ocean, ranging from the Red Sea and Madagascar to Easter Island, and from Japan to Australia (Fig. 1a)3,8. Characterized by the unique morphology, the adult S. monotuberculatus typically grows to ~20 cm in length, displaying a brownish-yellow body wall and a quadrangular shape (Fig. 1a)8,9. S. monotuberculatus is similar to Stichopus horrens and Stichopus naso in appearance, whereas genetic barcoding indicates that S. monotuberculatus and S. horrens fall within the same clade10,11. The abundant high-quality protein and collagen fibers in the body wall of S. monotuberculatus make it highly valuable for consumption12, earning it a premium edible sea cucumber in Asia3. In order to meet the commercial demand and restore the damage of wild resources due to overfishing, the artificial spawning of S. monotuberculatus has recently been developed9, and various related aquaculture techniques were also continually improved8,13,14. Moreover, research were further performed on taxonomic clarification10,15, sex determination16, genetic diversity17, food resource18, gut microbiota19 and immune mechanism20,21,22,23,24,25 in S. monotuberculatus. Despite the mitochondria DNA26, transcriptomic sequencing27, and preliminary unassembled genome draft28 of S. monotuberculatus having been conducted, there remains a significant gap in integrated genomic research, restricting a more comprehensive understanding of the evolutionary and genetic traits this species.

Fig. 1
figure 1

S. monotuberculatus and its genomics feature. (a) Appearance and global distribution of S. monotuberculatus; (b) 21-mer seq frequency distribution in the S. monotuberculatus genome; (c) Hi-C contact heatmap of the S. monotuberculatus genome.

In this study, we utilized a combination of Nanopore long-read sequencing, Illumina short-read sequencing, and Hi-C technology to produce a high-quality chromosome-level genome of S. monotuberculatus. The final assembly is estimated to be 810.54 Mb, with a contig N50 of 10.15 Mb and a scaffold N50 of 35.36 Mb, consisting of 23 pseudochromosomes and achieving 99.82% assembly coverage. The Benchmarking Universal Single-Copy Orthologs (BUSCO) integrity assessment indicates that 97.8% of the conserved metazoan genes are complete. The S. monotuberculatus genome contains 29,596 protein-coding genes, with 94.43% (27,948 genes) annotated with functional information. This high-quality genome assembly will facilitate a deeper understanding of the genomic structure and genetic traits of S. monotuberculatus, laying a solid foundation for future wild resource conservation, genetic breeding and aquaculture. Furthermore, this study can also give new insights to evolution and ecological adaptation mechanisms of echinoderms.

Methods

Sample collection and nucleic acid extraction

All samples used in this study were obtained from adult female S. monotuberculatus, collected from the natural habitat at Tanmen Port, Qionghai City, Hainan Province (19.33° N, 110.49° E). The longitudinal muscles of a female specimen were excised, washed three times with phosphate-buffered saline (PBS), quickly frozen in liquid nitrogen, and subsequently stored at −80 °C. High-quality DNA was extracted from the longitudinal muscles using the QIAamp DNA Mini Kit (QIAGEN, Hilden, Germany) to facilitate both long-read and short-read whole-genome sequencing.

Transcriptomic sequencing was carried out on 9 tissues, including the body wall, coelomocytes, intestine, muscle, oral tentacles, Polian vesicles, respiratory tree, rete mirabile and skin. The coelomocytes were harvested from coelomic fluids that were filtered by 100-μm sterile nylon mesh and centrifuged immediately at 4 °C and 1,000 × g for 10 min. Other tissues were carefully separated using scissors and forceps, then processed similarly to the muscle specimen. Total RNA was extracted using the RNAprep Pure Plant Plus Kit (Tiangen Biotech Co. Ltd., Beijing, China) for transcriptomic sequencing.

Library preparation and sequencing

For Illumina short-read sequencing, high-quality DNA was randomly fragmented using the Covaris ultrasonic disruptor (Woburn, MA, USA). The sequencing pair-end libraries, with a 350 bp insert size, were prepared using the Nextera DNA Flex Library Prep Kit (Illumina, San Diego, CA, USA). Sequencing was performed on the Illumina NovaSeq6000 platform. Low-quality reads were removed from raw reads through the SOAPnuke (v2.1.4)29 tool and clean data were used for subsequent analyses. A total of 148.04 Gb of short-read data (equivalent to a coverage of 182.64×) were obtained (Table 1).

Table 1 Statistics of the sequencing data used for genome assembly.

For long-read Nanopore sequencing, libraries were prepared using the SQK-LSK110 ligation kit (Oxford Nanopore Technologies, Oxford, UK). The libraries were loaded onto primed R9.4 Spot-On Flow Cells and sequenced on a PromethION sequencer (Oxford Nanopore Technologies) with a 48-h run. Base-calling analysis of raw data was performed using the Oxford Nanopore GUPPY (v0.3.0)30. A total of 126.07 Gb of Nanopore continuous long-read data (equivalent to a coverage of 155.54×) were obtained (Table 1, Table 2).

Table 2 Statistics of the sequencing data for Nanopore.

For Hi-C sequencing with freshly harvested muscle samples, a formaldehyde cross-linking step was performed, followed by digestion of DpnII enzyme (NEB, Ipswich, MA, USA). In situ Hi-C chromosome conformation capture was performed following the DNase-based method31. The libraries were sequenced in 150 bp paired-end mode on an Illumina NovaSeq, resulting in 194.07 Gb of clean data (Table 1; Table 3).

Table 3 Statistics of sequencing data from Hi-C.

For transcriptomic sequencing, 2 μg total RNA was used in a sample. Sequencing libraries were generated using the NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (NEB), and index code was added for each sample. The sequencing was performed for 9 tissues, employing a strand-switching approach using Illumina HiSeq X Ten platform. A total of 174.53 Gb of clean data were generated in final.

Genome assessment and assembly

The K-mer analysis was conducted to estimate the genome size, proportion of repetitive sequences and heterozygosity using the Jellyfish (v2.3.0)30 and GenomeScope (v2.0.0)32. Jellyfish was used to count K-mer from the raw sequencing reads, configuring the K-mer and hash sizes at 21 M and 100 M, respectively (Fig. 1b; Table 4). Employing 10 threads, the analysis was accounted for both DNA strands. The K-mer histogram generated was further analyzed via GenomeScope, setting the K-mer size at 21 and assuming a diploidy level of 2. After discarding K-mers reflecting abnormal depth, the analysis yielded 120,544,041,989 K-mers, with a peak depth at 159 (Table 4). Based on these results, the S. monotuberculatus genome size was estimated to be 735.23 Mb, comprising about 1.86% heterozygosity and 29.67% repetitive sequences.

Table 4 K-mer frequency and genome size evaluation of the S. monotuberculatus genome.

The initial long-read assembly of genome was performed using SMARTdenovo33, followed by polishing with nanopore sequencing data by Racon (v1.4.11)34. The subsequent error correction was accomplished by integrating 148.04 Gb of Illumina sequencing data via the Pilon (v1.23)35 software. Modifications to address heterozygosity were conducted using the Purge_haplotigs (v1.0.4)36 pipeline. Hi-C data was processed through the ALLHiC (v1.1)36 pipeline to organize scaffolds into chromosome-length sequences, supporting accurate scaffold placements into 23 pseudo-chromosomes (Fig. 1c; Table 3). The chromosomal architecture and diverse genomic features were visualized using Circos software (v0.69)37 (Fig. 2; Table 5). The assembly spans 810.55 Mb across 42 scaffolds, achieving a scaffold N50 of 35.36 Mb (Table 6).

Fig. 2
figure 2

Circos plot depicting the features of the S. monotuberculatus genome. From inner to outer circles: chromosomes, gene densities, repeat sequences, SNP percentage, and NGS sequencing depth. Average values for these features are 0.600, 0.200, 0.019 and 173, respectively. The windows size for all circles was 500k bp.

Table 5 Chromosome and corresponding statistical results after Hi-C assisted assembly.
Table 6 Assembly statistics of the S. monotuberculatus genome.

Repetitive element annotation

RepeatMasker38 (v 4.09 with RepBase 20181026) and EDTA39 (a whole-genome de-novo TE annotation pipeline) were used to create High-quality TE annotations. The EDTA pipeline included LTRharvest, the parallel version of LTR_FINDER, LTR_retriever, GRF, TIR-Learner, HelitronScanner, and RepeatModeler along with customized filtering scripts. The gene-like sequences and redundancy results were removed using nucleotide coding sequences (CDS). Overall, sequences constituting 25.02% of the assembled genome were identified as repeats, of which the most abundant repetitive element was DNA elements (20.3%), followed by long interspersed nuclear elements (LINEs, 3.24%), short interspersed nuclear elements (SINEs, 0.16%), SINEs (0.16%), terminal repeats (LTRs, 0.06%) (Fig. 3a; Table 7).

Fig. 3
figure 3

Prediction and annotation of repetitive element (a) and Noncoding RNA (b).

Table 7 Statistics of repetitive sequence of the S. monotuberculatus genome.

Noncoding RNA annotation

Ribosomal RNAs (rRNAs) and transfer RNAs (tRNAs) were predicted by Barrnap (v0.9) and tRNAscan-SE (v2.0.11)40 using default parameters, respectively. For other non-coding RNAs, such as small nuclear RNAs (snRNAs) and microRNAs (miRNAs), an alignment with the Rfam database (v14.8)41 was conducted, followed by annotation using Infernal (v1.1.4)42. A total of 19 miRNAs, 2,023 tRNAs, 444 rRNAs, and 202 snRNAs were identified in the S. monotuberculatus genome (Fig. 3b; Table 8).

Table 8 Statistics of non-coding RNA annotation.

Gene prediction and functional annotation

To obtain a high-quality gene set derived from the genome, three methods, namely RNAseq, homology, and de novo, were employed, and the results were integrated using MAKER3 (v3.01.03)43. For RNAseq, a total of 28 samples from 9 different tissues were sequenced, resulting in the generation of 186.82 Gb of data (Table 9). The reads were aligned to the genome using HISAT2 (v2.2.1)44 and possible gene structures were assembled using StringTie (v2.1.7)45. For Homology-based predictions, protein sets selected from 6 representative genomes, including Homo sapiens (GCA_000001405.29), Drosophila melanogaster (GCA_000001215.4), Strongylocentrotus purpuratus (GCF_000002235.5)46, Acanthaster planci (GCF_001949145.1)47, A. japonicus (GCA_037975245.1)48, and Holothuria leucospilota (GCA_029531755.1)49, were aligned to the S. monotuberculatus genome with BLASTx50. For de novo method, genes were predicted using Augustus (v3.3.2)51 and Fgenesh, and integrated within MAKER3. Genes supported only by the de novo method were retained if they aligning to UniProt, as they may represent fast-evolving or novel genes in the species. A total of 29,596 genes were identified, with 79.4% supported by at least two methods, and 3,468 genes (12.5%) supported only by the de novo method. The comparison of the number and features of predicted genes in the S. monotuberculatus genome with those of 4 other echinoderm species including S. purpuratus46, A. planci47, A. japonicus48, and H. leucospilota49 were shown in Fig. 4a & Table 10.

Table 9 Statistics of tissue transcriptomic sequencing data.
Fig. 4
figure 4

Prediction and annotation of gene, and gene family analysis. (a) Prediction and annotation of gene; (b) Venn diagram of common and unique gene family in 4 sea cucumbers; (c) Phylogenetic and gene family evolution analysis among 9 echinoderm species. The scale below represents the divergence time. The number of expanded (+green) and contracted (−red) gene families were shown alongside the species.

Table 10 Comparison of the number and features of predicted genes in the S. monotuberculatus genome with those of four other echinoderm species.

Functional annotation of the predicted protein-coding genes was performed by BLASTp (v2.11.0, e-value 1e-5)52 against entries in the UniProt53 and UniProtKB/Swiss-Prot databases. Domain and GO information were identified using InterProScan (v5.52-86.0)54, resulting in successful annotation of 96.05% of the genes to the existing databases (Table 11). A common set of 8,844 genes is shared among all 4 species belonging to the class Holothuroidea, namely, A. japonicus48, S. monotuberculatus, H. leucospilota49, and H. scabra. The number of genes specific to each species are 893, 727, 1,380, and 688 in A. japonicus, S. monotuberculatus, H. leucospilota, and H. scabra, respectively. A Venn diagram is presented to visually illustrate both the shared characteristics and differences in the gene profiles of these 4 sea cucumber species (Fig. 4b).

Table 11 Function annotation of predicted protein-coding genes.

Gene family analysis

Gene family clustering was conducted with 9 species belonging to the phylum Echinodermata, including A. planci (GCF_001949145.1), Lytechinus variegatus (GCA_018143015.1), S. purpuratus (GCF_000002235.5), Chiridota heheva (GCA_020152595.1), S. monotuberculatus (PRJNA1123322), A. japonicus (GCA_037975245.1), Holothuria glaberrima (GCA_009936505.2), Holothuria scabra (PRJNA1074116) and H. leucospilota (GCA_029531755.1), using OrthoFinder55 with MMseqs. 256, followed by phylogenetic tree construction based on single-copy genes. The analysis of gene families’ expansion and contraction was carried out across using CAFE (v2.1)57 (Fig. 4c). The number of gene families in the Most Recent Common Ancestor (MRCA) was determined to be 8915. Within the class Stichopodidae, there were 351 expanded gene families and 269 contracted gene families. Among them, 29 expansions and 22 contractions were found to be statistically significant (p < 0.01). In the case of S. monotuberculatus, there were 687 expanded gene families and 956 contracted gene families, with 68 expansions and 155 contractions exhibiting statistical significance (p < 0.01), possibly indicating adaptations to varying environmental conditions or ecological niches.

Data Records

All the sequencing reads and the chromosome-level genome assembly sequences related to this project have been deposited in NCBI BioProject PRJNA112332258. In this study, whole genome sequencing reads are available under the accession numbers SRR29673227 to SRR29654456, while RNA-seq sequencing reads can be found under the accession numbers SRR29673218 to SRR29673229. The whole genome shotgun project has been deposited at DDBJ/ENA/GenBank under the accession number JBEVTQ00000000059. Further related data, including assembly, annotation, functional annotation, and gene families, have been submitted to the Figshare database60.

Technical Validation

Nucleic acid quality

The DNA quality and concentration were assessed using 0.75% agarose gel electrophoresis, NanoDrop One spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA), and Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA). DNA samples showing slight degradation were deemed suitable for sequencing library preparation. RNA purity was evaluated with the kaiaoK5500® Spectrophotometer (Kaiao, Beijing, China). RNA integrity and concentration were determined using the RNA Nano 6000 Assay Kit on the Bioanalyzer 2100 system (Agilent Technologies, Santa Clara, CA, USA). Samples with an RNA integrity number (RIN) exceeding 9.50 were considered appropriate for library construction.

Genome assembly and annotation quality

The QV pipeline of Merqury61 was used to estimate the assembly QV based on k-mer analysis. The script “best_k.sh” in Merqury was employed to determine the optimal k-mer length, which was found to be 19. Meryl was then utilized to calculate the number of k-mers in the Illumina WGS reads with default settings. The QV evaluation was performed in Merqury using the output from Meryl and the assembly. The findings demonstrated a k-mer completeness of 79.10% and a k-mer-based QV of 43.36.

To assess the integrity and precision of the genome, alignment of sequenced reads was performed against the genome. High alignment integrity indicates robust genome integrity. Utilizing Illumina short-read sequencing, the genome was sequenced at the coverage of 182.64×, resulting in the generation of 148 Gb of high-quality reads. This approach achieved a mapping rate of 99.57%, with 90.43% of reads properly paired. The genome exhibited a coverage of 99.67%, with regions exceeding 30× coverage representing 99.01% of the genome (Table 12).

Table 12 Reads mapping of NGS data.

Genome completeness was further evaluated through BUSCO analysis using the metazoa_odb10 dataset, which contains 954 conserved genes across 65 genomes (Table 13). The analysis revealed an impressive overall completeness of 97.8%, with 94.3% of BUSCOs being complete, and minor proportions being fragmented (3.5%) or missing (2.2%). These results highlight the strength of our assembly and demonstrate the excellent quality of the genome. These metrics confirm the completeness and accuracy of our genome assembly, providing valuable data for further wild resource conservation, genetic breeding and aquaculture of S. monotuberculatus.

Table 13 Statistical result of BUSCO evaluation results of genome assembly.