Background & Summary

Colochirus anceps (synonym: Cercodemas anceps) is a benthic sea cucumber within the order Dendrochirotida and the family Cucumariidae1. This species is characterized by its elongated body and leathery skin, features typical of the class Holothuroidea2. The striking coloration of C. anceps serves as an aposematic signal to deter predators. Chromatic vision, which detects hue and chroma, enables predators to perceive these warning colors, while achromatic vision processes luminance for contrast detection. Studies demonstrate that both chromatic and achromatic visual cues in C. anceps contribute significantly to predator avoidance3. Widely distributed in tropical regions, particularly in the waters of Vietnam and Malaysia1, it plays an essential ecological role in intertidal and seagrass ecosystems4. As a deposit feeder, C. anceps contributes to nutrient recycling and sediment oxygenation, which, in turn, enhances infaunal biodiversity and primary productivity4,5. Its adaptability to various substrates, including sandy and muddy bottoms, further underscores its ecological versatility. C. anceps has also garnered attention for its biomedical potential. It is a rich source of holostane-type triterpene saponins, bioactive compounds with potent cytotoxic properties1. Specifically, cercodemasoide A, a saponin derived from this species, has demonstrated remarkable anticancer activity against various cancer cell lines, including hepatoma, melanoma, and breast cancer6. These findings highlight the therapeutic potential of C. anceps in cancer research and drug development.

In recent years, the increasing number of assembled and annotated sea cucumber genomes provides a critical foundation for investigating their evolutionary history, behavioral traits, and physiological adaptations, while significantly enriching the echinoderm genomic database. According to the latest classification of sea cucumbers by Miller et al. in 2017, the Holothuroidea class is divided into seven orders7. Among these, the orders Synallactida and Holothuriida have been the most extensively studied, with the highest number of genome assemblies8,9,10,11,12,13,14,15 (Table 1). Additionally, the order Apodida has seen progress in genomic research with the representative species Chiridota heheva having its genome successfully sequenced and analyzed16,17 (Table 1). In contrast, the order Dendrochirotida remains relatively underexplored in genomic terms. Preliminary genome survey analyses have revealed that species within this order exhibit uniquely large genome sizes, ranging from 2238 to 3754 Mb, and are characterized by a high proportion of repeat sequences18. These genomic features imply that Dendrochirotida species may have evolved distinct adaptive mechanism.

Table 1 Chromosome-level reference genomes assembled in Holothuroidea.

In this study, we selected Colochirus anceps as a representative species of the order Dendrochirotida and utilized PacBio long-read sequencing combined with Hi-C technology to assemble its chromosome-level reference genome. The assembly results revealed that the genome size of C. anceps is 2,238.33 Mb, comprising a total of 1,433 contigs with a contig N50 of 15.09 Mb. Using Hi-C technology, these contigs were further clustered and ordered into 23 chromosomes (2n = 46), consistent with the chromosome number reported for other published sea cucumber genomes. A total of 24,102 protein-coding genes were predicted, of which 96.6% (23,288 genes) were successfully annotated with functional information. In summary, the C. anceps genome represents the first chromosome-level reference genome assembled for a sea cucumber species in the order Dendrochirotida. This genome serves as a significant addition to the sea cucumber genomic database, offering an essential resource for studying genomic diversity and evolutionary relationships among different sea cucumber orders. Moreover, it provides a foundation for investigating the genetic mechanisms underlying the unique physiological traits and ecological adaptations of C. anceps.

Methods

Sample collection, library construction and sequencing

A healthy adult individual of the sea cucumber Colochirus anceps was collected from Xiamen, Fujian Province, China. Muscle, body wall, intestine, respiratory tree, nerve ring, tentacle, and gonad tissues were sampled, immediately frozen in liquid nitrogen, and stored at −80 °C. Muscle samples were used for next-generation sequencing (NGS), PacBio sequencing, and Hi-C sequencing, while both muscle and other tissue samples were used for transcriptome sequencing to assist genome annotation. All processes involved in DNA extraction, library construction, and sequencing were carried out by Novogene (Beijing, China) in strict accordance with the manufacturer’s protocols. Genomic DNA was extracted from muscle tissue to construct short-insert libraries (350 bp), which were sequenced on the Illumina platform, generating 374.0 Gb (160.28 × ) of 150 bp paired-end reads (Table 2). For high-fidelity (HiFi) reads, high-molecular-weight (HMW) DNA was extracted from muscle tissue using Novogene’s SDS method19. High-quality DNA samples (main band > 30 kb) were fragmented into 15–18 kb pieces using a Covaris ultrasonicator20. Large DNA fragments were enriched and purified using magnetic beads, followed by damage repair and end repair. Sequencing adapters were ligated to the DNA fragments, forming circular templates, and unligated fragments were removed with exonuclease21. The prepared libraries were sequenced on the PacBio Sequel II platform, generating 91.4 Gb of PacBio CCS reads with a coverage depth of 39.17 × (Table 2). For Hi-C sequencing, Hi-C library preparation involved crosslinked DNA digestion, biotin labeling, proximity ligation, and DNA purification. The Hi-C libraries were sequenced on the Illumina NovaSeq platform, producing 249.9 Gb of Hi-C reads with a sequencing depth of 107.25 × (Table 2). For transcriptome sequencing, mRNA was enriched from total RNA using Oligo dT magnetic beads. After fragmentation, the first strand of cDNA was synthesized with random hexamer primers, followed by the synthesis of the second strand. Library preparation involved end repair, A-tailing, adapter ligation, fragment selection, amplification, and purification22. The prepared libraries were sequenced on the Illumina platform, generating a total of 51.39 Gb of raw data, including 6.25 Gb from muscle tissue, 6.99 Gb from body wall tissue, 7.90 Gb from intestinal tissue, 9.42 Gb from respiratory tree tissue, 6.03 Gb from gonad tissue, 6.40 Gb from tentacle tissue, and 8.39 Gb from nerve ring tissue23 (Table 2).

Table 2 Sequencing data used for the assembly and annotation of Colochirus anceps genome.

Genome survey, assembly and quality assessment

Before assembling the genome, a genome survey analysis was conducted using short-read sequencing data. K-mer analysis was performed using Jellyfish (v2.2.7) with a k-mer size of 17. The analysis revealed a main peak at a depth of approximately 30, yielding a total of 68,054,868,485 k-mers24. Based on the formula K-mer number/depth, the estimated genome size was calculated to be approximately 2,268.5 Mb, with a corrected size of 2,238.33 Mb25. Notably, this genome size is approximately 2-3 times larger than those reported for species in the order Synallactida. The substantial size difference may reflect distinct evolutionary trajectories in genome architecture, potentially involving differential transposable element activity. The genome heterozygosity rate was 1.06%, and the proportion of repetitive sequences was 69.39% (Table 3). The raw Pacbio sequencing data were quality-controlled using CCS (v5.0.0) with the parameter min-rq = 0.99, generating HiFi reads26. Genome assembly was subsequently performed with Hifiasm (v0.19.5), assembling the HiFi reads into contigs27. The assembly produced a total of 1,433 contigs, with a genome size of 2,407,851,961 bp and a contig N50 length of 15.09 Mb (Table 3). The contig-level genome was then clustered, oriented, and ordered to a near-chromosomal level by integrating Hi-C sequencing data and processing it with AllHiC (v0.9.8) (parameters: enz = DpnII, CLUSTER = n)28. Finally, manual refinement based on chromosomal interaction intensity was carried out using Juicebox (v1.11.08), resulting in a chromosome-level genome assembly29. After Hi-C-assisted assembly, a total of 23 sequences were assembled to the chromosome level (Table 4), matching the karyotypic characteristics observed in species from both Synallactida and Holothuriida orders. While 1,191 sequences remained unassembled at this level, the anchored sequences (2,289,615,330 bp) represent 95.09% of the total genome length (2,407,881,661 bp) (Table 3 and Fig. 1). The assembled genome was evaluated using BUSCO (Benchmarking Universal Single-Copy Orthologs)30, and the results showed that 94.3% of the BUSCO genes were complete, indicating a high level of genome assembly completeness. (Table 3 and Fig. 2a).

Table 3 Statistics of the genome survey, assembly, Hi-C scaffolding and quality assessments.
Table 4 Statistics of chromosome clustering counts and lengths.
Fig. 1
figure 1

Genome-wide Hi-C heatmap of C. anceps genome.

Fig. 2
figure 2

Sequencing and assembly information statistics of C. anceps genome.

Repeat and ncRNA annotation

Repetitive sequences in the C. anceps genome were predicted using both homology-based and de novo methods. For the homology-based prediction, RepeatMasker (v4.1.2) was employed to identify sequences in the C. anceps genome that shared similarity with those in the RepBase database (parameters: -nolow -no_is -norna -pa 30)31. For the de novo prediction, a de novo repeat library for C. anceps was constructed using RepeatModeler (v2.0.3) (parameters: -engine ncbi -pa 30 -LTRStruct)32, and repetitive sequences were subsequently identified through de novo prediction using RepeatMasker (v4.1.2). The de novo repeat library was then integrated with the RepBase database, and RepeatMasker (v4.1.2) was used again to annotate repetitive sequences in the C. anceps genome. The results showed that 70.95% of C. anceps genome consisted of repetitive sequences (Table 5). Among them, 56,123 SINEs (short interspersed nuclear elements) accounted for 0.52%, 824,798 LINEs (long interspersed nuclear elements) accounted for 12.18%, and 1,057,255 LTRs (long terminal repeats) accounted for 11.88% (Table 5). Based on the structural characteristics of tRNA, tRNA in C. anceps genome were identified using tRNAscan-SE (v1.4)33. For rRNA identification, the highly conserved rRNA sequences of a closely related species, Apostichopus japonicus, were used as reference sequences, and blast (v2.2.26) was employed to search for rRNA in C. anceps genome (parameters: -e 1e-10, -v 10000, -b 10000). To predict miRNA and snRNA, the covariance models from the Rfam were applied using Infernal (v1.1.5)34. The analysis identified 1,458 miRNAs (0.007%), 38,549 tRNAs (0.120%), 7,136 rRNAs (0.057%), and 2,300 snRNAs (0.015%) in C. anceps genome (Table 6).

Table 5 Statistics of repetitive sequences annotation.
Table 6 Statistics of non-coding RNA annotation.

Protein-coding gene prediction, functional annotation, and genome structure visualization

Gene structure prediction was performed using three methods: de novo prediction, homology-based prediction, and transcript-based prediction. De novo prediction was conducted using Augustus (v3.5) and SNAP (v2013.11.29), which predict gene structures based on the statistical features of the genome, such as codon frequency and exon/intron distribution35. Homology-based prediction utilized alignment tools, including blast (v2.2.26) and GeneWise (v2.4.1), to identify and predict gene structures by aligning known protein-coding sequences from homologous species to C. anceps genome36. The reference species used for homology alignment included Apostichopus japonicus (Ajap)12, Holothuria leucospilota (Hleu)8, and Synapta maculata (Smac). Transcript-based prediction relied on RNA-seq data from various tissues of C. anceps. ORF prediction and protein alignment were carried out using Hisat2 (v2.2.1), StringTie (v2.2.1), and TransDecoder (v5.7.1)37,38. Finally, the gene sets predicted by the above methods were integrated into a non-redundant gene set using EVidenceModeler (EVM)39. The annotation results from EVM were further refined using PASA (v2.4.1), which added information such as UTRs and alternative splicing, resulting in the final gene set40. A total of 24,102 protein-coding genes were predicted in the C. anceps genome. The average transcript length, average CDS length, average exon length, and average intron length were 35,790.07 bp, 1,401.02 bp, 205.39 bp, and 5,907.41 bp, respectively (Table 7; Fig. 3; Fig. 4a). Using blastp (v 2.2.26) and diamond (v0.8.22)41, the gene set was aligned against commonly used protein databases, including SwissProt, NR, Pfam, KEGG, and InterPro. Functional annotation of protein-coding genes was performed with InterProScan (v5.59-91.0)42. The results showed that 22,211 genes were aligned to the NR database, 16,181 genes to the SwissProt database, 16,864 genes to the KEGG database, and 21,944 genes to the InterPro database. After integration, 96.6% of the genes (a total of 23,288 genes) in the gene set were successfully assigned functional annotations (Fig. 4b). The chromosome sizes were calculated using the sequence processing tools seqtk and pyfaidx. Bedtools was employed to create a BED file with 100 Kb intervals (parameter: makewindows -g -w 100000) and to calculate the number of genes and GC content within each interval43. The number and location of repetitive sequences were analyzed using RepeatMasker, while LTR_finder and LTR_retriever (v2.9.0) were utilized to identify and integrate LTR information (parameters: -D 15000 -d 1000 -L 700 -l 100 -p 20 -C -M 0.9)44. Finally, the aforementioned data were integrated and visualized as a circular plot (Fig. 5) using Circos v0.69-945.

Table 7 Statistics of gene structure prediction.
Fig. 3
figure 3

Comparisons of the genomic elements of closely related species.

Fig. 4
figure 4

Prediction of genes and functional annotation. (a) Venn diagram of gene sets predicted by different methods. Venn diagram of the gene sets predicted using a gene overlap greater than 50% between two or more of the following methods: de novo, genes supported by EVM-integrated de novo predictions; homolog, genes supported by EVM-integrated homologous predictions; RNA, genes supported by EVM-integrated RNA-seq data predictions. Gene predictions had a baseline gene expression, RPKM, greater than 1. (b) Genes annotated with functions in four databases.

Fig. 5
figure 5

The genome structure of C. anceps. From outer to inner circles: (a) chromosomes (Chr1-Chr23), (b) gene density, (c) GC content, (d) repeat number, and (e) LTR number.

Data Records

The genomic Illumina sequencing data were deposited in the SRA at NCBI SRR3191750346.

The genomic PacBio sequencing data were deposited in the SRA at NCBI SRR3191750247.

The transcriptomic sequencing data were deposited in the SRA at NCBI SRR31917493- SRR3191749948,49,50,51,52,53,54.

The Hi-C sequencing data were deposited in the SRA at NCBI SRR3191749255, SRR3191750056, and SRR3191750157.

The final chromosome assembly and genome annotation files are available in Genbank58 and Figshare59.

Technical Validation

Using BUSCO to assess genome completeness, the results revealed that 94.3% of the BUSCO genes were successfully assembled, with 89.8% classified as complete and single-copy BUSCOs, 4.5% as complete duplicated BUSCOs, 3.2% as fragmented BUSCOs, and 2.5% as missing BUSCOs, indicating a relatively high completeness of the assembly (Fig. 2a). Short-read libraries were aligned to the assembled genome using BWA to evaluate the alignment rate, genome coverage, and depth distribution, assessing the completeness and sequencing uniformity. The results showed a read alignment rate of approximately 98.37% and genome coverage of about 99.36%, indicating strong consistency between the reads and the assembled genome (Fig. 2b and Table 8). To further analyze potential GC bias and contamination, the GC content and average depth of the assembled genome were calculated using 10 kb windows. The results demonstrated a GC content concentrated around 41.65%, with no obvious scatterplot separation, suggesting no significant GC bias or external contamination in the genome (Fig. 2c). Genome quality was assessed using Merqury software based on K-mer analysis, yielding a Qv (quality value) of 32.7327, indicating an accuracy greater than 99.9% (Table 8). In summary, multiple evaluation methods confirmed that the assembled genome exhibits high consistency, completeness, and accuracy Fig. 5.

Table 8 Statistics of genome reads coverage.