Background & Summary

The sea cucumber belongs to Echinodermata, which occupy an important phylogenetic position together with their sister phylum, Hemichordata. Evolutionary studies of sea cucumber are crucial for understanding the origin of chordates1. Sea cucumbers have evolved special behaviors, including super regenerative capacity, aestivation, and anatomy, among others, to adapt to various oceanic environments. This adaptation has taken place over a long evolutionary history, which can be traced back to the early Cambrian era2. Sea cucumbers play crucial roles in maintaining the health of the ocean floor by consuming dead plant and animal matter, which helps to keep the sediment in a balanced state3. In addition, sea cucumbers are important economic aquaculture species and have been deemed as one of the most valuable functional foods in the sea due to their nutritional and pharmacological properties4,5. Among approximately 60 species with exploitation value, Apostichopus japonicus is the most economically important species6. In China alone, around 200,000 tons of adult A. japonicus are produced each year, with an estimated whole industrial chain value of about 15 billion dollars (Sea cucumber Industry Branch of China Fisheries Association, 2010–2022).

For the reasons listed above, A. japonicus is one of the most studied echinoderms, and a total of 1,358 research papers on A. japonicus were published between 2000 to 20217. Such research occurs in fisheries8,9, immunology10,11, food science and biological medicine4,5, ecological function3, as well as biochemistry and molecular biology12,13,14. With the deepening of research requirements and development of biological technologies, studies of the mechanisms on phenotype formation, physiological responses, and behavioral regulation is rapidly developing. Hence, the genetic resources of A. japonicus have been exploited, including the coding and noncoding RNA transcriptome15,16,17, proteome18, epigenome19, genetic linkage map20,21 and genome1,22. Among them, the genome is the most basic data that is essential for most multi-omics analyses.

In 2017, Zhang et al. constructed the A. japonicus genome with the assembly of 805 Mb (contig N50 of 190 Kb and scaffold N50 of 486 Kb)22 (Table 1). A total of 30,350 protein-coding genes were annotated in that genome version22 (Table 1). In 2018, Li et al. published 952 Mb (contig N50 of 45 Kb and scaffold N50 of 196 Kb) of the A. japonicus genome with 29,451 protein-coding genes1 (Table 1). In 2022, Wang et al. constructed a chromosome-level A. japonicus genome using Hi-C technology based on the 2017 genome assembly version23. However, the existing genome versions are far behind the growing demand of in-depth study on A. japonicus biology. Genomes with a low contig N50 are generally highly fragmented, resulting in the poor annotation of protein-coding genes and non-coding sequences24. For example, the sea cucumber breeding industry is currently primarily focused on traditional breeding methods, with molecular marker-assisted breeding playing a supporting role. A. japonicus breeding is in the transition from Breeding 2.0 (Statistical and experimental design to improve selection effort) to Breeding 3.0 (Integration of genetic and genomic data)25. Genome-wide association studies (GWAS) and genomic selection (GS) are required to identify genes related to economic traits and accurately evaluate them in A. japonicus breeding programs. Therefore, genomes with long contig N50s and long continuity are essential. Moreover, the annotation of functional genes and regulatory elements plays important roles in understanding evolutionary mechanisms and genetic regulation, which depend on the high quality of genome.

Table 1 Comparison of A. japonicus genome assemblies in 2017, 2018 and 2023.

We applied multiple sequencing technologies, generating 56.54 Gb of Illumina data, 43.22 Gb of PacBio data, and 3.18 Gb of HiC data, to reconstruct the chromosome-level A. japonicus genome (Fig. 1), which is the first known chromosome-level genome of sea cucumber (Table 1). The final assembly was 671.60 Mb in total length with a contig N50 length of 17.20 Mb and scaffold N50 length of 29.65 Mb (Table 1). The assembly quality was much better than those of the previous genome versions. The genome developed herein will be an excellent tool to better investigate the mechanisms that drive evolution and biodiversity23. By analyzing their genetic basis, researchers can identify the compounds responsible for these benefits and develop new treatments for human diseases4. Moreover, scientists can better understand the genetic basis of their responses to environmental stressors by studying their genomes, and develop new tools for monitoring and conserving the ocean’s resources26. As an important aquaculture species, the accurate genetic analysis of economic traits can help to improve the genetic stability of target traits and the success rate of genetic improvement operations20. Overall, constructing a high-quality genome for sea cucumbers is a crucial step in advancing the understanding of these unique animals and their importance to the health of the ocean and to human well-being.

Fig. 1
figure 1

A circos plot of 23 chromosomes of A. japonicus genome. The tracks from inside to outside are: bar plot for gene density profile, the distributions of transposable element and 23 chromosomes.

Methods

Sample collection and sequencing

The longitudinal muscle of a female A. japonicus was collected in Rushan, Shandong Province, China, in 2021. The sample was washed three times with phosphate buffered saline (PBS), quickly frozen in liquid nitrogen, and stored at −80 °C until DNA extraction. After DNA extraction, a short fragmented library was prepared with an insert size of 350 bp and sequenced using the Illumina Platform to generate 150-bp paired-end reads. For HiFi read generation, high-molecular-weight (HMW) gDNA was sheared to approximately 15 Kb before preparing a PacBio HiFi library. The genomic library was sequenced in CCS mode on the PacBio Sequel II system at Novogene (Beijing, China). After trimming the low-quality reads and adaptor sequences from the raw data, 56.54 Gb of Illumina data and 43.22 Gb of PacBio data with a mean read length of 14.7 Kb were obtained, resulting in 83.58-fold and 63.89-fold coverage of the A. japonicus genome respectively (Table 2). Such coverage was sufficient for haplotype-resolved assembly.

Table 2 Statistical analysis of sequencing reads from Illumina and PacBio.

Genome assembly

The genome was assembled using the default parameters of Hifiasm (v0.15.4-r343)27. Hifiasm calculates from the uncollapsed genome, allowing it to preserve haplotype information as much as possible. The HiFi long reads were provided to Hifiasm to generate the monoploid and a pair of haplotype-resolved assembly contig graphs. We assembled 178 contigs with a total length of 671.63 Mb. The maximum contig size and N50 were 38.08 and 17.20 Mb (Table 3), respectively.

Table 3 Assembly statistics at the contig level.

Hi-C library preparation, sequencing, and chromosome anchoring

A Hi-C library was prepared following the Hi-C library protocol28. After grinding with liquid nitrogen, fresh muscle was cross-linked using 4% formaldehyde solution at room temperature in a vacuum for 30 min. The fixation was terminated using 2.5 M glycine. Following cell lysis, cross-linked DNA was digested using the four-cutter restriction enzyme MboI. The DNA ends were subsequently labeled with biotin-14-dCTP and subjected to blunt-end ligation of the cross-linked fragments. DNA was extracted and purified using the phenol-chloroform extraction method. Sonication was employed to generate fragments ranging from 200 to 600 base pairs, and the ends of these fragments were repaired using a combination of T4 DNA polymerase, T4 polynucleotide kinase, and Klenow DNA polymerase. Streptavidin C1 magnetic beads were utilized for the specific enrichment of biotin-labeled Hi-C samples29,30. Following the addition of A-tails to the fragment ends and ligation with Illumina paired-end (PE) sequencing adapters, Hi-C sequencing libraries were subjected to PCR amplification (12–14 cycles) and subsequently sequenced on an Illumina PE150 platform. The raw sequence data were filtered to obtain a total of 73.50 Gb of clean data, with Q20 = 96.28% and Q30 = 91.25% (Table 4), which was used to assist chromosome assembly.

Table 4 Statistical analysis of sequencing data from Hi-C.

HiCUP (v0.8.1) was used to process the Hi-C data31. The clean Hi-C data was assembled using the ALLHiC pipeline, which contained a total of five steps: pruning, partitioning, rescuing, optimizing and building32,33. Finally, 99.94% of the initial assembled sequences were anchored to 23 pseudo-chromosomes (Fig. 2) with lengths ranging from 18.21 to 46.02 Mb. The total length of the genome assembly was 671.63 Mb, with 34 scaffolds and a scaffold N50 of 29.65 Mb (Table 5).

Fig. 2
figure 2

Genome-wide Hi-C heatmap of Apostichopus japonicus.

Table 5 Assembly statistics for Hi-C.

Genomic repeat annotation and ncRNA annotation

Repeat sequences of the A. japonicus genome were identified by homology-based and de novo strategies34. First, we integrated the repetitive sequence database predicted by Denovo with the homologous repetitive sequence database, Repbase35. Then, we used RepeatScout (v1.0.5)34, RepeatModeler (v2.0.1)36, Piler (v1.0)37 and LTR-FINDER (v1.0.6)38 to identify transposable element (TE) families. Repeatmasker (v4.1.0)36, RepeatProteinMask (v4.1.0) and TRF (v4.0.9)39 were used to identify and classify different repetitive elements by aligning the A. japonicus genome sequences against the integrated database. After removing the redundancy results obtained using the above three methods, the total length of the repeat sequences accounted for 47.33% of the A. japonicus genome. In addition, the Kimura divergence value of TE was calculated using calcDivergenceFromalign.pl40. TE landscapes were drawn using createRepeatLandscape.pl41 (Fig. 3). Among the repeat elements, short interspersed nuclear elements (SINEs) accounted for 0.02% of the genome and long interspersed nuclear elements (LINEs) accounted for 2.94% of the genome. Long terminal repeats (LTRs) and DNA elements accounted for 27.03% and 3.74% of the genome, respectively (Table 6).

Fig. 3
figure 3

Distribution of divergence rates for TEs in the A. japonicus genome.

Table 6 Classification of repetitive sequences and ncRNAs in the A. japonicus genome.

For the annotation of noncoding RNA (ncRNA), tRNAScan (v1.4)42 and blast (v2.2.26)43 were used for tRNA and rRNA prediction, respectively. Other noncoding RNAs, including miRNA and snRNA were detected by alignment to the Rfam database44 using INFERNAL (v1.0)45. Four types of noncoding RNAs, including 1,066 miRNAs, 4,963 tRNAs, 3,379 rRNAs, and 1,088 snRNAs, were identified from the A. japonicus genome (Table 6).

Protein-coding gene prediction and annotation

Gene structures were predicted using three basic strategies: de novo, homology-based, and transcriptome sequencing-based prediction. Based on the genome sequence, we used Augustus (v3.2.3)46, GlimmerHMM (v3.0.4)47, SNAP (v2013.11.29)48, Geneid (v1.4)49 and Genscan (v1.0)50 for ab initio gene prediction. For homology-based gene prediction, the protein sequences of L. variegatus51, S. purpuratus52, A. planci53, H. sapiens54, D. rerio55, S. chloronotus56 and S. kowalevskii57 were downloaded from the National Center for Biotechnology Information (NCBI). Blast (v2.2.26)43 and Genewise (v2.4.1)58 were used to align the protein sequences of A. japonicus to the seven other species for homology-based gene prediction. A total of 10,707, 11,249, 9,982, 7,184, 7,374, 15,566 and 11,377 genes were identified for L. variegatus, S. purpuratus, A. planci, H. sapiens, D. rerio, S. chloronotus and S. kowalevskii, respectively (Table 7). We also compared the gene, CDS, and exon and intron lengths to those of the seven other species (Fig. 4). For A. japonicus, the average lengths of the transcript, exons, and introns were 7,736.67, 193.65, and 1,286.06 bp, respectively.

Table 7 Statistical analyses of the gene structure annotation of the A. japonicus genome.
Fig. 4
figure 4

Comparisons of the genomic elements of closely related species.

The clean RNA-seq data underwent two types of assembly methods. For transcript assembly, we relied on the reference genome, while de novo assembly was carried out using Trinity (v2.11.0)59. Open reading frames (ORFs) were detected using PASA (v2.1.0)60. Based on the predictions, we used EvidenceModeler(v1.1.1)61 to integrate the gene sets predicted using different strategies into a non-redundant and complete gene set of 19,828 protein-coding genes (Table 7 & Fig. 5a).

Fig. 5
figure 5

Gene prediction and functional annotation of the A. japonicus genome. (a) Venn diagram of the gene set prediction. (b) Venn diagram of functional annotation based on different databases.

To perform functional annotation of protein-coding genes, Blastp (v2.2.26)62 and Diamond (v0.8.22)63 were used to align protein-coding genes to the SwissProt64, NCBI Non-redundant protein(NR) (ftp://ftp.ncbi.nih.gov/pub/nrdb/), KEGG65, InterPro66, GO Ontology (GO)67 and Pfam68 protein databases using an E-value threshold of 1E-5. The protein domains and motifs were annotated using InterProScan (v5.52-86.0)69. Finally, 19,122 (96.40%) of the 19828 predicted genes were annotated by at least one database (Table 8). Of the functional proteins, 12,767 (64.39%) were supported by all four databases (Fig. 5b).

Table 8 Statistical analysis of the functional gene annotations of the A. japonicus genome.

Data Records

The genomic Illumina sequencing data were deposited in the SRA at NCBI SRR2252357870.

The genomic PacBio sequencing data were deposited in the SRA at NCBI SRR2279926171, SRR2364010672-SRR2364010773.

The transcriptomic sequencing data were deposited in the SRA at NCBI SRR1705608474.

The Hi-C sequencing data were deposited in the SRA at NCBI SRR23362389- SRR2336239275,76,77,78.

The final chromosome assembly and genome annotation files are available in Figshare79.

Technical Validation

Evaluation of the genome assembly and annotation

We evaluated the genome assembly quality through the following measures: (i) The BUSCO (V4.1.2)80 evaluation was performed using a single-copy orthologous gene library, combined with software tools such as tblastn, augustus, and hmmer, to assess the assembled genome. The result showed that 97.2% of gene orthologs were detected in A. japonicus. Among them, 96.7% achieved complete scores, while 0.5% obtained fragment scores. This indicates a relatively comprehensive assembly outcome (Supplementary fig. 1). (ii) Employing the Core Eukaryotic Genes Mapping Approach (CEGMA) (v2.5)81, we identified 458 core eukaryotic genes, including 248 highly-conserved core genes used to assess genome and annotation completeness (Supplementary table 1). By aligning A. japonicus genes to these 248 core genes, we observed homologous genes in the A. japonicus gene sets for 228 core genes, accounting for 91.94% of the total. These findings further support the relatively complete assembly results. (iii) By aligning Illumina sequencing reads to the nuclear genome using BWA (v0.7.8)82, we determined a read mapping rate of 96.86% and a coverage rate of 99.83%, indicating high mapping efficiency and comprehensive coverage. (iv) The consensus quality value (QV) of genomes representing per-base consensus accuracy was estimated by Merqury83, and the QV of the A. japonicus genome exceeded 45 (48.86), which indicated the high accuracy of the genome assembly. Thus, all of the above results indicated that we obtained the high-quality genome of A. japonicus.