Background &Summary

Although the majority of sea anemones are sessile, certain species have evolved a burrowing character by embedding themselves in soft sediments1,2. These burrowing sea anemones play crucial roles in benthic ecosystems. Their movements create microhabitats for small organisms and facilitate biogeochemical cycling and material exchange between water and sediment layers. Unlike other sessile species that are surrounded solely by seawater, burrowing sea anemones inhabit a heterogeneous microenvironment composed of water mixed with sand or mud. To survive under such conditions, they have evolved tolerance to stressors such as hypoxia, pathogens and strongly reducing conditions3,4. Despite their distinct adaptations and ecological significance, genomic resources of burrowing sea anemones are limited, hindering our understanding of their molecular adaptation mechanisms.

Paracondylactis sinensis is a typical burrowing sea anemone and also an economically important species in China. However, due to overfishing and other anthropogenic disturbances, its wild populations have declined5. Our previous SSR-based analysis revealed signs of inbreeding in both northern and southern populations, highlighting the urgent need for conservation and artificial breeding efforts6. In this study, we present a chromosome-level genome assembly of P. sinensis, which provides a valuable resource for investigating its adaptive strategies and supports future studies on population genetics and resource management.

Beyond their ecological roles and economic value, sea anemones are recognized as a natural source of diverse bioactive polypeptides7. These include antimicrobial peptides used to defend against external pathogens and venom-derived toxins employed for prey capture—both of which hold promise for pharmaceutical applications such as infection control and wound healing8,9. Compared to sessile species, burrowing sea anemones like P. sinensis are exposed to more complex environment with low oxygen contents and distinct microbial communities. These unique conditions may drive the evolution of lineage-specific active compounds with novel biological activities uncharacterized before. The chromosome-level genome generated in this study provides a foundational resource for identifying such compounds and advancing bioprospecting and gene-mining efforts in this ecologically and pharmacologically valuable species.

Methods

Sample collection and DNA extraction

Samples of the sea anemones Paracondylactis sinensis were dug by hand during ebb at the coast of Taizhou, China (121°39.53′E, 28°20.12′N). The sampling procedures complied with relevant regulations of the Institutional Animal Care and Use Committee of the Institute of Oceanology, Chinese Academy of Science. After collection, they were transported to the lab and cultured in an aerated tank with temperature at 20°C and salinity at 33‰ (Fig. 1). One of the P. sinensis with active and healthy status was selected and rinsed with sterile seawater for subsequent DNA extraction. The body column of the sample was dissected with sterile scissors and forceps and then flash frozen in liquid nitrogen. DNA for PacBio HiFi sequencing was extracted using the MagAttract® HMW DNA Kit (Qiagen, Germany) according to the manufacturer’s instructions. The quantity, integrity and purity of the DNA were measured using a Qubit 3.0 fluorometer, pulsed-field gel electrophoresis (5–80 kV, 17 h) and a NanoDrop® instrument respectively. The value of A260/280 of the extracted DNA was 1.82 and a dominant high-molecular-weight DNA band larger than 48 kb was observed (Fig. 2), indicating its suitability for long-read sequencing. The extracted DNA was sent to Novogene Bioinformatics Technology Co., Ltd. (TianJin, China) for library construction and sequencing.

Fig. 1
Fig. 1
Full size image

Paracondylactis sinensis buried in the sandy substrate of a laboratory culture tank.

Fig. 2
Fig. 2
Full size image

Pulsed-field gel electrophoresis (PFGE) assessment of the extracted high-molecular-weight genomic DNA. Lane M: DNA size marker (10–48 kb); lane 1: the extracted DNA.

Illumina sequencing and genome survey

A paired-end DNA library for genome survey analysis was constructed with an insert size of 350 bp using a Rapid Plus DNA Lib Prep Kit for Illumina (ABclonal, USA). The library was subsequently sequenced on an Illumina NovaSeq X Plus sequencer (2 × 150 bp paired-end reads) (Illumina, USA). A total of 21.92 Gb data was generated. Data cleaning for the Illumina short reads was performed by Trimmomatic 0.36.310 with the parameters as follows: “ILLUMINACLIP:TruSeq. 3-PE-2.fa:2:30:10 LEADING:15 TRAILING:15 SLIDINGWINDOW:4:15 MINLEN:40”, which was followed by FastUniq 1.1 to remove duplicates11. Finally, 18.48 Gb clean data was obtained (Table 1). GenomeScope 2.0 and Jellyfish 2.2.312 were used for genome survey. The k-mer distribution with 17 nt was constructed using clean Illumina short-read data (Fig. 3). According to the formula genome size = knum/kdepth, where knum is the number of k-mers and kdepth is the expected depth of k-mers, the genome size was estimated to be 209.71 Mb, with a heterozygosity rate of 0.72% and a repeat rate of 31.47%.

Table 1 Sequencing data used for the assembly and annotation of Paracondylactis sinensis genome.
Fig. 3
Fig. 3
Full size image

17-mer frequency distribution in the Paracondylactis sinensis genome.

PacBio HiFi sequencing and Hi-C sequencing

For HiFi-read sequencing, a 15 kb SMRTbell library was constructed with a molecular weight of genomic DNA larger than 15 kb, following the instructions of the SMRTbell Express Template Prep Kit 3.0 (PacBio, USA). Fragment size distribution of the library detected by Agilent 5200 (Agilent, USA) revealed a dominant insertion peak at approximately 16 kb, with the majority of the fragments distributed between 10 and 40 kb (Fig. 4). The library was subsequently sequenced on a PacBio Revio system with a Revio SPRQ polymerase kit (PacBio, USA). A total of 8.34 Gb high-quality HiFi reads (39.77× genome coverage) were finally produced (Table 1). The length distribution of the HiFi reads showed a major peak at approximately 11 kb (Fig. 5), with a mean read length of 11,344 bp and an N50 at the length of 13,891 bp.

Fig. 4
Fig. 4
Full size image

Quality check of the SMRTbell library by Agilent 5200.

Fig. 5
Fig. 5
Full size image

Length distribution diagram of HiFi-reads.

For a haploid assembly, the sample used for Hi-C scaffolding was the same individual as one used for PacBio HiFi sequencing and Illumina sequencing. Dissected body column tissue was cross-linked (4% formaldehyde, 5 min, RT), quenched (0.20 M glycine, 1 min, RT), chilled ( > 15 min, ice), lysed (10 mM NaCl, 0.20% IGEPAL CA-630, 10 mM Tris–HCl, 1 × protease inhibitors), digested with DpnII, biotin-labeled, and end-repaired13. Library with insert size 350 bp was prepared with the Rapid Plus DNA Library Prep Kit for Illumina (ABclonal, USA) and sequenced on an Illumina NovaSeq X Plus sequencer (2 × 150 bp paired-end reads) (Illumina, USA). After quality control performed by Trimmomatic 0.36.310 and FastUniq 1.111, a total of 15.92 Gb Hi-C reads with 88.12× genome coverage was generated (Table 1).

Chromosome-level genome assembly

The genome was preliminarily assembled using hifiasm 0.20.014 with the generated PacBio HiFi reads under the parameters “-l 3 -s 0.5”. Sequencing contamination was screened by FCS-GX 0.5.515, and no sequences were detected to originate from foreign organisms. The contig-level genome assembly was 210.63 Mb in length, closely matching the estimated genome size (209.71 Mb) by genome survey analysis. The contig number of this initial assembly was 181, with contig N50 as long as 8.70 Mb (Table 2). The GC content of the assembly was 39.02%, which was close to that of the model sea anemone species Nematostella vectensis (40.72%; GCA_93252622516), suggesting a comparable base composition to published actiniarian genome.

Table 2 Statistics of the genome survey, genome assembly and quality assessments.

Hi-C reads were used to scaffold the draft assembly to the chromosome level. Paired-end Hi-C reads were aligned to the contig-level genome using Chromap 0.2.317 with the “--preset hic” and “--remove-pcr-duplicates” options enabled. The resulting SAM alignments were sorted and converted to BAM format using SAMtools 1.1618. Contig clustering and scaffolding were performed with YaHS 1.2.219 under default parameters. The scaffolded assembly and Hi-C contact maps were further processed using the pre and post modules of Juicer Tools 1.22.0120 to generate.hic files and finalize chromosome-scale scaffolds through manual review. Finally a total of 196.80 Mb (93.44%) of the genome was anchored to 19 pseudo-chromosomes (Fig. 6, Table 3), with the sizes ranging from 5.40 Mb to 25.01 Mb. The scaffold N50 of the chromosome-level genome was 9.41 Mb (Table 2). The total length of the unanchored scaffolds was 13.83 Mb, comprising 147 sequences, of which 86 scaffolds (58.50%) were distributed between 30–60 kb (Fig. 7). These unplaced scaffolds lacked sufficient Hi-C linkage evidence for confident anchoring with contact map largely blank (Fig. 8), which may be attributable to repeat-associated multi-mapping and/or conflicting placements across pseudochromosomes that precluded reliable assignment to a unique chromosomal locus.

Fig. 6
Fig. 6
Full size image

Circos plot of 19 pseudochromosomal linkage groups of Paracondylactis sinensis genome showing the marker distribution in 1 Mb sliding windows (from the outer to inner circle: GC content, gene density and repeat density; lines in the middle depict intrachromosomal synteny).

Table 3 Length of 19 pseudomolecules in the assembly of Paracondylactis sinensis genome.
Fig. 7
Fig. 7
Full size image

Length distribution of the unanchored scaffolds.

Fig. 8
Fig. 8
Full size image

Unanchored scaffolds showed in the Hi-C contact map.

Annotation of the repetitive sequences

Repetitive sequences were annotated using a combination of homology-based and de novo approaches. A species-specific repeat library was built from the assembled genome using RepeatModeler 2.021 with the -LTRStruct option enabled. Homology-based annotation was performed with RepeatMasker 4.122 against the Actiniaria repeat database using the parameters of -e rmblast, -nolow, -s, -a, and -gff. De novo repeat annotation was then conducted by masking the genome with the species-specific library generated by RepeatModeler 2.0. Results from both homology-based and de novo runs were combined using the combineRMFiles.pl utility supplied with RepeatMasker 4.1 to produce a comprehensive repeat annotation. Finally, 26.43% of the genome at the length of 55.68 Mb was annotated as repeats. Among the repetitive sequences, 43.11 Mb were transposable elements (TEs), accounting for 20.47% of the genome (Table 4). The proportion of repeats in the genome of P. sinensis was similar to that of the model actiniarian species Exaiptasia diaphana (24.24%; GCA_02432226523) predicted in our previous study24.

Table 4 Repeat composition of Paracondylactis sinensis genome.

Gene prediction and functional annotation

Gene structure prediction was performed by integrating evidence from transcriptome data, protein homology, and ab initio prediction. RNA-Seq data from different tissues of P. sinensis were downloaded (NCBI SRA accessions: SRR28341359, SRR28327126, SRR28327127, and SRR28327128)6 and aligned to the soft-masked genome using HISAT2 2.2.125 with default parameters. The alignments were sorted and indexed using SAMtools 1.1618. Gene prediction was then conducted with BRAKER3 3.0.326, which integrates GeneMark-ETP 1.00, AUGUSTUS 3.4.0, and extrinsic evidence. RNA-Seq alignments were incorporated via the --bam option, while protein homology information was provided using the --prot_seq option. Protein sequences from seven closely related sea anemone species—Actinostola sp.27, Alvinactis sp.28, Exaiptasia diaphana23, Nematostella vectensis29, Scolanthus callimorphus29, Actinia tenebrosa30, and Actinoscyphia liui24—were downloaded from NCBI and Figshare databases, which were used to guide evidence-based gene structure prediction. To further improve the structural annotation by incorporating untranslated regions (UTRs), PASA 2.5.331 was employed for transcript-based refinement. Transcripts assembled by trinity 2.15.132 were aligned to the genome using minimap2 2.3033, and integrated into the annotation framework in two rounds. The first round generated transcript alignments and preliminary structural models, while the second round updated the BRAKER3 gene models by adding UTR annotations and refining exon-intron boundaries based on transcript evidence. The representative transcript for each gene was then selected using AGAT 1.4.134. This integrative workflow produced a total of 19,420 genes (Table 5), with a gene length of 123,281,089 bp, exon length of 43,511,086 bp, CDS length of 30,009,737 bp, and intron length of 79,770,003 bp. After blasted in public databases (NR, SwissProt, eggNOG, KEGG, Pfam and GO), a total of 17,740 genes (91.35%) were functionally annotated (Table 6). The total number of predicted genes in P. sinensis (19,420) is comparable to that reported for other sea anemones, such as Nematostella vectensis with 19,231 predicted genes under accession number GCA_93252622516) and Anthopleura sola (GCA_02334942535,36) with 19,899 genes estimated in our previous study29, although this information was not reported in the original genome publication33.

Table 5 Gene prediction through BRAKER 3 integrating.
Table 6 Functional annotation of the predicted genes.

Data Records

The sequencing data generated in this study have been deposited in the National Center for Biotechnology Information (NCBI) under BioProject number PRJNA1334318. Specifically, the Illumina, PacBio HiFi, and Hi-C datasets are available in the NCBI Sequence Read Archive under accession numbers SRR35627867, SRR35627868, and SRR3562786637, respectively. The chromosome-level assembly of Paracondylactis sinensis genome was deposited in the NCBI genome database under accession number GCA_05449177538 and the corresponding BioSample ID is SAMN51854516. The genome annotation results are accessible from the Figshare database (https://doi.org/10.6084/m9.figshare.30209509)39.

Technical Validation

The completeness of the chromosome-level genome was evaluated by benchmarking universal single-copy orthologue (BUSCO 5.4.7) analysis40, Core Eukaryotic Genes Mapping Approach (CEGMA 2.5)41 and a k-mer based approach implemented in Merqury 1.342. BUSCO analysis with the metazoan (odb10) gene set revealed that the genome of P. sinensis and its predicted genes covered 95.91% and 97.69% of the “core” metazoan genes, respectively (Table 2). CEGMA analysis indicated the genome assembly covered 96.37% of highly conserved proteins in eukaryotic genomes. K-mer based evaluation was performed using a 21-mer database constructed from Illumina short reads. The result showed a high consensus accuracy (QV = 47.45, i.e., an error rate of 1.80 × 10−5), but a relatively lower completeness of 87.82% for the assembly. The spectra-cn plot indicated that most missing k-mers (the “read-only” part) were concentrated around the heterozygous peak (Fig. 9a), potentially representing heterozygous regions in the haplotype assembly. To confirm this, the completeness of a diploid representation which included both haplotype-resolved assemblies generated by hifiasm 0.20.014 was evaluated. The k-mer completeness of the diploid increased to 99.31% and the “read-only” k-mers were not aggregated around primary or heterozygous peaks (Fig. 9b). This result indicated that the lower completeness of the primary assembly mainly reflects unrepresented allele-specific k-mers rather than widespread assembly errors.

Fig. 9
Fig. 9
Full size image

Merqury copy number spectrum (spectra-cn) plots for haplotype (A) and diploid (B) assemblies of a Paracondylactis sinensis genome. Colors represent the different copy numbers of a certain k-mer found in the assembly. “read-only” means the k-mer found only in the Illumina reads.

The trimmed Illumina short reads and PacBio long reads were mapped against the assembled genome using BWA 0.7.1943. The mapping rates were 99.70% and 99.93%, respectively (Table 2), also indicating high accuracy of the assembly.