Background & Summary

The deep sea, an integral but underexplored part of the marine ecosystem characterized by darkness, high pressure, hypoxia, low temperatures, and limited food, was once considered barren1,2. However, recent technological advances in deep-sea exploration have revealed an unexpected abundance and rich diversity of animals inhabiting this extreme environment, with over 600 invertebrate species reported in hydrothermal vents and cold seeps alone, dominated by molluscs (e.g., mussels, limpets, snails) and crustaceans (e.g., shrimps)3,4,5. These very extremes make the deep sea an unparalleled natural laboratory for probing the limits of life, origins, and genomic adaptations. Despite these discoveries, how animals evolved and adapted to these harsh conditions remains a major and poorly understood question. Genomic analysis provides the foundation for understanding these key questions. Chromosome-level assemblies are therefore indispensable, as they enable the identification of large-scale rearrangements, conserved synteny and the organization of regulatory elements that shape genome structure and evolution, which cannot be resolved from contig-level data6. The advent of high-throughput sequencing has now ushered deep-sea biology into the genomics era7,8,9,10, with chromosome-level assemblies enabling the dissection of adaptive mechanisms across multiple genomic levels11,12,13.

Molluscs, the second largest and evolutionarily successful animal group surviving multiple mass extinctions since the Cambrian, exhibit superb adaptations across diverse environments. While genomic and transcriptomic analyses have enhanced our understanding of their molecular adaptations, most studies focus on shallow-water species14,15,16,17,18. Recent genomic studies of deep-sea molluscs have begun to reveal critical mechanisms of adaptation, providing deeper insights into the genetic basis behind survival in extreme deep-sea environments7,11,19,20. Limpets (Patellogastropoda), belonging to one of the most primitive gastropod lineages and serving as emerging evolutionary models, are particularly valuable for such studies21,22,23. The existing contig-level genome assembly of the deep-sea limpet B. lactea reveals adaptive signatures through expanded gene families and rapidly evolving genes associated with heterocyclic compound metabolism, membrane-bounded organelle, metal ion binding, and nitrogen and phosphorus metabolism, illuminating molecular adaptations to cold seeps24. However, the current lack of chromosome-level genomic resources for deep-sea limpets impedes essential analyses such as synteny-based phylogenomics, adaptive structural variant detection, and chromatin topology mapping, which are critical for fully deciphering the genetic basis of extreme environment colonization.

In this study, we constructed a high-quality chromosome-level genome assembly for B. lactea using PacBio long reads and Hi-C sequencing data. The genome assembly spans 769.80 Mb, achieving a scaffold N50 length of 82.05 Mb, which represents a significant improvement over the previously reported contig-level genome assembly with a contig N50 of 1.57 Mb24. Of the total assembly, 95.22% was anchored to 10 chromosomes. With a BUSCO completeness score of 96.90%, the new assembly demonstrates a significant improvement. In summary, this high-quality chromosome-level genome assembly will serve as a valuable resource for future studies on the adaptation and evolution of animals in cold seeps.

Methods

Sample collection and sequencing

B. lactea specimens were collected in the South China Sea (22.10°N, 119.28°E, 1,168 m depth) by the remotely operated vehicle (ROV) of Kexue. Tissues were frozen with liquid nitrogen and then kept at −80 °C until further use.

We applied the k-mer-based method to estimate genome size of B. lactea25. Firstly, a paired-end sequencing library with an insert size of 350 bp was constructed and sequenced using the Illumina HiSeq X Ten platform. A total of 36.95 Gb short-read sequencing data were generated (Table 1). Secondly, we selected the PacBio platform to generate long genomic reads for the construction of B. lactea reference genome. A total of 105.13 Gb data were generated with a mean length of 9.00 kb. Thirdly, we applied Hi-C sequencing technologies to generate a chromosome-level genome. For the Hi-C library, formaldehyde was used to crosslink cells, and DpnII restriction enzyme was employed to digest the DNA. The DNA was then labeled and processed through terminal repair. The Hi-C library was constructed using the Dovetail Omni-C Kit (Cantata Bio, USA) following the manufacturer’s protocol. Quality control of the library was performed using Qubit 2.0, the Agilent 2100 system, and qPCR methods. Hi-C sequencing was carried out on the Illumina NovaSeq 6000 platform (Illumina, USA). The resulting sequencing data were incorporated into the genome assembly process and are summarized in Table 1.

Table 1 Comparison of B. lactea genomic quality between the past version and this study.

Genome assembly and Hi-C scaffolding

The genome was assembled using Canu version 2.226 with optimized parameters for eukaryotic genomes (genomeSize = 792.9 m) with PacBio long-read data. The assembly was further polished using Pilon v1.2327 (--fix snps,indels) with Illumina short-read data. Duplicate sequences were removed using Purge_dups v1.2.528 with cutoff parameters automatically estimated from the read depth histogram. For Hi-C scaffolding, Juicer v1.629 (-s DpnII -y /path/to/DpnII.txt) was initially used to process the Hi-C sequencing data. Next, 3D-DNA v20100830 was employed to scaffold the contigs using default settings. Juicebox v1.11.0829 was then used to visualize the chromosome assembly, select the best result from the 3D-DNA output, and refine the chromosome boundaries based on the interaction heatmap. Finally, we reran 3D-DNA with the corrected assembly and exported the final chromosome-level genome. After Hi-C scaffolding, 95.22% of the assembled sequences were successfully anchored to 10 chromosomes. The 10 chromosomes were clearly displayed in the interaction heatmap (Fig. 1b). All bioinformatic tools used in this section were applied with default parameters.

Fig. 1
figure 1

Chromosome-level genome assembly of B. lactea. (a) Genome overview of the 10 chromosomes. Tracks from inner to outer represent assembled chromosomes, GC skew (heatmap with significance bars showing GC preference), GC content, and gene density, respectively, with densities calculated within a 100-kb window. (b) Hi-C interaction heatmap of the B. lactea genome. Color blocks indicate interaction intensity from yellow (low) to red (high).

Repeat annotation

We used RepeatModeler2 v2.0.131 with default parameters to build a de novo repeat library. LTR_FINDER v1.0732 (-D 15000 -d 1000 -L 7000 -l 100 -p 20) and LTR_retriever v2.9.033 were used to identify long terminal repeat (LTR) sequences in the genome by using default parameters. Repeats in the B. lactea genome were identified using RepeatMasker v4.134 (-e rmblast -xsmall -s -a), incorporating RepBase database, the LTR library and a species-specific de novo library. The proportion of transposon elements (TEs) in the B. lactea genome is 43.68%. Among them, retroelement accounts for 25.68%, DNA transposon accounts for 16.74% (Table 2).

Table 2 Statistics of repeat elements in the genome of B. lactea.

Gene prediction and annotation

Protein-coding genes were predicted using three approaches: ab initio prediction, homology-based prediction, and transcript-based prediction. Ab initio prediction was carried out with Braker2 v2.1.635 (--round 2). For homology-based prediction, protein sequences from 8 species—Homo sapiens (GCF_000001405.40)36, Nematostella vectensis (GCF_932526225.1)37, Drosophila melanogaster (GCF_000001215.4)38, P. yessoensis (GCF_002113885.1)17, Haliotis discus (GCA_044707095.1)39, Elysia chlorotica (GCA_003991915.1)40, P. canaliculata (GCF_003073045.1)41, and Lottia gigantea (GCF_000327385.1)42 were retrieved from the NCBI database and used in TblastN v2.13.043 with an e-value ≤ 1e-5. For transcript-based annotation, clean RNA-seq reads were aligned to the B. lactea genome assembly using HISAT2 v2.2.144, and the gene set was predicted using PASA v2.5.245 (-C -R --ALIGNERS blat,gmap). The results from these three methods were integrated with EvidenceModeler v1.1.146 (--segmentSize 100000 --overlapSize 10000) to generate a final, non-redundant gene set. A total of 21,122 protein-coding genes were annotated for the B. lactea genome.

Functional annotation of the predicted protein-coding genes was conducted using six public databases: Kyoto Encyclopedia of Genes and Genomes (KEGG), Gene Ontology (GO), NCBI-NR (non-redundant protein database), Swiss-Prot, SMART, and InterProScan. BLASTP v2.2.2347 was used with an e-value cutoff of 1e-5. The results indicated that 20,387 (96.50%) of the predicted genes were successfully annotated by at least one of the public databases (Table 3).

Table 3 The statistics of functional annotation in the deep-sea limpet B. lactea.

Noncoding RNAs were annotated using tRNAscan‐SE48 and the Rfam database49 based on Infernal50. The non-coding RNA statistics are summarized in Table 4.

Table 4 The statistics of ncRNA annotation in the deep-sea limpet B. lactea.

Data Records

The raw sequencing data of B. lactea have been deposited in the European Nucleotide Archive (ENA) under BioProject accession number PRJNA1237989 with accession number SRP57168351. The final genome assembly is available in the ENA under accession number GCA_977020065.152. In addition, the genome assembly data and annotations have also been deposited in Figshare53.

Technical Validation

Evaluation of the genome assembly

The genome size of B. lactea was estimated to be approximately 792.90 Mb based on 19-mer frequency distribution analysis. This estimate aligns with the final genome assembly size of 769.80 Mb (Table 1). The assembly produced 10 chromosomes, which is consistent with previous karyotyping studies of shallow limpets, including species from the Patelloidea superfamily54,55.

To evaluate the accuracy of the B. lactea genome assembly, we assessed its completeness using the “metazoan_odb10” database from BUSCO v556. The BUSCO completeness of the genome was 96.9% (C: 96.9%, S: 95.5%, D: 1.4%, F: 1.5%, M: 1.6%, n: 954) (Table 5). Additionally, Illumina short reads used for the genome survey were aligned to the B. lactea assembly using Bowtie2 v2.4.557, with 98.20% of the reads successfully mapped to the genome. These results collectively confirm the high quality of the genome assembly.

Table 5 Statistics of BUSCO assessment after Hi-C.

Gene annotation validation

To assess the completeness of the annotated gene set, we conducted BUSCO analysis using the “metazoan_odb10” database. The analysis showed that 87.60% of the conserved single-copy orthologs were complete (86.70% single-copy genes and 0.90% duplicated genes), 4.70% were fragmented, and 7.70% were missing (Table 5). Additionally, functional annotation indicated that 96.50% of the predicted genes were annotated by at least one public database (Table 3).