Background & Summary

Protobranchia constitute a relatively understudied yet phylogenetically primitive subclass of bivalve molluscs, offering valuable insights into early bivalve evolution and ecological diversification1,2. Within this group, the Solemyidae represent one of the most ancient lineages, with fossil evidence tracing their origins back to the Early Ordovician3. As among the earliest-diverging bivalve lineages, solemyids may have retained ancestral traits that were subsequently lost in other clades, thereby providing a critical window into the origin and evolutionary history of Bivalvia. However, genomic resources for Solemyidae remain extremely limited; to date, only the genome of the shallow-water solemyid Solemya velum has been recently published, severely constraining our understanding of this lineage’s evolutionary trajectory4.

Morphologically, solemyids are characterized by a distinctive thickened frill of radially pleated periostracum that extends beyond the calcified shell margins5. Their valves are equilateral, anteriorly elongate, and subcylindrical, with generally parallel dorsal and ventral margins and an edentate hinge, the latter representing an early apomorphic trait6. The family Solemyidae comprises approximately 30 described extant species belonging to the genera Solemya Lamarck, 1818 and Acharax Dall, 19085. Species of Solemya typically inhabit anaerobic muddy and sandy substrates on continental shelves and upper slopes at depths of 0–600 m7, whereas members of Acharax are primarily restricted to deep-sea chemosynthetic ecosystems, including cold seeps and hydrothermal vents, at depths of 400–5379 m8,9. Consequently, Acharax species are rarely collected alive, and knowledge of their phylogeny, diversity, and genomic architecture remains scarce.

The genus Acharax currently comprises approximately 20 fossil species and nine extant species10,11. Recent morphological examinations combined with molecular phylogenetic analyses led to the identification of Acharax haimaensis as a new species from the Haima cold seep in the South China Sea12. Over long evolutionary timescales, A. haimaensis has developed a suite of adaptive strategies that enable survival in extreme deep-sea habitats characterized by hypoxia, low temperatures, and high hydrostatic pressure. A key feature of this adaptation is its obligate symbiosis with chemoautotrophic, gill-associated sulfur-oxidizing bacteria, which provide an alternative carbon source in nutrient-limited environments12. These microbial partners fulfill dual ecological roles: they sustain primary production within the host and facilitate detoxification and metabolic flexibility in sulfide-rich, reducing conditions13. Due to this unique symbiotic association and its specialization to extreme environments, A. haimaensis represents a valuable model system for elucidating the ecological, physiological, and genomic mechanisms underlying adaptation to extreme environments in deep-sea organisms. Despite its recognized evolutionary and ecological significance, the paucity of genomic resources remains a major obstacle to elucidating the origin and evolution of A. haimaensis, its metabolic interactions with endosymbionts, and the molecular mechanisms underlying deep-sea adaptation.

Here, we report a high-quality chromosome-level reference genome of A. haimaensis, generated using PacBio HiFi long reads, Illumina short reads, and Hi-C scaffolding (Table 1). The genome size of A. haimaensis was estimated at ~4.0 Gb using Jellyfish v2.3.1, with a heterozygosity of 1.47%. The initial assembly comprised 2,866 contigs with a contig N50 of 9.45 Mb (Table 2). Following Hi-C scaffolding, 91.57% of contigs were anchored to 22 chromosomes (Fig. 1; Table 3), yielding a final assembly of 4.27 Gb with a contig N50 of 10.53 Mb and a scaffold N50 of 195.52 Mb (Tables 2 and 3). The mapping rate of illumina short reads to the assembly was 99.34%. Genome completeness was further supported by the identification of 937 out of 954 metazoan Benchmarking Universal Single-Copy Orthologs (BUSCOs), corresponding to 98.2% completeness (Table 2). Repetitive element annotation revealed that transposable elements (TEs) account for 50.17% of the genome, with LINEs, SINEs, LTRs, and DNA transposons contributing 14.20%, 0.35%, 5.91%, and 8.56%, respectively (Table 4). We predicted 38,343 protein-coding genes, of which 87.25% could be functionally annotated against at least one public database (Table 5). In addition, 35,928 tRNAs, 18,341 rRNAs, 3,972 snRNAs, and 81 miRNAs were annotated (Table 5). Macrosynteny analysis showed that each chromosome of A. haimaensis comprises segments derived from two to four ancestral linkage groups (ALGs) of bilaterians, cnidarians, and sponges (BCnS), indicating extensive chromosomal breakage and fusion during its evolutionary history (Fig. 2). Phylogenetic analysis revealed that A. haimaensis diverged from the common ancestor of Autobranchia ~550 Mya, underscoring its basal position and primitive evolutionary status within Bivalvia (Fig. 3).

Table 1 Overview of sequencing data.
Table 2 Statistics of genome assembly.
Fig. 1
Fig. 1
Full size image

Genomic characteristics of A. haimaensis. (A) Hi-C contact heatmap of chromosome interactions. The scale bar indicates interaction intensity from yellow (low) to red (high). (B) Circos view of the assembled chromosomes showing marker distributions at 1-Mb sliding windows from outer to inner circle: (a) chromosome length, (b) GC content, (c) gene density, (d) long interspersed nuclear element (LINE) density, (e) short interspersed nuclear element (SINE) density, (f) long terminal repeat (LTR) density, and (g) DNA transposon density.

Table 3 Statistics of Hi-C scaffolding.
Table 4 Statistics of transposable elements (TEs) annotation.
Table 5 Statistics of genome annotation.
Fig. 2
Fig. 2
Full size image

Chromosome-level macrosynteny between A. haimaensis and ancestral linkage groups (ALGs) of bilaterians, cnidarians, and sponges (BCnS). Distinct colors represent individual ALGs. Colored dots denote genes with statistically significant conserved synteny (P < 0.05), whereas black dots indicate non-significant synteny.

Fig. 3
Fig. 3
Full size image

A phylogenetic tree was constructed using the single-copy orthologs identified from A. haimaensis and other 21 bivalve species.

Methods

Sampling and sequencing

Individuals of A. haimaensis were collected from the Haima cold seep in the South China Sea at a depth of 1,375 m (16°43′43″ N, 110°28′21″ E) using a TV grab. Adductor muscle tissue from one individual was aseptically dissected, immediately flash-frozen in liquid nitrogen, and stored at –80 °C for subsequent Illumina, PacBio, and Hi-C sequencing. High-molecular-weight genomic DNA was extracted using the SDS method and further purified with a QIAGEN Genomic Kit (QIAGEN, Germany). DNA integrity and contamination were assessed by 0.75% agarose gel electrophoresis, while DNA purity was evaluated using a NanoDrop spectrophotometer (Thermo Fisher Scientific, USA). High-quality DNA was then used for library preparation and sequencing. For Illumina sequencing, a short-insert library was first constructed to estimate genome complexity. Approximately 2 μg of genomic DNA was fragmented to an average size of 350 bp via ultrasonication, and the resulting library was sequenced on the Illumina NovaSeq 6000 platform (Illumina, USA), yielding 221.71 Gb of 150 bp paired-end reads with an coverage of ~52 × (Table 1). For PacBio HiFi long-read sequencing, 10 μg of genomic DNA was sheared into 10–20 kb fragments using a g-TUBE (Covaris, USA). The fragments were subjected to damage repair, end polishing, and ligation with stem–loop adapters. Unligated DNA was removed by exonuclease treatment, and target fragments were size-selected using a BluePippin system (Sage Science, USA). Sequencing on the PacBio Sequel II platform in CCS mode produced seven SMRT cells of data, generating 143.67 Gb of clean HiFi reads (~34 × coverage) (Table 1). For Hi-C library construction, muscle cells from another individual were crosslinked with 1% formaldehyde and quenched with 0.2 M glycine. Crosslinked chromatin was digested with MboI restriction enzyme, end-repaired, and biotin-labeled. Proximity-ligated chimeric DNA fragments were circularized with T4 DNA ligase, purified, and sheared into smaller fragments. Biotin-labeled fragments were enriched using streptavidin-coated magnetic beads (Invitrogen, USA), and the resulting Hi-C library was sequenced on the Illumina NovaSeq 6000 platform, producing 230.89 Gb of clean reads with ~54 × coverage (Table 1).

RNA-seq was performed to facilitate genome annotation. Total RNA was extracted from gill, foot, and mantle tissues using the TRIzol reagent (Invitrogen, USA). RNA integrity was evaluated by 1% agarose gel electrophoresis, while RNA purity and concentration were assessed with a NanoDrop spectrophotometer (Thermo Fisher Scientific, USA). Poly(A)+ mRNA was enriched using oligo(dT)-attached magnetic beads, and first-strand cDNA was synthesized with M-MuLV reverse transcriptase. Library construction was carried out using Phusion High-Fidelity DNA Polymerase. Sequencing was performed on the Illumina NovaSeq platform, generating a total of 37.20 Gb of paired-end clean reads with an average Q30 value of 93.25% (Table 1).

Genome assembly and Hi-C scaffolding

To estimate genome size, heterozygosity, and repeat content, K-mer analysis was performed using Jellyfish v2.3.1 with k = 2114. The resulting K-mer frequency distribution was analyzed using GenomeScope v2.0.1 with the parameters -m 17 -p 215. The genome size of A. haimaensis was estimated to be approximately 4.0 Gb, with a heterozygosity of 1.47%. Based on PacBio HiFi reads, a contig-level assembly was generated using Hifiasm v0.19.4 with default parameters16. HiFi reads were aligned to the draft assembly with Minimap2 v2.2417, and redundant sequences were removed using Purge_Dups v1.2.3, followed by haplotype purging with Purge Haplotigs v1.1.2 (https://github.com/dfguan/purge_dups). The initial assembly consisted of 2,866 contigs with a contig N50 of 9.45 Mb (Table 2). For chromosome-level scaffolding, raw Hi-C reads were first filtered using HiC-Pro v3.1.0 and processed with Juicer v1.618,19. Contigs were then assembled into pseudo-chromosomes using 3D-DNA v20100820. Contact maps were visualized and manually curated with Juicebox v1.11.08, during which misassemblies were corrected and chromosomal boundaries refined based on Hi-C interaction heatmaps19. A final chromosome-level assembly was produced by re-running 3D-DNA after manual correction. Following Hi-C scaffolding, 97.08% of the reads were anchored to 22 chromosomes (Fig. 1; Table 3), resulting in a 4.27 Gb assembly with a scaffold N50 of 195.52 Mb (Table 2). Assembly completeness was assessed using BUSCO v5.7.1 with the metazoan odb10 dataset, which showed a high assembly completeness of 98.2% (Table 2)21.

Repeat and gene annotation

Transposable elements (TEs) were identified using a combination of de novo and homology-based approaches. A de novo repeat library was constructed with MITE-Tracker and RepeatModeler v2.0.1 under default parameters22,23. Long terminal repeat (LTR) elements were detected with LTR_Finder v1.07 and LTR_retriever v2.9.024,25. The resulting de novo repeat library and LTR dataset were merged to generate a comprehensive repeat library. For homology-based detection, RepeatMasker v4.1 was applied to screen the genome against both the Dfam/Repbase databases and the custom repeat library26,27. Gene prediction was performed on the repeat-masked genome by integrating de novo, homology-based, and transcriptome-based strategies. For homology-based annotation, BLAST v2.16.0 was used to align protein sequences from representative bivalves (Archivesica marissinica, Mercenaria mercenaria, Sinonovacula constricta, Tegillarca granosa, Gigantidas platifrons, and Patinopecten yessoensis), and gene structures were subsequently inferred using GeneWise v2.4.128. RNA-seq reads were aligned to the genome using HISAT2 v2.2.129, and StringTie v2.2.130 was used to reconstruct transcripts to provide transcript-based evidence for genome annotation. Assembled transcripts were further integrated with PASA v2.5.231. De novo gene prediction was conducted with AUGUSTUS v3.5.0 and BRAKER2 v2.1.632,33. The final consensus gene set was generated using EvidenceModeler v2.1.034, and untranslated regions as well as alternative splicing events were annotated with GUSHR and GeMoMa based on RNA-seq alignments35. In total, 38,343 protein-coding genes were predicted from the A. haimaensis genome (Table 5). The gene set exhibited a BUSCO completeness score of 96.0% (Table 2). Functional annotation revealed that 33,458 (87.25%) of the predicted genes matched entries in at least one major public database, including Swiss-Prot, TrEMBL, NR, Pfam, KEGG, GO, and COG/KOG (Table 5). Additionally, non-coding RNAs were annotated: transfer RNAs (tRNAs) were identified using tRNAscan-SE v1.3.136, microRNAs (miRNAs) and small nuclear RNAs (snRNAs) were predicted using Infernal with the Rfam database37,38, and ribosomal RNAs (rRNAs) were detected using RNAmmer and Barrnap39.

Data Records

The raw Illumina, PacBio, Hi-C sequencing, and RNA-seq data of Acharax haimaensis were deposited in the NCBI SRA under the accession number SRP61790140. Specifically, Illumina sequencing data are available under accession number SRX30378101; PacBio sequencing data under accession numbers SRX30378102, SRX30378104, SRX30378094, SRX30378095, SRX30378096, and SRX30378097; Hi-C sequencing data under accession number SRX30378098; and RNA-seq data under accession numbers SRX30378099, SRX30378100, and SRX30378103. The chromosome-level genome assembly has been deposited in the NCBI GenBank database under accession number GCA_054131465.141. In addition, the assembled genome sequence and genome annotation files are publicly available at Figshare42.

Technical Validation

Evaluating genome assembly and annotation completeness

The DNA fragments used for PacBio sequencing were predominantly distributed around 16.82 Kb, with an N50 read length of 17.54 Kb (Table 1). The assembled genome size of A. haimaensis was 4.27 Gb, consistent with the estimate from Jellyfish v2.3.1 (~4.0 Gb)14. The assembly achieved a quality value of 62.28, as calculated with Merqury v1.34743, reflecting high assembly accuracy. The initial assembly comprised 2,866 contigs with a contig N50 of 9.45 Mb (Table 2). The 19.87% of Hi-C reads were deemed valid, and after scaffolding, 91.57% (239/261) of contigs were successfully anchored to 22 chromosomes (Table 3). The illumina short reads were mapped to the final assembly using BWA v0.7.1944, with mapping statistics calculated by samtools flagstat v1.22.145, yielding a mapping rate of 99.34%. RNA-seq reads and PacBio HiFi reads were further aligned to the assembly using HISAT2 and Minimap2, respectively17,29, with mapping rates of 81.47% and 99.91%. Assembly completeness was evaluated using BUSCO v5.7.1 with the metazoan odb10 dataset21, which recovered 937 of 954 single-copy orthologs, corresponding to 98.2% completeness. Gene set quality was also assessed with BUSCO, yielding a completeness score of 96.0%, consistent with that of the assembly (Table 2). Collectively, these results demonstrate that the A. haimaensis genome assembly and annotation are of high quality and completeness.

Macrosynteny analysis

Oxford Dot Plot (ODP) software (https://github.com/conchoecia/odp?tab=readme-ov-file) was employed to investigate chromosomal-level synteny between A. haimaensis and BCnS ALGs46. Macrosyntenic relationships were visualized in ODP, with each dot representing an orthologous gene pair. The analysis revealed relatively weak syntenic conservation between A. haimaensis and BCnS ALGs. Most ALGs were fragmented into two or more parts, each mapping to different chromosomes of A. haimaensis. Each A. haimaensis chromosome comprised segments derived from two to four ALGs, indicating extensive chromosomal breakage and fusion during its evolutionary history (Fig. 2).

Phylogenetic analysis

A phylogenetic tree was constructed using A. haimaensis together with 22 additional bivalve species, with the chiton Acanthopleura granulata designated as the outgroup. Coding gene and protein sequences were retrieved from the MolluscDB 2.0 database (http://mgbase.qnlm.ac/home). Orthologous genes were identified with OrthoFinder v2.5.5 under default parameters47, and the single-copy orthologous groups (OGs) were selected for phylogenetic reconstruction. Amino acid sequences were aligned using MUSCLE v3.8.1551, and conserved regions were extracted with Gblocks v0.91b under the parameters -b4 = 5 -b5 = h -t = p -e = .248. A maximum-likelihood phylogenetic tree was constructed using IQ-TREE v2.3.0 under the insect + I + R4 substitution model, with nodal support assessed using 1,000 bootstrap replicates49. Divergence times were estimated with the MCMCtree module in PAML v4.10.750, calibrated with fossil records at the following nodes: G. platifronsModiolus philippinarum (130–139 Mya), Venerida–Mytilida (230–524 Mya), and A. granulata–Bivalvia (480–559 Mya) (https://timetree.org/). As shown in Fig. 3, A. haimaensis was placed as a basal lineage within Protobranchia, diverging early from the ancestral node, suggesting its primitive evolutionary status among bivalve molluscs.