Background & Summary

Lucinidae (Bivalvia: Lucinida) is the most species-rich family of chemosymbiotic invertebrates1. All known species of this bivalve family have established symbiotic relationships with chemosynthetic Gammaproteobacteria2,3. Lucinids are widely distributed in marine ecosystems, ranging from 70° N to 55° S, including intertidal zones, shallow-water, and deep-sea sediments4. Previous studies have demonstrated the evolution of deep-sea bivalves to chemosymbiosis5,6,7, but coastal ones may have different adaptations due to the higher availability of photosynthetic matter in coastal ecosystems than the deep-sea habitats. However, the specific evolutionary adaptations of coastal bivalves to chemosymbiosis remain largely unknown. Furthermore, Lucinidae and Thyasiridae (Lucinida) have long been considered as closely related groups due to the shared morphological features. However, phylogenetic trees based on rRNA genes supported the monophyletic status of each group8, and genomic studies of both Thyasiridae7 and Lucinidae species will further promote the understanding of these questions.

Lucinids have been proved to play a pivotal ecological role in coastal ecosystems. Through large-scale genomic studies, coastal lucinid symbionts mainly belong to the genus Ca. Thiodiazotropha and are universally capable of sulfur oxidation and carbon fixation9,10, enabling the lucinid holobionts to effectively remove sulfides from sediment. The presence of lucinid clams significantly reduces the concentration of sulfides in sediment, as demonstrated in either mesocosm or field experiments11,12,13. This process is crucial for maintaining the health of plants in coastal areas, as high levels of hydrogen sulfide can severely affect the development of the roots of seagrasses and mangroves11,14. Therefore, lucinids and their bacterial symbionts are of great ecological importance in coastal ecosystems11.

Despite lucinids’ significant importance in the fields of evolution and ecology, the lack of genomic data has hindered the study of their phylogenetic relationships, evolutionary adaptations, and the regulatory mechanisms behind their ecological functions. Here, we assembled the chromosome-level genome of Indoaustriella scarlatoi (Lucinidae) based on reads of whole genome sequencing (WGS), PacBio HiFi sequencing, and Hi-C sequencing (Table 1). The I. scarlatoi genome is 1.58 Gb in size, containing 690 contigs with a N50 length of 9.00 Mb (Table 2). After Hi-C scaffolding, 99.41% of contigs were anchored to 17 chromosomes with a scaffold N50 length of 94.81 Mb (Tables 2, 3, Fig. 1). The mapping rate of WGS reads is 98.15%. In total, 938 genes, including 911 complete ones, of the 954 metazoan Benchmarking Universal Single-Copy Orthologs (BUSCO) were successfully located in the final assembly, indicating that the genome completeness is 95.4% (Table 2). The transposable elements occupied 56.02% of the genome, while LTR accounted for 42.66% of the genome (Table 4). We predicted 34,469 protein-coding genes in the I. scarlatoi genome, and 74.43% of these genes can be functionally annotated using at least one public database (Table 5). The ncRNA including tRNA, rRNA, miRNA, and snRNA were annotated with a total length of 1.35 Mb (Table 6). Overall, the I. scarlatoi genome is of high quality and will provide a valuable resource for studies on phylogeny and adaptive evolution.

Table 1 Statistics of sequencing data.
Table 2 Statistics of genome assembly.
Table 3 Statistics of Hi-C scaffolding.
Fig. 1
figure 1

Genomic characteristics of Indoaustriella scarlatoi. (A) Genome-wide all-by-all Hi-C matrix. (B) Circos view of the assembled chromosomes showing marker distributions at 2-Mb sliding windows from outer to inner circle: GC content, gene density, tandem repeat density, transposable element density.

Table 4 Statistics of transposable elements (TE) annotation.
Table 5 Statistics of gene functional annotation.
Table 6 Statistics of ncRNA annotation.

Methods

Sampling and sequencing

Individuals of Indoaustriella scarlatoi were collected from peri-mangrove sediment in Wenchang, China (19°24′44″ N, 110°44′50″ E). Samples were fixed using RNAlater (Thermo Fisher Scientific) and stored at −80 °C.

The muscle tissue of one individual was used to extract the total DNA for WGS and PacBio HiFi sequencing. Genomic DNA was extracted using QIAamp DNA Mini Kit (Qiagen). For WGS, Covaris E220 was used to fragment DNA, and DNA fragments around 200 bp were selected using AMPure XP beads (Beckman). Selected fragments were amplified for eight PCR cycles and sequenced on the DNBSEQ sequencing platform (BGI) in a paired-end 150 bp layout (Table 1). Long-read sequencing was performed on the PacBio Sequel II system (PacBio). After examining the DNA using Qubit (Thermo Fisher Scientific) and pulsed field electrophoresis system (BioRad), a 15-kb PacBio library was constructed by g-TUBE (Covaris) shearing, end-repair, and BluePippin (Sage Science) size selection. Two SMART cells were sequenced through circular consensus sequencing (CCS) mode (Table 1). For Hi-C library construction, cells dissociated from I. scarlatoi’s muscle tissue were crosslinked with 1% formaldehyde and 0.2 M glycine. The fixed powder was resuspended in nuclei isolation buffer and then incubated in 0.5% SDS for 10 min at 62 °C, and the nuclei were collected by centrifugation. The DNA in the nuclei was digested with MboI (NEB), and the overhang was filled and biotinylated prior to ligation by T4 DNA ligase (NEB). After purification, DNA was sheared, and biotin-containing fragments were captured using Dynabeads MyOne Streptavidin T1 (Invitrogen). The captured DNA was then amplified and sequenced with NovaSeq 6000 (Illumina) with a layout of paired-end 150 bp (Table 1). To better annotate the genome assembly, RNA-seq of tissues from a whole clam was performed. Total RNA was extracted with TRIzol (Invitrogen) and used to generate cDNA with HiscriptII (Vazyme). The cDNA fragments were sequenced on the DNBSEQ platform, and 7.32 Gb 150 bp paired-end data was generated.

Genome assembly and Hi-C scaffolding

Genome survey was conducted with WGS data using Jellyfish v2.2.615 at K-mer 17, and the estimated genome size of I. scarlatoi was 1.48 Gb while the heterozygosity was 1.69%. The genome was assembled with PacBio data by hifiasm v0.16.1 (-k 45 -r 2 -a 2 -m 2,000,000 -p 20,000 -l 0)16. After that, the PacBio long-reads was realigned to the assembly using minimap2 v2.1417, and duplications in the assembly were removed using Purge_Dups v1.2.3 (https://github.com/dfguan/purge_dups) with default parameters. Kraken218 was used to identify potential contaminant contigs, and contigs assigned to Bacteria were removed. The decontaminated contig-level assembly was assessed using BUSCO v5.2.219 with metazoan odb10 (Table 2). The quality control of Hi-C data was performed using HiC-Pro v3.220 (Table 1), and assembled contigs was then scaffolded by 3D-DNA21. Assembled chromosomes were visualized and adjusted in Juicebox v1.922, and 99.41% of the contigs were anchored to 17 chromosomes (Table 3, Fig. 1A). The final assembly is 1.58 Gb with a scaffold N50 length of 94.81 Mb (Table 2, Fig. 1B).

Repeat and gene annotation

Tandem repeats were annotated using Tandem Repeats Finder v4.0.7 with MaxPeriod set as 200023. Transposable elements (TEs) were identified with both homology-based and de novo prediction methods. LTR_Finder v1.0.624 with parameters “-C” and RepeatModeler v1.0.825 with default parameters were used for de novo search. For homology-based search, RepeatMasker v4.0.626 was employed to search against Repbase v21.0127 with parameters “-nolow -norna -no_is” and results of de novo search (Table 4).

Ab initio, homology-based and gene expression evidence were combined to annotate protein-coding genes. Augustus v3.128 was used for ab initio gene prediction. Blast v2.2.2629 was used to align gene sets from 10 molluscan species (Archivesica marissinica6, Argopecten concentricus30, Conchocele bisecta7, Crassostrea gigas31, Gigantidas platifrons5, Lutraria rhynchaena32, Mactra quadrangularis33, Margaritifera margaritifera34, Modiolus philippinarum5, Pecten maximus35) onto the genome of I. scarlatoi, and the alignment hits were linked to candidate gene region by GenBlastA36. GeneWise v2.2.037 was employed to determine gene models with sequences of the candidate gene and their 2-kb flanking regions. RNA-seq data were mapped to the genome assembly by HISAT v2.1.038, and Stringtie v1.3.439 and Transdecoder v5.7.1 (github.com/TransDecoder/TransDecoder) with parameters “--complete_orfs_only” were used to generate the gene annotation with transcripts evidence. EVM v1.1.140 was employed to integrate the results generated by the three methods with parameters “--segmentSize 5000000 --overlapSize 200000”, and the weights for integrating were “AUGUSTUS 1, GeneWise 3, transdecoder 10”. All annotated protein-coding genes were searched against the following databases: Swiss-Prot v201709, KEGG v87.0, InterPro v55.0, and TrEMBL v201709 (Table 5). Completeness of the gene set was assessed using BUSCO v5.2.219 (Table 2).

ncRNA (non-coding RNA), including tRNA, rRNA, snRNA, and miRNA were predicted. tRNAscan-SE-1.3.141 were used to predict tRNAs in the assembly with default parameters. We aligned invertebrate rRNA sequences against the assembly using BLAST software29 with “-e 1e-5”. For miRNA and snRNA annotation, we first aligned the assembly against the Rfam database42 (v14.1) using BLAST software29 (-e 1) to find candidate alignment, and used INFERNAL43 v1.1.1 to annotate snRNAs and miRNAs with default parameters (Table 6).

Data Records

All sequencing data, including WGS, PacBio, Hi-C, RNA-seq, as well as the assembly (JBIWQA000000000)44 have been deposited at the NCBI (National Centre for Biotechnology Information) repository under project PRJNA1181275, SRP54367445. Genome assemblies and annotations of I. scarlatoi are also available at Figshare46.

Technical Validation

The lengths of DNA fragments for PacBio sequencing mainly distributed around 50 kb, and the N50 length of PacBio reads is 17.7 kb. The size of the assembly is 1.58 Gb, while the estimated genome size by Jellyfish is 1.49 Gb. The quality value of the assembly, calculated using Merqury v1.347, was 63.66, indicating high assembly accuracy. The assembled genome contains 690 contigs, which N50 length is 9.0 Mb and N90 is 2.2 Mb. The rate of valid Hi-C reads was 19.36%. After Hi-C scaffolding, 99.41% of the contigs were successfully anchored to 17 chromosomes. BWA (v0.7.17, github.com/lh3/bwa) MEM algorithm was used to align the WGS reads to the final assembly, and the mapping rate was calculated using the flagstat commands of samtools v1.948 with the secondary mapping records removed. The mapping rate of WGS reads was 98.15%. In addition, we aligned RNA-seq data and PacBio HiFi reads against the assembly using hisat238 and minimap217 (“-ax map-hifi”), respectively, and the mapping rates of RNA-seq data were 81.15% while that of the HiFi reads was 99.78%. Using BUSCO software (v5.2.2)19, 938 of 954 BUSCOs were identified in the genome, including 911 complete ones, and the completeness of the final assembly was estimated as 95.4%. Compleasm v0.2.649 was also used to test the completeness of the assembly and the result was 97.7% (Table 2). We used both BUSCO v5.2.219 and OMArk v0.3.050 to evaluate the quality of gene annotation, and the BUSCO score of gene set (95.1%) was similar with that of the assembly, while the OMArk completeness was 90.69%.