Background & Summary

The pen shell Atrina pectinata is a bivalve with large wedge-shaped structure, belonging to the family Pinnidae,order Ostreida and distributed along the Indo-Western Pacific coast from southeast Africa to Melanesia and New Zealand1. It is found mainly in muddy to sandy substrates either with patches in existence or the formation of small clusters, serves an important role in maintaining the ecological stability of seagrass beds, lagoons and coral reefs2,3. Pen shell have a characterize with developed byssus, which can hang tightly a place facing tidal impact forces, but also can shed rapidly and reversibly in face of bad environment in the ocean4. In addition, pen shell’s large adductor muscle is tasty and commercially valued. However, the resource of the pen shell has been continuously declined because of overfishing, depletion of habitats, pollution, and other factor1. Therefore, there is an urgent need to artificial cultivate pen shell to achieve an increase in the number of species, which has an important ecological significance for the population resource recovery. Unfortunately, so far, there has been no breakthrough in artificial breeding technology due to the spawning inducement difficulty and larvae conglutination problem5. Previous reports in A. pectinata predominantly focused on the taxonomy, genetics, ecology, morphology and biochemistry2,6,7,8,9,10,11, nevertheless, there are few report on the reproductive biology12. Hence, in order to speed up the artificial breeding process of A. pectinata, the first task is to figure out the biological mechanisms of breeding.

High-quality genomes is helpful to facilitate in-depth understanding and comprehensive screening of the genetic basis and variants associated with key traits. This knowledge enables us to understand insights into and effectively utilize the biological characteristics of the species for various purposes13,14,15. Currently, the genome of the pen shell has not been sequenced, impeding our exploration of genetic basis behind its biological feature and behaviour. In short, a high-quality chromosome-level genome will contribute to a profound comprehension of the genetic mechanisms responsible for biological mechanisms of breeding in A. pectinata.

Here, we constructed a high-quality chromosome-scale genome of A. pectinata. A combined strategy involving Illumina short-read, PacBio HiFi and Hi-C sequencing technologies was used in this study. Subsequently, we performed structural and functional annotation of the assembled genome through integrating full-length transcriptome data from adductor muscle tissue of A. pectinata. The high-quality reference genome of A. pectinata provides a valuable resource for exploring the reproductive biology.

Methods

Sample collection and DNA preparation

A four-years old Pen shell with length of 20.81 cm and height of 10.12 cm was collected from Changdao, Yantai, Shandong Province. Adductor muscle was obtained and preserved in liquid nitrogen until DNA or RNA extraction. The genomic DNA was extracted for DNA sequencing according to the standard phenol/chloroform extraction instruction. The total RNA was isolated using Trizol (Invitrogen, Carlsbad, CA, USA) following the manufacturer’s description.

Illumina library and PacBio library construction, sequencing and contig-level assembly

To obtain short reads, the DNA sample underwent evaluation through 1% agarose gel electrophoresis and the Pultton DNA/Protein Analyzer (Plextech). Subsequently, a paired-end library with an insert size of 500 bp was constructed following the manufacture’s instructions. Afterward, the DNA sample was purified, quantified, and subjected to sequencing from both ends using the Illumina Novaseq 6000 platform. Sequencing resulted in a total of 57.37 Gb raw reads. Following a filtering process utilizing fastp (v0.20.1) with default parameters16, the number of N in the single-end sequencing reads is greater than 3, the proportion of bases with quality values lower than 5 in the read is ≥20% and the reads containing adapters will be removed, a total of 56.66 Gb clean reads were obtained (Table 1). Then using GCE (v1.0.0) software17, K-mer analysis was performed to estimate the genome size, repetitive rate and heterozygosity for A. pectinata, which were 911.74 Mb, 62.79% and 0.941%, respectively (Fig. 1a). To obtain PacBio long reads, the DNA sample was first evaluated using Nanodrop, Qubit and agarose gel electrophoresis. Then, the library with a fragment size of 15 kb was created utilizing the SMRTBell template preparation kit 2.0 (Pacific Biosciences, CA, USA). Next, the construction includes DNA shearing, AMPure PB Bead purification, ssDNA overhangs removing, damage repair, end repair, hairpin adapter ligation, and bead purification of the library. After quality control test, a SMRTbell library was obtained. The library was sequenced using a single 8 M SMAT Cell on the PacBio Sequel II platform (Pacific Biosciences, CA, USA). The PacBio SMRT-Analysis package (https://www.pacb.com) was used for the quality control of the raw polymerase reads. After removing low-quality sequences using the SMRTLink 9.0 software with parameters--min-passes = 3--min-rq = 0.99, a sum of 33.29 Gb high-precision reads with an N50 value of 5.29 Mb were obtained (Tables 1, 2). For De novo assembly, Hifiasm18 software was applied to generate the draft assembly with high-precision reads. Briefly, reads performed all-vs-all pairwise alignment between them to self error correction firstly. After haplotype-aware error correction, the assembly string graph was built with the error-corrected reads and a bubble was generated in the assembly graph. A modified “best overlap graph” strategy was used to get the contig.

Table 1 Summary of genome sequencing data for A. pectinata.
Fig. 1
figure 1

Genome size estimation and heatmap of genome-wide Hi-C interaction. (a) K-mer frequency analysis was performed for genome size estimation of A. pectinata using GCE (v1.0.2) based on Illumina genome sequencing data. (b) The heatmap shows the scaffolding result of the A. pectinata genome based on the juicer and 3ddna pipeline.

Table 2 Summary statistics of A. pectinata genome assembly.

Hi-C library preparation, sequencing and chromosomal-level assembly

The Hi-C library was prepared using the method described previously19,20. In short, the library was constructed through the following steps: crosslinking cells with formaldehyde, digesting DNA with a suitable 4-cutter restriction enzyme (MboI), filling ends and mark with biotin, ligating the resulting blunt-end fragments, purification and random shearing DNA into 300–500 bp fragments. After quality control test of the libraries using Qubit 2.0, an Agilent 2100 instrument (Agilent Technologies, CA, USA) and qPCR, 150 bp PE sequencing of the Hi-C library were performed on the Illumina Novaseq 6000 platform by Berry Genomics Company, a sum of 170 Gb clean data were acquired (Table 1). The final assembled genome is 951.24 Mb, which was very close to the estimated genome size (911.74 Mb) based on the distribution of k-mer frequencies, with a scaffold N50 size of 52.64 Mb (Table 2). The filtered Hi-C reads were aligned to the initial draft genome by BWA software21 which was integrated into Juicer software22. Only uniquely mapped and valid paired-end reads were used to assembly by 3D-DNA23. Juicebox24 was used to manually order the scaffolds to get the final chromosome assembly. Contact maps were plotted with HiCExplorer25. Finally, 98.87% of genome sequence was successfully anchored onto 17 chromosomes (Fig. 1b, Table 2). The completeness of the genome assembly of A. pectinata was assessed by BUSCO (v5.2.216)26 based on the “mollusca_odb” BUSCO gene set collection27. The results showed that 98.8% of complete BUSCOs were successfully captured by our genome assembly, including 98.0% of single-copy and 0.8% of duplicated BUSCOs (Table 2).

RNA library construction and transcriptome sequencing

To facilitate genome annotation, PacBio ISO-Seq was performed, the RNAs from three samples were mixed equimolarly and subjected to sequencing. Specifically, the concentration, integrity, and purity of the RNA isolated from muscles were confirmed using Qubit, Agilent 2100 and Nanodrop, then pooled together at an equimolar concentration. A double-stranded cDNA library was prepared with SMARTer® PCR cDNA Synthesis Kit (Clontech, USA). Subsequently, the cDNA library was sequenced using the PacBio Sequel IIe platform. For full-length transcriptome sequencing data, SMRTLink (https://www.pacb.com/support/softwar-e-downloads/) was used to generate high-quality full-length transcripts, a total of 24.45 Gb of data were generated (Table 1).

Gene annotation

A comprehensive strategy combing ab initio prediction, protein-based homology searches, and RNA sequencing was applied to annotate the gene structure. The gene structure in the repeat-masked genome was screened using AUGUSTUS28, SNAP29, GlimmerHMM30 and GeneMark-ET31. Homology prediction was performed using GeMoMa32, then exon and intron boundary information were obtained based on the comparison between the transcript and the genome. The open reading frame (ORF) was predicted using PASA software33 based on the obtained full-length transcript sequence. The above prediction results were integrated using EvidenceModler34, UTR and other variable cut annotation was predicted by PASA software. A final gene set with a total number of protein-coding genes of 29,326 genes (Fig. 2a, Table 2).

Fig. 2
figure 2

Genomic information visualization of A. pectinata. Circular map of A. pectinata genome. From outer to inner circles: The innermost circle are 17 haploid chromosomes at the Mb scale; i represents the heat map of the repeat density on each chromosome, the darker the color, the more the number of repeats; ii represents the ncRNA density on the chromosome; iii represents GC content (yellow line), and ix represents gene density on each chromosome, the red line represents the gene density on the positive strand, and the green line represents the gene density on the negative strand; all circle lines are drawn with 1-Mb slid windows.

All protein-coding genes were aligned to three integrated protein sequence databases: NR (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz), SwissProt (https://ftp.uniprot.org/pub/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz), eggNOG (http://eggnog5.embl.de/#/app/home). Protein domains were annotated by InterPro and the Gene Ontology (GO) terms for each gene were obtained from the corresponding InterPro entry. The pathways in which the genes might be involved were assigned by BLAST35 against the KEGG databases (https://www.genome.jp/kegg/brite.html). The protein-coding gene functional annotation results were merged from the above methods. As a result, 84.25% (24,708 genes) of total predicted genes were successfully annotated (Table 3). Based on the eukaryota_odb10 library, genome annotation completeness was 97.7% and 97.3% occurred was single-copy genes.

Table 3 Functional annotation of the predicted genes in A. pectinata.

Repeat annotation and non-coding RNA annotation

The combination of de novo and homology-based method was applied to predict repeat elements prediction. In de novo annotation, MITE-Hunter36 was used to identify the mini-inverted repeat transposable elements (MITEs) which were widely present in the genome and belonging to type II transposable elements. LTRharvest37 and LTR Finder38 were used to detect the long terminal repeated sequences (LTRs) in the genome and LTR retriever39 was used to integrate the prediction results of those two software. For homolog evidence, RepeatMasker40 was used to search the genome sequence for the sequence similar to the known repetitive sequence in the repetitive sequence database RepBase (http://www.girinst.org/repbase) to obtain the known repetitive sequence in the target genome. Combination the de novo and known repetitive sequences in the target genome, RepeatMasker was used to mask the repetitive sequences of the target genome. RepeatModeler41 was used to de novo identify other repetitive sequences with repeat-masked genome. Ultimately, a total of 498.78 Mb repeat sequences were identified, accounting for 52.43% of the whole genome (Table 4).

Table 4 Classification of repetitive sequences and ncRNAs of A. pectinata genome.

The software of tRNAscan-SE42 was used to predict tRNA in the genome sequence with the default parameters. The cmscan tool of Infernal was used to align the genomic sequence with the RNA models in the Rfam database and ncRNA was obtained by blast with parameters -Z--cut_ga–nohmmonly--fmt 2. For cmscan, Z is defined differently; it is the length of the current query sequence (again, multiplied by 2) in nucleotides multiplied by the number of models in the target CM database. Consequently, 94 miRNAs, 766 tRNAs, 709 rRNAs and 243 sRNAs were predicted within A. pectinata genome (Table 4).

Data Records

The raw Illumina, PacBio, and Hi-C sequencing data are deposited in the NCBI under the BioProject accession number SRR30552289-SRR3055229143,44,45. The assembled genome sequence has been deposited in the NCBI GenBank with accession number ASM4548941v146. The genome annotation file is available from the Figshare repository47. The SRA database of transcriptome data is SRR3068405948.

Technical Validation

Evaluating genome assembly and annotation completeness

The genome assembly of A. pectinata was 951.01 Mb, consisting of 17 chromosomes of contig N50 of 5.29 Mb and scaffold N50 of 52.64 Mb (Fig. 2 and Table 2). The genome size is similar with the result that was estimated by other bivalves, like pinctada fucata martensii49. For quantitative assessment of this genome assembly, employing the mollusca_odb datasets, the BUSCO analysis demonstrated successful identification of 98.8% of complete BUSCO genes (Table 2). Collectively, the comprehensive assessment indicates that the A. pectinata genome serves as a high-quality reference genome.