Abstract
The pen shell Atrina pectinata is a bivalve recognized for its outstanding large adductor muscle and developed byssus. Now, it becomes threatened in East Asia, requiring special attention for artificial breeding to boost yield. However, the lack of high-quality genomes hinders our understanding of its reproductive biology, which resulting in the artificial breeding in pen shell is still a scientific technological problem. Here, we produced a high-quality chromosome-level genome assembly of A. pectinata combing the PacBio, Illumina, and high-resolution chromosome conformation capture sequencing. The final assembly has a size of 951.01 Mb with a scaffold N50 of 52.64 Mb, 98.87% of sequence was anchored onto 17 chromosomes, with a BUSCO evaluation integrity score of 98.8%. We successfully identified 29,326 protein-coding genes and 24,708 genes (84.25%) were functionally annotated. The BUSCO evaluation integrity score for the predicted protein-coding genes was 97.7%. This work promotes the applicability of the A. pectinata genome, laying a solid foundation for future investigations into genomics and biology within this species.
Similar content being viewed by others
Background & Summary
The pen shell Atrina pectinata is a bivalve with large wedge-shaped structure, belonging to the family Pinnidae,order Ostreida and distributed along the Indo-Western Pacific coast from southeast Africa to Melanesia and New Zealand1. It is found mainly in muddy to sandy substrates either with patches in existence or the formation of small clusters, serves an important role in maintaining the ecological stability of seagrass beds, lagoons and coral reefs2,3. Pen shell have a characterize with developed byssus, which can hang tightly a place facing tidal impact forces, but also can shed rapidly and reversibly in face of bad environment in the ocean4. In addition, pen shell’s large adductor muscle is tasty and commercially valued. However, the resource of the pen shell has been continuously declined because of overfishing, depletion of habitats, pollution, and other factor1. Therefore, there is an urgent need to artificial cultivate pen shell to achieve an increase in the number of species, which has an important ecological significance for the population resource recovery. Unfortunately, so far, there has been no breakthrough in artificial breeding technology due to the spawning inducement difficulty and larvae conglutination problem5. Previous reports in A. pectinata predominantly focused on the taxonomy, genetics, ecology, morphology and biochemistry2,6,7,8,9,10,11, nevertheless, there are few report on the reproductive biology12. Hence, in order to speed up the artificial breeding process of A. pectinata, the first task is to figure out the biological mechanisms of breeding.
High-quality genomes is helpful to facilitate in-depth understanding and comprehensive screening of the genetic basis and variants associated with key traits. This knowledge enables us to understand insights into and effectively utilize the biological characteristics of the species for various purposes13,14,15. Currently, the genome of the pen shell has not been sequenced, impeding our exploration of genetic basis behind its biological feature and behaviour. In short, a high-quality chromosome-level genome will contribute to a profound comprehension of the genetic mechanisms responsible for biological mechanisms of breeding in A. pectinata.
Here, we constructed a high-quality chromosome-scale genome of A. pectinata. A combined strategy involving Illumina short-read, PacBio HiFi and Hi-C sequencing technologies was used in this study. Subsequently, we performed structural and functional annotation of the assembled genome through integrating full-length transcriptome data from adductor muscle tissue of A. pectinata. The high-quality reference genome of A. pectinata provides a valuable resource for exploring the reproductive biology.
Methods
Sample collection and DNA preparation
A four-years old Pen shell with length of 20.81 cm and height of 10.12 cm was collected from Changdao, Yantai, Shandong Province. Adductor muscle was obtained and preserved in liquid nitrogen until DNA or RNA extraction. The genomic DNA was extracted for DNA sequencing according to the standard phenol/chloroform extraction instruction. The total RNA was isolated using Trizol (Invitrogen, Carlsbad, CA, USA) following the manufacturer’s description.
Illumina library and PacBio library construction, sequencing and contig-level assembly
To obtain short reads, the DNA sample underwent evaluation through 1% agarose gel electrophoresis and the Pultton DNA/Protein Analyzer (Plextech). Subsequently, a paired-end library with an insert size of 500 bp was constructed following the manufacture’s instructions. Afterward, the DNA sample was purified, quantified, and subjected to sequencing from both ends using the Illumina Novaseq 6000 platform. Sequencing resulted in a total of 57.37 Gb raw reads. Following a filtering process utilizing fastp (v0.20.1) with default parameters16, the number of N in the single-end sequencing reads is greater than 3, the proportion of bases with quality values lower than 5 in the read is ≥20% and the reads containing adapters will be removed, a total of 56.66 Gb clean reads were obtained (Table 1). Then using GCE (v1.0.0) software17, K-mer analysis was performed to estimate the genome size, repetitive rate and heterozygosity for A. pectinata, which were 911.74 Mb, 62.79% and 0.941%, respectively (Fig. 1a). To obtain PacBio long reads, the DNA sample was first evaluated using Nanodrop, Qubit and agarose gel electrophoresis. Then, the library with a fragment size of 15 kb was created utilizing the SMRTBell template preparation kit 2.0 (Pacific Biosciences, CA, USA). Next, the construction includes DNA shearing, AMPure PB Bead purification, ssDNA overhangs removing, damage repair, end repair, hairpin adapter ligation, and bead purification of the library. After quality control test, a SMRTbell library was obtained. The library was sequenced using a single 8 M SMAT Cell on the PacBio Sequel II platform (Pacific Biosciences, CA, USA). The PacBio SMRT-Analysis package (https://www.pacb.com) was used for the quality control of the raw polymerase reads. After removing low-quality sequences using the SMRTLink 9.0 software with parameters--min-passes = 3--min-rq = 0.99, a sum of 33.29 Gb high-precision reads with an N50 value of 5.29 Mb were obtained (Tables 1, 2). For De novo assembly, Hifiasm18 software was applied to generate the draft assembly with high-precision reads. Briefly, reads performed all-vs-all pairwise alignment between them to self error correction firstly. After haplotype-aware error correction, the assembly string graph was built with the error-corrected reads and a bubble was generated in the assembly graph. A modified “best overlap graph” strategy was used to get the contig.
Genome size estimation and heatmap of genome-wide Hi-C interaction. (a) K-mer frequency analysis was performed for genome size estimation of A. pectinata using GCE (v1.0.2) based on Illumina genome sequencing data. (b) The heatmap shows the scaffolding result of the A. pectinata genome based on the juicer and 3ddna pipeline.
Hi-C library preparation, sequencing and chromosomal-level assembly
The Hi-C library was prepared using the method described previously19,20. In short, the library was constructed through the following steps: crosslinking cells with formaldehyde, digesting DNA with a suitable 4-cutter restriction enzyme (MboI), filling ends and mark with biotin, ligating the resulting blunt-end fragments, purification and random shearing DNA into 300–500 bp fragments. After quality control test of the libraries using Qubit 2.0, an Agilent 2100 instrument (Agilent Technologies, CA, USA) and qPCR, 150 bp PE sequencing of the Hi-C library were performed on the Illumina Novaseq 6000 platform by Berry Genomics Company, a sum of 170 Gb clean data were acquired (Table 1). The final assembled genome is 951.24 Mb, which was very close to the estimated genome size (911.74 Mb) based on the distribution of k-mer frequencies, with a scaffold N50 size of 52.64 Mb (Table 2). The filtered Hi-C reads were aligned to the initial draft genome by BWA software21 which was integrated into Juicer software22. Only uniquely mapped and valid paired-end reads were used to assembly by 3D-DNA23. Juicebox24 was used to manually order the scaffolds to get the final chromosome assembly. Contact maps were plotted with HiCExplorer25. Finally, 98.87% of genome sequence was successfully anchored onto 17 chromosomes (Fig. 1b, Table 2). The completeness of the genome assembly of A. pectinata was assessed by BUSCO (v5.2.216)26 based on the “mollusca_odb” BUSCO gene set collection27. The results showed that 98.8% of complete BUSCOs were successfully captured by our genome assembly, including 98.0% of single-copy and 0.8% of duplicated BUSCOs (Table 2).
RNA library construction and transcriptome sequencing
To facilitate genome annotation, PacBio ISO-Seq was performed, the RNAs from three samples were mixed equimolarly and subjected to sequencing. Specifically, the concentration, integrity, and purity of the RNA isolated from muscles were confirmed using Qubit, Agilent 2100 and Nanodrop, then pooled together at an equimolar concentration. A double-stranded cDNA library was prepared with SMARTer® PCR cDNA Synthesis Kit (Clontech, USA). Subsequently, the cDNA library was sequenced using the PacBio Sequel IIe platform. For full-length transcriptome sequencing data, SMRTLink (https://www.pacb.com/support/softwar-e-downloads/) was used to generate high-quality full-length transcripts, a total of 24.45 Gb of data were generated (Table 1).
Gene annotation
A comprehensive strategy combing ab initio prediction, protein-based homology searches, and RNA sequencing was applied to annotate the gene structure. The gene structure in the repeat-masked genome was screened using AUGUSTUS28, SNAP29, GlimmerHMM30 and GeneMark-ET31. Homology prediction was performed using GeMoMa32, then exon and intron boundary information were obtained based on the comparison between the transcript and the genome. The open reading frame (ORF) was predicted using PASA software33 based on the obtained full-length transcript sequence. The above prediction results were integrated using EvidenceModler34, UTR and other variable cut annotation was predicted by PASA software. A final gene set with a total number of protein-coding genes of 29,326 genes (Fig. 2a, Table 2).
Genomic information visualization of A. pectinata. Circular map of A. pectinata genome. From outer to inner circles: The innermost circle are 17 haploid chromosomes at the Mb scale; i represents the heat map of the repeat density on each chromosome, the darker the color, the more the number of repeats; ii represents the ncRNA density on the chromosome; iii represents GC content (yellow line), and ix represents gene density on each chromosome, the red line represents the gene density on the positive strand, and the green line represents the gene density on the negative strand; all circle lines are drawn with 1-Mb slid windows.
All protein-coding genes were aligned to three integrated protein sequence databases: NR (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz), SwissProt (https://ftp.uniprot.org/pub/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz), eggNOG (http://eggnog5.embl.de/#/app/home). Protein domains were annotated by InterPro and the Gene Ontology (GO) terms for each gene were obtained from the corresponding InterPro entry. The pathways in which the genes might be involved were assigned by BLAST35 against the KEGG databases (https://www.genome.jp/kegg/brite.html). The protein-coding gene functional annotation results were merged from the above methods. As a result, 84.25% (24,708 genes) of total predicted genes were successfully annotated (Table 3). Based on the eukaryota_odb10 library, genome annotation completeness was 97.7% and 97.3% occurred was single-copy genes.
Repeat annotation and non-coding RNA annotation
The combination of de novo and homology-based method was applied to predict repeat elements prediction. In de novo annotation, MITE-Hunter36 was used to identify the mini-inverted repeat transposable elements (MITEs) which were widely present in the genome and belonging to type II transposable elements. LTRharvest37 and LTR Finder38 were used to detect the long terminal repeated sequences (LTRs) in the genome and LTR retriever39 was used to integrate the prediction results of those two software. For homolog evidence, RepeatMasker40 was used to search the genome sequence for the sequence similar to the known repetitive sequence in the repetitive sequence database RepBase (http://www.girinst.org/repbase) to obtain the known repetitive sequence in the target genome. Combination the de novo and known repetitive sequences in the target genome, RepeatMasker was used to mask the repetitive sequences of the target genome. RepeatModeler41 was used to de novo identify other repetitive sequences with repeat-masked genome. Ultimately, a total of 498.78 Mb repeat sequences were identified, accounting for 52.43% of the whole genome (Table 4).
The software of tRNAscan-SE42 was used to predict tRNA in the genome sequence with the default parameters. The cmscan tool of Infernal was used to align the genomic sequence with the RNA models in the Rfam database and ncRNA was obtained by blast with parameters -Z--cut_ga–nohmmonly--fmt 2. For cmscan, Z is defined differently; it is the length of the current query sequence (again, multiplied by 2) in nucleotides multiplied by the number of models in the target CM database. Consequently, 94 miRNAs, 766 tRNAs, 709 rRNAs and 243 sRNAs were predicted within A. pectinata genome (Table 4).
Data Records
The raw Illumina, PacBio, and Hi-C sequencing data are deposited in the NCBI under the BioProject accession number SRR30552289-SRR3055229143,44,45. The assembled genome sequence has been deposited in the NCBI GenBank with accession number ASM4548941v146. The genome annotation file is available from the Figshare repository47. The SRA database of transcriptome data is SRR3068405948.
Technical Validation
Evaluating genome assembly and annotation completeness
The genome assembly of A. pectinata was 951.01 Mb, consisting of 17 chromosomes of contig N50 of 5.29 Mb and scaffold N50 of 52.64 Mb (Fig. 2 and Table 2). The genome size is similar with the result that was estimated by other bivalves, like pinctada fucata martensii49. For quantitative assessment of this genome assembly, employing the mollusca_odb datasets, the BUSCO analysis demonstrated successful identification of 98.8% of complete BUSCO genes (Table 2). Collectively, the comprehensive assessment indicates that the A. pectinata genome serves as a high-quality reference genome.
Code availability
No specific script was used in this work. The codes and pipelines used for genome sequencing data analysis were performed following the manuals. The version and parameters of the the software was listed in the Methods.
References
An, H. S., Lee, J. W. & Dong, C. M. Population genetic structure of Korean pen shell (Atrina pectinata) in Korea inferred from microsatellite marker analysis. Genes Genom. 34, 681–688 (2012).
Tabata, T., Hiramatsu, K., Harada, M. & Hirose, M. Numerical analysis of convective dispersion of pen shell Atrina pectinata larvae to support seabed restoration and resource recovery in the Ariake Sea, Japan. Ecol. Eng. 57, 154–161 (2013).
Sanmartí, N. et al. Exploring coexistence mechanisms in a three-species assemblage. Mar. Environ. 178, 105647 (2022).
Sivasundarampillai, J. et al. 2023. A strong quick-release biointerface in mussels mediated by serotonergic cilia-based adhesion. Science 382, 829–834 (2023).
Zhang, H. Y. 2010. Research on reproductive biology and artificial breeding technology of Atrina pectinata Linnaeus. Ji Mei University. (In Chinese) (2022).
Xue, D. X., Wang, H. Y. & Zhang, T. Phylogeography and taxonomic revision of the pen shell Atrina pectinata species complex in the south China sea. Front. Mar. Sci. 8, 753553 (2021).
Yang, H. S. et al. First report on the occurrence of the comb pen shell, Atrina pectinata (Linnaeus, 1767) (Bivalvia: Pinnidae) in Ulleungdo Island in the East Sea: Ecology and molecular identification of the species using COI gene sequence. Ocean Sci. J. 50, 649–655 (2015).
Yoon, J. M. Genetic distances of binary pen shell Atrina pectinata populations. Development & Reproduction 26, 127–133 (2022).
Suck, A. H. et al. Comparison between wild and hatchery populations of Korean pen shell (Atrina pectinata) Using microsatellite DNA markers. Int. J. Mol. Sci. 12, 6024–6039 (2021).
Jela, H., Monteclaro, H. M., Añasco, N. C., Quinitio, G. F. & Babaran, R. P. Identification of pen shells (Bivalvia: Ostreida: Pinnidae) collected off northern Iloilo, Philippines using their morphological characters. Acta Ichthyol. Piscat. 54, 49–61 (2024).
Jeon, C. Adsorption behavior of cadmium ions from aqueous solution using pen shells. J. Ind. Eng. Chem. 58, 57–63 (2018).
Awaji, M. et al. Artificial fertilisation method for the production of pen shell Atrina pectinata juveniles in hatcheries. Aquaculture. 553, 738101 (2022).
Wu, B. et al. Chromosome-level genome and population genomic analysis provide insights into the evolution and environmental adaptation of jinjiang oyster Crassostrea ariakensis. Mol. Ecol. Resour. 22, 1529–1544 (2022).
Bai, C. M., et al. Chromosomal-level assembly of the blood clam, Scapharca (Anadara) broughtonii, using long sequence reads and Hi-C. Gigascience. 8 (2019).
Song, H. et al. Chromosome-level genome assembly of the caenogastropod snail Rapana venosa. Sci. Data. 10, 539 (2023).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Liu, B. H., et al. Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv: 1308 (2012).
Cheng H. Y., Concepcion, T.G., Feng, X.W, Zhang H.W., Li H. Haplotype-resolved de novo assembly with phased Haplotype-resolved de novo assembly with phased assembly graphs. bioRxiv. (2020).
Rao, S. S., Huntley, M. H. & Durand, N. C. A 3D map of the human genome at kilobase resolution reveals principlesof chromatin looping. Cell 159, 1665–1680 (2014).
Xie, T. et al. De novo plant genome assembly based on chromatin interactions: a case study of Arabidopsis thaliana. Mol. Plant 8, 489–492 (2015).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 3, 95 (2016).
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92 (2016).
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).
Wolff, J. et al. Galaxy HiCExplorer 3: a web server for reproducible Hi-C, capture Hi-C and single-cell Hi-C data analysis, quality control and visualization. Nucleic Acids Res. 48, W177–W184 (2020).
Seppey, M., Manni, M. & Zdobnov, E. M. BUSCO: Assessing genome assembly and annotation completeness. Methods Mol Biol. 31, 227–245 (2019).
Kriventseva, E. V. et al. OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software. Nucleic Acids Res. 43, 250–256 (2015).
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435–W439 (2006).
Ian, K. Gene finding in novel genomes. BMC Bioinform. 5, 59 (2004).
Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004).
Alexandre, L., Burns, P. D. & Mark, B. Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res. 42, e119 (2014).
Keilwagen, J., Frank, H. & Grau, J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq Data. Methods Mol. Biol. 1962, 161–177 (2019).
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
Haas, B. J., Salzberg, S. L., Zhu, W. & Pertea, M. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol. 9, R7 (2008).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Han, Y. J. & Wessler, S. R. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 38, e199 (2010).
Ellinghaus, D., Kurtz, S. L., Willhoeft, U. LTRharvest: an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinform. 9 (2008).
Zhao, X. & Hao, W. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268 (2007).
Ou, S. J. & Ning, J. LTR_retriever: A highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 176, 1410–1422 (2017).
Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Current protocols in bioinformatics/editoral board, Andreas D. Baxevanis. Chapter 4: Unit 4.10. (2004).
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. PNAS. 117, 9451–9457 (2020).
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).
NCBI sequence read archive https://identifiers.org/ncbi/insdc.sra:SRR30552289 (2024).
NCBI sequence read archive https://identifiers.org/ncbi/insdc.sra:SRR30552290 (2024).
NCBI sequence read archive https://identifiers.org/ncbi/insdc.sra:SRR30552291 (2024).
Li, Z. Atrina pectinata isolate ZL-2024a, whole genome shotgun sequencing project. Genbank https://identifiers.org/ncbi/insdc:JBFXME000000000 (2024).
Li, Z. Z. The annotation file of the chromosome-level genome of Atrina pectinata. figshare. Dataset. https://doi.org/10.6084/m9.figshare.26493232.v1 (2024).
NCBI sequence read archive https://identifiers.org/ncbi/insdc.sra:SRR30684059 (2024).
Du, X. D. et al. The pearl oyster Pinctada fucata martensii genome and multi-omic analyses provide insights into biomineralization. GigaScience 6, 1–12 (2017).
Acknowledgements
This work was supported by the Key Research and Development Project of Shandong Province (2022CXPT002), the State Key Laboratory of Mariculture Biobreeding and Sustainable Goods (BRESG202302), the National Key R&D Program of China (2022YFD2400105), the Agricultural Industrial Technology System in Shandong Province (SDIT-14), and National Marine Genetic Resource Center.
Author information
Authors and Affiliations
Contributions
Biao Wu, Bo Liu and Zhihong Liu conceived and designed the study as well as provided the funding support; Zhuanzhuan Li and Liqing Zhou analyzed the data; Xi Chen processed the comparative genome, Peizhen Ma and Xiujun Sun visualized the data; Yanxin Zheng and Tao Yu collected the samples; Zhuanzhuan Li discussed the results and wrote the original draft; Jianfeng Ren revised the manuscript. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, Z., Liu, B., Chen, X. et al. The chromosomal-level genome assembly and annotation of pen shell Atrina pectinata. Sci Data 12, 617 (2025). https://doi.org/10.1038/s41597-025-04978-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-025-04978-x