Background & Summary

Bivalves, including a diverse group of organisms such as clams, oysters, mussels, and scallops, serve dual ecological and economic roles across aquatic ecosystems. Ecologically, they act as natural biofilters to purify water through nutrient recycling and serve as early-warning indicators for aquatic ecosystem changes due to their environmental sensitivity1,2,3. Their population viability emerges as an integrated metric of ecosystem stressors, encompassing chemical contamination, climate change, and habitat alteration4,5. Beyond their ecological significance, numerous bivalve species, including oysters, mussels, and scallops, are of economic importance in aquaculture with 2,700 tons production in 2022, representing a commercial value of nearly 138.5 million dollars6. Additionally, pearls and shells produced by bivalves are highly valued in jewelry and decorative industries, further emphasizing their economic importance.

The P. maxima, an important tropical aquaculture species, is naturally distributed in the central Indo-Pacific region from Myanmar to the Solomon Islands like Australia, Southeast Asia, Philippines, and South China Sea7. P. maxima is a vital economic resource for mariculture, valued for their ability to produce high-quality pearls with high economic value8,9. It is known for generating the largest pearls in the world, and the size of pearls often exceed 10 mm. The larger size of the pearls generated by P. maxima, also called highly sought “South Sea” pearls10, make them especially desirable in the luxury market. Regions such as Australia, Indonesia, the Philippines, and French Polynesia, which cultivate these oysters, have gained huge economic benefits from pearl farming11.

However, overfishing and environmental changes have led to a steep decline in populations of pearl oyster12. China has designated them as a national second-class protected species13. Although artificial breeding techniques allow the production of pearl oysters, the culture industry growth has been hindered by high larvae mortality rates in mariculture14,15. Genomic resources are crucial for the conservation of P. maxima and the development of aquaculture industry of P. maxima. In addition, the pearl oyster serves as a crucial model organism for investigating the genetic mechanisms of biomineralization16, a field of considerable scientific importance. However, the limited genome resources available for this key bivalve species have hindered the identification of genes involved in regulating critical quality traits and the unique biological characteristics of pearls, such as biomineralization. Furthermore, genomic data is also of great value for the study of evolution, adaptation, longevity, gonad development, and sex determination in bivalves17,18,19.

In this study, the first chromosome-level genome of the P. maxima was generated using PacBio long-read sequencing, Illumina short-read sequencing and Hi-C technology. The repeats and protein-coding genes were annotated, and comparative genome analysis was conducted, including molecular phylogenetic and genome synteny analysis. The high-quality reference genome resources for the P. maxima are of immense value for genome-based breeding programs, understanding complex biological processes and conserving germplasm resources, marking a significant advancement in the field of bivalve genomics.

Methods

Sample collection

The adult P. maxima was collected from Lingshui County, Hainan Province, China. After dissection, the adductor muscle, smooth muscle, gonad, mantle, gill, hepatopancreas, foot, and intestine tissues were immediately frozen in liquid nitrogen and stored at −80 °C until DNA and RNA extraction for subsequent sequencing.

DNA extraction and genome sequencing

High molecular weight genomic DNA was extracted from adductor muscle using the traditional Phenol-Chloroform protocol20. DNA purity and concentration were measured using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific), with acceptable thresholds of A260/A280 > 1.8 and A260/A230 > 2.0. DNA integrity and size were verified by electrophoresis on a 1% agarose gel, confirming fragments >30 kb. For survey sequencing, libraries with an approximate insert size of 300 bp were constructed by using the VAHTS Universal Plus DNA Library Prep Kit for Illumina, followed by paired-end 150 sequencing on the Illumina NovaSeq. 6000 platform. A total of 67.3 Gb data were generated with 51.56 × genome coverage.

Long read sequencing was performed by using the PacBio Sequel II system (Pacific Biosciences, California, USA). PacBio libraries were prepared using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, California, USA) according to the manufacturer’s guidelines. Briefly, the library was constructed through several steps, including magnetic bead enrichment, DNA repair and A-tailing reaction, DNA purification, adapters ligation, purification to remove small DNA fragments and excess reagents. A total of 52.1 Gb of data generated, with an N50 read length of 17 kb.

Chromosome-level assembly was achieved by using the Hi-C technique. The flash-frozen adductor muscle was processed to construct Hi-C library using Arima Genomics Hi-C Kit (San Diego, California, USA) by following the manufacturer’s instructions. The samples underwent formaldehyde cross-linking, enzyme digestion, biotin marking of DNA ends and blunt end ligation, and DNA purification. Hi-C library was subjected to pair-end (2 × 150) sequencing on the Illumina NovaSeq. 6000 platform, yielding a total of 121.5 Gb of sequencing data.

RNA extraction and transcriptome sequencing

The gonad, mantle, gill, hepatopancreas, foot, adductor muscle, smooth muscle, and intestine tissues were used to extract RNA by using TRIzol reagent. NanoDrop2000 Spectrophotometers (Thermo Fisher Scientific, Waltham, MA) were used to determine the concentration of RNA, and Agilent 2100 Bioanalyzer (Agilent, Santa Clara, CA) was adopted to assess RNA integrity. mRNA from total RNA was captured by using poly-T oligo-attached magnetic beads. Library preparation was used NEBNext Ultra RNA library preparation kit (NEB) and the prepared libraries were sequenced on Illumina NovaSeq. 6000 platform (Novogene, Sacramento, CA). Finally, a total of 50.8 Gb high quality reads was generated. To acquire more comprehensive information on full-length transcripts, a third-generation full-length transcriptome (PacBio isoform sequencing Iso-seq) library was prepared by utilizing PacBio SMRT sequencing technology. Equal amounts of RNA from the eight sampled tissues were mixed together to prepare the Iso-seq library. The library was prepared by using the Clontech SMARTer PCR cDNA Synthesis Kit (Clontech, Mountain View, CA, USA) and the BluePippin Size Selection System (Sage Science, Beverly, MA, USA), as described in the Pacific Biosciences protocol (PN 100-092-800-03). The constructed PacBio library was sequenced on the PacBio Sequel II platform.

De novo genome assembly

K-mer analysis was conducted with Jellyfish21 and Genomescope222 to estimate genome size, repeat sequence content and heterozygosity, based on 17-mer frequency profiles derived from 67.3 Gb of Illumina raw data. A total of 60,102,996,443 k-mers were identified, exhibiting a depth of 49×. The haploid genome size was estimated at 1.21 Gb, with a heterozygosity rate of 1.01% and repetitive sequences accounting for 61.75% of the genome. A draft contig assembly was generated using PacBio HiFi sequencing data. Subreads obtained from the PacBio Sequel II platform were processed through SMRT Link v10.2 to generate Circular Consensus Sequences (CCS) via multi-pass subread integration. CCS reads were refined using the CCS algorithm (minimum passes = 3, minimum read quality = 0.99) to eliminate adapter sequences and low-quality reads. De novo genome assembly was performed using Hifiasm v0.20.023, leveraging its capacity for high accuracy and well-connected continuity to assemble PacBio HiFi reads.

Hi-C analysis and chromosome assembly

To assemble a high-quality chromosome-level genome, preliminarily assembled genome was anchored using Juicer24 and 3D-DNA25, with subsequent manual refinement implemented via Juicebox24. Hi-C chromatin interaction patterns resolved 14 chromosomal scaffolds (Fig. 1A), yielding a final assembly of 1,264.93 Mb with 36.18% GC content. Approximately 97.94% of the genome was anchored into these 14 chromosomes, and a contig N50 of 649 kb and a scaffold N50 of 89.19 Mb were observed. Genome architecture was visualized in the circos plot (Fig. 1B).

Fig. 1
figure 1

Characteristics of the Pinctada maxima genome. (A) Hi-C interaction heat map of Pinctada maxima. (B) Circos plot of the P. maxima genome assembly. (a) The length of chromosomes in the size of Mb. (b) Density of genes with 500 kbp windows; (c) GC content with 500 kbp windows; (d) depth of coverage of PacBio HiFi reads with 100 kbp windows; (e) depth of coverage of Illumina short reads with 100 kbp windows; (f) distribution of heterozygous SNPs with 500 kbp windows.

Repeat annotation

Repeat element annotation was performed through a hybrid approach combining de novo prediction and structural features by using RepeatModeler v2.0.326 (https://github.com/Dfam-consortium/RepeatModeler), EDTA v2.0.027 and RepeatMasker v4.1.228 (https://www.repeatmasker.org/). Candidate LTR-RTs repeat sequence library was identified using LTR_finder29 with parameters ‘-size 5000000 -time 1500 -w 2 -C -D 15000 -d 1000 -L 7000 -l 100 -p 20 -M 0.85’ and LTRharvest v1.6.230 with parameters ‘-similar 90 -vic 10 -seed 20 -seqids yes -minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 -motif TGCA -motifmis 1’. The identified LTR-RT candidates were filtered with LTR_retriever v3.0.131 by using default parameters. EDTA v2.0.0, LTR_retriever v3.0.1 and RepeatModeler v2.0.3 were used to build de novo repeat libraries. Finally, the perl script make_panTElib.pl in the EDTA v2.0.0 program was used to integrate and obtain combined repeat library. The combined repeat library was used as the final library to identify repeat sequences using RepeatMasker v4.1.2. The proportion of repeat sequences annotated was 65.46%, with DNA transposons accounting for the highest proportion (41.00%), followed by LTR (10.25%), LINE (7.52%) and SINE (0.16%) (Table 2).

Gene prediction and annotation

Protein-coding gene annotation was performed using the BRAKER v3.0.832 pipeline, which synergistically integrates multi-evidence approaches, including de novo prediction, homology-based searches and transcriptome-assisted methods. RNA-seq data generated by our study (SRP54613133) and published RNA-seq data (PRJNA362291,PRJNA636870 and PRJNA761869) were both used for de novo gene prediction. All the RNA-seq data were mapped to the soft masked genome using HISAT2 v2.2.134 with the alignment ranging from 93.93% to 98.17%. Then, BRAKER v3.0.8 and StringTie v2.2.135 were used to build transcript models of all mapping results. The transcript models were fed into AUGUSTUS v3.5.036 for gene model development and prediction. Homology-based annotation was conducted by using the amino acid sequences of Pinctada fucata37, Crassostrea gigas38, Patinopecten yessoensis39, Argopectens irradias40 and Chlamys farreri41. These amino acid sequences were aligned to the genome of P. maxima using TBLASTN with e-value threshold of 1e-10, and the aligned sequences were selected and provided to BRAKER v3.0.8. For the transcriptome-assisted annotation, Iso-seq data generated by our study was used to obtain full-length transcripts. The HiFi reads aligned were collapsed utilizing Isoseq. 3 v3.8.2 collapse pipeline (https://isoseq.how/classification/workflow.html) to remove the redundant transcripts resulting from 5′ RNA degradation. Then, the script gmst.pl from GeneMarkS v5.142 was used to predict the coding regions of the transcripts, and the prediction results were integrated using the script gmst2globalCoords.py from BRAKER v3.0.8. Finally, all evidences were merged to form a consensus gene set using TSEBRA v1.1.2.543 with parameters ‘--ignore_tx_phase -kl -f’. The weights of each part of the evidence are as follows: RNA-seq hints: 0.15; manual hints: 0.5; long reads hints: 0.5; protein hints: 3. In total, 26,315 protein-coding genes were identified (Table 3). NR, Pfam and SwissProt annotation of predicted protein-coding genes in P. maxima were performed by using BLASTP with e-value threshold of 1e-2. KEGG annotation was performed using KofamScan v1.344. The GO annotation was obtained by mapping the annotation results from SwissProt. Finally, more than 87.04% (22,905) of protein-coding genes were annotated (Table 3). The results of functional annotations were displayed on the online platform Figshare (https://doi.org/10.6084/m9.figshare.28053659).

The non-coding RNA genes including rRNAs, tRNAs, snRNAs and miRNAs were screened using INFERNAL v1.1.245 and tRNAscan-SE v2.0.1246. Four types of noncoding RNAs, including 43 miRNAs, 4,042 tRNAs, 241 rRNAs and 609 snRNAs were identified in the P.maxima genome.

Comparative genome analysis

We conducted a systematic comparison with the four chromosomal-level Pteriidae genomes (Pinctada fucata, Pinctada imbricata, Pteria penguin, and Pinctada radiata) and four well-assembled oyster genomes (Crassostrea gigas, Crassostrea virginica, Crassostrea hongkongensis and Ostrea edulis) available on NCBI to outline the distinguishing features brought by our assembly. Our newly assembled Pinctada maxima genome exhibits superior scaffold-level contiguity compared to other species, with a scaffold N50 of 89.19 Mb comparable to Ostrea edulis (94.3 Mb) and surpassing the other seven genomes. Although contig N50 was lower than some species, possibly due to the large genome size, high proportion of repeat sequence combined with high heterozygosity, the high scaffold N50 demonstrates effective gap-closing during assembly. Our assembled genome shows excellent completeness, achieving 97.38% and 95.26% in Metazoa and Mollusca BUSCO assessments respectively, comparable to the available bivalve genomes (Table 1). The number of protein-coding genes in P.maxima (26,315) is less than the other two pearl oysters (36,588 and 36,733), while comparable to that of two oysters (25,901 and 27,763) (Table 3). Interestingly, the proportion of functionally annotated genes in oysters is higher than that in the pearl oysters (Table 3). The reason for the low annotation rate of genes in pearl oysters for functional annotation need further investigation.

Table 1 Comparison of genome assembly metrics between Pinctada maxima and other bivalve genomes.
Table 2 Statistics of the genome-wide annotations in Pinctada maxima.
Table 3 The statistics of functional annotation for Pinctada maxima and five other bivalve species.

The genome of P.maxima and 21 other species (Acanthopleura granulate, Argopecten purpuratus, Bathymodiolus platifrons, Caenorhabditis elegans, Ciona intestinalis, Crassostrea gigas, Danio rerio, Drosophila melanogaster, Gallus gallus, Homo sapiens, Laticauda laticaudata, Lottia gigantea, Mus musculus, Octopus bimaculoides, Patinopecten yessoensis, Pictodentalium vernedei, Pinctada fucata, Pinctada imbricata, Scapharca broughtonii, Sinonovacula constricta, Xenopus tropicalis) were used for gene family identification using OrthoFinder v2.5.547 with default parameters. Protein sequence alignment was executed using MUSCLE v3.8.3148, following the alignment refinement conducted in GBLOCKS 0.91b49 using stringent parameters (-b4 = 5 -b5 = h -t = p). The optimal amino acid substitution model (LG + I + G + F) was determined through ProtTest3 v3.4.250 prior to maximum likelihood tree construction in RAxML v8.2.1251 with 1,000 bootstrap replicates. Divergence time estimation was performed using mcmctree in PAML52. Gene family contraction and expansion analysis was conducted in CAFE v5.0.053 using the result file generated by OrthoFinder. The constructed phylogenetic tree was visualized with the online interactive tool iTOL v7 (Interactive Tree Of Life) (https://itol.embl.de/). Syntenic genomic blocks between P. maxima and P. fucata were identified and visualized using MCScan implemented in jcvi v1.4.1154 with the parameter--cscore = 0.99.

Data Records

The assembled genome has been deposited at GenBank under the accession JBLANZ00000000055. The raw Illumina PE150, PacBio, and Hi-C sequencing data have been deposited in Sequence Read Archive (SRA) with the accession number of SRP55285956. The raw RNA-seq sequencing and Iso-Seq sequencing data have been deposited in SRP54613133, respectively. Assembled genome, functional annotation, and gene annotation files were uploaded to Figshare (https://doi.org/10.6084/m9.figshare.28053659)57.

Technical Validation

QUAST v5.3.058 was employed to assess the genome assembly quality, focusing on its size and genome continuity. The total genome size was generated to be 1,264.93 Mb, with a contig N50 of 649 kb and a scaffold N50 length of 89.19 Mb (Table 1). Subsequently, we evaluated the completeness of the genome assembly using Benchmarking Universal Single-Copy Orthologs (BUSCO v5.8.1) with the metazoa_odb10 and mollusca_odb10 database. For metazoa_odb10 database, a total of 97.38% complete core genes were found with 96.02% as single-copy and 1.36% as duplicated genes (Table 1). The mollusca_odb10 database contains a total of 5,295 conserved core genes for mollusca, and our assembled genome included 5,044 (95.26%) of the expected mollusca genes with 4,964 (93.75%) as single-copy and 80 (1.51%) as duplicated genes (Table 1). We also used BUSCO to evaluate the completeness of gene annotations, observing 93.50% and 90.59% of the expected metazoa and mollusca genes, respectively (Table 4). Merqury v1.359 was used to evaluate the genome quality with PacBio HiFi reads, ultimately obtaining a consensus quality value (QV) of 55.64. In addition, Illumina paired-end clean reads and PacBio HiFi reads were mapped to the final reference genome assembly by BWA v0.7.1860 and Minimap2 v2.161 to evaluated the genome assembly, observing the extremely high mapping rate with 98.89% and 99.99% for Illumina and PacBio sequencing. The high quality of the genome assembly is also demonstrated by the successful mapping of 95.39% ± 1.73% of transcriptome reads.

Table 4 BUSCO assessment the completeness of gene annotations.

Molecular phylogenetic analysis identified a total of 35,646 orthogroups, of which 119 were single-copy orthogroups. The ortholog analyses revealed that 24,684 genes in P.maxima were clustered into orthogroups, with 894 genes belonging to species-specific orthogroups. Among the three pearl oysters (P. maxima, P. fucata, and P. imbricata), 1,332, 1,860, and 1,281 genes were assigned to Pinctada-specific orthogroups, respectively. The resulting ML topology incorporated 1,000 bootstrap replications for robust branch support evaluation. Phylogenetic analysis indicated the closest evolutionary relationship between P. maxima and P. fucata, with an estimated divergence time of approximately 90 million years ago. Furthermore, 10,479 gene families were identified as undergoing expansion or contraction events. Specifically, 231 expanded and 633 contracted gene families were observed in P. maxima (Fig. 2A). Subsequent statistical analysis (p < 0.05) identified 28 significantly expanded and 48 significantly contracted gene families in P. maxima. The collinearity analysis between P. maxima and P. fucata identified 17,191 highly matched genomic blocks pairs, with the genomes exhibiting complete one-to-one chromosomal synteny and no large-scale rearrangements (fission, fusion, or deletion) detected. The results suggested highly conserved genome synteny between P. maxima and P. fucata with generally one-to-one correspondence between their 14 chromosomes (Fig. 2B).

Fig. 2
figure 2

Comparative genome analysis. (A) Maximum likelihood tree P. maxima and 21 other species with 1,000 bootstrap iterations. The scale bar represents 100 million years per unit. In the pie chart, red indicates contracted gene families, green represents expanded gene families, and blue denotes gene families that have neither contracted nor expanded. The stacked bar chart displays the number of genes in each species categorized as: Shared single-copy orthogroups, Shared multiple-copy orthogroups, Species-specific orthogroups, Pinctada-specific orthogroups, and Other genes. (B) Chord plot showing the chromosomal synteny between P. maxima and P. fucata.