Background & Summary

Oysters are among the most important aquaculture species worldwide, accounting for an annual production of ~7 million metric tons (World Food and Agriculture – Statistical Yearbook 2024 (fao.org)). Genetic improvement including selective breeding, hybridization and polyploidization, plays an important role in supporting oyster aquaculture1. One of the most significant advances over the last four decades is the development of triploid oysters. Triploid oysters have three sets of chromosomes and grow significantly faster than diploids due to their increased heterozygosity, polyploid gigantism and sterility1,2. Sterility is desired for aquaculture as it prevents uncontrolled reproduction of cultured stocks. Although incomplete, sterility in oysters inhibits excessive gonad development and improves meat quality during the reproductive season when mature diploids are undesirable. Because of their sterility, superior growth and improved meat quality, triploids have become one of the most popular stocks for oyster aquaculture1,2,3. The commercialization of triploid oysters has contributed significantly to oyster aquaculture, especially for the Pacific oyster Crassostrea gigas and Eastern oyster Crassostrea virginica, in meeting the market demand around the world4,5. Triploid oysters now account for 30–70% of the cultured oysters in major producing countries such as France, Australia, USA and China1. Originally, triploids were induced by retaining the second polar body in newly fertilized eggs with chemicals such as cytochalasin B (CB) or 6-dimethylaminopurine (6-DMAP)6,7. However, chemical induction had low efficiency, which hindered commercial production. The successful development of tetraploid oysters by Guo and Allen8 (1994) made it possible to produce mated triploids by mating diploids x tetraploids, which is 100% effective without any use of toxic chemicals5. Nowadays, triploid oysters are commercially produced through diploid x tetraploid crosses. Thus, the successful production and breeding of tetraploids are critical for oyster aquaculture that is heavily dependent on triploids.

The Guo and Allen method for tetraploid induction involves blocking the release of polar body I in eggs from triploid Pacific oysters fertilized by haploid sperm (3n♀ × 2n♂), which successfully introduced the first autotetraploid Pacific oyster8,9. While it is challenging and difficult to replicate, tetraploids can also be obtained using normal diploid eggs (2n × 2n)10. Induction of tetraploids has also been reported in several oyster species including C. gigas, C. virginica, C. angulata, Crassostrea hongkongensis, Crassostrea sikamea, and tropical oysters Crassostrea belcheri (Sowerby) and Crassostrea iredalei (Faustino)11,12,13,14, although it is not clear whether breeding populations of tetraploids have been established in the latter two species. Most of the tetraploid oysters produced so far were autotetraploids. Allotetraploids can also be produced between species that can hybridize. Tetraploid genomes represent a new state of whole genome duplication that may be unstable and go through rapid reorganization and evolution. With two different genomes, allotetraploids may be more stable because of the preferential pairing of homologs that reduces multivalent formation. The presence of two genomes provides a rare opportunity for studying genome interaction. They may also generate new genotypes by combining characteristics of two species and produce superior triploids for aquaculture. High-quality assemblies of diploid genomes have been produced for oyster species and led to advances in our understanding of oyster biology and environmental adaptation15,16,17,18,19,20,21,22. The sequencing and analysis of tetraploid genomes may provide insights into the biology and evolutionary potential of tetraploids.

We previously produced allotetraploid oysters between the Pacific oyster C. gigas and Portuguese oyster C. angulata, two closely related species that dominate oyster aquaculture production23. In this study, we used long reads generated by PacBio sequencing, short reads generated by Illumina sequencing, and high-throughput chromosomal conformation capture (Hi-C) analysis to construct a high-quality chromosomal-level genome assembly of the allotetraploid oyster. The final genome size is 1,230.39 Mb in 717 contigs, with a contig N50 length of 2.56 Mb and a scaffold N50 length of 57.22 Mb. More than 90% of contigs (1,108.13 Mb) were anchored on 20 chromosomes. The assembly contains 571.24 Mb (46.43%) of repetitive sequences and 7,961 noncoding RNAs. Using de novo prediction, mRNA transcripts and homolog-based strategies, a total of 58,330 protein-coding genes were predicted, and 98.34% of which (57,360) were annotated in the publicly available NCBI RefSeq non-redundant protein, eggNOG, KEGG, SWISS-PROT, Pfam, TrEMBL, GO, and KOG databases. This allotetraploid oyster genome assembly provides a valuable resource for studying interactions between two genomes after duplication and hybridization, which are important for our understanding of the evolutionary biology of polyploids. The interaction or reorganization of the two genomes will likely create novel genotypic combinations or structural variations that affect the phenotype and can be used to study genetic control of production traits and improve the aquaculture performance of polyploid oysters.

Methods

Sample and sequencing

The allotetraploid oyster was artificially induced between the Portuguese oyster C. angulata and the Pacific oyster C. gigas. First, allotriploids were produced in 2015 by mating diploid C. angulata females and autotetraploid C. gigas males; the latter was produced by blocking the release of polar body I in eggs from triploid Pacific oysters fertilized by haploid sperm, which then experienced several generations of random 4n × 4n mating from 2009 to 2015 involving several lines or populations. Second, allotetraploids were produced in 2018 with the Guo and Allen method8 using eggs from the allotriploids and sperm from diploid C. angulata. Subsequently, allotetraploids were reproduced by 4n × 4n crosses for three generations. For this study, one allotetraploid oyster was sampled on 05/27/2024 from the F3 allotetraploids that were produced in 2022 (Fig. 1). The tetraploidy of the sampled oyster was confirmed by flow cytometry. Adductor muscle was collected and flash-frozen in liquid nitrogen, and then used for genomic DNA extraction (with ~30 mg tissue) using the DNeasy Blood & Tissue Kit (Qiagen, Hilden, Germany). Agarose (1.0%) gel electrophoresis, Qubit (Invitrogen, QubitTM3Flurometer) and NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) were used to determine DNA concentration and quality. The genomic DNA was used to build sequencing libraries, including 15-kb insert PacBio HiFi library and 150-bp insert Illumina paired-end library.

Fig. 1
figure 1

Breeding history leading to the sequenced allotetraploid oyster in this study.

High-molecular weight (HMW) gDNA was prepared for PacBio HiFi read production, and libraries were constructed using the PacBio Template Prep Kit 1.0 according to the standard protocol of Template Preparation using BluePippin size selection (Pacific Biosciences, USA). Sequencing of genomic libraries was performed on two cells using the self-testing high-precision CCS mode on the PacBio Sequel II system. A total of 34.37 Gb of HiFi long-read data with a read N50 length of 16.59 Kb (average read length of 16.18 Kb) was obtained, resulting in 27.94-fold coverage of the allotetraploid oyster genome size.

The short-insert library was constructed using the NR604-VAHTS Universal V6 RNA-seq Library Prep Kit (Vazyme), and then was sequenced by the Illumina NovaSeq. 6000 platform using the paired-end model (PE 150) following the standard protocol (Illumina Inc., San Diego, CA, USA). A total of 156.82 Gb (127.50-fold coverage) of clear reads with a Q30 of 93.52% were obtained to assess allotetraploid oyster genome size.

The Hi-C libraries were also constructed for genome assembly24,25. The same fresh adductor muscle was crosslinked with 1.0% formaldehyde and then was terminated with 0.2 M glycine. Libraries were generated according to the manufacturer’s instructions: (1) digestion with HindIII restriction enzyme, (2) labeling using Biotin-14-dATP (Thermo Fisher Scientific, USA), (3) ligation with T4 DNA ligase, (4) physically shearing into 300–700 bp fragments, (5) selectively capture using streptavidin magnetic beads. Illumina HiSeq 6000 platform was used for sequencing. We obtained 147.05 Gb (119.55-fold coverage) of clean data.

For genome annotation, we collected tissues from four organs (gill, mantle, adductor muscle and labial palp) for RNA-seq. Total RNA was extracted from tissues of each organ and then equally mixed into 1 sample. The RNA mixture was used for library construction and sequencing by the Illumina NovaSeq. 6000 platform following the standard protocol (Illumina Inc., San Diego, CA, USA). A total of 7.17 Gb of clear data was obtained.

Genome assessment and assembly

Illumina paired-end clear reads (156.82 Gb) were used to survey the genome features of the allotetraploid oyster via the k-mer method. GenomeScope v2.0026 (parameters: -k 19 -p 4 -m 1000000000) and Jellyfish v2.1.427 (parameter: -h 1000000000) were used for k-mer count histogram (k = 19) (Fig. 2). Estimation of genome size followed the formula of G = N k-mer/Daverage k-mer, where N k-mer is the total number of k-mers, Daverage k-mer is the average depth of k-mers, G is genome size. The survey results showed that the haploid genome size of allotetraploid oyster was estimated to be 544.56 Mb with the heterozygosity, repetitive sequence ratio and GC content of 5.50%, 46.69% and 34.61%, respectively (Table 1). In total, 34.37 Gb of HiFi long-reads were used for assembly using the Hifiasm v0.19 software28 with default parameters, resulting in a total length of 1,815.36 Mb comprising 1,610 contigs with a contig N50 length of 2.29 Mb for the allotetraploid oyster (Table 2).

Fig. 2
figure 2

The 19-mer frequency distribution in the allotetraploid oyster genome. The percentages of different genotypes are: aaaa [94.5%], aaab [3.15%], aabb[1.99%], aabc [0.001%] and abcd [0.361%]. The X-axis is the k-mer coverage, and Y-axis represents the product of frequency by coverage of the k-mer.

Table 1 Characteristics of the allotetraploid oyster genome based on k-mer analysis.
Table 2 Assembly statistics of the allotetraploid oyster genome.

Chromosomal-level assembly with Hi-C

To anchor contigs, 491,784,314 clean reads generated from the Hi-C data were mapped to the assembly using BWA v0.7.17-r118829 with default parameters. Valid interaction pairs (125,471,951 pairs) were defined as paired reads with mate mapped to a different contig and then were used to do the Hi-C associated scaffolding using HiC-Pro v2.10.030 that can also filter out invalid interaction pairs including self-ligation, non-ligation, PCR amplification, random break, and extreme fragments. The LACHESIS v2.0.131 was used for agglomerative hierarchical clustering, sorting and orientation (cluster_min_re_sites = 544; cluster_max_link_density = 2; order_min_n_res_in_trunk = 908; order_min_n_res_in_shreds = 870). All of 1,610 contigs were clustered into 717 groups (contigs after Hi-C) with a contig N50 length of 2.56 Mb and a scaffold N50 length of 57.22 Mb (Table 2), and 91.35% (655) were anchored on 20 chromosomes. Finally, 583 contigs were successfully sorted and oriented with a total length of 1,108.13 Mb for the allotetraploid oyster (Table 3). Chromatin contact matrix was built by Juicebox v1.532, and the 20 chromosomes show clearly distribution in the heatmap, with distinct interaction signal around the diagonal within chromosome and between adjacent chromosomes (Fig. 3). Moreover, we carried out collinearity analysis of the assembled allotetraploid genome with the original diploid C. gigas (GCF_963853765.1) and C. angulata (GCA_025765675.3) reference genome using Diamond v0.9.29.13033 (e < 1e−5, C score > 0.5) and MCScanX34 (MCScanX -s 5 -m 5). The pronounced co-linearity relationships indicated highly conserved gene blocks among allotetraploid oyster, diploid C. gigas and C. angulata (Fig. 4).

Table 3 Statistics of allotetraploid oyster genome sequence length (chromosome level).
Fig. 3
figure 3

Genome assembly of the allotetraploid oyster. (a) Chromosome interactive heat map of the Hi-C assembly. Color block indicates intensity of interaction from yellow (low) to dark-red (high). (b) Circos plot of the assembly. The a, b, c, d and e indicate chromosome ideograms, TE density, SSR density, gene density and GC content.

Fig. 4
figure 4

Collinearity analysis among genomes of the allotetraploid oyster, diploid C. gigas and C. angulata.

Repeat sequences annotation

Whole-genome repeat sequences, including tandem repeats and transposable elements (TEs), were annotated using the combined strategy of ab initio prediction and homology alignment. The MIcroSAtellite identification tool (MISA v2.135) and Tandem Repeat Finder (TRF v4.0936, 2 7 7 80 10 50 500 -d -h) were used to predict tandem repeats, which yielded a total of 83.22 Mb of tandem repeats (6.76% of the genome assembly) (Table 4). For TEs, a customized repeat library was built using RepeatModeler v2.0.137 (BuildDatabase -name; RepeatModeler -pa 12), which can initiate two de novo repeat finding programs of RECON v1.0.838 and RepeatScout v1.0.639. The library was then classified by RepeatClassifier with default parameters according to the public databases of Dfam v3.540 and Repbase v19.0641. The LTRharvest v1.5.1042 and LTR_finder v2.843 (ltr_finder -w 2 -C -D) were used to identify full-length long terminal repeat retrotransposons (fl-LTR-RTs). High-quality intact fl-LTR-RTs and non-redundant LTR library were then generated by LTR_retriever v2.9.044. We combined the above de novo TE sequences libraries with public databases to construct non-redundant species-specific TE library, which was then used to identify and classify the final TE sequences using homology search of RepeatMasker v4.1.245 (repeatmasker -nolow -no_is -norna -engine wublast -parallel 8 -qq). A total of 488.03 Mb of TEs were identified, accounting for 39.67% of genome assembly. Among TEs, DNA transposons and retroelements accounted for 26.85% (330.34 Mb) and 12.82% (157.68 Mb) of the genome assembly, respectively (Table 4).

Table 4 Annotation of repeat sequences for the assembled allotetraploid oyster genome.

Noncoding RNAs and pseudogene annotation

For noncoding RNAs annotation, miRNA, rRNA, tRNA, snoRNA and snRNA were identified by specific approaches. The miRNA was identified against miRBase database46. Based on the Rfam v14.547 database, rRNA and tRNA were identified by tRNAscan-SE v1.3.148 and barrnap v0.949 (barrnap–kingdom euk–threads 1) respectively, and snoRNA and snRNA were identified by Infernal v1.150 (cmscan–cpu 3–rfam). In total, 7,710 tRNA, 179 rRNA and 72 miRNA were predicted (Table 5).

Table 5 Noncoding RNAs and pseudogene annotation of the assemble allotetraploid oyster genome.

The GenBlastA v1.0.451 and GeneWise v2.4.152 were used to identify homologous pseudogenes after excluding functional genes (genblasta -P wublast -pg tblastn) and to search for immature stop codons and frameshift mutations (genewise -both -pseudo), respectively. We obtained 362 pseudogenes with an average length of 5.64 Kb (Table 5).

Protein-coding gene prediction and functional annotation

Three approaches, de novo prediction, homology-based prediction, and mRNA-based prediction, were applied for protein-coding gene prediction in the allotetraploid genome. Two ab initio gene-prediction software, Augustus v3.1.053 and SNAP v2006-07-2854, were used for de novo gene model prediction in the repeat-masked assembly (hard-masking). For homology-based prediction, protein sequences of four well-annotated species (C. angulata (GCA_025765675.3), C. ariakensis (GCA_020567875.1), C. virginica (GCF_002022765.2) and Danio rerio (GCA_049306965.1)) were downloaded and aligned to the repeat-masked genome assembly. Then, the GeMoMa v1.755 (run.sh mmseqs) was used to predict gene model based on sequence alignment. The 7.17 Gb clean data from RNA-seq was used for mRNA-based prediction. The Hisat2 v2.1.056 (hisat2–dta -p 10) and StringTie v2.1.457 (stringtie -p 2) were used to assemble transcripts. The GeneMarkS-T v5.158 was used to predict genes based on transcripts. Finally, the EVidenceModeler (EVM) v1.1.159 was used to integrate all gene models predicted by the above methods, which was then modified by PASA v2.4.160 to generate a weighted and non-redundant gene set. A total of 58,330 protein-coding genes (Table 6) were predicted with an average exon number of 7.84 per gene and an average gene length of 8.27 Kb (Table 7).

Table 6 Gene prediction for the allotetraploid oyster genome using three methods.
Table 7 The comparison of gene models predicted from the allotetraploid oyster, Pacific oyster (C. gigas) and Portuguese oyster (C. angulata).

For functional annotation of gene models, we searched against public biological functional databases, including Non-Redundant (NR), Evolutionary Genealogy of Genes: Non-supervised Orthologous Groups (eggNOG)61, Gene Ontology (GO), TrEMBL, Gene Ontology (GO), EuKaryotic Orthologous Groups (KOG), Kyoto Encyclopedia of Genes and Genomes (KEGG)62, SWISS-PROT63 and Pfam64, using Diamond blastp (Diamond v0.9.29.13033, diamond blastp–masking 0 -e 0.001). A total of 57,360 genes (98.34% of the total predicted genes) were functionally annotated (Table 8).

Table 8 Statistics of gene functional annotation of the allotetraploid oyster genome assembly.

Data Records

The raw PacBio, Hi-C, and Illumina sequencing data are deposited in the NCBI Sequence Read Archive database under the accession numbers: SRR32607952, SRR32607953, SRR32459897, SRR32456008 and SRR3245587665. The genome assembly has been deposited on the NCBI GenBank database under the accession number JBPJCZ00000000066. Moreover, the genomic annotation results have been deposited in the figshare database67.

Technical Validation

Four methods were used to evaluate the genome assembly: the mapping of Illumina reads, PacBio HiFi reads, BUSCO assessment, and core gene integrity. The Illumina short-reads and PacBio HiFi-reads were mapped to the assembly using BWA v0.7.17-r118829 and Minimap2 v2.2868 to assess the quality, respectively. As shown in Table 9, 99.09% and 99.95% of short-reads and HiFi-reads were mapped to the allotetraploid genome, respectively. The completeness of the assembly was evaluated by the Core Eukaryotic Genes Mapping Approach (CEGMA) v2.569 database and Benchmarking Universal Single-Copy Orthologs (BUSCO) v2.070 against the metazoa_odb10. A total of 447 (97.60%) out of 458 conserved eukaryotic core genes from the CEGMA database and 938 (98.32%) out of the complete 954 BUSCO orthologous groups were identified in the assembled genome (Table 10). All single-copy genes are expected to be duplicated in the allotetraploid, and the fact that 12.4% of BUSCO orthologs are present in single copies indicates significant gene loss after the whole genome duplication. Moreover, we randomly selected 36 genes from the allotetraploid oyster and aligned with genome assemblies of C. gigas and C. angulata. A total of 33 genes showed high identity (å 90%) and 1 gene showed low identity (~80%) with both species. Two genes were aligned with one of the two species (Table 12). These findings indicate that some genome reorganization has occurred, which may alter the fitness and aquaculture performance of allotetraploids. The Hi-C heatmap shows strong interactions within intra-chromosomal regions and between paired inter-chromosomes (Fig. 3). Taken together, these results confirm that the allotetraploid oyster genome assembly is of high quality considering its high heterozygosity and repeat content.

Table 9 Statistical results of short-read (Illumina) and HiFi-reads (PacBio) alignment.
Table 10 The CEGMA and BUSCO assessment of allotetraploid oyster genome assembly.

Alignment of 4 randomly selected genes confirms the presence of both C. gigas and C. angulata alleles (Table 11). Some heterozygous alleles between A and B subgenomes clearly originated from Pacific oyster and Portuguese oyster genomes, respectively (black boxes in Fig. 5), confirming allotetraploidy of the sequenced oyster. Some homozygous alleles between A and B subgenomes originated from single parental genotype (C. gigas or C. angulata), and some loci had alleles absent in the reference genomes of both parental species. In addition, we used species-specific COI sequences (C. gigas: TAGTAGCAGACATGCAATTTCCTCGA; C. angulata: CGTGATAATTGGGGGGTTTGGTAACT) to align with 156.82 Gb Illumina short-read data. A total of 20,006 reads were mapped with C. angulata specific COI sequence, while only 3 reads were mapped with C. gigas specific COI sequence. This result indicates that the mitochondrial genome is from C. angulata, consistent with the known pedigree of the allotetraploid oyster.

Table 11 One-to-one correspondence of 4 randomly selected genes among genome assemblies of C. gigas, C. angulata, A and B subgenome of allotetraploid oyster in this study.
Table 12 Identity in CDS sequence of 36 randomly selected genes of the allotetraploid oyster with that of the parental species.
Fig. 5
figure 5

Sequence alignment of 4 randomly selected gene segments among of C. gigas, C. angulata, A and B subgenome of allotetraploid oyster in this study.