Background & Summary

The Chinese soft-shelled turtle (Pelodiscus sinensis) belongs to the order Testudines, family Trionychidae, and genus Pelodiscus, and is distributed in many Asian countries, including China, Japan, Korea, Vietnam, etc.1. Due to its rich nutritional and medical values, the breeding industry of P. sinensis has developed rapidly in recent years. According to FAO data, the total production of P. sinensis in 2022 has reached 375,000 tons, making it one of the most important aquatic species2. In China, previous studies classified P. sinensis populations into different strains based on their geographical distribution, including the northern strain from the northern region of Hebei province, the Yellow river strain from the Yellow river basin, the Dongting lake strain, Poyang lake strain, and Taihu lake strain from the Yangtze river basin, the southwestern strain from Guangxi province, Taiwan strains from southern and central Taiwan, etc.3,4. With the expansion of aquaculture production, cross regional reproduction between different farms has led to the degradation of P. sinensis germplasm resources5. Furthermore, due to overfishing and non-standard introduction, the wild resources of P. sinensis have decreased6. It has been listed as a “vulnerable species” on the International Union for Conservation of Nature (IUCN) Red List of Endangered Species7.

At present, research on the evaluation of P. sinensis germplasm resources mainly focuses on morphological detection, mitochondrial diversity, and phylogenetic relationships between different strains4,8,9. Moreover, the degree of genetic differentiation among different geographical populations of P. sinensis is still unclear. It was suggested that different habitats and a long evolutionary history might be the reasons for the genetic differentiation of P. sinensis3. With the development of sequencing technology, whole genome sequencing has largely overcome the limitations of traditional genetic methods such as the lack of molecular markers, providing a reference for germplasm resource conservation and genetic differentiation research10,11,12. Although a genome of soft-shelled turtle has been published in 2013, this genome was a fragmented draft with scaffold N50 lengths of 3.33 Mb13. The high-quality reference genome of P. sinensis can promote and advance the conservation genetics and molecular mechanism research of important economic traits of this species.

This study applied a combination strategy of Illumina paired-end sequencing, PacBio HiFi, and High-throughput chromosome conformation capture (Hi-C) technologies to generate sequencing data for the construction of the chromosome genome of P. sinensis. The total length of the genome is about 2.24 Gb, and more than 97.2% of the BUSCO genes were detected, with contig N50 lengths of 107.61 Mb, indicating excellent integrity and sequence continuity of the genome. A total of 21,532 protein coding genes were predicted in the assembled genome, with 98.22% of the genes successfully functionally annotated. In recent years, some genome research of turtle and tortoise species have been reported, including Chelonia mydas13, Mauremys mutica14, Mauremys reevesii15, Rafetus swinhoei16, Gopherus agassizii17, Trachemys scripta elegans18, Platysternon megacephalum19, Chrysemys picta bellii20, Aldabrachelys gigantea21, Pelochelys cantorii22, etc. The high-quality chromosome level genomes provided in this study may further serve as a valuable resource for the evolutionary research of reptiles.

Methods

Sample collection and sequencing

A healthy 1-year -old female P. sinensis was collected from a breeding farm of Huzhou, Zhejiang Province, China (37.0750 °N, 113.9221 °E) in June 2022. Muscle, spleen, kidney, heart, lung, and liver tissues were collected from P. sinensis, and quickly frozen in liquid nitrogen for one hour and then stored at −80 °C. Among them, liver tissue was used for DNA sequencing for genome assembly, while all tissues were used for RNA sequencing. Genomic DNA and RNA were extracted using the Genomic DNA Extraction Kit (Takara Bio Inc., Dalian, China) and RNAisoPlus Reagent (TakaRa Bio Inc., Dalian, China), respectively.

For short-read sequencing, the Illumina HiSeq X (Illumina, San Diego, CA, USA) was used to perform paired-end sequencing with an insert size of 350 bp. Moreover, fastp v 0.21.0 was used to evaluate the quality of raw reads with default parameters23, and clean reads were obtained by removing reads containing adapter, low-quality and ploy-N. For long-read DNA sequencing, the PacBio HiFi sequencing was performed on a PacBio Sequel II platform with circular consensus sequencing (CCS) mode24. To anchor scaffolds onto the chromosomes, a Hi-C library was constructed according to the protocol described previously25,26. The liver tissue of P. sinensis was crosslinked using paraformaldehyde solution and enzymatically digested with MboI restriction enzyme. The ends of the restriction fragments were labeled with biotinylated nucleotides, and the ligated DNA was extracted, purified, and sheared into 350 bp fragments for Hi-C library construction. Finally, the library was quantified with Q-PCR method and sequenced with the Illumina HiSeq X platform (Illumina, San Diego, CA, USA). After removing adapters and low-quality short reads, a total of 241.66 Gb (109.84×) of Hi-C data was generated. In addition, total RNAs from the tissues of muscle, spleen, kidney, heart, lung, and liver tissues were extracted. Then, RNA quality and quantity of all tissues were detected by a NanoDrop spectrophotometer (NanoDrop products, Wilmington, DE, USA), a 2100 Bioanalyzer (Agilent Technologies, CA, USA), and 1% agarose gel electrophoresis. Finally, six RNA-seq library was constructed using the Illumina HiSeq X platform (Illumina, San Diego, CA, USA). Additionally, all tissues were equally mixed for Iso-Seq. The cDNA library was sequenced on the PacBio sequel II platform. In total, we obtained 471.77 Gb of sequencing data, which included 104.21 Gb (47.36×) of Illumina reads, 87.28 Gb (39.67×) of PacBio HiFi reads, 241.66 Gb (109.84×) of Hi-C data, and 38.62 Gb of RNA sequencing data.

De novo assembly and chromosome construction of the P. sinensis genome

The k-mer analysis was utilized to survey the genome features of P. sinensis with the Illumina short reads27. Genome size, heterozygosity, and duplication rate were estimated using GenomeScope v 2.028. The 17-mer analysis estimated the genome size of P. sinensis was approximately 2.14 Mb, with a duplication rate of 52.49% and a heterozygosity of 0.81%. The initially assembly of PacBio HiFi long reads was generated using Hifiasm v 0.19.8 with the default parameters29. The heterozygous sequences were removed using the Purge_haplotigs v 1.1.1 with default parameters30. The draft genome contained a total size of 2.24 Gb containing 220 contigs with N50 sizes of 107.61 Mb. To assemble a chromosome-level genome, the Hi-C reads were mapped to the assembled genome and filtered by Jucier v 1.631. The contigs were ordered and anchored into chromosomes using the 3D-DNA32, and manually adjusted using Juicebox33. Finally, the Hi-C interaction heatmap demonstrated an excellent quality of the genome assembly (Fig. 1A). Approximately 805.56 million read pairs generated from Hi-C sequencing. Previous study revealed that P. sinensis had a diploid chromosome number of 3334. The Circos35 was used to visualize the 33 chromosomes, total TE density, DNA-TE density, LINE density, LTR density, and GC% density (Fig. 1B). The longest and shortest chromosomes were 336.74 Mb and 13.04 Mb in length, respectively (Table 1). For the final genome assembly, the contig N50 and scaffold N50 reached 107.61 Mb and 129.58 Mb, respectively (Table 2).

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Genome-wide chromosomal heatmap (A) and circos plot of genome (B). The rings from inside to outside indicate (a) pseudochromosome length of the genome, (b) gene density, (c) total transposable elements (TE) density, (d) DNA-TE density, (e) long interspersed nuclear element (LINE) density, (f) long terminal repeats (LTR) density, and (g) GC% density.

Table 1 Statistics of assembled chromosomes sequence length.
Table 2 Statistics of P. sinensis genome assembly.

To evaluate the quality of the assembled genome, the completeness and accuracy of this genome were assessed by short-read mapping and Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis. Using BWA v0.7.10-r78936, the short reads were aligned to the genome, it was found that over 98.43% of the reads were aligned, demonstrating a high mapping ratio for the short-read sequencing data. Furthermore, the completeness of the assembled P. sinensis genome was assessed by BUSCO v5.4.6 with the vertebrata_odb10 database37. Among the 3354 single-copy orthologous genes, 3260 (97.2%) and 27 (0.8%) were identified as complete and fragmented BUSCOs, respectively, indicating that the assembled P. sinensis genome had high quality (Table 3).

Table 3 BUSCO evaluation of P. sinensis genome.

Repetitive and non-coding gene prediction

The annotation of repetitive elements was divided into two methods: de novo prediction and homology-based alignment38. In this study, repetitive elements and long terminal repeats were identified in the genome using RepeatModeler39 and LTR-FINDER40 with default parameters. Afterwards, the homology-based alignment was performed utilizing the RepBase database41. DNA and protein transposable elements (TEs) were detected by RepeatMasker and RepeatProteinMask42, respectively. Tandem repeats were identified with Tandem Repeat Finder43. The repetitive element annotations are listed in Table 4. By combining Repbase and de novo datasets, we obtained a total of approximately 1.03 Gb of nonredundant repetitive sequences, accounting for 45.81% of the genome.

Table 4 Classification of repetitive sequences and ncRNAs.

For noncoding RNA (ncRNA) annotation, rRNA and tRNA prediction was conducted using RNAmmer v 1.244 and tRNAScan v 1.345, respectively. Furthermore, other ncRNAs were detected using Rfam database46. Six types of ncRNAs, including 24 lncRNAs, 837 miRNAs, 2958 rRNAs, 721 snRNAs, 10 ribozymes, and 7394 tRNAs, were identified from the P. sinensis genome (Table 4).

Gene prediction and functional annotation

The gene structures were predicted according to three approaches, including de novo-based, homology-based, and RNA-seq-based prediction, were used to identify gene structure. For de novo-based prediction, gene prediction was performed using AUGUSTUS v 3.4.047, GlimmerHMM v 3.0.448, Genscan v 3.149, GeneID v 1.450, and SNAP (version 2006-07-28)51 with default parameters. The protein sequences of Alligator sinensis, Chelonia mydas, Chrysemys picta bellii, Deinagkistrodon acutus, Gallus gallus, Gekko japonicus, and P. sinensis (previously published)13 were downloaded from Ensembl52. Homology‐based predictions were performed with protein sequences from these reference species. For the RNA-seq-based method, the full-length transcriptome sequences generated from PacBio sequencing were aligned to the genome using the TopHat v 2.1.153, and gene structure was predicted using Cufinks v 2.2.154. All the gene models were merged, and redundancy was removed using MAKER255. Overall, a total of 21,532 protein-coding genes were predicted with an average transcript length of 40,287.42 bp, average cds length of 1597.32 bp, average intron length of 167.95 bp, average exon length of 4546.19 bp, and average exons per gene of 9.51 (Table 5).

Table 5 Statistics of gene structure and functional annotation of P. sinensis genome.

For functional annotation, the Diamond v 2.0.656 was used to align all protein-coding genes to the non-redundant protein (NR) and Swissprot databases with an E-value threshold of 1e-5. The annotation of Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways was performed by Blast2GO57. The protein motifs and domains were identified using the Pfam58.

A total of 21,149 genes (98.22% of the predicted protein-coding genes) were annotated using the above databases, and approximately 89.59%, 82.51%, 94.79%, 84.01%, and 65.34% were annotated in Swissprot, Pfam, Nr, KEGG, and GO, respectively (Table 5). A total of 12,880 genes were commonly annotated by these databases (Fig. 2).

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Venn diagram of the number of genes from P. sinensis genome functional classification using multiple public databases.

Ethics statement

This study was approved by the the Institutional Animal Care and Use Committee (IACUC) of the Zhejiang Institute of Freshwater Fisheries. All the methods used in this study were conducted following approved guidelines.

Data Records

All the raw sequencing data utilized in this study were submitted to the National Center for Biotechnology Information (NCBI) SRA (Sequence Read Archive) database under BioProject accession number PRJNA1149904. the Illumina WGS data, PacBio HiFi data, Iso-Seq and Hi-C data was deposited with the accession number SRR3030500559, SRR3030500460, SRR3032361761 and SRR3030500662, respectively. The RNA-seq data have been were archived under the accession numbers SRR3030499863, SRR3030499964, SRR3030500065, SRR3030500166, SRR3030500267, SRR3030500368 in the kidney, spleen, lung, muscle, liver and heart tissues, respectively. The genome assembly has also been deposited at NCBI with the accession number GCA_049634645.169. The genome annotation have been deposited at the Figshare70.

Technical Validation

To verify the integrity and accuracy of the genome assembly, the BUSCO v5.4.6 assessment was conducted with the vertebrata_odb10 database, the final genome assembly demonstrated a BUSCO completeness of 97.2%, with 95.9% single-copy BUSCOs, 1.3% duplicated BUSCOs, 0.8% fragmented BUSCOs, and 2.0% missing BUSCOs (Table 3). Furthermore, the PacBio Hifi reads were mapped to the genome using BWA and counted for mapping ratio. As a result, the mapping ratio of the assembly were 98.43%, and the genome coverage of the assembly were 99.66%. In addition, a total of 21,532 nonredundant protein-coding genes were successfully produced by combining de novo-based, homology-based, and RNA-seq-based prediction. A total of 21,149 genes were successfully functionally annotated. Therefore, the high mapping ratio, genome coverage, recognition rate of single-copy orthologues and gene number indicated the high-quality of P. sinensis genome.