Background & summary

Starry flounder (Platichthys stellatus, FishBase ID: 1787), a member of the Pleuronectidae family in the order Pleuronectiformes, has garnered attention as a promising aquaculture flatfish species along the coast of Korea and North China. This cold-water flatfish is naturally distributed in coastal waters of the North Pacific and Arctic oceans, but its distribution extends beyond marine habitats to include estuarine transition zones, brackish lagnoons, and fully freshwater systems in the river and lake1,2,3, suggesting its outstanding adaptability to euryhaline conditions. In addition, studies have shown that starry flounder can survive normally in salinity of 0-33 ppt4. Therefore, starry flounder can be considered an ideal model to study the molecular genetic mechanism of euryhaline adaptation in teleost fishes. However, no high-quality marbled flounder reference genome has been reported so far.

As we all know, high-quality genome sequences are the molecular basis for understanding the genetic mechanism of environmental adaptation in fish. In recent years, a large number of fish genome sequences have been decoded, revealing the genetic basis of fish adaptation to different environments, including salinity (Dicentrarchus labrax, Tenualosa ilisha, and Takifugu obscurus)5,6,7, high altitude (Triplophysa bleekeri, Glyptosternon maculatum, and Oxygymnocypris stewartii)8,9,10, low temperature (Notothenia coriiceps, Parachaenichthys charcoti, and Chionodraco myersi)11,12,13, heat (Gadus morhua)14, light (Thunnus orientalis)15, deep sea (Coryphaenoides rupestris, and Pseudoliparis swirei)16,17, and extreme alkaline environment (Leuciscus waleckii)18. The initial genome assembly of the starry flounder, generated solely by Illumina short-read sequencing (GCA_016801935.1)19, exhibited limited continuity (contig N50: 33.2 kb) due to the limitations of sequencing technology. These structural deficiencies in the initial genome now necessitate urgent resolution through establishing a chromosome-scale reference by third-generation long-read sequencing, which is essential for evolutionary-developmental studies and aquaculture genomics applications.

In the present study, we assembled an improved high-quality chromosome-scale starry flounder genome comprehensively using Illumina short-read sequencing, PacBio Circular Consensus Sequencing (CCS), and high-throughput chromosome conformation capture (Hi-C) sequencing technologies (Fig. 1). This is the highest-quality genome sequence of starry flounder reported so far. Taken together, the genomic resources obtained in this study not only provided new insights into the genetic research in starry flounder, but also laid a robust foundation for the development of molecular breeding technology for starry flounder.

Fig. 1
Fig. 1
Full size image

The genome snail plot of P. stellatus.

Methods

Sample collection and genome sequencing

A two-year-old female starry flounder was obtained from Yantai, Shandong, China. Genomic DNA was extracted from fresh muscle samples for short-read sequencing, long-read PacBio HiFi sequencing, and Hi-C sequencing. The quality and the concentration of genomic DNA were determined by agarose gel electrophoresis and NanoDrop 2000, respectively. All procedures including the sample collection and handling of the starry flounder in this study conformed to the ethical principles of the Animal Care and Use Committee of Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences (CAFS).

For short-read sequencing, qualified genomic DNA was randomly fragmented, and a library with a 350 bp insert size was constructed using the Illumina DNA PCR-Free Prep kit (Illumina, USA). Sequencing was performed on Illumina Novaseq 6000 platform with 150 bp pair-end (PE) mode. A total of 57.84 Gb of raw data about 90×depth of the genome was generated (Table 1).

Table 1 Summary of sequencing data for P. stellatus genome assembly.

For PacBio HiFi sequencing, qualified genomic DNA was used to construct a PacBio HiFi library using SMRTbell prep kit 2.0 (PacBio, USA) according to the PacBio manufacturing protocols, and then the qualified library was sequenced on the PacBio Sequel II platform using the Circular Consensus Sequencing (CCS) mode. Finally, 34.95 Gb (55×) PacBio HiFi long reads were produced for the subsequent genome assembly (Table 1). The average length of the HiFi reads was 15.94 Kb (Table 1).

To construct the chromosome-level genome of the starry flounder, a Hi-C library was prepared. The Hi-C library construction process includes formaldehyde crosslinking, cell lysis, enzymatic digestion, end repair, and biotin labeling, blunt-end ligation, crosslinking reversal, and DNA purification20. The qualified Hi-C library was then sequenced using 150 bp PE mode on the Illumina NovaSeq 6000 platform. As a result, 113.21 Gb (180×) Hi-C sequencing data was generated (Table 1).

Genome assembly

PacBio HiFi data described above was used for the draft genome assembly by Hifiasm (v0.19.5)21 software with default parameters. Then, the purge_dups (v1.2.5)22 was applied to identify and remove the haplotypic duplication of the primary draft genome. Pilon (v1.23) was then used to polish the draft genome using Illumina data. After initial assembly and polishing, we obtained a 643.56 Mb reference genome of starry flounder with a contig N50 length of 10.00 Mb, which greatly improved the continuity and completeness compared with the current reference genome (GCA_016801935.1) with a contig N50 length of 33.20 kb (Table 2), representing an approximately 301-fold improvement. To further construct the chromosome-level genome, the 3D-DNA pipeline23 and Juicer-box (v1.91)24 were then used to examine and visualize the interaction frequencies among different chromosomes and anchor the initially assembled genome scaffolds to pseudochromosomes with Hi-C data. As a result, 605.10 Mb of the genome sequence covering 94.02% of the genome assembly were anchored and oriented into 24 pseudochromosomes with a scaffold N50 length of 26.19 Mb (Fig. 2 and Table 2). We further searched for the occurrences of telomeric repeat motifs (CCCTAA/TTAGGG) in the starry flounder genome assembly using quarTeT25. As a result, a total of 18 telomeres were identified, and telomeres were detected on both ends of 1 chromosome (Table S1). The above findings suggested that the new starry flounder genome assembly is a significant improvement over the current reference genome.

Table 2 Comparative statistics of genome assembly in P. stellatus.
Fig. 2
Fig. 2
Full size image

The Hi-C heatmap of chromosome interactions in P. stellatus.

Repeat annotation

A strategy of combining homology-based prediction and de novo prediction was carried out to annotate the repetitive elements. In detail, RepeatMasker (v4.0.5)26 and RepeatProteinMasker (v4.0.5) were used to detect interspersed repeats and low complexity sequences against the Repbase database (21.01)27 at both nuclear and protein levels, respectively. Then, RepeatMasker was used to detect species-specific repeat elements using a custom database generated by RepeatModeler (v1.0.8)28 and LTR-FINDER (v1.0.6)29. Moreover, Tandem Repeat Finder (v4.0.7)30 was employed to the prediction of tandem repeats. All predicted repeated annotations were integrated into a non-redundant repetitive sequence of 227.87 Mb, representing 35.41% of the assembled genome (Table 3). Among them, DNA transposons, long terminal repeats (LTRs), long interspersed elements (LINEs), and short interspersed nuclear elements (SINEs) accounted for 19.02%, 9.04%, 8.76%, and 0.97% of the genome, respectively (Table 3).

Table 3 Classification statistics of repeated elements in P. stellatus.

Protein-coding gene prediction and functional annotation

Protein-coding gene prediction was performed using a combination of de novo, homology-based, and transcriptome-based prediction strategies. For de novo prediction, Genscan31 and Augustus32 with default settings were used for the gene structure prediction. For homology prediction, protein sequences of Cynoglossus semilaevis, Paralichthys olivaceus, Amphiprion ocellaris, Anabas testudineus, and Acanthochromis polyacanthus were downloaded from NCBI and Ensembl, and were aligned to the starry flounder genome for homology-based annotation using Exonerate (v2.4.0)33. For transcriptome-based prediction, RNA-seq data downloaded from NCBI Sequence Read Archive (SRA) database (accession number: SRP216013) were aligned to the starry flounder genome using HISAT2 (v2.0.5)34, and the coding sequences were identified using TransDecoder (v5.5.0, https://github.com/TransDecoder/TransDecoder). Finally, MAKER (v3.01.03) was used to integrate the above prediction results, and a consensus protein-coding gene set consisting of 22,835 genes was obtained (Table 4). The distribution patterns of gene length, coding sequence (CDS) length, exon length, and intron length in starry flounder were similar to those of the other five fish species (Fig. 3).

Table 4 Statistics of predicted protein-coding genes in P. stellatus.
Fig. 3
Fig. 3
Full size image

Distribution of the gene length, coding sequence (CDS) length, exon length, and intron length among P. stellatus, C. semilaevis, P. olivaceus, Amphiprion ocellaris, Anabas testudineus, and Acanthochromis polyacanthus.

The functional annotation of these predicted genes were performed by aligning them to seven databases, including InterPro35, GO36, KEGG37, Swissprot38, TrEMBL38, Pfam39, and NR40, using DIAMOND (v2.1.8)41 or the corresponding built-in software35. As a result, a total of 22,835 genes (95.18% of all predicted genes) were annotated (Table 5).

Table 5 Statistics of functional annotation of protein-coding genes in P. stellatus.

For non-coding RNAs annotation, 5,761 tRNAs and 13,189 rRNAs were identified using tRNAscan-SE (v2.0.12)42 and BLASTN, respectively. 1715 miRNAs and 2,417 snRNAs were predicted using INFERNAL43 based on Rfam database (Table 6).

Table 6 Statistics of non-coding RNA in P. stellatus.

Data Records

The PacBio HiFi sequencing data, the Hi-C sequencing data, and the Illumina sequencing data have been deposited into NCBI SRA database with the accession number SRP56429144. The assembled genome has been submitted to the NCBI GenBank with the accession number JBLIWB00000000045. The assembly statistics of chromosomes and the assembly annotations file have been deposited at Figshare46.

Technical Validation

Completeness and quality assessment of genome assembly

The completeness of the starry flounder genome assembly was evaluated using BUSCO (v5.2.2)47 with the actinopterygii_odb10 database including 3,640 BUSCOs. Of these, 3,579 (98.3%) complete BUSCOs including 3,542 (97.3%) single-copy BUSCOs and 37 (1.0%) duplicated BUSCOs were identified. Only 18 (0.5%) fragmented BUSCOs and 43 (1.2%) missing BUSCOs were detected. The genome quality value (QV) was accessed by Merqury48, and the QV score was 37.68, highlighting a high-quality assembly.

Evaluation of the gene annotation

The accuracy of gene annotation was evaluated using BUSCO (v5.2.2) on the basis of actinopterygii_odb10 database containing 3,640 BUSCOs. The results showed that 3,498 (96.1%) complete BUSCOs, containing 3,459 (95.0%) single-copy and 39 (1.1%) duplicated BUSCOs, were detected, 31 (0.9%) fragmented BUSCOs and 111 (3.0%) missing BUSCOs were identified.