Background & Summary

Crucian carp (Carassius carassius) is a wide-spread species in Northern Europe, normally found in smaller ponds and lakes with rather harsh environmental conditions. In some ponds crucian carp may even be the only fish species present. Due to the small surface area and little or no current in some ponds, ice forms during the winter and prevents oxygen from diffusing into the water from the air. When the layer of ice becomes covered with snow, UV radiation from the sun is effectively blocked, preventing photosynthesis and thus replenishment of the oxygen that continues to be used by all remaining organisms. Consequently, the ponds eventually become depleted of oxygen (anoxic) until the ice melts in the spring. Contrary to most other vertebrates1, the crucian carp can survive anoxia for months, explaining why it is often the sole fish species in ponds with seasonal anoxia. The physiological adaptations allowing it to survive anoxia are fairly well characterized2,3, with one key trait being the ability to convert the anaerobic end product lactate into ethanol, which can be excreted to the water via the gills, contrary to lactate that would accumulate in tissues and lead to severe acidosis. It has been shown that the pyruvate dehydrogenase complex of crucian carp has an additional and modified subunit of the E1 enzyme4, which is highly expressed in muscle tissue during anoxia and thought to have pyruvate decarboxylase activity, i.e. converting pyruvate into acetaldehyde, which can then be converted into ethanol by alcohol dehydrogenase. The carp-specific whole-genome duplication5,6,7 has been hypothesized to play a central role in the development of anoxia tolerance by enabling neofunctionalization of gene paralogs such as the extra E1 subunit4.

Here, we present a high-quality reference genome (74x coverage), that has been scaffolded using chromosome conformation capture (Hi-C) sequencing, and structurally and functionally annotated based on transcriptomic evidence. This genome assembly will open opportunities to study the molecular and evolutionary basis of anoxia tolerance in the crucian carp. The genome will also be useful for furthering research on evolutionary and genomic aspects of fish species that have undergone genome duplications. Interestingly, the anoxia tolerance of the closely related goldfish (Carassius auratus) is markedly lower than that of crucian carp (i.e. shorter survival time8), and anoxia tolerance in silver crucian carp (Carassius gibelio) has to our knowledge not been reported. Similarly, the common carp (Cyprinus carpio) from the same family of fish that underwent the carp-specific whole-genome duplication6 is only somewhat hypoxia tolerant9 and not anoxia tolerant. A specific comparison of the genes involved in known physiological and metabolic functions of anoxia tolerance, between crucian carp, silver crucian carp, goldfish, and common carp, can be the first step to shed light on what is still present in the anoxia-tolerant fish, and what was lost in the less tolerant fish. A comparison with the existing genomes of a farmed-type crucian carp10, silver crucian carp11, goldfish12, and common carp13, indicate that the genome we present here is more contiguous and more complete. This genome thus represents a necessary contribution to the larger effort of investigating anoxia tolerance from the genomic and transcriptomic point of view and elucidating the evolutionary history of the physiological mechanisms. Having a high-quality genome of a crucian carp specimen from a population known to be recurrently exposed to anoxia14 (such as Tjernsrudtjernet in Oslo, Norway) and with an extensively characterised physiology and response to anoxia2,3,4,15,16,17,18,19,20,21,22,23, will be valuable for future studies linking physiological adaptations to molecular regulation and evolution. The genome will also be useful for studies of population genomics. Crucian carp from different ponds in Norway have been shown to have different morphology24 related to the presence or absence of predators, and population genomics can be used to investigate the genomic basis of these differences. It would also be useful to compare with other populations in Northern Europe (e.g. the farmed UK populations and wild populations in Finland), where the habitats may vary with regards to the extent of seasonal anoxia. In summary, the high-quality genome presented here will be an important resource for the field of comparative animal physiology, and fish ecology and evolution.

Methods

Sample acquisition

Specimens

The male crucian carp specimen selected for whole genome sequencing stems from a batch of crucian carp collected from a small pond in Oslo (‘Tjernsrudtjernet’; N 59.922886 E 10.609834) using nylon net cages. The fish were captured in October 2019 and held at 10–12 °C in the InVivo Aquarium facility (Depart. Biosciences, Univer. Oslo) for approximately three months. The fish were fed by hand to satiation with commercial carp pellets twice daily (Tetrapond, Tetra, Melle, Germany), and kept under a 12 h:12 h light-dark cycle in 750 L tanks with a semi-closed recirculation system of aerated and dechlorinated tap water. At the time of sampling (in January 2020), the selected specimen was euthanized with a sharp blow to the head, after which blood was sampled by caudal puncture. A portion of the blood was preserved in ethanol while the remaining portion was flash frozen using liquid nitrogen, as were remaining tissues (brain, liver, red muscle, white muscle, gills, gonad, spleen, kidney, heart).

For structural annotation (using short-read RNA sequencing) samples were taken from multiple individuals exposed to normoxia and from different tissues. This collection included samples from kidney, spleen, gills, gonad (male and female), skin, scales, intestine, eye, liver, red muscle, and white muscle (from a batch collected 12 Oct 2021 from the same pond as mentioned above and sampled 11 Aug 2022). Additionally, brain tissues were sampled from a batch collected 13 September 2013 and sampled 19 November 2014 and heart were sampled from a batch collected 23 September 2022 and sampled 8 June 2023. The fish were given minimum 2 weeks to acclimatize to holding conditions prior to any experiment or sampling. Individuals of both sexes were included to increase genetic diversity of the transcriptomic data. For the brain, samples were from three individuals, exposed to 6 days normoxia, 6 days anoxia, or 6 days anoxia followed by 1 day re-oxygenation, respectively. For the heart, one sample was from a fish exposed to normoxia for 1 day and another sample was from a fish exposed to anoxia for 2 days. Tissues were flash frozen on liquid nitrogen and stored at −80 °C. The anoxia-exposure experiments were carried out according to Norwegian animal research guidelines (‘Forskrift om bruk av dyr i forsøk’) at the InVivo Aquarium facility approved by the Norwegian Food Safety Authorities (approval no. 155/2008).

DNA and RNA extraction

For preparation of long-read and short-read DNA libraries, genomic DNA was extracted from 25 mg of blotted dry-weight muscle tissue using the Circulomics Nanobind HMW Tissue DNA kit (Handbook v06.16 3/2019), to obtain 263 ng/µL of DNA with modal peak size distribution of 47 kb. This DNA was used for the library preparation of PacBio long reads (for genome assembly) and Illumina short reads (for error correction of the genome assembly). For preparation of the Hi-C library (cross-linked DNA in close proximity for chromosome conformation capture), genomic DNA was extracted from blood in ethanol using the Arima-HiC kit with a modified version of the mammalian blood protocol. Specifically, the sample was washed with PBS and the ethanol removed, and then continued from step 12 in the Arima blood protocol.

The RNA for both short- and long-read sequencing was extracted using the TRIzol reagent (Cat. no. 15596026 and 15596018), following instructions from the manufacturer. The extracted RNA from different tissues was pooled (except the brain samples that were processed previously) and checked for integrity using a Bioanalyzer.

Library preparation and sequencing

The library preparation and sequencing were provided by the Norwegian Sequencing Centre (www.sequencing.uio.no), a national technology platform hosted by the University of Oslo.

Long-read and short-read DNA sequencing

The long-read DNA library was prepared using the Pacific Biosciences Express library preparation protocol without any fragmentation of the sample prior to library preparation. Size selection of the final library was performed using BluePippin with a 15 kb cut-off. The long-read library was sequenced on one 8 M SMRT cell on the Sequel II instrument using Sequel II Binding kit 2.0 and Sequencing chemistry v2.0. Loading was performed by diffusion (movie time: 15 hours). The sequencing yielded 5.8 M reads with an N50 insert length of 23 kb. The short-read DNA library was built from 1000 ng of genomic DNA using the Kapa Hyper prep PCR free workflow. For quality check, the library was amplified with PCR, purified, and checked with Fragment Analyzer and NGS kit. The library was sequenced on one lane Illumina HiSeq 4000 with 300 cycles (150 bp reads, paired end), yielding 297 M read pairs.

Hi-C sequencing

A quality control of the genomic DNA (following the Arima protocol) confirmed that the sample included correctly cross-linked proximal DNA, and therefore was ready for library preparation using the Arima library protocol. First, 3.4 µg of cross-linked DNA sample (from Arima kit) were sheared in Covaris tubes and Covaris E220 instrument. Then, after size selection, the biotin enrichment step used 382 ng of sheared DNA, followed by ligation using Illumina unique adaptors. The library was amplified with 10 cycles of PCR, checked in fragment analyzer (FA) and NGS kit. Finally, the Kapa Quantification kit was used for assessment of library concentration. The library was sequenced on one lane Illumina HiSeq 4000 with 300 cycles (150 bp read paired end), yielding 343 M read pairs.

Long-read and short-read RNA sequencing

The long-read RNA libraries were prepared from total RNA from each tissue using Pacific Biosciences protocol for Iso-Seq™ Express Template Preparation for Sequel® and Sequel II Systems. The libraries were multiplexed and sequenced on the PacBio Sequel II instrument using one SMRT cell with Sequel II Binding kit 2.0 and Sequencing chemistry v2.0 (loading by diffusion) and yielding 5,860,324 subreads with an average subread length of 3,093 bp. IsoSeq analysis to obtain full-length transcripts from subreads was performed using the IsoSeq pipeline (SMRT Link v9.0) with default parameters. Reads were demultiplexed prior to filtering for full-length reads and clustering of isoforms. This processing resulted in 223,902 high-quality isoforms. Polished ccs reads (3,137,504) were later created from the raw subreads using the PacBio command-line tool ‘ccs’ (SMRT Tools v10.1) with default filtering parameters.

RNA-samples from diverse tissues of the crucian carp were pooled into 5 different sets to be sequenced as independent libraries to ensure sufficient read coverage from all sets. Each set was prepared with Strand-specific TrueSeqTM mRNA-seq library prep and all the sets were sequenced together in one ¼ S4 Illumina Novaseq 6000 flow cell. Heart samples from other fish in normoxia and anoxia were included in the same sequencing run. Additionally, already available brain RNA-seq data from a previous project was included (strand-specific TruSeq mRNA libraries multiplexed on 4 lanes Illumina HiSeq 2500; 250 cycles, paired end).

Genome assembly and annotation

Table 1 lists all the software and versions used in our pipelines, as described in more detail below. A schematic overview of the assembly and annotation steps is provided in Fig. 1. Unless otherwise indicated, computations were carried out on a high-performance computing cluster.

Table 1 Software packages and pipelines used for assembly and annotation.
Fig. 1
figure 1

Overview of bioinformatic assembly and annotation pipelines. (a) De novo assembly of PacBio long DNA reads (1) followed by error correction using 150 bp Illumina DNA reads (2), mapping and scaffolding with proximity-ligated (Hi-C) DNA reads (3,4), visualization of contact maps and manual curation (5), resulting in a fully assembled genome with 290 scaffolds. Sequences were renamed and contigs > 2995 bp were kept, resulting in a final subset genome with 262 scaffolds (6) and a soft-masked version (7). (b) Ribosomal RNA, rRNAs (8a) and transfer RNAs, tRNAs (8b) were annotated from the genomic sequences, while models of protein coding genes were predicted from mapped RNA-seq reads (9) and a protein dataset consisting of the OrthoDB v10 ‘Actinopterygii’ dataset plus predicted proteins from the farmed UK crucian carp genome (fCarCar2; GCF_963082965.1) (10). Protein-coding gene models were further revised to improve UTR annotation and gene-to-isoform relationships using full-length transcript (PacBio IsoSeq clustered isoforms) (13), resulting in a final set of protein-coding gene models (12). (c) Functional annotation was carried out by extracting transcript and protein sequences (14) and searching for the protein sequences in different databases (15). Details of software packages and scripts used, including versions, are provided in Table 1, while details of input and final output files are provided in Table 2.

Draft de novo genome assembly

For general quality control of the input raw reads, we conducted a kmer analysis of the Illumina short read data (subsequently used for polishing). First, Kmergenie25 was used to estimate the appropriate kmer size for our sample, and then kmers were counted using Jellyfish26 to produce a kmer profile (histogram), which was then plotted by GenomeScope. We selected the GenomeScope27 pipeline for kmer profiling due to the capabilities of this software to provide overall genome characteristics from raw, short-read DNA sequencing data, without the need of a reference genome. From the resulting kmer profile produced by GenomeScope, the presence of repeats should be visible as pronounced peaks, while potential presence of sequencing errors and repeat duplicates would distort the appearance of the kmer histogram, due to increased variances and low frequency kmers27. The genome assembly pipeline (Fig. 1a) started with an initial draft assembly of PacBio long reads using Flye 2.928, followed by a polishing step (error correction) using the tool POLCA from MaSurCa29, and short read data as input. Next, the Arima pipeline (https://github.com/ArimaGenomics/mapping_pipeline) was used to map Hi-C paired-end reads against the assembly, followed by the AllHiC pipeline for scaffolding of polyploid genomes30. We chose the AllHiC pipeline because it is specialized in avoiding that Hi-C signals erroneously link allelic haplotypes together in polyploids (or species with recent whole-genome duplications such as the crucian carp). With the scaffolded assembly, Juicebox31 was used to visualize Hi-C contact points, as well as to correct visibly misassembled scaffolds. BUSCO (Benchmarking Universal Single-Copy Orthologs32) scores were compared, before and after Juicebox curations, and with different levels of curation (minimum, medium, and high), to assess whether manual curation had an improving effect. QUAST33 was used to obtain length statistics of the draft genome at different steps of the assembly process. The final assembly was also checked with FCS-GX34 to detect potential contamination with genetic material from other organisms.

The structural annotation pipeline (described below) required filtering of the primary assembly. All contigs that were above 3000 bp were kept, plus one contig that was only 2959 bp long, but had more than 100 reads mapping (to investigate read support, RNA-seq data were mapped to the draft genome using STAR35, and the samtools36 command ‘idxstats’ was used to extract the number of reads aligning to each scaffold/contig). A total of 262 contigs were kept. After filtering, but before structural annotation was carried out, the assembled scaffolds were reordered by decreasing size and renamed using Funannotate (https://github.com/nextgenusfs/funannotate). Synteny between the largest scaffolds was visualized in Synvisio37, based on intra-genomic collinearity blocks calculated using McScanX38. This resulted in pairs that were renamed as their corresponding chromosome and sub-genome (A or B), with a total of 50 scaffolds (chromosomes), as expected from previous knowledge of the crucian carp and goldfish39. The remaining scaffolds were named with the prefix “scaffold”. The final subset genome assembly was soft-masked using RepeatModeler2 (https://www.repeatmasker.org/RepeatModeler/)40.

Structural annotation

The sequence representing the mitochondrial genome was identified using Blast+41 with an existing crucian carp mitochondrial genome42,43 as the query sequence and the de novo genome assembly as the target database. This search matched one scaffold (renamed ‘scaffold_107_mito’). The mitochondrial genes were annotated on the scaffold using MitoAnnotator from the MitoFish database44,45,46. Transfer RNAs were predicted using tRNAScan-SE47 while ribosomal RNAs were predicted using RNAmmer48 (Fig. 1b).

For the purpose of using RNA-seq data in the structural annotation of protein-coding genes (Fig. 1b), low-quality reads and adapters were trimmed from the libraries using Trimgalore (https://github.com/FelixKrueger/TrimGalore), whereafter read coverage was normalized using the Trinity pipeline49 script ‘insilico_read_normalization.pl’ (https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Insilico-Normalization) with option ‘--max_cov 30’ to reduce the total number of reads included for annotation, while maximizing information across the genome, including regions with low expression. After coverage normalization, the reads (128.7 million pairs) were mapped to the filtered genome (262 scaffolds) using STAR35, with the following parameters: ‘--twopassMode Basic --outFilterMultimapNmax 1 --outSJfilterReads Unique --outSJfilterCountUniqueMin 6 3 3 3 --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outSAMattributes All’. By using only uniquely mapping reads and increasing the number of alignments needed for splice junctions to be included, we lowered the risk of including spurious gene models in the annotation. In the final alignment map (.bam) used for annotation, 120.8 million read pairs (93.84%) were uniquely mapped and properly paired.

For the final structural annotation of protein-coding genes (Fig. 1b), we first performed ab initio gene prediction with BRAKER350. Training of the gene detection was performed with protein sequences from ray-finned fishes (OrthoDB v10 Actinopterygii dataset51) combined with proteins predicted from a genome of a farmed crucian carp from United Kingdom sequenced by the Wellcome Sanger Institute for the Darwin Tree of Life project10. The PASA pipeline52 was used to obtain an updated structural annotation that included annotation of untranslated regions (UTRs) and improved gene-isoform relationships. Exon and transcript lengths were obtained with gFACs53.

Functional annotation of protein-coding genes

For the functional annotation (Fig. 1c), we used AGAT (https://github.com/NBISweden/AGAT) to extract predicted transcript and protein sequences from the final assembly (using the final structural annotation), and then those proteins were searched for in the UniProtKB/Swiss-Prot database54 using Blast + and in the InterPro55 database using InterProScan56. The latter included searches against several databases focused on protein motifs. Gene ontology (GO) terms were extracted for genes based on the matching UniProtKB/Swiss-Prot protein entry. In addition, predicted proteins were searched for in the KEGG ortholog database using BlastKOALA57.

Data Records

A list of input and final output data is given in Table 2, including relevant step in the pipeline (Fig. 1), name of the repository where data are available, type of data, and accession information or file name. All sequence data are deposited in the NCBI sequence read archive (SRA) under BioProject number PRJNA111939458. The Whole Genome Shotgun project (i.e. the full genome assembly) has been deposited at GenBank under the accession JBEDAC000000000. The version described in this paper is version JBEDAC01000000059. The subset and soft-masked assemblies, together with structural and functional annotation files, as well as clustered high-quality transcript isoforms, are deposited in DataverseNO60. The files in DataverseNO are organised into six subfolders: 01_genome, 02_structural_annotation, 03_predicted_sequences, 04_functional_annotation, and 05_mitochondrial_genome_annotation.

Table 2 Data record details for input and output files.

Technical Validation

We obtained a genome assembly from a wild-caught Norwegian crucian carp (Fig. 2a), with an estimated length of 1.65 Gbp, predicted by the GenomeScope k-mer plot based on short-read DNA data (Fig. 2b). The k-mer plot showed one main frequency peak at just below 40x coverage, indicating a high level of heterozygosity, with a much smaller secondary peak at 80x coverage. Furthermore, the k-mer plot indicated that most of the reads were included in the assembly. Assembly quality metrics are summarized in the Blobtools snail plot61 (Fig. 2c), and showed a high degree of completeness in terms of BUSCO. The longest scaffold of the genome was 51.1 Mbp (red line), while the shortest contig at 50% of the total assembly length (N50) was 31.7 Mbp (dark orange), and the shortest scaffold at 90% of the total assembly length (N90) was 26.8 Mbp (light orange). Among the 290 scaffolded contigs, after manual curation of the draft genome using Hi-C data, 50 scaffolds appeared that were markedly larger than the remaining scaffolds and covered 98.6% of the total length of the assembly. Specifically, when sorted by length the 50th scaffold was 21 Mbp while the 51st scaffold was 2.2 Mbp, and taken together the 50 largest scaffolds can therefore be assumed to correspond to the expected 50 chromosomes of the crucian carp (Fig. 2d). Based on the protein sequences predicted through functional annotation of the genome (see further below), blocks of collinearity could be identified, and showed the expected pairing of the 25 chromosome pairs reflecting the two sub-genomes (Fig. 2e), originating from the whole genome duplication specific to carps.

Fig. 2
figure 2

Genome of the wild crucian carp from Norway. (a) Wild crucian carp specimen collected in Tjernsrudtjernet pond (Oslo), by our research group during autumn. (b) Genomescope k-mer spectra that shows the fingerprint of a diploid without contamination. (c) Snail-plot visualization of the crucian carp assembly metrics. (d) Visualisation of chromatin contact points after mapping of Hi-C reads. After Juicebox curation, 50 scaffolds that were significantly larger than remaining scaffolds emerged, corresponding to the 50 chromosomes. (e) Collinearity analysis of the 50 scaffolds and synteny plotting reveals a pairing of the 50 scaffolds into two sub-genomes, which is expected in the crucian carp genome (collinearity blocks filtered with E value 1e-10 and minimum 7 genes). Note that in this figure, chromosomes named ccar-ua1 to ccar-ua25 in the assembly and annotation files are referred to as wc1 to wc25, while ccar-ub1 to ccar-ub25 are referred to as wc26-wc50 (due to requirements of MCScanX and Synvisio that were used for plotting).

In addition to the successful assembly of near-complete chromosomes, the mitochondrial genome was identified among the contigs assembled by Flye (i.e. one contig with no gaps). The length of the sequence (16 603 bp) was similar to the expected size of mitochondrial genomes, and the expected number and identity of protein-coding and non-coding genes were annotated (Fig. 3).

Fig. 3
figure 3

Crucian carp mitochondrial genome. The contig representing the mitochondrial genome was identified by running blastn (BLAST+) of an available crucian carp mitochondrial genome against a database of the sequences in the present genome assembly. Plot created with MitoFish.

The structural annotation pipeline using BRAKER3 and PASA resulted in a total of 82,557 protein-coding transcripts contained in 45,667 genes (Table 3). The number of transcripts went up from 63,098 before the PASA step, indicating that the PASA pipeline using IsoSeq full-length transcripts helped significantly to resolve gene-isoform relationships and likely also recovered splice variants not detected, or discarded, by BRAKER3. We also compared the final structural annotation with an earlier version obtained using the previous version of BRAKER (consisting of running BRAKER162 and BRAKER263 separately, then merging them with TSEBRA64, followed by PASA). This earlier approach resulted in a larger total number of genes, of which a large proportion were mono-exonic (Table 3). Also the exon length (Fig. 4a,b) and total transcript length (Fig. 4c,d) were improved with the final annotation, compared to both BRAKER3 alone and the previous version of BRAKER. The most notable effect of refining transcripts using PASA was on the length of multi-exonic transcripts, which almost doubled, likely due to the addition of UTRs and inclusion of some exons previously annotated as separate, mono-exonic genes. Overall, the PASA annotation was considered a worthwhile improvement of the annotation quality.

Table 3 Annotation metrics.
Fig. 4
figure 4

Exon and transcript lengths after structural annotation. The graphs show violin plots (a,c) and length density distributions (b,d) for exon lengths (a,b) and transcript lengths (c,d) after structural annotation using three different methods. The annotations being compared are ‘old_BRAKER’ in green (BRAKER1 and BRAKER2 merged by TSEBRA, and followed by refinement by PASA), ‘BRAKER3’ in blue (output from BRAKER3 pipeline alone), and ‘PASA’ (BRAKER3 followed by PASA).

Recently, a chromosome-level genome assembly generated using PacBio HiFi data from a farmed crucian carp from the UK was released by the Darwin Tree-of-Life initiative (https://portal.darwintreeoflife.org/)10, and was therefore compared in more detail to the genome assembly of the present study. Snailplots (Fig. 2c vs. 5a) indicated that scaffold-level length metrics were only marginally better for the HiFi assembly. A dot-plot made with Dgenies65 (Fig. 5b) revealed high levels of sequence identity, an equal number of chromosomes, and similar sizes of scaffolds between the assemblies. Collinearity analysis38 and visualization of synteny between the assemblies37 (Fig. 5c) also showed the expected pairing of chromosomes within the two sub-genomes. These comparisons also indicate that there could be some structural differences between the assemblies (e.g. translocations), which is expected due to the variation that exists between the methods used to obtain sequencing data and the assembly pipelines, but also the likely biological differences between the source populations of the specimens, where the wild-type crucian carp population is known to be exposed to seasonal anoxia, which is unlikely to be the case for the farmed crucian carp. While the assemblies were similar in many aspects, the contiguity of our assembly, however, was substantially better when compared across a number of different contig-level metrics (Table 4): the contig level N50 was 15Mbp for our CLR-Flye assembly, compared to 3.8Mbp for the HiFi-Hifiasm assembly, and the contig L50 in our genome was 40, while it was 135 for the HiFi-Hifiasm assembly. A better contiguity may explain why the present assembly, despite the slightly shorter total length and scaffold-level N50, still obtained a higher BUSCO score and annotated a larger number of protein-coding genes (Table 4).

Fig. 5
figure 5

Comparison of genome assemblies from farmed (UK) crucian carp genome and wild (NO) crucian carp. (a) Blobtools snailplot summary of the farmed crucian carp genome10. This assembly was built using PacBio HiFi reads, and it shows that the genome we have obtained using PacBio long read sequencing with short-read error correction has a similarly high quality (shown in Fig. 2c). (b) Dgenies dotplot of the 50 chromosomes of farmed crucian carp compared to the crucian carp genome presented in this study. The plot indicates, as expected, high degree of similarity and continuity but also some chromosomes with possible structural differences. (c) Synvisio synteny plot (collinearity blocks filtered with e value 1e-10 and minimum 7 genes) of similarity between the crucian carp from the present study (chromosome names wc01 to wc25 for subgenome A, and wc26 to wc50 for subgenome B), against the farmed crucian carp (chromosome names fc01 to fc25 for subgenome A, and fc26 to fc50 for subgenome B).

Table 4 Assembly metrics with comparison to other cyprinid genomes.

Chromosome-level genome assemblies are also available for the related species goldfish, silver crucian carp and common carp5,6,66, and the present genome of crucian carp is similar in terms of overall size, number of chromosomes and GC content to these genomes, but importantly is much less fragmented (Table 4), with a contig level N50 of 15 Mbp, which is 3- to 18-fold longer than the other assemblies. This is particularly visible when inspecting the cumulative length of contigs (Fig. 6). Here, it can be seen that the wild-type, farmed and silver crucian carp genomes perform best at the scaffold level, with common carp following closely behind. At the contig level, the present crucian carp genome is markedly above the other assemblies. It is also noted that the reference genome for goldfish (i.e. labelled as reference genome in NCBI and available at ensemble.org) appears to be of lower quality and has 59 chromosomes, which is not the expected number, based on the evolutionary history and relatedness to common carp and crucian carp, that both have the expected 50 (twice as many as zebrafish, Danio rerio). Furthermore, while the reference genome for goldfish appears to be longer (1.8 Gb) than both crucian and common carp, the scaffold L90 is very large, and not closer to the number of chromosomes as is the case for the common carp and both crucian carp genomes.

Fig. 6
figure 6

Cumulative length of contigs (Nx). Data are shown for the scaffolded assemblies as well as for the ‘broken’ assemblies (contig level). Plot generated using QUAST optimised for large genomes with option “--large”. C. carassius (w) is the genome presented in this paper.

Taken together, these results show that our sequencing efforts have resulted in a high-quality chromosome-level reference genome for the wild-type crucian carp. Considering the additional data used for structural annotation, specifically the full-length transcripts from multiple tissues and mRNA sequencing from both a variety of tissues and anoxia treatments, we are confident that our genome assembly is representative of the wild, anoxia-tolerant crucian carp and represents a significant resource for future studies regarding the evolution of mechanisms involved in anoxia survival.