Background & Summary

The Patagonian toothfish (Dissostichus eleginoides), also known as the Chilean sea bass, belongs to the family Nototheniidae and the suborder Notothenioidei. It is a large, slow-growing fish species typically found in cold, deep waters in the Southern Hemisphere, particularly around South America and the sub-Antarctic islands1,2, growing up to 2 meters in length, brown-gray in color and living up to 50 years, with unique anti-freeze glycoproteins in its blood that allow it to survive freezing temperatures. There is one of only two species in the genus Dissostichus, the other being the Antarctic toothfish with which it shares morphological similarities3,4. Despite their similarities, however, these two species have distinct genetic and ecological traits, such as the presence of the antifreeze glycoprotein (AFGP) genes in the Antarctic toothfish but not in the Patagonian toothfish, as well as separate habitats. We confirmed the presence of the AFGP gene based on exact locus in the genome sequence. The Patagonian toothfish is a commercially important species, often caught by longline fishing vessels that target high-value species in the region. Because of its delicate, sweet flavour and firm texture, which make it a delicacy in many countries, overfishing has depleted the population to the extent that it is considered a threatened species5,6. By comparing the genomes of these two species, we can understand the genetic factors that contribute to their differences and use this knowledge to inform more effective conservation and management strategies7. The analysis of this study was performed using similar methods utilized in our previous study8. We reveal chromosome-level genome assembly and gene annotation information that can be used as a basis for a genome study of the Patagonian toothfish. Previous study had not been able to analyze genome assembly at the chromosome level9. In this study, we obtained 24 unambiguous chromosome sequences by complete Hi-C analysis and confirmed the homogeneity of the chromosome level by comparison with the ecologically closest species, D. mawsoni10. This will provide a valuable resource for future genomic studies of Antarctic and sub-Antarctic fish species, particularly when comparing the Antarctic and Patagonian toothfish. This data can be used to inform the management and conservation of these ecologically and economically important species and will contribute to our understanding of the genetic and physiological mechanisms that enable animals to survive in extreme environments. The findings of this study highlight the importance of understanding the genomic and ecological differences between closely related species and the potential benefits of studying the genomes of Antarctic fish species beyond fisheries management.

Methods

Sample preparation and sequencing

Adult Patagonian toothfish were obtained from commercial fisheries (captured at −56.91° −70.69°), and we had extracted genomic DNA from muscle from a single individual. For short-read and long-read genome sequencing, genomic DNA was extracted following the manufacturer’s protocol using the MagAttract HMW DNA Kit (Qiagen, catalog no.67563). Library preparation was performed using the Illumina Truseq Nano DNA Library prep kit for short-read genome sequencing. Approximately 350 bp fragments of high-molecular-weight genomic DNA were generated by random shearing using the Covaris S2 system (Covaris Inc., Woburn, MA, USA). The resulting DNA fragments were end-repaired and ligated to Illumina-specific adaptor. Indexed libraries were pooled in equimolar concentrations. Whole-genome sequencing (WGS) was carried out using an Illumina NovaSeq 6000 system (Illumina Inc., San Diego, CA, USA) with a paired-end of 2 × 150 bp reads. For the construction of the PacBio library, 20Kb fragments were generated by shearing genomic DNA using the Covaris G-tube, following the manufacturer’s recommended protocol for long-read genome sequencing. A total of 5 μg of DNA from a sample was used as input for the library preparation. The SMRTbell library was constructed using SMRTbell™ Template Prep Kit 1.0. After annealing the sequencing primer to the SMRTbell template, DNA polymerase was bound to the complex using Sequel Binding Kit 2.0. Excess unbound polymerase molecules and small DNA inserts were removed during a purification step following polymerase binding. The polymerase-SMRTbell-adapter complex was then loaded into zero-mode waveguides (ZMWs). Finally, the SMRTbell library was sequenced using nine Sequel™ SMRT® Cell 1 M v2 and the Sequel Sequencing Kit 2.1. A 600-minute movie was captured for each SMRT cell using the Sequel System (Pacific Biosciences, Menlo Park, CA, USA)10. The Hi-C library was constructed according to the manufacturer protocol to generate pseudo-chromosomes. The library was prepared using the Dovetail™ Hi-C Library Preparation Kit (Dovetail Genomics, Santa Cruz, CA, USA). Nuclear chromatin from the muscle tissue was cross-linked with formaldehyde and extracted. The fixed chromatin was digested with DpnII, and the sticky ends were filled in with biotinylated nucleotides, followed by ligation. The cross-links were then reversed, and purified DNA was treated to remove any unbound biotin from the ligated fragments. The DNA was subsequently sheared to an average size of ~350 bp using Covaris S2 System (Covaris Inc., Woburn, MA, USA), and biotinylated fragments were enriched through streptavidin bead pull-down. Finally, index PCR was performed to generate the library, which was then sequenced using Illumina NovaSeq platform8. RNA was extracted from the muscle tissue of the same individual for genomic DNA extraction. A total of 1 μg of RNA was used as input for cDNA synthesis using the SMARTer PCR cDNA Synthesis Kit (Clontech, Catalog No. 634925). Although 1~5 μg of pooled cDNA was required for library construction, the SMRTbell library was prepared using SMRTbell™ Template Prep Kit 1.0-SPv3 and sequenced on three SMRT cells per library using the Sequel Sequencing Kit 2.0. Sequencing condition included a 600-minute movies run time and a 240- minute pre-extension time on the Sequel System (Pacific Biosciences, Menlo Park, CA, USA). After sequencing, high-quality Iso-Seq reads were extracted using SMRTLink v.8.011 (Table 1).

Table 1 Sequencing data generated for short-read, long-read and Hi-C data of the Patagonian toothfish.

Chromosome-level genome assembly with long-read sequences and Hi-C

The draft de novo assembly was constructed using the FALCON-Unzip assembler12 with filtered subread sequences. The length cut-off (length_cutoff = 23800) was specified based on the subreads’ N50 value of 23.8 Kb. The primary assembly contigs were generated through phased diploid assembly using an unzipping approach, followed by polishing with the Arrow consensus algorithm. To improve genome assembly quality, we corrected errors using WGS reads aligned with Pilon v.1.2.2313, based on the haplotig-merged primary contigs and fixed error bases. The Hi-C raw sequence data were aligned to the draft assembly using BWA-MEM14, and Juicer v.1.5.715 was used to generate Hi-C contact matrices after duplicate removal from the linking data. The alignment was then processed with 3D-DNA16 for chromosome-level scaffolding, and inter-contig linkage information was used to arrange, merge and classify contigs into chromosomes based on linkage density. Initial scaffolding results were manually reviewed using Juicebox to correct any mis-joined and unplaced contigs17. Ultimately, 3D-DNA was used to generate a high-quality, chromosome-level genome assembly. The draft genome assembly of 1,224 contigs, N50 of 4.3 Mega base-pair (Mbp), and a total length of 842 Mbp, constructed using 90x long-read data and corrected errors with 50x short-read data. The Hi-C analysis allowed the draft genome assembly to be upgraded to chromosome-level genome assembly within 24 chromosomal sequences. The longest contig length was 42.8 Mbp, with a contig N50 of 36 Mbp, and the total number of contigs decreased to 495. Additionally, 24 scaffolds longer than 10 Mbp were identified, consistent with known karyotype of Patagonian toothfish chromosomes (2n = 48)18. The N50 scaffold identified in the previous study9 is 3.5 Mbp, which is 10% of N50 contig length of 36 Mbp in this study, and the number of contigs is 447, but only 11 contigs are longer than 10 Mbp, which is not enough to complete the genome (Table 2A). The Hi-C scaffolds were validated using the Hi-C contact map (Fig. 1a,b). The coverage of 24 chromosomal sequences was 95.14% and the total size of the unplaced scaffolds was 41.05 Mbp (Table 3).

Table 2 Summary of the Patagonian toothfish genome assembly and comparison of previous assembly.
Fig. 1
figure 1

Summary of the final genome assembly results. (a) Contact map plot of the Patagonian toothfish genome. The Hi-C raw read pairs were aligned with the genome sequences; the x and y axes indicate their positions; the red dots indicate the position of the read pairs, and a high density of red dots denotes that they are located on the same chromosome. (b) Overview of the Patagonian toothfish genome. The features are arranged in chromosomes [a], gene density [b], DNA element density [c], LINE density [d], SINE density [e], GC contents [f], and GC skew [g] at 1-Mbp intervals across the 24 chromosomes. (c) Phylogenetic analysis with 17 species, including D. eleginoides. The number of expanded (red) and contracted (blue) gene families are indicated at each node and end, and the length of each node is indicated in million years old (MYA). Species of the Notothenioidei suborder, comprising Antarctic, sub-Antarctic, and non-Antarctic species, were binned into color zones. A divergence time of the Patagonian toothfish was indicated red dot and label.

Table 3 Summary of chromosome sequences of the final assembly.

Repeat analysis

A de novo repeat library was constructed using RepeatModeler v.1.0.319, which incorporated RECON and RepeatScout v.1.0.520 with default parameters. Additionally, Tandem Repeats Finder21 was utilized to predict consensus sequences and generate classification data for each repeat elements. All repeats identified by RepeatModeler were further analyzed against the UniProt/SwissProt database22. To accurately identify long terminal repeat retrotransposons (LTR-RTs), an LTR library was constructed using LTR_retriever23 by integated raw LTR data from both LTRharvest24 and LTR_FINDER25. Repetitive elements were subsequently annotated using RepeatMasker v.4.0.9 with the combined repeat library generated from RepeatModeler and LTR-retriever. The repetitive elements in the chromosome-level genome were analyzed, revealing that they constitute 39.08% of the entire genome. The distribution of repetitive elements consisted of 12.25% DNA transposons, 5.09% long interspersed nuclear elements (LINEs), 0.49% short interspersed nuclear elements (SINEs), 9.96% long terminal repeats (LTRs), and 9.72% other repetitive sequences. (Table 4).

Table 4 Summary of annotated transposable elements of the Patagonian toothfish.

Gene prediction and annotation

Gene prediction was performed using EVidenceModeler (EVM) v.1.1.126, which integrates the results of multiple gene predictions. Repeat-masked genomes were used for ab initio gene prediction using GeneMark-ES v.4.6827 and Augustus v.3.4.028. Next, the hints for protein and ab initio predictions were extracted using all protein sequences from Actinopterygii, a clade of bony fishes, in the UniProt/SwissProt protein database29 using ProtHint v.2.6.030. The hints were used to perform protein predictions using GeneMark-EP + v.4.6830 and for ab initio predictions using Augustus. To obtain transcriptome-level evidence, PASA pipeline v.2.3.331 was used with Iso-Seq data. EVM was used to integrate the ab initio, transcriptome, and protein prediction results to obtain the final gene prediction with the weights (ABINITIO_PREDICTION = 1, PROTEIN = 50, and TRANSCRIPT = 50). Finally, the PASA pipeline with Iso-Seq data was used to predict changes in exons by the addition of untranslated regions (UTRs). Genome Annotation Generator v.2.0.132 was used to add start/stop codon data and generate a well-formed gff file. Other noncoding RNAs and putative tRNA genes were identified using Barrnap v.0.9 and tRNAscan-SE v.2.0.533, respectively. The predicted genes were annotated by aligning them with the NCBI non-redundant protein (nr) database34 using NCBI BLAST v.2.9.035 with a maximum e-value of 1e−5. To obtain protein domain information, InterProScan v.5.44.7936 was used along with protein sequences translated from a transcripts. Additionally, comprehensive annotation of transcriptome sequences was performed using Trinotate37, while the Kyoto Encyclopedia of Genes and Genomes (KEGG) was employed with decoded peptide sequences using TransDecoder v.5.5. Protein signal peptide prediction was carried out using SignalP v.5.038, and transmembrane domain prediction was conducted using TMHMM v2.039. Finally, Gene Ontology (GO) terms22 were assigned to the genes using the BLAST2GO pipeline v.4.040. A total of 32,224 genes and 32,471 coding sequences (CDSs) were analysed in the Patagonian toothfish genome. The average length of CDSs was 1,287 bp, and the average number of exons per gene was 8.01 (Table 3B). A total CDSs were annotated from a minimum of 10.10% to a maximum of 76.90% in seven databases for functional annotation. In one or more databases, 87.84% of CDSs were annotated (Table 5).

Table 5 Summary of functional annotations of the Patagonian toothfish genome.

Phylogenetic analysis

Orthologous gene clusters were classified within the genomes of 17 species, including D. eleginoides (Table 6) using OrthoMCL (OrthoMCL-DB: Ortholog Groups of Protein Sequences)41. The protein sequence was extracted from longest isoforms of each species. To construct a phylogenetic tree, we performed orthologous gene analysis using Orthofinder42 with an e-value cut-off 1e-5 and an all-to-all BLASTP analysis of 17 species. MAFFT v.6.861b43 was used to align each gene family, and the phylogenetic tree was inferred with FastTree v.2.1.1020, with divergence time calibration performed using both PATHd844 and TimeTree45. Finally, CAFE v.4.2.146 was used to predict the likelihood of gene family expansion and contraction with P < 0.01 and automatic searching for the λ value with default parameters load -p 0.01; lambda -s -t and time tree information. Phylogenetic analysis was performed using single-copy ortholog genes, and 17 species, including the Patagonian toothfish, were found to have branched from the most recent common ancestor (MRCA) with 24,069 orthogroups. The 17 species used in the analysis belong to the Actinopterygii class, and the Notothenioidei suborder diverged about 73 million years ago. In this suborder, the Antarctic and sub-Antarctic fishes diverged about 61 million years ago, while the Antarctic toothfish and Patagonian toothfish, which have the most morphological similarity, diverged about 15 million years ago, at roughly the same time. However, the difference in the number of expanded and contracted gene families is 871 and 3,414, respectively, showing that there are ecological differences between the Antarctic toothfish and Patagonian toothfish (Fig. 1c).

Table 6 Genome assemblies used for phylogenetic analysis.

Genomic comparative analysis

To compare genome assembly sequences at the chromosome level, nucmer in the MUMmer software package v.4.02b47 was used with the parameters -c 1000 -l 1000 and add–mum for unique matching and avoiding repeat regions. Long sequences corresponding to chromosomes were extracted and compared for a clear chromosome comparison; any unordered contig or scaffold sequences were excluded. Circos48 is a useful tool for comparing genome sequences based on homogeneous coordinates. We converted the coordinate data obtained through nucmer into a readable format in Circos. The results of chromosome comparison between two genomes were diagrammed using Circos. We performed a genomic comparative analysis to determine the genetic similarities between the Patagonian toothfish49 and the Antarctic toothfish50 and found that all chromosome regions of both species matched identically without any chromosome segregation (Fig. 2a). The regions containing AFGP and trypsinogen genes were extracted from the whole-genome sequence using NCBI BLAST11 v.2.9.035 against transcript and protein sequences of the Antarctic toothfish51. AFGP genes evolved from trypsinogen genes in Antarctic fish52. The prediction of gene features of AFGP genes cannot be accomplished using automated prediction methods, because the AFGP gene sequence has a high incidence of tandem repeats53. Instead, we developed a customized process to predict complete AFGP gene features and analysed exons and CDSs of AFGP and trypsinogen genes. The AFGP–trypsinogen locus was located between genes encoding transmembrane protein 145 (tmem145) – mitochondrial 39S ribosomal protein L17 (mrpl17) and Cbl proto-oncogene, E3 ubiquitin-protein ligase (cbl) – BCL3 transcription coactivator (bcl), as reported in a previous study54. Using previously published Antarctic toothfish genome data10, we constructed a haplotype-resolved genome and annotated multiple trypsinogens and AFGP genes using the AFGP gene prediction method. However, only seven copies of the trypsinogen genes and one copy of the trypsinogen-like protease gene were predicted at the exon/CDS level in the Patagonian toothfish. This result confirmed the largest genetic difference between the Antarctic toothfish and the Patagonian toothfish (Fig. 2b).

Fig. 2
figure 2

(a) Genomic comparative between D.eleginoides and D. mawsoni. The regions of high similarity between the genomic segments in D.eleginoides and B are plotted as color-coded lines for each chromosome. D.eleginoides chromosomes are located at the top and D. mawsoni chromosomes at the bottom. (b) The region of AFGP genes family. Gene locations within the AFGP gene family region of two haplotypes of D. mawsoni and D.eleginoides were compared. Color-coded according to gene type, and the orientation of the arrow reflected the orientation of the gene. The position of each gene on the genome was shown on the same scale. Trypsinogen and AFGP gene regions were color-coded.

Data Records

The genome assembly, annotation data and raw data of the Patagonian toothfish have been deposited in NCBI under the accession number GCA_031216635.149 and SRP52497111.

Technical Validation

Quality control of nucleic acids and libraries

The extracted DNA quality and quantity were estimated using Qubit 2.0 fluorometer (Invitrogen, Life Technologies, Carlsbad, CA, USA) and Fragment Analyzer (Agilent Technologies, Santa Clara, CA, USA). The genomic DNA above size 20 Kb fragments were used to construct the SMRTbell library. Size of the Hi-C fragments were sheared to a size of ~350 bp. The quality and quantity of RNA were assessed using 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) and Qubit 2.0 fluorometer (Invitrogen, Life Technologies, CA, USA). The value of the RNA integrity number (RIN) was 8.8 and the Iso-seq average library size was ~2,800 bp.

Genome assembly and annotation evaluation

Benchmarking Universal Single-Copy Orthologs (BUSCO) v.5.4.455 with default parameters and the Actinopterygii lineage dataset (3,640 single-copy orthologs; OrthoDB v.10) were used to assess the completeness of the genome assembly. The Actinopterygii dataset can be universally applied to teleosts, such as the Patagonian toothfish. The completed BUSCO was identified of 3,563 (97.9%). Of these, 3,450 (94.8%) were single-copy BUSCO and 113 (3.1%) were duplicates. The numbers of partially matched and missing BUSCO were 48 (1.3%) and 215 (5.9%), respectively. Additionally, BUSCO was used in transcriptome mode with CDSs to confirm the gene prediction results. The percentage of complete BUSCO was 84.0% while missing BUSCO comprised 10.2% (Table 7). The k-mer completeness and quality value (QV) were evaluated by Merqury v1.356. Merqury analysis were QV of 32.89 and completeness of 91.77.

Table 7 Assessment of the Patagonian toothfish final assembly and transcriptome using BUSCO.