Abstract
The Patagonian toothfish (Dissostichus eleginoides) belongs to the Actinopterygii class, and the suborder Notothenioidei, which lives in cold waters in the Southern Hemisphere. We performed assembly and annotation, and we integrated the Illumina short-read sequencing for polishinng, PacBio long-read sequencing for contig-level assembly, and Hi-C sequencing technology to obtain high-quality of chromosome-level genome assembly. The final assembly analysis resulted in a total of 495 scaffolds, a genome size of 844.7 Mbp and an N50 length of 36 Mbp. Among these data, we confirmed 24 scaffolds exceeded 10 Mbp and classified as chromosome-level. The completeness of BUSCO rate was over 97%. A total gene set of 32,224 was identified. Furthermore, we analyzed the presence of AFGP genes, classified into Antarctic and sub-Antarctic categories through phylogenetic analysis. This study provides a useful resource for the genomic analysis of Patagonian toothfish and genetic insights into the comparison with Antarctic fishes.
Similar content being viewed by others
Background & Summary
The Patagonian toothfish (Dissostichus eleginoides), also known as the Chilean sea bass, belongs to the family Nototheniidae and the suborder Notothenioidei. It is a large, slow-growing fish species typically found in cold, deep waters in the Southern Hemisphere, particularly around South America and the sub-Antarctic islands1,2, growing up to 2 meters in length, brown-gray in color and living up to 50 years, with unique anti-freeze glycoproteins in its blood that allow it to survive freezing temperatures. There is one of only two species in the genus Dissostichus, the other being the Antarctic toothfish with which it shares morphological similarities3,4. Despite their similarities, however, these two species have distinct genetic and ecological traits, such as the presence of the antifreeze glycoprotein (AFGP) genes in the Antarctic toothfish but not in the Patagonian toothfish, as well as separate habitats. We confirmed the presence of the AFGP gene based on exact locus in the genome sequence. The Patagonian toothfish is a commercially important species, often caught by longline fishing vessels that target high-value species in the region. Because of its delicate, sweet flavour and firm texture, which make it a delicacy in many countries, overfishing has depleted the population to the extent that it is considered a threatened species5,6. By comparing the genomes of these two species, we can understand the genetic factors that contribute to their differences and use this knowledge to inform more effective conservation and management strategies7. The analysis of this study was performed using similar methods utilized in our previous study8. We reveal chromosome-level genome assembly and gene annotation information that can be used as a basis for a genome study of the Patagonian toothfish. Previous study had not been able to analyze genome assembly at the chromosome level9. In this study, we obtained 24 unambiguous chromosome sequences by complete Hi-C analysis and confirmed the homogeneity of the chromosome level by comparison with the ecologically closest species, D. mawsoni10. This will provide a valuable resource for future genomic studies of Antarctic and sub-Antarctic fish species, particularly when comparing the Antarctic and Patagonian toothfish. This data can be used to inform the management and conservation of these ecologically and economically important species and will contribute to our understanding of the genetic and physiological mechanisms that enable animals to survive in extreme environments. The findings of this study highlight the importance of understanding the genomic and ecological differences between closely related species and the potential benefits of studying the genomes of Antarctic fish species beyond fisheries management.
Methods
Sample preparation and sequencing
Adult Patagonian toothfish were obtained from commercial fisheries (captured at −56.91° −70.69°), and we had extracted genomic DNA from muscle from a single individual. For short-read and long-read genome sequencing, genomic DNA was extracted following the manufacturer’s protocol using the MagAttract HMW DNA Kit (Qiagen, catalog no.67563). Library preparation was performed using the Illumina Truseq Nano DNA Library prep kit for short-read genome sequencing. Approximately 350 bp fragments of high-molecular-weight genomic DNA were generated by random shearing using the Covaris S2 system (Covaris Inc., Woburn, MA, USA). The resulting DNA fragments were end-repaired and ligated to Illumina-specific adaptor. Indexed libraries were pooled in equimolar concentrations. Whole-genome sequencing (WGS) was carried out using an Illumina NovaSeq 6000 system (Illumina Inc., San Diego, CA, USA) with a paired-end of 2 × 150 bp reads. For the construction of the PacBio library, 20Kb fragments were generated by shearing genomic DNA using the Covaris G-tube, following the manufacturer’s recommended protocol for long-read genome sequencing. A total of 5 μg of DNA from a sample was used as input for the library preparation. The SMRTbell library was constructed using SMRTbell™ Template Prep Kit 1.0. After annealing the sequencing primer to the SMRTbell template, DNA polymerase was bound to the complex using Sequel Binding Kit 2.0. Excess unbound polymerase molecules and small DNA inserts were removed during a purification step following polymerase binding. The polymerase-SMRTbell-adapter complex was then loaded into zero-mode waveguides (ZMWs). Finally, the SMRTbell library was sequenced using nine Sequel™ SMRT® Cell 1 M v2 and the Sequel Sequencing Kit 2.1. A 600-minute movie was captured for each SMRT cell using the Sequel System (Pacific Biosciences, Menlo Park, CA, USA)10. The Hi-C library was constructed according to the manufacturer protocol to generate pseudo-chromosomes. The library was prepared using the Dovetail™ Hi-C Library Preparation Kit (Dovetail Genomics, Santa Cruz, CA, USA). Nuclear chromatin from the muscle tissue was cross-linked with formaldehyde and extracted. The fixed chromatin was digested with DpnII, and the sticky ends were filled in with biotinylated nucleotides, followed by ligation. The cross-links were then reversed, and purified DNA was treated to remove any unbound biotin from the ligated fragments. The DNA was subsequently sheared to an average size of ~350 bp using Covaris S2 System (Covaris Inc., Woburn, MA, USA), and biotinylated fragments were enriched through streptavidin bead pull-down. Finally, index PCR was performed to generate the library, which was then sequenced using Illumina NovaSeq platform8. RNA was extracted from the muscle tissue of the same individual for genomic DNA extraction. A total of 1 μg of RNA was used as input for cDNA synthesis using the SMARTer PCR cDNA Synthesis Kit (Clontech, Catalog No. 634925). Although 1~5 μg of pooled cDNA was required for library construction, the SMRTbell library was prepared using SMRTbell™ Template Prep Kit 1.0-SPv3 and sequenced on three SMRT cells per library using the Sequel Sequencing Kit 2.0. Sequencing condition included a 600-minute movies run time and a 240- minute pre-extension time on the Sequel System (Pacific Biosciences, Menlo Park, CA, USA). After sequencing, high-quality Iso-Seq reads were extracted using SMRTLink v.8.011 (Table 1).
Chromosome-level genome assembly with long-read sequences and Hi-C
The draft de novo assembly was constructed using the FALCON-Unzip assembler12 with filtered subread sequences. The length cut-off (length_cutoff = 23800) was specified based on the subreads’ N50 value of 23.8 Kb. The primary assembly contigs were generated through phased diploid assembly using an unzipping approach, followed by polishing with the Arrow consensus algorithm. To improve genome assembly quality, we corrected errors using WGS reads aligned with Pilon v.1.2.2313, based on the haplotig-merged primary contigs and fixed error bases. The Hi-C raw sequence data were aligned to the draft assembly using BWA-MEM14, and Juicer v.1.5.715 was used to generate Hi-C contact matrices after duplicate removal from the linking data. The alignment was then processed with 3D-DNA16 for chromosome-level scaffolding, and inter-contig linkage information was used to arrange, merge and classify contigs into chromosomes based on linkage density. Initial scaffolding results were manually reviewed using Juicebox to correct any mis-joined and unplaced contigs17. Ultimately, 3D-DNA was used to generate a high-quality, chromosome-level genome assembly. The draft genome assembly of 1,224 contigs, N50 of 4.3 Mega base-pair (Mbp), and a total length of 842 Mbp, constructed using 90x long-read data and corrected errors with 50x short-read data. The Hi-C analysis allowed the draft genome assembly to be upgraded to chromosome-level genome assembly within 24 chromosomal sequences. The longest contig length was 42.8 Mbp, with a contig N50 of 36 Mbp, and the total number of contigs decreased to 495. Additionally, 24 scaffolds longer than 10 Mbp were identified, consistent with known karyotype of Patagonian toothfish chromosomes (2n = 48)18. The N50 scaffold identified in the previous study9 is 3.5 Mbp, which is 10% of N50 contig length of 36 Mbp in this study, and the number of contigs is 447, but only 11 contigs are longer than 10 Mbp, which is not enough to complete the genome (Table 2A). The Hi-C scaffolds were validated using the Hi-C contact map (Fig. 1a,b). The coverage of 24 chromosomal sequences was 95.14% and the total size of the unplaced scaffolds was 41.05 Mbp (Table 3).
Summary of the final genome assembly results. (a) Contact map plot of the Patagonian toothfish genome. The Hi-C raw read pairs were aligned with the genome sequences; the x and y axes indicate their positions; the red dots indicate the position of the read pairs, and a high density of red dots denotes that they are located on the same chromosome. (b) Overview of the Patagonian toothfish genome. The features are arranged in chromosomes [a], gene density [b], DNA element density [c], LINE density [d], SINE density [e], GC contents [f], and GC skew [g] at 1-Mbp intervals across the 24 chromosomes. (c) Phylogenetic analysis with 17 species, including D. eleginoides. The number of expanded (red) and contracted (blue) gene families are indicated at each node and end, and the length of each node is indicated in million years old (MYA). Species of the Notothenioidei suborder, comprising Antarctic, sub-Antarctic, and non-Antarctic species, were binned into color zones. A divergence time of the Patagonian toothfish was indicated red dot and label.
Repeat analysis
A de novo repeat library was constructed using RepeatModeler v.1.0.319, which incorporated RECON and RepeatScout v.1.0.520 with default parameters. Additionally, Tandem Repeats Finder21 was utilized to predict consensus sequences and generate classification data for each repeat elements. All repeats identified by RepeatModeler were further analyzed against the UniProt/SwissProt database22. To accurately identify long terminal repeat retrotransposons (LTR-RTs), an LTR library was constructed using LTR_retriever23 by integated raw LTR data from both LTRharvest24 and LTR_FINDER25. Repetitive elements were subsequently annotated using RepeatMasker v.4.0.9 with the combined repeat library generated from RepeatModeler and LTR-retriever. The repetitive elements in the chromosome-level genome were analyzed, revealing that they constitute 39.08% of the entire genome. The distribution of repetitive elements consisted of 12.25% DNA transposons, 5.09% long interspersed nuclear elements (LINEs), 0.49% short interspersed nuclear elements (SINEs), 9.96% long terminal repeats (LTRs), and 9.72% other repetitive sequences. (Table 4).
Gene prediction and annotation
Gene prediction was performed using EVidenceModeler (EVM) v.1.1.126, which integrates the results of multiple gene predictions. Repeat-masked genomes were used for ab initio gene prediction using GeneMark-ES v.4.6827 and Augustus v.3.4.028. Next, the hints for protein and ab initio predictions were extracted using all protein sequences from Actinopterygii, a clade of bony fishes, in the UniProt/SwissProt protein database29 using ProtHint v.2.6.030. The hints were used to perform protein predictions using GeneMark-EP + v.4.6830 and for ab initio predictions using Augustus. To obtain transcriptome-level evidence, PASA pipeline v.2.3.331 was used with Iso-Seq data. EVM was used to integrate the ab initio, transcriptome, and protein prediction results to obtain the final gene prediction with the weights (ABINITIO_PREDICTION = 1, PROTEIN = 50, and TRANSCRIPT = 50). Finally, the PASA pipeline with Iso-Seq data was used to predict changes in exons by the addition of untranslated regions (UTRs). Genome Annotation Generator v.2.0.132 was used to add start/stop codon data and generate a well-formed gff file. Other noncoding RNAs and putative tRNA genes were identified using Barrnap v.0.9 and tRNAscan-SE v.2.0.533, respectively. The predicted genes were annotated by aligning them with the NCBI non-redundant protein (nr) database34 using NCBI BLAST v.2.9.035 with a maximum e-value of 1e−5. To obtain protein domain information, InterProScan v.5.44.7936 was used along with protein sequences translated from a transcripts. Additionally, comprehensive annotation of transcriptome sequences was performed using Trinotate37, while the Kyoto Encyclopedia of Genes and Genomes (KEGG) was employed with decoded peptide sequences using TransDecoder v.5.5. Protein signal peptide prediction was carried out using SignalP v.5.038, and transmembrane domain prediction was conducted using TMHMM v2.039. Finally, Gene Ontology (GO) terms22 were assigned to the genes using the BLAST2GO pipeline v.4.040. A total of 32,224 genes and 32,471 coding sequences (CDSs) were analysed in the Patagonian toothfish genome. The average length of CDSs was 1,287 bp, and the average number of exons per gene was 8.01 (Table 3B). A total CDSs were annotated from a minimum of 10.10% to a maximum of 76.90% in seven databases for functional annotation. In one or more databases, 87.84% of CDSs were annotated (Table 5).
Phylogenetic analysis
Orthologous gene clusters were classified within the genomes of 17 species, including D. eleginoides (Table 6) using OrthoMCL (OrthoMCL-DB: Ortholog Groups of Protein Sequences)41. The protein sequence was extracted from longest isoforms of each species. To construct a phylogenetic tree, we performed orthologous gene analysis using Orthofinder42 with an e-value cut-off 1e-5 and an all-to-all BLASTP analysis of 17 species. MAFFT v.6.861b43 was used to align each gene family, and the phylogenetic tree was inferred with FastTree v.2.1.1020, with divergence time calibration performed using both PATHd844 and TimeTree45. Finally, CAFE v.4.2.146 was used to predict the likelihood of gene family expansion and contraction with P < 0.01 and automatic searching for the λ value with default parameters load -p 0.01; lambda -s -t and time tree information. Phylogenetic analysis was performed using single-copy ortholog genes, and 17 species, including the Patagonian toothfish, were found to have branched from the most recent common ancestor (MRCA) with 24,069 orthogroups. The 17 species used in the analysis belong to the Actinopterygii class, and the Notothenioidei suborder diverged about 73 million years ago. In this suborder, the Antarctic and sub-Antarctic fishes diverged about 61 million years ago, while the Antarctic toothfish and Patagonian toothfish, which have the most morphological similarity, diverged about 15 million years ago, at roughly the same time. However, the difference in the number of expanded and contracted gene families is 871 and 3,414, respectively, showing that there are ecological differences between the Antarctic toothfish and Patagonian toothfish (Fig. 1c).
Genomic comparative analysis
To compare genome assembly sequences at the chromosome level, nucmer in the MUMmer software package v.4.02b47 was used with the parameters -c 1000 -l 1000 and add–mum for unique matching and avoiding repeat regions. Long sequences corresponding to chromosomes were extracted and compared for a clear chromosome comparison; any unordered contig or scaffold sequences were excluded. Circos48 is a useful tool for comparing genome sequences based on homogeneous coordinates. We converted the coordinate data obtained through nucmer into a readable format in Circos. The results of chromosome comparison between two genomes were diagrammed using Circos. We performed a genomic comparative analysis to determine the genetic similarities between the Patagonian toothfish49 and the Antarctic toothfish50 and found that all chromosome regions of both species matched identically without any chromosome segregation (Fig. 2a). The regions containing AFGP and trypsinogen genes were extracted from the whole-genome sequence using NCBI BLAST11 v.2.9.035 against transcript and protein sequences of the Antarctic toothfish51. AFGP genes evolved from trypsinogen genes in Antarctic fish52. The prediction of gene features of AFGP genes cannot be accomplished using automated prediction methods, because the AFGP gene sequence has a high incidence of tandem repeats53. Instead, we developed a customized process to predict complete AFGP gene features and analysed exons and CDSs of AFGP and trypsinogen genes. The AFGP–trypsinogen locus was located between genes encoding transmembrane protein 145 (tmem145) – mitochondrial 39S ribosomal protein L17 (mrpl17) and Cbl proto-oncogene, E3 ubiquitin-protein ligase (cbl) – BCL3 transcription coactivator (bcl), as reported in a previous study54. Using previously published Antarctic toothfish genome data10, we constructed a haplotype-resolved genome and annotated multiple trypsinogens and AFGP genes using the AFGP gene prediction method. However, only seven copies of the trypsinogen genes and one copy of the trypsinogen-like protease gene were predicted at the exon/CDS level in the Patagonian toothfish. This result confirmed the largest genetic difference between the Antarctic toothfish and the Patagonian toothfish (Fig. 2b).
(a) Genomic comparative between D.eleginoides and D. mawsoni. The regions of high similarity between the genomic segments in D.eleginoides and B are plotted as color-coded lines for each chromosome. D.eleginoides chromosomes are located at the top and D. mawsoni chromosomes at the bottom. (b) The region of AFGP genes family. Gene locations within the AFGP gene family region of two haplotypes of D. mawsoni and D.eleginoides were compared. Color-coded according to gene type, and the orientation of the arrow reflected the orientation of the gene. The position of each gene on the genome was shown on the same scale. Trypsinogen and AFGP gene regions were color-coded.
Technical Validation
Quality control of nucleic acids and libraries
The extracted DNA quality and quantity were estimated using Qubit 2.0 fluorometer (Invitrogen, Life Technologies, Carlsbad, CA, USA) and Fragment Analyzer (Agilent Technologies, Santa Clara, CA, USA). The genomic DNA above size 20 Kb fragments were used to construct the SMRTbell library. Size of the Hi-C fragments were sheared to a size of ~350 bp. The quality and quantity of RNA were assessed using 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) and Qubit 2.0 fluorometer (Invitrogen, Life Technologies, CA, USA). The value of the RNA integrity number (RIN) was 8.8 and the Iso-seq average library size was ~2,800 bp.
Genome assembly and annotation evaluation
Benchmarking Universal Single-Copy Orthologs (BUSCO) v.5.4.455 with default parameters and the Actinopterygii lineage dataset (3,640 single-copy orthologs; OrthoDB v.10) were used to assess the completeness of the genome assembly. The Actinopterygii dataset can be universally applied to teleosts, such as the Patagonian toothfish. The completed BUSCO was identified of 3,563 (97.9%). Of these, 3,450 (94.8%) were single-copy BUSCO and 113 (3.1%) were duplicates. The numbers of partially matched and missing BUSCO were 48 (1.3%) and 215 (5.9%), respectively. Additionally, BUSCO was used in transcriptome mode with CDSs to confirm the gene prediction results. The percentage of complete BUSCO was 84.0% while missing BUSCO comprised 10.2% (Table 7). The k-mer completeness and quality value (QV) were evaluated by Merqury v1.356. Merqury analysis were QV of 32.89 and completeness of 91.77.
Code availability
The bioinformatic software and pipeline utilized in this study were implemented following the guidelines provided by the developers. Details regarding the versions and parameters of each software are outlined in the Methods section. Unless specified otherwise, default parameters were utilized.
References
Fischer, W. & Hureau, J. C. Southern Ocean: Fishing Areas 48, 58 and 88 (CCAMLR Convention Area). Vol. 1 (Food and agriculture organization of the United nations, 1985).
DeWitt, H., Heemstra, P. & Gon, O. Nototheniidae. Fishes of the southern ocean. JLB Smith Institute of Ichthyology, Grahamstown, 279–331 (1990).
Eastman, J. T. Antarctic fish biology: evolution in a unique environment. (Academic Press, 2013).
Policansky, D. Southernmost Fauna: Antarctic Fish Biology. Evolution in a Unique Environment. Joseph T. Eastman. Illustrations and graphics by Danette Pratt. Photographs by William Winn. Academic Press, San Diego, CA, 1993. xiv, 322 pp., illus. 74.95or£57.;AntarcticFishandFisheries.Karl-HermannKock.CambridgeUniversityPress,NewYork,1992.xvi,359pp.,illus. 1 10 or£ 60. Studies in Polar Research.; History and Atlas of the Fishes of the Antarctic Ocean. Richard Gordon Miller. With contributions by Philip A. Hastings and Josette Gourley. Foresta Institute of Ocean and Mountain Studies, Tucson, AZ, 1993. xx, 792 pp., illus. 95;laminatedcover, 78. Science 264, 1002–1004 (1994).
Clover, C. The end of the line: how overfishing is changing the world and what we eat. (Univ of California Press, 2008).
Brandão, A. & Butterworth, D. S. A proposed management procedure for the toothfish (Dissostichus eleginoides) resource in the Prince Edward Islands vicinity. (2009).
Seung Jae Lee, J. K., Choi, E., Jo, E. & Cho, M. Hyun Park. The Application of Genome Research to Development of Aquaculture. Journal of Marine Life Science 6, 47–57 (2021).
Lee, S. J. et al. A chromosome-level reference genome of the Antarctic blackfin icefish Chaenocephalus aceratus. Scientific Data 10, 657 (2023).
Ryder, D. et al. De novo assembly and annotation of the Patagonian toothfish (Dissostichus eleginoides) genome. BMC genomics 25, 233 (2024).
Lee, S. J. et al. Chromosomal assembly of the Antarctic toothfish (Dissostichus mawsoni) genome using third-generation DNA sequencing and Hi-C technology. Zoological research 42, 124 (2021).
NCBI Sequence Read Archive http://identifiers.org/ncbi/insdc.sra:SRP524971 (2024).
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nature methods 13, 1050–1054 (2016).
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS one 9, e112963 (2014).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell systems 3, 95–98 (2016).
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
Dudchenko, O. et al. The Juicebox Assembly Tools module facilitates de novo assembly of mammalian genomes with chromosome-length scaffolds for under $1000. BioRxiv, 254797 (2018).
Ghigliotti, L. et al. The two giant sister species of the Southern Ocean, Dissostichus eleginoides and Dissostichus mawsoni, differ in karyotype and chromosomal pattern of ribosomal RNA genes. Polar Biology 30, 625–634 (2007).
Bao, Z. & Eddy, S. R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome research 12, 1269–1276 (2002).
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21, i351–i358 (2005).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27, 573–580 (1999).
Dimmer, E. C. et al. The UniProt-GO annotation database in 2011. Nucleic acids research 40, D565–D570 (2012).
Ou, S. & Jiang, N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant physiology 176, 1410–1422 (2018).
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC bioinformatics 9, 1–14 (2008).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic acids research 35, W265–W268 (2007).
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome biology 9, 1–22 (2008).
Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y. O. & Borodovsky, M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic acids research 33, 6494–6506 (2005).
Stanke, M., Schöffmann, O., Morgenstern, B. & Waack, S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC bioinformatics 7, 1–11 (2006).
Consortium, U. UniProt: a worldwide hub of protein knowledge. Nucleic acids research 47, D506–D515 (2019).
Bruna, T., Lomsadze, A. & Borodovsky, M. GeneMark-EP and-EP+: automatic eukaryotic gene prediction supported by spliced aligned proteins. bioRxiv, 2019.2012. 2031.891218 (2020).
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic acids research 31, 5654–5666 (2003).
Geib, S. M. et al. Genome Annotation Generator: a simple tool for generating and correcting WGS annotation tables for NCBI submission. Gigascience 7, giy018 (2018).
Chan, P. P. & Lowe, T. M. tRNAscan-SE: searching for tRNA genes in genomic sequences. (Springer, 2019).
Marchler-Bauer, A. et al. CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic acids research 39, D225–D229 (2010).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of molecular biology 215, 403–410 (1990).
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Bryant, D. M. et al. A tissue-mapped axolotl de novo transcriptome enables identification of limb regeneration factors. Cell reports 18, 762–776 (2017).
Almagro Armenteros, J. J. et al. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nature biotechnology 37, 420–423 (2019).
Möller, S., Croning, M. D. & Apweiler, R. Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 17, 646–653 (2001).
Conesa, A. et al. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21, 3674–3676 (2005).
Li, L., Stoeckert, C. J. & Roos, D. S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome research 13, 2178–2189 (2003).
Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome biology 20, 1–14 (2019).
Katoh, K., Asimenos, G. & Toh, H. Multiple alignment of DNA sequences with MAFFT. Bioinformatics for DNA sequence analysis, 39–64 (2009).
Britton, T., Anderson, C. L., Jacquet, D., Lundqvist, S. & Bremer, K. Estimating divergence times in large phylogenetic trees. Systematic biology 56, 741–752 (2007).
Kumar, S., Stecher, G., Suleski, M. & Hedges, S. B. TimeTree: a resource for timelines, timetrees, and divergence times. Molecular biology and evolution 34, 1812–1819 (2017).
Han, M. V., Thomas, G. W., Lugo-Martinez, J. & Hahn, M. W. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Molecular biology and evolution 30, 1987–1997 (2013).
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome biology 5, 1–9 (2004).
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome research 19, 1639–1645 (2009).
Park, H. Genebank https://identifiers.org/insdc.gca:GCA_031216635.1 (2023).
Jae Lee, S., et al Genebank https://identifiers.org/insdc.gca:GCA_011823955.1 (2021).
Nicodemus-Johnson, J., Silic, S., Ghigliotti, L., Pisano, E. & Cheng, C.-H. C. Assembly of the antifreeze glycoprotein/trypsinogen-like protease genomic locus in the Antarctic toothfish Dissostichus mawsoni (Norman). Genomics 98, 194–201 (2011).
Chen, L., DeVries, A. L. & Cheng, C.-H. C. Evolution of antifreeze glycoprotein gene from a trypsinogen gene in Antarctic notothenioid fish. Proceedings of the National Academy of Sciences 94, 3811–3816 (1997).
Chen, L., DeVries, A. L. & Cheng, C.-H. C. Convergent evolution of antifreeze glycoproteins in Antarctic notothenioid fish and Arctic cod. Proceedings of the National Academy of Sciences 94, 3817–3822 (1997).
Kim, B. M. et al. Antarctic blackfin icefish genome reveals adaptations to extreme environments. Nat Ecol Evol 3, 469–478, https://doi.org/10.1038/s41559-019-0812-7 (2019).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome biology 21, 1–27 (2020).
Acknowledgements
This work was supported by Korea Institute of Marine Science & Technology Promotion(KIMST) grant funded by the Ministry of Oceans and Fisheries(KIMST RS-2022-KS221661), the National Institute of Fisheries Science (NIFS; R2024003), and a grant from Korea University.
Author information
Authors and Affiliations
Contributions
H.P. and J.-H.K. conceived the study. S.J.L., M.C., J.K., E.K.C., S. Choi, S. Chung, and J.L. performed genome sequencing and assembly. S.J.L., M.C., J.-H.K. and H.P. wrote the manuscript. All the authors contributed to writing and editing the manuscript, collating the supplementary information, and preparing the figures.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lee, S.J., Cho, M., Kim, J. et al. Chromosome-level genome assembly and annotation of the Patagonian toothfish Dissostichus eleginoides. Sci Data 11, 1240 (2024). https://doi.org/10.1038/s41597-024-04119-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-04119-w