Chromosome-level genome assembly and annotation of the Patagonian toothfish Dissostichus eleginoides

Lee, Seung Jae; Cho, Minjoo; Kim, Jinmu; Choi, Eunkyung; Choi, Soyun; Chung, Sangdeok; Lee, Jaebong; Kim, Jeong-Hoon; Park, Hyun

doi:10.1038/s41597-024-04119-w

Download PDF

Data Descriptor
Open access
Published: 16 November 2024

Chromosome-level genome assembly and annotation of the Patagonian toothfish Dissostichus eleginoides

Seung Jae Lee¹^na1,
Minjoo Cho¹^na1,
Jinmu Kim¹,
Eunkyung Choi¹,
Soyun Choi¹,
Sangdeok Chung²,
Jaebong Lee ORCID: orcid.org/0000-0002-9719-3376²,
Jeong-Hoon Kim³ &
…
Hyun Park ORCID: orcid.org/0000-0002-8055-2010¹

Scientific Data volume 11, Article number: 1240 (2024) Cite this article

2474 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

The Patagonian toothfish (Dissostichus eleginoides) belongs to the Actinopterygii class, and the suborder Notothenioidei, which lives in cold waters in the Southern Hemisphere. We performed assembly and annotation, and we integrated the Illumina short-read sequencing for polishinng, PacBio long-read sequencing for contig-level assembly, and Hi-C sequencing technology to obtain high-quality of chromosome-level genome assembly. The final assembly analysis resulted in a total of 495 scaffolds, a genome size of 844.7 Mbp and an N50 length of 36 Mbp. Among these data, we confirmed 24 scaffolds exceeded 10 Mbp and classified as chromosome-level. The completeness of BUSCO rate was over 97%. A total gene set of 32,224 was identified. Furthermore, we analyzed the presence of AFGP genes, classified into Antarctic and sub-Antarctic categories through phylogenetic analysis. This study provides a useful resource for the genomic analysis of Patagonian toothfish and genetic insights into the comparison with Antarctic fishes.

Chromosome-level genome assembly of the clam, Xishi tongue Coelomactra antiquata

Article Open access 11 March 2025

Chromosome-level genome assembly and annotation of the Antarctica whitefin plunderfish Pogonophryne albipinna

Article Open access 12 December 2023

Chromosome-level genome assembly of an Arctic fish species pale eelpout (Lycodes pallidus)

Article Open access 10 July 2025

Background & Summary

The Patagonian toothfish (Dissostichus eleginoides), also known as the Chilean sea bass, belongs to the family Nototheniidae and the suborder Notothenioidei. It is a large, slow-growing fish species typically found in cold, deep waters in the Southern Hemisphere, particularly around South America and the sub-Antarctic islands^1,2, growing up to 2 meters in length, brown-gray in color and living up to 50 years, with unique anti-freeze glycoproteins in its blood that allow it to survive freezing temperatures. There is one of only two species in the genus Dissostichus, the other being the Antarctic toothfish with which it shares morphological similarities^3,4. Despite their similarities, however, these two species have distinct genetic and ecological traits, such as the presence of the antifreeze glycoprotein (AFGP) genes in the Antarctic toothfish but not in the Patagonian toothfish, as well as separate habitats. We confirmed the presence of the AFGP gene based on exact locus in the genome sequence. The Patagonian toothfish is a commercially important species, often caught by longline fishing vessels that target high-value species in the region. Because of its delicate, sweet flavour and firm texture, which make it a delicacy in many countries, overfishing has depleted the population to the extent that it is considered a threatened species^5,6. By comparing the genomes of these two species, we can understand the genetic factors that contribute to their differences and use this knowledge to inform more effective conservation and management strategies⁷. The analysis of this study was performed using similar methods utilized in our previous study⁸. We reveal chromosome-level genome assembly and gene annotation information that can be used as a basis for a genome study of the Patagonian toothfish. Previous study had not been able to analyze genome assembly at the chromosome level⁹. In this study, we obtained 24 unambiguous chromosome sequences by complete Hi-C analysis and confirmed the homogeneity of the chromosome level by comparison with the ecologically closest species, D. mawsoni¹⁰. This will provide a valuable resource for future genomic studies of Antarctic and sub-Antarctic fish species, particularly when comparing the Antarctic and Patagonian toothfish. This data can be used to inform the management and conservation of these ecologically and economically important species and will contribute to our understanding of the genetic and physiological mechanisms that enable animals to survive in extreme environments. The findings of this study highlight the importance of understanding the genomic and ecological differences between closely related species and the potential benefits of studying the genomes of Antarctic fish species beyond fisheries management.

Methods

Sample preparation and sequencing

Adult Patagonian toothfish were obtained from commercial fisheries (captured at −56.91° −70.69°), and we had extracted genomic DNA from muscle from a single individual. For short-read and long-read genome sequencing, genomic DNA was extracted following the manufacturer’s protocol using the MagAttract HMW DNA Kit (Qiagen, catalog no.67563). Library preparation was performed using the Illumina Truseq Nano DNA Library prep kit for short-read genome sequencing. Approximately 350 bp fragments of high-molecular-weight genomic DNA were generated by random shearing using the Covaris S2 system (Covaris Inc., Woburn, MA, USA). The resulting DNA fragments were end-repaired and ligated to Illumina-specific adaptor. Indexed libraries were pooled in equimolar concentrations. Whole-genome sequencing (WGS) was carried out using an Illumina NovaSeq 6000 system (Illumina Inc., San Diego, CA, USA) with a paired-end of 2 × 150 bp reads. For the construction of the PacBio library, 20Kb fragments were generated by shearing genomic DNA using the Covaris G-tube, following the manufacturer’s recommended protocol for long-read genome sequencing. A total of 5 μg of DNA from a sample was used as input for the library preparation. The SMRTbell library was constructed using SMRTbell™ Template Prep Kit 1.0. After annealing the sequencing primer to the SMRTbell template, DNA polymerase was bound to the complex using Sequel Binding Kit 2.0. Excess unbound polymerase molecules and small DNA inserts were removed during a purification step following polymerase binding. The polymerase-SMRTbell-adapter complex was then loaded into zero-mode waveguides (ZMWs). Finally, the SMRTbell library was sequenced using nine Sequel™ SMRT® Cell 1 M v2 and the Sequel Sequencing Kit 2.1. A 600-minute movie was captured for each SMRT cell using the Sequel System (Pacific Biosciences, Menlo Park, CA, USA)¹⁰. The Hi-C library was constructed according to the manufacturer protocol to generate pseudo-chromosomes. The library was prepared using the Dovetail™ Hi-C Library Preparation Kit (Dovetail Genomics, Santa Cruz, CA, USA). Nuclear chromatin from the muscle tissue was cross-linked with formaldehyde and extracted. The fixed chromatin was digested with DpnII, and the sticky ends were filled in with biotinylated nucleotides, followed by ligation. The cross-links were then reversed, and purified DNA was treated to remove any unbound biotin from the ligated fragments. The DNA was subsequently sheared to an average size of ~350 bp using Covaris S2 System (Covaris Inc., Woburn, MA, USA), and biotinylated fragments were enriched through streptavidin bead pull-down. Finally, index PCR was performed to generate the library, which was then sequenced using Illumina NovaSeq platform⁸. RNA was extracted from the muscle tissue of the same individual for genomic DNA extraction. A total of 1 μg of RNA was used as input for cDNA synthesis using the SMARTer PCR cDNA Synthesis Kit (Clontech, Catalog No. 634925). Although 1~5 μg of pooled cDNA was required for library construction, the SMRTbell library was prepared using SMRTbell™ Template Prep Kit 1.0-SPv3 and sequenced on three SMRT cells per library using the Sequel Sequencing Kit 2.0. Sequencing condition included a 600-minute movies run time and a 240- minute pre-extension time on the Sequel System (Pacific Biosciences, Menlo Park, CA, USA). After sequencing, high-quality Iso-Seq reads were extracted using SMRTLink v.8.0¹¹ (Table 1).

Table 1 Sequencing data generated for short-read, long-read and Hi-C data of the Patagonian toothfish.

Full size table

Chromosome-level genome assembly with long-read sequences and Hi-C

The draft de novo assembly was constructed using the FALCON-Unzip assembler¹² with filtered subread sequences. The length cut-off (length_cutoff = 23800) was specified based on the subreads’ N50 value of 23.8 Kb. The primary assembly contigs were generated through phased diploid assembly using an unzipping approach, followed by polishing with the Arrow consensus algorithm. To improve genome assembly quality, we corrected errors using WGS reads aligned with Pilon v.1.2.23¹³, based on the haplotig-merged primary contigs and fixed error bases. The Hi-C raw sequence data were aligned to the draft assembly using BWA-MEM¹⁴, and Juicer v.1.5.7¹⁵ was used to generate Hi-C contact matrices after duplicate removal from the linking data. The alignment was then processed with 3D-DNA¹⁶ for chromosome-level scaffolding, and inter-contig linkage information was used to arrange, merge and classify contigs into chromosomes based on linkage density. Initial scaffolding results were manually reviewed using Juicebox to correct any mis-joined and unplaced contigs¹⁷. Ultimately, 3D-DNA was used to generate a high-quality, chromosome-level genome assembly. The draft genome assembly of 1,224 contigs, N50 of 4.3 Mega base-pair (Mbp), and a total length of 842 Mbp, constructed using 90x long-read data and corrected errors with 50x short-read data. The Hi-C analysis allowed the draft genome assembly to be upgraded to chromosome-level genome assembly within 24 chromosomal sequences. The longest contig length was 42.8 Mbp, with a contig N50 of 36 Mbp, and the total number of contigs decreased to 495. Additionally, 24 scaffolds longer than 10 Mbp were identified, consistent with known karyotype of Patagonian toothfish chromosomes (2n = 48)¹⁸. The N50 scaffold identified in the previous study⁹ is 3.5 Mbp, which is 10% of N50 contig length of 36 Mbp in this study, and the number of contigs is 447, but only 11 contigs are longer than 10 Mbp, which is not enough to complete the genome (Table 2A). The Hi-C scaffolds were validated using the Hi-C contact map (Fig. 1a,b). The coverage of 24 chromosomal sequences was 95.14% and the total size of the unplaced scaffolds was 41.05 Mbp (Table 3).

Table 2 Summary of the Patagonian toothfish genome assembly and comparison of previous assembly.

Full size table

Table 3 Summary of chromosome sequences of the final assembly.

Full size table

Repeat analysis

A de novo repeat library was constructed using RepeatModeler v.1.0.3¹⁹, which incorporated RECON and RepeatScout v.1.0.5²⁰ with default parameters. Additionally, Tandem Repeats Finder²¹ was utilized to predict consensus sequences and generate classification data for each repeat elements. All repeats identified by RepeatModeler were further analyzed against the UniProt/SwissProt database²². To accurately identify long terminal repeat retrotransposons (LTR-RTs), an LTR library was constructed using LTR_retriever²³ by integated raw LTR data from both LTRharvest²⁴ and LTR_FINDER²⁵. Repetitive elements were subsequently annotated using RepeatMasker v.4.0.9 with the combined repeat library generated from RepeatModeler and LTR-retriever. The repetitive elements in the chromosome-level genome were analyzed, revealing that they constitute 39.08% of the entire genome. The distribution of repetitive elements consisted of 12.25% DNA transposons, 5.09% long interspersed nuclear elements (LINEs), 0.49% short interspersed nuclear elements (SINEs), 9.96% long terminal repeats (LTRs), and 9.72% other repetitive sequences. (Table 4).

Table 4 Summary of annotated transposable elements of the Patagonian toothfish.

Full size table

Gene prediction and annotation

Gene prediction was performed using EVidenceModeler (EVM) v.1.1.1²⁶, which integrates the results of multiple gene predictions. Repeat-masked genomes were used for ab initio gene prediction using GeneMark-ES v.4.68²⁷ and Augustus v.3.4.0²⁸. Next, the hints for protein and ab initio predictions were extracted using all protein sequences from Actinopterygii, a clade of bony fishes, in the UniProt/SwissProt protein database²⁹ using ProtHint v.2.6.0³⁰. The hints were used to perform protein predictions using GeneMark-EP + v.4.68³⁰ and for ab initio predictions using Augustus. To obtain transcriptome-level evidence, PASA pipeline v.2.3.3³¹ was used with Iso-Seq data. EVM was used to integrate the ab initio, transcriptome, and protein prediction results to obtain the final gene prediction with the weights (ABINITIO_PREDICTION = 1, PROTEIN = 50, and TRANSCRIPT = 50). Finally, the PASA pipeline with Iso-Seq data was used to predict changes in exons by the addition of untranslated regions (UTRs). Genome Annotation Generator v.2.0.1³² was used to add start/stop codon data and generate a well-formed gff file. Other noncoding RNAs and putative tRNA genes were identified using Barrnap v.0.9 and tRNAscan-SE v.2.0.5³³, respectively. The predicted genes were annotated by aligning them with the NCBI non-redundant protein (nr) database³⁴ using NCBI BLAST v.2.9.0³⁵ with a maximum e-value of 1e−5. To obtain protein domain information, InterProScan v.5.44.79³⁶ was used along with protein sequences translated from a transcripts. Additionally, comprehensive annotation of transcriptome sequences was performed using Trinotate³⁷, while the Kyoto Encyclopedia of Genes and Genomes (KEGG) was employed with decoded peptide sequences using TransDecoder v.5.5. Protein signal peptide prediction was carried out using SignalP v.5.0³⁸, and transmembrane domain prediction was conducted using TMHMM v2.0³⁹. Finally, Gene Ontology (GO) terms²² were assigned to the genes using the BLAST2GO pipeline v.4.0⁴⁰. A total of 32,224 genes and 32,471 coding sequences (CDSs) were analysed in the Patagonian toothfish genome. The average length of CDSs was 1,287 bp, and the average number of exons per gene was 8.01 (Table 3B). A total CDSs were annotated from a minimum of 10.10% to a maximum of 76.90% in seven databases for functional annotation. In one or more databases, 87.84% of CDSs were annotated (Table 5).

Table 5 Summary of functional annotations of the Patagonian toothfish genome.

Full size table

Phylogenetic analysis

Orthologous gene clusters were classified within the genomes of 17 species, including D. eleginoides (Table 6) using OrthoMCL (OrthoMCL-DB: Ortholog Groups of Protein Sequences)⁴¹. The protein sequence was extracted from longest isoforms of each species. To construct a phylogenetic tree, we performed orthologous gene analysis using Orthofinder⁴² with an e-value cut-off 1e-5 and an all-to-all BLASTP analysis of 17 species. MAFFT v.6.861b⁴³ was used to align each gene family, and the phylogenetic tree was inferred with FastTree v.2.1.10²⁰, with divergence time calibration performed using both PATHd8⁴⁴ and TimeTree⁴⁵. Finally, CAFE v.4.2.1⁴⁶ was used to predict the likelihood of gene family expansion and contraction with P < 0.01 and automatic searching for the λ value with default parameters load -p 0.01; lambda -s -t and time tree information. Phylogenetic analysis was performed using single-copy ortholog genes, and 17 species, including the Patagonian toothfish, were found to have branched from the most recent common ancestor (MRCA) with 24,069 orthogroups. The 17 species used in the analysis belong to the Actinopterygii class, and the Notothenioidei suborder diverged about 73 million years ago. In this suborder, the Antarctic and sub-Antarctic fishes diverged about 61 million years ago, while the Antarctic toothfish and Patagonian toothfish, which have the most morphological similarity, diverged about 15 million years ago, at roughly the same time. However, the difference in the number of expanded and contracted gene families is 871 and 3,414, respectively, showing that there are ecological differences between the Antarctic toothfish and Patagonian toothfish (Fig. 1c).

Table 6 Genome assemblies used for phylogenetic analysis.

Full size table

Genomic comparative analysis

To compare genome assembly sequences at the chromosome level, nucmer in the MUMmer software package v.4.02b⁴⁷ was used with the parameters -c 1000 -l 1000 and add–mum for unique matching and avoiding repeat regions. Long sequences corresponding to chromosomes were extracted and compared for a clear chromosome comparison; any unordered contig or scaffold sequences were excluded. Circos⁴⁸ is a useful tool for comparing genome sequences based on homogeneous coordinates. We converted the coordinate data obtained through nucmer into a readable format in Circos. The results of chromosome comparison between two genomes were diagrammed using Circos. We performed a genomic comparative analysis to determine the genetic similarities between the Patagonian toothfish⁴⁹ and the Antarctic toothfish⁵⁰ and found that all chromosome regions of both species matched identically without any chromosome segregation (Fig. 2a). The regions containing AFGP and trypsinogen genes were extracted from the whole-genome sequence using NCBI BLAST¹¹ v.2.9.0³⁵ against transcript and protein sequences of the Antarctic toothfish⁵¹. AFGP genes evolved from trypsinogen genes in Antarctic fish⁵². The prediction of gene features of AFGP genes cannot be accomplished using automated prediction methods, because the AFGP gene sequence has a high incidence of tandem repeats⁵³. Instead, we developed a customized process to predict complete AFGP gene features and analysed exons and CDSs of AFGP and trypsinogen genes. The AFGP–trypsinogen locus was located between genes encoding transmembrane protein 145 (tmem145) – mitochondrial 39S ribosomal protein L17 (mrpl17) and Cbl proto-oncogene, E3 ubiquitin-protein ligase (cbl) – BCL3 transcription coactivator (bcl), as reported in a previous study⁵⁴. Using previously published Antarctic toothfish genome data¹⁰, we constructed a haplotype-resolved genome and annotated multiple trypsinogens and AFGP genes using the AFGP gene prediction method. However, only seven copies of the trypsinogen genes and one copy of the trypsinogen-like protease gene were predicted at the exon/CDS level in the Patagonian toothfish. This result confirmed the largest genetic difference between the Antarctic toothfish and the Patagonian toothfish (Fig. 2b).

Data Records

The genome assembly, annotation data and raw data of the Patagonian toothfish have been deposited in NCBI under the accession number GCA_031216635.1⁴⁹ and SRP524971¹¹.

Technical Validation

Quality control of nucleic acids and libraries

The extracted DNA quality and quantity were estimated using Qubit 2.0 fluorometer (Invitrogen, Life Technologies, Carlsbad, CA, USA) and Fragment Analyzer (Agilent Technologies, Santa Clara, CA, USA). The genomic DNA above size 20 Kb fragments were used to construct the SMRTbell library. Size of the Hi-C fragments were sheared to a size of ~350 bp. The quality and quantity of RNA were assessed using 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) and Qubit 2.0 fluorometer (Invitrogen, Life Technologies, CA, USA). The value of the RNA integrity number (RIN) was 8.8 and the Iso-seq average library size was ~2,800 bp.

Genome assembly and annotation evaluation

Benchmarking Universal Single-Copy Orthologs (BUSCO) v.5.4.4⁵⁵ with default parameters and the Actinopterygii lineage dataset (3,640 single-copy orthologs; OrthoDB v.10) were used to assess the completeness of the genome assembly. The Actinopterygii dataset can be universally applied to teleosts, such as the Patagonian toothfish. The completed BUSCO was identified of 3,563 (97.9%). Of these, 3,450 (94.8%) were single-copy BUSCO and 113 (3.1%) were duplicates. The numbers of partially matched and missing BUSCO were 48 (1.3%) and 215 (5.9%), respectively. Additionally, BUSCO was used in transcriptome mode with CDSs to confirm the gene prediction results. The percentage of complete BUSCO was 84.0% while missing BUSCO comprised 10.2% (Table 7). The k-mer completeness and quality value (QV) were evaluated by Merqury v1.3⁵⁶. Merqury analysis were QV of 32.89 and completeness of 91.77.

Table 7 Assessment of the Patagonian toothfish final assembly and transcriptome using BUSCO.

Full size table

Code availability

The bioinformatic software and pipeline utilized in this study were implemented following the guidelines provided by the developers. Details regarding the versions and parameters of each software are outlined in the Methods section. Unless specified otherwise, default parameters were utilized.

References

Fischer, W. & Hureau, J. C. Southern Ocean: Fishing Areas 48, 58 and 88 (CCAMLR Convention Area). Vol. 1 (Food and agriculture organization of the United nations, 1985).
DeWitt, H., Heemstra, P. & Gon, O. Nototheniidae. Fishes of the southern ocean. JLB Smith Institute of Ichthyology, Grahamstown, 279–331 (1990).
Eastman, J. T. Antarctic fish biology: evolution in a unique environment. (Academic Press, 2013).
Policansky, D. Southernmost Fauna: Antarctic Fish Biology. Evolution in a Unique Environment. Joseph T. Eastman. Illustrations and graphics by Danette Pratt. Photographs by William Winn. Academic Press, San Diego, CA, 1993. xiv, 322 pp., illus. 74.95or£57.;AntarcticFishandFisheries.Karl-HermannKock.CambridgeUniversityPress,NewYork,1992.xvi,359pp.,illus. 1 10 or£ 60. Studies in Polar Research.; History and Atlas of the Fishes of the Antarctic Ocean. Richard Gordon Miller. With contributions by Philip A. Hastings and Josette Gourley. Foresta Institute of Ocean and Mountain Studies, Tucson, AZ, 1993. xx, 792 pp., illus. 95;laminatedcover, 78. Science 264, 1002–1004 (1994).
Article CAS PubMed Google Scholar
Clover, C. The end of the line: how overfishing is changing the world and what we eat. (Univ of California Press, 2008).
Brandão, A. & Butterworth, D. S. A proposed management procedure for the toothfish (Dissostichus eleginoides) resource in the Prince Edward Islands vicinity. (2009).
Seung Jae Lee, J. K., Choi, E., Jo, E. & Cho, M. Hyun Park. The Application of Genome Research to Development of Aquaculture. Journal of Marine Life Science 6, 47–57 (2021).
Google Scholar
Lee, S. J. et al. A chromosome-level reference genome of the Antarctic blackfin icefish Chaenocephalus aceratus. Scientific Data 10, 657 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Ryder, D. et al. De novo assembly and annotation of the Patagonian toothfish (Dissostichus eleginoides) genome. BMC genomics 25, 233 (2024).
Article CAS PubMed PubMed Central Google Scholar
Lee, S. J. et al. Chromosomal assembly of the Antarctic toothfish (Dissostichus mawsoni) genome using third-generation DNA sequencing and Hi-C technology. Zoological research 42, 124 (2021).
Article Google Scholar
NCBI Sequence Read Archive http://identifiers.org/ncbi/insdc.sra:SRP524971 (2024).
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nature methods 13, 1050–1054 (2016).
Article CAS PubMed PubMed Central Google Scholar
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS one 9, e112963 (2014).
Article ADS PubMed PubMed Central Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell systems 3, 95–98 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. The Juicebox Assembly Tools module facilitates de novo assembly of mammalian genomes with chromosome-length scaffolds for under $1000. BioRxiv, 254797 (2018).
Ghigliotti, L. et al. The two giant sister species of the Southern Ocean, Dissostichus eleginoides and Dissostichus mawsoni, differ in karyotype and chromosomal pattern of ribosomal RNA genes. Polar Biology 30, 625–634 (2007).
Article Google Scholar
Bao, Z. & Eddy, S. R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome research 12, 1269–1276 (2002).
Article CAS PubMed PubMed Central Google Scholar
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21, i351–i358 (2005).
Article CAS PubMed Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27, 573–580 (1999).
Article CAS PubMed PubMed Central Google Scholar
Dimmer, E. C. et al. The UniProt-GO annotation database in 2011. Nucleic acids research 40, D565–D570 (2012).
Article CAS PubMed Google Scholar
Ou, S. & Jiang, N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant physiology 176, 1410–1422 (2018).
Article CAS PubMed Google Scholar
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC bioinformatics 9, 1–14 (2008).
Article Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic acids research 35, W265–W268 (2007).
Article PubMed PubMed Central Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome biology 9, 1–22 (2008).
Article Google Scholar
Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y. O. & Borodovsky, M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic acids research 33, 6494–6506 (2005).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M., Schöffmann, O., Morgenstern, B. & Waack, S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC bioinformatics 7, 1–11 (2006).
Article Google Scholar
Consortium, U. UniProt: a worldwide hub of protein knowledge. Nucleic acids research 47, D506–D515 (2019).
Article Google Scholar
Bruna, T., Lomsadze, A. & Borodovsky, M. GeneMark-EP and-EP+: automatic eukaryotic gene prediction supported by spliced aligned proteins. bioRxiv, 2019.2012. 2031.891218 (2020).
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic acids research 31, 5654–5666 (2003).
Article CAS PubMed PubMed Central Google Scholar
Geib, S. M. et al. Genome Annotation Generator: a simple tool for generating and correcting WGS annotation tables for NCBI submission. Gigascience 7, giy018 (2018).
Article PubMed PubMed Central Google Scholar
Chan, P. P. & Lowe, T. M. tRNAscan-SE: searching for tRNA genes in genomic sequences. (Springer, 2019).
Marchler-Bauer, A. et al. CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic acids research 39, D225–D229 (2010).
Article PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of molecular biology 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Article CAS PubMed PubMed Central Google Scholar
Bryant, D. M. et al. A tissue-mapped axolotl de novo transcriptome enables identification of limb regeneration factors. Cell reports 18, 762–776 (2017).
Article CAS PubMed Google Scholar
Almagro Armenteros, J. J. et al. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nature biotechnology 37, 420–423 (2019).
Article CAS PubMed Google Scholar
Möller, S., Croning, M. D. & Apweiler, R. Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 17, 646–653 (2001).
Article PubMed Google Scholar
Conesa, A. et al. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21, 3674–3676 (2005).
Article CAS PubMed Google Scholar
Li, L., Stoeckert, C. J. & Roos, D. S. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome research 13, 2178–2189 (2003).
Article CAS PubMed PubMed Central Google Scholar
Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome biology 20, 1–14 (2019).
Article Google Scholar
Katoh, K., Asimenos, G. & Toh, H. Multiple alignment of DNA sequences with MAFFT. Bioinformatics for DNA sequence analysis, 39–64 (2009).
Britton, T., Anderson, C. L., Jacquet, D., Lundqvist, S. & Bremer, K. Estimating divergence times in large phylogenetic trees. Systematic biology 56, 741–752 (2007).
Article PubMed Google Scholar
Kumar, S., Stecher, G., Suleski, M. & Hedges, S. B. TimeTree: a resource for timelines, timetrees, and divergence times. Molecular biology and evolution 34, 1812–1819 (2017).
Article CAS PubMed Google Scholar
Han, M. V., Thomas, G. W., Lugo-Martinez, J. & Hahn, M. W. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Molecular biology and evolution 30, 1987–1997 (2013).
Article CAS PubMed Google Scholar
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome biology 5, 1–9 (2004).
Article Google Scholar
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome research 19, 1639–1645 (2009).
Article CAS PubMed PubMed Central Google Scholar
Park, H. Genebank https://identifiers.org/insdc.gca:GCA_031216635.1 (2023).
Jae Lee, S., et al Genebank https://identifiers.org/insdc.gca:GCA_011823955.1 (2021).
Nicodemus-Johnson, J., Silic, S., Ghigliotti, L., Pisano, E. & Cheng, C.-H. C. Assembly of the antifreeze glycoprotein/trypsinogen-like protease genomic locus in the Antarctic toothfish Dissostichus mawsoni (Norman). Genomics 98, 194–201 (2011).
Article CAS PubMed Google Scholar
Chen, L., DeVries, A. L. & Cheng, C.-H. C. Evolution of antifreeze glycoprotein gene from a trypsinogen gene in Antarctic notothenioid fish. Proceedings of the National Academy of Sciences 94, 3811–3816 (1997).
Article ADS CAS Google Scholar
Chen, L., DeVries, A. L. & Cheng, C.-H. C. Convergent evolution of antifreeze glycoproteins in Antarctic notothenioid fish and Arctic cod. Proceedings of the National Academy of Sciences 94, 3817–3822 (1997).
Article ADS CAS Google Scholar
Kim, B. M. et al. Antarctic blackfin icefish genome reveals adaptations to extreme environments. Nat Ecol Evol 3, 469–478, https://doi.org/10.1038/s41559-019-0812-7 (2019).
Article PubMed PubMed Central Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome biology 21, 1–27 (2020).
Article Google Scholar

Download references

Acknowledgements

This work was supported by Korea Institute of Marine Science & Technology Promotion(KIMST) grant funded by the Ministry of Oceans and Fisheries(KIMST RS-2022-KS221661), the National Institute of Fisheries Science (NIFS; R2024003), and a grant from Korea University.

Author information

These authors contributed equally: Seung Jae Lee, Minjoo Cho.

Authors and Affiliations

Department of Biotechnology, College of Life Sciences and Biotechnology, Korea University, Seoul, 02841, Korea
Seung Jae Lee, Minjoo Cho, Jinmu Kim, Eunkyung Choi, Soyun Choi & Hyun Park
2National Institute of Fisheries Science (NIFS), Busan, 46083, Korea
Sangdeok Chung & Jaebong Lee
Korea Polar Research Institute (KOPRI), Yeonsu-gu, Incheon, 21990, Korea
Jeong-Hoon Kim

Authors

Seung Jae Lee
View author publications
Search author on:PubMed Google Scholar
Minjoo Cho
View author publications
Search author on:PubMed Google Scholar
Jinmu Kim
View author publications
Search author on:PubMed Google Scholar
Eunkyung Choi
View author publications
Search author on:PubMed Google Scholar
Soyun Choi
View author publications
Search author on:PubMed Google Scholar
Sangdeok Chung
View author publications
Search author on:PubMed Google Scholar
Jaebong Lee
View author publications
Search author on:PubMed Google Scholar
Jeong-Hoon Kim
View author publications
Search author on:PubMed Google Scholar
Hyun Park
View author publications
Search author on:PubMed Google Scholar

Contributions

H.P. and J.-H.K. conceived the study. S.J.L., M.C., J.K., E.K.C., S. Choi, S. Chung, and J.L. performed genome sequencing and assembly. S.J.L., M.C., J.-H.K. and H.P. wrote the manuscript. All the authors contributed to writing and editing the manuscript, collating the supplementary information, and preparing the figures.

Corresponding authors

Correspondence to Jeong-Hoon Kim or Hyun Park.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

supplymentary data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Lee, S.J., Cho, M., Kim, J. et al. Chromosome-level genome assembly and annotation of the Patagonian toothfish Dissostichus eleginoides. Sci Data 11, 1240 (2024). https://doi.org/10.1038/s41597-024-04119-w

Download citation

Received: 30 April 2024
Accepted: 12 November 2024
Published: 16 November 2024
DOI: https://doi.org/10.1038/s41597-024-04119-w