Background & Summary

The Japanese cutlassfish, Trichiurus japonicus, also known as large-head hairtail, is widely distributed in the northwestern Pacific Ocean, inhabiting coastal waters at depths of 100–200 meters and exhibiting diurnal vertical migration1. It is a commercially important species in East Asia’s capture fisheries and the dominant Trichiuridae species in most fishing ports across Taiwan, accounting for 40% to 100% of reported catches2. The global capture production of Trichiuridae, which often aggregates the output of T. japonicus under the globally distributed species T. lepturus as it was previously regarded as a subspecies, has exceeded one million tons annually since 19943, with over 80% caught from the East Asian Seas. Compared to other Trichiuridae species, the broad latitudinal distribution of the Japanese cutlassfish suggests that this temperate species is well adapted to cold waters4, leading to the hypothesis of temperature-driven speciation5.

A decline in the Japanese cutlassfish population has been documented in the literature1, putatively caused by overfishing and the impacts of global climate change. To assess the stock structure for resource monitoring, it is crucial to use a methodology that can accurately assign species rather than relying solely on phenotypic features. Using molecular features such as the mitochondrial control region, cytochrome b (cyt b), or cytochrome c oxidase I (COI), researchers successfully assessed the cutlassfish population composition2,3. In light of the heavy exploitation of Japanese cutlassfish, developing sophisticated genome-derived molecular markers to elucidate genetic diversity within populations is critical for informing resource management strategies.

The other genome of T. japonicus (Genebank assembly: GCA_046254865.1) was published by Shanghai Ocean University on the NCBI database in Dec 2024. This genome assembly, consisting of 24 chromosomes (931.81 Mb) and 1,014 unplaced scaffolds with a scaffold N50 of 39.2 Mb, was constructed by integrating Oxford Nanopore long reads (112 Gb), Illumina short reads (66.37 Gb), and Hi-C scaffolding data (104.11 Gb). This manuscript presents a more complete genome assembly of T. japonicus, built from PacBio HiFi Reads, Oxford Nanopore, 10x, Omni-C, and Hi-C data. To maximize the utility of read information from different sequencing platforms, we combined the assembling processes of HiFiasm6,7, and SALSA28, and further enhanced genome continuity using our gap-filling algorithm, LOCLA9. The final genome assembly comprises 24 chromosomes and 160 scaffolds, totaling 1,138 Mb in length, with a scaffold N50 length of 47.10 Mb, an average scaffold length of 6.18 Mb, and a complete mitogenome of 16,796 bp. Comprehensive annotation was performed on the nuclear and mitochondrial genomes to identify protein-coding and non-coding genes. This assembly will provide a foundational infrastructure for further study.

Methods

Sample collection and library construction

A female Japanese cutlassfish (Trichiurus japonicus), measuring approximately 58 cm in length and weighing 117 grams, was captured offshore of Kaohsiung, Taiwan (22°15.754′ N, 120°38.784′ E) in May 2020. Tissue samples from the brain, heart, blood, spleen, kidney, liver, and muscle were collected on board the fishing vessel. High-quality genomic DNA was extracted from various tissues using the DNeasy Blood & Tissue Kit (Qiagen, USA) following the manufacturer’s instructions. DNA quality and concentration were assessed using a NanoDrop One spectrophotometer (Thermo Scientific, USA) and an automated pulsed-field capillary electrophoresis system (Agilent, USA). Liver-derived genomic DNA demonstrated the highest quality among all tissue sources and was used for library preparation for short-read, long-read, and Hi-C sequencing platforms.

Genome sequencing

A single-molecule real-time (SMRT) library prepared from extracted liver genomic DNA (gDNA) was sequenced on the PacBio Sequel II platform using an 8 M SMRT Cell chip with a 30-hour movie time. HiFi reads (Q > 20) were generated from circular consensus (CCS) reads, processed with SMRT Link (version: 11.1.0.166339). Two PacBio HiFi cells generated 922,338 and 910,177 reads, respectively, producing 16,520,702,091 bp (average read length: 17,911 bp) and 16,053,456,774 bp (average read length: 17,637 bp), with an average base quality score of 30.

A liver-derived genomic DNA library was constructed and sequenced for Nanopore long-read sequencing using the Oxford Nanopore Technologies (ONT) R9.4 flow cell. Reads longer than 10 kb were retained for subsequent genome assembly. A total of 532,138 reads were generated, yielding approximately 12 Gb of sequence data, with an average read length of 22,575 bp.

The experiment design included high-throughput chromosome conformation capture (Hi-C) sequencing for scaffolding. Hi-C libraries were prepared using chromatin crosslinking protocols to preserve long-range genomic interactions. Sequencing on the Illumina Hi-Seq. 2500 platform generated 42,628 megabase pairs (Mbp) of 151 bp paired-end reads.

Genome assembly and quality evaluation

A contig-level draft genome, based on PacBio HiFi and Hi-C reads, was constructed using HiFiasm (version: 0.19.5-r587)6,7 and is summarized in Table 1. It comprises 1,010 scaffolds with a maximum scaffold length of 28,308,023 bp a contig N50 of 7,895,182 bp. A two-step scaffolding strategy was employed to connect contigs approaching chromosome-level assembly. First, Hi-C reads were reused for scaffolding with SALSA2 (docker version 2.3)8. Subsequently, the global-contig-based (GCB) scaffolding module of LOCLA9 was applied to the SALSA2-derived scaffolds with ONT long reads (over 10 Kbp) to improve scaffold continuity further. After the scaffolding process, 184 scaffolds were assembled, with the maximum scaffold length increased to 55,973,360 bp. Both the contig N50 and mean scaffold length were elevated to 46,998,866 bp and 6,168,641 bp, respectively (Table 1). We applied the LOCLA global-contig-based (GCB) gap-filling module in the last gap-filling step with PacBio long-read data. Among the final 184 scaffolds, 29 were longer than 1 Mb, comprising 1,138,003,112 bp with non-N bases ratio of 0.02%. Meanwhile, the complete mitochondrial genome with a length of 16,796 bp was identified and reconstructed with PacBio

Table 1 Basic statistics of genome assembly in different stages.

HiFi reads and Canu. The Hi-C contact map (Fig. 1) reveals strong interactions within chromosomes. The final assembly was then further scanned with the FCS-Adapter and FCS-GX (the database was constructed on March 27, 2024). No adapters or contamination were found in the representative nuclear and mitochondrial genomes. In addition, analysis of HiFi reads and final assembly with Merqury (version: 1.3)10 revealed that the average quality value (QV) was 38.59, and k-mer completeness was 86.82%.

Fig. 1
Fig. 1
Full size image

Hi-C interaction heatmap for T. japonicus final genome assembly. The resolution for resolving interactions was 2 Mbp, and the data were normalized by read coverage. The grid spacing in the plot was 50 Mbp. The highlighted block represented 24 chromosomes of the T. japonicus nuclear genome.

Repeat element annotation

Repeat elements were annotated through homology-based and de novo prediction approaches. A combined homology database was established using partitions 0, 2, and 6 of Dfam 3.8 (Dfam TE Tools Container v1.89.2)11 and Repbase (2018/10/26)12, limited to the taxonomic level of Actinopterygii. TRF (v4.09)13 and RepeatModeler (v2.0.5)11, which in turn utilize RepeatScout (v1.0.6)12, RECON (v1.08)14, and LTR_retriever (v2.9.0)15, were used for de novo repeat discovery. In total, 4,672,592 elements were identified, spanning 695,266,837 bp and covering 61.10% of the genome (Table 2). These included Class II DNA transposition at 20.77%, long terminal repeats (LTRs) at 9.00%, long interspersed nuclear elements (LINEs) at 8.36%, simple repeat at 2.95%, rolling circle transposons (RC) at 1.77%, short interspersed nuclear elements (SINEs) at 0.87%, satellites at 0.80%. Notably, repeat patterns that did not match any known transposable elements (classified as “Unknown”) occupied 15.67% of the genome (Fig. 2).

Table 2 Summary of repeat elements and non-coding sequences in the Trichiurus japonicus nuclear genome.
Fig. 2
Fig. 2
Full size image

The genome annotation of the Japanese cutlassfish. The genomic features of Trichiurus japonicus were summarized using a Circos plot comprising 24 chromosomes. From the outermost to the innermost track, the plot displayed the density of protein-coding genes, Class II DNA transposons, SINEs, LTRs, LINEs, and GC content, calculated in 0.1 Mb windows.

Annotation on protein-coding and non-coding RNA genes

Braker316 annotation pipeline (docker image: braker3: v3.0.7.6) was applied to the genome from the previous section after additional TRF soft-masking for gene structure prediction17,18,19,20, with both transcriptional data and protein alignments as evidence. RNA-seq data were downloaded from the NCBI SRA, including SRR32402605 (adult blood), SRR32402606 (adult muscle), SRR32402607 (adult kidney), SRR32402608 (adult swim bladder), SRR32402609 (adult skin), SRR32402610 (adult eye), SRR32402611 (adult gill), SRR32402612 (adult brain), SRR32402614 (adult liver), and SRR32402615 (adult heart). Raw reads, totaling 64.3 Gbp, were preprocessed with Cutadapt (v4.9)21 to remove library adapters, reads shorter than five bp, and low-quality bases at the 3′-end, and were aligned to the Japanese cutlassfish reference genome using STAR (v 2.7.11b)22. Properly mapped read pairs were retained using samtools (version 1.21)23. Protein sequences from related teleost species were compiled, including proteins in the OrthoDB (version 12, https://bioinf.uni-greifswald.de/bioinf/partitioned_odb12/), Actinopterygii (Taxon ID: 7898) proteins from UniProtKB24, and Scombriformes (Taxonomy ID: 1489894) proteins from the annotated typical genome in NCBI, as evidence for protein-coding gene prediction. Sequence preparation followed Braker3 team’s instructions16. Furthermore, BUSCO lineage “actinopterygii_odb10” was used to help reduce the error rate. For species with multiple full genome assemblies available in NCBI, the primary assembly was used for protein sequence extraction. As a result, 26,541 protein-coding genes comprising 488,952 exons were annotated with a median length of 7,391 (Table 3), among which 39,048 mRNAs were derived, and 2,233 genes were single-exon genes.

Table 3 Basic statistical results of nuclear protein-coding gene prediction.

The unmasked nuclear genome assembly was used as input for predicting non-coding RNA (ncRNA) genes. The tRNA genes and pseudogenes were identified using tRNAscan-SE (v2.0.12) with a two-step filtering process. First, tRNAscan-SE was applied with parameters -E -I --detail -H to detect all potential tRNA sequences. Second, the EukHighConfidenceFilter was applied with default cutoff settings to distinguish high-confidence tRNAs from pseudogenes. The EukHighConfidenceFilter default criteria were: (i) secondary filtering domain-specific model score ≥50, (ii) secondary filtering secondary structure score ≥10, (iii) secondary filtering isotype-specific model score ≥70, (iv) tertiary filtering isotype-specific model starting score ≥70, and (v) tertiary filtering maximum isotype-specific model score ≤95. Sequences not meeting these criteria were classified as pseudogenes. Infernal (v1.1.5, parameter: -Z 2276.006224 --cut_ga --rfam –nohmmonly --fmt 2 --clanin Rfam.clanin --oskip Rfam.cm)25, referring to the Rfam database (v15.0)26 was used to predict other ncRNA types including, rRNAs, snRNAs, snoRNAs, miRNA precursors, and signal recognition particle RNAs. This approach detects ncRNA families with well-defined primary sequence and conserved secondary structure motifs. Because most lncRNAs lack conserved secondary structures and are not comprehensively represented in Rfam, only a small number of known lncRNA families were annotated in this study. A more complete lncRNA annotation will require full-length transcriptome sequencing and specialized discovery pipelines in future work.

A total of 16,383 ncRNA genes were predicted, including 12,075 tRNAs, 1,728 rRNAs, 1,554 snRNAs, 849 miRNAs, 40 Metazoan signal recognition particle RNA (Metazoa SRP), 113 MALAT1-associated small cytoplasmic RNAs/MEN beta RNAs (mascRNA-menRNA), and other minor ncRNA types, accounting for 0.1694% of the genome. Additionally, 6,699 pseudogenes and 244 cis-regulatory elements (cis reg elements) were predicted in tRNAscan-SE (Table 4).

Table 4 Summary of annotated non-coding RNA genes, pseudogenes, and cis-regulatory elements in the Trichiurus japonicus nuclear genome.

The mitogenome of T. japonicus was annotated using MITOS2 (v2.1.9), setting parameter: --best, applying the vertebrate mitochondrial genetic code (code 2) and referencing RefSeq. 89 Metazoa dataset. As a result, 13 protein-coding genes, 21 tRNA genes, and 2 rRNA genes were identified (Fig. 3, Table 5). The annotation plot of the mitochondrial genome was generated with OGDRAW (https://chlorobox.mpimp-golm.mpg.de/OGDraw.html).

Fig. 3
Fig. 3
Full size image

Circular map of the Trichiurus japonicus mitochondrial genome (16,796 bp). Protein-coding genes are color-coded by respiratory complex: Complex I (yellow), Complex IV (pink), ATP synthase (green), and other genes (purple). Transfer RNAs (blue) and ribosomal RNAs (red) were also shown here. The inner grey ring represented the GC content calculated in 4-bp sliding windows.

Table 5 Basic statistical results of mitochondrial genes.

Functional annotation of proteins

Methods for annotating functions and structural domains of putative proteins from Japanese cutlassfish genome were applied to both nuclear and mitochondrial proteins, comprising a total of 39,061 proteins from 26,554 genes: DIAMOND (v2.1.10.164, parameters: --header simple --max-target-seqs. 1 --outfmt 6 qseqid stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids --evalue 0.001)27 for sequence homology searches against the NCBI Non-redundant protein database28; KEGG pathway assignment using KEGG Automatic Annotation Server (KAAS, v2.1, model: ghostz model with BBH method)29,30 against 38 fish species comprising 1,034,578 sequences; InterProScan (v5.73–104.0, paremeters: -goterms)31,32 for protein domain detection; and DeepTMHMM (v1.0.24)33 accompanied by the DeepTMHMM_parser (Github: https://github.com/soldatsm/ DeepTMHMM_parser) for transmembrane feature detection. Gene Ontology was derived from InterProScan results.

The DIAMOND blastp results showed that 98.70% of the protein sequences, corresponding to 98.15% of the genes (Table 6), had homologs in the nr database. Of these identified proteins, 88.54% were informative, originating from 85.39% of the total genes. The top three species most frequently assigned by DIAMOND blastp were Thunnus maccoyii, T. albacares, and T. thynnus, accounting for 50.70% of all annotated proteins (Table 7).

Table 6 Basic statistics of functional annotation with DIAMOND blastp.
Table 7 Top three species contributing to the functional annotation of Trichiurus japonicus proteins based on Diamond blastp results.

The performance of annotations from these tools was visualized using a Venn diagram generated with the R package Venn (GitHub: https://github.com/dusadrian/venn). All applied tools annotated proteins from 17,700 genes (66.69% of all genes), whereas 1,506 genes (5.67%) received no annotation (Fig. 4).

Fig. 4
Fig. 4
Full size image

Venn diagram illustrating the annotation coverage across different tools. The pink area with a red border represented all protein-coding genes (“all”, for nuclear + mitochondrial, 26,554 genes). More than half of the genes were annotated by all applied methods, whereas 1,506 genes (5.67%) lacked any annotation information.

DeepTMHMM analysis detected 5,746 (14.72%) alpha-helical transmembrane proteins without a signal peptide (TM), 2,764 (7.08%) alpha-helical transmembrane proteins with a signal peptide (SP + TM), 3,192 (8.17%) globular proteins with a signal peptide (SP), 27,346 (70.03%) globular proteins without a signal peptide (GLOB), and 13 (0.03%) beta-barrel transmembrane proteins (BETA) (Table 8).

Table 8 DeepTMHMM annotation result summary.

Data Records

This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession JBDZUI000000000. The version described in this paper is JBDZUI010000000. The associated raw sequence data are under accession SRP58275134 and the assembly under accession GCA_050140585.135. The raw reads of PacBio HiFi (SRR33408711), Nanopore (SRR33408712), 10x (SRR33408710), and Hi-C (SRR33408713) can be retrieved under this SRP archive. Genome annotation files (GFF3, GTF and FASTA formats), repeat sequence annotations, and the complete mitochondrial genome assembly are available through Zenodo36. All datasets are publicly accessible and adhere to open-access data sharing principles, ensuring reproducibility and enabling future research applications.

Technical Validation

Genome completeness of the assembly built in this study was assessed using BUSCO (version: 5.8.2)37 with the actinopterygii_odb10 reference dataset, which comprised 3,640 conserved single-copy orthologs (BUSCOs) derived from 26 teleost genomes. The genome recovered 3,617 (99.4%) markers completely, including 2,925 (80.4%) complete single-copy BUSCOs and 692 (19.0%) complete and duplicated BUSCOs (Table 9). Fragmented and missing BUSCOs were 15 (0.4%) and 8 (0.2%), respectively. For the predicted proteome, 3,612 (99.2%) BUSCOs were recovered as complete BUSCOs; fragmented and missing BUSCOs were 10 (0.3%) and 18 (0.5%), respectively.

Table 9 Assessment of completeness of T. japonicus assembly and its nuclear structural annotation of protein-coding genes.

The assembled mitochondrial genome was compared to the published complete T. japonicus mitochondrial genome available on NCBI accession: NC_011719.1 using the BLASTn algorithm, revealing a sequence similarity of 99.32%. To assess the suitability of the assembly as a representative genome for sequence-based study, two raw read datasets from NCBI SRA data repository, originally for a genome assembling pipeline, were utilized to evaluate mapping efficacy: SRR32395546, a HiSeq read set in 150 bp pair-end format generated on Illumina platform, and SRR32395604, a long-read dataset of Oxford Nanopore Technologies. The Illumina short reads were preprocessed with Cutadapt (v4.9)21 (parameters: -cut 60 --cut -60 --discard --error-rate 0.25 --overlap 10 -m 500 --max-aer 0.15), and the ONT reads were preprocessed with Porechop (v0.2.4, https://github.com/rrwick/Porechop). Both read sets’ mapping rates and other indicators were accessed using QUAST (v5.2.0, parameter: -large)38 (Table 10). The mapping rates of short and long reads were 99.39% and 99.94%, respectively, indicating that most reads could be successfully aligned to the genome assembly presented in this study. The quartet (v1.2.5)39 TeloExplorer and CentroMiner modules were applied to identify telomere and centromere (Fig. 5). Motifs for telomere and centromere detection were based on species-specific or known consensus sequences. Evidence from transposable elements (TE) were used to assist in centromere determining, as generated by EDTA (version: 2.2.2)40. TEs overlapping with coding DNA sequences (CDS) predicted by BRAKER3 were excluded from the analysis.

Table 10 Mapping Statistics and Genome Coverage of Trichiurus japonicus Using Illumina and Nanopore Data.
Fig. 5
Fig. 5
Full size image

The recognized telomere and centromere inside the Japanese cutlass genome.