Background & Summary

Genomic resources, specifically genome sequences, are of particular importance in various genetic studies. Whole genome sequences are of help in examining the chromosomal evolution through comparative genomics, dissecting the genomic architecture for ecological adaptation, pinpointing the genes responsible for notable phenotypes as well as elucidating the divergence and speciation of organisms1,2,3. The technologies of high-throughput genome sequencing and cost-effective, precise genome assembly algorithms have promoted the assembly and release of numerous genome sequences, meanwhile, have substantially made the progress in genomics, offering comprehensive and novel insights into the fundamental mechanisms behind various biological questions of interest4,5.

The Japanese anchovy (Engraulis japonicus) is a petite marine finfish belonging to the Clupeiformes order, distributing in the northwest Pacific marginal seas, northward from the Sea of Japan and southward to the East China Sea6. This anchovy with a great biomass in the region, plays a pivotal role in the food chain due to being as both a forage and a food fish7. During the late 1990s, its peak annual catch was about one million tons8. However, due to the high capture pressure and adverse effects of global climate change on marine ecosystem, its population size had substantially declining8,9. Unfortunately, the species has recently been classified as overexploited. Like some other migratory fish in the region such as Larimichthys polyactis and L. crocea10, E. japonicus exhibits a migratory behaviour between spawning and overwintering grounds11. So far, the presence of genetic variation among different migratory stocks of E. japonicus remains controversial, primarily due to the use of different genetic markers and variations in the resolution of analytical methods12,13,14,15. Population genetic studies based on sequence variation in mitochondrial cytochrome b (Cyt b) and mitochondrial DNA control region fragments revealed no significant genetic structure across the wide-ranging populations of E. japonicus in the northwestern Pacific12,13. However, another molecular analyse using fragments of the Cyt b gene revealed considerable genetic variation among populations in the southern East China Sea14. Similarly, study utilizing six microsatellite loci detected weak but significant genetic differentiation between populations from the northeastern and southwestern coasts of Taiwan15. Marginally significant genetic differentiation was also observed between regional populations, such as the “Bohai Sea population (BHS)” and the “Japan Sea population (JPS)”, as well as between the “North Yellow Sea population (NYS)” and the “Japan Sea population (JPS)” using restriction-site associated DNA sequencing (RADseq)16. As highlighted above, it should be noted that traditional approaches, which rely on limited genetic data from narrow genomic regions, may not fully capture the population structure of E. japonicus. The discrepancies between these studies may therefore hinder the accuracy and effectiveness of fisheries management and conservation efforts. Recently, genome scans based on the whole genome sequencing data have identified numerous loci under putative natural selection. These genetic loci, with significant genetic differentiation among stocks, can be utilized to assign the different stocks within a given population, which is helpful for management and conservation of fishery resources10,16,17,18. Understandably, these genomic resources are invaluable for those investigations like adaptive evolution, population dynamics, and genetic conservation etc.

Despite the ecological and commercial importance, the genomic features of this species remain unknown. The previous investigations were mostly concerted with the population structure identification by using microsatellite15, and mitochondrial DNA markers12,13, RADseq16. So far, there has not been existed any report about transcriptome or genome sequence datasets of this species. Moreover, genomic data for anchovy fish in general are limited, with genome sequences available for only six species, including Coilia nasus, C. grayii, Encrasicholina punctifer, E. encrasicolus, Setipinna tenuifilis, and Thryssa baelama. This scarcity has greatly hindered our understanding of the evolutionary processes and environmental adaptations within the Engraulidae family and even the broader Clupeiformes order.

To address this, we have utilized the Pacific Biosciences (PacBio) HiFi long-read, Hi-C (chromosome conformation capture), and Illumina short-read sequencing technologies to construct a high-quality chromosome-level genome sequence of the Japanese anchovy. Moreover, we conducted annotation and analysis of the genome in comparison with the related species. The workflow of de novo genome assembly and annotation is shown in the Fig. 1. The highly accurate, chromosome-level reference genome would promote the progress of both population genetics and evolutionary biology of this species, as well as make it possible for the comparative genomics studies among the species of Clupeiformes order.

Fig. 1
figure 1

The overview of the chromosome-level genome assembly and annotation. Chrs: chromosomes. We first used 95.0 Gb short-read sequencing data to predict the assembled genome size was approximately 1,045.1 Mb by K-mer analysis, and the repeat sequences and heterozygosity were approximately 54.0% and 2.3%, respectively. Then, the 51.3 Gb of PacBio ccs data resulted in a 1,467.6 Mb assembly, with contig N50 of 456.3 kb. The contigs were anchored into 24 pseudo-chromosomes covering roughly 95.2% of the genome assembly with the assistance of 109.5 Gb Hi-C reads. The final assembly consisted of 24 pseudo-chromosomes that yielded 1,423.3 Mb of E. japonicus genome, with a scaffold N50 of 55.0 Mb. The genome contained 54.9% repeat sequences and 23,709 genes were functionally annotated from a total of 24,405 (97.15%) predicted protein-coding genes by combination of RNAseq and ISO-Seq annotation, genome sequence, and homolog protein.

Methods

Ethics statement

All experiments were performed according to the Guidelines for the Care and Use of Laboratory Animals in China. All experimental procedures and sample collection methods were approved by the Institutional Animal Care and Use Committee (IACUC) of Yellow Sea Fisheries Research Institute, CAFS under approval No. YSFRI-2022041.

Sample collection and sequencing

A mature female E. japonicus (Fig. 2) was obtained from the coastal waters of the Yellow Sea, close to Qingdao, China. Its dorsal muscle was collected for subsequently DNA extraction using a standard sodium dodecyl sulfate (SDS) extraction method. Subsequently, the concentration and quality of the extracted genomic DNA (gDNA) were quantified and assessed using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific) and by running a 0.8% agarose gel, respectively. The high-quality gDNA was initially employed to establish a short-insert library of approximately 350 bp using the TruSeq DNA PCR-Free kit (Illumina, USA). The library was subsequently sequenced on the Illumina NovaSeq 6000 platform (Illumina, USA), and approximately 101 Gb of 2 × 150 bp reads were generated (Table 1). Long-read sequencing was carried out on the same sample using the PacBio HiFi sequencing technology (Pacific Biosciences, USA). A standard PacBio library with an insert size of 20 kb was prepared using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, USA). Subsequently, the library was sequenced on a PacBio Sequel II system (Pacific Biosciences, USA), yielding a total of 51.3 Gb of PacBio HiFi reads, with an N50 length of 17.4 kb (Table 1). Lastly, a Hi-C library was established according to a previous protocol19 with some modifications20. In summary, muscle samples from the same sequenced individual were cross-linked using 4% formaldehyde. The fixed samples were then homogenized to isolate the nuclei. Following that, the DNA was digested with the MboI restriction enzyme (NEB, USA). The digested products underwent sequential treatments for end repairing, biotin labelling, and ligation of blunt-end fragments. The ligated DNA was subsequently sheared into fragments with a peak size of 400 bp. These fragments were then used to construct a standard DNA library using the TruSeq DNA Sample Prep Kit (Illumina, USA). The Hi-C library was sequenced for 2 × 150 bp reads on the Illumina NovaSeq 6000 platform, generating a total of 109.5 Gb reads (Table 1).

Fig. 2
figure 2

The mature female Japanese anchovy (Engraulis japonicus) obtained from the coastal waters of the Yellow Sea.

Table 1 Summary statistics of sequencing libraries and reads used in this study.

For transcriptome sequencing, samples of the brain, ovary, heart, muscle, and liver were obtained from the same sequenced sample for RNA extraction, using TRIzol™ Reagent (Thermo Fisher Scientific, USA). The concentration and quality of the total RNA were quantified and evaluated utilizing a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, USA) and by running a 1.0% agarose gel, respectively. Total RNA from each individual sample was employed to construct mRNA libraries using the TruSeq RNA Library Prep Kit v2 (Illumina, USA). Subsequently, the libraries were sequenced on the Illumina NovaSeq 6000 platform (Illumina, USA), yielding an average of 5.58 Gb of 2 × 150 bp reads for each transcriptome sample (Table 1).

Chromosome-level genome assembly

The Illumina reads were first cleaned using the program NGSQCToolkit v2.321. The cleaned reads were then utilized to estimate genome parameters based on the 17-mer frequency distribution using the program GenomeScope v2.0922. The estimated genome size, heterozygosity, and content of repetitive sequences were found to be 1,045.1 Mb, 2.3%, and 54.0%, respectively. Subsequently, the Pacbio HiFi reads were assembled into contigs using the program Hifiasm v0.19.523, with default parameters. The assembled contigs were then polished using Pilon v1.2224, also with default parameters. The total length and N50 of the assembled contigs were approximately 1,467.6 Mb and 456.3 kb, respectively (Table 2).

Table 2 Summary statistics of the assembled contigs and scaffolds of Engraulis japonicus.

To achieve a chromosome-level assembly, raw Hi-C sequencing reads were first filtered using HiC-Pro v2.8.025. Subsequently, the cleaned reads were employed to anchor the assembled contigs into scaffolds using Juicer26 and 3D-DNA pipelines19. The assembled scaffolds were then manually curated using Juicebox27, with a prior setting of 24 haploid chromosomes28. Consequently, 95.2% of the assembled contigs were anchored to 24 pseudochromosomes (Fig. 3A), with individual chromosome lengths ranging from 47.0 Mb to 69.1 Mb (Fig. 3B and Table 3). The total length of the chromosome-level genome assembly amounted to 1,423.3 Mb, with a scaffold N50 of 55.0 Mb (Table 2). This discrepancy in genome assembly size, as opposed to the previously mentioned prediction, can be attributed to the tendency of short-read sequencing to underestimate the size of highly repetitive and heterozygous genomes29.

Fig. 3
figure 3

Chromosome-level assembly and features of the Engraulis japonicus genome. (A) Genome-wide chromatin interactions in the E. japonicus genome revealed by heatmap. (B) Circos plot of genomic features in the E. japonicus genome in a 100-kb window size. Each circle from outside to inside represents GC content along individual pseudochromosomes with indicated length (a), gene density (b), density of repetitive sequences (c), density of LTR elements (d), density of LINE elements (e) and density of DNA transposable elements (f).

Table 3 Summary statistics of the length of pseudochromosomes of Engraulis japonicus.

Repetitive sequence annotation

Annotations of repetitive sequences were conducted using Repeatmasker v4.0.630, based on the RepBase database v20210131 and a custom repeat library. The custom repeat library was generated utilizing RepeatModeler v2.0.532, with default parameters. Additionally, the programs LTR_FINDER v1.0633 and Tandem Repeat Finder v4.0734 were independently employed to identify long terminal repeats and tandem repeats, using default parameters. The predictions of these programs were then consolidated to create a nonredundant library of repetitive sequences within the genome, which was subsequently used for annotation within Repeatmasker. A total of 780.9 Mb, constituting 54.9% of the assembled genome, were annotated as repetitive sequences (Table 4). Among these repeats, long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), and long terminal repeats (LTRs) accounted for 6.3%, 0.9%, and 9.1% of the genome, respectively (Table 4).

Table 4 Summary statistics of the predicted sequence repeats in the assembled genome of Engraulis japonicus.

Gene prediction and functional annotation

Predictions of protein-coding genes were carried out on a repeat-masked genome utilizing homology-, evidence- and ab initio-based prediction methods. For the homology-based gene prediction, protein sequences of Alosa alosa (GCF_017589495.1), A. sapidissima (GCF_018492685.1), S. tenuifilis (v1)35, C. nasus (v1)36, and Danio rerio (NCBI, GCF_000002035.6) were aligned to the E. japonicus genome assembly using BLASTP v2.2.2437 with default parameters. Regarding evidence-based annotation, the mentioned transcriptomes were assembled utilizing Trinity v2.1.138 with default parameters, and then condensed into a nonredundant transcript dataset for utilization as supporting evidence for prediction. The Maker v2.53 pipeline39 was employed to consolidate the predictions from both the homology- and evidence-based approaches. Predicted gene models were iteratively trained using SNAP v2006.07.2840, GeneMark-EP v4.7241, and Augustus v3.3.242 for three iterations. Subsequently, predicted gene models containing transposable element (TE) domains and lacking support from transcripts were filtered out and removed. As a result, a total of 24,405 nonredundant protein-coding genes were predicted. Upon comparing the gene set of E. japonicus with that of A. alosa, A. sapidissima, S. tenuifilis, C. nasus, and D. rerio, a similar distribution pattern in the length of genes (Fig. 4A), exons (Fig. 4B), and coding sequences (CDS) (Fig. 4C) was observed among these studied fish species.

Fig. 4
figure 4

Features of the predicted protein coding genes in the Engraulis japonicus genome. (A) Distribution of the length of genes among six studied species. (B) Distribution of the length of exons. (C) Distribution of the length of coding sequences (CDS). (D) Summary statistics of the number of genes annotated by different databases: NR, InterPro, KEGG, KOG and SwissProt.

Additionally, all predicted genes were functional annotated by mapping to the public databases including SwissProt, Nr, KEGG, and InterPro, COG, KOG, and Pfam. In total, 23,709 genes were classified by at least one of these databases, accounting for 97.1% of all the predicted protein coding genes in the E. japonicus genome (Table 5 and Fig. 4D). Furthermore, genes coding for tRNA were predicted using tRNAscan-SE v1.3.143 with default parameters. Genes for rRNA were predicted by aligning to invertebrate template rRNA sequences using BLASTN v2.2.2437 with an E-value of 1e-5. Genes for both snRNAs and miRNAs were then identified using INFERNAL v1.1.144 against the Rfam database (release 12.0). In total, 23,984 non-coding RNAs (ncRNAs) were predicted, including 19,120 tRNAs, 229 rRNAs, 1,492 miRNAs, and 3,143 snRNAs (Table 6).

Table 5 Summary statistics of the numbers of predicted protein coding genes in the assembled genome of Engraulis japonicus.
Table 6 Summary statistics of noncoding RNAs in the genome assembly of Engraulis japonicus.

Data Records

All raw sequencing data are available on the NCBI through Bioproject PRJNA108287745. The genome assembly and annotations are available on figshare46 and the CNGB with accession number CNP000537747. The assembled genome is also available on NCBI GenBank under the accession number GCA_040112795.148.

Technical Validation

Evaluation of the genome assembly

To evaluate the quality of the genome assembly, the completeness of the genome sequence was first assessed by mapping to the Actinopterygii database (actinopterygii_odb10) of Benchmarking Universal Single-Copy Orthologs (BUSCO, v5.7.1). The genome assembly exhibited a high level of completeness, with a complete BUSCO value of 94.07%. Within this value, 88.71% were complete and single-copy while 5.36% were complete and duplicated. Only 1.7% BUSCOs were fragmented, and 4.2% were missing from the genome assembly (Table 7). We retrieved the genome assemblies of Clupeiformes archived in NCBI and found only 23 species with available genome sequences, of which only 10 species had chromosome-level genome assemblies (Table 8). The complete BUSCO value of E. japonicus (94.07%) is comparable to that of the high-quality chromosome-level genome assemblies of Clupeiform species archived in NCBI, which range from 84.5% to 95.6% with a median value of 92% (Table 8). Furthermore, both the PacBio HiFi long reads and Illumina short reads were aligned to the genome assembly using minimap2. The mapping rates for PacBio and Illumina reads were 99.91% and 97.97%, respectively (Table 9). Finally, the consensus quality value (QV), representing per-base consensus accuracy, was estimated using Merqury (v1.3), resulting in a QV of 49.74. Considering these data collectively, it is evident that the genome assembly of E. japonicus is characterized by both high completeness and high quality.

Table 7 Assessment of the completeness of the genome assembly of Engraulis japonicus using BUSCO.
Table 8 Comparison of the genome assemblies of Clupeiform species.
Table 9 Coverage statistics of PacBio HiFi long reads and Illumina short reads.