Background & Summary

The alligator gar (Atractosteus spatula, Lacepède 1803) is one of seven extant species of the ancient Lepisosteidae family, which includes two genera: Atractosteus with three species (tropical, Cuban, and alligator gars) and Lepisosteus with four species (spotted, Florida, longnose, and shortnose gars)1. The slowly evolving genome of the gar fish has garnered increasing attention from scientific researchers in recent years. Studies of the spotted gar (Lepisosteus oculatus) genome have revealed the value of holostean genomes in comparative research, offering significant insights into the evolution of vertebrate immunity, development, and the roles of regulatory sequences2. The draft genome of the alligator gar was used to examine the terrestrial transition of vertebrates from aquatic environments3. In recent times, the genome of the longnose gar (Lepisosteus osseus) highlighted the potential of holostean genomes for understanding the evolution of vertebrate repetitive elements and provided a critical reference for comparative genomic studies using ray-finned fish models4.

It is noteworthy that recent reports highlighting the invasion of the alligator gar underscore the urgency of prioritizing its management efforts. Alligator gar is native to northern and central parts of the United States and Mexico5,6. It has been distributed to numerous countries globally through the aquarium industry. Considered invasive in China, Singapore, Indonesia, Turkmenistan, and several other nations, it has been documented invading 47,287 locations across the planet7,8. Initially detected in Baiyun Lake, Guangzhou, Guangdong Province, in February 2019, it subsequently expanded its range to various provinces including Hunan, Guangxi, Shandong, Sichuan, Qinghai, Jiangsu, and Yunnan7. Due to their inherent biological characteristics, alligator gars possess three primary advantages for survival in freshwater ecosystems. Firstly, they exhibit a large body size and possess overlapping ganoid scales. As the largest species within the gar family, typical adult alligator gars reach length of about 2 m (6.5 feet) and weight over 45 kg (100 pounds)9. They also have tough bone-like scales covered by an enamel-like substance, rendering them nearly impenetrable10,11. Secondly, alligator gars display high fecundity and produce toxic eggs. In comparison to the spotted gar and longnose gar, alligator gar laid the greatest number of eggs per gram of body weight12. In addition, their eggs and yolk sacs are extremely toxic to crustaceans and vertebrates, except teleosts13. Lastly, alligator gars are voracious predators that prey on blue crabs, waterfowl, turtles, small mammals, carrion, and other discarded waste around docks and piers14.

Currently, high-quality sequencing technologies offer immense potential in unraveling the genetic basis of biological characteristics for many species at genome-wide levels15,16. Despite the availability of a draft second-generation genome of the alligator gar, it largely limits the study of speciation and chromosome evolution3. In this study, we generated a high contiguity, completeness, and accuracy genome assembly of alligator gar at chromosome level using Oxford Nanopore Technology and Hi-C sequencing techniques. The assembled genome was 1.05 Gb, with a contig N50 of 15.7 Mb, scaffold N50 of 56.8 Mb. The Hi-C sequences were further clustered and ordered into 28 pseudochromosomes(2n = 56, length from 10.2 Mb to 76.3 Mb). A sequence of ~323 Mb was annotated as a repeat element, constituting 30.91% of the genome. We predicted 19,103 protein-coding genes, of which 99.83% were functionally annotated. In summary, the genomic resources presented in this study would deep our understanding of the underlying genomic basis to comprehend the ecology, evolution, and invasiveness of alligator gars.

Methods

Sample collection and ethics statement

A six-month/1-year-old female alligator gar with 2.8 kg in bodyweight and 41.3 cm in body length was collected by Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences, Jilin Province, China (Supplementary Fig. 1). The Alligator Gar was captured with standard mini-fyke nets (0.6 m × 1.2 m frame, with a 4.6-m-long lead and 3-mm mesh17) and electrofishing in Dehui, Jilin Province, China, and then placed in a live well for further processes. The otolith of alligator gar was picked out to discriminate its age18. All experiments on the alligator gar were approved under the project ID “DLS20220131-001” by Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences.

Nucleic acid extraction, library construction and sequencing

For the Nanopore library, a total of 8–10 µg high-quality genomic DNA was extracted from a muscle sample, and >50 kb DNA molecules were selected with BluePippin (Sage Science, Beverly, MA, USA). A standard library was constructed with the Ligation Sequencing Kit 1D following the Nanopore library construction protocol. ONT long reads were sequenced on the PromethION P48 sequencer (Grandomics, Beijing, China). RNA was extracted from three tissues (blood, muscle, and skin) from the same individual using TRlzol reagent (Invitrogen, USA) according to manufacturer’s instructions. cDNA libraries were reverse transcribed from 200 to 400 bp RNA fragments and sequenced on the Illumina NovaSeq 6000 platform (Grandomics, Beijing, China). For short insert size WGS sequencing, we first isolated total genomic DNA from muscle samples (~2 g) using a phenol-chloroform protocol together with ethanol precipitation19 and prepared the DNBSEQ libraries following manufacturer’s instructions. Finally, they were subjected to the DNBSEQ-T1 sequencer (MGI tech, Guangdong, China) for paired end 100 bp sequencing. For Hi-C library construction, freshly collected liver samples (~2 g) were crosslinked with formaldehyde to fix the chromatin conformation, and the crosslinked DNA was digested by the dpnII restriction endonuclease. The Hi-C library with 350 bp insert size was sequenced on a DNBSEQ-T7 sequencer (Grandomics, Beijing, China).

To obtain a high-quality reference genome of the alligator gar, we generated 65.43 Gb raw long reads (60.68 Gb pass long reads, 57.8-fold) using Oxford Nanopore Technology (ONT) for de novo assembly (Table 1). The average read length of long reads was 18.0 kb and the N50 was 30.5 kb. We also generated 206.46 Gb whole-genome sequencing (WGS) data (196.6-fold) for the genome survey and polishing (Table 1). We generated 8.82 Gb transcriptomic data for genome annotation (Table 1).

Table 1 Summary of genome assemblies and gene annotations in the alligator gar genome.

Genome size estimation and de novo assembly

Before the de novo genome assembly, we performed a genome survey to estimate the genome size using DNBSEQ short reads data by KmerFreq v1.020 with a kmer size of 17. The de novo genome assembly was performed by NextDenovo (v2.5.0; https://github.com/Nextomics/NextDenovo) with default parameters. We used NextCorrect and NextGraph, two core modules in NextDenovo, to process the raw Nanopore long read correction for consensus sequence extraction and initial assembly. We then improved the single-base accuracy of the draft genome assembly by the NextPolish v1.4.0 software21 using both ONT long-reads and DNBSEQ short reads for six times. For further chromosomal-level genome assembly, Hi-C reads were aligned to the polished genome assembly using Burrows-Wheeler Aligner (BWA, v0.7.17)22. Juicer v1.523 was used for Hi-C data quality control, and 3d-DNA pipeline v19071624 was applied to concatenate the scaffolds to the chromosome-level genome. Juicer Box v1.11.0825 was used for final manual correction.

The genome size was 1.19 Gb, as estimated by 17-kmer frequency (Fig. 1a). A total of 107.81 Gb Hi-C reads were generated for concatenating primary contigs into a chromosome-level genome assembly. We then anchored scaffolds to a cluster map, capturing 98.26% of assembled bases in 28 pseudochromosomes (2n = 56, length from 10.2 Mb to 76.3 Mb, Table 2 and Supplementary Fig. 2), which is consistent with the karyotype study of Echelle et al.26. Other 11 scaffolds had a total length of 18.2 Mb (1.74%, length from 5 kb to 5.7 Mb). In summary, the alligator gar genome in this study was 1.05 Gb in length with a contig N50 of 15.7 Mb and scaffold N50 of 56.8 Mb. The completeness of the final chromosome-level genome reached 96.7% (96% complete and in single copy) by BUSCO analysis (Fig. 1b).

Fig. 1
figure 1

Genome assembly of the alligator gar. (a) K-mer frequency distribution at k-mer size of 17. K-mer refered to an artificial sequence division of K nucleotides. The peak depth was 73X. The total number of 17-mer present in this subset was 86,674,739,928. (b) BUSCO scores of the draft, polished, and final chromosome-level genome. (c) General view of the alligator gar genome in nonoverlapping 500 kb windows: (a) circular map of 28 chromosomes. (b) heat map of gene density. The darker the colour, the higher the density. (c) GC content. (d) Long Terminal Repeat (LTR) number. (e) Repeat number.

Table 2 Statistical results of the 28 pseudochromosomes of the alligator gar genome.

Combined with third-generation ONT long reads and large-scale Hi-C data, the chromosome-level genomes assembled in this study exhibited significant improvements in the following aspects compared with the previously released assembly (GCA_016984175.1, Supplementary Table 1)3. (1) Our assembly demonstrated a significant reduction in the number of scaffolds from 81,747 to 39. (2) Our assembly showed remarkable enhancements of 785-fold and 41-fold in N50 values of the contig and scaffold over those of the previously released assemblies, respectively. (3) The gap region (Ns) in the previously released assemblies (5.967%) has been significantly reduced by a factor of 459 in our assembly genome (0.013%). (4) Our assembly had a notable increase of 2.1% in the BUSCO score, indicating a higher integrity in our assembly. These findings demonstrated the reliability and advanced nature of our chromosome-level genetic assembly. Furthermore, the combination of homology-based protein alignment, de novo predictions, and transcriptomic mapping in our study showed a higher PCGs number than that of the previously published alligator gar genome (GCA_016984175.1, 18,839 PCGs)3, enhancing contiguity and integrity of our assembled genome allowed for more precise gene prediction.

Genome annotation

Prior to gene prediction and annotation, genome repetitive elements were annotated by integrating homology-based and de novo strategies. For the de novo method, RepeatModeler v2.027 and LTR_retriever28 were used to annotate repeat elements which were then added to the known repeat database REPBASE v21.0129. Then, the genome was aligned to the REPBASE using RepeatMasker v4.0.530, RepeatProteinMask, and Trf v4.07b31 at both DNA and protein levels. Finally, we obtained a nonredundant repeat set. The gene density, GC content, repeat number, and LTR number of 28 chromosomes were further analyzed in nonoverlapping 500 kb windows using CIRCOS v0.69-832.

After masking the repeat elements in the genome, three strategies were used for protein-coding gene prediction. Firstly, for the de novo strategy, we ran the prediction using Augustus v3.0.333. Secondly, for the transcriptome-based strategy, transcripts were assembled using StringTie v1.3.3b34 based on RNA-seq data. Finally, for the homology-based strategy, protein sequences of the spotted gar (LepOcu1, GeneBank ID: GCF_000242695.1)2, coelacanth (Latimeria chalumnae, LatCha1, GeneBank ID: GCF_000225785.1)35, bichir (Polypterus senegalus, ASM1683550v1, GeneBank ID: GCF_016835505.1)3, and paddlefish (Polyodon spathula, ASM1765450v1, GeneBank ID: GCF_017654505.1)36 were mapped to the alligator gar genome using TBlastn program v2.9.037. GeneWise v2.4.138 was used to predict the potential gene structure with an E-value cutoff of 1e-5 (Supplementary Table 2). The final protein-coding gene set was predicted by combining the results from these three strategies using the MAKER pipeline v3.01.0339. For functional annotations, this gene set was searched in five publicly available databases including Swiss-Prot, TrEMBL, InterProScan v5.52-86.040, GO terms, and KEGG using BLAST v2.2.2637 (e-value cutoff of 1e-5). For ncRNA prediction, miRNA and snRNA were identified by searching the Rfam database (Release 12.0)41. The tRNA genes were predicted with tRNAscan-SE v1.3.142, and the rRNA genes were identified by aligning human rRNA using BLAST.

Combining the de novo and homology-based predictions, we found 323 Mb repeat elements, accounting for 30.91% of the alligator gar genome (Supplementary Table 2). The predominant repeat types were long terminal repeats (LTR, 11.73%), long interspersed elements (LINEs, 5.69%), and DNA transposons (4.96%) (Table 3, Supplementary Table 3, and Supplementary Fig. 3).

Table 3 Statistics of identified repeat elements by De novo method.

We predicted 19,103 protein-coding genes (PCGs) through the combination of homology-based protein alignment, de novo predictions, and transcriptomic mapping. The average lengths of PCGs, exons, and introns were 2,120.55 bp, 166.71 bp, and 2231.85 bp, respectively (Table 1 and Fig. 2). Of these predicted PCGs, 19,070 (99.83%) were annotated in at least one related functional assignment (Table 4 and Fig. 3). We further plotted the distribution of gene density, GC content, and repeat density across 28 pseudochromosomes (Fig. 1c). We also predicted 22,559 noncoding RNAs (ncRNAs), including 191 microRNAs, 10,015 transfer RNAs (tRNAs), 9524 ribosomal RNAs (rRNAs), and 2829 small nuclear RNAs (snRNA) (Table 5).

Fig. 2
figure 2

Comparisons of CDS length, mRNA length, intron length, and exon length among five species (Atractosteus spatula, Lepisosteus oculatus, Polypterus senegalus, Polyodon spathula, and Latimeria chalumnae).

Table 4 Statistics on functional annotation of the alligator gar gene set.
Fig. 3
figure 3

Venn diagram representing the functional annotation of the alligator ger gene set.

Table 5 Statistics of non-coding RNA annotation.

Data Records

The raw sequencing data for this study are deposited in the NCBI under BioProject ID: PRJNA116104143. Illumina, transcriptome, and PacBio sequencing data are available under the Sequence Read Archive (SRA) with the accession number SRP53704644. The assembled genome has been deposited in the GenBank database under the accession number GCA_043380575.145. Additionally, assembled genome and annotations can be downloaded from Figshare46 under https://doi.org/10.6084/m9.figshare.27193392.

All assemblies and raw sequencing data generated of this study also have been deposited CNGB Sequence Archive (CNSA)47 (https://db.cngb.org/cnsa/) of the China National GeneBank DataBase (CNGBdb)48 with accession number CNP0003816.

Technical Validation

Benchmarking Universal Single-Copy Orthologs (BUSCO) v3.1.049 was used to evaluate the completeness of the draft, polished, and final chromosome-level genomes in the genome mode (-m genome) with 3354 core vertebrata gene sets (vertebrata_odb10). The completeness of the final chromosome-level genome reached 96.7% (96% complete and in single copy) by BUSCO analysis. Gene set completeness was also evaluated with the vertebrata_odb10 database using the protein mode (-m protein) of BUSCO. BUSCO analysis showed 94.5% completed BUSCO scores for predicted PCGs, with 2.5% fragmented and 3.0% missing of core vertebrate genes. Furthermore, the total size of the assembled genome is similar to that estimated by jellyfish.