Background & Summary

The scalloped spiny lobster, Panulirus homarus, belongs to Crustacea, Decapoda, Palinuridae, and Panulirus, and it consists of three economically valuable subspecies, including Panulirus h. homarus (Fig. 1a), Panulirus h. megasculptus, and Panulirus h. rubellus1,2. P. h. rubellus has also been reported to be a distinct species3,4. The species is an “engineer species” in coral reefs and rocky ecosystems, controlling population size by preying on benthic invertebrates (e.g., sea urchins, shellfish), maintaining ecological balance, and preventing overpopulation of certain species, which could destroy habitat structure5. In addition, P. homarus is distributed widely in the Indo-West Pacific (Fig. 1b), providing a good model for genetic comparisons of different geographic populations and for resolving gene flow, species differentiation, and ecological adaptation mechanisms4,6.

Fig. 1
figure 1

Photograph and geographic distribution of the long-tailed marine-living scalloped spiny lobster, P. homarus. (a) Photograph of an adult P. h. homarus. (b) Natural distribution map of P. homarus, indicated by the red star.

With increasing pressure on marine ecosystems globally, aquaculture is increasingly recognized as an important strategy for mitigating the depletion of wild fisheries resources7. P. homarus exhibits more rapid growth than other lobsters and it has been successfully farmed on a large scale in Vietnam and Indonesia, making it the optimal candidate for intensive large-scale lobster aquaculture8,9. Currently, the artificial nursery technology for scalloped spiny lobster remains valuable, with larvae mainly captured from the wild and then farmed in cage or industrial recirculating aquaculture systems9,10. Aquaculture research on P. homarus covers a range of topics, including resource assessment9 and the effects of nutrition11,12, salinity13, light14, and temperature15 on growth and reproduction. In addition, high-throughput sequencing technologies have advanced the study of lobster species genomes.

Currently, only the genomes of Homarus americanus16 and P. ornatus17 have been reported successfully. Genome assembly data for 13 Panulirus species are available in the NCBI database (Table S1). However, the average genome size of the species is only 1.5 Gb, which is significantly smaller than the anticipated genome size for lobsters. For instance, the genome size of P. h. homarus is merely 1.3 Gb, with a mere scaffold N50 of 2.9 kb, rendering the assembly quality inadequate to meet the demands of further research (Table S1). A comprehensive understanding of P. homarus is essential for effective management of its resources and the development of sustainable aquaculture practices.

In the present study, the authors adopted a comprehensive multi-platform sequencing approach, combining Illumina short-read sequencing, PacBio long-read sequencing, and Hi-C chromosome conformation capture technologies to generate a chromosome-level genome assembly for P. h. homarus (Fig. 2). The project generated 140.56 Gb of Illumina short-read data, 341.51 Gb of PacBio long-read data, and 364.32 Gb of Hi-C data, culminating in a final assembled genome with a size of 2.61 Gb, a contig N50 of 5.43 Mb, and a scaffold N50 of 36.69 Mb (Tables 1, 2). The chromosome-level assembly enhances the genomic resources available for lobsters substantially and provides a crucial reference genome.

Fig. 2
figure 2

Genomic landscape of P. h. homarus. Circos plot illustrating the genomic features of P. h. homarus. From the outermost to innermost rings: (a) gene density, (b) GC content, (c) densities of DNA transposons, (d) density of Long Terminal Repeats (LTRs), (e) density of Long Interspersed Nuclear Elements (LINEs), and (f) density of Short Interspersed Nuclear Elements (SINEs), all represented in 200-kb genomic windows.

Table 1 Statistics of the sequencing data.
Table 2 Assembly statistics of the P. h. homarus.

Methods

Sample collection

An adult male P. h. homarus specimen was sourced from Hainan Yonghe Biotechnology Co., Ltd. (Qionghai, Hainan, China). Muscle tissue was collected after the lobster was anesthetized using cryogenic methods. The surface of the tissue was washed thoroughly several times with sterile phosphate buffered saline to effectively remove bacteria and impurities. The extraction of genomic DNA (gDNA) from muscle tissue for genome survey and library construction was carried out using the AMPure bead cleanup kit (Beckman Coulter, High Wycombe, UK) in strict accordance with the manufacturer’s instructions.

Total RNA was isolated from eye stalk, hemocyte, liver, muscle, intestine, and gills of the same specimen using TRIzol reagent, according to the manufacturer’s protocol. The integrity and quality of the RNA were evaluated by 1.5% agarose gel electrophoresis, whereas the concentrations were quantified precisely using a Qubit fluorometer (Thermo Fisher Scientific, Waltham, MA, USA).

Genome sequencing

A short-read library with an insert size of 350 bp was constructed and sequenced on the Illumina Novaseq-6000 (Illumina Inc., San Diego, CA, USA) platform, generating 2 × 150 bp paired-end reads. In total, 0.08 μg gDNA per sample was used as input material for the DNA library preparations. Library preparation was performed using the NEBNext® Ultra™ DNA Library Prep Kit (New England Biolabs, Ipswich, MA, USA), in strict accordance with the Illumina second-generation sequencing protocol, resulting in 140.56 Gb of raw data (Table 1).

For PacBio sequencing, gDNA was employed to construct SMRTbell libraries and sequenced on the PacBio Sequel (PacBio, Menlo Park, CA, USA) platform, leveraging single molecule real-time (SMRT) technology. In brief, the genomic DNA was first sheared into 6–20-kb fragments using g-TUBE. Subsequently, ExoVII (New England Biolabs, Beverly, MA, USA) was used to remove single-strand overhangs, followed by DNA damage repair with the SMRTbell Express Template Preparation Kit 2.0 (PacBio). T4 DNA polymerase and T4 PNK (New England Biolabs) were used to repair the ends, making them suitable for ligating SMRTbell hairpin adapters. After ligation, EXOIII (New England Biolabs) and ExoVII (New England Biolabs) enzymes were used to remove imperfect templates, and AMPure PB beads were used for purification. Subsequently, sequencing primers were annealed to the SMRTbell templates, and polymerase was bound to the template ends using the Binding Kit (PacBio). Finally, the library was loaded onto SMRT Cells for sequencing. A total of 341.51 Gb of continuous long reads was obtained, resulting in an extensive 131-fold coverage of the P. h. homarus genome (Table 1).

For Hi-C sequencing, high molecular weight gDNA was first cross-linked and then digested using the MboI restriction enzyme (New England Biolabs). The DNA was mechanically sheared into 300–500-bp fragments following 5′ overhang biotinylation and blunt-end ligation. Finally, the Hi-C library was sequenced on the Illumina NovaSeq 6000 platform (lllumina Inc., San Diego, CA, USA) in a 2 × 150-bp paired-end strategy, yielding 364.32 Gb raw reads, with a sequencing depth of 140× (Table 1).

For RNA sequencing, libraries of six tissues were prepared using the NEBNext® UltraTM RNA Library Prep Kit (New England Biolabs) for Illumina®, with all procedures following the manufacturer’s protocol rigorously. The RNA-seq libraries were subsequently sequenced on the Illumina NovaSeq 6000 platform in a 2 × 150-bp paired-end strategy and producing 55.7 Gb of clean reads.

Genome survey and assembly strategy

Prior to genome assembly, adapter sequences and low-quality reads from short-read sequencing data were filtered using Fastp software (v0.23.1)18 with default parameters, ensuring that only high-quality clean reads were retained for downstream processes. A comprehensive genome survey was performed to ascertain essential genomic characteristics, including overall size, heterozygosity, and repeat content. This was achieved by analyzing 17 distinct K-mer frequencies using SOAPec (v2.01)19 and GenomeScope (v2.0)20. Based on the analyses, the genome size of P. h. homarus was estimated to be 3,127.74 Mb, with 1.04% heterozygosity and 66.75% repetitive sequences at the dominant peak depth of 26 (Table S2, Fig. S1).

A dual-strategy utilizing two independent assembly software—Wtdbg2 (v2.5)21 and Flye (v2.9)22 with default parameters was employed for P. h. homarus genome assembly. The assembled genome drafts were subsequently improved using the Arrow (v8.0) polishing process23. The assembly results generated by Wtdbg2 and Flye after initial polish were merged using Quickmerge (v0.3)24. The merged assembly underwent further refinement through a comprehensive polishing process, involving two rounds of Arrow polishing followed by two rounds of Pilon polishing (v1.22)25, both utilizing default parameters. PacBio subreads were employed for Arrow polishing, whereas Illumina short reads were used for Pilon polishing to ensure high sequence accuracy. The rigorous and iterative assembly process culminated in the generation of 7,135 contigs, and a total assembly length of 2,720,451,987 bp (Table 2), representing a high-quality and robust genome assembly for P. h. homarus.

Chromosome-level assembly refinement

During the Hi-C scaffolding phase, the Juicer pipeline26 was used to align Fastp (v0.23.1)18 processed high-quality reads with draft genome assembly. The alignment was followed by the assembly of contigs into chromosomes, as well as to orient and sort contigs within each chromosome using the 3D-DNA pipeline27. Further refinement of the assembly was achieved through manual error correction using Juicebox Assembly Tools (v2.13.06)26. The rigorous scaffolding process successfully anchored 2,613.14 Mb of the genome to 73 chromosomes (Fig. 3), representing 96.05% of the total genome assembly (Table S3). The final assembly achieved a scaffold N50 of 36.69 Mb, reflecting a high level of continuity (Table 2). Remarkably, this assembly demonstrates exceptional contiguity, with 37 chromosomes containing no more than 30 gaps each (Table 3).

Fig. 3
figure 3

Hi-C heatmap (200-kb resolution) displaying interaction frequencies between different chromosomes of the P. h. homarus.

Table 3 Assembly statistics for chromosomes.

Comprehensive annotation of repetitive elements and noncoding RNAs

Repetitive sequences in the P. h. homarus genome were predicted using two strategies, including de novo assembly and homology matching28. Initially, the de novo-predicted repetitive sequence database was integrated with the Repbase homologous repetitive sequence database29. A suite of tools—including RepeatScout (v1.0.5)28, RepeatModeler (v2.0.1)30, Piler (v1.0)31, and LTR-FINDER (v1.0.6)32—was employed to identify transposable element (TE) families. Subsequently, classification of distinct repetitive elements was performed using RepeatMasker (v4.1.0)30, RepeatProteinMask (v4.1.0), and TRF (v4.0.9)33, aligning the P. h. homarus genome sequences against the integrated repetitive sequence database. After removing redundant entries from the various methods, repetitive sequences were observed to constitute 69.67% of the P. h. homarus genome (Table S4). Additionally, the Kimura divergence values of TEs were calculated using the ‘calcDivergenceFromalign.pl’ script34, and TE landscapes were visualized with ‘createRepeatLandscape.pl’35. The repeat elements identified included DNA transposons, which comprised 9.82% of the genome, long interspersed nuclear elements accounting for 37.44%, short interspersed nuclear elements representing 0.02%, and long terminal repeats making up 30.06% (Table 4 and Fig. 4).

Table 4 Classification of repetitive sequences in the P. h. homarus genome.
Fig. 4
figure 4

Distribution of divergence rates for transposable elements (TEs) in the P. h. homarus genome.

Tools specialized for non-coding RNAs (ncRNAs) were used to annotate ncRNA in P. h. homarus genome. tRNA were identified using tRNAScan (v1.4)36, whereas rRNA were predicated via BLAST (v2.2.26)37. Additional ncRNAs, including miRNAs and snRNAs, were annotated by aligning sequences with the Rfam database38 using the INFERNAL tool (v1.0)39. Ultimately, four distinct classes of ncRNAs were annotated successfully, comprising 20,765 miRNAs, 3,608 tRNAs, 1,421 rRNAs, and 3,066 snRNAs (Table 5), providing valuable insight into the non-coding RNA landscape of P. h. homarus.

Table 5 Classification of non-coding RNAs in the P. h. homarus genome.

Integrated gene structure prediction and functional annotation

A comprehensive, integrated approach combining ab initio prediction, homology prediction, and transcriptome sequencing-based prediction was used to predict the gene structure of the P. h. homarus genome. For de novo gene predication, a robust suite of tools—AUGUSTUS (v3.2.3)40, GlimmerHMM (v3.02)41, SNAP (v2013.11.29)42, Geneid (v1.4)43, and Genscan (v1.0)44—was employed to predict gene structures directly from the genome sequence. For homology-based annotation, protein sequences from Drosophila melanogaster (fruit fly), Penaeus chinensis (Chinese shrimp), Eriocheir sinensis (Chinese mitten crab), Litopenaeus vannamei (Pacific white shrimp), Marsupenaeus japonicus (Kuruma shrimp), P. ornatus (ornate spiny lobster), Portunus trituberculatus (swimming crab), and H. americanus (American lobster) were retrieved from NCBI’s GenBank database and aligned to the P. h. homarus genome using BLAST (v2.2.26)37 and Genewise (v2.4.1)45. The integrated multifaceted strategy enabled comprehensive and accurate prediction of protein-coding genes, significantly advancing our understanding of the genetic architecture of P. h. homarus. A total of 8,545–178,660 homologous genes were identified for D. melanogaster, P. chinensis, E. sinensis, L. vannamei, M. japonicus, P. ornatus, P. trituberculatus, and H. americanus (Table 6). Gene length, along with CDS, exon, and intron lengths, was analyzed and compared to those of other species (Fig. 5). The mean transcript length in P. h. homarus was 31,472.77 bp, with CDS, exon, and intron lengths averaging 1,613.73 bp, 279.37 bp, and 6,251.44 bp, respectively (Table S5).

Table 6 Statistical analyses of gene structure annotation of the P. h. homarus genome.
Fig. 5
figure 5

Comparisons of genomic elements across closely related species.

To further refine transcriptome assembly, two distinct methods were employed: genome-guided transcript assembly and de novo assembly using Trinity software (v2.11.0)46. Gene structures were identified using PASA (v2.1.0)47, and gene sets predicted through various methods were integrated into a non-redundant comprehensive gene set of 25,580 protein-coding genes using Evidence Modeler (v1.1.1)48 (Table 7 and Fig. 6a).

Table 7 Summary of functional gene annotation in the P. h. homarus genome.
Fig. 6
figure 6

Gene prediction and functional annotation of the P. h. homarus genome. (a) Venn diagram illustrating the integration of gene set predictions from various methods. (b) Venn diagram showing overlap of functional annotations based on different databases.

Functional annotation of these protein-coding genes was performed using BLASTp (v2.2.26)37 and Diamond (v0.8.22)49 to align sequences against several key protein databases, including SwissProt50, NCBI Nonredundant protein (NR), KEGG51, InterPro52, Gene Ontology (GO)53, and Pfam54, with an E-value cutoff of 1E-5. Protein domains and motifs were annotated using InterProScan (v5.52–86.0)55. Among the 25,580 predicted genes, 25,575 (99.98%) were annotated to at least one database (Table 7) and 16,526 proteins (64.61%) received annotation support from across all four databases (Fig. 6b).

Data Records

We have deposited the Hi-C sequencing data (SRR30872734), Illumina sequencing data (SRR30872735), PacBio sequencing data (SRR30872736), and transcriptomic sequencing data (SRR3105790260 - SRR3105790765) in the SRA at NCBI56.

The genome-wide shotgun project has been deposited in DDBJ/ENA/GenBank under accession number GCA_043589495.157, and the genome assembly along with its annotation information has been made available on Figshare58.

Technical Validation

The quality of the P. h. homarus genome assembly was technically verified rigorously through a multifaceted evaluation. First, the genomic quality was analyzed using Benchmarking Universal Single-Copy Orthologs (BUSCO) (v5.8.0)59, with the arthropoda_odb12 BUSCO database, to assess the presence of single-copy orthologous genes. Using tools such as tBLASTn (v2.2.26)37, AUGUSTUS (v3.2.3)40, and HMMER60, 98.2% of gene orthologs were detected, of which 97.2% were complete and 1.0% fragmented, indicating a highly comprehensive assembly (Table S6). Second, using the Core Eukaryotic Genes Mapping Approach (CEGMA) (v2.5)61, we identified homologs for 226 highly conserved core genes in P. h. homarus, representing 92.34% (229) of the total, further supporting the completeness of the assembly (Table S7). Third, the consensus quality value and k-mer (k = 21) completeness of the assembly evaluated using Merqury software62 were 31.78 and 87.59%, respectively (Table S8). In addition, alignment of Illumina sequencing reads to the nuclear genome using BWA (v0.7.8)63 yielded a high read mapping rate of 98.60% and a coverage rate of 94.85%, underscoring the robust integrity of the assembled genome and the consistency of the sequencing data (Table S9). Finally, to conduct genome-wide homology analysis, we used MCScanX within the JCVI toolkit (v1.1.12) (https://github.com/tanghaibao/jcvi) to perform a synteny comparison between the genomes of the P. h. homarus and P. ornatus, and visualized the macro-syntenic relationships using Circos (v0.69)64. The results showed that 73 chromosome-level scaffolds of P. h. homarus exhibited significant synteny with the corresponding chromosomes of P. ornatus (Fig. S2). These combined results affirm the exceptional quality and completeness of the P. h. homarus genome assembly.