Background & Summary

Diatoms, unicellular, primarily diploid during their vegetative phase, photosynthetic, and eukaryotic microalgae, are widely distributed across marine and freshwater systems1, exhibiting a vast array of shapes and sizes within their tens of thousands of species2. Despite their explosive diversification, diatoms have only emerged since the early Mesozoic period3. Notably, diatom cells feature a unique characteristic—a silica-based cell wall called a frustule. Their vegetative cells undergo asexual reproduction, but the mechanism of frustule formation reduces cell size. The linear decrease in valve diameter with each cell division is termed the “McDonald and Pfitzer’s rule”4. Substantial reduction in cell size prevents further division, resulting in cell death5,6,7. Consequently, diatoms employ auxospore formation (considered a true sexual process) and vegetative cell enlargement (pseudo-auxospore formation, i.e., an asexual process) to restore their cell size1,7. Diatoms, constituting the most dominant microalgal group in coastal waters, sporadically form dense blooms8,9. The global net primary production (NPP) of all terrestrial and marine autotrophs is around 105 petagrams (Pg) of carbon annually10. Marine diatoms, a diverse group of phytoplankton found worldwide, are estimated to contribute up to 25% (26 Pg C yr−1) of this total, surpassing the annual primary production of any terrestrial biome10,11,12. Diatoms account for up to 50% of marine primary production and are considered essential components of the biological carbon pump13.

Since the release of the Thalassiosira pseudonana and Phaeodactylum tricornutum genomes14,15, some additional diatom genomes representing both centric and pennate species have been published16,17. Genome information has been accumulated, and advanced technology, such as Hi-C, optimal genome mapping, and long-reads sequencing has been developed16,17,18. Thanks to technological innovations, genome models are now available with higher accuracy than ever before.

Investigations employing electron microscopy and rDNA sequences from marine strains have unveiled that S. costatum sensu lato (s.l.) consists of a series of genetically and morphologically distinct species18,19,20,21. Since then, 27 Skeletonema species have been identified and taxonomically accepted22. This study conducted a comparative genome analysis to obtain the basic genome information in Skeletonema, showing the wide distribution range, rapid growth, and forming of extensive blooms. We sequenced the genome of nine species (11 strains) in the genus Skeletonema, using a hybrid assembly pipeline in combination with Illumina short-reads and Nanopore long-reads (Table 1). The analytical flow of the whole genome in Skeletonema is shown in Fig. 1.

Table 1 Basic information on the strains of Skeletonema and the sequencing data of Illumina short-reads and Nanopore long-reads.
Fig. 1
figure 1

The analytical flow of the whole genome in Skeletonema.

KMC v3.2.023 and GenomeScope v2.024 estimated the haploid genome sizes as 43.3–85.5 Mbp (Table 2). Smudgeplot v0.2.424 revealed that the vegetative cells in Skeletonema is diploid (Fig. 2). After scaffolding, the number of contigs was 94–348, and the estimated haploid genome sizes were 40.3–69.3 Mbp. Lengths of N50 were 0.35–1.09 Mbp, and the longest scaffold of length 2.1–4.6 Mbp, as calculated by QUAST25. GC contents were 44.7–45.6% (45.3 ± 0.7, average ± SD), and a one-sample t-test indicated no significant differences between the strains. BUSCO assessment showed that 90–98% (92.4 ± 2.3, average ± SD) of orthologs conserved in Stramenopiles were present in this genome assembly (sum of the percentages of single-copy and duplicate), suggesting that our draft genome possessed a sufficient gene repertoire from Stramenopiles (Table 2).

Table 2 Statistics of the genome assembly and gene models in Skeletonema.
Fig. 2
figure 2

The result of the analysis by Smudgeplot in Skeletonema.

The complete genome sequences in chloroplast and mitochondria were determined and the sizes were 126.9–127.4 Kbp and 36.1–41.4 Kbp, respectively (Figs. 3, 4, Table 3). Gene organization of the plastid and mitochondrial genomes was similar and conservative among species, but non-coding sequences in the mitochondria genome were variable. Repetitive regions of Skeletonema were analysed. Interestingly, the unclassified element was the most abundant and accounted for 56–78% of the total repeat elements (Fig. 5). The estimated repeat regions of total length 5.1–28.4 (16.0 ± 7.9, mean ± SD) Mbp accounted for 11.0–41.1 (27.9 ± 10.3) % of the genomes (Fig. 6). Repetitive elements were found to contribute 11.3 Mbp (19.9%) and 2.55 Mbp (7.6%) to de novo Phaeodactylum tricornutum and Tharassiosira pseudonana genome assemblies, respectively14, and larger in the seven strains of five Skeletonema species than T. pseudonana in this study. The core genome size, namely the genome size without repeat elements in Skeletonema, was 32.1–41.1 (38.4 ± 2.4, average ± SD) Mbp, and it was more consistent than the repeat element sizes (Fig. 6) among strains. Interestingly, a significant correlation was detected between core genome size and protein-coding gene number in the Spearman test, showing that the gene numbers affect the core genome size in Skeletonema (Fig. 7). However, whole genome size is largely affected by the repeat element sizes (Fig. 6).

Fig. 3
figure 3

The result of Chloroplast genome annotation by GeSeq in Skeletonema.

Fig. 4
figure 4

The result of mitochondrial genome annotation by GeSeq in Skeletonema.

Table 3 Genome sizes and accession numbers for sequencing and complete organelle genomes and nuclear rRNA genes in Skeletonema.
Fig. 5
figure 5

The frequency of each repeat element identified by RepeatMasker in Skeletonema.

Fig. 6
figure 6

The comparison of repeat size frequency in the whole genome size in Skeletonema.

Fig. 7
figure 7

The relationship between the core genome size and protein-coding gene number in Skeletonema.

The predicted protein-coding gene numbers were 15,275–21,376. Bacterial genome contamination was <1% in the eight strains, showing successful cultivation under axenic conditions. More than 90% of the annotated genes originated from eukaryotes, mainly from Bacillariophyceae. The result indicates that the gene model is consistent with the systematic position of Skeletonema. In contrast, 3–30% of bacterial genome contamination was unfortunately confirmed in the remaining three strains (Fig. 8). However, the BUSCO analysis with Stramenopiles conserved gene databases found 90–98% (93.8 ± 2.5, average ± SD) completeness in our annotation dataset (Table 2), which was the same as those estimated in the genome assembly, suggesting that our draft genome possessed a sufficient gene repertoire from Stramenopiles. We believe that these Skeletonema genomes will be crucial references and help deepen our understanding of diatom genome structure and function.

Fig. 8
figure 8

Krona charts representing taxonomic composition based on the result of Diamond search of the predicted genes in Skeletonema. Bacterial genome contamination was <1% in the eight strains, showing successful cultivation under axenic conditions.

Methods

Sample collection and DNA and RNA extractions

Eleven clonal strains of 9 Skeletonema species were established from a bloom in various regions in Japan, Vietnam, Taiwan, and Sweden by micropipetting single chains (Table 1). f/2 medium was modified by adding 10 μM of selenious acid (H2SeO3) and without copper sulfate hydrate in the stock solution of the metal mixture26. The clonal strain was maintained in 25 ml of f/2 medium based on enrichment of natural seawater collected from Tokyo Bay (salinity adjusted to 30 PSU) in a 75 mL capacity plastic tube at a temperature of 25 °C under an irradiance of 100 μmol m−2 s−1 provided by cool-white fluorescent lamps with a 12:12 h L:D cycle. For the whole genome analysis, the strain was incubated in 3–4 × 400 mL of the modified f/2 medium with 500 mL plastic flasks under the same conditions for the maintenance culture for 5–7 days until the cultures reached exponential or stationary growth phases. The vegetative cells were harvested by filtrating through 1-μm-pore-size polycarbonate filters (Nucleopore membrane, GE Healthcare, Tokyo, Japan). Genomic DNA was extracted from the harvested cells on the filter with a modified SDS-Proteinase K method (TE buffer: 10% SDS: 20 mg mL−1 Proteinase K (Qiagen) = 16:15:1)27. Purified precipitates were dissolved in TE buffer (pH 8.0) and stored at −30 °C until further processing.

The strains were incubated under the same incubation conditions with DNA sequencing for RNA extraction, but only one of 400 mL of the modified f/2 medium in a 500 mL plastic culture flask. At the end of the incubation, the cultures were harvested by filtering mentioned at the DNA sequencing, and the total RNA was immediately extracted by PureLink RNA mini Kit with TRIzol (Thermo Fisher Scientific, MA, USA) and stored at −80 °C until further processing.

Library preparation and sequencing

Before the library preparation, a quality check (quantity of DNA, fragment size, contamination of DNA/RNA, protein or salt ions) was done by gel electrophoresis, Qubit fluorometer (Life technology, Carlsbad, CA, USA), and a general spectrophotometer. Gel electrophoresis for genomic DNA and RNA was performed using an agarose gel system and Agilent 2100 Bioanalyzer (Agilent, Tokyo, Japan), respectively. The judgment for qualified and disqualified in genomic DNA and RNA samples was mainly done by sequencing companies based on the quantity and concentration of DNA/RNA and the presence or absence of degradation. Only qualified samples were used for the library preparation.

Sequencing of DNA and RNA was performed using the Novaseq. 6000 (Illumina, San Diego, CA, USA) or DNBSEQ-G400 (MGI Tech, Shenzhen, Guangdong, China), the next-generation sequencers. For genome sequencing, library construction of Pair End libraries (150PE) were performed by the default protocol and these libraries were used for sequencing (Table 1). The kit for library preparation and insert size for genome and transcriptome sequencing by short-read sequencers were shown in Table 4. A total of 9.1–30.0 Gbp of sequences were obtained (Table 1), which were approximately 131–644 x coverage of Skeletonema genomes (40.3–69.3 Mbp, see below). For long-read sequencing using MinION (Oxford Nanopore Technology, Oxford, UK), the extracted genomic DNA was fragmented to ~20 kbp using Covaris g-TUBE (Covaris, Woburn, MA, USA). After purification using AMPure XP beads (Beckman Coulter, Brea, CA, USA), library preparation was performed using the SQK-LSK109 Ligation Sequencing kit (Oxford Nanopore Technologies) based on the manufacturer’s protocol. The libraries were prepared and loaded onto R9.4.1 chemistry flow cell (FLO-MIN106) and sequenced using MinKNOW v 19.06.7. After sequencing, Guppy v3.2.2 (Nanopore) was used for base calling. A total of 1.1–22.1 Gbp of long-read data were obtained (Table 1), which were 20–445 x coverage of Skeletonema genomes. The raw reads were checked using Seqkit v2.1.028 and quality filtered using Seqtk v1.3-r117-dirty29. In RNA, a total of 6.0–15.4 Gbp of sequences were obtained, which were approximately 96–222 x coverages of Skeletonema genomes (Table 1). Quality check and trimming of the raw reads were done by fastp v0.22.030 with default setting.

Table 4 The information on the kit for library preparation kit and insert size for genome and transcriptome sequencing by short-read sequencers.

Genome assembly

We estimated the overall characteristics of the Skeletonema genomes, including its genome size, heterozygosity, ploidy, and repeat content calculated from Illumina short–reads. The analytical flow of the whole genome in Skeletonema is shown in Fig. 1. KMC v3.2.028 and GenomeScope v2.024 estimated the haploid genome sizes as 43.3–85.5 Mbp (Table 2). Smudgeplot v0.2.424 was employed to estimate the ploidy of the vegetative cells in Skeletonema). We applied a hybrid de novo assembly approach based on Illumina short-reads and Nanopore long-reads. Short– and long–reads were assembled to contigs using MaSuRCA v4.0.831. For gap-closing, assembled contigs were scaffolded into the draft genome using HaploMerger2 v2018060332. The resultant draft haploid genomes had total lengths, scaffold numbers, N50, and the longest scaffold of length, as calculated by QUAST v5.1.0rc132 (Table 2). We evaluated the gene completeness of our draft genome using BUSCO v5.3.033,34. GetOrganelle v1.7.5.035 and NOVOPlasty4.3.136 determined the complete genome sequences of chloroplast and mitochondria. The organelle gene annotation was done with GeSeq37. The rRNA gene sequences, including the intergenic spacer (IGS) regions, were also obtained through the analyses of GetOrganelle, and the Nucleotide BLAST (Standard database with Nucleotide collection (nr/nt) option) in NCBI identified the species in each strain with high identity showing >99.0% (Table 3).

Repeat analysis

Repetitive regions of Skeletonema were identified using a combination of de novo and homology-based approaches. For homology-based prediction, known repetitive elements were identified using RepeatMasker v4.1.238 to search against published RepBase sequences. For de novo prediction, RepeatModeler v2.0.339 was executed on the Skeletonema assemblies to build a de novo repeat library in each species. Then, RepeatMasker was used to annotate repetitive elements using the libraries.

Gene prediction and annotation

The organelle sequences were excluded from the assembly data, and repeat regions were masked to use the assembly data for the gene prediction. RNA–seqs reads were mapped to the assembled genome sequences using HISAT2 v2.2.140 with default settings, and gene prediction was performed using the BRAKER2 v2.1.6 pipeline41, which integrates RNA-Seq data through GeneMark-ET (GENEMARK v.4.68)42 and refines gene models using AUGUSTUS v3.4.043, trained with the protein sequence data of Thalassoisira pseudonana, the closest species to Skeletonema. This process resulted in the annotation of 15,275–21,376 protein-coding genes in the Skeletonema genomes (Table 2). BRAKER was run with default parameters, incorporating transcript evidence to enhance the precision of the predicted gene structures. The closest protein homolog of each entry in the gene models of Skeletonema using Diamond v2.0.1344, and visualized results by Krona45.

Data Records

All DNA and RNA raw reads have been deposited in DDBJ with the accession numbers of DRR539406–DRR539437 (See Table 1)46. The complete genome sequences of chloroplast, mitochondria and nuclear rRNA genes were deposited with the accession numbers of LC814759–LC814791 (See Table 3)47.

The assembly genome data have also been deposited with the accession numbers of BAAHPM010000001-BAAHPM010000104, BAAHPN010000001-BAAHPN010000346, BAAHPO010000001-BAAHPO010000158, BAAHPP010000001-BAAHPP010000130, BAAHPQ010000001-BAAHPQ010000150, BAAHPR010000001-BAAHPR010000169, BAAHPS010000001-BAAHPS010000286, BAAHPT010000001-BAAHPT010000096, BAAHPU010000001-BAAHPU010000093, BAAHPV010000001-BAAHPV010000330, and BAAHPW010000001-BAAHPW010000121 (See Table 5)48

Table 5 The accession number information of the assembly genome on the strains of Skeletonema.

Technical Validation

Technical validation quality assessment of the genome assembly

The total assembly lengths are 40.3–69.3 Mbp, and the scaffold N50s are 0.3–1.1 Mbp (Table 2). BUSCO analysis was performed with Stramenopiles conserved genes databases to assess the completeness of the genome assembly, resulting in values of 90–98%.

Gene prediction and annotation validation

Gene models within the assembly were forecasted using Augustus, trained with the BUSCO assessment results. The ultimate gene set encompassed a range of 15,275 to 21,376 genes, as detailed in Table 2. The BUSCO value, ranging from 90% to 98%, closely paralleled those observed in the genome assembly. This congruence suggests the robust reliability of the generated genome models.