Background & Summary

Diatoms are a diverse group of algae that significantly contribute to global carbon fixation and marine and freshwater ecosystem function1. In addition to their ecological role, their ability to tolerate and quickly acclimate to rapidly changing environmental conditions is remarkable2. These photosynthetic microalgae may capture and transmit \({{CO}}_{2}\) into diverse compounds, including lipids, omega-3 fatty acids, pigments, antioxidants, and polysaccharides3. They produce a variety of phytosterols, which offer possible health benefits such as cholesterol-lowering properties4. Diatoms can be cultivated indoors and outdoors, and their biomass productivity can be doubled in high-technology photobioreactors. A few selected species are used as model organisms in genetics and biochemistry research, while several taxa could be a bioprocess platform for biofuels3.

Diatoms play a critical role in the global carbon cycle3,5,6. Through photosynthesis, diatoms convert carbon dioxide into organic carbon, forming the basis of marine food webs and assisting in the sequestration of carbon in ocean sediments6. Diatoms fix atmospheric carbon dioxide, accounting for around 20% of the world’s primary production7. Their silica-based cell walls contribute to long-term carbon storage as they cause diatom cells to sink and settle on the ocean floor or the bottom of lakes and rivers. This process may be especially important during diatom blooms, which characterise temperate ocean margin zones and freshwater bodies in the spring. Various environmental factors in interactions with marine ecosystems affect the onset and progression of blooms, such as temperature, light intensity, and fluctuations of nutrients8,9.

Interaction and coexistence with bacterial communities are an integral part of the life of diatom algae. They also form consortia and heterogeneous cohorts building networks of numerous cell-to-cell interactions for e.g. nutrient exchange. In this mutually beneficial deal, bacteria contribute by assimilating nutrients from the water and sequester minerals released by diatoms efficiently. Further, bacteria supply nutrients that diatoms are not able to produce themselves, for example, vitamins and fixed nitrogen10. Additionally, diatom blooms influence bacterial communities, showcasing their interconnectedness in marine ecosystems (e.g.11,12,13. At the same time, bacteria impact the dynamics of diatom growth14. The ecological roles of diatoms and their interaction with other organisms are now better-understood thanks to molecular techniques, which have provided new insights into cell death, silicon metabolism, environmental sensing, and community-level interactions15.

However, despite the frequency and importance of diatoms in the ecosystem, complete genetic resources for diatoms are scarce. When starting this study, we found 89 Bacillariophyta genome assemblies at National Center for Biotechnology Information (NCBI) Datasets (https://www.ncbi.nlm.nih.gov/datasets/, April 1st, 2024, see Supplementary Table S1). Of these, 66 were flagged as “representative genomes”. In total, 13 of these genome assemblies had an annotation of protein coding genes, but only seven of the genome assemblies flagged as “representative genomes” had such an annotation. This means for four of the annotated assemblies, a younger and better but yet unannotated genome assembly existed (but the assembly of Thalassiosira pseudonana was not flagged as representative, had been annotated, and no alternative representative genome assembly was available). For three species available at the NCBI, we found an annotation of protein coding genes in PhycoCosm16 but not at the NCBI. Knowledge about the protein coding genes is essential to fully exploit genome sequences17, and thus we made it our mission to annotate previously unannotated genome assemblies of the Bacillariophyta.

Initially, we set out to annotate the genome assemblies of all Bacillariophyta that did not have an annotation of protein coding genes, or where a younger and better representative genome has been made available without annotation. Looking at redundancy (sometimes more than one genome assembly for the same species is available), we selected one assembly from each species. However, we decided later to exclude 10 genome assemblies (see Supplementary Table S2), either due to technical problems during download or annotation, or due to data quality. We ended up successfully annotating 49 Bacillariophyta genome assemblies18 (references to the original sequence data publications are listed in Table 1, genome assembly details are given in Supplementary Table S3, a taxonomic tree is shown in Fig. 1).

Table 1 References for sequence data used in this study, either for genome annotation or for comparison to annotations of previously existing genome annotations.
Table 2 Top: Statistics on the raw and intermediate BRAKER gene sets, bottom: statistics of the filtered and final BRAKER gene sets.
Fig. 1
figure 1

Taxonomy tree of selected Bacillariophyta genomes. This tree displays species of selected Bacillariophyta genome assemblies available from NCBI datasets between June 14th and 26th 2024. The tree was generated by PhyloT (https://phylot.biobyte.de/, August 21st 2024), visualised with iTol57. Species with representative genome assemblies with a previously existing annotation at NCBI are labelled in grey. Genomes that we annotated are colored in different shades of blue. From lightest to darkest blue: with BRAKER3; with BRAKER3 including proteins from the same species that were already available for an older assembly at NCBI or from PhycoCosm; with BRAKER2; with BRAKER2 including proteins from the same species that were already available for an older assembly at NCBI or from PhycoCosm. We excluded “uncultured” entries and those matching only two letters followed by a dot, e.g. “sp.”.

With this study, we present the annotation data of protein coding genes for 49 Bacillariophyta genome assemblies that were previously stored as unannotated at NCBI Datasets. Combined with the previously existing annotations, this now makes a total of 58 Bacillariophyta genome annotations accessible for further studies (Fig. 2 visualises how these 58 species cover the taxonomic clades of Bacillariophyta). Together, these data can be applied to various scientific problems and help researchers better understand many of the processes in diatom algae.

Fig. 2
figure 2

Stacked bar plot showing the distribution of species with structurally annotated genomes (9 previously annotated, 49 newly annotated in this study) across taxonomic subclades of Bacillariophyta. The lower portion of each bar represents species with annotated genomes, while the full bar height represents the total number of known species according to NCBI Taxonomy.

Methods

The genome annotations presented here were generated using publicly available genome, transcriptome, and protein data. Data analysis was performed in three steps: (1) data preparation, (2) structural genome annotation, and (3) functional genome annotation. After annotation, we performed assembly contamination analysis (4) and identified horizontal gene transfer candidates (5). Steps 1 and 2 were executed using a semi-automated and reproducible Snakemake workflow19 that is publicly available at https://github.com/KatharinaHoff/braker-snake (August 30th, 2024). Singularity20 was employed to manage software dependencies. Steps 3–5 were performed manually. In addition to genome annotation, we also estimated ploidy in a large number of genome assemblies. All software version numbers are listed in Supplementary Table S4.

Data preparation

In short, we used the NCBI Datasets tool to retrieve Bacillariophyta genome assembly information from the NCBI database (in this case available at https://www.ncbi.nlm.nih.gov/datasets/ via web browser). Assembly information was filtered to exclude ‘uncultured’ samples and species names ending in ‘sp.’ If multiple assemblies were available for the same species, we prioritized the ‘representative’ assembly, or, if unavailable, the assembly with the largest N50. Genomes with fewer than or equal to 1,000 annotated proteins were selected as candidates for further annotation. This threshold was set to include genome assemblies for annotation that have only a protein coding gene annotation for organelle genomes. For each candidate genome, we checked if an older assembly had existing protein-coding gene annotations (referred to as ‘legacy proteins’) and stored this information. All genome assemblies and any associated legacy proteins were downloaded using the datasets tool.

The workflow automatically retrieves the appropriate OrthoDB v11 partition21 for the specified taxon from https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/ (in the case of diatoms, that is a combination of the following two files: https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Stramenopiles.fa.gz and https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Viridiplantae.fa.gz). For Bacillariophyta, this corresponds to the Stramenopiles partition, which we combined with the Viridiplantae partition to ensure a larger sequence set.

For species lacking genome annotations, RNA-seq data availability was verified using the Biopython/Entrez API to query the Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra)22. Up to six Illumina paired-end libraries were selected (the top six entries from the Entrez results), and downloaded using fasterq-dump (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software, accessed August 21st, 2024). RNA-seq data were aligned to the genome using HISAT223. Co-culture libraries were not excluded, as they often provide critical data for diatoms, but libraries with an alignment rate below 20% were discarded. The resulting SAM files were converted to BAM, merged if multiple files existed, sorted, and indexed using SAMtools24.

Before proceeding with automated annotation, we manually queried the PhycoCosm portal (Joint Genome Institute) for existing protein-coding gene annotations for species in our dataset. For Cyclotella cryptica25,26, Nitzschia putrida27 and Pseudo-nitzschia multiseries, we downloaded available protein sequences and included them as ‘legacy proteins’ in the BRAKER annotation process.

The final output of this data preparation phase was a CSV file that specifies the input files required for the subsequent annotation workflow for each species.

Structural genome annotation

Each selected genome assembly was processed individually using a consistent pipeline. First, RepeatModeler228 was used to construct a species-specific repeat library, followed by RepeatMasker (http://www.repeatmasker.org, accessed August 21st, 2024) to soft mask the repeats in the genome. Depending on the availability of extrinsic data, either BRAKER2 or BRAKER329,30 was employed to predict protein-coding gene structures from the soft-masked genome.

Protein evidence was always used during annotation. For many genomes, the combined Stramenopiles/Viridiplantae protein partition was used as input. Additionally, legacy proteins were incorporated when available. In cases where RNA-seq data were absent, BRAKER2 was run with an option to enrich the predicted gene set using BUSCOs from the Stramenopiles_odb10 dataset31, enhanced with compleasm32. BRAKER2 first uses GeneMark-EP + 33, which self-trains GeneMark-ES34,35 to identify seed gene sequences. These sequences are then compared to the protein database using DIAMOND36, followed by accurate spliced alignment with Spaln237. GeneMark-EP + generates an intermediate gene set based on protein evidence, which is refined using AUGUSTUS38,39. TSEBRA40 then combines and filters the predictions using protein evidence and BUSCOs as guides41.

When RNA-seq alignments were available, BRAKER3 was used. This workflow employed GeneMark-ETP42, which processes RNA-seq alignments using StringTie243 to assemble transcripts. GeneMarkS-T44 then screens the assembled transcripts for potential genes. DIAMOND and GeneMark-EP + ‘s protein evidence pipeline were used to filter the genes, and GeneMark-ETP also performed initial gene predictions based on self-training. AUGUSTUS was again trained on a reliable subset of predicted genes, and the final gene set was merged using TSEBRA.

Not all BRAKER jobs completed successfully; assemblies affected by these failures were excluded from further analysis (see Supplementary Table S2).

For quality control, we ran BUSCO with the stramenopiles_odb10 dataset on both the genome assemblies and the predicted protein sequences. Genomes were excluded if there was a significant discrepancy between BUSCO completeness scores at the genome level and the predicted protein level. For example, despite a 95% BUSCO completeness score at the genome level, Pseudo-nitzschia delicatissima achieved only 72% completeness at the annotation level and was excluded (see Fig. 3). Additionally, Thalassiosira sundarbana was excluded due to low genome BUSCO completeness (15%) and contamination in the database. Epithemia catenata was also excluded due to low genome BUSCO completeness (56%).

Fig. 3
figure 3

BUSCO scores of Pseudo-nitzschia delicatissima. We decided to exclude this species from further analysis because of the discrepancy of BUSCO scores between genome and protein level.

Fig. 4
figure 4

Rooted species tree of Bacillariophyta with an annotation of protein coding genes. Major diatom lineages are labelled on the right. The previously annotated species (C. tenuissimus, C. closterium, F. crotonensis, M. pseudoterrestris, N. inconspicua, P. tricornutum, Pseudo-nitzschia multistriata, S. robusta, and T. pseudonana) are labeled with a star. The numbers displayed on branches correspond to support values according to the Shimodaira-Hasegawa-like method242.

Functional gene annotation

The EnTAP functional annotation software was employed to provide functional descriptors and identify potential contaminants for the predicted proteins45. EnTAP was configured with two curated databases, NCBI’s RefSeq Protein46 and UniProtKB/Swiss-Prot47, for similarity searches, utilising a 50% target and query coverage minimum, and a DIAMOND E-value threshold of 0.00001. An optimal alignment was selected for each protein query based on phylogenetic relevance, informativeness, and standard alignment quality metrics. Additionally, EnTAP performed independent searches against the EggNOG database48 using the EggNOG-mapper toolbox49. The resulting gene family assignments, along with high-quality similarity search alignments, facilitated the subsequent connections to Gene Ontology terms50,51, protein domains from Pfam52, and pathway associations via KEGG53.

Contamination and HGT analysis

We screened each assembly for potential contamination by leveraging the EnTAP classification of individual transcripts as either contaminated or uncontaminated. In EnTAP, contaminant transcripts aligned with high confidence to the NCBI RefSeq microbial database or exclusively to the microbial gene families housed in EggNOG. Annotated transcripts in each assembly were mapped back to their corresponding contigs, and the proportion of “contaminated” versus “uncontaminated” transcripts was computed per contig. Any contig with more than 75% of its transcripts flagged as contamination was classified as potentially contaminated and a Note was added to each CDS feature in the gff3 file for this assembly. We detected between 1 and 318 contaminated contigs in 39 of the assemblies (see Supplementary Table S6).

Furthermore, we evaluated each predicted proteome, using the longest isoform per gene, for potential horizontally transferred genes (HGT). In specific, we identified HGT candidates that occurred in one or more Bacillariophyta, but were not conserved in other members of the Ochrophyta. For this, we performed additional DIAMOND searches against donor databases (NCBI RefSeq microbial and plant) and a recipient database (NCBI RefSeq Ochrophyta with all Bacillariophyta removed), using coverage thresholds of at least 50% for both query and subject and an e-value cutoff of 1e-5. Candidate HGTs were initially identified as those aligning to the microbial donor database while failing to align, either to the plant donor, or to the recipient database. We then filtered the HGT candidates by removing those lacking two flanking neighboring genes belonging to the target species, or with (either) flanking genes identified as a contaminant, or at the end of a scaffold (lacking two flanking genes for evaluation). The methodology for HGT identification and downstream filtering is available in EnTAP (v.2.3.0). The remaining genes were retained as HGT candidates (see Supplementary Table S7) and each corresponding CDS feature in the gff3 file was tagged accordingly. We identified between 1 to 129 HGT candidates, per species, in 42 of the annotations.

Orthogroup analysis

We used OrthoFinder to identify orthologous gene groups across species by performing an all-versus-all comparison of protein sequences after removing proteins that are located on genomic sequences that were suspected to be contaminants, and excluding horizontal gene transfer candidates (using the longest isoform of each gene). Based on sequence similarities, genes were grouped into orthogroups, which represent sets of genes descended from a common ancestor. To ensure the reliability of the species tree, we included species from nine publicly available annotations (see Table 3) and also the Oomycota clade for Phytophthora cinnamomi, Phytophthora infestans, Phytophthora ramorum, Phytophthora sojae, and Bremia lactucae (see Table 4).

Table 3 Descriptive statistics of the previously existing annotations of protein coding genes in representative genome assemblies at NCBI.
Table 4 Descriptive statistics of genome assemblies and protein-coding gene annotations in Oomycota.

A species tree (Fig. 4) was generated using OrthoFinder with the -M msa option, which builds gene trees based on multiple sequence alignments (using MAFFT54) and infers their topology with FastTree55. FastTree uses an approximate maximum-likelihood approach and provides SH-like (Shimodaira–Hasegawa-like) support values for each branch, which offer a fast estimate of how reliable each split is—though they are not traditional bootstrap values. These gene trees were then combined using the STAG56 (Species Tree from All Genes) algorithm, which reconstructs the species tree by integrating information from genome-wide orthogroup data, including multi-copy gene families. The support values shown on internal nodes of the species tree reflect how often each grouping is supported across all gene trees. Finally, the tree was rooted using STRIDE (Species Tree Root Inference from Duplication Events)57, which uses gene duplication patterns to determine the most likely root. Altogether, this approach combines gene family structure and duplication history to produce a comprehensive view of species relationships.

The OrthoFinder results files, including orthogroups, are available at58.

Filtering of false positive single exon genes

Descriptive statistics of the raw BRAKER output (see Table 2) and the EnTAP annotation rate (see Supplementary Table S5) suggested that BRAKER overpredicted single-exon genes in some cases. This issue has previously been reported in land plant annotations59.

To address this and filter out potential false positive single-exon gene predictions—while retaining gene models that may be of scientific interest—we applied the following filtering approach: We discarded single-exon gene models that lacked a functional annotation by EnTAP, did not have a significant hit in a DIAMOND search against the NCBI RefSeq non-redundant proteins (NR) database (February 2nd, 2024), and were not part of an orthologous group spanning more than one species in the OrthoFinder results.

File processing

In order to prepare NCBI-compliant GFF3 files, the filtered BRAKER output files were decorated with product names and notes according to EnTAP results (command lines at https://github.com/Gaius-Augustus/Diatom_annotation_scripts).

Ploidy Estimation with Smudgeplot

GenBank accessions were used to retrieve additional metadata from NCBI, including read type, DNA SRA accessions, genome size, and assembly level. Ploidy was not estimated if the SRA accession was unavailable or corresponded to long-read data (i.e., PacBio, ONT). A Nextflow pipeline (available at https://github.com/Gaius-Augustus/Diatom_annotation_scripts) was developed to estimate the ploidy for all remaining individuals in parallel. Paired-end SRA accessions were first fetched using sra-tools and filtered for fungal, bacterial, archaeal, and viral contaminants using Kraken’s60 default parameters. Coverage was calculated before and after contaminants were removed, ranging between 12-499x. Next, FastK built a database for each contaminant-free library using a k-mer size of 21. With the FastK database, Smudgeplot ‘hetmers’ found all k-mer pairs61. The lower k-mer threshold (-L) was estimated with Smudgeplot ‘cutoff’. The final ploidy estimate and proportion of heterozygosity carried by paralogs was extracted from the verbose summary text file resulting from the ‘plot’ module (results in Supplementary Table S8).

Data Records

The data set is available at Zenodo (ref. 62 version v6). It consists of an archive file called Bacillariophyta_annotations.tar.gz. After extraction, the resulting folder Bacillariophyta_annotations contains gff3 format files with gene models (summarized in Table 2) that each correspond to a FASTA format genome file. The accession numbers of the genome assemblies are for user convenience listed in the additionally included file README.md.

Technical Validation

We performed a genome annotation study focusing on 49 diatom species, aiming to create a robust genomic dataset that supports future research into diatom biology and evolution. To emphasize the need for our work, we plotted the distribution of existing Bacillariophyta genome annotations in the context of all known species within this taxon (Fig. 2). This analysis highlights the limited representation of annotated diatom species in current genomic resources. Our work significantly expands the number of annotated assemblies from 9 (or 15, including legacy assemblies) to 58, providing a valuable resource for diatom research.

Descriptive statistics for the gene structures of the newly annotated genomes are provided in Table 2. Previously annotated diatom genomes at NCBI contain between 10,321 and 38,391 protein-coding gene models (see Table 3). The gene numbers in the newly generated gene sets fall within this range. Vuruputoor et al. (2023)59 recommend using the ratio of mono-exon to multi-exon genes as a quality measure for genome annotations, with a suggested ratio of 0.2 for land plants. In contrast, diatom genomes exhibit a higher proportion of single-exon genes, ranging from 0.66 to 2.14 (based on existing annotations; see Table 3). The BRAKER2 and BRAKER3 pipelines tend to overpredict single-exon genes, and we hypothesise that this phenomenon extends to diatom genomes as well. After applying our filtering approach, only five species - Porosira glacialis (3.34), Skeletonema marinoi (2.46), Skeletonema tropicum (2.6), Thalassioria delicatula (2.87), and Thalassiosira mediterranea (2.71) - exceeded this range. These deviations are modest and may partly be attributed to selfish DNA elements, such as unmasked transposons and inserted retroviruses. The exon structure of the novel annotations aligns with previously annotated genomes in terms of the median number of exons per gene (2–3) and the largest number of exons per transcript (13–96) (compare Tables 2 and 3).

Evaluating the quality of novel genome annotations is challenging. We used BUSCO to assess genome completeness at both the genome and protein levels (only the longest isoform per gene), following Earth BioGenome Project guidelines17. BUSCO estimates the proportion of genes typically present as single copies within a clade. However, the stramenopiles_odb10 dataset applicable to diatoms is relatively small (100 marker genes). While BUSCO scores measure sensitivity within this limited dataset (see Fig. 5), a close agreement between genome- and protein-level scores suggests that the new annotations do not lack a significant portion of BUSCO genes detectable at the genome level. This is expected, as the stramenopiles_odb10 dataset was used as input for BRAKER.

Fig. 5
figure 5

BUSCO results of genomes and protein sets (only the longest isoform per gene was used in this analysis). This plot demonstrates the quality of genome assemblies (G = Genome) and predicted protein sets (B = BRAKER) across all here annotated species; species ordered alphabetically. The categories Complete (Single copy or duplicated), Fragmented, or Missing BUSCOs are shown.

We also applied OMArk63 to further assess the quality of protein-coding gene annotations. OMArk uses conserved homologous genes (HOGs) from the OMA database64 and the OMAmer software for fast protein placement65. For Bacillariophyta, the relatively small Ochrophyta subset of 942 HOGs is applicable. While this is a limited number of marker genes, OMArk provides additional metrics, assessing contamination, consistency, and fragmentation. Figure 6 shows OMArk results for our newly annotated genomes, while Fig. 7 displays results for previously available reference genomes. Unlike BUSCO, OMArk correctly handles alternative transcript isoforms, suggesting that the observed duplicates are likely real. Notably, we observed a high level of HOG completeness across most assemblies. However, Thalassiosira profunda and Fistulifera solaris showed a surprisingly high number of duplicate HOGs. For T. profunda, this is consistent with BUSCO scores at the genome level, indicating agreement between different metrics. In contrast, the source of duplicates in F. solaris remains unclear. We explored the genome assembly statistics (Table 5) but found no obvious explanation. Additionally, OMArk identified a significant level of contamination in the genome of Licmophora abbreviata, which had not been flagged as contaminated in public databases (Fig. 8).

Fig. 6
figure 6

OMArk results of newly annotated Bacillariophyta genomes. The top bar graph displays the number of canonical proteins per proteome, the middle graph presents completeness metrics based on single-copy, duplicated, or missing conserved genes, and the bottom graph illustrates the consistency assessment. Proteins are categorized as consistent, contamination, inconsistent, unknown, partial mapping, or fragments. Consistent proteins align with taxonomically expected gene families, while contamination refers to proteins matching gene families from other species. Inconsistent proteins belong to gene families outside the expected lineage but are not contaminants. Unknown proteins cannot be assigned to known gene families and may represent novel or misannotated sequences. Partial mapping indicates proteins aligning with gene families over less than 80% of their sequence, and fragments are proteins shorter than half the median length of their gene family.

Fig. 7
figure 7

OMArk results of previously annotated Bacillariophyta reference genome assemblies. Since it was not straight-forward to extract alternative isoform nesting from the GFF3 files, we extracted the longest isoform for each locus with TSEBRA instead of generating an isoform information file for OMArk. The top bar graph displays the number of canonical proteins per proteome, the middle graph presents completeness metrics based on single-copy, duplicated, or missing conserved genes, and the bottom graph illustrates the consistency assessment. Proteins are categorized as consistent, contamination, inconsistent, unknown, partial mapping, or fragments. Consistent proteins align with taxonomically expected gene families, while contamination refers to proteins matching gene families from other species. Inconsistent proteins belong to gene families outside the expected lineage but are not contaminants. Unknown proteins cannot be assigned to known gene families and may represent novel or miss-annotated sequences. Partial mapping indicates proteins aligning with gene families over less than 80% of their sequence, and fragments are proteins shorter than half the median length of their gene family. These metrics provide a comprehensive evaluation of annotation quality beyond completeness alone.

Table 5 Assembly statistics according to seqstats (https://github.com/clwgg/seqstats) of Bacillariophyta genome assemblies annotated with BRAKER.
Fig. 8
figure 8

Assignment rate of proteins to orthogroups. The figure shows the percentage of genes assigned to cross-species orthogroups, to single-species orthogroups (paralog only groups), or not assigned to orthogroups across different species as a stacked barplot. The light blue bars represent the genes assigned to orthogroups, while the orange bars represent the unassigned genes. The dark blue bars represent the genes assigned to single-species orthogroups. Previously annotated species are marked in bold face.

To better explain variation in genome-level BUSCO duplication across the diatoms, ploidy was estimated. Fistulifera has already been recognized as an allopolyploid66,67, yielding BUSCO duplication rates between 21% and 89%. While elevated BUSCO duplication can indicate polyploidy in some cases, it may also be a result of incomplete purging, mixed samples, or elevated heterozygosity. Skeletonema marinoi and Thalassiosira profunda, for example, have BUSCO duplication rates of 29% and 26%, respectively, but are still estimated to be diploid. Roberts et al. (2024)68 report the same ploidy levels. Interestingly, the only exception is Stephanodiscus minutulus, which was estimated to be triploid in this study (see Supplementary Table S8).

In the current study, we mainly used the orthogroups constructed by OrthoFinder to filter likely false positive predicted single exon genes. However, the OrthoFinder results themselves are also an interesting result of this study. Across species, the percentage of genes assigned to orthogroups ranged from 85.4% to 99.1%, indicating a generally high rate of orthogroup recovery. For most species, over 90% of genes were successfully assigned, with especially high assignment rates observed in Skeletonema marinoi (99.1%), Discostella pseudostelligera (98.9%) and Skeletonema menzelii (98.9%) (Fig. 8). A few species, such as Thalassiosira delicatula (85.4%) and Bremia lactucae (species from the outgroup used for the OrthoFinder analysis) (89.8%), showed slightly lower assignment rates, potentially reflecting lineage-specific gene content. It should be noted that OrthoFinder also constructs intra-species orthogroups, which consist of genes from a single species.

In total, 1,115,003 genes (96,8% of the dataset) were assigned to inter-species orthogroups, emphasizing the significant degree of genetic overlap among the species included in this study. Orthogroup inference resulted in a total of 7,092 species-specific orthogroups, comprising 29,717 genes, which represents 2.6% of all input genes. It points to potential species-specific adaptations, with these gene families possibly linked to unique ecological roles or environmental responses. The mean orthogroup size was 32.7 genes, while the median size was 6.0, reflecting a skewed distribution with some large, highly conserved orthogroups. The G50 (i.e., the orthogroup size above which 50% of all assigned genes are found) was 87 for assigned genes and 119 when considering all input genes. The corresponding O50 values—representing the number of the largest orthogroups containing half of the genes—were 2,577 and 4,381, respectively. Notably, only 178 orthogroups included genes from all species. The relatively low number of orthogroups containing genes from all species (262) suggests a high level of gene family diversification, likely reflecting extensive evolutionary divergence and possible lineage-specific expansions or losses across the dataset. The total number of genes per species varied widely, from less than 10,000 in Discostella pseudostelligera to almost 36,000 in Seminavis robusta (see statistics per species in the Supplementary Table S9), highlighting the diversity in genome size across the dataset.

OrthoFinder’s analysis is based on the construction of gene trees, allowing for the classification of orthologous and paralogous relationships. The gene trees can be summarized in species trees, which are particularly useful for identifying variable rates of sequence evolution (through branch lengths) and the order in which sequences diverged (tree topology). The resulting species tree for Bacillariophyta gene sets, including both novel and previously annotated genomes from the International Nucleotide Sequence Database Collaboration (INDSC), is shown in Fig. 4.

The species were grouped into major sub-lineages: Coscinodiscophyceae, Mediophyceae, Fragilariophyceae, and Bacillariophyceae. Consistent with findings from earlier phylogenetic research69,70, diatom sub-lineages are not recovered as a monophyletic group: radial centrics (Coscinodiscophyceae) form a paraphyletic clade, while Mediophyceae and pennate diatoms (Fragilariophyceae and Bacillariophyceae) form separate, well-supported clades. Chaetoceros muelleri and Chaetoceros tenuissimus are often placed outside the main Coscinodiscophyceae (radial centric) clade and instead fall within the Mediophyceae, a group of polar centric diatoms69,70,71. Mediophyceae regularly emerge as the sister group to pennate diatoms (Fragilariophyceae and Bacillariophyceae), rather than to radial centrics. Chaetoceros species cluster with other Mediophyceae such as Thalassiosira, Biddulphia, and Rhizosolenia, forming a distinct group separate from radial centrics and generally closer to pennate diatoms. This placement supports earlier morphological and phylogenetic studies72,73 showing that chain-forming centrics like Chaetoceros are more closely related to pennates than to traditional radial centrics.

In some cases, the effect of excluding contaminant and HGT candidate sequences may have been slightly too stringent, potentially leading to overfiltering. To illustrate, we provide BUSCO scores for both the original and genome assemblies and gene sets without contaminant and HGT-labeled sequences (see Supplementary Table S10).

While the PhycoCosm database includes additional annotated Bacillariophyta genomes, our workflow was specifically designed to rely on automatic querying of NCBI datasets for genome downloads. Therefore, we did not include PhycoCosm genomes in this study.

The novel annotations presented here will be valuable for studying interactions between diatoms and bacteria, particularly in the context of algal blooms that play a significant role in global carbon cycling. Given that methods for recovering full eukaryotic genomes from metagenomes are still developing, reference-based binning approaches, such as BlobTools74 using DIAMOND, may provide a viable strategy, especially as databases like NCBI NR expand for this clade.