A curated protein dataset for taxonomic classification of Prochlorococcus and Synechococcus in metagenomes

Coe, Allison; Mullet, James I.; Vo, Nhi N.; Berube, Paul M.; Anjur-Dietrich, Maya I.; Salcedo, Eli; Parker, Sierra M.; VonEmster, Konnor; Bliem, Christina; Arellano, Aldo A.; Castro, Kurt G.; Becker, Jamie W.; Chisholm, Sallie W.

doi:10.1038/s41597-025-06164-5

Download PDF

Data Descriptor
Open access
Published: 03 December 2025

A curated protein dataset for taxonomic classification of Prochlorococcus and Synechococcus in metagenomes

Scientific Data volume 12, Article number: 1895 (2025) Cite this article

1665 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Prochlorococcus and Synechococcus are abundant marine picocyanobacteria that contribute significantly to ocean primary production. Recent genome sequencing efforts, including those presented here, have yielded a large number of high-quality reference genomes, enabling the classification of these picocyanobacteria in marine metagenomic sequence data at high phylogenetic resolution. When combined with environmental data, these classifications can guide cluster/clade/grade assignments and offer insights into niche differentiation within these populations. Here we present ProSynTax, a curated protein sequence dataset and accompanying classification workflow aimed at enhancing the taxonomic resolution of Prochlorococcus and Synechococcus classification. ProSynTax includes proteins from 1,260 genomes of Prochlorococcus and Synechococcus, including single-amplified genomes, high-quality draft genomes, and newly closed genomes. Additionally, ProSynTax incorporates proteins from 41,753 genomes of marine heterotrophic bacteria, archaea, and viruses to assess microbial and viral communities surrounding Prochlorococcus and Synechococcus. This resource enables accurate classification of picocyanobacterial clusters/clades/grades in metagenomic data – even when present at 0.15% of reads for Prochlorococcus or 0.03% of reads for Synechococcus.

Genomes of Prochlorococcus, Synechococcus, bacteria, and viruses recovered from marine picocyanobacteria cultures based on Illumina and Qitan nanopore sequencing

Article Open access 12 April 2025

Differential global distribution of marine picocyanobacteria gene clusters reveals distinct niche-related adaptive strategies

Article Open access 25 February 2023

A standardized archaeal taxonomy for the Genome Taxonomy Database

Article 21 June 2021

Background & Summary

Prochlorococcus and Synechococcus are oxygenic photosynthetic bacteria that are key microbes at the base of the marine microbial food web. Together, they have colonized much of the surface ocean¹, with Synechococcus having a broad geographic range and Prochlorococcus dominating the tropical and subtropical open ocean. Prochlorococcus and Synechococcus are part of a diverse community of marine phytoplankton, which have a key role in regulating the transformation of energy and matter in the oceans. Studies of marine ecosystems have generated large metagenomic datasets^2,3,4,5, which are a rich resource for investigating the distributions of microorganisms in the marine environment⁶. Many classification tools have been developed to process growing volumes of sequence data. Some use broad reference databases (e.g., NCBI and GTDB), while others are specialized to specific groups of microbes^7,8. Accurate species identification is challenging due to the high diversity of natural samples and the limited availability of taxonomically classified genomes, particularly for distinguishing between very closely related organisms.

Determining the distributions of the picocyanobacteria Prochlorococcus and Synechococcus is of particular importance for understanding marine ecosystems. These organisms are exceptionally diverse^{9,10,11,12,13}, thus making the accurate differentiation of clusters/clades/grades within each genus difficult. As a result, Prochlorococcus and Synechococcus populations are often classified as a single strain¹⁴, obscuring the finely-tuned niche partitioning that is well documented in these groups^{15,16,17,18,19,20}. Over the past decade, a substantial amount of high-quality genomic reference data has become available to address these gaps. These data enable the linkage between the functional and taxonomic structure of Prochlorococcus and Synechococcus populations and the features of the environment in which they are embedded. They also include genomes from uncultivated single cells, dwarfing the number of available reference genomes from cultured cells by an order of magnitude^21,22,23. Single-cell genomes have substantially expanded our understanding of picocyanobacterial diversity while mitigating biases inherent in cultivation-dependent methods, which tend to select for strains adaptable to existing isolation protocols.

Here, we leveraged these extensive datasets to present a comprehensive protein dataset – ProSynTax – that has been manually curated to enable finely resolved taxonomic classification of Prochlorococcus and Synechococcus, along with the microbial and viral communities associated with these picocyanobacteria in short-read metagenomic datasets. The curated taxonomy is based on the core protein similarity of individual genomes as well as decades of research on niche partitioning and the diverse ecology and physiology of different phylogenetic branches of Synechococcus and Prochlorococcus^12,13,15,17. While we have chosen to retain this traditional clade/ecotype-based framework for its ecological relevance, the classification workflow is taxonomy-flexible and allows users to apply alternative classification systems, such as the genome-based taxonomies proposed by Tschoeke et al.²⁴ and Salazar et al.²⁵, if desired.

Accompanying this protein dataset is a new classification workflow that links high-throughput quality control of short-read sequences and taxonomic classification using dedicated classifier algorithms^26,27. This dataset leverages protein reference sequences, demonstrated to perform comparably to nucleotide references for metagenomic classification²⁸, and aligns well with our normalization approach, which utilizes single copy core genes. Custom scripts are included for reporting both the absolute and normalized read abundances (the latter referred to as genome equivalents) of Prochlorococcus and Synechococcus at user-defined taxonomic levels in metagenomic samples, ranging from broad groupings of high-light and low-light adapted Prochlorococcus to more fine-grained distinctions between individual clades/grades. The classification workflow can accurately identify the major clusters/clades/grades/ecotypes of these picocyanobacteria in short-read sequence libraries consisting of a minimum of 1 million reads in which Prochlorococcus represents at least 0.80% of reads, and Synechococcus at least 0.09% of reads. ProSynTax incorporates protein sequences from 1,260 genomes of Prochlorococcus and Synechococcus, encompassing single-amplified genomes, high-quality draft genomes, and newly closed genomes. Among its contents are newly closed circular genomes from 39 Prochlorococcus, 12 Synechococcus, and 12 marine heterotrophic bacterial strains that were co-isolated from Prochlorococcus cultures, including 29 previously partially assembled and 10 unpublished isolate Prochlorococcus genomes. Furthermore, ProSynTax includes proteins from 41,753 genomes of marine heterotrophic bacteria, archaea, and viruses to analyze the microbial and viral communities associated with Prochlorococcus and Synechococcus. The closed genomes reported here will be particularly valuable for studies of microbial evolution that rely on fully assembled genomic islands and regions of the genome that are subject to remodeling.

Methods

Enrichment and isolation of picocyanobacteria

Metadata for culture isolates included in this dataset, including both newly closed genomes and previously published partial genomes that were missing detailed isolation information, are defined here. Specifically, the strains were isolated from seawater collected from 3 cruises in the North Pacific Subtropical Gyre and 1 cruise in the South Pacific Subtropical Gyre. Ten low-light (LL) adapted Prochlorococcus strains representing the LLI and LLIV clades and 4 heterotrophic bacteria representing classes Gammaproteobacteria and Alphaproteobacteria were isolated and sequenced (Fig. 1, Table 1). These 4 marine heterotrophic bacteria were co-isolated from Prochlorococcus cultures. Synechococcus strain Cu2B8 was also isolated and sequenced; however, records of its isolation location and depth are not available (Table 1).

Table 1 Culture isolates of Prochlorococcus, Synechococcus, and marine heterotrophic bacteria.

Full size table

All Prochlorococcus strains were isolated from raw seawater, except MIT1227 and MIT1418 which were isolated from the filtrate of seawater put through a 1 or 1.2 µm filter. Different media formulations were used, with specific nutrient amendments aimed at selecting for various clades of Prochlorococcus, as described hereafter. Isolate MIT0916 was amended with Pro2 nutrients²⁹, replacing the nitrogen source with 50 µM nitrate. Isolates MIT1011, MIT1012, and MIT1013 were isolated on media containing 20 µM ammonium chloride, 1 µM sodium phosphate, 0.1x Pro99 trace metal mix²⁹, and 1 mM sodium bicarbonate. Isolates MIT1201 and MIT1205 were isolated on media containing 16 µM ammonium chloride, 1 µM sodium phosphate, and 0.1x Pro99 trace metal mix. MIT1227 was isolated on media containing 32 µM ammonium chloride, 2 µM sodium phosphate, and 0.1x Pro99 trace metal mix. Finally, MIT1418 was isolated on media containing 0.1x Pro2 nutrient and metal concentrations. The chemicals used in media preparation were of the highest purity (i.e. Sigma BioUltra) to prevent contaminants that could affect culture growth.

Heterotrophic bacteria MIT1350, MIT1358, MIT1370, MIT1388, and MIT1392 were isolated on 0.3% agar pour plates consisting of Pro99 medium amended with 0.05% pyruvate and 3.75 mM TAPS buffer. Colonies were picked and subcultured three times before being transferred into liquid Pro99 medium. MIT1350 was sequenced before freezing, while the remaining strains were directly frozen. Upon thawing MIT1358, MIT1370, MIT1388, and MIT1392 for this study, the cells were repurified using the same methods, with the exception that colonies were transferred into liquid ProMM medium³⁰ before pelleting. Purity was then assessed through genomic sequencing to verify the absence of contaminants. Heterotrophic bacteria strain MIT0238 was isolated from xenic Prochlorococcus MIT9313; however, the records detailing this isolation are not available.

Culture conditions and biomass preparation

Cultures were grown in 200 mL of Pro99 medium²⁹ at 24 °C using continuous light or a 13:11 light-dark cycle at light intensities ranging from 10 to 50 µmol photons m⁻² s⁻¹. Cells were harvested for sequencing during late exponential growth phase, with 0.003% Pluronic F-68 Polyol (MP Biomedicals, Cat# 2750049) added to promote cell aggregation and improve biomass yields during centrifugation. Cells were centrifuged at 7,197 × g for 15–20 min at room temperature. After centrifugation, the supernatant was decanted, and the resulting pellets were flash frozen in liquid nitrogen and stored at −80 °C.

Extraction and sequencing

Genomic DNA was extracted from the frozen cell pellets using a modified version of a phenol:chloroform based extraction method³¹ in order to obtain high molecular weight DNA suitable for sequencing on Pacific Biosciences (PacBio) platforms. Modifications included minimizing agitation to prevent DNA shear (e.g., gently shaking the sample tube by hand instead of vortexing), adding 1 µl of 50 mg/mL lysozyme (cat# 90082, Thermo Scientific) to the cell pellet during the digestion step, performing a second chloroform/isoamyl alcohol step to remove residual phenol, and increasing the salt concentration by the addition of 0.1 volume of 3 M sodium acetate or 3 M ammonium acetate before precipitation with 0.6–0.8 volumes of isopropanol. The tube was then gently inverted multiple times and incubated at room temperature for 30 min. DNA was recovered by centrifugation at 10,000 × g for 3 min at 4 °C. The supernatant was then carefully removed by pipetting. The pellet was then washed twice with 75% ethanol before allowing the pellet to dry at room temperature for approximately 10 min inside a biological safety hood to prevent contamination. It was then dissolved in 1x TE buffer (pH 8) and stored at −80 °C.

Genomic DNA samples were diluted and fragmented to 10–12 Kb using a gTube (cat# 520079, Covaris), followed by a 0.45X SPRIselect (cat# B23317, Beckman Coulter Life Sciences) bead cleanup. The samples were then made into indexed SMRTBell libraries following guidelines for the template prep kit v 1.0, 2.0, or 3.0 (cat# 100-259-100, 100-938-900, 102-141-700, respectively, Pacific Biosciences). These libraries were assembled into a single pool, and cleaned using MinElute reaction cleanup kit (cat# 28204, Qiagen), followed by additional cleaning using 0.4X SPRIselect beads. The libraries were then bound using Sequel Binding Kit v 2.1, Sequel II Binding Kit v 2.1, and Sequel II binding Kit v 3.2 (cat#100-369-800, 101-820-500, 102-194-100, respectively, Pacific Biosciences) and sequenced on PacBio Sequel I, Sequel II, or Sequel IIe flowcells at either the MIT BioMicro Center or Harvard Bauer core facilities.

Genome assembly

PacBio sequence data were demultiplexed by barcode using PacBio SMRTtools, generating both circular consensus sequencing reads and subreads. To maximize the likelihood of closing genomes, reads were assembled using multiple tools, including Flye v2.9.5, Canu v2.3, and SPAdes v4.1.0^32,33,34. Contigs were classified using MMseqs. 2 v17.b804f against the GTDB v207 genome database^35,36. The resulting genomes were submitted to NCBI³⁷ and IMG, with accessions provided in ProSynTax_genomes.csv³⁸. The GitHub repository, which includes documentation and code for genome assembly, is provided in the Code Availability section.

In addition to closing genomes of cultured strains, six heterotrophic bacterial genomes were successfully assembled from sequencing xenic cultures of Prochlorococcus and Synechococcus. However, since pure axenic isolates for these heterotroph strains are unavailable, we implemented a new nomenclature by appending an additional two-digit identifier (“01”) to the picocyanobacterial strain name from which they were derived. For example, Thalassospira MIT121401 was obtained from sequencing a xenic culture of Prochlorococcus MIT1214.

Dataset construction

The dataset consists of single-amplified genomes, finished genomes of culture isolates, and high-quality draft genomes of cultured isolates. These genomes were obtained from several sources, including IMG-ProPortal²¹, GORG-Tropics³⁹, NCBI RefSeq (July 2025), Cyanorak⁴⁰, and this study³⁸. No metagenome-assembled genomes were included in the dataset to ensure only high-quality reference genomes were present. Additionally, we filtered the NCBI RefSeq database to include all viral genomes and only bacterial and archaeal reference genomes, which typically includes a single reference genome per species. For any genomes without annotated protein sequences, prodigal v2.6.3⁴¹ was used to identify open reading frames. Stop codons were removed from these predicted protein sequences to remain consistent across all genomes. All protein sequences were concatenated, and Kaiju v1.10.1 was used to construct the finalized dataset (kaiju-mkbwt and kaiju-mkfmi with default parameters)²⁶.

Picocyanobacteria taxonomy

The taxonomy of Prochlorococcus and Synechococcus was manually curated based on a concatenated multiple-sequence alignment of proteins encoded by any single-copy core genes annotated in each genome. Single-copy core genes were defined as those present in exactly one copy (i.e., genes with multiple copies within any genome were removed) in 100% of 92 genomes derived from cultures in the CyCOG v6.0 database²¹ and estimated to be > 99% complete using checkM⁴². All proteins encoded by Prochlorococcus and Synechococcus genomes were aligned against proteins in the CyCOG v6.0 database (blastp v2.16.0 -evalue 0.001), selecting the best hit for each gene. Following the annotation of all genomes using the CyCOG v6.0 database, protein sequences for the 424 single-copy core genes were aligned using ClustalOmega v1.2.4⁴³, concatenated with MEGA v11.0.13⁴⁴, and used to construct a phylogenetic tree (Fig. 1) with FastTreeMP v2.1.11 (-lg -boot 100)⁴⁵. Genomes missing more than 95% data among positions with <50% gaps or genomes with less than 5 single-copy core genes were removed from the dataset and phylogenetic tree. The classified genomes were then analyzed using CheckM v1.2.3⁴² to determine the completeness and contamination for each genome.

Data Records

ProSynTax data files are accessible online through a Zenodo repository³⁸, which includes the following files:

Reference genomes in ProSynTax

ProSynTax_genomes.csv

Table of genomes included in the ProSynTax dataset and their associated metadata. Data fields are as follows:

organism: The name of the organism recorded in NCBI when available. For genomes/organisms obtained from sources other than NCBI, the organism’s name is provided in NCBI format

genome_short_name: The genome name used in the ProSynTax dataset

domain: Bacteria, Archaea, Eukarya, or Virus

genus: The genus of the organism in NCBI

clade: The major cluster/clade/grade of Prochlorococcus or Synechococcus based on phylogenetic reconstruction using a concatenated alignment of proteins encoded by single-copy core genes

NCBI_BioProject: The NCBI BioProject accession number associated with the organism, when available

NCBI_BioSample: The NCBI BioSample accession number associated with the organism, when available

NCBI_GenBank: The NCBI GenBank accession number associated with the genome sequence data, when available

IMG_Genome_ID: The IMG Genome ID accession number, also known as the IMG Taxon ID, corresponds to the genome or organism in the Joint Genome Institute’s (JGI) Integrated Microbial Genomes (IMG) repository, when available

New_genome_SRA_accession: SRA accession numbers for new genomes generated for this study

Taxonomic classification files for Kaiju

ProSynTax_names.dmp

Names taxonomy file for use with the ProSynTax dataset

ProSynTax_nodes.dmp

Nodes taxonomy file for use with the ProSynTax dataset

ProSynTax_v1.1.fmi.bz2

Index file containing contents of ProSynTax_v1.faa for use with ProSynTax dataset

Raw protein files used to build.fmi file for Kaiju

ProSynTax_v1.1.faa.bz2

File containing protein sequences used by Kaiju for classification of reads. Each protein sequence contains a header starting with “ > ”

ProSynTax_v1.1_without_refseq.faa.bz2 File containing protein sequences found in ProSynTax with NCBI RefSeq genomes removed. Each protein sequence contains a header starting with “ > ”.

Files for read count normalization

CyCOG6.dmnd

Database containing orthologous groups of proteins used in the cluster/clade/grade normalization step

average_cycog_length.csv

Comma separated file containing the average length for each protein sequence used in the normalization step. Data fields are as follows:

cycog: name for single-copy core gene

mean_AA_length: the average length of amino acids in the protein sequence of the gene

Technical validation files

ProSynTax-workflow_benchmarking_genomes.tsv

This tab-delimited file contains a list of subsetted genomes used in each benchmarking experiment reported in the Technical Validation section. Data fields are as follows:

Experiment Name: name of benchmarking experiment conducted

Subset ID: unique ID from the random genome subsetting

Genome Name: name of genome used in benchmarking experiment

ProSynTax-workflow_benchmarking_composition.tsv

This tab-delimited file contains the taxon composition of all samples used in each benchmarking experiment reported in the Technical Validation section. Data fields are as follows:

Experiment Name: name of benchmarking experiment conducted

Sample Name: unique sample name

Percent Prochlorococcus: percent of reads in simulated sample originating from Prochlorococcus genomes

Percent Synechococcus: percent of reads in simulated sample originating from Synechococcus genomes

Percent Heterotroph: percent of reads in simulated sample originating from marine heterotrophic bacterial genomes

Experiment Description: description of each benchmarking experiment which corresponds to varying taxonomic compositions

Notes: additional information about the simulated sample

Technical Validation

Generation of mock metagenome data

To assess the accuracy of cluster/clade/grade-level classification of metagenomic reads using ProSynTax, we generated mock metagenome sequence datasets with known genome inputs. The description for each benchmarking experiment and accompanying composition for each taxonomic group in the mock sequence datasets generated can be found in the ProSynTax-workflow_benchmarking_composition.tsv file. Using a custom script³⁸, 20% of genomes from each Prochlorococcus and Synechococcus cluster/clade/grade and 20% of marine heterotrophs from the reference genome dataset were randomly selected for read simulation. This process of random selection was done 10 times for each benchmarking experiment. The list of genomes resulting from this random selection and their associated subset ID (1–10) can be found in the ProSynTax-workflow_benchmarking_genomes.tsv file. These genomes were then used as input for mason2 v2.0.0-beta1⁴⁶ to create simulated mock metagenomes with 1 million paired-end 150 nt sequences. The ProSynTax_v1.1.faa.bz2 file used to generate the ProSynTax_v1.1.fmi.bz2 file for classification, was modified to exclude sequences from genomes included in the mock metagenomes by using the Kaiju kaiju-mkbwt and kaiju-mkfmi functionalities²⁶. This ensures that genomes used to create mock metagenomes were excluded from the set used for classification.

Assessment of misclassification rates

To evaluate how reads were classified across taxonomic groups, we analyzed the generated mock sequence dataset (described in the Generation of Mock Metagenome Data above) using the classification workflow. Reads for each taxonomic group (i.e., Prochlorococcus, Synechococcus, and Heterotrophs) were simulated independently, processed through the classification workflow, and results from 10 random subsets were averaged. Classification accuracy was measured as the proportions of reads correctly classified versus misclassified (Table 2). For example, when Prochlorococcus reads are processed through the classification workflow, ~81% are correctly classified as Prochlorococcus, ~0.7% are misclassified as Synechococcus, ~3% as Heterotrophs, and ~15% cannot be classified (“Unclassified”).

Table 2 Misclassification rates of simulated metagenomic reads for Prochlorococcus, Synechococcus, and marine heterotrophic bacteria.

Full size table

Limit of detection

To estimate the minimum detectable abundance of Prochlorococcus in a mock sample, we simulated datasets with varying proportions of Prochlorococcus and heterotroph reads, and processed them through the classification workflow. The taxonomic composition for the mock datasets are available on Zenodo (ProSynTax-workflow_benchmarking_composition.tsv)³⁸. The Prochlorococcus misclassification rate was calculated as the number of heterotroph reads misclassified as Prochlorococcus divided by the total reads classified as Prochlorococcus. To maintain a misclassification rate below 10%, Prochlorococcus read abundance must exceed 0.08% of all classified reads (Fig. 2A). We applied the same simulation-based approach for Synechococcus, generating datasets with varying proportions of Synechococcus and heterotroph reads and calculated the misclassification rate in the same manner (ProSynTax-workflow_benchmarking_composition.tsv)³⁸. Using this approach, Synechococcus requires a read abundance greater than 0.02% to maintain a misclassification rate below 10% (Fig. 2C). For more sensitive cluster/clade/grade-level identification with a target misclassification rate below 5%, the minimum required read abundance is 0.15% for Prochlorococcus and 0.03% for Synechococcus (Fig. 2A,C).

To establish the minimum Prochlorococcus:Synechococcus ratio and vice versa, we simulated samples with varying proportions of the two genera and processed them through the classification workflow. This was done to determine the detection limtis and minimize the risk of misclassifying one cyanobacterium as the other. Misclassification rates were calculated by dividing misclassified reads by the total classified reads for each taxon. To keep the Prochlorococcus misclassification rate below 10%, the Prochlorococcus:Synechococcus ratio must be greater than 0.23 (Fig. 2B). Similarly, to maintain a Synechococcus misclassification rate below 10%, the Synechococcus:Prochlorococcus must be greater than 0.10 (Fig. 2D). For a more sensitive classification with a 5% misclassification rate, the required thresholds are a Prochlorococcus:Synechococcus ratio of 0.40 and a Synechococcus:Prochlorococcus ratio of 0.20 (Fig. 2B,D). Depending on the user-defined tolerances, classification results should be filtered to exclude samples that do not meet the ratio requirements for the desired sensitivity.

Cluster/clade/grade accuracy

To evaluate the classification workflow’s accuracy in estimating the composition of different Prochlorococcus and Synechococcus clusters/clades/grades, we processed simulated samples with equal proportions of all clusters/clades/grades with and without marine heterotrophic bacteria, following the composition outlined in the Zenodo file (ProSynTax-workflow_benchmarking_composition.tsv)³⁸. Specifically, we generated simulated samples containing only the respective picocyanobacterium, as well as mixtures where the respective picocyanobacterium was present at 2% with 98% marine heterotrophic bacteria, and at 1% with 98% marine heterotrophic bacteria plus 1% of the other picocyanobacterium.

The classification results for both Prochlorococcus and Synechococcus closely matched expected values, with most clade, grade, or cluster classifications differing by less than 5% (Fig. 3). Discrepancies were minor, all under 10%, and largely attributable to the underrepresentation of certain groups in the reference genome dataset (e.g., Prochlorococcus grade LLVIII and AMZ-III; Fig. 3A; Synechococcus clusters 5.1A-III, 5.1B-V, 5.1B-VI, and 5.2; Fig. 3B).

Validation with field data

To evaluate the accuracy and effectiveness of ProSynTax and the associated classification workflow on field data, we analyzed the fraction of classified Prochlorococcus reads relative to the total reads (classified + unclassified) and assessed the clade/grade diversity throughout a depth profile at Station ALOHA using metagenomic data from cruises HOT224-238⁴⁷. Using the 5% misclassification rate (e.g., 0.15% Prochlorococcus abundance; Prochlorococcus:Synechococcus ratio 0.43) established in Fig. 2, we filtered the data to ensure reliable estimation of total Prochlorococcus read abundance at each depth (Fig. 4A). We then classified clade/grade diversity at each depth and found predominantly high-light (HL) Prochlorococcus clades at <75 m and low-light (LL) clades/grades at >125 m (Fig. 4B). These results are consistent with many studies on Prochlorococcus diversity at Station ALOHA^22,48,49,50, demonstrating that the ProSynTax dataset and workflow accurately capture ecological patterns consistent with previous studies.

Usage Notes

The ProSynTax dataset and classification workflow assigns taxonomic classifications to metagenomic reads based on the taxonomic hierarchy in ProSynTax and uses these classified reads to estimate the genome equivalents of Prochlorococcus and Synechococcus clusters, clades, or grades within a metagenome sample. This dataset can be customized by allowing users to add new genomes as needed as well as modify the taxonomy in order to improve classification accuracy and adapt the classification workflow to specific research needs.

To setup and run the ProSynTax classification workflow, first clone the ProSynTax classification workflow repository into a suitable directory. Then, download the protein dataset index file (ProSynTax_v1.1.fmi), the names (ProSynTax_names.dmp), the nodes (ProSynTax_nodes.dmp), and the diamond blast database (CyCOG6.dmnd) from the Zenodo repository³⁸. Once the repository is cloned and the dataset files are in place, update the input files to specify the correct file paths and directories. To execute the classification workflow, navigate to the scripts directory and run the run_classify_smk.sbatch script.

The classification workflow begins by trimming low-quality regions and removing adapter sequences from paired-end FASTQ files using BBDuk v39.18²⁷, followed by read classification with Kaiju v1.10.1²⁶. A custom Python script³⁸ summarizes raw read counts by genus, after which reads identified as Prochlorococcus and Synechococcus are extracted into separate FASTA files using Seqtk vr82⁵¹. To correct for genome length variation across clusters/clades/grades, raw reads are aligned to single-copy core genes (SCCG) from the CyCOG v6.0 database²¹ using DIAMOND Blastx v2.1.11⁵². Genome equivalents are estimated by normalizing total SCCG read residues to the average SCCG length.

It is recommended to filter out samples with low Prochlorococcus or Synechococcus abundance and to adjust parameters when working with low-coverage samples. Samples showing high proportions of unclassified reads or significant taxonomic imbalances may also benefit from parameter tuning. For broader Prochlorococcus ecotype comparisons, a 10% misclassification rate is generally sufficient, whereas a stricter 5% misclassification rate is recommended for cluster/clade/grade delineations (Fig. 2). For detailed guidance, comprehensive tutorials are available on GitHub (https://github.com/jamesm224/ProSynTax-workflow).

Data availability

All ProSynTax files are available on Zenodo (https://doi.org/10.5281/zenodo.14889680)³⁸, including the main dataset, files for the complementary classification workflow, and benchmarking files used the in Technical Validation section. Raw sequencing reads for genomes originating from this study were submitted in the NCBI Sequence Read Archive under accession SRP569671³⁷. Individual SRA accession numbers for genomes generated in this study are available in the ProSynTax_genomes.csv file on Zenodo³⁸.

Code availability

Code and documentation for ProSynTax associated classification workflow, along with the accompanying technical validation are available on the GitHub repository: https://github.com/jamesm224/ProSynTax-workflow Code and documentation for Genome Closing workflow are available on the GitHub repository: https://github.com/konnorve/HIFI-genome-closing-improved.git.

References

Flombaum, P. et al. Present and future global distributions of the marine Cyanobacteria Prochlorococcus and Synechococcus. PNAS 110, 9824–9829 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Carradec, Q. et al. A global ocean atlas of eukaryotic genes. Nat Commun 9, 373 (2018).
Article ADS PubMed PubMed Central Google Scholar
Clayton, S. et al. Bio-GO-SHIP: The Time Is Right to Establish Global Repeat Sections of Ocean Biology. Front. Mar. Sci. 8 (2022).
Sunagawa, S. et al. Ocean plankton. Structure and function of the global ocean microbiome. Science 348, 1261359 (2015).
Article PubMed Google Scholar
Sunagawa, S. et al. Tara Oceans: towards global ocean ecosystems biology. Nat Rev Microbiol 18, 428–445 (2020).
Article CAS PubMed Google Scholar
Caputi, L. et al. Community-Level Responses to Iron Availability in Open Ocean Plankton Ecosystems. Global Biogeochemical Cycles 33, 391–419 (2019).
Article ADS CAS Google Scholar
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2019).
Article PubMed PubMed Central Google Scholar
Groussman, R. D., Blaskowski, S., Coesel, S. N. & Armbrust, E. V. MarFERReT, an open-source, version-controlled reference library of marine microbial eukaryote functional genes. Sci Data 10, 926 (2023).
Article CAS PubMed PubMed Central Google Scholar
Mazard, S., Ostrowski, M., Partensky, F. & Scanlan, D. J. Multi-locus sequence analysis, taxonomic resolution and biogeography of marine Synechococcus. Environmental Microbiology 14, 372–386 (2012).
Article CAS PubMed Google Scholar
Kashtan, N. et al. Single-cell genomics reveals hundreds of coexisting subpopulations in wild Prochlorococcus. Science 344, 416–420 (2014).
Article ADS CAS PubMed Google Scholar
Farrant, G. K. et al. Delineating ecologically significant taxonomic units from global patterns of marine picocyanobacteria. Proc Natl Acad Sci USA 113, E3365–3374 (2016).
Article CAS PubMed PubMed Central Google Scholar
Doré, H. et al. Evolutionary Mechanisms of Long-Term Genome Diversification Associated With Niche Partitioning in Marine Picocyanobacteria. Front Microbiol 11, 567431 (2020).
Article PubMed PubMed Central Google Scholar
Becker, J. W. et al. Novel isolates expand the physiological diversity of Prochlorococcus and illuminate its macroevolution. mBio 15, e0349723 (2024).
Article PubMed Google Scholar
Tarn, J., Peoples, L. M., Hardy, K., Cameron, J. & Bartlett, D. H. Identification of Free-Living and Particle-Associated Microbial Communities Present in Hadal Regions of the Mariana Trench. Front Microbiol 7, 665 (2016).
Article PubMed PubMed Central Google Scholar
Johnson, Z. I. et al. Niche partitioning among Prochlorococcus ecotypes along ocean-scale environmental gradients. Science 311, 1737–1740 (2006).
Article ADS CAS PubMed Google Scholar
Larkin, A. A. et al. Niche partitioning and biogeography of high light adapted Prochlorococcus across taxonomic ranks in the North Pacific. ISME J 10, 1555–1567 (2016).
Article PubMed PubMed Central Google Scholar
Kent, A. G. et al. Parallel phylogeography of Prochlorococcus and Synechococcus. ISME J 13, 430–441 (2019).
Article PubMed Google Scholar
Thompson, A. W., Kouba, K. & Ahlgren, N. A. Niche partitioning of low-light adapted Prochlorococcus subecotypes across oceanographic gradients of the North Pacific Subtropical Front. Limnology and Oceanography 66, 1548–1562 (2021).
Article ADS Google Scholar
Ustick, L. J., Larkin, A. A. & Martiny, A. C. Global scale phylogeography of functional traits and microdiversity in Prochlorococcus. ISME J 17, 1671–1679 (2023).
Article PubMed PubMed Central Google Scholar
Hunter-Cevera, K. R., Post, A. F., Peacock, E. E. & Sosik, H. M. Diversity of Synechococcus at the Martha’s Vineyard Coastal Observatory: Insights from Culture Isolations, Clone Libraries, and Flow Cytometry. Microb Ecol 71, 276–289 (2016).
Article ADS CAS PubMed Google Scholar
Berube, P. M. et al. Single cell genomes of Prochlorococcus, Synechococcus, and sympatric microbes from diverse marine environments. Sci Data 5, 180154 (2018).
Article CAS PubMed PubMed Central Google Scholar
Malmstrom, R. R. et al. Temporal dynamics of Prochlorococcus ecotypes in the Atlantic and Pacific oceans. ISME J 4, 1252–1264 (2010).
Article PubMed Google Scholar
Ulloa, O. et al. The cyanobacterium Prochlorococcus has divergent light-harvesting antennae and may have evolved in a low-oxygen ocean. PNAS 118 (2021).
Tschoeke, D. et al. Unlocking the Genomic Taxonomy of the Prochlorococcus Collective. Microb Ecol 80, 546–558 (2020).
Article ADS PubMed Google Scholar
Salazar, V. W. et al. A new genomic taxonomy system for the Synechococcus collective. Environ Microbiol 22, 4557–4570 (2020).
Article CAS PubMed Google Scholar
Menzel, P., Ng, K. L. & Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun 7, 11257 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Bushnell, B. BBMap: A Fast, Accurate, Splice-Aware Aligner. Conference: 9th Annual Genomics of Energy & Environment Meeting (2014).
Ye, S. H., Siddle, K. J., Park, D. J. & Sabeti, P. C. Benchmarking Metagenomics Tools for Taxonomic Classification. Cell 178, 779–794 (2019).
Article CAS PubMed PubMed Central Google Scholar
Moore, L. R. et al. Culturing the marine cyanobacterium Prochlorococcus. Limnol. Oceanogr. Methods 5, 353–362 (2007).
Article CAS Google Scholar
Berube, P. M. et al. Physiology and evolution of nitrate acquisition in Prochlorococcus. ISME J 9, 1195–1207 (2015).
Article CAS PubMed Google Scholar
Wilson, K. Preparation of Genomic DNA from Bacteria. Current Protocols in Molecular Biology 56, 2.4.1–2.4.5 (2001).
Article Google Scholar
Kolmogorov, M. et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat Methods 17, 1103–1110 (2020).
Article CAS PubMed PubMed Central Google Scholar
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27, 722–736 (2017).
Article CAS PubMed PubMed Central Google Scholar
Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A. & Korobeynikov, A. Using SPAdes De Novo Assembler. Curr Protoc Bioinformatics 70, e102 (2020).
Article CAS PubMed Google Scholar
Parks, D. H. et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Research 50, D785–D794 (2022).
Article CAS PubMed Google Scholar
Steinegger, M. & Söding, J. MMseqs. 2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
NCBI Sequence Read Archive. https://www.ncbi.nlm.nih.gov/sra/SRP569671 (2025).
Coe, A. et al. ProSynTax: Prochlorococcus and Synechococcus Taxonomy Database. Zenodo https://doi.org/10.5281/zenodo.14889680 (2025).
Pachiadaki, M. G. et al. Charting the Complexity of the Marine Microbiome through Single-Cell Genomics. Cell 179, 1623–1635.e11 (2019).
Article CAS PubMed PubMed Central Google Scholar
Garczarek, L. et al. Cyanorak v2.1: a scalable information system dedicated to the visualization and expert curation of marine and brackish picocyanobacteria genomes. Nucleic Acids Research 49, D667–D676 (2021).
Article CAS PubMed Google Scholar
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
Article PubMed PubMed Central Google Scholar
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25, 1043–1055 (2015).
Article CAS PubMed PubMed Central Google Scholar
Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7, 539 (2011).
Article PubMed PubMed Central Google Scholar
Tamura, K., Stecher, G. & Kumar, S. MEGA11: Molecular Evolutionary Genetics Analysis Version 11. Molecular Biology and Evolution 38, 3022–3027 (2021).
Article CAS PubMed PubMed Central Google Scholar
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol Biol Evol 26, 1641–1650 (2009).
Article CAS PubMed PubMed Central Google Scholar
Holtgrewe, M. & Mason, - Tools for Biological Sequence Simulation (v3.0.0-beta1). GitHub https://github.com/seqan/seqan/blob/main/apps/mason2/README (2014).
Mende, D. R. et al. Environmental drivers of a microbial genomic transition zone in the ocean’s interior. Nat Microbiol 2, 1367–1373 (2017).
Article CAS PubMed Google Scholar
Mende, D. R., Boeuf, D. & DeLong, E. F. Persistent Core Populations Shape the Microbiome Throughout the Water Column in the North Pacific Subtropical Gyre. Front. Microbiol. 10 (2019).
Coe, A. et al. Survival of Prochlorococcus in extended darkness. Limnology and Oceanography 61, 1375–1388 (2016).
Article ADS Google Scholar
Thompson, A. W. et al. Dynamics of Prochlorococcus Diversity and Photoacclimation During Short-Term Shifts in Water Column Stratification at Station ALOHA. Front. Mar. Sci. 5 (2018).
Li, H. SeqTK, Toolkit for processing sequences in FASTA/Q formats v1.4lh3. GitHub https://github.com/lh3/seqtk (2023).
Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods 18, 366–368 (2021).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank the crews and research teams of the KM0915, MV1015, KM1217, KM1309, and KOK1404 cruises on the research vessels R/V Kilo Moana, R/V Melville, and R/V Kaʻimikai-O-Kanaloa for their support in facilitating the Prochlorococcus isolations used in this study. This work was supported by grants from the National Science Foundation (OCE-1153588 to S.W.C.; DBI-0424599 to S.W.C.; OCE-2048470 to P.M.B.), the Gordon and Betty Moore Foundation (GBMF495 to S.W.C.; GBMF4511 to S.W.C.), the Robert and Ardis James Foundation to S.W.C., and the Simons Foundation (Life Sciences Project Award ID 337262, S.W.C.; SCOPE Award ID 329108, S.W.C.; Marine Microbial Ecology Postdoctoral Fellowship 984601, M.A.D.). This paper is a contribution from the Simons Collaboration on Ocean Processes and Ecology (SCOPE).

Author information

These authors contributed equally: Allison Coe, James I. Mullet, Nhi N. Vo.

Authors and Affiliations

Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
Allison Coe, James I. Mullet, Nhi N. Vo, Paul M. Berube, Maya I. Anjur-Dietrich, Eli Salcedo, Sierra M. Parker, Konnor VonEmster, Christina Bliem, Aldo A. Arellano, Kurt G. Castro, Jamie W. Becker & Sallie W. Chisholm
Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
Sallie W. Chisholm

Authors

Allison Coe
View author publications
Search author on:PubMed Google Scholar
James I. Mullet
View author publications
Search author on:PubMed Google Scholar
Nhi N. Vo
View author publications
Search author on:PubMed Google Scholar
Paul M. Berube
View author publications
Search author on:PubMed Google Scholar
Maya I. Anjur-Dietrich
View author publications
Search author on:PubMed Google Scholar
Eli Salcedo
View author publications
Search author on:PubMed Google Scholar
Sierra M. Parker
View author publications
Search author on:PubMed Google Scholar
Konnor VonEmster
View author publications
Search author on:PubMed Google Scholar
Christina Bliem
View author publications
Search author on:PubMed Google Scholar
Aldo A. Arellano
View author publications
Search author on:PubMed Google Scholar
Kurt G. Castro
View author publications
Search author on:PubMed Google Scholar
Jamie W. Becker
View author publications
Search author on:PubMed Google Scholar
Sallie W. Chisholm
View author publications
Search author on:PubMed Google Scholar

Contributions

P.M.B., M.A.D., A.C., S.W.C., J.I.M., N.N.V. conceived and designed experiments. P.M.B. and J.W.B. isolated cultures and A.C. and P.M.B. maintained cultures. A.C., K.G.C., S.M.P., E.S., C.B. and A.A.A. grew cultures and performed DNA extractions. N.N.V., P.M.B., K.V.M. and A.C. closed genomes and submitted data to repositories. J.I.M., P.M.B. and N.N.V. developed code, workflow, and dataset. N.N.V. and J.I.M. performed validation of the dataset and M.A.D., A.C. and P.M.B. analyzed validation results. All authors contributed to writing and editing the manuscript.

Corresponding author

Correspondence to Allison Coe.

Ethics declarations

Competing interests

The authors declare no competing interests. The funders had no role in the design or execution of the study nor the decision to submit the work for publication.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Coe, A., Mullet, J.I., Vo, N.N. et al. A curated protein dataset for taxonomic classification of Prochlorococcus and Synechococcus in metagenomes. Sci Data 12, 1895 (2025). https://doi.org/10.1038/s41597-025-06164-5

Download citation

Received: 26 March 2025
Accepted: 21 October 2025
Published: 03 December 2025
Version of record: 03 December 2025
DOI: https://doi.org/10.1038/s41597-025-06164-5