Metagenome-assembled-genomes recovered from the Arctic drift expedition MOSAiC

Boulton, William; Salamov, Asaf; Grigoriev, Igor V.; Calhoun, Sara; LaButti, Kurt; Riley, Robert; Barry, Kerrie; Fong, Allison A.; Hoppe, Clara J. M.; Metfies, Katja; Oetjen, Kersten; Eggers, Sarah Lena; Müller, Oliver; Gardner, Jessie; Granskog, Mats A.; Torstensson, Anders; Oggier, Marc; Larsen, Aud; Bratbak, Gunnar; Toseland, Andrew; Leggett, Richard M.; Moulton, Vincent; Mock, Thomas

doi:10.1038/s41597-025-04525-8

Download PDF

Data Descriptor
Open access
Published: 04 February 2025

Metagenome-assembled-genomes recovered from the Arctic drift expedition MOSAiC

William Boulton¹,
Asaf Salamov²,
Igor V. Grigoriev ORCID: orcid.org/0000-0002-3136-8903^2,3,
Sara Calhoun²,
Kurt LaButti²,
Robert Riley²,
Kerrie Barry²,
Allison A. Fong⁴,
Clara J. M. Hoppe ORCID: orcid.org/0000-0002-2509-0546⁴,
Katja Metfies⁴,
Kersten Oetjen⁴,
Sarah Lena Eggers⁴,
Oliver Müller⁵,
Jessie Gardner⁶,
Mats A. Granskog ORCID: orcid.org/0000-0002-5035-4347⁷,
Anders Torstensson⁸,
Marc Oggier⁹,
Aud Larsen¹⁰,
Gunnar Bratbak⁵,
Andrew Toseland ORCID: orcid.org/0000-0002-6513-956X¹¹,
Richard M. Leggett¹²,
Vincent Moulton¹ &
…
Thomas Mock ORCID: orcid.org/0000-0001-9604-0362¹¹

Scientific Data volume 12, Article number: 204 (2025) Cite this article

5807 Accesses
4 Citations
16 Altmetric
Metrics details

Subjects

Abstract

The Multidisciplinary Observatory for Study of the Arctic Climate (MOSAiC) expedition consisted of a year-long drifting survey of the Central Arctic Ocean. The ecosystems component of MOSAiC included the sampling of molecular data, with metagenomes collected from a diverse range of environments. The generation of metagenome-assembled-genomes (MAGs) from metagenomes are a starting point for genome-resolved analyses. This dataset presents a catalogue of MAGs recovered from a set of 73 samples from MOSAiC, including 2407 prokaryotic and 56 eukaryotic MAGs, as well as annotations of a near complete eukaryotic MAG using the Joint Genome Institute (JGI) annotation pipeline. The metagenomic samples are from the surface ocean, chlorophyll maximum, mesopelagic and bathypelagic, within leads and under-ice ocean, as well as melt ponds, ice ridges, and first- and second-year sea ice. This set of MAGs can be used to benchmark microbial biodiversity in the Central Arctic Ocean, compare individual strains across space and time, and to study changes in Arctic microbial communities from the winter to summer, at a genomic level.

Compendium of 530 metagenome-assembled bacterial and archaeal genomes from the polar Arctic Ocean

Article 15 November 2021

Marine picoplankton metagenomes and MAGs from eleven vertical profiles obtained by the Malaspina Expedition

Article Open access 01 February 2024

Recovery of 240 metagenome-assembled genomes from coastal mariculture environments in South Korea

Article Open access 20 August 2024

Background & Summary

The Central Arctic Ocean is one of the most understudied biomes on Earth and it is home to a significant amount of microbial diversity relative to its area. Furthermore, the Arctic Ocean is warming at a higher rate than lower latitudes, which has detrimental effects on the diversity of Arctic ocean biomes¹. Microbes that underpin these biomes play an important role as primary producers and therefore as the base of the marine food web and central for biogeochemical cycles. They are also a source of novel genes, many of which have been of interest to the biotechnology industry². It is therefore important to understand the effect of changing environmental conditions on these organisms including the evolution of their genes and genomes. However, due to the inaccessibility of the Central Arctic Ocean, there is a large gap in our understanding of microbial communities which thrive in this habitat. The MOSAiC expedition provided invaluable opportunities to address this knowledge gap and is the first year-long survey of the Central Arctic Ocean, facilitating sampling throughout the entire Arctic winter.

The MOSAiC expedition (2019–2020) was a Lagrangian drift survey, based around the icebreaker RV Polarstern³, which moved from north of the Laptev Sea northwards during leg 1 (19^th September to 15^th December 2019), before drifting southwards toward the Fram Strait (legs 2 to 4, 15^th December 2019 to 12^th August 2020) and finally returning to the Central Arctic Ocean (leg 5, 12^th August to 12^th October 2020) at the end of the expedition. During each leg, metagenomes were collected from both sea ice and pelagic waters, with a minority of samples collected from sediment traps under sea ice. Samples were collected either as part of a core time series, from intense observation periods of opportunistic sampling, or as part of the HAVOC project (Ridges – safe HAVens for ice-associated flora and fauna in a seasonally ice-covered Arctic Ocean), collecting samples from ice-ridges, under-ice water, and from sediment traps beneath sea-ice ridges and level ice⁴.

Sea-ice ridges are characteristic features covering 25 to 45% of the Arctic sea ice area⁵. Ridges are formed by pressure from drifting ice. When ice floes are forced together, they break up and are pushed up and down to form a sail above and a keel below the surface water level. The keel consists of ice blocks separated by voids, described as macroporosity, making up ~15–30% of the volume^6,7. The voids may be empty (in the sail) or filled with liquid or frozen seawater or meltwater (in the keel). While the bottom of level sea ice is known as an important habitat for Arctic marine biodiversity and activity, much less is known of the life within ridge keel voids which constitute unique habitats and biological hotspots in the Arctic Ocean^8,9,10. As ridges are logistically harder to navigate and take samples from, they are relatively understudied compared to level ice. The aim of the HAVOC project was to better understand how ice ridges act as a refuge for microbial biodiversity and activity, and how food web and biogeochemical processes at the ice-ocean interface and the underlying water column differ between ridges and level sea ice⁴.

This study consists of metagenome-assembled-genomes (MAGs) assembled from a pilot sequencing project of 15 metagenomes collected during Leg 2 (15^th December to 3^rd March) of the voyage in the Arctic winter as part of the core time series, and 58 metagenomes associated with the HAVOC project. Advances in metagenomic sequencing, assembly and binning, have generated a wealth of MAG datasets, even for Arctic environments such as in¹¹. However, challenges associated with the assembly and binning of eukaryotes have meant that these generally focus on prokaryotic MAGs. Two exceptions are Duncan et al. and Delmont et al.^12,13, which generated 21 and 25 eukaryotic medium and high quality MAGs from Arctic metagenomes respectively (i.e. above 50% completeness). The data presented here includes both prokaryotic and eukaryotic MAGs, using coassembly to improve the coverage of eukaryotes. These MAGs can be used to gain insights into microbial diversity and the metabolic potential of microbiomes during the Arctic winter, and the role of ice ridges in maintaining the microbial biodiversity of the Arctic Ocean.

Methods

Sampling

Of the 73 samples presented here, 15 were collected as part of a year-round time-series, during leg 2 of the expedition (between 13 January 2019 and 7 February 2020) as described in Winder et al.¹⁴, and sequenced as part of a pilot sequencing project (hereon called pilot samples). Of these samples, 8 were from pelagic layers and the remaining 7 from sea ice. The pelagic pilot samples were collected using a CTD rosette, on three different days. Sequenced pilot samples from a sampling event on February 6^th 2020 consist of 2 co-located samples (i.e. replicates, from the same CTD cast and sampled at the same depth, but from different Niskin bottles) taken from a depth of 20 m, and one sample from a depth of 202 m. One sample was collected on February 7^th 2020, at a depth of 4082 m. Further, 2 replicates from a depth of 50 m sampled on January 16^th 2020 were sequenced.

Additionally, for each of the two biological replicates, a third technical replicate was generated by pooling remaining material from the two replicate samples. These data are summarised in Supplementary Table 1, with the two samples generated through pooling identified with a sample identifier suffix ‘pool’ (column E).

The remaining 7 pilot samples from sea ice were taken from the first-year and second-year MOSAiC ice-coring sites on the floe, as described in Nicolaus et al.¹⁵ and Fong et al.¹⁶. First and second-year sea-ice chemical and physical properties are available in Oggier et al.^17,18, Lei et al.¹⁹, and properties for snow in Macfarlane et al.²⁰. Generally, 2–4 cores were collected on a weekly basis, cut in 10 cm sections, except the bottom where two 5 cm thick section were cut, and pooled per section to allow for enough biomass in the DNA samples. For the pilot study, samples per depth interval always represent pools of three individual cores collected after each other in the same location (adjacent within 40 cm). Of these, five samples were collected from different 5–10 cm thick sections of cores from the same coring site on February 3^rd 2020 from first-year ice; three from the upper part (20–50 cm from the top), one from the middle section (70–80 cm from the top), and one from the bottom-most section (122 to 127 cm from the top) of the sea ice, i.e. at the sea-ice interface. The remaining two samples were second-year ice, also from the bottom most 5 cm section, i.e. from the sea–ice interface, 1.23–1.28 m and 1.43–1.48 m from the top, collected on January 13^th and 27^th 2020.

The 58 metagenomes from the HAVOC project were collected during legs 2, 3 and 4 (collection dates between 22^nd January 2020 and 26^th July 2020), either: from sediment traps, directly below an ice ridge (7 samples) or level ice (10 samples) at depths of 5, 15 and 50 m, in the water column at 20 m (2 samples), from under ice water below an ice ridge (2 samples) and level ice (3 samples), from seawater taken from voids in the ice ridge (10 samples), or from a 10 cm ice core section at different depths of the ice ridge, as either ridge bottom ice (3 samples), top of void ice (5 samples), bottom of void ice (6 samples), refrozen void ice (4 samples) and ice samples at irregular depths (6 samples). Samples from the same location, depth, and time (Supplementary Table 1) are considered replicates, with samples taken from a total of 24 distinct locations and depths. The number of pooled core sections for each sample, and section thickness, are recorded in Supplementary Table 1. Figure 1 summarises the locations of the samples.

For both the HAVOC and pilot sea ice samples, ice cores were sectioned in the field, transferred to sterile plastic bags, and brought back to the RV Polarstern. On board, 50 ml 0.22 µm filtered sea water was added per cm sea ice, and the sea ice samples melted within 24–36 hours in the dark at around 17–22°C. The use of filtered seawater was made necessary due to logistical constraints during the drift; this is a possible source of contamination for the sea ice samples. For both sea ice and pelagic samples, water was filtered through a Sterivex 0.22 µm filter or when volumes were <500 mL (HAVOC sediment trap samples and three HAVOC ice samples) onto 0.22 µm Durapore filters. The filters were immediately flash frozen in liquid nitrogen and stored at −80 °C on board the RV Polarstern, and subsequently shipped to either the Alfred Wegener Institute (pilot samples), or the University of Bergen (HAVOC samples), at a temperature of −80 °C.

DNA extraction, purification, and sequencing

Following shipping, DNA from Sterivex filters was extracted using the Qiagen PowerWater DNA kit, following the QIAGEN DNeasy Power Water SOP v1 for the ice and water samples, and the QIAGEN Dneasy Power Soil SOP v1 (QIAGEN N.V., Hilden, Germany) used for the DNA extraction from Durapore filters.

Plates were shipped to the Joint Genome Institute (JGI, CA, USA) under dry ice, and sequenced using 151 base pair paired-end reads with an Illumina NovaSeq S4 device. The library type used was in all cases either the Illumina low or regular concentration protocol, with between 0 and 15 rounds of PCR applied to samples, though in all but 5 cases, this was restricted to only either 0 or 5 rounds.

Supplementary Table 3 outlines the library preparation steps and sequencing protocols used for each of the samples.

Genome assembly and binning

Samples were first assembled individually using the JGI MAP pipeline²¹, with prokaryotic bins recovered on a per-sample basis. In brief, samples were filtered for quality with BBDuk, error corrected using BBCMS, assembled using SPAdes²², and reads mapped back to contigs using BBMap (38.86)²³. Binning was performed using MetaBAT 2²⁴ (v2.1.15) and assessed for quality using CheckM²⁵ (v1.1.3). Software and pipeline versions used are listed in Supplementary Table 3.

To extract further eukaryotic bins, we used a coassembly method. We used a custom filtering pipeline to identify the eukaryotic fraction of reads from each sample, before pooling these reads for coassembly and binning. To extract the eukaryotic fraction of reads, we used MMSeqs2²⁶ (version 0188988235c6f1a8e90f327827c73f981db8a19a) with the default parameters (length cutoff of 500 base pairs) to taxonomically identify contigs that had already been assembled using a per-sample assembly method, using a combination of MMETSP²⁷ and NCBI NR²⁸ as a reference database. Contigs identified at the domain level as anything other than Bacteria, Archaea, or viruses were retained, leaving a list of putatively eukaryotic contigs. Contigs identified as belonging to already existing prokaryotic bins were removed from this list. Next, the quality-filtered reads were mapped to this subset of contigs with BBMap (v.3.17). These reads were pooled, depending on whether they were from the HAVOC or pilot dataset, and assembled with metahipmer²⁹ (version 2.1.0.1.380-gf770aca-dirty-master) for the HAVOC samples, or SPAdes (v3.14.0) with the metaspades.py–only-assembler option for the pilot samples. Samples with no pre-existing metagenome bins from their single assembly were excluded. Finally, the pooled reads from each dataset were mapped to their respective coassemblies, and then the new contigs were binned using MetaBAT 2 (version 2:v2.15-30-g4ec2ab8), and checked for quality with EukCC³⁰ (2.1.1, database version 1.1). Bins of over 90% completeness and less than 5% contamination, and with at least 18 tRNA genes, and with 23S, 16S and 5S genes, were designated as high-quality MAGs, those with above 50% completeness and less than 10% contamination were designated medium-quality, and, for the eukaryotic MAGs, those above 30% completeness and less than 10% contamination were retained and designated as low quality, as per Alexander et al.³¹. Figure 2 shows a schematic diagram of the bioinformatics pipeline, and an overview of MAG completeness and contamination is provided in Fig. 3.

Functional and taxonomic annotation

MAGs recovered from single-sample assemblies were annotated using the IMG/M annotation pipeline (versions ranging between 5.0.23 and 5.1.11), using Genemark (v1.05) and Prodigal (V2.6.3)^32,33 for gene calling, and HMMer³⁴ (3.1b2) to combine analyses from the COG, PFam³⁵ (v.34), TIGRFAM³⁶ (v.15.0), Cath-Funfam³⁷ (v.4.1.0), SuperFamily³⁸ (v.1.75), and SMART³⁹ (01_06_2016) databases. CRISPRs, and tRNAs were identified with CRT⁴⁰ (1.8.2) and tRNAscan-SE⁴¹ (2.0.4) respectively. Prokaryotic MAGs were taxonomically placed with GTDB-tk (v.2.4.0, database release 220)⁴². GO terms were included based on the PFam2GO mapping provided by Interpro^43,44.

To identify genes within the coassembled eukaryotic MAGs within the coassemblies, we used MetaEuk⁴⁵ (version f32e8dfc6b994025627326d0f461f3e9903e997e) with the–easy-annotate option, using a custom database of the combined Phycocosm⁴⁶, MMETSP, and UniRef⁴⁷ databases, with UniRef clustered at a 50% identity level. These were combined with genes identified through Genemark-ES (version 4.71_lic, gmes_pepal.pl–ES), with genes from MetaEuk given priority and retained if overlapping with genes from Genemark-ES. PFam (v.35.0), PANTHER (v.17.0), SMART (v.9.0), NCBIfam (12.0) and SuperFamily (v.1.75) domains were then annotated using InterProScan⁴⁸ (version 5.63–95.0).

Eukaryotic MAGs were placed on a phylogenetic tree (Fig. 4), using a set of 100 concatenated BUSCO⁴⁹ genes (BUSCO v5.1.1 odb_eukaryota_10 gene set, aligned using MUSCLE⁵⁰ v3.8.1551), alongside a set of 140 eukaryotic reference genomes from Phycocosm and NCBI RefSeq²⁸. A maximum-likelihood tree was generated using FastTree (2.1.11).

Case-Study of a high-quality MAG

From our eukaryotic coassembly, we flagged bin havoc.90 from the HAVOC dataset as having good contamination and completeness scores, as measured by BUSCO (v5.1.1, odb_eukaryota_10 gene set), with completeness and contamination of 73% and 0.4% respectively. This MAG, Bacilliariophyceae sp. MOSAICH1_1, was annotated using the Phycocosm pipeline; in brief, scaffolds were masked for repeats with RepeatMasker, and genes were called using a combination of Genemark-ES (version 4), GeneWise⁵¹ (version 1), and fgenesh⁵² (fgenesh1_pg), with the best fitting model for each locus picked to form a set of filtered gene models. Genes were functionally annotated for signal peptides, transmembrane domains, and assigned functional descriptions including assignment of PFam, GO, KOG and KEGG terms, and genes were formed into gene clusters using the MCL algorithm⁵³.

This genome had a length of 32.07 Mbp, contained in 2963 scaffolds, with 13169 genes called among the set of filtered gene models. The most closely related isolate genome in Phycocosm was the diatom Pseudo-nitzschia multiseries CLN-47, with an average nucleotide identity of 76% (estimated with fastANI v1.33).

Data Records

All MAGs are available through Figshare⁵⁴, at NCBI BioProject PRJNA1160706 (Data Citation 2)⁵⁵, and replicated in the GOLD database⁵⁶ (https://gold.jgi.doe.gov), Study ID 505419. Annotations of Bacilliariophyceae sp. MOSAICH1_1 are also available through the Phycocosm web portal (https://phycocosm.jgi.doe.gov/Mosaich1_1/Mosaich1_1.home.html), and via Figshare.

Individual read files for samples are stored in the NCBI SRA, with BioSample, BioProject, and SRP accessions listed in Supplementary Table 1, and as citations^{57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129}.

Technical Validation

Prokaryotic MAG completeness and contamination was estimated using CheckM²⁵, and only those MAGs within the MIMAG standards of 50% completeness and below 10% contamination were retained. Eukaryotic MAG completeness and contamination was estimated using EukCC – those above 30% completion and less than 10% contamination were retained, with those under 50% completion designated as low-quality MAGs.

Code availability

The custom pipelines used for eukaryotic MAG binning and annotation are available at https://github.com/willboulton/mosaic-pilot-havoc-mags.

References

Ardyna, M. & Arrigo, K. R. Phytoplankton dynamics in a changing Arctic Ocean. Nat. Clim. Chang. 10, 892–903 (2020).
Article ADS CAS MATH Google Scholar
Dalmaso, G. Z. L., Ferreira, D. & Vermelho, A. B. Marine Extremophiles: A Source of Hydrolases for Biotechnological Applications. Mar Drugs 13, 1925–1965 (2015).
Article CAS PubMed PubMed Central MATH Google Scholar
Knust, R. Polar Research and Supply Vessel POLARSTERN Operated by the Alfred-Wegener-Institute. Journal of large-scale research facilities JLSRF 3, A119–A119 (2017).
Article MATH Google Scholar
Granskog, M. & Müller, O. A peek beneath the surface of Arctic sea ice. EU RES 37, 38–39 (2024).
Article MATH Google Scholar
Mårtensson, S., Meier, H. E. M., Pemberton, P. & Haapala, J. Ridged sea ice characteristics in the Arctic from a coupled multicategory sea ice model. Journal of Geophysical Research: Oceans 117 (2012).
Timco, G. W. & Burden, R. P. An analysis of the shapes of sea ice ridges. Cold Regions Science and Technology 25, 65–77 (1997).
Article MATH Google Scholar
Strub-Klein, L. & Sudom, D. A comprehensive analysis of the morphology of first-year sea ice ridges. Cold Regions Science and Technology 82, 94–109 (2012).
Article Google Scholar
Gradinger, R., Bluhm, B. & Iken, K. Arctic sea-ice ridges—Safe heavens for sea-ice fauna during periods of extreme ice melt? Deep Sea Research Part II: Topical Studies in Oceanography 57, 86–95 (2010).
Article ADS Google Scholar
Syvertsen, E. E. Ice algae in the Barents Sea: types of assemblages, origin, fate and role in the ice-edge phytoplankton bloom. Polar Research 10, 277–288 (1991).
Article Google Scholar
Fernández-Méndez, M. et al. Algal Hot Spots in a Changing Arctic Ocean: Sea-Ice Ridges and the Snow-Ice Interface. Frontiers in Marine Science 5 (2018).
Royo-Llonch, M. et al. Compendium of 530 metagenome-assembled bacterial and archaeal genomes from the polar Arctic Ocean. Nat Microbiol 6, 1561–1574 (2021).
Article CAS PubMed MATH Google Scholar
Duncan, A. et al. Metagenome-assembled genomes of phytoplankton microbiomes from the Arctic and Atlantic Oceans. Microbiome 10, 67 (2022).
Article CAS PubMed PubMed Central MATH Google Scholar
Delmont, T. O. et al. Functional repertoire convergence of distantly related eukaryotic plankton lineages abundant in the sunlit ocean. Cell Genomics 2, 100123 (2022).
Article CAS PubMed PubMed Central MATH Google Scholar
Winder, J. C. et al. Genetic and Structural Diversity of Prokaryotic Ice-Binding Proteins from the Central Arctic Ocean. Genes 14, 363 (2023).
Article CAS PubMed PubMed Central MATH Google Scholar
Nicolaus, M. et al. Overview of the MOSAiC expedition: Snow and sea ice. Elementa: Science of the Anthropocene 10, 000046 (2022).
MATH Google Scholar
Fong, A. A. et al. Overview of the MOSAiC expedition: Ecosystem. Elementa: Science of the Anthropocene 12, 00135 (2024).
MATH Google Scholar
Oggier, M. et al. First-year sea-ice salinity, temperature, density, oxygen and hydrogen isotope composition from the main coring site (MCS-FYI) during MOSAiC legs 1 to 4 in 2019/2020. PANGAEA https://doi.org/10.1594/PANGAEA.956732 (2023).
Oggier, M. et al. Second-year sea-ice salinity, temperature, density, oxygen and hydrogen isotope composition from the main coring site (MCS-SYI) during MOSAiC legs 1 to 4 in 2019/2020. PANGAEA https://doi.org/10.1594/PANGAEA.959830 (2023).
Lei, R. et al. Seasonality and timing of sea ice mass balance and heat fluxes in the Arctic transpolar drift during 2019–2020. Elementa: Science of the Anthropocene 10, 000089 (2022).
Google Scholar
Macfarlane, A. R. et al. A Database of Snow on Sea Ice in the Central Arctic Collected during the MOSAiC expedition. Sci Data 10, 398 (2023).
Article CAS PubMed PubMed Central MATH Google Scholar
Clum, A. et al. DOE JGI Metagenome Workflow. mSystems 6, https://doi.org/10.1128/msystems.00804-20 (2021).
Bankevich, A. et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology 19, 455–477 (2012).
Article MathSciNet CAS PubMed PubMed Central MATH Google Scholar
Bushnell, B. BBMap: A Fast, Accurate, Splice-Aware Aligner (2014).
Kang, D. et al. MetaBAT 2: An adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).
Article PubMed PubMed Central MATH Google Scholar
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Article CAS PubMed PubMed Central Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026–1028 (2017).
Article CAS PubMed MATH Google Scholar
Keeling, P. J. et al. The Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP): Illuminating the Functional Diversity of Eukaryotic Life in the Oceans through Transcriptome Sequencing. PLoS Biol 12, e1001889 (2014).
Article PubMed PubMed Central Google Scholar
O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44, D733–745 (2016).
Article PubMed MATH Google Scholar
Hofmeyr, S. et al. Terabase-scale metagenome coassembly with MetaHipMer. Sci Rep 10, 10689 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Saary, P., Mitchell, A. L. & Finn, R. D. Estimating the quality of eukaryotic genomes recovered from metagenomic analysis with EukCC. Genome Biol 21, 244 (2020).
Article PubMed PubMed Central MATH Google Scholar
Alexander, H. et al. Eukaryotic genomes from a global metagenomic data set illuminate trophic modes and biogeography of ocean plankton. mBio 14, e01676–23 (2023).
Article PubMed PubMed Central Google Scholar
Lukashin, A. V. & Borodovsky, M. GeneMark.hmm: New solutions for gene finding. Nucleic Acids Research 26, 1107–1115 (1998).
Article CAS PubMed PubMed Central MATH Google Scholar
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
Article PubMed PubMed Central MATH Google Scholar
Mistry, J., Finn, R. D., Eddy, S. R., Bateman, A. & Punta, M. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res 41, e121 (2013).
Article CAS PubMed PubMed Central Google Scholar
Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Research 49, D412–D419 (2021).
Article CAS PubMed MATH Google Scholar
Haft, D. H. et al. TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Res 29, 41–43 (2001).
Article CAS PubMed PubMed Central MATH Google Scholar
Sillitoe, I. et al. New functional families (FunFams) in CATH to improve the mapping of conserved functional sites to 3D structures. Nucleic Acids Res 41, D490–D498 (2013).
Article CAS PubMed MATH Google Scholar
Wilson, D. et al. SUPERFAMILY–sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res 37, D380–386 (2009).
Article CAS PubMed Google Scholar
Letunic, I., Doerks, T. & Bork, P. SMART 6: recent updates and new developments. Nucleic Acids Res 37, D229–232 (2009).
Article CAS PubMed Google Scholar
Bland, C. et al. CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinformatics 8, 209 (2007).
Article PubMed PubMed Central MATH Google Scholar
Chan, P. P. & Lowe, T. M. tRNAscan-SE: Searching for tRNA genes in genomic sequences. Methods Mol Biol 1962, 1–14 (2019).
Article CAS PubMed PubMed Central MATH Google Scholar
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2020).
Article CAS Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat Genet 25, 25–29 (2000).
Article CAS PubMed PubMed Central MATH Google Scholar
The Gene Ontology Consortium. et al. The Gene Ontology knowledgebase in 2023. Genetics 224, iyad031 (2023).
Article Google Scholar
Levy Karin, E., Mirdita, M. & Söding, J. MetaEuk—sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics. Microbiome 8, 48 (2020).
Article CAS PubMed PubMed Central Google Scholar
Grigoriev, I. V. et al. PhycoCosm, a comparative algal genomics resource. Nucleic Acids Research 49, D1004–D1011 (2021).
Article CAS PubMed MATH Google Scholar
Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B. & Wu, C. H. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Article CAS PubMed Google Scholar
Quevillon, E. et al. InterProScan: protein domains identifier. Nucleic Acids Res 33, W116–W120 (2005).
Article CAS PubMed PubMed Central Google Scholar
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molecular Biology and Evolution 38, 4647–4654 (2021).
Article CAS PubMed PubMed Central Google Scholar
Edgar, R. C. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113 (2004).
Article PubMed PubMed Central MATH Google Scholar
Birney, E., Clamp, M. & Durbin, R. GeneWise and Genomewise. Genome Res 14, 988–995 (2004).
Article CAS PubMed PubMed Central Google Scholar
Salamov, A. A. & Solovyev, V. V. Ab initio gene finding in Drosophila genomic DNA. Genome Res 10, 516–522 (2000).
Article CAS PubMed PubMed Central MATH Google Scholar
Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30, 1575–1584 (2002).
Article CAS PubMed PubMed Central MATH Google Scholar
Boulton, W. et al. Metagenome-assembled-genomes recovered from the Arctic drift expedition MOSAiC. Figshare https://doi.org/10.6084/m9.figshare.27879576 (2024).
NCBI BioProject http://identifiers.org/ncbi/bioproject:PRJNA1160706 (2024).
Mukherjee, S. et al. Genomes OnLine Database (GOLD) v.8: overview and updates. Nucleic Acids Research 49, D723–D733 (2021).
Article CAS PubMed MATH Google Scholar
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497145 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497146 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497147 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497148 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497149 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497150 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497151 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497152 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497153 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497154 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497155 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497157 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497160 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497161 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497165 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497166 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497167 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497168 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497174 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497178 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497181 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497183 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497184 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497185 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497186 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497187 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497189 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497190 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497191 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497192 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497193 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497194 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497195 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497196 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497197 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497198 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497199 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497200 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497201 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497202 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497203 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497204 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497205 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497207 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497209 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497213 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497214 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497216 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP497220 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506343 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506344 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506347 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506348 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506350 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506351 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506352 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506353 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506355 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506356 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506357 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506359 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506366 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506368 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506369 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506371 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506373 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506374 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506375 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506376 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506379 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506382 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506389 (2024).
NCBI sequence read archive http://identifiers.org/insdc.sra:SRP506779 (2024).
Nixdorf, U. et al. MOSAiC Extended Acknowledgement (2021).

Download references

Acknowledgements

W.B. was supported by the Natural Environment Research Council and ARIES DTP [grant number NE/S007334/1]. A.T., V.M. and T.M. were supported by the Natural Environment Research Council grant [NE/W005654/1]. Data used in this manuscript were produced as part of the international Multidisciplinary drifting Observatory for the Study of the Arctic Climate (MOSAiC) with the tag MOSAiC20192020 (expedition AWI_PS_122_00). We thank all land and ship based contributors of the MOSAiC campaign¹³⁰. This work was also supported through the Research Council of Norway through project HAVOC (grant no 280292), and the National Science Foundation through grant OPP-1735862. The work (10.46936/10.25585/60001271) conducted by the US Department of Energy Joint Genome Institute (https://ror.org/04xm1d337), a DOE Office of Science User Facility, was supported by the Office of Science of the US Department of Energy under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations

School of Computing Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK
William Boulton & Vincent Moulton
U.S. Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
Asaf Salamov, Igor V. Grigoriev, Sara Calhoun, Kurt LaButti, Robert Riley & Kerrie Barry
Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA, 94720, USA
Igor V. Grigoriev
Alfred Wegener Institute, Am Handelshafen 12, 27570, Bremerhaven, Germany
Allison A. Fong, Clara J. M. Hoppe, Katja Metfies, Kersten Oetjen & Sarah Lena Eggers
University of Bergen, Thormøhlens gate 53 A/B, 5006, Bergen, Norway
Oliver Müller & Gunnar Bratbak
UiT the Arctic University of Norway, Hansine Hansens veg 18, 9019, Tromsø, Norway
Jessie Gardner
Norwegian Polar Institute, Fram Centre, Hjalmar Johansens gate 14, 9296, Tromsø, Norway
Mats A. Granskog
Department of Aquatic Sciences and Assessment, Section for Ecology and Biodiversity, Swedish University of Agricultural Sciences, Uppsala, Sweden
Anders Torstensson
University of Alaska Fairbanks, 1731 South Chandalar Drive, AK, 99775, Fairbanks, USA
Marc Oggier
NORCE Norwegian Research Centre, Nygårdsgaten 112, NO-5008, Bergen, Norway
Aud Larsen
School of Environmental Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK
Andrew Toseland & Thomas Mock
Earlham Institute, Norwich Research Park, Colney Lane, Norwich, NR4 7UZ, UK
Richard M. Leggett

Authors

William Boulton
View author publications
Search author on:PubMed Google Scholar
Asaf Salamov
View author publications
Search author on:PubMed Google Scholar
Igor V. Grigoriev
View author publications
Search author on:PubMed Google Scholar
Sara Calhoun
View author publications
Search author on:PubMed Google Scholar
Kurt LaButti
View author publications
Search author on:PubMed Google Scholar
Robert Riley
View author publications
Search author on:PubMed Google Scholar
Kerrie Barry
View author publications
Search author on:PubMed Google Scholar
Allison A. Fong
View author publications
Search author on:PubMed Google Scholar
Clara J. M. Hoppe
View author publications
Search author on:PubMed Google Scholar
Katja Metfies
View author publications
Search author on:PubMed Google Scholar
Kersten Oetjen
View author publications
Search author on:PubMed Google Scholar
Sarah Lena Eggers
View author publications
Search author on:PubMed Google Scholar
Oliver Müller
View author publications
Search author on:PubMed Google Scholar
Jessie Gardner
View author publications
Search author on:PubMed Google Scholar
Mats A. Granskog
View author publications
Search author on:PubMed Google Scholar
Anders Torstensson
View author publications
Search author on:PubMed Google Scholar
Marc Oggier
View author publications
Search author on:PubMed Google Scholar
Aud Larsen
View author publications
Search author on:PubMed Google Scholar
Gunnar Bratbak
View author publications
Search author on:PubMed Google Scholar
Andrew Toseland
View author publications
Search author on:PubMed Google Scholar
Richard M. Leggett
View author publications
Search author on:PubMed Google Scholar
Vincent Moulton
View author publications
Search author on:PubMed Google Scholar
Thomas Mock
View author publications
Search author on:PubMed Google Scholar

Contributions

W.B. drafted the manuscript and wrote code for the coassembly pipelines. W.B. and S.C. annotated the near complete eukaryotic MAG. A.S., I.V. G., S.C., K.B., R.R., K.L. provided bioinformatics support for all the JGI infrastructure including assembling, binning, and annotating all the individual samples, and bioinformatics support for the coassembly pipeline. K.B., I.V.G., T.M., K.M. coordinated and managed the metagenome sequencing project. A.F., C.J.M.H., A.T., K.M., O.M., J.G., M.O., collected and processed the samples. M.A.G., A.L. and G.B. coordinated and supervised the HAVOC project. O.M., J.G., S.L.E., K.O. performed the DNA extractions. T.M., V.M., R.L., A.T. supported and supervised the conceptualisation, writing and editing of the manuscript. All authors read, edited, and approved the manuscript.

Corresponding author

Correspondence to Thomas Mock.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Table 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Boulton, W., Salamov, A., Grigoriev, I.V. et al. Metagenome-assembled-genomes recovered from the Arctic drift expedition MOSAiC. Sci Data 12, 204 (2025). https://doi.org/10.1038/s41597-025-04525-8

Download citation

Received: 06 June 2024
Accepted: 24 January 2025
Published: 04 February 2025
Version of record: 04 February 2025
DOI: https://doi.org/10.1038/s41597-025-04525-8