De novo discovery of conserved gene clusters in microbial genomes with Spacedust

Zhang, Ruoshi; Mirdita, Milot; Söding, Johannes

doi:10.1038/s41592-025-02816-x

Download PDF

Article
Open access
Published: 15 September 2025

De novo discovery of conserved gene clusters in microbial genomes with Spacedust

Nature Methods volume 22, pages 2065–2073 (2025)Cite this article

5996 Accesses
36 Altmetric
Metrics details

Subjects

Abstract

Metagenomics has revolutionized environmental and human-associated microbiome studies. However, the limited fraction of proteins with known biological processes and molecular functions presents a major bottleneck. In prokaryotes and viruses, evolution favors keeping genes participating in the same biological processes colocalized as conserved gene clusters. Conversely, conservation of gene neighborhood indicates functional association. Here we present Spacedust, a tool for systematic, de novo discovery of conserved gene clusters. To find homologous protein matches, Spacedust uses fast and sensitive structure comparison with Foldseek. Partially conserved clusters are detected using novel clustering and order conservation P values. We demonstrate Spacedust’s sensitivity with an all-versus-all analysis of 1,308 bacterial genomes, identifying 72,843 conserved gene clusters containing 58% of the 4.2 million genes. It recovered 95% of antiviral defense system clusters annotated by the specialized tool PADLOC. Spacedust’s high sensitivity and speed will facilitate the annotation of large numbers of sequenced bacterial, archaeal and viral genomes.

Genetic barriers more than environmental associations explain Serratia marcescens population structure

Article Open access 17 April 2024

Transcriptome-wide marker gene expression analysis of stress-responsive sulfate-reducing bacteria

Article Open access 27 September 2023

Ancient origin and constrained evolution of the division and cell wall gene cluster in Bacteria

Article 21 November 2022

Main

In the past decade, metagenomics has accelerated the pace of research into microbial ecology and human-associated microbiomes and their intimate association with human health¹. Hundreds of thousands of microbial and viral genomes assembled from shotgun metagenomics permit the study of microorganisms and their interactions with each other and their environment^2,3,4. However, our ability to extract useful insights from such data is severely limited by the lack of functional information⁵. Even in well-studied ecosystems such as the human gut, for around 40% of genes neither molecular function nor biological process is annotatable².

The standard approach for protein function annotation is by homology inference, that is, by sequence similarity search to find the best match in reference databases such as InterPro, KEGG orthologs, COGs or SEED^6,7,8, and transferring the annotation if certain criteria are met^{9,10,11,12,13}. Earlier approaches relied on sequence–sequence search tools such as BLAST. However, function can remain conserved even at sequence identities much below 20%, which these approaches cannot detect¹⁴. Therefore, modern approaches with increased sensitivity search with the query protein sequences through databases of profile hidden Markov models (HMMs) or sequence profiles. These are precomputed from multiple sequence alignments of protein family members with the same or similar functions. Many databases of orthologous families have been developed to automate the process of clustering orthologous protein sequences together^7,15. This approach is motivated by the ‘orthology conjecture’, which states that orthologous sequences are more likely to be functionally related than paralogous ones, although the difference appears to actually be small^16,17.

Integration of genomic context can improve the precision of ortholog clustering and functional annotation. Proteins do not work in isolation but cooperate with others in biological pathways. Evolution has a tendency to keep functionally associated genes closely together in prokaryotic and viral genomes. This can be a consequence of coexpressed genes sharing regulatory sequences or even forming part of the same transcription unit, an operon¹⁸. Clustering also maximizes the chances of horizontal transfer of useful gene modules, and it minimizes disruptions of functionally associated genes by genomic recombination^19,20,21. Some methods exploit gene neighborhood conservation to increase specificity for identifying orthologs^22,23,24,25. Others used the ‘guilt by association’ principle to find functionally associated genes (for example, see refs. ^26,27).

Many methods have been designed to detect a specific type of cluster, such as biosynthetic gene clusters (BGCs)^28,29,30,31, phage defense systems^32,33,34, virulence and antibiotic resistance factors^35,36 or xenobiotic degradation pathways³⁷. Most search the protein sequences from the query genome against a pre-assembled database of profile HMMs representing protein families typically occurring in these clusters. They then apply heuristic rules for what constitutes a valid cluster match.

A few tools aim to find genomic neighborhoods similar to a query neighborhood^38,39,40,41. The sensitivity of these de novo cluster detection methods is limited severely by the use of sequence–sequence comparison tools such as BLAST or DIAMOND^42,43, compounding their ability to detect all but closely related conserved clusters⁴⁴. They also do not scale up to more than a few hundred genomes in all-versus-all search mode, and some require strict conservation of gene order (colinearity). Some approaches find conserved clusters by first searching each of the genomes to be analyzed against a profile HMM database of orthologous groups, and then find clusters of genes with a similar composition of orthologs^45,46. While this type of approach has improved sensitivity over the first, it requires a reference database of orthologous groups and, therefore, excludes the many proteins from as-yet-unknown families⁴⁷.

Spacedust is a tool for systematic de novo discovery of conserved gene clusters across multiple genomes. It finds all gene clusters significantly conserved between any two genomes in a set of input genomes. Conserved clusters are found by maximizing the statistical significance measured with two novel statistics assessing the degree of clustering and the degree of order and strand conservation. Because remote homologies are critical to achieve high sensitivity for detecting conserved gene clusters, Spacedust performs its homology searches with our new structure-based search tool Foldseek. Foldseek has similar sensitivity as the best structural comparison tools and much higher than sequence–sequence, sequence–profile and profile HMM searches^48,49.

Spacedust improves upon previous methods in several ways: (i) It is reference-free and can discover conserved clusters of any type and composition; (ii) its structure-based search maximizes its sensitivity for finding remotely related conserved gene clusters; (iii) its high speed allows for analyzing a large number of genomes for conserved gene clusters using all-versus-all searches; (iv) it integrates functional annotation of proteins to facilitate inference of function from cluster members; and (v) it offers a user-friendly Google Colab notebook.

We demonstrate the utility of Spacedust by detecting conserved clusters in an all-versus-all comparison of 1,308 representative bacterial genomes from different genera with a total of 4.2 million protein-coding genes. Spacedust recovers previously annotated gene clusters, for example, operons, antiviral defense systems and BGCs. It is able to assign 58% of all 4.2 million genes and 35% of genes without any annotation to conserved gene clusters. Spacedust also discovers the vast majority of antiphage defense systems in this dataset and achieves better results in identifying 207 manually annotated BGCs than three specialized tools.

Results

Spacedust algorithm

Spacedust takes as input a set Q of query genomes and a set T of target genomes (which may be equal to Q) and, for each pair (q, t) ∈ Q × T of query and target genome, it finds all gene clusters whose gene arrangement is at least partially conserved between q and t (Fig. 1 and Methods). For that purpose, Spacedust first identifies homologous matches (‘hits’) between proteins in Q and proteins in T using our sensitive structure search tool Foldseek⁴⁸ and our sequence search tool MMseqs2 (ref. ⁵⁰; steps 1–5 in Fig. 1a). For every Q–T pair, it detects clusters of hits with significant conservation of gene neighborhood, using a greedy cluster detection algorithm (step 6 in Fig. 1a,b). The cluster detection starts with each protein hit in its own cluster and adds protein hits to the cluster matches one at a time. If the significance score of the cluster match improves, the addition is accepted and the algorithm continues, until the significance of the cluster matches cannot be improved further. The significance score is calculated as the sum of the negative logarithms of a clustering P value and an ordering P value. The clustering P value is the probability of finding ‘by chance’ at least k matches within a window of at most m genes in both the query and the target genome. The ordering P value is the probability to find ‘by chance’ at least n pairs of genes of the cluster match in conserved order in both genomes. The cluster detection algorithm thereby identifies positionally conserved clusters between all Q–T pairs of genomes. Optionally, the cluster matches for each query genome, aggregated across multiple reference genomes, can be visualized as a measure of conservation strength.

A reference set of remotely conserved bacterial gene clusters

Despite the availability of tens of thousands of complete bacterial genomes, the gene cluster conservation landscape has yet to be surveyed systematically. To address this gap, we curated a dataset of 1,308 bacterial reference genomes, covering a broad phylogenetic range. These genomes were selected such that they belong to different bacterial genera (Methods), to focus on detecting remote homology and globally conserved clusters across higher taxonomic ranks. This choice means species-specific and genus-specific gene clusters cannot be found in this analysis. We subjected all predicted genes (4.19 million) from the 1,308 genomes to an all-versus-all Foldseek+MMseqs2 search using Spacedust. The all-versus-all homology search and hit-filtering process took 72 h to complete on two servers with two 64-core AMD EPYC ROME 7,742 CPUs each, or 150 ms per genome–genome comparison (Supplementary Information), and the subsequent cluster detection required 51 min. The runtime of Spacedust scales quadratically with the total number of genomes and proteins in the genomes, owing to the all-versus-all search and cluster detection. The search yielded 321.2 million cluster hits in 106.6 million cluster matches, with an average of three genes per cluster match (Fig. 2a).

**Fig. 2: Conservation of gene clusters identified by Spacedust predicts functional association.**

These pairwise cluster matches were subsequently grouped on the level of genome and genes, which yielded 72,483 nonredundant clusters comprising 2.45 million genes, representing 58% of the dataset. We classified 4.19 million genes based on their eggNOG-mapper annotations: ‘annotated’ if the gene was assigned a specific function (3.13 million, or 75% of the dataset), or ‘unannotated’ if the gene was labeled as ‘hypothetical protein’, ‘protein with unknown function’, or lacking any annotation (1.06 million, or 25% of the dataset). Notably, 66% of the annotated genes were found in nonredundant clusters present in more than one genome. Additionally, 35% of the unannotated genes were found in nonredundant clusters (Fig. 2b).

To evaluate the functional associations within the nonredundant clusters, we evaluated the congruence of KEGG module IDs for gapped gene pairs separated by up to four genes (i, i + 1)…(i, i + 4; Fig. 2c and Extended Data Fig. 1). A gene pair is considered a true positive if both genes share a common KEGG module ID, and false positive if not. The area under the precision–recall curve indicates that Spacedust identifies cluster matches with considerably higher accuracy than the baseline model, which assumes any neighboring gene pair (i, i + x) to be functionally associated. Similarly, we assessed the functional association within the nonredundant clusters predicted by the Foldseek-only search mode (Fig. 2e,f), with the area under the precision–recall curve for gapped gene pairs (i, i + 1)…(i, i + 4) slightly higher than that of the default Foldseek+MMseqs search mode.

Global functional conservation of a cyanobacterium genome

To illustrate how Spacedust can support functional annotation of genomes, we took one example genome of a unicellular cyanobacterium Synechocystis sp. Pasteur Culture Collection (PCC) 6803 from the reference database with the 1,308 genomes. This genome comprises one chromosome (GenBank accession: BA000022.2) and four plasmids (AP004311.1, AP004312.1, AP004310.1, AP006585.1), totaling 3,551 protein-coding genes. All the detected clusters are visualized as an interactive cluster heat map (Extended Data Figs. 2–4). For better visibility, we zoomed in on a specific region spanning protein location indices 500 to 800 (Fig. 3a; corresponding to 0.0007% of the total dataset) and integrated functional annotation data obtained from eggNOG-mapper⁵¹. This allowed us to assign functions to many of the proteins. From this selected genomic region containing 300 genes, we identified three distinct cyanobacteria-specific clusters, indicative of functional conservation across related species. Additionally, we detected 21 clusters shared with other phyla (Supplementary Tables 1–3). Some clusters corresponded to single operons, while others spanned multiple operons.

**Fig. 3: Evolutionary conservation of gene clusters in an example cyanobacterium.**

Cluster 1 (Fig. 3b and Extended Data Fig. 5) comprises genes associated with photosystem II (PSII), the protein–pigment complex that drives oxygenic photosynthesis. The first two genes, rubredoxin and ycf48, are crucial for PSII activity and assembly. The remaining genes, psbEFLJ, form an operon encoding components of the core PSII complex. In many cases, psbL and psbJ are absent, possibly owing to poor conservation or the short length of their sequences.

Cluster 2 (Extended Data Fig. 6) forms an operon encompassing components of the phycobilisome complex rod, a large protein complex in cyanobacteria responsible for capturing sunlight and transferring energy to the photosynthetic reaction centers. The genes cpcA and cpcB encode two major subunits of the rod, while cpcD, cpcC and cpcC2 encode linker components connecting the rod to the PBS core⁵². Conversely, homologous clusters in some other cyanobacteria only contain one copy of the cpcC gene, suggesting that cpcC and cpcC2 might have been created by gene duplication. In some genomes, the genes are still colocalized despite the order of the genes being only partially conserved.

In cluster 3 (Extended Data Fig. 7), the first two genes are both annotated as spkA by eggNOG-mapper, encoding a eukaryotic-type serine/threonine protein kinase involved in signal transduction and mobility. Alignment with other clusters revealed gene fusion of these two genes in other cyanobacteria⁵³. The third gene in the cluster is highly conserved in other genomes but could not be annotated using eggNOG-mapper.

De novo identification of specialized gene clusters

To further assess the ability of Spacedust to identify conserved gene clusters, we focused on two categories of known, specialized gene clusters, antiviral defense systems and BGCs.

We used PADLOC (v1.1.0)³³ to identify all known antiviral defense systems in the 1,308 bacterial reference genomes. We removed any predicted region consisting only of single genes. Spacedust was able to recover 5,255 (95%) of 5,520 multi-gene defense system clusters detected by PADLOC (Fig. 4a,b), with 93% (4,888) of the clusters matching fully and 7% (367) matching partially to the PADLOC prediction. Most partial cluster matches resulted from missing matches to one or two short genes at the edge of longer clusters such as CRISPR–Cas systems. For 73 of the 106 defense system types, more than 90% of all defense system clusters were discovered in their entirety by Spacedust (Fig. 4a), despite that restriction–modification type II clusters are the most abundant type of defense systems in the dataset yet most challenging for Spacedust to detect.

**Fig. 4: Spacedust recovers the vast majority of antiviral defense systems predicted by specialized tools.**

To evaluate Spacedust’s ability to recover BGCs, we utilized a gold-standard dataset consisting of nine complete genomes available at the NCBI that were fully annotated with BGC and non-BGC regions²⁹. We queried these nine genomes against our reference set of 1,308 bacterial genomes using Spacedust. We compared the results with three tools specialized in identifying BGCs, ClusterFinder²⁹, DeepBGC³⁰ (using a cutoff of 10% false positive rate as recommended) and GECCO³¹. As Spacedust returns all conserved clusters and not exclusively BGCs, it was not feasible to compare the precision of BGC detection based on all predictions. Therefore, we evaluated the F1 score, equal to the harmonic mean of precision and recall, for each of the annotated BGCs (Fig. 5). The precision is the fraction of genes in the overlapping predicted region that were annotated as BGC genes, and the recall is the fraction of genes in the annotated BGC region that were predicted as BGC region by the tool. Spacedust achieves higher F1 scores than ClusterFinder, DeepBGC and GECCO (Fig. 5a–c) owing to its higher precision than DeepBGC and GECCO and its higher recall than ClusterFinder (Extended Data Figs. 8 and 9). All tools failed to detect a few instances of BGCs that were identified by Spacedust. Figure 5d shows the cumulative distribution of the F1 score for the three tools. The average F1 score over all BGCs is 0.44 for ClusterFinder, 0.39 for DeepBGC, 0.43 for GECCO and 0.61 for Spacedust.

**Fig. 5: Prediction of 207 manually annotated BGCs from nine genomes.**

We used AntiSMASH (version 8)²⁸, a tool for profile-based BGC detection, to functionally annotate the genes as either ‘biosynthetic-related’ (biosynthetic, biosynthetic additional, transport, regulatory) or ‘other genes’. We manually inspected the clusters reported by Spacedust, ClusterFinder, DeepBGC and GECCO. We observed that the regions reported by Spacedust often miss the transport and regulatory genes but cover the core and additional biosynthetic-related genes, sometimes with multiple short gene clusters within the annotated BGC region (Extended Data Fig. 10).

Expansion of CRISPR–Cas subtype III-E single effector Cas7-11

Next, we investigated Spacedust’s utility in identifying new instances of known gene cluster families. One such example is the recently discovered CRISPR subtype III-E, which comprises a single effector protein known as Cas7-11 (ref. ⁵⁴). Notably, Cas7-11 is a protein fusing four Cas7 proteins with a putative Cas11-like protein. The fusion yields a single-protein programmable RNase that shows high sequence specificity and no evidence of collateral activity. Previous screening for Cas7-11 across bacterial genomic sequences led to the identification of subtype III-E systems in 17 loci.

To expand our knowledge of subtype III-E systems beyond the reported loci, we queried the proteins in the 17 loci reported by ref. ⁵⁴ against the GTDB database⁵⁵. Because we were unable to map a substantial portion of the query proteins and GTDB proteins to known structures, we used the MMseqs2 iterative search with three iterations to perform the homology search. We identified an additional seven instances of subtype III-E clusters in the GTDB database by demanding the presence of the gene encoding Cas7-11 (Fig. 6). In three out of seven genomes, all components of the respective system were identified, demonstrating the high sensitivity of the method.

**Fig. 6: Additional instances of CRISPR–Cas subtype III-E clusters identified in GTDB.**

Spacedust Colab notebook

To facilitate the use of Spacedust for a broad user base, we have set up a Google Colaboratory environment, which allows users to easily run tests and reproduce results without requiring a local installation or configuration. We provide a comprehensive IPython notebook (ipynb file) that includes steps for installing all dependencies and databases, executing the program, and interactively visualizing the clusters. Within the Colab framework, users can either run Spacedust in an all-versus-all mode or annotate query genomes against a pre-compiled reference database with just a single click. With the help of interactive visualization, they can explore the evolutionary conservation of their genome of interest in other genomes at different resolutions and generate gene neighborhood plots for any gene clusters.

Discussion

Exploiting the conservation of gene neighborhoods to predict functional association between genes is an old idea. So far, the main limitations have been (1) the rather low fraction of genes that are part of a conserved gene cluster⁴⁴, and (2) the low reliability of the inference of functional association.

Spacedust addresses limitation 1 in three ways. First, it finds homologous proteins using protein structure comparison with Foldseek, which is much more sensitive than the sequence search-based methods used so far. The increased sensitivity yields a higher number of conserved cluster matches between genomes (Extended Data Figs. 2–4). Second, owing to Foldseek’s high speed and Spacedust’s clustered search mode, it can analyze large sets of genomes in an all-versus-all fashion. The large number of genomes increases the chances that a gene will be part of a gene cluster that is conserved in another genome. Certainly, all gene clusters that are laterally transferred as a functional unit—such as BGCs⁵⁶—will be detectable by Spacedust if a sufficiently large number of genomes is analyzed. Third, Spacedust does not require exact synteny but can find partially conserved neighborhoods (for example, Fig. 6). The success of these measures is demonstrated by the high fraction (58%) of genes that are part of a conserved gene cluster among the 4.2 million proteins from the reference genomes of 1,308 bacterial genera (Fig. 2b), as well as the high sensitivities attained for the de novo discovery of antiviral defense systems and BGCs (Figs. 4 and 6).

Spacedust partially addresses limitation 2, the low reliability of functional association, by computing two novel P-value statistics for the significance of gene cluster conservation, one assessing the strength of positional clustering of the matched genes and the other assessing the degree of their strand and order conservation. These P values enable flexibility to find partially conserved gene clusters while still ensuring their statistical significance. Statistically significant conservation alone does not guarantee functional association, however. The fraction of gene pairs within a cluster match that are part of the same KEGG pathway can be as low as 50% when conserved between only two genomes (Fig. 3c), but it rises to over 80% for the 25% of the 4.2 million genes that are part of a cluster conserved in at least 50 of 1,308 genomes (Fig. 2b,c).

The relatively high fraction of KEGG-discordant gene pairs could be due to ‘false false positives’, which are functionally associated genes that are not labeled with the same KEGG pathway ID. However, we suspect that most discordant pairs are indeed not functionally associated. This observation was referred to as ‘genomic hitchhiking’ or ‘carpooling’ in ref. ⁴⁶ and was later rationalized by Fang et al.¹⁹: Hitchhiking, the conservation of genomic neighborhoods containing groups of genes without obvious functional links between them, occurs mainly between core (‘persistent’) genes as a side effect of keeping functionally associated, accessory (‘non-persistent’) genes clustered together. As a consequence, accessory genes are less involved in hitchhiking. Hitchhiking generally limits the reliability of predicting functional association from conservation of gene neighborhoods.

To assess the sensitivity of Spacedust for finding functionally associated clusters of genes, we compared it with PADLOC, a tool specialized for finding antiviral defense systems. PADLOC relies on a hand-built library of approximately 3,800 HMMs to search for the proteins forming part of one of the 210 defense systems. Spacedust discovered de novo 95% of the defense system clusters annotated by PADLOC, of which 93% were discovered in their entirety (Fig. 4). Similarly, when assessing Spacedust on its ability to discover BGCs manually annotated in nine genomes, it performed better than GECCO, DeepBGC and ClusterFinder, which are trained on a large dataset of known BGCs (Fig. 5). In summary, Spacedust has similar sensitivity for de novo discovery of functional modules as dedicated tools trained for the discovery of a specific type of gene cluster. However, it is important to note that these specialized tools provide additional value by annotating clusters with specific biosynthetic classes or defense system types, which is not feasible with Spacedust.

The following four limitations of Spacedust need to be addressed in future work. First, partial conservation of a gene cluster in a certain number of genomes predicts functional conservation with only moderate precision (Fig. 2c,d). We are working on an improved conservation score that takes the evolutionary divergence times between genomes into account. We also plan to increase precision by integrating operon predictions on all input genomes. Second, while the fast transformer tool ProstT5 can reliably predict three-dimensional interaction (3Di) sequences for well-studied reference bacterial proteins, its accuracy is less consistent for viral and metagenomic sequences. Although we provide the ProstT5 model to enable full Foldseek structure searches as an alternative to mapping precomputed structures, we emphasize this limitation to guide users in selecting appropriate modes based on their dataset. We are actively working to update the model and will ensure users can easily download the improved version once available. Third, Spacedust cannot find protein members of functional modules encoded outside a conserved gene cluster²⁶. We will address this limitation by applying Spacedust for building a database of module-specific protein families and profile HMMs with HMM-specific acceptance thresholds (similar to Pfam⁵⁷), which should allow us to identify also positionally isolated members of functional modules with high specificity. Another limitation of Spacedust is its quadratic scaling of runtime with the total number of genomes and proteins in the genomes to be analyzed, caused by the quadratic time complexity of the all-versus-all comparison of proteomes and of the cluster detection algorithm.

Positional orthologs, that is, orthologs that also have conserved gene neighborhoods, are under much stronger evolutionary constraints than orthologs without gene neighborhood conservation⁵⁸. Therefore, we expect functional module-specific protein families to show high functional conservation and to become highly useful for automatic functional annotation using a comprehensive profile HMM database of such modules. We also plan to use Spacedust for the systematic discovery of uber-operons or extended gene neighborhoods^46,59,60, sets of genes that tend to co-occur in each other’s neighborhood more often than by chance and that tend to participate in the same or related processes in the cell.

In conclusion, Spacedust is a sensitive and fast tool for finding conserved gene clusters in large numbers of genomes. It can be used for the large-scale discovery of modules of functionally associated genes in prokaryotic and viral genomes and metagenome-assembled genomes, as demonstrated here with various examples. Its de novo approach and high sensitivity make it particularly interesting for the discovery of novel types of functional modules. It can visualize conserved clusters in a query genome across hundreds of target genomes (Fig. 3a). Its tabular output facilitates its integration into current genome annotation pipelines. It can thereby accelerate the identification of the functional capabilities of the millions of prokaryotes and viruses that live in and on our bodies and populate all natural environments.

Methods

Spacedust workflow

Input

Spacedust accepts genomic sequences as multiple FASTA files, each containing a single prokaryotic genome or metagenome-assembled genome. Users can either predict the protein-coding sequences from the input genome using Prodigal (v2.6.3)⁶¹ or provide the corresponding GFF3 annotation files of protein-coding regions. Contigs belonging to the same assembly should be contained in a single FASTA file. All protein-coding regions are extracted and translated. For each protein sequence, the location index, strand and nucleotide coordinates are stored. For all-versus-all comparisons, only one set of genomes is required. For query-to-reference comparisons, a custom database can be built analogously, or a pre-compiled reference database can be automatically downloaded by Spacedust.

Mapping to structure database

To enable structure comparisons, query protein sequences are mapped to the reference structure database provided by Foldseek⁴⁸. Currently, Foldseek supports several structure databases such as AlphaFold (UniProt, Proteome, Swiss-Prot), Protein Data Bank and ESMAtlas30. For each protein, in addition to the amino acid sequence, the structure information, including the 3Di sequence and the C_α coordinates, are stored in the Foldseek database. Each query protein sequence is searched against the amino acid sequences of the structure database using MMseqs2 (ref. ⁵⁰) with a stringent sequence identity cutoff of 0.9 and sequence coverage cutoff of 0.9 (--min-seq-id 0.9 -c 0.9), and the respective 3Di sequence and the C_α coordinates of the best match are retained to build a Foldseek-compatible database. Alternatively, Foldseek supports translation between protein sequences and 3Di sequences using the ProstT5 protein language model⁶², with which a Foldseek-compatible database can be created for all query protein sequences.

Homology search

Spacedust conducts a sensitive search of all query proteins against target proteins using Foldseek and/or MMseqs2. If a Foldseek-compatible database is available, the mapped or translated structure sequences will be used in a Foldseek search, while any remaining unmapped sequences are searched using MMseqs2. The E-value cutoff is set at 0.001, and the query sequence coverage is set at 0.8 (-e 0.001 -c 0.8 --cov-mode 2). The results from both searches are merged. Users can also opt for an iterative profile (PSI-BLAST like) search by specifying the number of iterations.

Clustered search

Searching against a large sequence or structure database is a time-consuming and memory-demanding task. To improve the search speed while maintaining high sensitivity, we implemented a clustered search workflow similar to the strategy used in ColabFold⁶³. The query sequences are searched against the consensus sequence or structure of the clustered version of the reference database. For each query hit to a consensus sequence, we realign the query to its respective cluster members and expand the search results. An additional advantage is the higher sensitivity attained using cluster consensus sequences. The clustered search approach results in a fourfold speed-up, because only 1 million cluster consensus structures are searched instead of 3.8 million structures from the bacterial reference database.

Hit filtering and grouping

Each protein in each of the genomes in the query set is searched through the proteins in each of the target genomes T with N_T proteins. For each query protein, the hit in t with the lowest P value p is identified. If p ≤ p₀ (default value 10⁻⁷), we compute the probability that in a comparison with N_T nonhomologous proteins no P value will be below p, given by the first-order P-value statistics, ${p}_{{\rm{bh}}}=1-{(1-p)}^{{N}_{T}}$. The best hits between the pairs of query and target genomes are then grouped for subsequent cluster detection.

Cluster detection

For each pair of genomes, Spacedust uses a probabilistic approach to identify conserved neighborhoods of genes, which are clusters of homologous hits with partially conserved clustering and ordering. We assess the conservation of gene neighborhood by combining two P-value statistics, a clustering P value and an ordering P value. Given any cluster of hits ${\mathcal{C}}$, we can compute the number of hits k. The span m is the maximum of the number of genes (including unmatched ones) in the query genome cluster and in the target genome cluster. We also define q₀ (default value 10⁻³) as the probability for an arbitrary protein q to hit an arbitrary protein t in T. The clustering P value of the given cluster is the probability to observe a cluster of size k each with a probability of q₀ within a square of span m (Supplementary Information), as given by equation (1):

$${p}_{{\rm{clu}}}({\mathcal{C}})\approx \frac{m{!}^{2}}{(m-k){!}^{2}k!}\,{q}_{0}^{k}\,.$$

(1)

The ordering P value assesses the statistical significance of the conservation of order and strandedness between two cluster matches. We define the directionality and ordering statistic n ∈ {0, 1, 2, . . . } as the number of neighboring query protein pairs that are also direct neighbors and whose relative orientation is conserved. The ordering P value is the probability to observe in a randomly occurring cluster match with k matched proteins at least n neighboring pairs with conserved order and strandedness. In the methods section of the Supplementary Information, and as shown in equation (2), this P value is:

$${p}_{{\rm{ord}}}({\mathcal{C}})=\frac{1-n/k}{{2}^{n}\,n!}\,.$$

(2)

Both statistics are independent of each other and of the strength of individual pairwise sequence homology and thus should improve the specificity of the search. Additionally, both statistics are not influenced by the size of the query and target sets, making them suitable for fragmented contigs with small numbers of genes.

The clusters are detected with a greedy agglomerative hierarchical clustering algorithm. Because the two P values are independent random variables under the null model, we can combine them using the product of P values $p:= {p}_{{\rm{clu}}}({\mathcal{C}})\,{p}_{{\rm{ord}}}({\mathcal{C}})$⁶⁴, which yields the cluster match P value, $p\times (1-\log p)$. We define a cluster score $S({\mathcal{C}})$ as negative logarithm of the cluster match P value: $S({\mathcal{C}})=-\log {p}_{{\rm{clu}}}({\mathcal{C}})-\log {p}_{{\rm{ord}}}({\mathcal{C}})+\log \left(1-\log {p}_{{\rm{clu}}}({\mathcal{C}})-\log {p}_{{\rm{ord}}}({\mathcal{C}})\right)$. The greedy agglomerative hierarchical clustering algorithm first treats each hit as a singleton cluster, and iteratively merges hits with the highest cluster match score satisfying the clustering criteria. Users can adjust the stringency by defining different clustering criteria, such as the maximum number of gaps (non-cluster genes) allowed and the minimum number of genes in a cluster. The probabilistic nature of the algorithm accounts for micro-rearrangements between genomes, gene insertions/losses and misannotated genes.

Output

Spacedust outputs a tab-separated text file. Each reported cluster consists of one summary line followed by multiple lines, one line for each pairwise hit. The summary line starts with ‘#’: a unique cluster identifier, query genome accession, target genome accession, cluster match P value (joint P value of clustering and ordering), multi-hit P value and number of hits in the cluster. Each following line describes an individual member hit of the cluster in MMseqs2 alignment-result-like format with the following columns: query protein accession, target protein accession, best-hit P value p_bh, sequence identity, pairwise E-value, query protein start, end and length, target protein start, end and length, and alignment traceback string.

Bacterial reference database

The bacterial reference database was assembled from the KEGG GENOME collection⁶⁵, which comprises 7,167 complete bacterial genomes. The genomes were downloaded from NCBI GenBank in September 2022. Genomes were filtered for redundancy using pairwise average amino acid identities (AAIs). Specifically, we used Mash (v2.3)⁶⁶ to perform all-versus-all alignments using the amino acid alphabet (-a) with default parameters and computed the pairwise AAI as (1 − Mash distance). Next, genomes with AAIs of at least 70% were clustered using SciPy’s hierarchical clustering function^67,68, and the longest sequence within each cluster was selected as the representative. The threshold roughly corresponds to genus-level clustering, meaning the representatives belong to different bacterial genera⁶⁹. This resulted in 1,308 representative genomes that make up the reference database. We predicted the protein sequences using Prodigal v2.6.3 and constructed the Foldseek-compatible database as described above. Around 90.9% (3.8 million of 4.2 million) of the protein sequences could be mapped to a structure in the AlphaFold database. Comprehensive functional annotation of all protein sequences is included using eggNOG-mapper (v2.0)⁵¹ with MMseqs2 search and default parameters.

Reducing the size of the reference database

For large reference databases, the search step’s memory requirements would make local runs infeasible. The reference genomes, even after the AAI-based redundancy filtering step, still contain sequence redundancy. Therefore, we further reduced the size of the database by clustering the protein and structure sequences. We clustered the protein sequences with MMseqs2 at 70% sequence identity and 80% bidirectional coverage (--min-seq-id 0.70 -c 0.8 --cov-mode 0). For structures, we first clustered the protein sequences at 30% sequence identity and 90% bidirectional coverage with MMseqs2 (--min-seq-id 0.30 -c 0.8 --cov-mode 0), and then further clustered the structure sequences with Foldseek without the sequence identity threshold but with 90% bidirectional coverage and an E-value of less than 0.01. Spacedust provides a search mode --profile-cluster-search. Under this search mode, Spacedust only performs MMseqs2 and Foldseek searches against the cluster consensus sequences and then expands to other members of the cluster to not lose hits.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Data used in this work were obtained from public sources and are freely accessible. The bacterial genome dataset was assembled from the KEGG GENOME collection (https://www.genome.jp/kegg/tables/br08606.html) downloaded from NCBI GenBank in September 2022. The protein structure database used for mapping was compiled from the AlphaFold database (https://alphafold.ebi.ac.uk/). The genomes and datasets used for the analysis in Fig. 5 and Extended Data Figs. 8–10 are publicly accessible via the supplementary material of Hannigan, G. D. et al.³⁰. The genome used as a query in Fig. 6 is available from NCBI GenBank under accession number NZ_BEXT01000001.1. The target database is publicly available from GTDB (https://gtdb.ecogenomic.org/).

Code availability

Spacedust is implemented in C++ and is available as an open-source (GPLv3), user-friendly, command-line software for Linux and macOS. The Spacedust source code, compilation instructions and a user guide are available at https://github.com/soedinglab/Spacedust/. The dataset of the 1,307 representative bacterial genomes and scripts to reproduce the search and visualize results are available at https://wwwuser.gwdg.de/~compbiol/spacedust/.

References

Quince, C., Walker, A. W., Simpson, J. T., Loman, N. J. & Segata, N. Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol. 35, 833–844 (2017).
Article CAS PubMed Google Scholar
Almeida, A. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 39, 105–114 (2021).
Article CAS PubMed Google Scholar
Nayfach, S. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 6, 960–970 (2021).
Article CAS PubMed PubMed Central Google Scholar
Chen, J. Global marine microbial diversity and its potential in bioprospecting. Nature 633, 371–379 (2024).
Article CAS PubMed PubMed Central Google Scholar
Thomas, A. M. & Segata, N. Multiple levels of the unknown in microbiome research. BMC Biol. 17, 48 (2019).
Article PubMed PubMed Central Google Scholar
Richardson, L. Genome properties in 2019: a new companion database to InterPro for the inference of complete functional attributes. Nucleic Acids Res. 47, D564–D572 (2019).
Article CAS PubMed Google Scholar
Galperin, M. Y. COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Res. 49, D274–D281 (2020).
Article PubMed Central Google Scholar
Overbeek, R. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 42, D206–D214 (2014).
Article CAS PubMed Google Scholar
Richardson, L. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).
Article CAS PubMed Google Scholar
Wilke, A. The MG-RAST metagenomics database and portal in 2015. Nucleic Acids Res. 44, D590–D594 (2016).
Article CAS PubMed Google Scholar
Franzosa, E. A. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat. Microbiol. 4, 293–305 (2019).
Article CAS PubMed Google Scholar
Mahlich, Y., Steinegger, M., Rost, B. & Bromberg, Y. HFSP: high speed homology-driven function annotation of proteins. Bioinformatics 34, i304–i312 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chen, I.-M. A. IMG/M: integrated genome and metagenome comparative data analysis system. Nucleic Acids Res. 45, D507–D516 (2017).
Article CAS PubMed Google Scholar
Rost, B. Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999).
Article CAS PubMed Google Scholar
Huerta-Cepas, J. Fast genome-wide functional annotation through orthology assignment by eggNOG-Mapper. Mol. Biol. Evol. 34, 2115–2122 (2017).
Article CAS PubMed PubMed Central Google Scholar
Altenhoff, A. M., Studer, R. A., Robinson-Rechavi, M. & Dessimoz, C. Resolving the ortholog conjecture: orthologs tend to be weakly, but significantly, more similar in function than paralogs. PLoS Comput. Biol. 8, e1002514 (2012).
Article CAS PubMed PubMed Central Google Scholar
Stamboulian, M., Guerrero, R. F., Hahn, M. W. & Radivojac, P. The ortholog conjecture revisited: the value of orthologs and paralogs in function prediction. Bioinformatics 36, i219–i226 (2020).
Article CAS PubMed PubMed Central Google Scholar
Moreno-Hagelsieb, G. in Bioinformatics: Volume II: Structure, Function, and Applications 41–63 (ed. Keith, J. M.) (Springer, 2017).
Fang, G., Rocha, E. P. & Danchin, A. Persistence drives gene clustering in bacterial genomes. BMC Genomics 9, 4 (2008).
Article PubMed PubMed Central Google Scholar
Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G. D. & Maltsev, N. The use of gene clusters to infer functional coupling. Proc. Natl Acad. Sci. USA 96, 2896–2901 (1999).
Article CAS PubMed PubMed Central Google Scholar
Huynen, M., Snel, B., Lathe, W. & Bork, P. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res. 10, 1204–1210 (2000).
Article CAS PubMed PubMed Central Google Scholar
Jahangiri-Tazehkand, S., Wong, L. & Eslahchi, C. OrthoGNC: a software for accurate identification of orthologs based on gene neighborhood conservation. Genomics Proteomics Bioinformatics 15, 361–370 (2017).
Article PubMed PubMed Central Google Scholar
Georgescu, C. H. SynerClust: a highly scalable, synteny-aware orthologue clustering tool. Microb. Genom. 4, e000231 (2018).
PubMed PubMed Central Google Scholar
Tang, H. SynFind: compiling syntenic regions across any set of genomes on demand. Genome Biol. Evol. 7, 3286–3298 (2015).
Article PubMed PubMed Central Google Scholar
Fouts, D. E., Brinkac, L., Beck, E., Inman, J. & Sutton, G. PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Res. 40, e172 (2012).
Article CAS PubMed PubMed Central Google Scholar
Shmakov, S. A. Systematic prediction of functionally linked genes in bacterial and archaeal genomes. Nat. Protoc. 14, 3013–3031 (2019).
Article CAS PubMed PubMed Central Google Scholar
Szklarczyk, D. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 51, D638–D646 (2023).
Article CAS PubMed Google Scholar
Blin, K. antiSMASH 6.0: improving cluster detection and comparison capabilities. Nucleic Acids Res. 49, W29–W35 (2021).
Article CAS PubMed PubMed Central Google Scholar
Cimermancic, P. Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters. Cell 158, 412–421 (2014).
Article CAS PubMed PubMed Central Google Scholar
Hannigan, G. D. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res. 47, e110 (2019).
Article CAS PubMed PubMed Central Google Scholar
Carroll, L. M. et al. Accurate de novo identification of biosynthetic gene clusters with GECCO. Preprint at bioRxiv https://doi.org/10.1101/2021.05.03.442509 (2021).
Tesson, F. Systematic and quantitative view of the antiviral arsenal of prokaryotes. Nat. Commun. 13, 2561 (2022).
Article CAS PubMed PubMed Central Google Scholar
Payne, L. J. Identification and classification of antiviral defence systems in bacteria and archaea with PADLOC reveals new system types. Nucleic Acids Res. 49, 10868–10878 (2021).
Article CAS PubMed PubMed Central Google Scholar
Doron, S. Systematic discovery of antiphage defense systems in the microbial pangenome. Science 359, eaar4120 (2018).
Article PubMed PubMed Central Google Scholar
Ho Sui, S. J., Fedynak, A., Hsiao, W. W. L., Langille, M. G. I. & Brinkman, F. S. L. The association of virulence factors with genomic islands. PLoS ONE 4, e8094 (2009).
Article PubMed PubMed Central Google Scholar
Li, J. VRprofile: gene-cluster-detection-based profiling of virulence and antibiotic resistance traits encoded within genome sequences of pathogenic bacteria. Brief Bioinform. 19, 566–574 (2018).
CAS PubMed Google Scholar
Awasthi, G., Kumari, A., Pant, A. B. & Srivastava, P. In silico identification and construction of microbial gene clusters associated with biodegradation of xenobiotic compounds. Microb. Pathog. 114, 340–343 (2018).
Article CAS PubMed Google Scholar
Marcet-Houben, M. & Gabaldón, T. EvolClust: automated inference of evolutionary conserved gene clusters in eukaryotes. Bioinformatics 36, 1265–1266 (2019).
Article PubMed Central Google Scholar
Medema, M. H., Takano, E. & Breitling, R. Detecting sequence homology at the gene cluster level with MultiGeneBlast. Mol. Biol. Evol. 30, 1218–1223 (2013).
Article CAS PubMed PubMed Central Google Scholar
Gilchrist, C. L. M. cblaster: a remote search tool for rapid identification and visualization of homologous gene clusters. Bioinform. Adv. 1, vbab016 (2021).
Article PubMed PubMed Central Google Scholar
Svetlitsky, D., Dagan, T., Chalifa-Caspi, V. & Ziv-Ukelson, M. CSBFinder: discovery of colinear syntenic blocks across thousands of prokaryotic genomes. Bioinformatics 35, 1634–1643 (2019).
Article CAS PubMed Google Scholar
Altschul, S. F. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS PubMed PubMed Central Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
Article CAS PubMed Google Scholar
Wolf, Y. I., Rogozin, I. B., Kondrashov, A. S. & Koonin, E. V. Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Res. 11, 356–372 (2001).
Article CAS PubMed Google Scholar
Winter, S. Finding approximate gene clusters with Gecko 3. Nucleic Acids Res. 44, 9600–9610 (2016).
CAS PubMed PubMed Central Google Scholar
Rogozin, I. Connected gene neighborhoods in prokaryotic genomes. Nucleic Acids Res. 30, 2212–2223 (2002).
Article CAS PubMed PubMed Central Google Scholar
Pavlopoulos, G. A. Unraveling the functional dark matter through global metagenomics. Nature 622, 594–602 (2023).
Article CAS PubMed PubMed Central Google Scholar
van Kempen, M. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).
Article PubMed Google Scholar
Ruperti, F. Cross-phyla protein annotation by structural prediction and alignment. Genome Biol. 24, 113 (2023).
Article PubMed PubMed Central Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825–5829 (2021).
Article CAS PubMed PubMed Central Google Scholar
Domínguez-Martín, M. A. Structures of a phycobilisome in light-harvesting and photoprotected states. Nature 609, 835–845 (2022).
Article PubMed Google Scholar
Kamei, A., Yuasa, T., Orikawa, K., Geng, X. X. & Ikeuchi, M. A eukaryotic-type protein kinase, SpkA, is required for normal motility of the unicellular cyanobacterium Synechocystis sp. strain PCC 6803. J. Bacteriol. 183, 1505–1510 (2001).
Article CAS PubMed PubMed Central Google Scholar
Özcan, A. Programmable RNA targeting with the single-protein CRISPR effector Cas7-11. Nature 597, 720–725 (2021).
Article PubMed Google Scholar
Parks, D. H. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 50, D785–D794 (2021).
Article PubMed Central Google Scholar
Penn, K. Genomic islands link secondary metabolism to functional adaptation in marine Actinobacteria. ISME J. 3, 1193–1203 (2009).
Article CAS PubMed Google Scholar
Mistry, J. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2020).
Article PubMed Central Google Scholar
Lemoine, F., Lespinet, O. & Labedan, B. Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data. BMC Evol. Biol. 7, 237 (2007).
Article PubMed PubMed Central Google Scholar
Lathe, W., Snel, B. & Bork, P. Gene context conservation of a higher order than operons. Trends Biochem. Sci. 25, 474–479 (2000).
Article CAS PubMed Google Scholar
Che, D., Li, G., Mao, F., Wu, H. & Xu, Y. Detecting uber-operons in prokaryotic genomes. Nucleic Acids Res. 34, 2418–2427 (2006).
Article CAS PubMed PubMed Central Google Scholar
Hyatt, D. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
Article PubMed PubMed Central Google Scholar
Heinzinger, M. Bilingual language model for protein sequence and structure. NAR Genom. Bioinform. 6, lqae150 (2024).
Article CAS PubMed PubMed Central Google Scholar
Mirdita, M. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
Article CAS PubMed PubMed Central Google Scholar
Bailey, T. L. & Grundy, W. N. Classifying proteins by family using the product of correlated p-values. In Proc. Third Annual International Conference on Computational Molecular Biology 10–14 (ACM, 1999).
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2015).
Article PubMed PubMed Central Google Scholar
Ondov, B. D. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).
Article PubMed PubMed Central Google Scholar
Virtanen, P. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Müllner, D. Modern hierarchical, agglomerative clustering algorithms. Preprint at https://doi.org/10.48550/arXiv.1109.2378 (2011).
Rodriguez-R, L. M. The Microbial Genomes Atlas (MiGA) webserver: taxonomic and gene diversity analysis of Archaea and Bacteria at the whole genome level. Nucleic Acids Res. 46, W282–W288 (2018).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank M. Steinegger for the suggestion to use Foldseek in Spacedust. We thank H. Su and E. L. Karin for their valuable feedback on the implementation and insightful comments on the manuscript. We used the Scientific Compute Cluster at GWDG, the joint data center of the Max Planck Society (MPG) and University of Göttingen. R.Z. acknowledges support by the IMPRS Genome Science graduate school. The work was supported by the BMBF CompLifeSci project horizontal4meta. M.M. acknowledges support from the National Research Foundation of Korea (NRF) (grant RS-2023-00250470). R.Z. was supported by the International Max Planck Research School for Genome Science.

Funding

Open access funding provided by Max Planck Society.

Author information

Authors and Affiliations

Quantitative and Computational Biology, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
Ruoshi Zhang & Johannes Söding
School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
Milot Mirdita
Campus Institute Data Science (CIDAS), University of Göttingen, Göttingen, Germany
Johannes Söding

Authors

Ruoshi Zhang
View author publications
Search author on:PubMed Google Scholar
Milot Mirdita
View author publications
Search author on:PubMed Google Scholar
Johannes Söding
View author publications
Search author on:PubMed Google Scholar

Contributions

R.Z. and J.S. designed the Spacedust algorithm, benchmarks and biological applications. R.Z. and M.M. developed the software. R.Z. performed benchmarks and generated figures. R.Z. and J.S. wrote the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Johannes Söding.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Kai Blin and Daniel R. Mende for their contribution to the peer review of this work. Primary Handling Editor: Lei Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Precision-recall (PR) of functional association of non-redundant conserved clusters.

(including the ribosomal genes) for (A) Foldseek+MMseqs search and (B) Foldseek- only search with 3Di sequences predicted by ProstT5, assessed by congruence of KEGG module IDs of Spacedust cluster matches for all gene pairs separated by up to 4 genes (i,i+1),…, (i,i+4).

Extended Data Fig. 2 Evolutionary conservation of gene clusters in a cyanobacterium Synechocystis sp. PCC6803.

Clustered hits of Synechocystis sp. PCC6803 (Genome ID 527) against 1308 bacterial reference genomes using Spacedust Foldseek+MMseqs2 search.

Extended Data Fig. 3 Evolutionary conservation of gene clusters in a cyanobacterium Synechocystis sp. PCC6803 (ProstT5).

Clustered hits of Synechocystis sp. PCC6803 (Genome ID 527) against 1308 bacterial reference genomes using Spacedust Foldseek-only search with 3Di sequences pre- dicted by ProstT5.

Extended Data Fig. 4 Evolutionary conservation of gene clusters in a cyanobacterium Synechocystis sp. PCC6803 (MMseqs2).

Clustered hits of Synechocystis sp. PCC6803 (Genome ID 527) against 1308 bacterial reference genomes using Spacedust MMseqs2 search.

Extended Data Fig. 5 Gene neighborhood of Cyanobacteria-specific cluster 1.

(Protein ID 510- 515), centered around protein 512.

Extended Data Fig. 6 Gene neighborhood of Cyanobacteria-specific cluster 2.

(Protein ID 648- 652), centered around protein 649.

Extended Data Fig. 7 Gene neighborhood of Cyanobacteria-specific cluster 3.

(Protein ID 655- 657), centered around protein 655.

Extended Data Fig. 8 Scatter plots of precision versus recall for the 207 annotated BGCs.

for (A) Clusterfinder, (B) DeepBGC, (C) GECCO and (D) Spacedust.

Extended Data Fig. 9 Contig view of 9 reference genomes with genomic regions.

predicted by ClusterFinder (green), DeepBGC (orange), GECCO (yellow) and Spacedust (blue) overlapping with the annotated BGCs (grey).

Extended Data Fig. 10 Example BGC regions (DS999641.1).

identified by ClusterFinder (green), DeepBGC (orange), GECCO (Yellow) and Spacedust (blue), superimposed upon annotated BGCs (grey) along with AntiSMASH (version 8) predictions and functional categories.

Supplementary information

Supplementary Information

Supplementary Notes, Methods and Tables 1–3

Reporting Summary

Peer Review File

Supplementary Data 1

Sample result of Spacedust.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, R., Mirdita, M. & Söding, J. De novo discovery of conserved gene clusters in microbial genomes with Spacedust. Nat Methods 22, 2065–2073 (2025). https://doi.org/10.1038/s41592-025-02816-x

Download citation

Received: 02 October 2024
Accepted: 13 August 2025
Published: 15 September 2025
Issue date: October 2025
DOI: https://doi.org/10.1038/s41592-025-02816-x