Introduction

The recent development of high-throughput sequencing technologies has permitted the generation of omics data at an exponential scale. Integrating these high-dimensional biological data holds great potential for elucidating the complex mechanisms underlying diverse biological systems1,2,3. Among these, the interactions between microorganisms and metabolites have garnered growing interest due to their involvement in a wide range of biological functions. In human health, shifts in microbiome-metabolome relationships have been linked to conditions such as cardio-metabolic diseases4 and autism spectrum disorders5, potentially reflecting disruptions in microbiota-derived metabolic processes6. Beyond human health, microbiome-metabolome interactions also play key roles in plant productivity, soil nutrient cycling, and ecosystem resilience7,8,9.

Microbiome data, generated through metagenomics sequencing, provides a broad overview of microbial community composition based on either functional gene content or taxonomic assignment10. However, it presents unique analytical challenges due to inherent properties such as over-dispersion, zero inflation, high collinearity between taxa, and its compositional nature11,12. Proper handling of this compositionality, often through transformations like centered log-ratio (CLR) or isometric log-ratio (ILR), is crucial for avoiding spurious results11,12,13. Metabolomics, on the other hand, offers a comprehensive snapshot of the small molecules within a biological system, typically measured using techniques like LC-MS14. Similar to microbiome data, metabolomic profiles often exhibit over-dispersion and complex correlation structures. Integrating metabolomics data with pathway information alongside other omics, such as metagenomics, is vital for identifying disease-related biomarkers and metabolic signatures, with tools like MetaboAnalyst facilitating such analyses15.

The integration of microbiome and metabolome data into statistical frameworks provides key insights into how diet and lifestyle shape specie-metabolite interactions. It also brings its own challenges16. The multiplicity of new available statistical models for the integration makes the choice of the right method a daunting task for many researchers as well as determining the most appropriate approach for the analysis.

Typical research questions include either the inference of interactions occurring between species and metabolites or in predicting one omic layer borrowing the biological information contained in the other omic layers17,18. The inference of interactions focuses on understanding and identifying the complex relationships between microbial communities and the metabolites they influence. It aims to uncover causal dependencies and mechanisms underlying microbial-metabolite interactions in biological systems19. While advanced machine learning and deep learning methods are increasingly used for predictive tasks, they often lack the mechanistic interpretability needed for generating novel biological hypotheses and thus will not be the primary focus of this paper.

To elucidate the complex entanglement between microorganisms and metabolites, several strategies can be exploited, each exhibiting associations occurring at different scales. Consistent with a recent report from20, traditional workflows include distinct types of analysis, addressing complementary biological questions2, including the detection of global associations, data summarization, individual associations, and identification of the most relevant associated features across datasets.

Firstly, researchers often aim to determine the presence of an overall association between the two omic datasets. Multivariate methods like Procrustes analysis21, the Mantel test22, and MMiRKAT23 are commonly employed for this purpose. This global assessment often comes before more specific analyses such as data summarization or the identification of the most relevant associated features across datasets2.

Subsequently, mirroring approaches used in single-omics studies, summarizing the information within each dataset is a common objective, facilitating visualization and interpretation1. The presence of two distinct omic layers allows for the exploration of both intra- and inter-correlations between their features. Data summarization methods like canonical correlation analysis (CCA)24, Partial Least Squares (PLS)25, redundancy analysis (RDA)26, and MOFA227 help identify the most relevant associated features across datasets explaining significant data variability and have been successful in uncovering associations in diseases like Type 2 diabetes28. However, these approaches often lack the resolution to pinpoint specific microorganism-metabolite relationships, which is crucial for identifying context-relevant core features.

To address this limitation, methods for detecting individual associations may prove relevant for the identification of metagenomic taxonomic features associated with diet-associated metabolites2. One common strategy involves computing association measures (correlation or regression) for each metabolite-species pair. While straightforward, these methods face challenges due to the multiple testing burden. Alternative approaches include univariate or multivariate feature selection methods like LASSO29, sparse CCA (sCCA)30, and sparse PLS (sPLS)31, which aim to identify the most relevant associated features across datasets while addressing multicollinearity. However, the inherent complexities of microbiome and metabolome data can still limit the biological interpretability of results obtained from standard methods.

Thus, integrating microbiome and metabolome data requires particular attention. Approaches to deal with compositional data either as an outcome or explanatory variable have already been proposed12,32,33 covering applications of global association methods, data summarization, individual associations, or identification of core features. Conventional strategies involve applying standard statistical methods after appropriate transformations, such as cCLR or ILR for microbiome data, or log transformation for metabolomics data. Alternatively, purely compositional approaches, such as Dirichlet regression or balances derived from the ILR transformation, explicitly account for the compositional nature of microbiome data34,35,36,37. Regarding the multiple possibilities of analysis methodologies, determining which strategy is the best, depending on the research question, remains an open problem with major implications for practitioners.

Despite recent efforts to evaluate statistical methods for joint microbiome and metabolome integration, these have often been limited to literature reviews or real-data applications lacking a known ground truth38. This absence of unbiased comparisons is a significant gap in the field20, hindering the establishment of best practices for result interpretability and reproducibility. For instance, while transformations like CLR and ILR are commonly used for microbiome data normalization, a clear consensus on their relative performance is lacking.

Therefore, in this paper, we present a comprehensive comparison of nineteen different integrative methods to decipher the complex relationships between microorganisms and metabolites.

Our selection of methods and analytical strategies is based on a recent review20, covering common research aims such as global associations, data summarization, individual associations, and feature selection. Through extensive simulation studies, we provide valuable insights into the strengths and limitations of methods commonly used in practice. Subsequently, we apply the top-performing methods, identified based on power, robustness, and interpretability, to real gut microbiome and metabolome data from Konzo disease39,40, revealing a complex multi-scale architecture between the two omic layers. Specifically, the best methods were selected based on their performance in: (i) global associations-detecting significant overall correlations while controlling false positives; (ii) data summarization-capturing and explaining shared variance; (iii) individual associations-detecting meaningful pairwise specie-metabolite relationships with high sensitivity and specificity; and (iv) feature selection-identifying stable and non-redundant most relevant associated features across datasets. Finally, we offer general and specific recommendations for researchers based on their data and research objectives, and we highlight potential avenues for future methodological development. A comprehensive user guide with all associated code is provided to facilitate the application of these methods in diverse contexts, promoting scientific replicability and reproducibility.

Methods

In this section, we describe the data simulation process, the methods tested for the four key analytical questions, and the evaluation metrics used to assess their performance.

Simulation setups

Microbiome and metabolome data were simulated using the Normal to Anything (NORtA) algorithm, which allows for generating data with arbitrary marginal distributions and correlation structures41,42,43. Three real microbiome-metabolome datasets were used as templates:

  • Konzo dataset39: This high-dimensional dataset comprises 171 samples, 1,098 taxa, and 1,340 metabolites, collected from individuals affected by Konzo and matched controls. Microbiome data follow a negative binomial distribution, and metabolome data follow a Poisson distribution.

  • Adenomas dataset44: This intermediate-size dataset includes 240 samples, 500 taxa, and 463 metabolites, from patients with advanced adenomas and matched controls. Microbiome data exhibit zero-inflated negative binomial distributions, and metabolome data follow a log-normal distribution.

  • Autism spectrum disorder dataset45: This small dataset consists of 44 samples, 322 microbial taxa, and 61 metabolites, from children with autism spectrum disorder and neurotypical controls. Microbiome data exhibit zero-inflated negative binomial structures, and metabolome data follow a Poisson distribution.

To estimate the marginal distributions and correlation structures used in the simulations, we pooled all samples from each dataset regardless of study group, without explicitly modeling group-specific effects.

Correlation networks for species and metabolites were estimated using SpiecEasi46, and normal distributions were converted into correlated distributions matching the original data structures. Simulations accounted for microbiome transformations (CLR, ILR, and alpha) and evaluated their impact on method performance (Table S1).

To assess Type-I error control, null datasets with no associations were generated. For alternative scenarios, we varied the number and strength of associations between microorganisms and metabolites. Methods were tested under three realistic scenarios with varying sample sizes, feature numbers, and data structures, with 1000 replicates per scenario. Simulation details and additional scenarios are provided in the supplementary, and key steps are illustrated in Fig. 1.

Fig. 1: Overview of the simulation setup based on real datasets.
figure 1

A Three microbiome-metabolome datasets were selected, each exhibiting different data structures and correlations. We reported the sample size (N) and the number of features (P), as N × P, for each dataset. B Realistic datasets were simulated using the “Normal-to-Anything" (NORtA) framework. First, we estimated sparse microbiome and metabolome correlation networks using SpiecEasi. Second, correlated multivariate Gaussian distributions were generated for both microbiome and metabolome datasets using the correlation structures estimated in the previous step. Third, Gaussian distributions were converted into arbitrary distributions matching the original data structures. C Associations between species and metabolites were specified, mimicking the complex entanglement between the two omic layers. For each dataset, proportions of associated features vary between 1% and 10%, with association strengths randomly picked from a Gaussian distribution.

Additional scenarios were generated for methods requiring specific assumptions, detailed in the supplementary material "Additional simulation settings" section.

Statistical analyses

Let’s assume X and Y, a matrix of microbiome and metabolome, collected on the same set of samples, of size n × p and n × q, where n is the number of samples, p the number of metagenomic features (i.e., species level assignment in our scenario), and q the number of metabolites, respectively. Xij represents the jth microorganism in the ith sample, with j = 1, 2, . . , p, while Yik is the kth metabolite in the ith sample, where k = 1, 2, . . . , q. For the sake of simplicity, we considered the case where p = q.

Prior to applying the methods, microbiome data were transformed using compositional data approaches, which are detailed in section 2.3.

The following sections describe the different categories of methods benchmarked in this study: (i) global association methods; (ii) data summarization methods; (iii) individual association methods; and (iv) feature selection methods. Since these methods rely on different statistical principles, their outputs are not directly comparable. We then used distinct performance metrics tailored to each category, which are described in their respective sections.

Data and distance kernel transformation

Most methods selected below require a normalization step or a distance-based transformation to handle compositional or over-dispersed data. We applied three microbiome normalizations and one metabolome transformation, selected based on the research objective.

To account for microbiome compositionality while preserving feature count, we used the CLR transformation47,48, computed as:

$$CLR({X}_{j})=\log \left(\frac{{X}_{j}}{g(X)}\right)$$

where g(X) is the geometric mean across microorganisms in a sample. CLR maintains interpretability and was used as the reference normalization for individual associations and feature selection. However, it does not ensure feature independence or sub-compositional coherence, limiting its suitability for distance-based methods.

For global associations and data summarization, we applied the ILR35 and alpha transformations34, which project data into an independent D − 1 quasi-orthogonal space. ILR is defined as:

$$ILR({X}_{j})=\sqrt{\frac{j}{j+1}}\log \left(\frac{{\prod }_{j = 1}^{D-1}{X}_{j}}{{X}_{j}+1}\right)$$

while the alpha transformation is a Box-Cox type transformation ensuring multivariate normality34.

While we considered different normalization strategies for microbiome data, metabolome normalization can also significantly impact data analysis results. Various normalization approaches exist, including probabilistic quotient normalization, median fold change normalization, and total area normalization, each addressing specific biases in metabolomic measurements49. However, in practice, log-transformation remains the most commonly used approach for metabolomic data, as it stabilizes variance and improves normality50,51,52. Consequently and since metabolome data are log-normally distributed13, our evaluation focused on assessing the impact of applying or not applying a natural log-transformation on metabolomic data.

Additionally, for global and individual association analyses, we considered Euclidean, Canberra, and Manhattan distances on original and log-transformed metabolome data, and Euclidean distance on microbiome data. Notably, Euclidean distance on CLR-transformed data corresponds to Aitchison distance, which has been shown to outperform Bray–Curtis dissimilarity in stability and interpretability11.

All transformations and distance kernels used per method are summarized in Table S1.

Global associations

We define global association methods as statistical approaches assessing overall correlations between microbiome and metabolome data. Following Deek et al. 2024, we evaluated the Mantel test22, MMiRKAT23, and Procrustes Analysis21, using various data transformations and distance kernels.

Selected methods

The Mantel test22 measures global correlation between two datasets from the same samples, using distance or dissimilarity matrices. We applied it with Pearson’s and Spearman’s correlation on Euclidean, Canberra, and Manhattan distances for the original and log-transformed metabolome data. For microbiome data, Euclidean distance was used on a normalized matrix for better interpretability11 (Table S1). P values were empirically derived from 1000 permutations. The analysis was conducted using the vegan R package53.

MMiRKAT extends MiRKAT to assess global associations between a distance-transformed microbiome dataset (kernel function) and a low-dimensional multivariate phenotype23. The model regresses the multivariate outcome on the non-parametric kernel-transformed microbiome with a Gaussian error term. Consistent with the Mantel test, we used Euclidean, Canberra, and Manhattan distances as kernel-based transformations for microbiome data, while the original or log-transformed metabolome matrix served as the outcome (Table S1). Significance was assessed using the exact Davies method. MMiRKAT was implemented via the MiRKAT R package54.

Procrustes Analysis enables high-dimensional visualization by aligning datasets through linear translation, scaling, and rotation to minimize Euclidean distance (Procrustes superimposition). Prior to alignment, both the microbiome and metabolome matrices were independently subjected to principal component analysis using the rda() function from the vegan R package, to obtain low-dimensional representations suitable for Procrustes comparison. Once aligned, inter-omics distances were computed for each sample, with smaller distances indicating stronger agreement between the two omic layers. Statistical significance was assessed via 999 permutations, comparing the observed sum of squared deviations to the resampled distribution. Procrustes Analysis was performed using the vegan R package53.

Method evaluation

For global and individual association methods, we evaluated performance using Type-I error control and power, as these methods follow frequentist frameworks.

Type-I error control measures a method’s ability to maintain the expected false positive rate (e.g., 5% at a 0.05 significance threshold). Values above this rate indicate a liberal method, while lower values suggest conservatism. This was assessed via quantile-quantile plots of -log10 p-values.

Power quantifies a method’s ability to detect true associations while maintaining Type I error control. Ideally, methods should maximize power without inflating false positives.

For this category, the ground truth was defined at the global level, meaning that a structured dependence was introduced between microbiome and metabolome data without specifying individual species-metabolite pairs. A method was considered effective if it maintained proper Type-I error control while maximizing power.

Data summarization

Data summarization methods aim to capture shared variability between microbiome and metabolome data, facilitating dimensionality reduction and the identification of key underlying structures. We evaluated several approaches to extract latent components summarizing these complex relationships.

Selected methods

CCA24 identifies linear combinations of two matrices that maximize correlation and was implemented using the CCA R package55,56.

PLS maximizes covariance instead25, with two variants: canonical PLS and regression PLS, applied via mixOmics.

RDA combines multivariate linear regression with PCA26: regression is first applied to predict responses from predictors, followed by PCA on predicted values. RDA was performed using vegan.

MOFA2 is a Bayesian probabilistic framework that identifies shared sources of variability across omic layers27, applied with default settings in MOFA2. For MOFA2, the number of latent factors was selected by the model, while for other methods, we retained all components up to the smallest feature set in either dataset.

Method evaluation

To ensure fair comparisons between data summarization methods, we considered the redundancy index initially proposed for CCA24,57, and extended it to PLS25. This index measures the proportion of variance in each dataset explained by the latent variables. In practice, we report the percentage of variance explained in each omic, a more interpretable metric across methods, and aligns with common practice in dimensionality reduction and data integration frameworks.

Individual associations between species and metabolites

This section evaluates methods designed to identify direct, pairwise associations between individual microbial taxa and metabolites. Each taxon-metabolite pair is tested independently to assess its association strength, aiming to recover true links introduced in the simulation while maintaining control over false positives. We additionally include pairwise Pearson and Spearman correlation tests as standard baseline methods for evaluating taxon-metabolite relationships.

Selected methods

We evaluated four methods, a log-linear regression on CLR-transformed microbiome data, a log-contrast model32, MiRKAT23, and HALLA58.

The CLR-based log-linear model (CLR-lm) is defined as:

$$E({Y}_{ik}^{* }| {X}_{ij}^{* },{\beta }_{j})={\beta }_{0}+{X}_{ij}^{* }{\beta }_{j}+{\epsilon }_{i},\,\forall (j,k)$$

where Y* is the log-transformed metabolome matrix and X* the CLR-transformed microbiome data. While CLR accounts for compositionality, it does not preserve sub-compositionality32. The log-contrast model addresses this by enforcing \(\mathop{\sum }_{j = 1}^{p}{\beta }_{j}=0\):

$$E({Y}_{ik}^{* }| {X}_{i.},\beta )={X}_{i.}\beta +{\epsilon }_{i},\quad \mathop{\sum }_{j=1}^{p}{\beta }_{j}=0$$

To test for associations, we applied the global significance F-test via the compositions R package.

MiRKAT uses a semi-parametric kernel machine regression framework to summarize microbiome relationships23. Unlike other methods, it allows multiple distance kernels, addressing uncertainty in kernel selection. We tested Euclidean, Canberra, and Manhattan distances on original and transformed microbiome data, with the original or log-transformed metabolome as the outcome, using the MiRKAT R package.

Finally, we applied HALLA, a hierarchical All-against-All statistical framework58, designed for both homogeneous and heterogeneous datasets while controlling the False Discovery Rate. HALLA was implemented via the halla Python package with default parameters, using CLR-transformed and original microbiome data, along with log-transformed and original metabolome data.

Method evaluation

The performance of individual association methods was assessed using Type-I error rate and power, as described earlier for global association methods. Type-I error rate evaluates a method’s ability to control false positives, while power measures its capacity to detect true associations.

For this category, the ground truth was defined as the set of true species-metabolite associations introduced during data simulation. A method was considered effective if it maximized power while maintaining proper Type-I error control.

To account for multiple testing in individual association methods, we combined p-values using the Aggregated Cauchy-based test (ACAT)59 for CLR-lm, HALLA, and correlation-based approaches. ACAT, provided by the ACAT R package60, aggregates p-values across microorganisms for each metabolite, reducing the multiplicity burden while controlling the Type-I error rate. Unlike Fisher’s method, ACAT remains robust to p-value correlations and is computationally efficient for large datasets.

Feature selection: univariate

Univariate feature selection methods aim to identify relevant direct 1:1 associations between taxa and metabolites by evaluating their associations independently, across all possible pairs. In contrast to the previous section, which assesses each pair’s significance, the focus here is on selecting the most informative features-those involved in at least one true association-based on individual association scores. Methods are evaluated using sensitivity, specificity, and sparsity to quantify their ability to prioritize relevant features.

While feature selection is often associated with supervised modeling, we use the term here to denote the identification of taxa or metabolites involved in true inter-omic associations, reflecting its broader usage in multi-omics studies1,19,61,62.

Selected methods

Adapted from ref. 33, we evaluated two models incorporating compositional predictors when modeling metabolites as outcomes.

First, we applied CLR-LASSO, which performs a CLR transformation on microbiome data before fitting a univariate or multivariate LASSO log-linear regression29. In the Results section, we refer to these models as LASSO and MLASSO. Formally, for a metabolite k, the LASSO log-linear model is:

$${\sum }_{i=1}^{n}{\left({Y}_{ik}^{* }-{\sum }_{j = 1}^{p}{X}_{ij}{\beta }_{j}\right)}^{2}+\lambda {\sum }_{j=1}^{p}| {\beta }_{j}| $$

where Y* is the log-transformed metabolome matrix and X* the CLR-transformed microbiome data. The penalty parameter λ was selected via 10-fold cross-validation over a grid from 0.01 to 1. Models were implemented using the glmnet R package63.

We also applied coda-LASSO, a penalized log-contrast model ensuring scale invariance through a zero-sum constraint on regression coefficients. This approach, extending64, selects the most contributive features by fitting a two-stage model on all possible log-ratios between species, leading to sparse solutions. The coda4microbiome R package65 was used with default parameters.

Methods evaluation

The performance of univariate feature selection methods was assessed using sparsity, sensitivity, and specificity.

These metrics are used to evaluate how well the methods recover simulated associations. Specifically, sensitivity and specificity were calculated by comparing the selected features with the ground truth, which is the set of taxa or metabolites directly involved in the simulated associations (i.e., those with nonzero coefficients). While we do not include an explicit phenotype or classification task in our framework, the focus of the evaluation is on the ability of the methods to identify these predefined associations from the microbiome and metabolome data.

Specificity measures the proportion of true negatives among all non-associated features, while sensitivity quantifies the proportion of true positives among all associated features. Based on this, the confusion matrix (Table 1) is defined as follows:

Table 1 Confusion table

They are computed as:

$$Specificity=\frac{TN}{TN+FP},\quad Sensitivity=\frac{TP}{TP+FN}$$

For univariate feature selection methods, the ground truth corresponds to the set of taxa or metabolites directly involved in simulated 1:1 associations, meaning those with nonzero coefficients. Specificity and sensitivity are then evaluated based on the correct or incorrect identification of these associations.

Sparsity is also reported to quantify model parsimony, i.e., the proportion of selected features. While the ground truth is known, sparsity remains relevant to assess whether a method identifies only a minimal and informative set of features rather than an overly large or noisy subset.

Formally, the sparsity is given by:

$$Sparsity={\sum }_{i=1}^{p}\frac{I({\beta }_{i}\,!=\,0)}{p}$$

where β is either the coefficient for an association from a univariate featureselection method or corresponds to a loading in a multivariate feature selection. p is the total number of possible associations/features.

Feature selection: multivariate

Multivariate feature selection methods aim to identify subsets of taxa and metabolites that jointly contribute to the global covariance structure across omic layers. Unlike univariate approaches that test each pair independently, multivariate methods capture many-to-many relationships and select features that participate in shared latent structures, while ensuring model sparsity and interpretability.

Selected methods

sCCA30 and sPLS31 extend CCA and PLS by incorporating penalization to perform both data summarization and feature selection.

For sCCA, an L1 penalty was applied to both datasets, retaining features contributing to the first two components. Optimal penalties were determined via 25 permutation-based samples with a 0.1-step grid search from 0.01 to 1. sCCA was implemented using the PMA R package66.

Similarly, canonical and regression sPLS were tuned using 10-fold cross-validation with a 5-step grid search (5-25 for low-dimensional data, 10-50 for high-dimensional data). A maximum of two components was retained to select the most relevant features. sPLS was applied using the mixOmics R package1.

For both sCCA and sPLS, only features with non-null loadings on the first two components were considered informative for performance evaluation.

Method evaluation

The performance of multivariate feature selection methods was assessed using sparsity, sensitivity, and specificity, as described earlier for univariate feature selection.

For multivariate feature selection methods, the ground truth is defined as the set of features contributing to the latent structures underlying the simulated associations, meaning those with nonzero loadings. Sensitivity and specificity are assessed in relation to these identified features.

Konzo data analysis workflow

Konzo is a paralytic neurological condition affecting children and women in sub-Saharan Africa, linked to chronic consumption of improperly processed cassava containing cyanogenic glucosides. While cyanide exposure is a key factor, not all consumers develop konzo, suggesting additional contributing factors.

Fecal samples from 171 individuals in Masi-Manimba (n = 65) and Kahemba (n = 106), Democratic Republic of the Congo, were analyzed using metagenomics and metabolomics, regardless of disease status. Shotgun metagenomic sequencing (~50 million reads per sample) was performed on DNA extracted from ~250mg of stool, using Kraken2 and Bracken for taxonomic classification39. Metabolite profiling was conducted by Metabolon using a validated reference database. The dataset includes 1098 microorganisms and 1340 metabolites.

Microbiome data at the genus level were CLR-transformed, and metabolome data were log-transformed. The workflow included (1) global association, (2) data summarization, (3) univariate and multivariate feature selection, and (4) individual associations. Microorganisms were analyzed as both explanatory variables and outcomes. Given the high dimensionality, we applied the Mantel test for global associations and RDA to identify key contributors. Core microorganisms and metabolites were extracted using regression sPLS, selecting features with nonzero loadings on the first two components. Finally, CLR-lm regression and CODA-LASSO were used to assess microbial contributions to metabolite variations. The workflow is summarized in Fig. S13.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Results

We systematically evaluated the performance of microbiome-metabolome integration strategies by comparing different computational approaches across multiple simulated datasets. The following sections present the results for each category of methods, using evaluation metrics specific to each approach to highlight their strengths and limitations.

Global associations

In our main setting, the Mantel test and Procrustes Analysis exhibited comparable results across microbiome data normalizations for controlling the false positive rate in the Adenomas and Konzo simulated-based scenarios (Fig. 2C). While in the Autism simulated datasets, the Mantel test considering the Euclidean distance provided the best control of Type-I error rate; the Procrustes analysis and other distance kernels showing inflation patterns (Fig. 2A). Nonetheless, methods highlighted variable powers depending on the scenario, microbiome normalizations and distance kernels considered (Fig. 2B). Indeed, in the Konzo simulated-based scenario, the alpha transformation could detect more than twice fewer associations compared to the CLR or ILR normalizations, whether the Mantel test or the Procrustes Analysis are considered (Fig. 2B). Additionally, the Canberra distance showed weakest powers across all the three settings, where no clear distinction between Euclidean and Manhattan distances could be observed (Fig. 2B). As an example, in the Adenomas simulation-based scenario, considering the CLR normalization, both the Manhattan and Euclidean distances offered up to 2.72-fold higher powers than the Canberra kernel (Power Manhattan = 68%; Power Euclidean = 67%; Power Canberra = 25%). In an additional scenario with a smaller number of features than individuals, the Mantel test was, on average, the best method across microbiome data normalizations and distance kernels, showing good control of the Type-I error rate and power (Fig. 2C, D). Briefly, when considering the Euclidean distance for the Mantel test, the method was minimally 1.5 times more powerful than MMiRKAT, considering the same distance kernel or the Procrustes Analysis (Fig. 2D). This result was consistent across other microbiome normalizations. Our results were confirmed in additional scenarios and considering other types of correlations (Figs. S1–4). Our findings point to different method capabilities to detect global associations depending on the data at hand. Importantly, our results suggest better performances for the Mantel test regarding both Type-I error rate and power when compared to MMiRKAT or the Procrustes analysis. Particularly, applying the Mantel test using the ILR normalization on microbiome data and the log transformation on metabolites was a robust strategy, offering the best results to detect global associations in a variety of scenarios.

Fig. 2: Performance of the multivariate methods for global associations or data summarization.
figure 2

A QQ-Plot of the Mantel test and the Procrustes Analysis across microbiome normalizations and distance kernels. For the Mantel test, we considered Spearman’s method for computing the global association between the two datasets. P values for both the Mantel test and Procrustes Analysis were obtained empirically based on 1000 replicates. B Power of the Mantel test and the Procrustes Analysis across microbiome normalizations and distance kernels. For the Mantel test, we considered Spearman’s method for computing the global association between the two datasets. P values for both the Mantel test and Procrustes Analysis were obtained empirically based on 1000 replicates. P values ≤ 0.05 were considered significant. C QQ-Plot of MMiRKAT, the Mantel test, and the Procrustes Analysis across microbiome normalizations and distance kernels. Points below the straight line refer to conservative behavior in the result section. To accommodate MMiRKAT (fewer number of features than sample size), we considered scenarios with a smaller number of features in both omics layers than the number of individuals (See supplementary methods). D Power of MMiRKAT, the Mantel test, and the Procrustes Analysis across microbiome normalizations and distance kernels. To accommodate MMiRKAT (fewer number of features than sample size), we considered scenarios with a smaller number of features in both omics layers than the number of individuals (See supplementary methods). P values for both the Mantel test and Procrustes Analysis were obtained empirically based on 1000 replicates. P values ≤ 0.05 were considered significant. E Proportion of explained variance for the data summarization methods across different data structures and normalizations considering the log metabolome. Data summarization methods were compared considering scenarios with a number of features half the number of individuals (See supplementary methods).

Data summarization

As described in the Methods, we primarily focused on the first two components for all methods except MOFA2, where the number of components was inferred by the model. Regardless of the considered microbiome data normalization, method’s performances were consistent in the three simulation setups. For example, RDA exhibited weak variations of the metabolome explained variances across data normalizations, with standard errors of 0.003, 0.003, and 0.002, for Adenomas, Autism, and Konzo simulation-based scenarios. Impacts of the choice of normalization on result variability for other methods were provided in Fig. S5. This result suggests that the normalization used for dealing with compositionality observed in metagenomics data does not affect the capability of methods to capture data variability. Importantly, we observed that no single method consistently outperformed others across all data summarization techniques and simulation scenarios, highlighting that performance is context-dependent and influenced by the specific characteristics of each dataset. However, RDA provided consistent performances compared to other methods in most scenarios, showing an average explained variance of 52% depending on the omics layer of interest (Fig. 2E). Moreover, MOFA2 exhibited the highest variable results, spanning from 0% to 100% of explained variances. This variability suggests that MOFA2’s performance is highly sensitive to specific data properties, such as signal strength, noise level, and sample size. For regression-based methods, such as PLS-Reg or RDA, the integration directionality impacted the method performances, producing up to 6-fold higher explained variances (Fig. 2E). We confirmed our findings where considering the original metabolome (Fig. S6). Our results pointed to RDA as the best trade-off to summarize data variability through latent factors, while PLS-Reg could be exploited when strong assumptions about the effect direction can be made. As a whole, our findings suggested that the RDA or PLS-Reg are versatile and robust under scenarios commonly encountered in practice, where the choice of the best method could be driven depending on the research question (See Discussion).

Individual associations between species and metabolites for compositional prediction

Studying the relationship between metabolites and microorganisms may represent an important challenge when accounting for the compositionality induced by microbiome datasets. Indeed, the perfect correlation brought by the compositionality makes the application of standard methods incorrect. This is particularly true when species are incorporated as covariates in statistical models analyzing microbiome-metabolome associations11,12,13,14,15,32. To address this challenge while focusing on individual (1:1) associations, we evaluated four methods that test each microbe-metabolite pair independently: HALLA58, the Log-contrast model32, MiRKAT23, and a linear regression on CLR-transformed microbial abundances (referred to as CLR-lm). Overall, under the null hypothesis, across our three scenarios, methods exhibited different behaviors regarding the control of the Type-I error rate, showing an accurate control in the Adenomas simulation-based setting, or being either conservative or slightly liberal in the Autism and Konzo simulation-based scenarios (Fig. 3A). We observed a similar pattern when investigating Pearson’s and Spearman’s correlation with an accurate control or modest inflations of false positives depending on the considered setting (Fig. S7). These results highlight that Type I error control can be sensitive to dataset-specific characteristics, and no method maintained consistent performance across all settings. At the nominal 5% level, methods highlighted weak to modest powers, ranging from 7% to 35% depending on the method and simulated dataset considered, when these powers were drastically reduced after correcting for multiplicity (Figs. 3B and S8). These results are partially explained by the simulation setting, where a low signal-to-noise ratio was assumed, mimicking complex microbiome-metabolome relationships. In certain replicate, null powers were observed, while in others, a high percentage of significant associations (Fig. S9). Then, comparing models to Pearson’s and Spearman’s pairwise correlations (straight lines in Fig. 3B), we found no clear advantage of the Log-contrast, HALLA, MiRKAT, or CLR-lm over standard approaches, when higher powers were mainly due to uncalibrated false positive rates. Interestingly, MiRKAT was robust to microbiome normalization, offering consistent control of the Type-I error rate and power across CLR, ILR, or alpha transformations (Figs. 3A and S10). Collectively, our results align with a poor calibration of the Type-I error rate in our Konzo and Autism-based simulated scenarios for HALLA, the CLR-lm, or log-contrast regression, while MiRKAT exhibited consistent results across the considered scenarios. Across methods, well-calibrated QQ plots and reliable false positive control were only observed in the Adenomas-based simulations, indicating that dataset-specific features had a strong impact on method performance. Our evaluation suggested that MiRKAT offers the best false-positive power trade-off in all our simulation settings. The pros and cons of each strategy were further elaborated in the Discussion section.

Fig. 3: Performance of the individual association methods for compositional predictors.
figure 3

To accommodate long running times due to the number of pairs between species and metabolites, we considered scenarios with a number of features half the number of individuals (See supplementary methods) A QQplots of the individual association methods across our three simulation settings. B Power of the individual association methods across our two main scenarios. P values ≤ 0.05 were considered significant. For the CLR-lm method and HALLA, p-values were combined using ACAT in order to provide similar comparisons with the log-contrast regression and MiRKAT (See Methods). For MiRKAT, we reported Type-I error rate and power using the ILR transformed microbiome data and the log transformed metabolites, while for HALLA, we considered the CLR transformed microbiome and the log metabolome. The straight line represents the background ACAT-combined power using Spearman’s correlation on the CLR microbiome and the log metabolome. Powers were averaged over 1000 replicates.

Univariate feature-selection for compositional predictors

Feature selection methods have gained increasing interest from researchers in order to identify a subset of species associated with a variable of interest2,33,67,68. However, due to the compositional nature of microbiome data, traditional methods have been shown to lead to results that may not accurately reflect the underlying biological structure, though they can still provide valuable insights in other contexts11. Thus, we compared univariate feature selection methods accounting for compositional predictors, CODA-LASSO, CLR-LASSO, and CLR-MLASSO33. First, we evaluated whether methods were able to provide sparse sets of microorganisms across our three scenarios. We found that CLR-MLASSO exhibited higher sparsity than CLR-LASSO and CODA-LASSO, meaning that it retained a smaller subset of the original number of species (Fig. 4A). For example, in our Konzo-based simulation setting, we observed that CODA-LASSO offered the sparsest method with on average 0.2% of features selected (sd = 0.001) compared to 2% (sd = 0.003) and 41% (sd = 0.09), for the CLR-LASSO and CLR-MLASSO, respectively. This result was confirmed in the two other scenarios with different underlying data structures. Then, we assessed how accurate the methods are to find true associations, evaluating methods based on specificity and sensitivity (see Methods). Overall, methods exhibited different patterns of results, with weak to high sensitivities depending on the method and scenario considered (Fig. 4A). As an example, under our three simulation settings, CODA-LASSO showed sensibilities of 0.3%, 10%, and 30%, in the Adenomas-based, Autism-based, and the Konzo-based simulated scenario, respectively, suggesting method performance discrepancies under realistic data structures. We observed comparable results when considering CLR-LASSO or CLR-MLASSO. Additionally, because the great majority of features were non-associated, methods showed highly specific behaviors. Interestingly, CLR-MLASSO, due to the underlying feature selection process, could be highly sensitive at the price of lower sparsity or specificity compared to the two other methods. Collectively, our results highlight the importance of selecting univariate feature selection methods based on the data structure to ensure accurate and interpretable identification of key microbial signatures associated with metabolites. Specifically, methods incorporating compositional constraints, such as CODA-LASSO, demonstrated improved feature selection in microbiome data.

Fig. 4: Performance (Sparsity (Spa.), Sensibility (Sens.), Specificity (Spe.)) of the feature selection methods for providing a sparse and reliable subset of elements across our two scenarios.
figure 4

A Performance of univariate feature selection methods considering microorganisms as covariates across our three settings. Metabolites were log-transformed before running the methods. Performances were calculated on 100 replicates. For CODA-LASSO in the Konzo scenario, we adapted the simulation setting, selecting 300 species and 600 metabolites to accommodate running times of the method (See supplementary methods). B Performance of multivariate feature selection methods. Metabolites were log-transformed before running the methods. sPLS-Reg1 and sPLS-Reg2 correspond to the sPLS-Reg with X = microbiome and X = metabolome, respectively.

Multivariate feature-selection

Across our three scenarios, either sPLS-Can or sPLS-Reg provided lower sparsity scores compared to sCCA. While sPLS selected on average 12% of the total number of features, sCCA tends to keep either all or no variables, suggesting a poor method’s performance to select features, as confirmed by the patterns of specificity and sensibility (Fig. 4B). This result is higher than the upper bound of true associations assumed by the simulation scenario (10%) suggesting that methods tend to provide a higher proportion of false associations than expected. Further investigation revealed that differences in sparsity could be attributed to the distribution of penalty parameters, which varied across simulation settings. In some cases, we observed strong or uniform penalization patterns, leading to overly sparse solutions (Fig. S11). Interestingly, both sPLS-Can and sPLS-Reg exhibited the same behavior, showing important levels of specificity and modest sensitivities, while sCCA showed inconsistent performances across our three simulation settings. For example, in the Konzo-based simulated scenario, the two methods offered on average 12% of signals detected as true when the signal is true (sensitivity), with much higher variability for sCCA as demonstrated by the shape of the boxplots in Fig. 4B. Finally, when focusing on sPLS-Reg, we found that directionality of integration did not offer discrepancy in results, suggesting robustness to the underlying outcome structure. These results were confirmed when considering metabolites on the original scale (Fig. S12). Collectively, our findings indicate that multivariate feature-selection methods have limited ability to distinguish informative from uninformative features when leveraging intra- and inter-correlations between omic layers.

Real-data application

Our systematic evaluation of strategies to jointly analyze microbiome and metabolome data has permitted the identification of the best methods depending on the research question. Thus, we illustrated the best approaches through an application on metabolomics and metagenomics data of the Konzo disease39. We presented the exact workflow in the Konzo data analysis section and Fig. S13. Firstly, we asked whether there is a different pattern of global association between the two omics layer in cases and controls and found a stronger relationship in Konzo-affected individuals (Mantel statistic r: 0.4272; Spearman’s permutation p value: 9.999e-05) than in healthy individuals (Mantel statistic r: 0.2838; Spearman’s permutation p value: 0.0026997). Then we applied the RDA and found that the two first components explained roughly 26% of metabolome variability across the two conditions, while these proportions remained stable when considering microbiome (Fig. 5A). Moreover, the top-20 most contributing features in each omic on the two first RDA factors highlighted a large panel of associations between species and metabolites with distinct patterns of correlations occurring in affected and unaffected subjects (Figs. 5B and S14). For example, in unaffected individuals, RDA identified mevalonate as strongly positively associated with species and 3-hydroxyisobutyrate as exhibiting moderate negative association with microorganisms. These two metabolites have been shown to be linked to inflammatory- or oxidative-based processes potentially involved in Konzo69,70. Consistently, the application of RDA has allowed the detection of 15 Prevotella species, which have a negative correlation with metabolites in unaffected samples, whereas they are positively correlated with metabolites in KONZO-affected samples. Moreover, Bifidobacterium pseudocatenulatum, B. adolescentis, and B. angulatum are positively correlated in unaffected samples, whereas B. pseudocatenulatum, B. catenulatum, B. longum are negative in affected patients. A high number of Streptococcus species have a positive correlation uniquely in Konzo-affected samples. This result points to distinct patterns of associations between species and metabolites in healthy and affected subjects. Subsequently we used the sPLS regression and identified 30, and 45 metabolites and 235, and 130 microorganisms significantly contributing to the two first components, in cases and controls, respectively (Fig. 5C). Interestingly, of the 130 species kept by the sPLS regression in healthy individuals, 64% were also found in affected individuals, while 46% of metabolites were preserved between the two conditions. Interestingly, both mevalonate and 3-hydroxyisobutyrate have been found to be contributing metabolites in Konzo-affected subjects. To investigate the implication of interactions between metabolites and species in Konzo, we sequentially applied MiRKAT and CODA-LASSO on the subset of unique metabolites found in affected individuals. Consequently, we found that mevalonate and 3-hydroxyisobutyrate were significantly associated with 14 and 16 associated species, respectively, both exhibiting a large panel of associations (Fig. 5D, E). We identified species that could play a role in Konzo in affecting the oxidative response of the metabolism. For example, Desulfovibrio desulfuricans was positively associated with mevalonate, suggesting that increases in the microbial abundance are associated with an augmentation of metabolite levels. Also, Clostridioides difficile exhibited a consistent effect across mevalonate and 3-hydroxyisobutyrate, suggesting common microbial dynamics between the two metabolites Clostridioides difficile has already been reported to have an impact in oxidative stress-related pathways, potentially involved in Konzo71. These associations were missed when applying the CLR-lm regression. We validated our findings at a larger scale by a systematic network analysis from the CLR-lm regression and CODA-LASSO (Figs. 5F and S15–17). Our results from metagenomics and metabolomics data from Konzo disease highlight distinct patterns of interactions between microorganisms and metabolites occurring in both affected and unaffected individuals, where different microbial dynamics are involved.

Fig. 5: Application of best strategies highlights complementary biological interactions between microorganisms and metabolites in Konzo data.
figure 5

A Proportion of cumulative explained variance in Metabolome and Microbiome datasets in both affected and unaffected individuals B Top-20 of the most contributing species and metabolites on the first RDA component in healthy and affected samples. Positive correlations were identified by a +, while negative correlations were identified with a - sign. Projection of metabolites (red) and microorganisms (blue) into the 2D regression sPLS space in C affected and D unaffected individuals. Features with null loadings were removed from the analysis. E Coefficients provided by the CODA-LASSO across mevalonate and 3-hydroxyisobutyrate were identified only in Konzo by the regression sPLS. Positive coefficients were identified by a +, while negative coefficients were identified with a - sign F Network between mevalonate and 3-hydroxyisobutyrate and their corresponding associated species found by CODA-LASSO. Positive associations were represented by green edges and negative associations by pink edges.

Discussion

The integration of microbiome and metabolome datasets within statistical frameworks has become a valuable resource for researchers to comprehensively understand the underlying biological mechanisms involved in diseases. Indeed, recent studies in inflammatory bowel disease72 or cardiometabolic traits4 have highlighted that pathoetiology may result in the disruption of the complex architecture occurring between the two omic layers. Understanding these interactions represents, therefore, a critical avenue for unraveling the biology of complex phenotypes. However, currently, there are no standards on how to integrate these two omic layers together, pushing researchers to constantly waste a lot of time in their decision-making process. Thus, deciding which method fits best for a specific biological question remains a daunting task, critically limiting the result interpretations and replicability. In this paper, we extensively benchmarked nineteen existing integrative methods to disentangle microbiome-metabolome interactions, covering most of the researchers' aims: global associations, data summarization, individual associations, and feature selection. Based on a comprehensive and realistic simulation study and a real data application, we highlighted the best methods depending on the research question and data at hand, providing important insights about statistical good practices (Table 2) and avenues for future methodological developments (Table 3).

Table 2 Summary of best methods depending on the research question
Table 3 Overview of the research questions, corresponding methodological approaches, selected models, and performance metrics used in this benchmark

Despite its realism, our simulation framework has some limitations. In particular, it does not explicitly remove correlations driven by case-control group effects, meaning that some simulated associations may reflect group-level signals rather than true biological links. However, this limitation is mitigated in the context of a benchmark, where the primary goal is to compare methods under controlled yet realistic conditions, rather than to reproduce ground-truth biological mechanisms.

Furthermore, a user guide and all the scripts used in this work are available in the GitHub repository associated with the paper to facilitate reproducibility and further advancements in the field.

When evaluating global association methods, our results have pointed to important lessons for practitioners. Indeed, across our three realistic simulation scenarios, the Mantel test is the most powerful method to find associations occurring at a global scale compared to MMiRKAT and the Procrustes Analysis. Also, the method exhibits an adequate control of Type-I error rate, showing robustness across a variety of scenarios, considering several underlying data structures, normalizations, and distance kernels. This is an appealing feature in practice since choosing the right data transformation or distance metric may represent an important challenge for practitioners. However, by exploiting regression-based frameworks, MMiRKAT, unlike the Mantel test, can adjust for confounding factors, making the correction for certain bias induced by individual characteristics, such as age, sex, lifestyle, or even batch effects, possibly 23. However, the method is unable to deal with scenarios with a larger number of features than individuals, limiting applications in most multi-omics scenarios, as pointed out by ref. 20. We therefore recommend using the Mantel test in most cases or MMiRKAT filtering out features based on a feature selection approach when confounding is expected. The Procrustes Analysis could be exploited to have graphical representations, but shows no advantages over the Mantel test or MMiRKAT in our benchmark. Importantly, when using the Mantel test, our results suggest that the Canberra distance on metabolome data is the poorest choice for detecting global associations across all our scenarios (Fig. 2B). The application of the Mantel test in our study identifies distinct patterns of global associations between microbiome and metabolome occurring in affected and unaffected individuals. Thus, our recommendation here is to apply Euclidean distance on the transformed microbiome data while applying Euclidean or Manhattan distances on metabolites in most cases.

Dimensionality reduction is often used by practitioners for summarizing information through a small number of components. Having an efficient method that recapitulates variability across two omic layers is critical for facilitating subsequent analyses, such as visualization or clustering. We considered five different methods exhibiting specific features to summarize omic layers information and found that in addition to being robust to data normalization, RDA is the most reliable method showing consistency across simulation scenarios and data normalizations (Fig. 2E). Interestingly, our results point to important impacts of directionality when applying regression-based methods, such as PLS-Reg or RDA, with performances drastically increasing depending on whether species or metabolites are considered as the outcome. This result could be explained by the underlying complex structure of microbiome data, strongly impacting the capability of methods to capture data variability. Consistent with this idea, we observed unpredictable performances for MOFA2 with explained variances going from 0% to 100% in all our scenarios (Fig. 2E). Although MOFA2 has been shown to capture complex relationships for gene expression or methylation73, our results align with a distinct pattern of performance for metagenomics and metabolomics data. This behavior is explained by the underlying assumptions of MOFA2, where the use of the method could be restricted to cases where quasi-normality of data is expected, since other types of data require statistical approximations27. We then applied RDA to our Konzo dataset and found distinct species-metabolites associations involved in healthy and affected samples (Fig. 5A, B). In unaffected patients, the metabolites 3-(4-hydroxyphenyl)lactate and 5-hydroxyindoleacetate are identified in positive correlations and may be linked to the fermentation process of the Bifidobacterium spp. positively correlated in unaffected samples and negatively correlated in affected samples.

In practice, another important question for researchers is to determine the relationship between microbial communities with a variable of interest39,74. However, the underlying compositional structure of microbiome data is an important challenge for model performance. In this paper, we have compared four methods accounting for the compositionality of predictors with different strategies: a linear regression applied on the CLR transformed microbiome data, MiRKAT, the log-contrast model, and HALLA. Compared to correlations, these methods have not been shown to be significantly more powerful (Fig. 4B). As already pointed out in the result section, weak powers are mostly explained by the simulation scenario where we assumed a low signal-to-noise ratio. This behavior could therefore be expected in real-data applications. Thus, we recommend applying univariate methods to only a relevant subset of features, avoiding systematic applications across all metabolite-specie pairs. Also, certain univariate methods suffer from inflation of the Type-I error rate in our scenarios (Fig. 4A), suggesting that method performances are impacted by the underlying data structures. For example, the log-contrast model exhibits three different patterns of results across our three scenarios, showing well calibration, conservative, and liberal behaviors in our Adenomas-, Autism-, and Konzo- based scenarios (Fig. 4A). Our general recommendation to the reader is to use MiRKAT, since the method has shown accurate controls of Type-I error rate across microbiome normalizations, permitting correction for confounding factors. This is particularly important since the choice of the right normalization could be difficult. However, microbiome transformations could have critical impacts on result transferability and interpretability, where CLR provides still-correlated synthetic features and ILR or alpha reduce the dimensional space33,34. This result highlights the need for new compositional data transformations, keeping the original number of features while linearly independent (Table 4).

Table 4 Overview of avenues for future methodological developments to jointly analyze metagenomics and metabolomics data

Also, one important contribution of this work is to extensively evaluate feature selection methods. This is particularly crucial for researchers to accurately select metabolites and microorganisms involved in a specific biological context. Our results point to moderate performance of multivariate feature selection methods with inconsistent performances across scenarios and the data transformations considered (Figs. 5B and S12). The best performances are achieved for univariate feature-selection methods for compositional predictors, with CODA-LASSO as a good trade-off between sparsity and classification performances (Fig. 5A), while being sub-compositional coherent. However, all methods provide discrepancies in performance regarding the data structure. For example, CODA-LASSO is strongly affected by the proportion of zeros in microbiome data, as suggested by our results in the Adenomas-based scenario. Thus, we recommend in practice to use CODA-LASSO for scenarios with microbial predictors after removing taxa with a high proportion of zeros, to ensure the good performance of the method, whereas sPLS-Reg could be exploited to select features exploiting the inter-omics correlation. We therefore applied both regression sPLS and CODA-LASSO on the Konzo dataset. Regression sPLS has permitted the detection of a distinct set of metabolites and microorganisms in affected and unaffected individuals. From the subset of metabolites contributing to Konzo, CODA-LASSO has highlighted different microbial dynamics of effects (Fig. S17). This result is aligned with the model where microorganisms may be connected to a large set of metabolites. This complex microbiome-metabolome crosstalk has been shown to be associated with diseases75. However, the identified associations may include artifact signals, as many feature selection methods benchmarked in this study exhibit limited sparsity and variability in their selected features, affecting result consistency. This result is aligned with previous reports where authors have shown poor performances of traditional feature selection models76. Indeed, most penalized methods are built upon cross-validation, where small perturbations in data may yield drastic changes in results. Similarly to76 extending sparse multivariate or univariate methods to the knockoff framework77 or stability selection78 should represent interesting avenues for improving both sparsity and reliability for compositional data79 (Table 4).

Although our work is focused on evaluating methods for inferring associations between metagenomics and metabolomics data, additional work is still required to comprehensively compare methods in the context of prediction. Indeed, predicting metabolome levels from microbiome data is a flourishing research topic with critical implications for clinical applications. Indeed, addressing many challenges, such as integrating and analyzing diverse omics data types, dealing with high-dimensional data, addressing data heterogeneity, and developing robust computational models that capture complex relationships between different molecular layers, could be part of further investigation.

Although our simulation setup can realistically simulate microbiome and metabolome data, our framework is limited to the “Normal-to-Anything” framework. However, as discussed by41, simulating pure compositional data from a Dirichlet distribution induced only a small correlation between features, which is often unrealistic regarding the biology of the microbial communities and metabolites. We therefore promote the correlation, zero-inflation, and overdispersion characteristics over a purely compositional structure, with no major impacts expected on the conclusions. Also, methods selected through the benchmark assumed a directionality of effects between species to metabolites; other types of approaches could be considered depending on the biological question. Indeed, in some contexts, linking metabolome to microbiome using models such as the Dirichlet regression could be achieved36,37. We explored the performance of such approaches and found poor results. While we did not elucidate whether the results were explained by the methods themselves or the simulation setting, we decided to omit these approaches in the current version of the benchmark since they are not widely used in practice but could be explored in future works. Our illustration could identify biological species and metabolites involved in Konzo-related processes; our application is, however, limited to intersection-based analyses between affected and unaffected individuals. We are aware that explicit multi-omics models incorporating the disease information, such as DIABLO80 or MDiNE42 are available. However, future evaluations of differential multi-omics integrative strategies are required to mechanistically link microbiome and metabolome to diseases from a dynamic perspective at a large scale (Table 4). We argue this aspect is particularly critical to pinpoint the underlying biological mechanisms, hence facilitating precision medicine applications81,82.

To conclude, this study presents one of the most comprehensive benchmarks of purely statistical frameworks for integrating metagenomics and metabolomics data. We focus exclusively on purely statistical methods that naturally fall into four main categories, while acknowledging that other integrative approaches exist, including those leveraging external knowledge or addressing different research questions17,83,84. Through extensive and realistic simulations, we systematically compared nineteen methods across key research questions, identifying the most effective approaches. Their applicability was demonstrated on the Konzo dataset, highlighting complementary biological processes across different scales. This comprehensive evaluation provides practical guidance tailored to specific scientific questions and data characteristics. By providing a robust and replicable comparative framework, we aim to enhance research standards and foster the development of new integrative models. We anticipate that systematic applications of these methods to unified metagenomics-metabolomics resources85 will be instrumental in identifying common and unique species-metabolite interactions, essential for understanding disease mechanisms.

Key Points

  • We systematically evaluated nineteen statistical methods for jointly analyzing metagenomics and metabolomics data across a wide range of scientific questions, including global associations, data summarization, individual associations, and feature selection.

  • We provided general guidelines for practitioners for properly analyzing microbiome and metabolome data together, facilitating result interpretation and replicability (Box 1).

  • We illustrated the best methods through an application to metagenomics and metabolomics data in the gut for Konzo disease, highlighting complementary biological results.