Introduction

Lipoproteins, fatty acids, amino acids and ketone bodies are circulating markers of metabolic processes essential for human health. Reliable quantification of absolute concentrations of these metabolites can now be achieved through high-throughput nuclear magnetic resonance (NMR) spectroscopy1. Metabolomics data in large population samples such as the UK Biobank (UKB), coupled to national health records, has allowed researchers to identify numerous associations between patterns of metabolite concentrations and a wide range of common medical conditions2. These metabolites hold potential for precision medicine as they have been shown to predict long term outcomes3, and could aid in combatting key public health issues, including the adverse effects of the worldwide obesity epidemic4.

Charting the pleiotropic genetic architecture of metabolic biomarkers, through the effects of common and rare variants, is key to understanding interindividual differences in metabolic processes. Genome-wide association studies (GWAS) of metabolomics data have confirmed there is a substantial genetic component to these metabolite concentrations and have identified hundreds of genetic variants associated with individual metabolites5,6,7. The sets of metabolites included in metabolomics panels are strongly genetically correlated to each other8; joint analysis through a multivariate approach may improve discovery of variants with widespread effects by leveraging shared genetic signal across the metabolites9. Additionally, characterizing the influence of rare variants on metabolites through whole exome sequencing (WES) data complements previous GWAS efforts, as rare variants are likely to be particularly impactful and point towards promising drug targets10,11.

Obesity and sex are likely important moderators of the relation between an individual’s genetic make-up and metabolic health. As obesity and its downstream medical conditions co-occur with changes in metabolite concentrations12; disentangling the causal role of obesity in determining these levels can aid in devising treatment strategies. Biological sex is a further important determinant of metabolic activity13, yet there is little knowledge about sex-dependent genetic influences. Males and females differ substantially in basal metabolic activity, as well as in their propensity to develop prominent metabolic conditions, such as obesity, coronary artery disease (CAD) and type 2 diabetes (T2D)14. Previous studies have shown that there is a genetic basis for sex differences in metabolism, beyond the impact of gonadal hormones15.

Here, we take advantage of the latest generation of targeted metabolomics technology available in the UKB and Estonian Biobank (EstBB), to perform a large GWAS of circulating metabolic traits, leveraging NMR spectroscopy data from over 300,000 individuals. We expand on previous work on this data by employing a multivariate approach to boost discovery of variants with widespread shared effects across metabolites, and perform quantification of the global genetic architecture. We further incorporate WES data to increase knowledge about the impact of both common and rare variants. Lastly, we identify widespread sex-specific effects and estimate the influence of obesity (indexed by body mass index, BMI) to provide insight into the influence of individual, clinically relevant factors.

Results

We conducted GWAS of 249 circulating metabolites from the Nightingale NMR metabolomic platform, charting their shared and specific genetic architectures. This panel encompasses 228 lipids, lipoproteins and fatty acids, and 21 non-lipids, including amino acids, ketone bodies, fluid balance, glycolysis- and inflammation-related metabolites. See Supplementary Data 1 for an overview of these circulating metabolites, their categories, and sample sizes. For the main analyses we used data from UKB, including 207,836 White British participants, with a mean age of 57.4 years (standard deviation (SD) 8.0 years), 53.7% female. Additionally, there were data on 27,509 non-White British UKB participants, with a mean age of 54.5 years (SD = 8.4 years), 54.3% female. From EstBB, we included 92,661 unrelated White European participants, with a mean age of 50.9 years (SD = 16.2 years), 65.7% female, which we used to test for generalization of the discovered loci across different populations. For each of these subsets, identical analyses were carried out, covarying for age, sex, and the first twenty genetic principal components to control for population stratification16.

Univariate GWAS

We estimated the effective number of independent traits in our analyses to be 96, based on matrix spectral decomposition17 of the phenotypic correlation between all 249 metabolite concentrations. We therefore set the univariate GWAS significance threshold at α = 5 × 10−8/96 = 5.2 × 10−10. The GWAS of all individual 249 metabolites revealed a median of 63 loci discovered per metabolite (range 8 to 98), for a total of 15,585 loci when summing over the individual univariate GWAS, as shown in Fig. 1a. Accounting for locus boundary overlap across the univariate GWAS, there were 465 unique genomic regions involved, suggesting high numbers of shared genetic variants across the metabolites. Of these, 166 regions were novel, in that they did not overlap with the 276 regions identified by the previously largest GWAS of 233 metabolites of the Nightingale metabolomics panel5. The most significant novel loci were rs4760682, mapped to the PFKM gene, and rs7584089, mapped to PDK1. Both showed strongest association with pyruvate levels (B = −0.11, p = 2.6 × 10−187; B = −0.09, p = 8.9 × 10117). This befits the central role that the mapped genes play in glycolysis and fatty acid metabolism, with their overexpression previously coupled to T2D18, cancer19, and Alzheimer’s disease20. Supplementary Data 1 lists the number of significant loci and lead SNPs for each of the 249 individual metabolites. Supplementary Datas 2 and 3 list information on all discovered regions (per metabolite and aggregated), and whether they are novel. All Manhattan plots are provided in Supplementary Data 19.

Fig. 1: Discovered loci for individual metabolites.
figure 1

a Scatterplot displaying the effect sizes (y-axis) of all 15,585 locus lead variants identified through univariate genome-wide association study (GWAS) of 249 metabolites, ordered by their minor allele frequency (x-axis) and colour coded by metabolite category. The two loci with the strongest effects on individual GWAS are demarcated by vertical dashed lines and annotated. On the right side of the figure, the local genomic regions of these two loci are depicted, with b showing rs7412 mapped to APOE and c showing rs1047891 mapped to CPRS1, both of which were also the only fine-mapped variants in these regions. The chromosomal location is on the x-axis and -log10(p-value) on the y-axis, with colours reflecting the linkage disequilibrium between the lead and surrounding variants, as indicated in the legend. The bottom panel shows the genes located in these regions. The vertical dashed lines highlight the position of the lead variants in their respective genes.

We applied a combination of PolyFun and FINEMAP, Bayesian fine-mapping procedures bundled in the SAFFARI pipeline21, to each of the univariate GWAS summary statistics, to obtain a set of high-confidence causal variants and genes. Out of the original 15,585 loci, we retained 2629 variants that had a posterior probability >0.95 of being part of a credible set. We then mapped these variants to 2498 protein-coding genes using OpenTargets22. The two fine-mapped loci with the largest effect sizes (rs7412, APOE, and rs1047891, CSP1) had strongly divergent patterns of associations, being highly pleiotropic (influencing nearly all lipid measures) versus one individual association (with glycine). APOE is well-known for playing a central role in lipid homeostasis, and variation in its gene has been associated with a wide variety of traits23. CPS1 on the other hand is an enzyme involved in a specific pathway degrading choline to urea, with variation in its gene linked to blood pressure and CAD through its strong effects on glycine levels24. Figure 1b and c illustrates the mapping of these genomic regions. Supplementary Data 4 lists the fine-mapping results in more details, including all mapped genes and their coupling to individual metabolites.

We checked cross-population generalization of the effects of fine-mapped variants in the White EstBB cohort and in the non-White British UKB subset. For the EstBB replication set (n = 92,645 individuals), 99.0% of the 2,019 available variants showed the same direction of effect, and 91.3% of these effects were nominally significant. In the additional UKB subset (n = 27,509 individuals), we found that 95.9% of all 2,207 available fine-mapped variants showed the same direction of effects, and 75.3% were nominally significant. Thus, our results suggest cross-population generalization of the discovered genetic associations. Supplementary Fig. 1 shows the relationship between the number of discovered loci in UKB and replicated loci in EstBB, per metabolite.

Multivariate GWAS

Genetic variants are likely to have pleiotropic effects across the metabolites, given these metabolites are components of the same biological system, as also indicated by the univariate GWAS findings. We therefore jointly analysed all measures with the Multivariate Omnibus Statistical Test (MOSTest)9, which prioritizes the identification of pleiotropic variants by leveraging shared genetic signal across the univariate measures, yielding a multivariate association with each genetic variant.

For the primary sample, MOSTest revealed 12,216 independent significant SNPs and 2690 lead SNPs across all metabolites, for a total of 534 loci covering 8.3% of the genome, see Fig. 2a. The lead SNPs of 96 of these loci did not show genome-wide significant effects on any of the individual metabolites, i.e. they were detected only through MOSTest due to their distributed signal across the metabolites. Supplementary Fig. 2 summarizes the significance of the locus lead SNPs across all metabolites, illustrating the pervasive pleiotropy of most discovered variants. Indeed, 48 of these SNPs showed a genome-wide significant association with more than 100 metabolites, as summarized in Supplementary Data 5. This table also lists the comparison to the previous GWAS, showing that 274 of these 276 regions overlap with the MOSTest-discovered loci, while MOSTest uncovered another 241 loci not reported in this previous GWAS.

Fig. 2: Discovery of pleiotropic variants and their relationship to disease.
figure 2

a Manhattan plot of the output of the multivariate genome-wide association study (GWAS) on all 249 metabolites, with the observed -log10(p-value) for each variant shown on the y-axis. The x-axis shows the relative genomic location, grouped by chromosome, and the red dashed line indicates the genome-wide significance threshold of 5 × 108. The colour coding represents the number of genome-wide significant associations of each variant with metabolites at the univariate level, ranging from 0 (in black) to 214 (in red), illustrating the extent of pleiotropy. Loci with p < 1 × 10300 have been annotated with mapped gene names. b heatmap showing 30 multivariate GWAS-identified loci with the highest number of Bonferroni-corrected significant associations with diseases, based on published GWAS. On the x-axis are the locus lead variants with mapped genes, and on the y-axis the categories of diseases as compiled by OpenTargets. The cell colouring indicates the -log10(p-value) of the variant-disease association, as indicated by the legend on the right.

We performed phenome-wide association studies (pheWAS) of each of the 534 loci identified through MOSTest, querying GWAS Catalog and FinnGen GWAS summary statistics through the ‘otargen’ R package, leveraging the OpenTargets ‘diseases’ categorization in order to determine clinical relevance25. There were a total of 1253 Bonferroni-corrected significant associations, with 341 unique traits across 372 studies. The results, fully listed in Supplementary Data 6, show that many of the discovered variants are associated with cardiovascular diseases, as expected, as well as commonly comorbid conditions. Specific variants also show high pleiotropy across diseases, as illustrated in Fig. 2b. For instance, the pleiotropic variant rs3184504 was mapped to SH2B3 on chromosome 12, which is a key regulator of signalling pathways involved in inflammatory responses26. Among the ten most pleiotropic variants was also a novel locus at chromosome 7 mapped to IRF5, which encodes a transcription factor that induces proinflammatory cytokines, and has been named a potential therapeutic target for a wide range of autoimmune diseases27. These examples reflect the well-known coupling of low-grade inflammation to metabolic dysregulation, contributing to patterns of comorbidity28.

Gene-based analyses

Next, we ran gene-based analyses to identify the most significant genes and their enrichment among specific biological pathways. Aggregating across all common variants within 17,849 protein-coding genes, we found 1921 significant genes based on the 249 individual univariate GWAS summary statistics, and 2590 genes from the multivariate GWAS. Tests of tissue-specificity, covarying for mean expression across all tissues, revealed differential gene expression in the liver for nearly all metabolites (243 out of 249), in line with its central role in metabolism of both lipids and amino acids. Differential expression in other tissues was more specific to a metabolite category, as can be seen for the spleen, summarized in Fig. 3a. Competitive gene-set analysis for each individual metabolite GWAS, testing for 7522 Gene Ontology (GO) biological processes, primarily uncovered associations with lipoprotein particle modification, organization, and homeostasis, see Fig. 3b. Supplementary Data 7 lists all identified genes, and Table 8 contains the complete results of the gene-set analyses for each individual metabolite.

Fig. 3: Functional annotation of gene-based tests.
figure 3

a Stacked bar plot summarizing the output of tests of tissue-specific gene expression, with the top 15 tissues on the x-axis. b Competitive gene-set analysis of Gene Ontology biological processes, with top 15 pathways listed on the x-axis. For both plots, the number of significant associations with metabolites is shown on the y-axis and the colours indicate metabolite categories. c Venn diagram of the number of genes identified through gene-based tests of the multivariate GWAS, univariate GWAS and rare variant WES data.

Rare variant data

We ran SKAT-O29 gene burden tests on the UKB WES data (N = 200,330), restricted to intragenic variants with MAF < 0.005, to characterize the impact of rare exonic variation on metabolites. There were 338 protein-coding genes with a significant burden, see Supplementary Data 9. Figure 4 lists the top genes identified, showcasing the widespread impact of apolipoprotein genes, well-known for their association with obesity and Alzheimer’s disease (AD), on the lipid metabolites. Many of the additional identified highly pleiotropic genes with impactful rare variants also play roles in metabolic conditions and healthy ageing through interrelated pathways30,31,32, directly or indirectly. For instance, ZPR1 codes for a zinc finger protein that regulates axonal growth, which has been coupled to neuronal cell death following a high-fat diet33. As impactful rare variation signals potential for manipulation of these pathways, we queried the drug-gene interaction database (DGIdb v5.0.6)34 and found significant enrichment for targets of nine drugs. These include three lipid-lowering agents, three antibacterial drugs, a drug used to treat leukemia, an anticonvulsant, and a platelet aggregation inhibitor (Supplementary Data 10).

Fig. 4: Genes associated with metabolites based on rare variants.
figure 4

a Stacked bar plot showing the number of significant associations (x-axis) with genes identified through whole-exome sequencing-based gene burden tests (y-axis), coloured by metabolite category. b Grouped scatterplot, showing the -log10(p-value) of Bonferroni-corrected significant genes on the y-axis, grouped by category on the x-axis, with the dot colours also reflecting category. The top three most significant genes per category are annotated.

See Fig. 3c for overlapping and unique components of the identified sets of genes through the three different gene-level analyses, capturing genes with effects driven by rare, common, and/or pleiotropic variants. The unique and shared sets of genes were coupled to the GWAS Catalog through hypergeometric tests, with results summarized in Supplementary Data 11.

Global genetic architecture

We determined the SNP-based heritability, h2, as well as the polygenicity and average magnitude of non-null effects (‘discoverability’)35,36 for each metabolite. Overall, the output showed that the metabolites vary widely in their global genetic architecture; h2 ranged from 02 (acetoacetate, standard error (SE) = 0.002) to 0.21 (triglycerides to total lipids in very large HDL, SE = 0.001), all p-values < 1.1 × 1021. Polygenicity estimates spread across two orders of magnitude, from oligogenic (phenylalanine with an estimated 17 causal variants) to moderately polygenic (creatinine with an estimated 1529 causal variants). Similarly, discoverability ranged from 1.2 × 104 (lactate) to 2.4 × 103 (omega-3%). Supplementary Fig.. 3 depicts the estimated proportion of h2 explained as a function of sample size, for 37 metabolites validated for clinical use1. This showed that over 70% of h2 for omega-3 and omega-6 fatty acid concentrations is explained by the currently discovered variants. All estimates, for each of the 249 metabolites, are listed in Supplementary Data 1.

Analyses of BMI and sex

Given BMI and sex have been associated with substantial interindividual variation in metabolic activity12,13, we sought to determine the phenotypic, causal and genetic relation of these individual determinants with the metabolites. First, we conducted linear regression analyses, regressing each metabolite onto BMI, sex, BMI*sex, and age. These models produced highly significant associations with BMI, sex, and their interaction across nearly all metabolites, as summarized in Supplementary Data 12. This underscores the importance of sex and BMI as individual-level influences when investigating the biological underpinnings of metabolic processes. Notably, there was a very high correlation between the coefficients of BMI and sex (r = 0.87, p = 4.1 × 1079), indicating that these factors share mechanisms that in turn impact metabolites.

As BMI is a modifiable factor, we next sought to estimate the causal nature of the identified relationships between BMI and metabolites. We ran bidirectional two-sample Mendelian randomization (MR), combining inverse variance weighted (IVW) MR with the weighted median and MR Egger approach37. There were no instances where the metabolites had a significant causal effect on BMI consistently across the different MR methods. BMI had a multiple comparisons-corrected significant (p < 0.05/96) causal effect on 79 metabolites for the IVW and weighted median approach. However, when further thresholded by the MR Egger approach, sensitive to horizontal pleiotropy, the causal effect of BMI on only six metabolites remained: albumin, phenylalanine, average diameter for LDL particles, cholesterol % in small LDL, tyrosine, and valine, see Fig. 5. The full results, in both directions, are provided in Supplementary Datas 1315.

Fig. 5: Causal influences of body mass index (BMI) on the metabolites.
figure 5

Plot listing coefficients from two-sample Mendelian randomization (MR) analyses on the x-axis, and the 6 different metabolites that showed a significant influence of BMI on the y-axis. The dots and lines represent the point estimates with their standard error around the mean, colour-coded by the MR method used.

Sex-specific genetic influences

Our identification of significant interactions between BMI and sex on metabolite concentrations underlines the need for sex-specific research into metabolic health. We therefore first ran univariate GWAS within both sexes separately, to compare the overall genetic architecture between men (n = 96,281) and women (n = 111,560). Through paired t-tests applied to sex-specific LDSC heritability estimates, we found that the mean h2 was significantly higher for women than for men (h2 = 0.148 vs. .132, t = 12.8, p < 1 × 10−16). Men’s h2 was still higher than that of the overall GWAS (h2 = 0.132 vs. 0.128, t = 7.5, p = 9 × 1013), suggesting heritability estimates may be lowered by combining two subsamples (men and women) with differing genetic influences. We further calculated genetic correlations between the two sets of sex-specific GWAS and found that these ranged between 0.85 and 1. While these correlations were high, the majority differed significantly from 1, as reported in Supplementary Data 16.

Given the identification of sex-specific genetic components through LDSC, we ran multivariate GWAS with an interaction term between sex and each genetic variant, to discover individual variants with sex-specific effects. We found 31 loci with a genome-wide significant interaction effect. Of these, 8 loci had no whole-genome significant interaction effect on any individual metabolite. Next, we mapped the loci to 29 genes through OpenTargets, see Fig. 6a. Functional annotation of the 29 mapped genes revealed tissue-specific upregulation in kidney, liver, and heart tissues based on GTEx v8 data, and enrichment for GO pathways involved primarily in cholesterol regulation. Coupling these 29 genes to the GWAS Catalog showed enrichment among gene lists reported for metabolic syndrome, CAD, T2D, and steatotic liver disease, which are well-known for having sex differences in prevalence and etiology14.

Fig. 6: Genome-wide interactions between sex and genetic variants.
figure 6

a Manhattan plot of the multivariate genome-wide association study with an interaction term with sex on all 249 metabolites, with the observed −log10(p-value) of each interaction shown on the y-axis. The x-axis shows the relative genomic location, grouped by chromosome, and the red dashed line indicates the genome-wide significance threshold of 5 × 108. The y-axis is clipped at -log10(p)=150. Loci have been annotated with mapped gene names. b Illustration of an identified significant cross-over interaction between sex and rs1065853 on chromosome 19, showing opposite effects on phospholipid concentrations (y-axis) in men and women (x-axis). c An interaction effect of rs964184 on chromosome 11, illustrating effects on very low density lipoprotein (VLDL) cholesterol concentrations only in women. In both plots, the line colors indicate genotypes, and error bars represent standard error of the mean. Male n = 96 281, female n = 111 560.

Follow-up in the univariate summary statistics showed that the interaction effects were often present for numerous metabolites, with one interaction effect (rs1065853, APOE) being genome-wide significant across 110 metabolites. Figure 6b provides an example of univariate cross-over interaction between sex and the rs1065853 genetic variant on lipid levels, with highly significant effects in females (B = 0.145, p < 1 x 10-16), but not males (B = −0.002, p = 0.84). Figure 6c shows another significant sex*gene variant interaction effect of rs964184 (ZPR1), which also influences cholesterol levels only in females (B = 0.110, p = 1 x 10-16) and not males (B = 0.008, p = 0.21). In total, there were 496 univariate genome-wide significant interactions. In EstBB, the concordance rate was 99.3%, and 113 out of 158 (71.5%) of the available lead variants was nominally significant. In the non-White UKB subset, the concordance rate was 90.7%, with 268 out of the 496 lead variants being nominally significant (54.0%). The lists of all multivariate and univariate loci with significant interactions are provided in Supplementary Data 17 and 18.

Discussion

Here, we reported results from a large-scale GWAS of circulating metabolite concentrations. This led to the identification of the largest number of discovered genetic determinants across these metabolites to date, mapped to genes with roles in lipid homeostasis. Using a multivariate approach, our findings emphasized the pervasive pleiotropy across metabolic measures that underscore and expand the findings from other GWAS using this data7. We further went beyond previous studies by integrating WES data in our analyses to uncover a sizeable role for rare variants. We identified the causal effect of BMI on specific amino acids, indicating obesity as a primary target for improving metabolic health. Last, we discovered sex-specific genetic effects on metabolite concentrations, which may explain the substantial sex differences in metabolic health.

Locus discovery was high, in line with the estimated genetic architecture. The complementary univariate and multivariate GWAS approaches employed in this study particularly emphasized the pervasive pleiotropy across the set of included metabolites, in accordance with previous findings8. Joint analyses of these interrelated measures are essential to boost discovery of variants with small, yet distributed effects. The clinical relevance of this discovery is underscored by the results of the pheWAS analyses, showing the association of many of these pleiotropic variants with medical conditions across domains. This likely contributes to the extensive comorbidity across complex medical conditions with a cardiometabolic component38,39, which is an important determinant of clinical outcomes39,40.

The gene-based analyses illustrated the relative contributions of common and rare variation, with extensive pleiotropy, to determining metabolite levels. The WES gene burden tests, aggregating across rare variants, identified 335 genes with widespread associations across both lipid and non-lipid metabolite categories. Among the most pleiotropic were apolipoprotein genes, well-known for their involvement in diabetes and CAD as well as in brain disorders41. Particularly notable in this context is the identification of BACE1 on chromosome 11 among the most influenced genes, the protein product of which is central to the generation of amyloid-B peptides in neurons and a key enzyme in the pathophysiology of AD42. Overall, this rare variant data confirms the presence of impactful rare variants with high potential for druggability, as confirmed by the coupling to DGIdb. The generated data on the specificity of these genetic effects on metabolites is important information for research into comorbidities and for predicting utility as a biomarker and drug target.

The findings of the gene-by-sex interaction analyses underscore the substantial differences between males and females in metabolism13. This is likely to be a strong explanatory factor of sex differences in the prevalence of a wide array of cardiometabolic conditions14, advocating for the investigation of sex-specific mechanisms. The notoriously low power of interaction effects43 is counteracted by our multivariate approach. MOSTest is insensitive to differences in the directions of these interactions across the univariate measures, which would hamper other approaches to aggregation across measures. The identification of the widespread sex-dependent effects of rs1065853 showcases the potential of these interaction terms to identify variants that explain interindividual variation beyond their main effects. This SNP, located in a known enhancer of APOE, is well known for its association with numerous metabolic and clinical outcomes, including AD and CAD44,45. The identification of such non-linear effects represent a new frontier in genomics, which needs to be explored in order to further resolve interindividual heterogeneity. Our findings particularly suggest value of additional sex-specific research into obesity and metabolic health.

The MR analyses provided evidence for the causal effect of BMI, as a proxy of obesity, on circulating metabolic biomarkers, emphasizing the importance of obesity as a primary target for treatment of cardiometabolic conditions. In accordance with previous findings in smaller samples, we show that BMI has a significant causal effect on levels of several metabolites46, primarily amino acids, while there was no evidence of effects of any metabolites on BMI. Obesity therefore appears to drive changes in these amino acids, which may then cause complications47. Branched chain amino acids, including valine, have been robustly associated with an increased risk of T2D48, which may be driven by higher BMI and insulin resistance49. The importance of targeting BMI is further underscored by our finding that higher BMI lowers albumin levels, which is a key marker of liver function and general nutritional status, as well as a predictor of a wide range of cardiovascular outcomes50. Notably, the use of MR methods that safeguard against horizontal pleiotropy substantially reduced the number of causal relationships identified with lipid-related measures. This suggests a sizeable role for pleiotropic effects complicating the relationship between genetically mediated obesity and these measures, in line with our GWAS findings. It speaks, for instance, to the complex role of the GLP-1 secretory system, currently hailed as a highly promising therapeutic target for treatment of obesity, with divergent findings across both human and animal studies51. A better understanding of the role of genetic susceptibility and sources of interindividual variation is needed to optimize individual outcomes.

Strengths of this study are the large sample size and the use of high quality, accurately measured metabolomics data. We further complemented the analysis of common variants influencing individual metabolites with a multivariate approach for greater discovery of pleiotropic variants and inclusion of WES data to uncover the role of rare variants. While this allowed for greater insight into the overall genetic architecture of metabolism, focused follow-up studies are needed to generate a deeper understanding of the specific determinants of subsets of metabolites. Genetic discovery was based on a single large cohort, with a relatively homogeneous population of White British individuals. We included two replication samples with varying genetic ancestry, enabling estimation of generalization of the findings. However, given known ethnic differences in the association between obesity and metabolic conditions such as T2D52, the role of ethnicity should be investigated in further detail. It should also be noted that the data collection was not done under fasting conditions, which has been shown to obscure associations between genetic variation and metabolites5. Ideally, future studies include gene-by-time interaction analyses to further increase our understanding of the genetic regulation of metabolite concentrations.

To conclude, metabolic health is central to the most prevalent and impactful medical conditions in our society, indicating a strong need for new therapeutic targets. Knowledge about causal individual-level determinants is central to develop effective strategies that optimally treat the individual. Here, we showed that accurate NMR-derived circulating metabolite concentrations share genetic influences that can be leveraged to boost discovery of pleiotropic variants of high relevance for cardiometabolic diseases. The summary statistics made freely available can be used by follow-up studies to further enhance our understanding of metabolism and related diseases, identify potential drug targets for these diseases, and contribute to the development of more effective interventions by identifying individual-level determinants.

Methods

The conducted research complies with all relevant ethical regulations. It has been approved by the UK’s National Health Service National Research Ethics Service (ref. 11/NW/0382) and the Estonian Council on Bioethics and Human Research (24 March 2020, nr 1.1-12/624). The study design and conduct complied with all relevant regulations regarding the use of human study participants and was conducted in accordance to the criteria set by the Declaration of Helsinki.

Participants

For the UKB, we obtained data under accession number 27412. The composition, set-up, and data gathering protocols of the UKB have been extensively described elsewhere53. It has obtained informed consent from its participants. For the primary analyses, we selected unrelated White Europeans (KING cut-off 0.05)54 that had the Nightingale metabolomics data, as well as genetic and complete covariate data available (N = 207,836, mean age 57.4 years (SD = 8.0), 53.7 % female). BMI was taken from UKB field 21001, with a mean of 27.4 (SD = 4.8). For the generalization analyses, we made use of data from non-White European UKB participants (N = 27,509, mean age 54.5 years (SD = 8.4), 54.3 % female). Ethnicity was based on self-report confirmed by genetics (UKB field 22006).

EstBB is a volunteer-based biobank composed of ~213,000 individuals with data available on genotype, phenotype and electronic health records55. All EstBB participants have signed an informed consent form. All analyses were conducted using data according to release S60 from EstBB. Specifically, individuals were selected under conditions identical to those used for the UKB data for filtering and quality control, resulting in 92,661 unrelated White European participants, with a mean age of 50.9 years (SD = 16.2 years), 65.7% female. BMI values (mean 26.1, SD = 5.3) were either calculated at the time of recruitment and blood donation or referenced from EHR within a year from enrollment.

Data collection and pre-processing

We included all 249 metabolites from the Nightingale NMR metabolomics panel, encompassing 228 lipids, lipoproteins or fatty acids and 21 non-lipid traits, namely amino acids, ketone bodies, fluid balance, glycolysis-, and inflammation-related metabolites, as QC’ed and released by UKB2. We applied additional pre-processing through the ‘ukbnmr’ R package, to remove sources of technical noise56.

We applied rank-based inverse normal transformation57 to each measure, leading to normally distributed measures as input for the GWAS.

Univariate GWAS and univariate interaction GWAS

We made use of the UKB v3 imputed data, which has undergone extensive quality control procedures as described by the UKB genetics team58. After converting the BGEN format to PLINK binary format59, we set a minor allele frequency threshold of 0.005, leaving 11,144,506 SNPs.

We carried out univariate GWAS on each of the 249 metabolites through PLINK2, which were then combined into a multivariate GWAS through the freely available MOSTest software (https://github.com/precimed/mostest). Details about the procedure and its extensive validation have been described previously9. GWAS on each of the normalized measures were carried out using the standard additive model of linear association between genotype vector, \({g}_{j}\), and phenotype vector, \(y\). In all analyses we covaried for mean-centered age and twenty genetic principal components. We additionally covaried for biological sex, except in the sex-specific analyses.

Association of genotype*sex interaction with each of 249 metabolites was tested with PLINK2, including genotype, sex, mean-centered age and 20 genetic principal components as covariates. Produced univariate GWASs were then combined into multivariate MOSTest analysis. Calibration of the null distribution for the MOSTest analysis was performed permuting both genotypes and sex independently.

Clumping

For both univariate and multivariate GWAS, independent significant variants and genomic loci were identified in accordance with the Psychiatric Genomics Consortium locus definition60. First, we selected a subset of variants that passed genome-wide significance threshold, and used PLINK to perform a clumping procedure at LD r2 = 0.6 to identify the list of independent significant variants. Second, we queried the reference panel for all candidate variants in LD r2 of 0.1 or higher with any independent significant variant. Further, for each independent significant variant, its corresponding genomic loci were defined as a contiguous region of the independent significant variants’ chromosome, containing all candidate variants in r2 = 0.1 or higher LD with the independent significant variant. Adjacent genomic loci were merged if separated by less than 250 KB. A subset of independent significant variants with LD r2 < 0.1 was selected as lead variants (with potentially more than one lead variant per locus). Finally for each locus the most significant among all lead variants was defined as the locus lead variant. Allele LD correlations were computed from a random subset of 10% of the study population to lower computational burden. The number of unique significant loci across all univariate GWAS was determined through the min-P approach61.

Gene mapping

We used the Variant-to-Gene (V2G) pipeline from Open Targets Genetics, to map lead variants to genes based on the strongest evidence from quantitative trait loci (QTL) experiments, chromatin interaction experiments, in silico functional prediction, and proximity of each variant to the canonical transcription start site of genes22.

PheWAS

We used the ‘otargen’ R package to conduct the pheWAS analyses on each of the 534 MOSTest-identified locus lead SNPs. We restricted the analyses to the FinnGen and GWAS Catalog study sources, and selected only traits that had the term ‘disease’ in the trait category. The results were thresholded to associations of each of the locus lead SNP at p < 0.05 divided by the unique number of traits included (n = 7684).

Fine-mapping procedure

We used the SAFFARI pipeline to perform statistical and functional fine mapping21. This consisted of applying PolyFun+FINEMAP to each of the GWAS in order to identify sets of functionally-informed highly credible causal variants, selecting those that were part of a credible set with a posterior probability >0.95 prioritizing these for follow-up. By default, SAFFARI excludes the major histocompatibility complex (MHC) region on chromosome 6 (28–34 Mb).

WES gene burden tests

We used Regenie (v3.1.1) to perform omnibus SKAT-O tests to combine variance component tests and burden tests for each of the 249 metabolites, with age, sex and 20 genetic principal components as covariates. We merged the genotype data of chromosome 1 to 22 into a single PLINK file, lifted the genomic build from GRCh37 to GRCh38, and filtered with PLINK (--maf 0.01 --mac 20 --geno 0.1 --hwe 1e-15 --mind 0.1) to select 591,260 SNPs for step 1. Step 2 variants were rare (MAF < 0.005) with the following annotation masks: LoF, missense (0/5), missense (5/5), missense (>=1/5), and synonymous. We used relevant annotation files described elsewhere: https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=916. We included the same set of protein coding genes and multiple comparisons correction as used for the MAGMA gene-based analyses. The analyses were conducted on the Research Analysis Platform (https://ukbiobank.dnanexus.com).

Gene-set analyses

We carried out gene-based analyses using MAGMA v1.08 with default settings, which entails the application of a SNP-wide mean model62. We used a randomly selected set of 10,000 white British UKB participants as reference panel. Gene-set analyses were done in a similar manner, restricting the sets under investigation to those that are part of the Gene Ontology biological processes subset (n = 7522), as listed in the Molecular Signatures Database (MsigdB; c5.bp.v7.1).

For tissue-specificity analyses, we applied MAGMA gene-property analyses to test relationships between tissue-specific gene expression profiles and the identified gene associations. This encompassed running one-sided tests for each of 30 general tissue types, testing whether the association between each tissue’s known gene expression levels and the gene-based Z-scores is greater than 0, corrected for the average expression across all tissue types and a set of technical confounders. We used preprocessed and normalized GTEx v8 tissue expression values63 as provided through FUMA’s downloads (https://fuma.ctglab.nl/).

Multiple comparison’s correction for these analyses consisted of a Bonferroni correction for the number of protein-coding genes, with α = 0.05/17,849 = 2.8 x 10-6

Drug enrichment analysis

The Drug Gene Interaction Database (DGIdb, (https://www.dgidb.org/) v.5.0.6 (04/04/2024)34 was used to identify drug-gene interactions among the genes identified from the WES gene burden tests. The DGIdb provides information on drugfig

-gene interactions from 28 diverse sources that are aggregated and normalized. The database collects drug-gene interactions based on information about therapeutic targets and their corresponding drugs, knowledge from clinical trials, as well as potentially clinically actionable drug-gene associations based on metadata such as molecule structure and molecular weight34. Gene-set enrichment analysis (GSEA) was performed to test if the genes identified from the WES gene burden tests were significantly (FDR < 0.05) enriched for targets of specific drugs.

LDSC

We applied univariate64 and cross-trait65 LDSC to estimate narrow-sense heritability and genetic correlations, respectively. For this, we formatted the GWAS summary statistics using our standardized pipeline, including ‘munging’ and removal of all variants in the extended MHC region (chr6:26–34 Mb), in accordance with recommendations (https://github.com/precimed/python_convert/blob/master/sumstats.py).

MiXeR analysis

We applied a causal mixture model35,36 to each of the main univariate GWAS summary statistics, with the extended MHC region excluded, to estimate the percentage of variance explained by genome-wide significant SNPs as a function of sample size. For each SNP, \(i\), MiXeR models its additive genetic effect of allele substitution,\({\beta }_{i}\), as a point-normal mixture, \({\beta }_{i}=\left(1-{\pi }_{1}\right)N\left({\mathrm{0,0}}\right)+{\pi }_{1}N\left(0,{\sigma }_{\beta }^{2}\right)\), where \({\pi }_{1}\) represents the proportion of non-null SNPs (‘polygenicity‘) and \({\sigma }_{\beta }^{2}\) represents the variance of effect sizes of non-null SNPs (‘discoverability‘). Then, for each SNP, \(j\), MiXeR incorporates LD information and allele frequencies for 9,997,231 SNPs extracted from the EUR population of the 1000 Genomes Phase3 data to estimate the expected probability distribution of the signed test statistic, \({z}_{j}={\delta }_{j}+{\epsilon }_{j}=N{\sum }_{i}\sqrt{{H}_{i}}{r}_{{ij}}{\beta }_{i}+{\epsilon }_{j}\), where \(N\) is the sample size, \({H}_{i}\) indicates heterozygosity of i-th SNP, \({r}_{{ij}}\) indicates an allelic correlation between i-th and j-th SNPs, and \({\epsilon }_{j}\sim N\left(0,{\sigma }_{0}^{2}\right)\) is the residual variance. Further, the three parameters, \({\pi }_{1},{\sigma }_{\beta }^{2},{\sigma }_{0}^{2}\), are fitted by direct maximization of the likelihood function. Finally, given the estimated parameters of the model, the power curve\(S\left(N\right)\) is then calculated from the posterior distribution \(p\left({\delta }_{j}|{z}_{j},N\right)\).

For quality control of the MiXeR results, we used the Akaike Information Criterion (AIC), comparing the Gaussian mixture model fit to that of the infinitesimal model. In this study, the AIC values of all 249 metabolites were positive, i.e. the Gaussian mixture had better model fit, warranting the inclusion of the results.

Mendelian randomization

We ran bidirectional MR, investigating the causal relationships between BMI and the 249 metabolites, with the TwoSampleMR R package. For this, we combined the BMI GWAS summary statistics from the GIANT consortium with no UKB participants (N = 339 224)66, to prevent sample overlap, with the metabolomics GWAS summary statistics generated in this study. We selected only genome wide significant variants for the analysis, clumped using PLINK with clump_p = 1, clump_r2 = 0.001, clump_kb = 10000 against the 1000 Genomes Phase3 503 EUR samples keeping other settings default. We calculated MR regression coefficients using the inverse variance weighted method and the weighted median method. To create robust findings, we only selected findings that showed a multiple comparisons-significance (p < .05/96) across both these methods. As an additional check, we ran MR-Egger and selected those relationships with nominal significance on this test.

Statistical analyses

All pre-processing steps and analyses performed outside the above-mentioned tools and software, e.g. formatting the data and creating the graphs, were carried out in R, v4.2.

Sensitivity analyses

We ran two sets of variations on the primary GWAS, to investigate the role of medication and of the preprocessing pipeline. First, we re-ran the primary GWAS controlling for insulin, blood pressure, and cholesterol-lowering medication. Second, we re-ran without the ‘ukbnmr’ pre-processing pipeline, directly on the originally released metabolomics data. For both variations, the produced summary statistics were highly comparable with the primary GWAS, with median genetic correlations of 0.992 and 0.998 across the metabolites, respectively.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.