Main

Diabetes affects 589 million adults globally, and the number is estimated to increase to more than 853 million by 20501. T2D accounts for more than 90% of all diabetes cases, and its pathogenesis involves both polygenic susceptibility and environmental risk factors (for example, diet and lifestyle)2. Manifested by insulin resistance, β-cell dysfunction and consequent hyperglycemia, the progression of T2D is characterized by comprehensive yet integrative metabolic changes orchestrated at several organ systems3. Identifying the molecular profile characterizing the dysregulated metabolism contributing to T2D, as well as the genetic and environmental determinants of such a metabolic profile, is crucial for understanding T2D etiology, and may inform the design of more effective preventive strategies targeting specific metabolic pathways.

The circulating metabolome is the quantitative collection of small molecules in the blood and provides a comprehensive functional readout of the metabolic homeostasis in each person. In past decades, prospective studies examining circulating metabolites from preselected pathways or, more recently, a broader spectrum of the metabolome, have identified more than 100 metabolites associated with T2D risk4,5,6. These studies collectively highlight important roles of several metabolites in T2D pathophysiology, such as branched-chain amino acids (BCAAs), tryptophan and lysine, specific phospholipids (PLs) and ceramides4,5,6. Recent research further indicated that the blood metabolome can be influenced by genetics7,8, health conditions, diet and lifestyle6,9,10, among other factors10,11. As such, there is a strong need to integrate multimodal data to better understand how various risk factors are related to disease-associated metabolites to advance precision prevention. However, systematic evaluations of the circulating metabolome associated with T2D risk are lacking, and the genetic and nongenetic contributors to the T2D metabolome have not been investigated in a comprehensive manner.

To fill these knowledge gaps, we examined 469 circulating metabolites in a pooled study of 23,634 initially T2D-free and racially/ethnically diverse people from ten prospective cohorts, to identify metabolites associated with incident T2D over up to 26 years of follow-up. We further conducted integrative analyses combining genomic data and diet/lifestyle factors, to systematically elucidate genetic determinants, functional enrichments and potential tissue origin for T2D-associated metabolites; and to illustrate the relationships among diet/lifestyle factors, circulating metabolites and incident T2D. Finally, we derived and validated a multi-metabolite signature that reflected the complex metabolic states predictive of future T2D risk, with the potential to facilitate risk stratification and precision prevention (Fig. 1).

Fig. 1: Study overview.
Fig. 1: Study overview.
Full size image

a, To identify blood metabolites associated with incident T2D, we analyzed 469 harmonized metabolites in up to 23,634 participants from ten prospective cohort studies. At baseline, participants were free of T2D and other chronic diseases; and blood metabolome was profiled using the metabolomic platforms at Broad Institute or Metabolon Inc. A metabolome-wide association study (MWAS) for incident T2D was conducted in each cohort; and results from the ten cohorts were combined using meta-analysis, identifying 235 metabolites associated with T2D risk. b, We curated meta-analyzed genome-wide association studies (GWASs) for each metabolite using data of up to 18,590 people from eight cohorts, followed by functional analyses, colocalization analyses and Mendelian randomization analyses. c, We conducted MWASs for major modifiable risk factors in up to 16,883 participants from five cohorts, identifying metabolites that potentially mediated the associations between risk factors and T2D risk. d, We used machine learning analyses to develop a metabolomic signature reflecting the complex metabolic states predictive of long-term T2D risk, which may facilitate the identification of high-risk individuals and precision prevention.

Source data

Results

Metabolome-wide association analysis of incident T2D

Our primary analysis included 23,634 participants from ten prospective cohorts free of T2D at study baseline. During up to 26 years of follow-up, 4,000 incident T2D cases were identified (Fig. 1 and Extended Data Table 1). Metabolomic profiling was conducted at either the Broad Institute or Metabolon Inc., and 469 metabolites were harmonized across cohorts for analyses (Supplementary Fig. 1). We conducted metabolome-wide association analysis in each cohort stratified by major racial/ethnic groups (Supplementary Table 1; Methods). In meta-analysis of all subsets, we identified 235 metabolites associated with incident T2D (false discovery rate (FDR) < 0.05), after adjusting for demographic, socioeconomic and clinical factors, including body mass index (BMI) and waist–hip ratio (WHR) (Fig. 2, Extended Data Fig. 1a and Supplementary Table 2). These include 168 previously reported associations and 67 additional significant associations identified in this study (Supplementary Table 3). Aside from glucose, multivariable-adjusted risk ratio (RR) for incident T2D per s.d. increment in circulating levels of metabolites ranged from 0.67 (95% confidence interval (CI), 0.61–0.72) for C22:4 cholesterol ester (CE), to 1.71 (95% CI, 1.60–1.83) for C32:0 diacylglycerol (DAG).

Fig. 2: Associations between 235 metabolites and incident T2D in meta-analysis of ten prospective cohorts.
Fig. 2: Associations between 235 metabolites and incident T2D in meta-analysis of ten prospective cohorts.
Full size image

Circular plots illustrate metabolites associated with incident T2D at FDR < 0.05, by biochemical category. a, Results for complex lipids including monoacylglycerols (MAG) and DAG, TAG, LP, PC, PE, other PLs, PL plasmalogens and sphingolipids (SG). b, Results for other metabolites, including amino acids, carbohydrates, bioenergetic metabolites, nucleotides (NTs), xenobiotics (XBs), as well as other lipid metabolites including carnitines, BAs, CEs and nonesterified fatty acids. Each bar represents results for one metabolite; red and blue indicate positive and inverse associations, respectively; color depth indicates association magnitude, that is, ln(RR) per s.d. increment in the metabolite, capped at −0.3 to 0.3; and bar height indicates association significance, capped at 10−20 in a and 10−15 in b. Analyses were conducted in each cohort by racial/ethnic groups adjusting for age, sex, smoking, alcohol consumption, fasting status, hypertension, dyslipidemia, lipid-lowering medication use, anti-hypertensive medication use, BMI, WHR, family history of T2D and cohort-specific variables, and results were combined using meta-analysis.

Source data

The large number of metabolites associated with T2D risk is expected, given the correlations among metabolites—particularly among lipids (Supplementary Fig. 2) and our large sample size. Results were consistent in multiple sensitivity analyses, including a basic model adjusting for only demographic and socioeconomic factors, and models further adjusting for diet quality and physical activity (PA), blood pressure, blood lipids or estimated glomerular filtration rate (eGFR) (Supplementary Figs. 34 and Supplementary Table 2). In stratified meta-analysis by major racial/ethnic groups, associations between most metabolites and T2D risk were comparable between non-Hispanic white individuals and those of other racial and ethnic groups, although a few discrepancies deserve further investigations (for example, acisoga was associated with T2D risk only in Black participants, and C36:1 phosphatidylcholine (PC) only in Hispanic/Latino adults) (Extended Data Fig. 1b–d and Supplementary Table 4). Further comparison between cohorts applying Broad Institute versus Metabolon platforms also yielded consistent findings for overlapping metabolites (Extended Data Fig. 2 and Supplementary Table 5).

The 235 metabolites associated with incident T2D

We examined 205 complex lipids, identifying 130 significantly associated with incident T2D (including 18 associations not previously reported) and providing a comprehensive characterization of the lipidomic association patterns (Fig. 2a and Supplementary Tables 2 and 3). Positive associations with T2D risk were observed for triacylglycerols (TAG; the main form of energy storage in adipose tissue12), DAG (activate signaling cascades triggering hepatic insulin resistance13), ceramides (involved in β cell apoptosis and impaired insulin signaling14), as well as PC, phosphatidylethanolamine (PE) and phosphatidylinositol (related to insulin resistance13,15,16). Inverse associations with T2D risk were found for CEs, lysophospholipids (LPs), unsaturated PL plasmalogens (function as endogenous antioxidants17), some sphingomyelins (may counteract effects of ceramides14) and alpha-glycerophosphate (involved in glycolysis18). In addition, we noted that more double bonds were correlated with lower T2D risk among DAG, TAG and sphingomyelins, but with higher T2D risk among plasmalogens (Extended Data Fig. 3 and Supplementary Table 6), possibly due to functions of constituting fatty acids and/or the insulin-mediated regulation of FA desaturases19.

Of other lipid signaling pathways, we identified 34 metabolites associated with T2D risk, including 26 significant associations not reported previously (Fig. 2b and Supplementary Tables 2 and 3). Notably, detrimental associations were identified for four primary and three secondary bile acids (BAs), including taurocholate, glycochenodeoxycholate, taurochenodeoxycholate, deoxycholate and taurodeoxycholate, that were not linked previously to T2D risk, adding further evidence to the role of BAs in T2D pathogenesis20,21. Significant associations with T2D were also noted for eight carnitine metabolites, including six significant associations not reported before (that is, C3, C5:1, C14, C16, C5–DC, butyrobetaine), supporting its role in energy metabolism22. Of the 19 free fatty acids associated with T2D risk, 18 showed positive associations, whereas some of their CE counterparts were inversely associated with T2D risk.

Across other pathways, we identified 43, five, seven and eight metabolites involved in amino acid, carbohydrate, energy and nucleotide metabolism, respectively, and eight other metabolites associated with T2D risk, comprising 23 significant associations not reported previously (Fig. 2b and Supplementary Tables 2 and 3). Key amino acid pathways underlying T2D risk indicated by these associations encompassing alanine, aspartate and glutamate, glycine, serine and threonine, urea cycle, arginine and proline, histidine, BCAA, tryptophan, lysine, phenylalanine and creatine metabolism. Notably, although coffee consumption has been consistently related to lower T2D risk23, metabolites derived from phytochemicals in coffee showed both positive (for example, caffeine, theophylline and 1,3-dimethylurate) and inverse (for example, trigonelline and hippuric acid) associations with T2D risk.

Shared genetic architecture between metabolites and T2D

To offer new biological insights, we examined the shared genetic architecture between T2D-associated metabolites and T2D risk. Through meta-analyses in up to 18,590 people from eight cohorts, we curated genome-wide association study (GWAS) summary statistics for 458 harmonized (including 233 T2D-associated) metabolites (Fig. 1 and Supplementary Table 7; Methods). We identified one or more genetic loci for 165 T2D-associated metabolites at P < 1.09 × 10−10 (5 × 10−8 correcting for 458 metabolites), with 45% of the identified metabolite–locus pairs not reported by previous major metabolite quantitative trait locus (mQTL) studies8,24,25,26 (Supplementary Table 8; Methods).

Most mQTLs were annotated to genes involved in cellular metabolism, synthesis, and/or transportation of the corresponding metabolites. Approximately 23% of the identified loci were associated with multiple T2D-associated metabolites, especially those from the same or closely related pathways (for example, GCKR, FADS1-3 and ZNF259) (Extended Data Fig. 4). A brief comparison between racial/ethnic groups suggested that some mQTLs may be specific to certain groups (for example, OPLAH for pyroglutamate in Hispanic/Latino adults) (Supplementary Figs. 5 and 6), warranting confirmation by larger trans-ancestry studies. Several mQTLs for T2D-associated metabolites overlapped with known T2D risk loci27, with significant genetic colocalizations observed at several loci (posterior probability for H4 (PPH4) > 0.8). For example, 53 lipids, ten amino acids and two carbohydrates colocalized with T2D at the GCKR locus (probably driven by rs1260326). Similar colocalizations with T2D were observed at LDL and APOE for several complex lipids and at L17REL for leucine and valine (Extended Data Fig. 4 and Supplementary Tables 8 and 9).

The number of independent variants and the proportion of variance explained by genetics per metabolite, were generally similar between T2D-associated and non-associated metabolites (mean r2 = 4.4% versus 5.1%; Pdifference = 0.47) (Fig. 3a, Supplementary Figs. 7 and 8 and Supplementary Table 10). The top enriched canonical pathways for mQTLs of T2D-associated metabolites, however, were notably different from those of non-associated metabolites. Genes annotated to mQTLs of T2D-associated metabolites were enriched in pathways closely relevant to T2D pathogenesis—such as lipoprotein metabolism28,29, adiponectin functions30, BAs in glucose/lipid regulation31, insulin response and Rac1 activation32, as well as glycine and l-serine33, l-lysine34, l-cysteine35 and l-phenylalanine metabolism, whereas most of these pathways were not enriched for mQTLs of non-associated metabolites (Fig. 3b, Extended Data Fig. 5 and Supplementary Table 11).

Fig. 3: Genetic determinants of T2D-associated metabolites.
Fig. 3: Genetic determinants of T2D-associated metabolites.
Full size image

We examined genetic architectures of T2D-associated metabolites based on genome-wide meta-analysis summary statistics. a, R2 explained by genetics comparing the T2D-associated metabolites versus other metabolites by biochemical category (Wilcoxon test, statistical significance defined as two-sided P < 0.00625, correcting for eight categories). b, Top enriched canonical pathways for genes mapped to mQTLs of T2D-associated metabolites, most of which were not enriched for genes mapped to mQTLs of non-associated metabolites. AS, atherosclerosis; CAR, constitutive androstane receptor; FXR, farnesoid X receptor; MetS, metabolic syndrome; PXR, pregnane X receptor; VLDL, very low-density lipoprotein. c, Percentages of metabolites showing nominally significant (P < 0.05) genetic correlations (rg) with traits reflecting T2D pathophysiology, comparing T2D-associated versus non-associated metabolites (two-sided chi-squared test). Barplot shows results for all metabolites (**FDR < 0.05, correcting for 22 traits); and heatmap shows percentage among T2D-associated metabolites by biochemical category (**FDR < 0.05 correcting for 121 comparisons; *P < 0.05). ALT, alanine aminotransferase; AST, aspartate aminotransferase; GGT, gamma-glutamyltransferase; HDLC, HDL cholesterol; LDLC, LDL cholesterol; TC, total cholesterol; TG, triglycerides. d, Proportions of metabolites colocalized (PPH4 > 0.8) with tissue-specific gene expression across 47 human tissues. We tested whether the proportions were higher among T2D-associated metabolites (colors: organ systems) versus non-associated metabolites (gray) using univariant logistic regression (**one-sided FDR < 0.05 correcting for 47 tissue types; *P < 0.05). e, For tissue types showing enriched genetic colocalizations with T2D-associated metabolites (seven tissue types with FDR < 0.05, plus the main metabolic organ liver with P < 0.05), we detailed the enrichment by biochemical category (color depth: proportions among T2D-associated versus non-associated metabolites; *one-sided P < 0.05).

Source data

We then examined genetic correlations (rg) between metabolites and 21 traits indicative of various T2D pathophysiologic mechanisms, leveraging summary statistics from large-scale GWAS for these traits (Methods). T2D-associated versus non-associated metabolites demonstrated substantially more significant rg with fasting insulin (17-fold), BMI-adjusted insulin secretion and sensitivity indices (10- to 32-fold), liver enzymes (10- to 11-fold), intrahepatic and visceral fat (9- and 6-fold, respectively), obesity and blood lipids. Such an enrichment seemed to be driven by complex lipids, lipid signaling metabolites and amino acids (Fig. 3c). Significant rg demonstrated a potential link between metabolites and physiological functions; for example, BCAAs were genetically correlated with traits reflecting insulin resistance, ectopic fat and impaired liver function (Supplementary Table 12).

Genetic colocalization between circulating metabolites and tissue-specific transcriptome

We hypothesized that levels of circulating metabolites may partially reflect biological homeostasis and gene regulations of related metabolic pathways across different organ systems. We therefore conducted a colocalization analysis between circulating mQTLs and tissue-specific cis-gene expression quantitative trait loci (eQTLs) of 47 human tissue types (leveraging Genotype-Tissue Expression v.8 data36) (Methods). Genetic colocalizations were observed across all tissue types (PPH4 > 0.8), supporting our hypothesis. T2D-associated metabolites, compared to non-associated metabolites, had significantly higher (FDR < 0.05) percentage of colocalizations in seven digestive and metabolic/endocrine tissues, including thyroid (62%), esophagus mucosa (45%), esophagus–gastroesophageal junction (58%), visceral fat (55%), whole blood (55%), pancreas (54%) and salivary gland (21%), and nominally higher (P < 0.05) percentage of colocalizations in another 13 tissue types including liver (Fig. 3d and Supplementary Table 13). Such an enrichment of colocalizations seemed to be driven primarily by T2D-associated amino acids, fatty acids and complex lipids (Fig. 3e). Further, each T2D-associated metabolite seemed to be colocalized with gene expressions within several, instead of one specific, metabolic/endocrine tissue types (Extended Data Fig. 6), consistent with the cumulative evidence that T2D development involves integrative biological changes across liver, fat, pancreas and digestive organ systems3.

We observed several instances where tissue-specific gene expression, circulating metabolites and T2D colocalized at the same potential causal variants, highlighting potential genes and tissue types underlying the observed metabolite–T2D associations. For example, of the 65 metabolites colocalized with T2D at the GCKR/PPM1G/IFT172 locus, 61 also colocalized with PPM1G expression in pancreas, IFT172 in thyroid and/or NRBP1 in esophagus–gastroesophageal junction (likely causal variant rs1260326). Similarly, 34:4 PC colocalized with T2D, as well as FADS1 expression in liver, visceral fat and esophagus–gastroesophageal junction, and TMEM258 expression in thyroid, by rs174545 (Supplementary Fig. 9a,b).

Bidirectional Mendelian randomization analysis

We conducted two-sample Mendelian randomization (MR) analyses to infer the potential causal relationships between 233 T2D-associated metabolites and T2D risk, leveraging a published consortium GWAS for T2D27 (Methods). Genetically predicted circulating levels of 42 lipids and five amino acids were associated with T2D risk (FDR < 0.05), supported by several MR methods (Supplementary Table 14a). Sensitivity analyses selecting genetic instruments using a more stringent P threshold did not change the results, but removing variants on the three most recurrent loci (that is, GCKR, ZNF259 and FADS1-3) attenuated results especially for lipids, which was expected given the roles of these genes in lipogenesis and lipid metabolism (Supplementary Fig. 10). Of note, genetically predicted T2D was not associated with any metabolite except for glucose—a known diagnostic criterion, rather than an etiological biomarker of T2D (Supplementary Table 14b), supporting that our prospective analysis findings are less likely to be due to reverse causation.

Modifiable risk factors and T2D-associated metabolites

Lifestyle and dietary factors play a pivotal role in metabolism and T2D development37,38,39. We next examined relationships between modifiable risk factors (that is, BMI, smoking, PA and intakes of 15 main food groups, mutually adjusted for one another) with circulating metabolites in up to 16,883 participants (Fig. 1; Methods). BMI accounted for more between-person variation in T2D-associated versus non-associated metabolites (r2 = 1.52% versus 0.55%, Pdifference = 1 × 10−13), which seemed to be driven by glycerolipids (GLs), PLs and several amino acids (Fig. 4), consistent with their strong genetic correlation with BMI (Fig. 3c). Behavioral factors (especially PA, and red meat, vegetable and coffee/tea consumption) in total explained more variations in T2D-associated versus non-associated metabolites (r2 = 7.73% versus 6.57%, Pdifference = 0.029), especially for GLs, fatty acids, amino acids and bioenergetic metabolites (Fig. 4 and Supplementary Table 15).

Fig. 4: Variance of metabolites explained by modifiable risk factors.
Fig. 4: Variance of metabolites explained by modifiable risk factors.
Full size image

a, Boxplots comparing variance explained by age, sex and modifiable risk factors (including smoking, PA and intakes of 15 main food groups), for T2D-associated metabolites versus non-associated metabolites. b, Boxplots showing several specific biochemical categories of metabolites that drove the differential R2. Each box shows the IQR, line in box indicates median and whiskers extend from the box to the smallest and largest value within 1.5 IQR from the lower and upper quartiles. Wilcoxon test was used to compare R2 of the T2D-associated versus that of other metabolites; **two-sided P < 0.0025 (Bonferroni correction for 20 examined factors); *two-sided P < 0.05. For each metabolite, we first fitted a linear regression to regress inverse normal transformed metabolite on age, sex, BMI (standardized), PA (METs hours per week; standardized), all 15 main food groups (red meat, processed meat, poultry, fish and seafood, egg, total dairy, total vegetables, total fruits, potato, nuts and legume, whole grain, refined grain, sugary drinks, coffee and tea and alcohol; servings per day), fasting status and other cohort-specific variables simultaneously. We then calculated R2 of the metabolites explained by each of the risk factors based on association coefficients and the variance of metabolite and risk factors. The analyses were conducted in NHS, NHS2, HPFS, SOL and WHI separately (n = 16,883) by main racial/ethnic groups and R2 were averaged for the comparison.

Source data

Metabolites mediating associations between modifiable risk factors and incident T2D

T2D-associated metabolites (versus non-associated metabolites) seemed to show stronger associations with several baseline risk factors, in a direction that is consistent with the epidemiological associations between risk factors and T2D risk (Fig. 5a–c). For example, among the 235 T2D-associated metabolites, there was a strong, positive correlation (r = 0.86) between their association coefficients with baseline BMI and their prospective association coefficients with incident T2D (Fig. 5a). Likewise, positive correlations of association coefficients were observed for risk-increasing behavioral factors such as smoking, and higher consumption of red meat and sugary drinks. In contrast, metabolites associated with higher levels of PA, and higher consumption of coffee/tea and vegetables, tended to be associated with lower T2D risk (r = −0.65, −0.46 and −0.34, respectively) (Fig. 5b–c, Extended Data Fig. 7 and Supplementary Table 16).

Fig. 5: Metabolites that potentially mediate associations between modifiable risk factors and T2D risk.
Fig. 5: Metabolites that potentially mediate associations between modifiable risk factors and T2D risk.
Full size image

ac, Scatterplots compare the associations of metabolites with the risk factors BMI (a), PA (b) and coffee and/or tea consumption (c) versus their associations with T2D risk. Each dot represents a metabolite (colored: associated with the risk factor and incident T2D at FDR < 0.05 by biochemical category; dark gray: associated with incident T2D but not the risk factor; light gray: not associated with incident T2D); and the two trend lines are for T2D-associated (dark gray) and non-associated metabolites (light gray) separately. Association coefficients (betas) for risk factors are from MWASs in which all risk factors were mutually adjusted (including age, sex, BMI, PA, consumption of 15 main food groups, fasting status and other cohort-specific variables). For metabolites associated with a risk factor and incident T2D in an epidemiologically expected direction, we conducted mediation analysis testing the indirect effect (risk factor − T2D association via a metabolite). df, For metabolites whose indirect effects were in the same direction as the total effect, we present the distribution of proportion mediated (indirect effect/total effect) for BMI (d), PA (e) and coffee and/or tea consumption (f). All analyses were conducted separately in NHS, NHS2, HPFS, SOL and WHI (n up to 16,883 for individual metabolites) and results were combined using meta-analysis. g, For metabolites showing significant mediating effects between risk factors and incident T2D, we highlighted the top tissue types where these metabolites showed the most genetic colocalizations with tissue-specific gene expression, and the top clinical traits with which these metabolites have most genetic correlation.

Source data

Four risk factors (BMI, PA, coffee/tea consumption and red meat intake) demonstrated expected prospective associations with T2D risk consistently across our study cohorts (Supplementary Table 17a). We therefore employed a mediation analysis to identify which metabolites, and to what degree, mediated the associations between these risk factors and incident T2D. For BMI and PA, we identified 148 and 50 metabolites, respectively, potentially mediating their associations with T2D risk (Fig. 5d–e and Supplementary Table 17a). Notably, many of these metabolites have been linked, in our genetic analyses, to T2D-related traits such as intrahepatic and visceral fat, lipids and liver enzymes, and to tissue types such as visceral fat, pancreas and thyroid, among others (Fig. 5g and Supplementary Tables 12, 13 and 17a). We found eight metabolites (including C22:0 ceramide, C32:0 DAG and C36:2 PC Plasmalogen) as potentially causal mediators between BMI and T2D risk, based on mediation analysis and two-step MR analysis (Supplementary Fig. 11 and Supplementary Table 17b). These findings suggest that obesity and PA may affect T2D risk through metabolic modulations related to visceral and intrahepatic fat deposition, liver and endocrine dysfunction, and lipid dysregulation.

We identified 74 metabolites as potential mediators between coffee/tea consumption and lower T2D risk, comprising several complex lipids, hippuric acid, isoleucine and glycine (Fig. 5f and Supplementary Table 17a). Hippuric acid is formed through hepatic glycine conjugation of benzoic acid, which is generated by the gut microbiota from polyphenols such as chlorogenic acids and epicatechins (abundant in coffee and tea)40,41, highlighting a potential host–microbe interplay in polyphenol metabolism and metabolic health. We also identified six lipids as potential mediators between red meat intake and T2D risk, including lipids linked to ectopic fat and lipid dysregulation in our genetic analyses (Supplementary Tables 12, 13 and 17a).

A metabolomic signature to reflect the complex metabolic states predictive of T2D risk

Finally, we developed a multi-metabolite signature reflecting the complex metabolic states predictive of future T2D risk using elastic net regression, focusing on T2D-associated metabolites shared between the two metabolomic platforms to facilitate translational applicability of our findings. A leave-one-cohort-out cross-validation approach was applied to avoid overfitting (Methods and Supplementary Fig. 18a). In independent testing cohorts, the metabolomic signature alone demonstrated decent prediction performance for incident T2D risk, with an area under the receiver operating characteristic (ROC) curve (AUC) ranging from 0.62 to 0.86. Compared to a conventional model with traditional risk factors, the model that additionally included the metabolomic signature substantially improved T2D risk prediction with the AUC ranging from 0.69 to 0.92 (AUC increment P < 0.05 in all cohorts, except P = 0.054 in SOL) (Fig. 6a–c, Extended Data Fig. 8, Supplementary Fig. 12 and Supplementary Table 18b). In secondary analyses of five datasets with available fasting glucose, the addition of the metabolomic signature improved the model AUC significantly (P < 0.05 in three datasets) to marginally (P = 0.06 in SOL) beyond traditional risk factors and fasting glucose, except for PREDIMED (P = 0.18) (Extended Data Fig. 9).

Fig. 6: A multi-metabolite signature for T2D risk prediction.
Fig. 6: A multi-metabolite signature for T2D risk prediction.
Full size image

a, AUC for T2D risk prediction in each cohort. Yellow: the model with metabolomic signature only, acquired using a leave-one cohort-out cross-validation approach to avoid overfitting (within WHI, the signature was acquired using a leave-one-out cross-validation); blue: the model with conventional risk factors including age, sex, smoking, BMI, dyslipidemia, hypertension, lipid-lowering medication use, anti-hypertensive medication use and family history of T2D; red: the model with conventional risk factors plus the metabolomic signature. For cohorts analyzed with Cox model, we plotted AUC estimated at the median follow-up time. We compared the AUC of the conventional plus metabolomic signature model to that of the conventional model; **two-sided P < 0.01, ^two-sided P < 0.1. b,c, Two examples of ROC curves and two-sided P values from WHI (b) and Black participants from ARIC (c). d, Crude incident rate of T2D by cohort, across deciles of the metabolomic signature, with a smooth trendline and 95% CI (gray band) from locally estimated scatterplot smoothing (LOESS). e, Relative risk ratio (points) and 95% CI (lines) for incident T2D, comparing participants in higher versus the lowest deciles of the metabolomic signature. Analyses were conducted separately in NHS, NHS2, HPFS, SOL, WHI, PREDIMED and Black and white participants from ARIC, separately, adjusting for age, sex, smoking, alcohol consumption, fasting status, hypertension, dyslipidemia, lipid-lowering medication use, anti-hypertensive medication use, BMI, WHR, family history of T2D and cohort-specific variables. We plotted relative risk ratios from the meta-analysis (n = 20,930). f, In multivariable analysis, BMI, red meat intake and sugary drink consumption (purple) were associated positively with the metabolomic signature, whereas PA, and intakes of coffee/tea, whole grains and wine (green), were associated inversely with the metabolomic signature (FDR < 0.05). A Sankey plot was used to demonstrate the associations between each of the 44 metabolites constituting the final metabolomic signature with these risk factors and with T2D risk (band-width proportional to the association coefficients).

Source data

Across cohorts, crude incidence of T2D increased from 7.7% in the lowest to 37.7% in the highest decile of the metabolomic signature (Fig. 6d). In a multivariable-adjusted analysis combining all cohorts, participants in the highest decile had a 5.1-fold higher risk of T2D compared to those in the lowest decile (RR = 5.07; 95 CI%, 4.02–6.39) (Fig. 6e and Supplementary Table 18c). Further assessing associations with modifiable diet/lifestyle factors, we found that greater BMI and higher consumption of red meat and sugary drinks were associated with a higher metabolomic signature score, whereas more PA and higher intakes of whole grain, coffee/tea and wine were associated with a lower signature score (Fig. 6f and Supplementary Table 18d).

The final metabolomic signature model, derived based on all study cohorts, comprised 44 metabolites (including 20 amino acids, 19 involved in lipid/energy metabolism and five others), with many potentially linking modifiable risk factors to T2D risk (Fig. 6f and Supplementary Table 18a). For instance, alanine, which connected higher BMI and intakes of red meat and sugary drinks with higher T2D risk, was found as a potential mediator between BMI and T2D risk by our mediation and two-step MR analyses (Supplementary Tables 14a and 17a,b). Several metabolites, including trigonelline, hippuric acid, isoleucine and glycine, connected higher coffee/tea intake to lower T2D risk (Fig. 6f). Taking together, this metabolomic signature may serve as a predicting/monitoring biomarker to facilitate risk prediction, risk stratification and evaluation of effects of diet/lifestyle interventions on T2D prevention.

Discussion

This is one of the largest and most comprehensive investigations of metabolomic profiles associated with T2D risk, integrating blood metabolomic, genomic and diet/lifestyle data across racially and ethnically diverse cohorts. Collectively, our study identified a profile of 235 metabolites reflecting a dysregulated metabolism driven by both genetics and modifiable risk factors and predicts future T2D risk.

A key strength of this study is the harmonized analysis of individual-level data from ten prospective cohort studies using standardized protocols. This design provided high statistical power, enabling the identification of 235 metabolites prospectively associated with T2D risk, offering a comprehensive view of the metabolic landscape underlying T2D pathogenesis and substantially expanding upon the 123 metabolites reported in a recent literature-review-based meta-analysis of more than 60 studies4. Our identified significant associations include 34 that were only nominally significant in previous studies and 33 never linked to T2D risk. The use of individual-level data also allowed consistent adjustments of covariates and result comparisons across population groups and metabolomic platforms—which are not feasible in literature-review-based meta-analyses. Notably, associations between the identified metabolites and T2D risk remain robust after adjustments for obesity/adiposity, blood lipids, blood pressures, lifestyle factors or kidney function, and were generally consistent across popular liquid chromatography–tandem mass spectroscopy (LC–MS) platforms and major racial and ethnic groups.

Previous mQTL studies have advanced our understanding of genetic regulation of metabolic homeostasis7,8,24,42,43. Our study offers additional insights into the shared genetic architectures between metabolites and T2D. First, genetic determinants of T2D-associated metabolites were enriched in pathways central to T2D pathogenesis, including regulatory signaling of glucose response, insulin resistance and lipid homeostasis, despite their modest contributions to the overall metabolite variation. In addition, many of these metabolites were genetically correlated with traits reflecting T2D pathophysiology, such as insulin secretion, insulin resistance, obesity, ectopic fat deposition and liver function. Furthermore, circulating levels of T2D-associated metabolites may reflect biological regulations within specific tissue types relevant to nutrient metabolism (digestive track, pancreas and liver), endocrine/metabolic regulation (thyroid, pancreas and adipose tissues), and inflammation (whole blood and visceral fat). Mapping metabolites—particularly those with strong genetic regulation—to relevant tissues and physiological functions can facilitate mechanistic interpretation. For example, TAGs 46:1 and 46:2 were linked to visceral but not subcutaneous fat, gene expression in pancreas, and insulin secretion and sensitivity indices, suggesting a role in visceral adiposity-related insulin resistance44. Notably, although dyslipidemia is often viewed as a consequence of diabetes45, our findings and recent evidence4,28,29 indicate a complex interplay between lipid and amino acid metabolism and glucose homeostasis. Future studies may leverage our results to further explore mechanisms linking circulating metabolites to T2D risk.

Obesity, diet and lifestyle can directly influence circulating metabolome9,10,11. We showed that obesity, PA and diet may impose substantial impacts on the subset of metabolites associated with T2D risk, which is consistent with the notion that environmental factors need to disturb causal pathways to affect T2D risk46. We also identified specific metabolites probably mediating risk factor–T2D associations. These findings, together with our genetic results, highlight potential causal pathways underlying T2D that deserve further mechanistic investigations. For instance, several metabolites mediating the inverse association between PA and T2D risk seem to be involved in ectopic fat-related insulin resistance and liver function impairment, whereas metabolites mediating the association between coffee/tea consumption and T2D risk were linked to polyphenol metabolism, glucose response, insulin resistance, ectopic fat deposition and liver function. Future clinical trials and functional studies could prioritize these pathways when investigating the causal effects of PA and coffee (or tea) consumption on metabolic health.

The blood metabolome reflects overall biological states and may serve as a prediction or monitoring tool in T2D prevention and therapeutic interventions. In the final step, we developed a multi-metabolite signature that robustly predicted future T2D risk, either used alone or in combination with conventional risk factors, and could identify people with extremely high risk of T2D before T2D diagnosis. The metabolomic signature is also associated with key modifiable risk factors and comprises metabolites that may mediate the associations between various diet/lifestyle factors and T2D risk. Collectively, this metabolomic signature captures the complex metabolic states associated with T2D risk, and is applicable in future clinical and research settings, as either a prediction tool to identify people with high risk of T2D for early prevention, or an intermediate biomarker to evaluate the efficacy of dietary and lifestyle interventions.

We acknowledge several limitations. First, although metabolomic data were harmonized between two LC–MS platforms, some metabolites were unique to one platform, limiting their sample sizes to specific cohorts. Second, although MR analysis is used frequently to infer causality between metabolites and diseases47,48,49, its results should be interpreted cautiously, because some metabolites have weak genetic instruments and many molecules within the same pathways share genetic loci. To minimize false positives, we used the conservative mode-based estimate as our primary method, and confirmed findings with another three MR methods. We note that the lack of significant MR results does not preclude potential biological connections between a metabolite and T2D. Third, due to the observational design, our study cannot establish causality. Randomized trials are warranted to assess how diet/lifestyle affect T2D-associated metabolites and T2D risk. Finally, although our study included people with racially and ethnically diverse backgrounds, and associations were generally consistent across groups, 77% of our participants were non-Hispanic white individuals, highlighting the need for further replication and additional investigations in more diverse populations.

In summary, we identified 235 metabolites associated with incident T2D, potentially reflecting the influence of genetic and modifiable factors (especially diet, PA and adiposity) on metabolic pathways underlying T2D risk. This included 67 significant associations not previously reported encompassing BA, lipid, carnitine, urea cycle and arginine/proline, glycine and histidine metabolic pathways. As a resource, our findings may aid mechanistic and clinical research to investigate pathways underlying T2D pathophysiology. Our metabolomic signature may serve as a powerful tool for risk stratification and as a monitoring biomarker to inform precision T2D prevention and early intervention.

Methods

Study participants and ethics approval

Our MWAS for incident T2D involves the use of data from ten prospective cohorts, including the Nurses’ Health Study (NHS; initiated in 1976 with 121,701 female nurses aged 30–55 years9,50), NHS2 (started in 1989 with 116,429 female nurses aged 25–42 years9,50), Health Professionals Follow-Up Study (HPFS; started in 1986 with 51,529 male health professions aged 40–75 years9), Hispanic Community Health Study/Study of Latinos (SOL; enrolled 16,415 Hispanic/Latino adults aged 18–74 years during 2008–201151,52), Women’s Health Initiative (WHI; initiated in 1993 enrolling 68,132 women aged 50–79 years to one of three clinical trials or an observational study53), Atherosclerosis Risk in Communities (ARIC) study (enrolled 15,792 mostly Black and white US adults aged 45–64 years during 1987–198954), Framingham Heart Study Offspring cohort (FHS; enrolled 5,124 adults; we focused on those attended the fifth examination during 1991–1995), Multi-Ethnic Study of Atherosclerosis (MESA; initiated in 2000 with 6,814 adults aged 45–84 years55,56), the Boston Puerto Rican Health Study (BPRHS; enrolled 1,500 self-identified Puerto Rican adults aged 45–75 years) and the Prevención con Dieta Mediterránea Study (PREDIMED; a 5-year dietary trial with 7,447 adults aged 55–80 years57). In each cohort, comprehensive data on demographics, medical and family history, diet, lifestyle and other health information were collected at baseline and were updated during longitudinal follow-ups. Blood samples were collected at baseline and/or during follow-ups. Our MWAS for incident T2D included participants with qualified metabolomics data, and were free of diabetes, cardiovascular disease and cancer at study baseline. The final analysis included 6,890 participants from NHS; 3,692 from NHS2 and 2,529 from HPFS; 2,821 from SOL; 1,392 from WHI; 1,288 white and 1,433 Black participants from ARIC; 1,424 from FHS; 902 from MESA; 378 from BPRHS and 885 from PREDIMED (Extended Data Table 1). Each study was approved by Institutional Review Boards at respective institutions or study centers, and all participants provided informed consent. Our GWAS for metabolites included participants from eight cohorts comprising NHS, NHS2, HPFS, SOL, WHI, ARIC, FHS and, in addition, the Cardiovascular Health Study (CHS; enrolled 5,201 adults during 1989–1990 and 678 predominantly Black participants in 1992–199358,59) (Supplementary Table 7). The detailed descriptions of the design, data collection, ethical review of each cohort, and our inclusion and exclusion criteria are provided in Supplementary Methods.

Ascertainment of T2D

In all cohorts, incident T2D was defined when a participant was free of diabetes at baseline but was identified as having T2D during longitudinal follow-up. Detailed information on diagnosis criteria in each cohort is included in Supplementary Methods, and follow-up years and numbers of incident cases are listed in Extended Data Table 1. Briefly, in NHS/HPFS, T2D were identified by follow-up questionnaires, and confirmed through a supplementary questionnaire based on diagnostic criteria from the National Diabetes Data Group before 199860 and the American Diabetes Association (ADA) criteria after 199861,62. In SOL, T2D was defined if a participant had fasting glucose ≥7.0 mmol l−1, fasting ≤8 h and nonfasting glucose ≥11.1 mmol l−1, post oral glucose tolerance test glucose ≥11.1 mmol l−1, HbA1c ≥ 6.5%, current use of antidiabetic medications or self-reported physician-diagnosed diabetes63. In WHI, T2D was determined based on self-reported history of diabetes or using antidiabetic medications (pills or shots) in any visits/interviews. In ARIC and FHS, T2D was diagnosed if a person had fasting glucose ≥7.0 mmol l−1, fasting ≤8 h and nonfasting glucose ≥11.1 mmol l−1, or current use of antidiabetic medications with ARIC further considering self-reported physician-diagnosed diabetes64,65. T2D cases in MESA and BPRHS were determined according to the ADA criteria66, which included fasting plasma glucose level ≥7.0 mmol l−1 or the use of antidiabetic medications or insulin56,67. In PREDIMED, T2D was adjudicated through blind assessment by a Clinical Endpoint and Adjudication of Events Committee, based on the ADA criteria68.

Assessment of diet, lifestyle factors and covariates

Detailed information on data collection in each cohort is in Supplementary Methods. Briefly, demographic factors (for example, self-reported sex, and race and ethnicity), socioeconomic status, health information (for example, medical conditions and family history) and lifestyle (for example, smoking history and PAs), anthropometrics and blood pressure, were collected at baseline and follow-up visits, through self-administrated questionnaires, or in-person or telephone-based interviews by trained staff. PA was quantified as metabolic equivalent (MET) in hours per week. We calculated BMI based on baseline weight and height, and WHR based on waist and hip circumferences. Blood clinical biomarkers were measured using standard assays. Among participants with serum creatinine data, eGFR was estimated using the CKD-EPI (Chronic Kidney Disease Epidemiology Collaboration) formula, based on age, sex and race in NHS/HPFS, WHI, ARIC and PREDIMED69, and standard reference equations for Hispanics adjusting for age and sex in SOL. In PREDIMED, two propensity scores were estimated to account for the probability of assignment to intervention groups57.

In NHS/HPFS, diet was assessed using a semi-quantitative food frequency questionnaire (FFQ) every 4 years; in our analysis we averaged the intakes from the two FFQs closest to the time of blood draw (NHS: 1986 and 1990; NHS2: 1995 and 1999; HPFS: 1994 and 1998). In WHI, ARIC, FHS, MESA and BPRHS, diet was similarly assessed by FFQs designed and validated for application to their targeted populations (for example, multiethnic and geographically diverse populations in WHI70,71,72 and Puerto Rican population in BPRHS73). In SOL, diet was assessed using two 24-h dietary recalls and a food propensity questionnaire74. The overall dietary quality was assessed by the Alternate Healthy Eating Index-2010 (AHEI-2010)75 in all cohorts except for the PREDIMED trial, in which it was assessed by a 14-item Mediterranean Diet Adherence Screener score57. In NHS/HPFS, SOL and WHI, we also calculated baseline consumptions of 15 main food groups in the unit of servings per day.

Metabolomic profiling, quality control and data harmonization

Metabolomic profiling in NHS/HPFS, WHI, MESA, PREDIMED, FHS and CHS was conducted with the Metabolomics Platforms at the Broad Institute of MIT and Harvard University, using three to four complementary LC–MS methods9,65,76. Metabolomic profiling in SOL and ARIC (serum samples) and BPRHS (plasma samples) was conducted using LC–MS based methods by the Metabolon DiscoveryHD4 Panel at the Metabolon Inc.63,77,78. Detailed protocols for both platforms have been described previously53,79.

Data processing was conducted within each study and, if applicable, separately within each batch (or substudy) if several batches/substudies were conducted within a cohort. Samples were removed if their metabolite detection rate was <80%, or were identified as outliers by multidimensional scaling analysis within a specific race/ethnic group. Metabolites were filtered if their detection rate across samples was <80% and, if applicable, had a coefficient of variation >20% for quality control (QC) samples. After quality filtering, missingness of each metabolite were imputed using the half minimum value, and the data were then standardized for analysis. Across all cohorts, we matched metabolites by their HMDB ID and/or PubChem ID, provided by the corresponding metabolomic laboratories. A total of 1,273 named metabolites were initially qualified for analysis in at least one cohort. To reduce single-study bias, we limited our analyses to 469 metabolites that were available in at least four independent cohorts, or available in at least three independent cohorts if the three cohorts covered both Metabolomic platforms. Finally, 407 metabolites from NHS, 363 from NHS2, 291 from HPFS, 364 from WHI, 327 from MESA, 274 from PREDIMED, 188 from FHS, 283 from SOL, 139 from ARIC and 231 from BPRHS were harmonized for our analysis (Extended Data Table 1). In CHS, 411 metabolites were included in genetic analyses (Supplementary Table 7). Details of the metabolomic profiling, QC and data processing are in the Supplementary Methods.

Metabolome-wide association analysis for incident T2D

Details of analytical approaches and models are provided in Supplementary Methods and Supplementary Table 1. Briefly, all association analyses were conducted separately for each cohort, stratified by major racial/ethnic groups when sample sizes permitted. Metabolites were inversely normal transformed by each substudy and racial/ethnic group (if applicable) in each cohort. To analyze the association between each metabolite and T2D risk, we applied Cox regression for studies of longitudinal cohort design (NHS excluding the T2D nested case–control substudy, NHS2, HPFS, SOL, ARIC, WHI, FHS, MESA and BPRHS); logistic regression for the NHS T2D nested case–control substudy; and Cox regression with Barlow weights80 and robust estimators for the PREDIMED T2D nested case–cohort study. The basic multivariate model (model 1) was adjusted for age, sex, smoking status, alcohol consumption and, if applicable, education, family income, fasting status, lipid-lowering medications, anti-hypertensive medications, family history of diabetes, self-reported physician-diagnosed hypertension, self-reported physician-diagnosed dyslipidemia and study-specific covariates. The main model was further adjusted for BMI and WHR (model 2). In sensitivity analyses, model 1 was further adjusted for PA and dietary quality index (model 3); high-density lipoprotein (HDL)-cholesterol, low-density lipoprotein (LDL)-cholesterol and triglycerides (model 4), or systolic and diastolic blood pressures (model 5). In another sensitivity analysis, model 2 was further adjusted for eGFR in NHS, NHS2, HPFS, SOL, ARIC, WHI and PREDIMED. For each metabolite, association results from all available cohorts and racial/ethnic groups were combined using a fixed-effect, inverse-variance-weighted (IVW) meta-analysis, and a meta-analyzed FDR < 0.05 was considered statistically significant. In secondary analyses, meta-analysis was conducted combining results from the same racial/ethnic groups, or cohorts using the same platforms.

To annotate the novelty of the identified associations, we reviewed previous prospective cohort studies linking circulating metabolites to T2D risk. We used a literature-review-based meta-analysis4 that included all studies published before 6 March 2021 as an anchor, and searched for additional studies published from 2021 to 202421,81,82,83,84,85,86,87,88,89,90,91,92,93,94. We considered an association as ‘previously reported,’ if the association was statistically significant in a published study after multiple testing correction based on the study’s prespecified analysis plan.

GWAS of metabolites

Detailed information on genotyping arrays, imputation methods, sample size and GWAS and meta-analysis methods, is provided in Supplementary Methods and Supplementary Table 7. Briefly, genotyping were conducted using several types of array by previous studies in NHS/HPFS95, SOL96, ARIC7, WHI97, CHS98 and FHS43. Imputation was conducted based on the HRC reference panel in NHS/HPFS and CHS; 1000 Genomes Project phase 3 worldwide reference panel in SOL, 1000 Genomes Project phase 3 v.5 in WHI and HapMap CEU population release v.22 in FHS with comprehensive pre- and postimputation QC. GWAS of metabolites were conducted previously in the NHS/HPFS (median n = 6,610, range 971–8,054) and WHI (n = 1,256) using the RVTESTS tool6,42,99, in SOL (n = 3,933) using a linear mixed-effect model in GMMAT7 and in ARIC (n = 1,772 and n = 1509 for African American and non-Hispanic white participants, respectively)7, CHS (n = 263) and FHS (n = 1,802)43, with detailed analysis procedures described in previous publications7,42,43.

GWAS summary statistics from each cohort were lifted over to Genome Build v.37 and filtered, retaining single nucleotide polymorphisms with a minor allele frequency ≥ 0.01 and imputation ratio ≥0.3. For each metabolite, an IVW fixed-effect meta-analysis, implemented in METAL100, was used to combine GWAS results from the cohorts in which the metabolite was available. Genomic control was implemented before and after meta-analysis100. The final GWAS were available for 458 out of 469 harmonized metabolites, with the total sample size ranging from 1,074 to 18,590 (median n = 8,611). We compared significant mQTLs identified at P < 5 × 10−8 and 1.09 × 10−10 (that is, 5 × 10−8 further correcting for 458 metabolites) levels. Manhattan plots were derived using R package CMplot and regional plots were draw with LocusZoom101. In a secondary analysis, we compared genetic effect heterogeneity between racial/ethnic groups at the identified mQTLs for T2D-associated metabolites (Supplementary Methods).

We annotate the novelty of our significant mQTLs for the 165 T2D-associated metabolites at P < 1.09 × 10−10, by comparing our results to eight previous studies (with N ≥ 4,000 and used LC–MS based metabolomic platforms)8,24,25,26,102,103,104,105. We considered a locus for a specific metabolite as ‘previously reported’ if the reported lead genetic variant was the same lead variant, or not the same lead variant but was significant in our study; or not in our study but within the clumping range of our identified locus. We considered a locus for a metabolite as potentially new if our locus was not previously reported for this metabolite, or this metabolite was not previously reported in these studies.

Lead variants for metabolites, pathway analysis and proportion of variance explained

We used the PLINK clumping function (P < 5 × 10−8 and r2 < 0.01 in a 1,000-kb window) to identify independent genetic variants associated with each metabolite. For metabolite with no variant at P < 5 × 10−8, a single lead variant with the smallest P was selected. Gene annotation for top variants was conducted using the SNPNexus web tool106. Canonical pathway enrichment analyses was conducted using the MetaCore software with the default background107; and we compared top enriched pathways for genes annotated to mQTLs of T2D-related metabolites versus those of non-associated metabolites. We calculated the R2 of each metabolite explained by independent lead genetic variants using the formula \({\sum }_{i=1}^{k}\beta \times \beta \times 2\times {\rm{MAF}}\times (1-{\rm{MAF}})\), in which k is the number of independent lead variants, and β is the association coefficient between the variant and the metabolite. We compared the R2 distribution for the T2D-associated versus non-associated metabolites using Wilcoxon test.

Genetic correlation r g between metabolites and T2D-related traits

We acquired publicly available GWAS summary statistics from large consortium studies for T2D (180,834 cases and 1,159,055 controls)27, fasting insulin (N = 98,210)108, proinsulin (N = 45,861)109, HOMA-IR and HOMA-B (N = 51,750)110, BMI-adjusted insulin sensitivity index (ISI, N = 53,657) and insulin fold-change (IFC; N = 55,124)111, BMI and WHR (N = 700,000)112 and lipids (N = 1,500,000)113. We conducted GWAS for HBA1c (N = 390,982), subcutaneous fat volume (N = 37,912), visceral fat volume (N = 37,912), liver proton density fat fraction (PDFF; N = 29,512), pancreas PDFF (N = 28,624) and liver enzymes (N = 390,000) in the UK Biobank using BOLT-LMM (Supplementary Methods). We calculated rg between each metabolite and each clinical trait using linkage disequilibrium score regression, based on their GWAS summary data overlapping with the 1.2 M HapMap3 variants after excluding the major histocompatibility complex region in the European population114. For each trait, we compared the distribution of its rg with T2D-associated versus non-associated metabolites, using chi-squared test, and considered FDR < 0.05 (correcting for numbers of comparisons tested) as statistically significant.

Genetic colocalization

We obtained tissue-specific cis-eQTLs summary statistics from the GTEx project v.8115,116. The shared causal variants between each metabolite and tissue-specific transcriptome from 47 tissue types, were examined using colocalization analysis implemented in the coloc.abf() function in R package ‘coloc’ v.5117. For each metabolite, we input the GWAS summary statistics for all variants within ±500 kb of its independent lead variants (Supplementary Methods). A posterior probability of H4 (PPH4) > 0.8 was considered as strong evidence for genetic colocalization. Within each tissue type, we used univariant logistic regression to test whether the proportions of mQTL–eQTL colocalizations are higher for the T2D-associated versus non-associated metabolites, and a one-sided FDR < 0.05 (correcting for 47 tissue types) was considered as statistically significant. We applied a similar coloc approach to examine genetic colocalizations between circulating metabolites and T2D27. We then aligned mQTL–T2D colocalizations with tissue-specific eQTL–mQTL colocalizations by metabolites and shared causal variants, to interpret the potential functionality of metabolites in T2D pathogenesis.

MR analysis

To infer the potential causal relationships between 233 T2D-associated metabolites (with genetic data) and T2D risk, we applied four MR methods implemented in the MendelianRandomization R package118: we used mode-based estimate (MBE) as the main method as it is generally conservative and robust to outliers; we further applied weighted-median, IVW and MR-egger to indicate result consistency119. When testing the direction from metabolites to T2D, we used independent variants from clumping (P < 5 × 10−8 and r2 < 0.01 in a 1,000-kb window) excluding the HLA region as genetic instrumental variables. If fewer than three variants were identified, we reduced the clumping P threshold until at least three variants were identified. We considered a potential causal relationship when MBE–FDR < 0.05 and at least two other MR methods showed the same effect directions as those from MBE. Sensitivity analyses were conducted, either to remove variants mapped to the top 3 recurrent loci (GCKR, ZNF259, FADS cluster) from the instrumental variables, or to use only independent variants clumped at P < 1.09 × 10−10 as the instrumental variables of metabolites, using the IVW MR method (due to fewer variants retained). When testing the direction from T2D to metabolites, we used independent lead variants associated with T2D at P < 5 × 10−8 as the instrumental variables. For the 148 metabolites that are potential mediators between BMI and T2D risk, we applied MR analysis to test the direction from BMI to metabolites. Details are provided in Supplementary Methods.

MWASs for modifiable risk factors

We fitted linear models to regress inversely normal transformed metabolite levels on age, sex (only in SOL), current smoking status, BMI, PA, intakes of 15 main food groups and fasting status, simultaneously together with cohort-specific covariates. Analyses were conducted in NHS/HPFS, SOL and WHI, separately, further stratified by substudies or racial groups (Supplementary Methods). Association coefficients between metabolites and each particular risk factor were then combined across analytical sets using a fixed-effect IVW meta-analysis. The R2 of each metabolite explained by specific risk factors were first calculated in each analytical set using the formula \(\beta \times \beta \times {\mathrm{variance}}\left({\mathrm{risk}}\; {\mathrm{factor}}\right)\!{/\mathrm{variance}}\left({\mathrm{metabolite}}\right)\), with the β being the association coefficients between the metabolite and the risk factor; and then averaged across all analytical sets. We compared the distributions of R2 for T2D-associated versus non-associated metabolites using the Wilcoxon test.

Mediation analysis between risk factors, metabolites and T2D risk

Details for mediation analysis are described in Supplementary Methods. Briefly, our analysis focused on BMI, PA, coffee/tea consumption and red/processed meat intake. For each risk factor, metabolites (1) that were associated with both the risk factor and T2D risk and (2) whose association directions with the risk factor and T2D risk were consistent with the pre-assumed epidemiological relationships between the risk factor and T2D risk, were considered. We tested whether, and to what degree, each metabolite mediated the association between a risk factor and T2D risk using the CMAverse R package120, adjusting age, sex, smoking, BMI and PA (if not the tested risk factor), calorie intake and other cohort-specific covariates, separately in NHS/HPFS, SOL and WHI. We combined total, indirect and direct effects, respectively, from each analytical set using a fixed-effect meta-analysis. The mediated proportion was calculated by dividing indirect effect to total effect. Metabolites with an indirect effect FDR < 0.05 and a consistent effect direction between the indirect and total effects, was considered as a potential mediator between a risk factor and T2D risk.

A multimetabolite signature for incident T2D prediction

We used metabolites shared between the Broad Institute and the Metabolon platforms (excluding glucose) to develop the signature to increase its generalizability to future studies. To avoid overfitting in model development and testing, we employed a leave-one-cohort-out cross-validation approach, in which we set aside one cohort as the testing set each time, and trained a prediction model for the set-aside cohort using data from all other cohorts (Extended Data Fig. 8). Given the heterogeneity of our cohorts, we did not pool individual-level data for model training. Instead, we applied a two-step approach to train the prediction model in a representable cohort (that is, WHI, which assessed the most shared metabolites for all its participants) but also leveraged association data from several other cohorts. In each iteration (that is, for each held-out testing cohort), we first conducted a metabolome-wide meta-analysis for T2D risk using all cohorts except WHI and the held-out cohort. Then, metabolites associated with T2D risk at FDR < 0.05 in the first step and shared between the two metabolomic platforms, were used as input in a Cox regression with elastic net regularization, implemented using the glmnet R package121, to construct a metabolomic signature model for T2D prediction in WHI. The derived model was further applied to the held-out cohort to calculate a metabolomic signature score. Within WHI, a leave-one-out cross-validation approach was used to acquire the unbiased metabolomic signature score. For details, please see Supplementary Methods.

The metabolomic signature scores, calculated in each held-out cohort, were then standardized. To evaluate whether the signature improved the T2D risk prediction, we fitted three sets of logistic (in SOL, and T2D nested case–control substudy in NHS) or Cox models (all other datasets): one model including only the metabolomic signature; a conventional risk factor model including age, sex, smoking, lipid-lowering medication use, anti-hypertensive medications, family history of diabetes, hypertension, dyslipidemia and BMI; and a third model including all conventional risk factors and the metabolomic signature. We compared the AUC between the conventional model versus the conventional plus metabolomic signature model. In a secondary analysis, we further included blood glucose (from metabolomic assays) in the conventional model to evaluate the added value of the metabolomic signatures beyond blood glucose.

In each cohort, we calculated the crude incident rate of T2D across deciles of the signature score. We fitted logistic or Cox models to analyze the relative risk of T2D, comparing higher versus lowest deciles of the metabolomic signature, adjusting for the same covariates in the main analysis model 2. In NHS/HPFS, SOL and WHI, we examined associations between the metabolomic signature with baseline risk factors, by regressing the signature score on age, sex (if appropriate), current smoking status, BMI, PA, intakes of 15 main food groups and fasting status simultaneously, together with cohort-specific covariates, using linear regression. All analysis was conducted separately in each cohort, and results were combined using a meta-analysis. FDR < 0.05 was considered as statistically significant.

We conducted two sensitivity analyses during model development. One was to use SOL (measured the most metabolites using the Metabolon platform) as the representative training cohort instead of WHI, which showed a similar, albeit slightly weaker, model performance in held-out cohorts (Extended Data Fig. 8). The other was to compare between elastic net versus lasso regularizations121, which reaffirmed that elastic net regression had compatible but a slightly better performance versus lasso regression (Supplementary Fig. 13). Separately from the leave-one-cohort-out cross-validation, we presented a final metabolomic signature model for future studies, developed using data from all study cohorts. For this model, we first conducted a metabolome-wide meta-analysis for T2D risk in all cohorts except WHI, and then used significant metabolites (FDR < 0.05) as input in a Cox regression with elastic net regularization for T2D prediction in WHI. The selected metabolites and their coefficients of this final model are highly consistent with those of models applied to each held-out cohort (Supplementary Table 18a).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.