Introduction

Metabolomics refers to comprehensive studies of amino acids, lipids, organic acids, or nucleotides, collectively known as metabolites, in biological systems. Metabolites’ levels change in response to genetic or environmental changes. Metabolomics can thus provide detailed information about the biochemical mechanisms happening in an organism, potentially leading to discoveries of important biomarkers that can be used to diagnose, monitor, or predict the risk of diseases1,2,3,4,5,6,7,8.

Generated by nuclear magnetic resonance (NMR) and/or mass spectrometry (MS), metabolomics datasets commonly involve tens of metabolites. Methods for analyzing static metabolomics data have been well developed9,10. However, longitudinal studies of the evolution of the course of time of the metabolites are becoming increasingly common11,12,13,14. Identifying the important metabolites responsible for the onset or progression of certain diseases is a challenging task, especially when dealing with complex datasets that include missing values. Efficient statistical methods to address these challenges within a coherent framework are however lacking. This need for sophisticated inference methods in high-dimensional longitudinal metabolomics datasets with missing values is the main motivation behind this article. The methods presented here, however, are broadly applicable to longitudinal studies beyond metabolomics, so the exposition of the statistical approach is kept fairly general.

The earlybird study: overview, inferential goals and analytical challenges

The rising prevalence of pre-diabetes and type 2 diabetes is a growing and alarming problem, associated with several short-term and long-term metabolic and cardiovascular complications15,16,17. However, the understanding of the underlying mechanisms that link the regulation of glucose and insulin in the early years of life is still incomplete.

The EarlyBird cohort study18,19 is a non-interventional prospective longitudinal study of healthy UK children, designed to explore how anthropometrics and clinical and metabolic processes are associated with glucose control during childhood and adolescence. The full cohort comprises 307 children, 170 of which are boys (the sub-cohort considered in this study due to availability of metabolomics data comprises 129 subjects, 92 boys, and 37 girls), who were followed up with medical examination on an annual basis from 5 to 16 years of age (12 time points). The collected data included anthropometrics, glucose, and insulin measures. In addition, a metabolic profiling approach was applied to the serum samples collected from the children at each time point, using proton nuclear magnetic resonance (1H-NMR) spectroscopy. This method allowed the collection of quantitative information on the serum content in lipoprotein-bound fatty acyl groups found in triglycerides, phospholipids and cholesteryl esters, and major low molecular weight molecules present in blood, such as amino acids and other major organic acids. Based on internal databases, several 1H-NMR signals are assigned and representative peaks are integrated to provide quantitative information on the different biochemical compounds. 1H-NMR signals that could not be assigned based on experimental datasets and internal reference databases are coded as U.X, where X corresponds to the 1H-NMR chemical shift where the signal was detected. The final dataset contains 82 1H-NMR-derived measurements (metabolites) for each time point, 4 of which were highly correlated and were removed to generate a second analysis set. The Materials and Methods section provides additional details.

Figure 1
figure 1

Missingness patterns of samples in the subcohort of the EarlyBird study. The grey cells represent observed values, and the white cells represent missing values. From left to right - missingness patterns in fasting glucose levels, in metabolite samples, and in clinical variables. The red cells on the left panel show additional missing values present only in fasting glucose levels.

Figure 2
figure 2

Number of missing values in the EarlyBird study in clinical variables (variables 1–10 on the y-axes) and metabolite samples (variable 11 on the y-axes) across different ages (left panel) and in total (right panel).

Figure 3
figure 3

Mean longitudinal trajectories of four different standardized covariates. Clockwise from top left: unsaturated fatty acids, 3-hydroxybutyrate, creatinine, and acetate. The whiskers around the mean show one standard deviation unit in each direction.

There is evidence in both adults and children that glucose levels high within the normal range are indicative of future diabetes. One-third of children showing transient hyperglycemia in the absence of any serious illness can be expected to develop diabetes within one year20,21. Therefore, we consider serum glucose to be the principal indicator of disease propensity. We also consider fasting blood glucose at the age of 16 years as closer to adult concentration values. Likewise, impaired fasting glycemia (IFG) identified at age 16 is considered to be a strong predictor of type 2 diabetes later in young adulthood. IFG is defined by the American Diabetes Association criteria with level from 5.6 mmol/L (100mg/dL) to 6.9 mmol/L (125 mg/dL).

In the present study, we employ novel longitudinal models to assess the association between fasting glucose concentration (the response y) and individual clinical variables, metabolites and select anthropometric measurements (the covariates \({\textbf{x}}\)). Different metabolomic pathways have been postulated to elucidate the roles metabolites play in glucose production in the liver. The Cahill or glucose-alanine cycle22,23, for example, refers to the metabolic pathway of the transport of carbons and amino groups from the muscle to the liver, whereas the Cori or lactic acid cycle24 corresponds to the set of reactions that transports lactate from the muscle to the liver, where it is converted to glucose in the absence of oxygen and is then metabolized back to lactate to later re-enter the liver. The Krebs or citric acid cycle25,26, on the other hand, is important in synthesizing glucose into other key biochemical products. Assessing the relevance of different metabolites in influencing glucose concentration in the EarlyBird study helps quantitatively elucidate these relationships in glucose metabolism in children and adolescents.

Significantly compounding the analytical challenges, glucose concentration, clinical variables, and metabolites all contain missing values with different missingness patterns (Figs. 1 and 2). The metabolites have missing values when the blood sample is not present, either because the subject did not attend the annual visit or because the sample was of poor quality after storage or not of enough volume for the NMR analysis. Clinical data are also missing for similar reasons. Specifically, the clinical variables measuring physical activity and respiratory quotient (variables 5 and 9 in Fig. 2) required additional visits and additional effort from the participants. In the case of the first variable, the children had to wear a device to determine their physical activity. The respiratory quotient measures the basal metabolic rate and was obtained by putting a face-mask on the children for 30 minutes. These intrusive measurements showed poorer compliance than the rest of the measurements obtained during their annual visit and were missing when the participants did not attend the additional visits. Glucose concentration had some additional missing values, shown in red in Fig. 1, attributable to human recording errors. The missingness mechanisms may be assumed to not depend on the true unobserved values of the missing data points and hence ignorable in nature (See Background and Section S.1 in Supplementary Information).

In addition to missingness, the high correlation between some of the metabolites is also an issue. Multicollinearity is evident from plots of variance inflation factors (Supplementary Figs. S.11 and S.13). Furthermore, the trajectories of the predictors are also widely different, with variability changing over time (Fig. 3).

The challenges presented by this complex dataset therefore include the confounding effect of growth in the metabolic signal, the presence of differently patterned missing values in the predictors and the response, the high dimensionality of the predictors, and their widely variable trajectories.

Our inferential tasks include the imputation of the missing values in y and \({\textbf{x}}\), which can be used in a variety of downstream analyses, inference on the relationships between y and \({\textbf{x}}\), and the selection of important covariates.

Background

The literature on longitudinal data and missing values is extensive. See, for example, books27,28,29,30,31,32,33,34, and review papers35,36,37,38,39,40, and the references therein.

The Bayesian paradigm provides a useful framework for handling missing data29,41,42. Specifying an appropriate joint probability model for the observed data, missing data, missingness mechanism, and the associated model parameters, Bayesian inferential machinery can naturally accommodate problems with missing data. Uncertainty in imputing the missing values is taken into account, and their finite sample estimates can also be readily obtained from samples drawn from the posterior.

The missingness mechanism can be ignored when the missing values are missing at random (MAR), i.e., the missingness does not depend on the missing data conditional on the observed data43. Bayesian inference then naturally relies on working with a joint model \(p(y,{\textbf{x}})\). It is often convenient to factorize \(p(y,{\textbf{x}})\) as \(p({\textbf{x}}) p(y \vert {\textbf{x}})\) and then focus separately on \(p({\textbf{x}})\) and \(p(y\vert {\textbf{x}})\). Jointly imputing the missing \({\textbf{x}}\) values using a model \(p({\textbf{x}})\) that properly accommodates their dependencies is especially beneficial when the components are strongly correlated, as is the case in the EarlyBird cohort. It is also natural to exploit the relationship encoded in \(p(y \mid {\textbf{x}})\) to impute the missing y’s. This regression model can also play a role in imputing the missing \({\textbf{x}}\), although it is typically not very informative in that context. Flexible and innovative modeling strategies for \(p({\textbf{x}})\) and \(p(y \vert {\textbf{x}})\) are crucial in further simplifying the imputation tasks and making the inferential exercises more robust.

We first discuss the challenges in building flexible and useful marginal models \(p({\textbf{x}})\). For time-invariant multivariate covariates with missing values, it may be practically convenient to work with \(p(x_{j} \vert x_{1},\dots ,x_{j-1},x_{j+1},\dots ,x_{p})\) for each j44,45. However, such sequential regression models may not correspond to a valid joint probability model for \({\textbf{x}}\). A related strategy, without this limitation, factors the joint distribution as \(p(x_{1},\dots ,x_{p}) = p(x_{p} \vert x_{1},\dots ,x_{p-1}) \dots p(x_{2} \vert x_{1}) ~ p(x_{1})\) and then models each one-dimensional conditional distribution separately46,47. With an increase in the dimension of the covariates, specifying a separate model for each component quickly becomes a difficult task. Additional complications arise when the covariates also evolve temporally, as in the EarlyBird study. The longitudinal trajectories of the different covariates, which look widely different, now have to be additionally modeled. The correlation structure among the \(x_{j}\)’s may also be changing with time. The development of flexible and automated models for longitudinally evolving high dimensional covariates, accommodating widely varying individual trajectories without requiring to specify and fine-tune a separate model for each covariate while also allowing easy missing data imputation, is thus extremely challenging. Addressing this problem is an important focus of this article.

Building a regression model \(p(y \vert {\textbf{x}})\) poses additional challenges. For a small number of covariates, traditional linear mixed models can be considered48,49,50. Unconstrained analysis involving all covariates, however, leads to complicated models with inflated variance, loss of predictive power, and difficulty in interpretation. The identification of important predictors is also of practical and scientific importance. The literature on variable selection in static, complete data settings is enormous. In recent years, it has become common to rely on optimizing a goodness-of-fit loss function having a penalty added to favor parsimony. Famous methods of this type include LASSO51, adaptive LASSO52, SCAD53, and elastic net54 among others. Adaptations to missing-data problems are not straightforward as they involve computing and optimizing penalized log-likelihood functions based only on the observed data. However, the EM algorithm can potentially be used55,56,57.

Popular Bayesian strategies for variable selection include placing two-component ‘spike and slab’ priors on the regression coefficients58,59,60. Such priors place a point mass or spike at zero characterizing the redundant covariates and a continuous component or slab representing the signals. Adapting such approaches to missing data problems is conceptually straightforward by placing a probability model on the high-dimensional covariates and conducting posterior computation under the resulting joint model using MCMC61. However, the computational burden can be quite daunting, as even computation without missing predictors, or a model on \(p({{\textbf{x}}})\), is pretty challenging. To reduce the computational burden in the absence of missing data, it is popular to rely on continuous shrinkage priors that are concentrated at zero with heavy tails62,63,64,65. These methods can induce variable selection via thresholding63 or use of appropriate loss functions66,67. Adapting these techniques to build parsimonious regression models for longitudinal datasets involving high-dimensional covariates with ignorable missing values is another important goal of this article.

Toward these goals, we developed a detailed joint model to analyze the EarlyBird data set. The model was carefully designed to conform to a more general modeling framework, described in Section S.1 in the Supplementary Information, which simplifies the imputation tasks in both response and covariate values in longitudinal data with ignorable missingness. Specifically, we used nonparametric mean processes to capture widely varying covariate trajectories. We used latent factor formulations of the residual process to accommodate time-varying correlation structures, while meeting the related dimensionality challenges. Importantly, the implied conditional independence relationships also greatly simplified the imputation tasks. Taking a more structured approach, we model the response variable using a parametric regression model that accommodates the effects of the covariates and those of the aging process. Different classes of shrinkage priors stabilize estimation and help automate selection of the latent factors and important predictors via posterior variable selection summaries.

Materials and methods

Study population

The EarlyBird cohort study incorporates a 1995/1996 birth cohort recruited in 2000/2001 when the children were 5 years old (307 children, 170 boys). Several clinical and anthropometric variables were measured on an annual basis from the age of 5 to the age of 16. Untargeted metabolomics was performed in a subset of the full cohort (129 subjects, 92 boys), for a total number of 82 metabolites. The study was conducted in accordance with the ethics guidelines of the Declaration of Helsinki II; ethics approval was granted by the Plymouth Local Research Ethics Committee (1999), and parents gave written informed consent and children verbal informed assent.

Anthropometric variables

The 4 anthropometric variables are described in Table 1. BMI was derived from direct measurement of height (Leicester Height Measure; Child Growth Foundation, London, U.K.) and weight (Tanita Solar 1632 electronic scales), performed in blind duplicate and averaged. BMI SD scores were calculated from the British 1990 standards.

Table 1 Anthropometric and clinical variables recorded in the EarlyBird cohort.

Clinical variables

The 10 clinical variables recorded in the EarlyBird study are described in Table 1. Peripheral blood was collected annually into EDTA tubes after an overnight fast and stored at -\(80^\circ\) C. Insulin resistance (IR) and beta cell function were determined each year from fasting glucose (Cobas Integra 700 analyzer; Roche Diagnostics) and insulin (DPC IMMULITE) (cross-reactivity with proinsulin, 1%) using the homeostasis model assessment (HOMA-IR and HOMA-B, respectively).

Serum metabolomics

To measure the metabolites, 400\(\mu\)L of blood serum were mixed with 200\(\mu\)L of deuterated phosphate buffer solution 0.6 M KH2PO4, containing 1mM of sodium 3-(trimethylsilyl)-[2,2,3,3-2H4]-1-propionate (TSP, chemical shift reference \(\delta\)H = 0.0ppm). 550\(\mu\)L of the mixture were transferred into 5mm NMR tubes. 1H-NMR metabolic profiles of serum samples were acquired with a Bruker Avance III 600 MHz spectrometer equipped with a 5mm cryoprobe at 310K (Bruker Biospin, Rheinstetten, Germany) and processed using TOPSPIN (version 2.1, Bruker Biospin, Rheinstetten, Germany) software package as reported previously. Standard 1H-NMR one-dimensional pulse sequence with water suppression, Carr-Purcell-Meiboom-Gill (CPMG) spin-echo sequence with water suppression, and diffusion-edited sequences were acquired using 32 scans with 98K data points. The spectral data (from \(\delta\)0.2 to \(\delta\)10) were imported into Matlab software with a resolution of 22K data-points (version R2013b, the Mathworks Inc, Natwick MA) and normalized to the total area after solvent peak removal. Poor quality or highly diluted spectra were discarded from the subsequent analysis.

1H-NMR spectrum of human blood plasma enables the monitoring of signals related to lipoprotein bound fatty acyl groups found in triglycerides, phospholipids, and cholesteryl esters, together with peaks from the glyceryl moiety of triglycerides and the choline head group of phosphatidylcholine. This data also covers quantitative profiling of major low molecular weight molecules present in blood. Based on internal database, representative signals of metabolites assignable on 1H CPMG NMR spectra were integrated, including asparagine, leucine, isoleucine, valine, 2-ketobutyric acid, 3-methyl-2-oxovaleric acid, alpha-ketoisovaleric acid, (R)-3-hydroxybutyric acid, lactic acid, alanine, arginine, lysine, acetic acid, N-acetyl glycoproteins, O-acetyl glycoproteins, acetoacetic acid, glutamic acid, glutamine, citric acid, dimethylglycine, creatine, citrulline, trimethylamine, trimethylamine N-oxide, taurine, proline, methanol, glycine, serine, creatinine, histidine, tyrosine, formic acid, phenylalanine, threonine, and glucose. In addition, in diffusion edited spectra, signals associated to different lipid classes were integrated, including phospholipids containing choline, VLDL subclasses, unsaturated and polyunsaturated fatty acids. The signals are expressed in arbitrary unit corresponding to a peak area normalized to total metabolic profiles, which is representative of relative change in metabolite concentration in the serum.

Statistical analysis

Let \(y_{it}\) denote the response for subject i at time t, and \(x_{ijt}\) the associated \(j^{th}\) covariate, \(i=1,\dots ,n; t=1,\dots ,T; j=1,\dots ,p\).

Modeling the covariates Simple parametric models are insufficiently flexible for accommodating the wide variety of shapes of the longitudinal covariate trajectories (Fig. 3). Fine-tuning such models individually for each separate predictor to adapt them to different shapes is practically infeasible in high-dimensional applications like ours. Ideally, we would want to build flexible automated models which can accommodate widely varying shapes without any supervision. To this end, we model the covariate-generating process as

$$\begin{aligned} & {\varvec{\mu }}_{x,t} = {\varvec{\mu }}_{x,t-1} + {\varvec{\epsilon }}_{t}, ~~~{\varvec{\epsilon }}_{t} \sim \hbox {MVN}_{p}({\textbf{0}},{\varvec{\Delta }}_{\epsilon }), \\ & {\textbf{x}}_{it} = {\varvec{\mu }}_{x,t} + {\textbf{b}}_{x,i} + {\varvec{\xi }}_{it}, ~~~{\textbf{b}}_{x,i} \sim \hbox {MVN}_{p}({\textbf{0}},{\varvec{\Delta }}_{x,b}), ~~~{\varvec{\xi }}_{it} \sim \hbox {MVN}_{p}({\textbf{0}},{\varvec{\Sigma }}_{x,t}). \end{aligned}$$

Here \(\hbox {MVN}_{p}({\varvec{\mu }},{\varvec{\Sigma }})\) denotes a p-variate normal distribution with mean vector \({\varvec{\mu }}\) and covariance matrix \({\varvec{\Sigma }}\). The Markovian but otherwise unstructured mean process \({\varvec{\mu }}_{x,t}^\textrm{T}=[\mu _{x,1t},\dots ,\mu _{x,pt}]\), that characterizes the temporal evolution of \([x_{1},\dots ,x_{p}]^\textrm{T}\), is crucial in accommodating widely varying trajectories of different \(x_{j}\)’s in an automated way. The associated error process \({\varvec{\epsilon }}_{t}\) has covariance \({\varvec{\Delta }}_{\epsilon }=\hbox {diag}\{\sigma _{\epsilon ,1}^{2},\dots ,\sigma _{\epsilon ,p}^{2}\}\). The random vector \({\textbf{b}}_{x,i}^\textrm{T}=[b_{x,i1},\dots ,b_{x,ip}]\), with covariance \({\varvec{\Delta }}_{x,b}=\hbox {diag}\{\sigma _{x,b,1}^{2},\dots ,\sigma _{x,b,p}^{2}\}\), collects individual-specific random effects. We assign conjugate inverse-Gamma priors on the variance parameters as

$$\begin{aligned} \sigma _{x,b,j}^{2} \sim \hbox {Inv-Ga}(a_{x,b,\sigma },b_{x,b,\sigma }), ~~~~\sigma _{\epsilon ,j}^{2} \sim \hbox {Inv-Ga}(a_{\epsilon ,\sigma },b_{\epsilon ,\sigma }). \end{aligned}$$

Here \(\hbox {Inv-Ga}(a,b)\) denotes an inverse-Gamma distribution with shape parameter a and scale parameter b.

Exploratory analysis indicated some of the covariates to be highly correlated with changing correlation patterns over time. Time indexed covariance matrices for the \({\varvec{\xi }}_{it}\), namely \({\varvec{\Sigma }}_{x,t}\), greatly improve model flexibility but their large dimensions also present significant modeling challenges. To address this issue, we consider factor analytic representations as

$$\begin{aligned} {\varvec{\xi }}_{it} = {\varvec{\Lambda }}_{t}{\varvec{\eta }}_{it} + {\textbf{u}}_{it}, ~~~{\varvec{\eta }}_{it} \sim \hbox {MVN}_{p}({\textbf{0}},\hbox {I}), ~~~{\textbf{u}}_{it} \sim \hbox {MVN}_{p}({\textbf{0}},{\varvec{\Delta }}_{u,t}), \end{aligned}$$

where \({\varvec{\Lambda }}_{t}=((\lambda _{tjh}))_{j=1,h=1}^{p,q_{t}}=[{\varvec{\lambda }}_{1t},\dots ,{\varvec{\lambda }}_{pt}]^\textrm{T}\) are \(p\times q_{t}\) loading matrices, \({\varvec{\eta }}_{it}^\textrm{T}=[\eta _{it1},\dots ,\eta _{itq_{t}}]\) are latent factors and \({\textbf{u}}_{it}\) are associated idiosyncratic errors with covariance matrix \({\varvec{\Delta }}_{u,t}=\hbox {diag}({\varvec{\sigma }}_{u,t}^{2})=\hbox {diag}(\sigma _{u,1t}^{2},\dots ,\sigma _{u,pt}^{2})\). Marginalizing out the latent factors, we have \({\varvec{\Sigma }}_{x,t}={\varvec{\Lambda }}_{t}{\varvec{\Lambda }}_{t}^\textrm{T}+{\varvec{\Delta }}_{u,t}\). Since any positive definite matrix \({\varvec{\Sigma }}^{p \times p}\) admits a low rank and diagonal matrix decomposition \({\varvec{\Sigma }}= {\varvec{\Lambda }}{\varvec{\Lambda }}^\textrm{T}+{\varvec{\Delta }}\) for some \({\varvec{\Lambda }}^{p \times q}\) and \({\varvec{\Delta }}= \hbox {diag}[\sigma _{1}^{2},\dots ,\sigma _{p}^{2}]\) for some \(0 \le q \le p\), in theory, the latent factor model is completely flexible. \({\varvec{\Sigma }}\) involves \(p(p+1)/2\) elements, whereas the number of parameters in a latent factor specification with q columns is \(p(q+1)\). Often \(q \ll p\) produces very good approximations of \({\varvec{\Sigma }}\) while achieving a significant reduction in the number of parameters. In practice, data-driven and automated selection of q can be achieved using sparsity-inducing priors as described below. The overall strategy is particularly relevant in our application with a high-dimensional \({\varvec{\Sigma }}_{x,t}\) at every t.

Separate latent factors \({\varvec{\eta }}_{it}\) for each time point t is still too flexible for high-dimensional settings like ours, especially since entire samples of covariates can be missing (Fig. 1) in which case the \({\varvec{\eta }}_{it}\) are informed entirely by the regression of \(y_{it}\) on \({\textbf{x}}_{it}\). Taking a middle path between a restrictive diagonal covariance matrix and a fully flexible model, we allow the latent factors \({\varvec{\eta }}_{i}\) to be shared across time points, further greatly reducing model complexity. Integrating out both \({\varvec{\eta }}_{i}\) and \({\textbf{b}}_{x,i}\) thereby induce flexible variance and cross-covariance structures \(\hbox {cov}({\textbf{x}}_{it}) = {\varvec{\Lambda }}_{t}{\varvec{\Lambda }}_{t}^\textrm{T}+ {\varvec{\Delta }}_{u,t} + {\varvec{\Delta }}_{x,b}\) for all t and \(\hbox {cov}({\textbf{x}}_{it},{\textbf{x}}_{it'}) = {\varvec{\Lambda }}_{t}{\varvec{\Lambda }}_{t'}^\textrm{T}+ {\varvec{\Delta }}_{x,b}\) for all \(t \ne t'\).

Precluding the necessity to pre-specify the number of latent factors, we next allow the loading matrices to have a-priori a potentially infinite number of columns. Sparsity-inducing priors, that favor more shrinkage as the column index increases, can then be used to shrink the redundant columns toward zero. We do this via the multiplicative gamma process shrinkage priors68 (MGPS) that allow easy posterior computation. For \(t=1,\dots ,T\) and \(h=1,\dots ,\infty\), we assign priors as follows

$$\begin{aligned} \lambda _{tjh}\sim & \hbox {Normal}(0,\phi _{\lambda ,tjh}^{-1}\tau _{\lambda ,th}^{-1}), ~~~~~\phi _{\lambda ,tjh} \sim \hbox {Ga}(\nu _{\lambda }/2,\nu _{\lambda }/2), \\ \tau _{\lambda ,th}\sim & \textstyle \prod _{\ell =1}^{h} \delta _{t\ell }, ~~~~~\delta _{\lambda ,t\ell } \sim \hbox {Ga}(a_{\lambda ,\ell },1), ~~~~~\sigma _{u,jt}^{2} \sim \hbox {Inv-Ga}(a_{u,\sigma },b_{u,\sigma }). \end{aligned}$$

Here \(\hbox {Normal}(\mu ,\sigma ^{2})\) denotes a Normal distribution with mean \(\mu\) and variance \(\sigma ^{2}\), and \(\hbox {Ga}(\alpha ,\beta )\) denotes a Gamma distribution with shape parameter \(\alpha\) and rate parameter \(\beta\). The parameters \(\{\phi _{\lambda ,tjh}\}_{j=1}^{p}\) control the local shrinkage of the elements in the \(h^{th}\) column of \({\varvec{\Lambda }}_{t}\), whereas \(\tau _{\lambda ,th}\) controls the global shrinkage. When \(a_{\lambda ,h} > 1\) for \(h=2,\dots ,\infty\), the sequence \(\{\tau _{\lambda ,th}\}_{h=1}^{\infty }\) is stochastically increasing and thus favors more shrinkage as the column index h increases. The shrinkage prior also helps alleviate rotational non-identifiability by assigning the strongest effects to the foremost factors.

Modeling the response Conditional on the covariates \({\textbf{x}}_{it}\), we model the response generating process as

$$\begin{aligned}&y_{it} = \mu _{y,t} + b_{y,i} + {\textbf{x}}_{it}^\textrm{T}{\varvec{\beta }}+ v_{it}, ~~~b_{y,i} \sim \hbox {Normal}(0,\sigma _{y,b}^{2}), ~~~v_{it} \sim \hbox {Normal}(0,\sigma _{v}^{2}). \end{aligned}$$

The mean process \(\mu _{y,t}\) captures the temporal evolution of y and is modeled as

$$\begin{aligned} \mu _{y,t} = {\textbf{p}}_{s,t}^\textrm{T}{\varvec{\alpha }}, \end{aligned}$$

where \({\textbf{p}}_{s,t}^\textrm{T}=[1,f(t,1),\dots ,f(t,s)]\), \(f(t,\ell )\) being a normalized version of \(t^{\ell }\), and \({\varvec{\alpha }}^\textrm{T}=[\alpha _{0},\alpha _{1},\dots ,\alpha _{s}]\) denotes the associated coefficients. Sparsity-inducing priors, that favor more shrinkage as the degree of the polynomial increases, are used to favor simpler lower-degree relationships. We do this by adapting the MGPS priors as

$$\begin{aligned} & \alpha _{k} \sim \hbox {Normal}(0,\tau _{\alpha ,k}^{-1}),~~~\tau _{\alpha ,k} \sim \textstyle \prod _{\ell =1}^{k} \delta _{\alpha ,\ell }, ~~~~~\delta _{\alpha ,\ell } \sim \hbox {Ga}(a_{\alpha ,\ell },1). \end{aligned}$$

As before, when \(a_{\alpha ,\ell } > 1\) for \(\ell =2,\dots ,s\), the sequence \(\{\tau _{\alpha ,k}\}_{k=1}^{s}\) is stochastically increasing, thereby favoring more shrinkage of the higher order coefficients towards zero.

Spline-based semiparametric regression models69 could also be used to flexibly model \(\mu _{y,t}\). The polynomial regression model with MGPS shrinkage priors on the coefficients, however, allows us to straightforwardly assess departures from simpler parametric alternatives, including a first-degree linear model.

The variables \(b_{y,i}\), with variance \(\sigma _{y,b}^{2}\), denote individual specific random effects in \(y_{it}\). The effect of the predictor \({\textbf{x}}_{it}\) on \(y_{it}\) is captured via linear regression with coefficients \({\varvec{\beta }}^\textrm{T}=[\beta _{1},\dots ,\beta _{p}]\). To stabilize inference and favor the selection of important predictors in the presence of high-dimensional covariates with many possibly insignificant components, we use sparsity-inducing continuous priors on the regression coefficients. Unlike the columns of factor loading matrices and elements of \({\varvec{\alpha }}\), there is no natural prior ordering of the elements of \({\varvec{\beta }}\) making MGPS priors an inappropriate choice here.

The starting point of our search for an appropriate prior for \({\varvec{\beta }}\) is a ‘spike and slab’ prior \(\beta _{j} \sim \pi \delta _{0} + (1-\pi )\hbox {Normal}(0,\sigma _{\beta }^{2}), j=1,\dots ,p\). Such priors, however, often have poor performance in high-dimensional settings, especially when the parameter vector is highly sparse. Choosing \(\pi =1/2\), for example, leads to an exponentially small prior probability of \(2^{-p}\) assigned to the null model. Although this issue can be mitigated by assigning a hierarchical beta prior on \(\pi\)70, posterior sampling in high-dimensional settings will still require a stochastic search over an enormous space, leading to slow mixing and convergence71. Continuous shrinkage priors such as the Bayesian LASSO62, the horseshoe64 and the Dirichlet-Laplace (DL)65 mimic the spike and slab strategy by having a peak at zero to capture sparsity and heavy tails to capture significance, but improve computational issues by allowing efficient posterior sampling through hierarchical auxiliary variable constructions. Importantly, however, unlike the Bayesian LASSO and the horseshoe that only mimic the marginal behavior of point mass mixture priors, the DL prior also mimics the joint behavior of hierarchical spike and slab type mixture priors, thereby having better control of the joint sparsity of the parameter vector.

A DL prior on the regression coefficients can be specified as

$$\begin{aligned} \beta _{j} \sim \hbox {DE}(\phi _{\beta ,j}\tau _{\beta }), ~~~ (\phi _{\beta ,1},\dots ,\phi _{\beta ,p}) \sim \hbox {Dir}(a_{\beta },\dots ,a_{\beta }), ~~~\textstyle \tau _{\beta }\sim \hbox {Ga}(p a_{\beta },1/2). \end{aligned}$$

Here \(\hbox {DE}(\sigma )\) denotes a double exponential distribution with location 0 and scale \(\sigma\), and \(\hbox {Dir}(\alpha _{1},\dots ,\alpha _{p})\) denotes a Dirichlet distribution with concentration parameter \((\alpha _{1},\dots ,\alpha _{p})\). For \(a_{\beta }<1\), with \(\phi _{\beta ,j}\) integrated out, the marginal distribution of \(\beta _{j}\) given \(\tau _{\beta }\) has a singularity at zero. The parameter \(\tau _{\beta }\) determines how the tails of the marginal distribution decay as \(\left| \beta _{j}\right|\) increases. The hyper-prior on \(\tau _{\beta }\) allows uncertainty in this parameter and learning from the data. The DL prior can be equivalently represented as

$$\begin{aligned} & \beta _{j} \sim \hbox {Normal}(0,\psi _{\beta ,j}\phi _{\beta ,j}^{2}\tau _{\beta }^{2}), ~~~\psi _{\beta ,j} \sim \hbox {Exp}(1/2), \\ & (\phi _{\beta ,1},\dots ,\phi _{\beta ,p}) \sim \hbox {Dir}(a_{\beta },\dots ,a_{\beta }), ~~~\textstyle \tau _{\beta }\sim \hbox {Ga}(p a_{\beta },1/2), \end{aligned}$$

which facilitates straightforward posterior computation via an efficient block Gibbs sampler.

Finally, we assign conjugate priors on the variance parameters as

$$\begin{aligned} \sigma _{y,b}^{2} \sim \hbox {Inv-Ga}(a_{y,b,\sigma },b_{y,b,\sigma }), ~~~~\sigma _{v}^{2} \sim \hbox {Inv-Ga}(a_{v,\sigma },b_{v,\sigma }). \end{aligned}$$

In many applications, especially in epidemiological studies with children and young adults as subjects, the natural growing or aging process might be expected to have some effect on the outcome of primary interest y. The mean process \(\mu _{y,t}\) separates this effect from that of the covariates. Unlike \({\varvec{\mu }}_{x,t}\) which captures the temporal evolution of \({\textbf{x}}_{it}\) but is of no independent interest to us and hence is left unstructured, understanding \(\mu _{y,t}\), the natural evolution of y as the subjects age, is often an important goal in epidemiological studies. To this end, we took a more structured approach towards modeling \(\mu _{y,t}\) and used a polynomial function of time that is easy to interpret and, if needed, related tests of hypotheses can also be easily performed.

Posterior inference and missing data imputation For posterior inference, we rely on samples drawn from the posterior using a Markov chain Monte Carlo (MCMC) sampler, missing data imputation being a naturally integrated part of the sampler. The model described above is designed carefully to conform to a more generic framework for longitudinal data with ignorable missingness, described in Section S.1 in the Supplementary Information, which simplifies the imputation steps of the MCMC sampler for both missing response and covariate values.

The design of the sampler exploits the conjugacy of the priors and the conditional independence relationships encoded in different layers of the hierarchy. The latent factor formulation of the model for the covariates plays a particularly important role in inducing conditional independence between different components of \({\textbf{x}}_{it}\) for each i at each t, simplifying the sampling of their missing values. We also evaluate post-processing model and variable selection criteria based on samples drawn from the posterior. See Section S.2 in the Supplementary Information for additional details.

Results

This section summarizes the results of our proposed approach applied to the EarlyBird dataset. To our knowledge, this is the first time we can report a comprehensive metabolic contribution of specific metabolic processes to overall blood glucose variations in a longitudinal and continuous manner. Such findings highlight the importance of specific metabolites in amino acid, ketone body, glycolysis, and fatty acid metabolism, in describing the variations of blood glucose throughout childhood.

We performed two sets of analyses. First, with all 96 time-varying predictors included, then, removing the last 4 metabolites that were the most strongly collinear. Some degree of multicollinearity was still present in the second set of 92 predictors (Supplementary Fig. S.13). Our proposed shrinkage prior based approach is robust to the presence of multicollinearity and the results obtained in the two cases were very similar. A competitor method, described later in this section and referred to as the ‘lme’ method, can not handle multicollinearity well. The results produced by this method were generally unstable, significantly more so in the first scenario. For space constraints, we summarize here only the results of the second set of analyses which were more favorable to the lme method.

Figure 4 illustrates the excellent performance of our model in capturing widely varying individual and average trajectories of the time-varying covariates. In Fig. 4 and similar others, the ages 5, 6, ..., 16 are represented by the time points 1, 2, ..., 12. Additional figures presented in the Supplementary Information (boxplots of observed and fitted values of the time-varying predictors across different time points in Fig. S.9 and plots of empirical correlations between time-varying predictors and the corresponding model estimates in Fig. S.13) show the effectiveness of our latent factors based approach in capturing the widely different and time-varying variance-covariance structures of the high dimensional time-varying predictors.

Figure 5 illustrates the results of the regression model, relating fasting glucose concentrations to age, 4 baseline anthropometric variables, 10 time-varying clinical variables and 78 time-varying metabolites. The average fasting glucose concentration levels vary nonlinearly with age (Fig. 5a and b). Among the key contributors or influencers of glucose trajectories, we observed several positive contributors to glucose variations, including glucose derived variables (HOMA-B, HOMA-IR), BMI z scores, 1H-NMR derived quantitative signals from glucose, lactate, and alanine (time-varying predictors 3, 4, 10, 27, 28, 66, 76 and 80 in Fig. 5c). In addition, we identified some other metabolites characterized with a negative contribution to glucose variation, including 1H-NMR derived quantitative signals from citrate, ketone body 3-D-hydroxybutyrate, leucine, asparagine, and lipoprotein bound fatty acyl groups (time-varying predictors 11, 16, 23, 46, 49, 56, 64 and 77 in Fig. 5c). These results confirm the expected behavior between glucose, insulin resistance (HOMA-IR), and beta cell function (HOMA-B), as well as cross-platform measurement relationships (serum glucose by enzymatic assay and 1H-NMR spectroscopy).

Figure 4
figure 4

Results for the EarlyBird study: Observed (solid lines) and fitted (dotted lines) trajectories for the first 2 time-varying predictors for 5 randomly selected subjects super-imposed over time-specific sample means across all subjects (solid black line) and the corresponding fitted values (dotted black line). The bullets represent the mean imputed missing values, assumed to be equal to the unknown true values for plotting purposes.

Figure 5
figure 5

Results for the EarlyBird study: (a) Estimated posterior means of components of \({\varvec{\alpha }}\) and their \(90\%\) credible intervals. (b) Observed (solid lines) and fitted (dotted lines) trajectories of y for 5 randomly selected subjects super-imposed over time-specific sample means across all subjects (solid black line) and the corresponding fitted values (dotted black line). The bullets represent mean imputed missing values, assumed to equal the unknown true values for plotting purposes. (c) Estimated posterior means of components of \({\varvec{\beta }}\) and their \(90\%\) credible intervals. The first 4 components correspond to anthropometric baseline predictors. The remaining 88 components correspond to the time-varying predictors, comprising 10 clinical variables and 78 metabolites.

Figure 6
figure 6

Post-processing variable selection results for the EarlyBird study: Model size vs the corresponding excess error \(\psi _{\lambda }\). See Section S.2.4 in the Supplementary Information for additional details.

The model thus allowed us to link childhood blood glucose variations with very different circulating levels of metabolites in the blood that correspond to the different central energy metabolic cascades. This is also reflected by the increasing contribution of lactate and alanine (time-varying predictors 27 and 28 in Fig. 5c) which corresponds to a higher contribution of the Cori and Cahyll cycles for glucose production, respectively. It is worth noticing how these molecular variations occur in a period of high metabolic flexibility during which the body of children switches from a high fat-protein basal metabolism towards a more carbohydrate dependent metabolic state. For instance, the model strongly highlights how decreasing circulating levels of the ketone body 3-D-hydroxybutyrate and citrate (a key intermediate in Krebs cycle) (time-varying predictors 23, 46, 49 and 77 in Fig. 5c) decrease concomitantly to the decrease in the circulation of some fatty acids over time. This corresponds to a trend towards lower fluxes of fatty acids to the liver for energy production. The amino acid asparagine production is highly connected to another Krebs cycle intermediate, oxaloacetate, and therefore the model may be describing these additional biological relationships.

In our discussion of the results so far, we have considered covariates with coefficients having marginal posterior credible intervals not including zero to be potentially important predictors of glucose concentrations. Taking a liberal approach, we also considered the covariates associated with highly variable coefficients, taking non-negligible values at least in some MCMC iterations, to also be potentially important. Conservative post-processing variable selection guidelines, described in Section S.2.4, lead to a model with 12 variables instead (Fig. 6). One such model with predictors chosen to correspond to coefficients with the largest absolute posterior means include (normalized versions of) \([1,t,t^{2},t^{5}]\) and the clinical variables metabolites HOMA-B, HOMA-IR, BMI, decrease concomitantly to the decrease in the circulation of some fatty acids, lipids (mainly HDL, fatty acid CH3 moieties), an unknown metabolite (U.2.21), citrate, asparagine and 1H-NMR derived quantitative signals from glucose (time-varying predictors 3, 4, 10, 11, 44, 49, 56 and 66).

Table 2 Deviance information criterion (DIC) and log pseudo marginal likelihood (LPML) estimates for the proposed Bayesian semiparametric latent factor based model (BSP-LF), and a related sub-model with diagonal covariance matrices for the covariates (BSP-Diag).

We also fitted a simpler sub-model for the covariates, referred to as the BSP-Diag model henceforth, \({\textbf{x}}_{it} = {\varvec{\mu }}_{x,t} + {\textbf{b}}_{x,i} + {\textbf{u}}_{it}, ~{\textbf{b}}_{x,i} \sim \hbox {MVN}_{p}({\textbf{0}},{\varvec{\Delta }}_{x,b}), ~{\textbf{u}}_{it} \sim \hbox {MVN}_{p}({\textbf{0}},{\varvec{\Delta }}_{u,t})\), excluding latent factors and assuming diagonal covariance matrices \({\varvec{\Sigma }}_{x,t} = {\varvec{\Delta }}_{u,t}\) for all t and hence independence of the covariate components.

Table 2 reports the estimated deviance information criterion (DIC)72 and log pseudo marginal likelihood (LPML)73 for the two Bayesian methods - the proposed latent factor based method and the diagonal covariance model described above. Compared to its diagonal covariance matrix restriction, the proposed latent factor based method clearly provides a much better fit to the EarlyBird data set.

We also implemented a standard lme approach, where we first imputed the missing values using cross means, implemented using the longitudinalData package in R, and then fitted a linear mixed model to the complete dataset, implemented using the lme4 package in R. To adhere to space constraints, we curtail the results produced by the BSP-Diag method and discuss here only the results produced by the lme method in greater detail.

Figure 7
figure 7

Results for the EarlyBird study obtained by the lme method: (a) Estimated values of \({\varvec{\alpha }}\) and their \(90\%\) confidence intervals. (b) Observed (solid lines) and fitted (dotted lines) trajectories of y for 5 randomly selected subjects super-imposed over time-specific sample means across all subjects (solid black line) and the corresponding fitted values (dotted black line). The bullets represent mean imputed missing values, assumed to equal the unknown true values for plotting purposes. (c) Estimated values of \({\varvec{\beta }}\) and their \(90\%\) confidence intervals. The first 4 components correspond to anthropometric baseline predictors. The remaining 88 components correspond to the time-varying predictors, comprising 10 clinical variables and 78 metabolites.

The results for the lme method are summarized in Fig. 7. Most regression coefficients had high estimated variance (Fig. 7, note the much larger y-axis scales compared to Fig. 5). The covariates important according to the lme method (coefficients with confidence intervals not including zero) were anthropometric variables sex and weight (baseline predictors 1, 2) and clinical variables and metabolites BMI SD, insulin derived measures, HOMA-B, HOMA-IR, physical activity, skin thickness, waist circumference, isoleucine, N-acetylcysteine, glutamate, polyunsaturated fatty acids and phenylalanines (time varying predictors 1, 2, 3, 4, 5, 6, 8, 10, 17, 38, 47, 50, 61). However, the lme method did not perform well in realistic simulation settings (Section S.3, Figs. S.2, S.3, S.7, and S.8 in the Supplementary Information), thus undermining the reliability of these results.

Discussion

In this article, we considered the problem of estimating the longitudinal evolution of an outcome of interest in the presence of high-dimensional predictors comprising baseline covariates as well as longitudinally evolving ones when both the response and the time-varying covariates included missing values. We developed flexible statistical frameworks for inference in such problems when the missingness is ignorable in nature. Nonparametric mean processes captured widely varying covariate trajectories. Flexible, yet parsimonious, latent factor formulations of high-dimensional time-varying covariance matrices helped meet daunting dimensionality challenges, while also greatly simplifying the imputation tasks. Different classes of shrinkage priors automated the selection of latent factors and significantly important predictors, effectively guarding against model overfitting. An efficient Markov chain Monte Carlo algorithm accounted for uncertainty in various aspects of the analysis, including the imputation tasks.

Our assumption of ignorable missingness could be justified by the design of the EarlyBird study but may not be realistic in other scenarios, including studies involving mass-spectrometry-based metabolic data. Ongoing directions of research include extension of the proposed methodology to accommodate time-varying covariate effects, more flexible covariance patterns, more general mixed effects regression models, unequally spaced time points, non-ignorable missing values, nonlinear regression relationships etc.

Some of the metabolites and clinical variables identified as the best time-varying predictors of fasting glucose were known and expected, such as glucose-derived variables (bmi, HOMA-IR and HOMA-B) or glucose-metabolite byproduct (lactate). Additionally, metabolites like citrate, known for their impact on energy metabolism, and leucine, influencing insulin secretion, were also among the identified predictors. However, others, such as asparagine, may have indirect effects and could be context-dependent, worth exploring in more detail.

The sparsity-inducing priors in our work, originally designed for high-dimensional \(n \ll p\) settings, are scalable to hundreds of predictors. We thus anticipate our method to also perform well under more extreme values of n and p. However, prior studies have not examined these methods in complex, longitudinal settings with missing data, as we do here. Testing their limits would thus require large-scale simulations with extreme n and p values, which, lacking a real-world problem at this scale, lies beyond the current scope but could be pursued in future work.

Additional information

Supplementary Information accompanying this paper presents a more generic modeling framework for longitudinal data with ignorable missingness, which simplifies the imputation tasks for both missing response and covariate values, details of the MCMC algorithm used to sample from the posterior, post-processing model selection and variable selection procedures, a simulation study evaluating the proposed method in synthetic settings, and a few additional figures summarizing the results of the EarlyBird application and the simulation studies.