Introduction

The goal of precision medicine is to provide medical care in which disease prevention and treatment is tailored to the genetic background and lifestyle of each person. At its core lies the challenge of identifying biomarkers as a means to stratify patients.

Multi-omic analysis is emerging as a new paradigm to obtain a more comprehensive understanding of pathological mechanisms since it provides a holistic approach to deciphering disease etiology. For instance, genomics allows us to determine the genetic makeup in each person, and owing to the availability of catalogs of genetic variants associated with a myriad of traits, we can create individual risk profiles. Other omic data like transcriptomics, metabolomics, and proteomics provide a more refined view into regulatory networks, allowing for a deeper understanding of the biological processes at work within an individual. Multi-omic analysis has been implemented in various longitudinal studies, generating a wealth of useful information regarding the modulation of biomolecular levels and pathways during healthy and pathological states1,2, as well as modeling health outcomes in relation to microbiome interactions3.

In recent years, there has been a growing recognition of the importance of understanding processes that occur subclinically, before the onset of symptoms. These early molecular and cellular changes often precede overt pathology and represent critical opportunities for intervention and prevention4. Nevertheless, despite the considerable amount of information that has been accumulating the past years, and the numerous phenotypic databases available, we have not yet established the biological insights we can derive from a cross-sectional integration analysis of multi-omic data obtained from healthy individuals.

For the actionability of personal omic profiling in health, we need to establish three key goals: a) the capability of each omic layer to stratify healthy individuals, b) the possibility of predicting prospective phenotypic manifestations for a certain group (referred to as practical stratification in the manuscript) and c) the kind of early prevention strategies we can implement based on multi-omic integration from a healthy state. To address these gaps, we set out to perform a cross-sectional integrative study of three omic layers on a cohort of 162 individuals without pathological phenotypes. We analyzed the omic layers individually and after integration, to evaluate their ability to stratify individuals into relevant subgroups and establish the potential strengths and weaknesses of precision medicine applications focused on stratification. Moreover, a subset of 61 individuals was followed longitudinally and provided omic data after two years from the first visit and one year from the second visit. Since temporal stability has previously been observed in certain omic layers derived from peripheral blood of healthy individuals5, the availability of this longitudinal data allowed us to validate the stability of the molecular profiles for each identified subgroup, a critical aspect for prevention strategies as highlighted recently6.

Overall, our results show that integration of different omic layers was necessary to obtain practical stratification and certain clusters displayed classification consistency over time highlighting the stability of their molecular profiles.

Results

In Fig. 1 we show the experimental design of the multi-omic integration that led to the stratification of the cohort into subgroups of individuals with potentially different underlying health predispositions. The analysis also evaluated the temporal stability of these stratifications across timepoints, providing insights into the consistency of the molecular profiles. The following sections present the steps that governed the stratification process.

Fig. 1: Flow chart representation of the analytical pipeline.
figure 1

A cohort of 162 ostensibly healthy individuals provided three omic layers on the first visit. These omic layers became the input of an analytical pipeline that stratified the cohort into four clusters. Moreover, a subset of 61 individuals provided serum lipoproteomic data at two additional timepoints. These data were used to evaluate the robustness of the initial classification.

Cohort characteristics

We generated genomic, metabolomic, and lipoproteomic profiles for a cohort of healthy individuals. The cohort consisted of 162 individuals, with a majority (61.1%) being females. The participants were primarily in their fifth decade (mean = 44.39, sd = 8.46) and exhibited an average BMI of 24.5 kg/m2 (sd = 3.95). On average, the levels of glucose, cholesterol, TG, and liver enzymes, along with other common analytical features, all remained within the normal range (Table 1). The collection of the omic data for the whole cohort occurred in 2019. For a subset of 61 individuals, serum lipoproteomic data were also collected at two additional time-points. One occurred around 2 years from the first visit (mean = 769 days, sd = 5 days) and the other around 1 year from the second visit (mean = 379 days, sd = 3 days).

Table 1 Cohort characteristics and analytical features from a basic biochemical blood analysis

Cohort displayed close genetic ancestry with European populations

We first evaluated the genetic ancestral composition of the individuals in our cohort. After merging the genomic datasets (WES and GSA), we removed indels and identified 557,317 single-nucleotide variants (SNVs) and imputed to 9,620,831 variants using the Michigan Imputation Server. From the latter callset, we extracted 161,157 independent variants (r2 ≤ 0.1) with allele frequency ≥0.05 and performed PCA against the 1000 Genomes reference populations in order to evaluate the ancestral background of the cohort. The cohort presented close genetic ancestry with European populations, showing strongest similarity with the Iberian population in Spain. Interestingly, we observed a continuous cline that reflects the variable amount of Basque genetic ancestry present in the current gene pool of the Spanish Basque Country region (Fig. 2).

Fig. 2: PCA using cohort’s 161,157 independent variants and 1000 Genomes reference populations.
figure 2

Population annotations in panel B: cohort (COH), African (AFR), Ad Mixed American (AMR), East Asian (EAS), European (EUR), South Asian (SAS). Population annotations in panel A: cohort (COH), British in England and Scotland (GBR), Finnish in Finland (FIN), Iberian population in Spain (IBS), Utah residents with Northern and Western European ancestry (CEU) and Toscani in Italia (TSI).

Loss-of-function variants provided limited ability for phenotypic characterization of the cohort

Since we wanted to evaluate the stratification capability of each omic layer, we studied them separately starting with the genomic layer. We identified 164 LoF variants (see Methods) that fall in a non-canonical splice site and eight in a single exon transcript. The genes associated with the latter are genes related to either protein-protein interactions (WDR5B), taste receptors (TAS2R3), olfactory signaling (OR51B5, OR56A5, OR4C16, OR5AR1 and OR4M2) or diseases such as scleral staphyloma and peach allergy (NACA2) that have not been clinically reported in the cohort. On the other hand, the LoF variants that fall in a non-canonical splice site have especially been reported in the central nervous system7. Overall, the interpretation in the context of our healthy cohort is not unequivocal and it carries limited practical actionability.

The cohort was characterized by diverse Mendelian conditions with unknown probability of manifestation

Continuing with the genomic layer, we identified 257 pathogenic variants related to 62 Mendelian conditions and involving 39 genes (see Methods). These alleles had an average minor allele frequency (MAF) of 0.15 (sd = 0.11), suggesting limited evolutionary control by natural selection and hence few implications for the carrying individual. Regarding damaging variants, we identified 3639 variants (MAF: mean = 0.12, sd = 0.12) linked to 607 Mendelian conditions and 429 genes, whereas 84 variants (MAF: mean = 0.14, sd = 0.12) and 11 genes were annotated as having an association with 13 disorders.

As a way to filter these results, we focused on the top-5 Mendelian conditions that were related to those genes that accumulated the highest number of damaging variants based on MetaSVM. This strategy identified myopathy (TTN, 13/186 damaging variants, mean MAF 0.03), bronchiectasis and pancreatitis (CFTR, 9/12 damaging variants, mean MAF 0.04), macular degeneration (ABCA4, 6/15 damaging variants, mean MAF 0.05), erythermalgia (SCN9A, 5/10 damaging variants, mean MAF 0.02) and Alport syndrome 3 (COL4A3, 5/15 damaging variants, mean MAF 0.11). As expected in the context of a healthy cohort, we did not observe any symptoms related to the aforementioned conditions, and we consider the probability to develop such manifestations low given the age of the participants.

Regarding drug related conditions, we identified toxicity towards 5-fluorouracil (DPYD, 1/3 damaging variants, mean MAF 0.12), Warfarin sensitivity (CYP2C9, 0/6 damaging variants, mean MAF 0.04), Warfarin resistance (VKORC1, 0/3 damaging variants, mean MAF 0.08), codeine sensitivity (CYP2D6, 0/3 damaging variants, mean MAF 0.15) and debrisoquine sensitivity (CYP2D6, 0/3 damaging variants, mean MAF 0.15). These findings provide useful information regarding the selection of treatment, with sensitivity to debrisoquine representing a particular case. More specifically, one of the drug therapies that has been reported in the cohort concerned medication for hypertension with 7 individuals being under treatment and two out them carrying minor alleles of the rs3892097 and rs1065852 variants of CYP2D6.

Bringing it all together, although this kind of approach may provide some useful information for prescribing clinicians, overall it fails to provide actionable information for the development of a prevention strategy since it yields a diverse phenotypic landscape for each individual that lacks a probabilistic context that is related to manifestation.

Polygenic scores highlight certain biomolecular traits with potential ability to stratify the cohort, albeit with substantial uncertainty

As an alternative to the previous approaches on the genomic layer, polygenic scores (PGS) provide a probabilistic framework for stratification by estimating genetic predispositions to various phenotypes. For each biomolecular trait available in our serum metabolomic and lipoproteomic datasets, we assessed the odds of each individual to have levels ≥75th percentile observed in the cohort provided that their corresponding PGS was ≥75th percentile observed in the cohort. We capitalized on the genetic estimation models provided by OMICSPRED to calculate the corresponding PGS (see Methods) and performed logistic regression using biomolecular traits as response variables and PGS as predictors, correcting at the same time for age, gender and BMI (Supplementary note table ST1).

We identified 28 traits with the potential to stratify the cohort. However, most traits displayed odds-ratio close to 2, with confidence intervals often including 1 (Fig. 3A). Two traits, namely glycine and triglycerides in medium HDL, presented an odds-ratio close to 6. But in both cases the 95%CI was wide, implying uncertainty regarding the conditions under which an individual displays high levels. This heterogeneity can be linked to the small sample size of the cohort, as well as to factors associated to lifestyle since glycine and triglyceride levels are influenced by diet.

Fig. 3: Odds-ratios for biomolecular traits with potential to stratify the cohort and distributions of cohort genotypes for GLDC variants.
figure 3

A Odds-ratios expressing the odds of observing levels ≥75th percentile if PGS is ≥75th percentile. Error bars represent 95% confidence interval of the odds-ratio. B Boxplots for glycine levels based on the number of minor alleles for three variant of the GLDC gene.

Triglycerides in HDL particles have been positively associated with high risk of myocardial infarction8, whereas glycine has been postulated to have a protective role9. After querying the LitVar2 portal10 for the variants associated with the glycine OMICSPRED model, we identified three variants (rs10975629, rs1061407, rs4419859) of the gene glycine decarboxylase (GLDC) that has been associated with non-ketotic hyperglycinemia (NKH). Inspection of the cohort genotypes (Fig. 3B) showed that only carriers of the rs4419859 and rs1061407 displayed the highest levels of glycine. Nevertheless, this observation concerned only a very small fraction from the carriers. This indicates the presence of confounding factors that would still remain unknown even if we increase the sample size. Additionally, from the two variants rs1061407 is classified as benign for NKH by dbSNP, whereas rs4419859 is an intron variant with unknown clinical significance.

Cohort displayed small variability across metabolomic and lipoproteomic features

We then moved to the metabolomic and lipoproteomic layers to assess their ability to effectively stratify the cohort. We performed principal component analysis (PCA) after standardizing all included features (Fig. 4A). The first two PCs captured 42% of the observed variance, with gender mainly separating individuals across PC1 as expected due to physiological differences between males and females.

Fig. 4: Exploratory analysis of the metabolomic and lipoproteomic features of the cohort.
figure 4

A Left panel: PCA of the cohort using the factors age, BMI, urine, and serum metabolomic data and serum lipoproteomic data. PCA was divided into four quadrants Q1, Q2, Q3 and Q4. Right panel: Correlation of the factors included in the PCA with the first two PCs. For the omic data, the loadings of their features have been averaged. B Comparison of the individuals clustered in each PCA quadrant for the factors included in the PCA. The levels of the omic data have been log2-transformed. Horizontal lines represent median value.

We then divided PCA into four quadrants (Q1, Q2, Q3, and Q4) to study the individuals based on their colocation. Based on the features loadings, age correlated equally with both PCs and influenced clustering towards the Q1 quadrant. Closer inspection revealed this quadrant accumulated male and female individuals with higher median age than the rest of the individuals of the same gender (Fig. 4B). Regarding BMI, it presented a strong correlation with PC1, especially impacting clustering towards Q2. We observed that this factor drove the clustering of male and female individuals with high BMI towards the Q1 and Q2 quadrant, with the latter accumulating females with the highest median BMI than the rest of this gender (Fig. 4B).

For the omic data, we evaluated their net impact on the clustering by averaging the loadings of their features. Lipoproteomic data had the strongest impact on the clustering of individuals compared to the metabolomic data (Fig. 4A), with serum lipoproteins influencing the clustering towards Q1 (Fig. 4B). However, the differences in lipoprotein levels of these individuals with the rest of the cohort were very small, corroborating the high uncertainty we previously reported on the odd-ratios using PGS. In a similar manner, we observed small inter-quadrant variability with the serum and urine metabolomic data (Fig. 4B) that led to the smallest impact on the clustering (Fig. 4A).

Multi-omic integration separated individuals into four distinct clusters, with one cluster accumulating individuals with dyslipoproteinemias

We integrated the lipoproteomic layer, which showed the highest clustering effect during PCA, with the abovementioned polygenic scores from the OMICSPRED models to study non-linear relationships within the cohort. Initially, we regressed-out the factors age, gender and BMI from each lipoprotein due to their strong influence on the cohort clustering. We then combined the resulting data with the previously calculated PGS from the OMICSPRED models into a topological analysis with UMAP11 to study non-linear relationships within the cohort. Finally we performed Kmeans clustering that divided the cohort into four distinct clusters (C1, C2, C3, and C4) (Fig. 5A).

Fig. 5: UMAP with 70 PGS from OMICSPRED models and serum lipoproteomic data after regressing out age, gender and BMI.
figure 5

A Individuals were annotated based on the cluster they fell into (C1, C2, C3, and C4) after performing Kmeans clustering with four centroids. B Individuals were annotated based on the assigned dyslipoproteinemias identified with the diagnostic algorithm of Sniderman. Contours represent percentiles concerning the amount of individuals that are included in each cluster.

We next investigated whether the accumulation of individuals in the same cluster was driven by underlying biological factors. As a first step, we identified genes whose variants are associated with a high number of OMICSPRED models that we used for the PGS calculation, with the rationale that these genes might have an important role in driving the clustering. This strategy highlighted six variants (mean MAF = 0.19, sd = 0.16) that are associated with the gene APOB, fall within an intergenic region and are related to more than 40 out of the 70 biomolecular traits.

Since APOB is linked to vascular diseases, we implemented the diagnostic algorithm provided by Sniderman et al.12 to identify individuals with potential underlying dyslipoproteinemias. The algorithm resembles a decision-tree classifier that groups individuals into six major dyslipoproteinemias, based on their levels of total cholesterol, triglycerides and apolipoprotein B (ApoB). Out of the 162 individuals, 13 individuals were identified with chylomicron and VLDL remnants, four with increased VLDL particles, three with increased LDL particles and two with increased VLDL and LDL particles. Interestingly, the vast majority of these individuals were collocated in the C4 cluster (Fig. 5B). This indicates that UMAP yielded practical clustering from a biological and phenotypic perspective.

Sniderman et al. provide diverse primary and secondary causes that could be associated with such manifestations. Some are related to familial incidence, for instance familial dysbetalipoproteinemia and hypertriglyceridemia, with genes such as APOB, APOE, APOC1 and PCSK9 being linked to such familial conditions13,14,15,16. Variants associated with familial conditions represent rare mutations that would require much larger sample sizes in order to evaluate their impact. However, driven by an interest to observe the distribution of the individuals that might carry such rare mutations, we focused on the variants rs11591147 (MAF = 0.01) and rs148195424 (MAF = 0.01) that have been associated with hypocholesterolemia16. We identified six individuals that carry one risk allele of rs11591147 and two individuals with one risk allele of rs148195424. Interestingly, from the eight carriers, none of them was collocated in the C4 cluster, whereas the distribution for the rest of the clusters was two carriers in C1, four in C2 and two in C3. This serves to emphasize even further the biological underpinnings behind the formation of the UMAP clusters, since the C4 cluster that accumulated individuals with dyslipoproteinemias did not include carriers of variants related to hypocholesterolemia.

The derived insights from the multi-omic data integration were transferable to measurements derived from a routine hematological examination

To assess whether the observed stratification could be replicated with a conventional blood test, we incorporated a dataset derived from a routine hematological examination from a different blood sample collected the same day as the serum omic data. This dataset included complete blood counts and measurements from a basic metabolic panel, and it was not incorporated in any step of the thus far presented analytical pipeline.

After regressing out the factors age, gender and BMI from the measurements, we compared the levels of the basic metabolic panel between the UMAP clusters that were previously identified (Fig. 6). Interestingly, we observed a distinct profile of the individuals of the C4 cluster, who presented the highest levels of triglycerides and the lowest levels of high-density lipoprotein (HDL). Similar levels of HDL were observed also in the C3 cluster, however this cluster displayed a different profile than C4 when HDL is subtracted from the total cholesterol (noHDL) and the lowest levels of low-density lipoprotein (LDL) than the rest of the clusters. This highlights the consistency of the clustering across different datasets.

Fig. 6: Measurements of a basic metabolic panel derived from a routine blood test of the cohort after regressing out the factors age, gender and BMI.
figure 6

The measurements were used to compare the UMAP clusters (x-axis) that were identified after the data integration. Individuals have been annotated based on the dyslipoproteinemias identified by the Sniderman diagnostic algorithm. Horizontal lines denote Tukey-Kramer’s post-hoc test p-value ≤ 0.01.

Regarding the individuals that were classified with dyslipoproteinemias, they displayed the highest levels of triglycerides, LDL and noHDL. Nevertheless, the high uncertainty previously observed during the calculation of the odds-ratios (Fig. 4) became apparent, since we also observed overlapping between the levels of classified and unclassified individuals (Normal).

Furthermore, the clustering was influenced by the factors age and BMI, albeit they were controlled for (Fig. 7). More specifically, the C4 cluster accumulated individuals with overall age and BMI above the cohort’s median, whereas the opposite was observed for the individuals in the C2 and C3 clusters. This indicates that linear approaches are not sufficient to fully control for these factors, unless higher order polynomial terms are introduced. However, such strategies increase the risk of overfitting and hinder interpretation.

Fig. 7: Characteristics of UMAP clusters for adjusted factors.
figure 7

Comparison of UMAP clusters for the factors age, gender and BMI.

Individuals in the C4 cluster displayed the smallest conversion rate according to Naive Bayes classification

To evaluate the temporal stability of the initial clustering from the first visit, we trained a Gaussian Naive Bayes (GNB) classifier (see Methods) using the cluster label as output and the levels of 59 lipoproteins, quantified across all visits, as predictors. The classifier was then used to predict the cluster labels for a subset of 61 individuals who provided data of serum lipoproteins for two additional visits.

We assessed cluster stability using the conversion rate, defined as the percentage of individuals whose predicted cluster label changed at least once across visits. After adjusting for the size of each cluster, we observed that individuals in the C4 cluster displayed the smallest conversion rate with 96% of the individuals retaining the same cluster label across the three visits (Fig. 8) compared to the rest of the clusters (C1 = 83%, C2 = 83% and C3 = 93%). This indicates that the multi-omic profile of the C4 cluster presents higher consistency over time, supplementing the previous observations associated with the distinctiveness of this cluster.

Fig. 8: Results from Naive Bayes classification of individuals in each cluster across three visits using serum lipoproteomic data and 70 PGS from OMICSPRED models.
figure 8

Individuals who have been identified with dyslipoproteinemias are annotated with a red outline. The size of each data point in the panels expresses the classification probability assigned by the classifier.

APOE, PCSK9, and LPA highlighted the difference in the C4 cluster regarding LDL levels

To identify which genes influence the formation of the clusters, we performed a ridge regression on each cluster separately (see Methods), considering the main fractions of the lipoproteins HDL, IDL, LDL, and VLDL (Supplementary note Fig. SF2) and calculated monogenic scores for 28 genes. We then used the estimated coefficients of the latter to perform a hierarchical clustering (Fig. 9).

Fig. 9: Hierarchical clustering using the estimated coefficients of ridge regressions from each cluster.
figure 9

A Ridge regression used main fractions of HDL as response variable and monogenic scores for 28 genes as predictors. B Ridge regression used main fractions of IDL and monogenic scores for 28 genes as predictors. C Ridge regression used main fractions of LDL and monogenic scores for 28 genes as predictors. D Ridge regression used main fractions of VLDL are response variable and monogenic scores for 28 genes as predictors.

Initially, we did not observe a statistically significant accumulation of minor allele carries within a specific cluster after conducting a chi-square test. Nevertheless, the hierarchical clustering revealed a dichotomic effect of the transcript variants of the considered genes on the clusters for the lipoproteins IDL and VLDL, with the genes ABCA1 and LPA having a pronounced impact. Moreover, the C4 cluster displayed a distinct profile than the rest of the clusters regarding LDL, with the genes APOE, PCSK9, and LPA driving the observed differences. As far as HDL is concerned, even though the C1 and C2 displayed the highest levels (Supplementary Note Fig. SF2), their levels were influenced positively by different genes, namely ABCA1 and PCSK9, respectively.

The mean effects of transcript variants of the genes ABCA1, LPA, APOE, and PCSK9, as well as the distribution of the minor alleles across the four clusters, are given in the Supplementary note Figs. SF3, SF4, SF5 and SF6 respectively.

Discussion

The main goal of precision medicine is the identification of patterns from biological data that could lead to patient stratification and targeted treatments. There are many remaining challenges, including data integration, privacy, and gaps in understanding genomic functions and patient recruitment17,18.

To the best of our knowledge, our study is the first endeavor to explore cross-sectional multi-omic integration under a non-pathological state, aiming at providing insights to assist the outset of precision medicine implementations towards prevention. To evaluate the current prospects for precision medicine, our main focus was to implement an integrative approach of three omic layers derived from a cohort of 162 ostensibly healthy individuals, and identify patterns that are relevant to certain groups of the cohort, instead of rare characteristics of a few individuals.

Multi-omic integration allowed us to stratify the cohort into four distinct clusters, even though each omic layer displayed weak stratification power when examined separately. For instance, the initial clustering using the urine and serum omic data met our expectations related to the absence of any apparent pathological condition in the cohort, as prima facie we did not observe groups with distinct characteristics. Instead, the minor interindividual differences were amplified by demographic factors such as age, gender, and BMI whose discerning impact has been previously reported on the serum bile acids profiles of healthy persons19. Thus, we accounted for these factors during the downstream analysis.

The genomic data revealed a large number of potential Mendelian conditions with unknown probability of manifestation. Additionally, even if we were to evaluate these findings on an individual basis, an extensive family history would have been necessary. On the other hand, variant annotations provided by ClinVar and MetaSVM aim to elucidate the pathogenicity of the identified variants and reduce the phenotypic complexity. However, conflicting annotations and lack of manifestation for conditions associated with common damaging variants hindered the interpretation of the obtained results.

The calculated PGS for 70 serum biomolecular features highlighted the discriminatory potential of the traits triglycerides in medium HDL and glycine, however we observed great uncertainty over the estimates which can be linked to the lifestyle of each individual, the small sample size, and the lack of individuals with high or pathogenic levels. Nevertheless, the PGS provided a bridging guide to the downstream analysis towards phenotypes associated with dyslipoproteinemias due to their relation to the highlighted biomolecular traits.

Regarding the identified clusters after integration, even though we managed to control for the impact of age, gender and BMI, the influence of the latter persisted causing a statistically non-significant but noticeable aggregation of individuals with median BMI above the cohort’s median in a distinct cluster (C4), indicating the need for refined methodologies to minimize the residual effects of such factors. Nevertheless, these individuals presented noteworthy differences associated with an aggravated dyslipidemic profile. Although this could be regarded as an outcome of a given lifestyle, this cluster displayed higher classification consistency over time, higher triglyceride levels in the peripheral blood and distinct profile associated with the impact on LDL serum levels of transcript variants of genes involved in lipid metabolism. Even though we did not observe a statistically significant accumulation of certain transcript variants in this cluster, the obtained results indicate that the clustering is driven by the cumulative influence of numerous subtle features rather than by the dominant impact of a few features. This, combined with the narrowed variability in feature levels across individuals in the cohort, renders challenging the extraction and interpretation of latent components that drive interindividual differences under an ostensibly healthy state.

Through the integrative exercise in this challenging setting, we identified several weaknesses that hinder the development of precision medicine in healthy contexts towards prevention. This includes the difficulty in evaluating the pathogenicity of certain aforementioned findings, which pales in comparison with the easiness in discovery through already available genomic databases, and the significant amount of information that needs to be parsed, curated and structured to allow the emergence of relationships and the interpretation of the findings from the integration of the multi-omic layers. Developing unified bioinformatic frameworks for such implementations is a must, even more so considering the increasing tendency towards localized efforts to develop precision medicine approaches in specific regional communities.

Regarding limitations, our study lacks an external replication. While our primary objective focused on exploring the potential of multi-omic integration within a cohort of healthy individuals, the lack of an independent validation dataset restricts the generalizability of our findings. However, robust validation of multi-omic integration results is hindered by the fact that it necessitates datasets generated using consistent methodologies, particularly concerning NMR-based metabolomics, and ideally, from populations with similar ancestries, given the known population-specific performance of PGS scores. Moreover, even as multi-omic profiling becomes more accessible, most available datasets remain cross-sectional and lack harmonization across platforms and populations, hence limiting our ability to validate integration strategies in external cohorts.

Thus, even though this study does not provide direct actionable insights to be implemented in the clinical practice, our exercise confirms that integration of different omic layers is a promising methodology to stratify healthy individuals, filter the immense phenotypic diversity that is initially assigned to each group and provide phenotypic clues that assist towards the development of prevention strategies.

Methods

Cohort ethics approval and consent to participate

Participants belong to the AKRIBEA cohort, a large-scale precision medicine initiative from the Basque Country led by CIC bioGUNE20. The cohort is recruited through annual medical check-ups in the Basque region, and hence it is particularly enriched for Basque genetic ancestry. Following the principles of the Declaration of Helsinki, all individuals provided informed consent for clinical research, with the consequent evaluation of the ethics committee for Investigation with medicinal products of Euskadi-Basque Country (CEIC-E 16–114 and CEIC-E 19-13). To protect patient confidentiality, all data has been double codified.

Whole-exome library preparation and sequencing

The quantity and quality of the DNAs were evaluated with Qubit dsDNA Broad Range Assay Kit and Agarose gels, respectively. Sequencing libraries were prepared following Nextera Flex for Enrichment Reference Guide using the corresponding Illumina kit for library preparation (Nextera DNA Flex Pre-Enrichment Library Prep and Enrichment Reagents-96 samples, Illumina Exome Panel - Enrichment Oligos only and indexes IDT for Illumina Nextera DNA Unique Dual Indexes Set A and B).

Input genomic DNA (300 ng) was tagmented by incubation with Enrichment Bead-Linked Transposomes (eBLT) for 5 min at 55 °C. After sample cool down at 10 °C, neutralization reagent was added to inactivate tagmentation reaction. In the next step, after sample cleanup, the Unique Dual index adapters and PCR Master Mix were added to the tagmented gDNA and amplification of libraries was carried out by PCR (72 °C for 3 min, 98 °C for 3 min and 9 cycles of: 98 °C for 20 s, 60 °C for 30 seconds, 72 °C for 1 min and a final extension step of 72 °C for 5 min). Subsequently, amplified libraries were purified using Agencourt AMPure XP beads, and were visualized on an Agilent 2100 Bioanalyzer using Agilent High Sensitivity DNA kit and quantified using Qubit dsDNA Broad Range Assay Kit Afterwards, libraries were pooled by mass attending to their concentration and their enrichment was performed overnight by incubation with enrichment oligos panel at 58 °C, then libraries were capture with streptavidin beads and cleaned up. Enriched libraries were then amplified by a second PCR (98 °C for 30 s and 10 cycles of: 98 °C for 10 s, 60 °C for 30 s, 72 °C for 30 s and a final extension step of 72 °C for 5 min). Eventually, after bead purification, the concentration and quality of the enriched library pools were checked by Qubit and Bioanalyzer, respectively.

Genomic data

We characterized the genome of the individuals using two strategies, namely through whole-exome sequencing (WES) and genome-wide genotyping with a commercial array. The sequencing libraries for WES were prepared following the Nextera Flex enrichment reference guide using the corresponding Illumina kit for library preparation as previously described. The resulting libraries were sequenced with NovaSeq 6000, generating paired-end reads of 101 base-pairs in length. The raw reads were aligned to the human reference GRCh38.p13, where we observed a depth coverage of ≥80× in up to 50% of the genomic positions across all samples (Supplementary note Fig. SF1), and variants were discovered after following the GATK best practices pipeline for germline short variants identification21. The analysis was performed using GATK v4.1.9.022, Cutadapt v1.18, BWA v0.7.1723, Samtools v1.724, VCFTools v0.1.1625, Fastqc v0.11.926, and Picard v2.24.127, while more information on the processing steps is given in the supplementary note (SM1). As an end result, a joint callset was generated in VCF format and is referred to as VCF-WE in the downstream analysis.

We further included genotypes from genotypic arrays (GSA) in the genomic dataset. The samples were processed using Illumina’s Infinium global screening array 24.v3.0.a1 BeadChip kit following the Infinium HTS Assay Manual Protocol. Decodification of raw data was done based on the corresponding decoding files, and GSA-24.v3.0.A2.bpm manifest using GenomeStudio 2.0 genotypic module. The resulted PED and MAP files were converted into a callset in VCF format with PLINK28. This VCF is referred to as VCF-GSA in the downstream analysis.

Genotype imputation

Genotype imputation was performed using the cloud-based implementation of the Michigan Imputatin Server (MIS)29. Prior to imputation, we lifted the genomic positions of WEVCF and GSAVCF over to GRCh37 human reference using the command line tool LiftOver30. Subsequently, we followed the recommendations provided in the sections on data preparation and quality control of MIS documentation (https://imputationserver.readthedocs.io/). Upon completion, the quality-controlled WEVCF included 83,022 SNPs, whereas GSAVCF 489,414 SNPs. Afterwards, we merged WEVCF and GSAVCF, generating 557,317 unique SNPs, and the merged VCFs per chromosome became the input of MIS for genotype imputation. From the resulting imputed VCFs, we removed SNPs with R2 < 0.7 and lifted the genomic positions over to GRCh38 human reference with LiftOver. Finally, we merged all imputed chromosomal VCFs, producing a unified callset that included 9,620,831 phased SNPs.

Metabolomic and lipoproteomic data

The metabolomic dataset included measured levels of metabolites from urine and blood serum samples. The metabolite quantification was performed using Bruker’s NMR platform available at CIC bioGUNE. After excluding metabolites with zero levels across all samples or correlation between signal and calculated fit <50%, 27 urine and 29 blood serum metabolites were finally incorporated into the analysis. On the other hand, the lipoproteomic dataset encompassed 112 blood serum lipoproteins whose quantification was performed with the B.I.LISA platform from Bruker.

Identification of loss-of-function variants

Using the Ensembl variant effect predictor (VEP)31 along with the plugin LOFTEE32, we identified 567 variants that were annotated as loss-of-function (LoF) variants with high confidence and allele frequency ≤0.05. Finally, we removed those with unknown gene function.

Identification of variants associated with Mendelian conditions

We used the online Mendelian inheritance in man portal33 and the dbNSFP database34 to query information for whole-exome variants that are associated with Mendelian conditions. We annotated the variants according to mode of inheritance and classified them as either pathogenic, damaging, associated with a condition or associated with a drug according to annotations provided by ClinVar35. Additionally, variants were characterized as tolerable or damaging by MetaSVM36. Finally, we considered only those variants whose mode of inheritance (dominant or recessive) could be matched based on the individual’s genotypes.

Polygenic scores calculation

We generated polygenic scores (PGS) for 70 molecules in our blood serum datasets with genetic score models available at the OMICSPRED repository37. The OMICSPRED models we selected concerned serum metabolomics and lipoproteomics that were quantified with NMR using the Nightingale platform. Using the model weights for each biomolecular trait, we calculated PGS using PLINK 2.038, after following the instructions suggested by the OMICSPRED platform for new cohorts.

Training of Gaussian Naive Bayes classifier

A subset of 61 individuals provided serum lipoproteomic data at two additional time-points. We utilized these data to assess the robustness of the initial clustering by training a GNB classifier that assumes independence between the predictors. Although this assumption might not hold for biological data due to interactions between biomolecules, the GNB classifier performs well when the independence condition is violated, and provides several advantages such as a simpler model that is less prone to overfit and requires less data to be trained. These advantages make this classifier a suitable choice for our cohort.

For the training process, we used 59 serum lipoproteins that were quantified across all visits, and trained the classifier on the whole cohort by using cluster label as output and lipoprotein levels from the first visit as predictors. Then, we used the lipoprotein levels from the second and third visits to predict the cluster label for each one of the 61 individuals and evaluate the overall divergence from the initial cluster assignment.

Ridge regression on serum lipoproteome

To perform ridge regression, we considered the main fractions of HDL, IDL, LDL, and VLDL that are measured in mg/dL and collectively represent the serum lipoproteome. Then, using the Reactome database39, we queried 817 genes that are involved in biological processes such as the metabolism of lipids and assembly, remodeling, and clearance of plasma lipoproteins. This gene set was later reduced to 28 genes, representing genes that are involved in the OMICSPRED models and have identified SNPs in the cohort that fall within the transcript region based on GENCODE annotations40. Finally, the mean effect across all OMICSPRED models was computed for every SNP of the 28 genes, and used to calculate 28 monogenic scores for each individual in the cohort.

The individuals in the cohort were separated based on the identified UMAP cluster, and a ridge regression was performed on each cluster using main fractions of a given serum lipoprotein (HDL or IDL or LDL, or VLDL) as response variable and 28 monogenic scores as predictors, correcting for age, sex, and BMI.