Multi-omic integration sets the path for early prevention strategies on healthy individuals

Kioroglou, Dimitrios; Gil-Redondo, Rubén; Embade, Nieves; Bizkarguenaga, Maider; Conde, Ricardo; Millet, Oscar; Mato, José M.; Marigorta, Urko M.

doi:10.1038/s41525-025-00491-7

Download PDF

Article
Open access
Published: 03 May 2025

Multi-omic integration sets the path for early prevention strategies on healthy individuals

npj Genomic Medicine volume 10, Article number: 35 (2025) Cite this article

7188 Accesses
2 Citations
22 Altmetric
Metrics details

Subjects

Abstract

Precision medicine requires biomarkers that stratify patients and improve clinical outcomes. Although longitudinal multi-omic analyses provide insights into pathological states, their utility in stratifying healthy individuals remains underexplored. We performed a cross-sectional integrative study of three omic layers, including genomics, urine metabolomics, and serum metabolomics/lipoproteomics, on a cohort of 162 individuals without pathological manifestations. We studied each omic layer separately and after integration, concluding that multi-omic integration provides optimal stratification capacity. We identified four subgroups and, for a subset of 61 individuals, longitudinal data for two additional time-points allowed us to evaluate the temporal stability of the molecular profiles of each identified subgroup. Additional functional annotation uncovered accumulation of risk factors associated with dyslipoproteinemias in one subgroup, suggesting targeted monitoring could reduce future cardiovascular risks. Overall, our methodology uncovers the potential of multi-omic profiling to serve as a framework for precision medicine aimed at early prevention strategies.

Multi-omics microsampling for the profiling of lifestyle-associated changes in health

Article Open access 19 January 2023

A novel integrated extraction protocol for multi-omic studies in heavily degraded samples

Article Open access 30 July 2024

Deriving consensus sepsis clusters via goal-directed subgroup identification in multi-omics study

Article Open access 24 November 2025

Introduction

The goal of precision medicine is to provide medical care in which disease prevention and treatment is tailored to the genetic background and lifestyle of each person. At its core lies the challenge of identifying biomarkers as a means to stratify patients.

Multi-omic analysis is emerging as a new paradigm to obtain a more comprehensive understanding of pathological mechanisms since it provides a holistic approach to deciphering disease etiology. For instance, genomics allows us to determine the genetic makeup in each person, and owing to the availability of catalogs of genetic variants associated with a myriad of traits, we can create individual risk profiles. Other omic data like transcriptomics, metabolomics, and proteomics provide a more refined view into regulatory networks, allowing for a deeper understanding of the biological processes at work within an individual. Multi-omic analysis has been implemented in various longitudinal studies, generating a wealth of useful information regarding the modulation of biomolecular levels and pathways during healthy and pathological states^1,2, as well as modeling health outcomes in relation to microbiome interactions³.

In recent years, there has been a growing recognition of the importance of understanding processes that occur subclinically, before the onset of symptoms. These early molecular and cellular changes often precede overt pathology and represent critical opportunities for intervention and prevention⁴. Nevertheless, despite the considerable amount of information that has been accumulating the past years, and the numerous phenotypic databases available, we have not yet established the biological insights we can derive from a cross-sectional integration analysis of multi-omic data obtained from healthy individuals.

For the actionability of personal omic profiling in health, we need to establish three key goals: a) the capability of each omic layer to stratify healthy individuals, b) the possibility of predicting prospective phenotypic manifestations for a certain group (referred to as practical stratification in the manuscript) and c) the kind of early prevention strategies we can implement based on multi-omic integration from a healthy state. To address these gaps, we set out to perform a cross-sectional integrative study of three omic layers on a cohort of 162 individuals without pathological phenotypes. We analyzed the omic layers individually and after integration, to evaluate their ability to stratify individuals into relevant subgroups and establish the potential strengths and weaknesses of precision medicine applications focused on stratification. Moreover, a subset of 61 individuals was followed longitudinally and provided omic data after two years from the first visit and one year from the second visit. Since temporal stability has previously been observed in certain omic layers derived from peripheral blood of healthy individuals⁵, the availability of this longitudinal data allowed us to validate the stability of the molecular profiles for each identified subgroup, a critical aspect for prevention strategies as highlighted recently⁶.

Overall, our results show that integration of different omic layers was necessary to obtain practical stratification and certain clusters displayed classification consistency over time highlighting the stability of their molecular profiles.

Results

In Fig. 1 we show the experimental design of the multi-omic integration that led to the stratification of the cohort into subgroups of individuals with potentially different underlying health predispositions. The analysis also evaluated the temporal stability of these stratifications across timepoints, providing insights into the consistency of the molecular profiles. The following sections present the steps that governed the stratification process.

**Fig. 1: Flow chart representation of the analytical pipeline.**

Cohort characteristics

We generated genomic, metabolomic, and lipoproteomic profiles for a cohort of healthy individuals. The cohort consisted of 162 individuals, with a majority (61.1%) being females. The participants were primarily in their fifth decade (mean = 44.39, sd = 8.46) and exhibited an average BMI of 24.5 kg/m² (sd = 3.95). On average, the levels of glucose, cholesterol, TG, and liver enzymes, along with other common analytical features, all remained within the normal range (Table 1). The collection of the omic data for the whole cohort occurred in 2019. For a subset of 61 individuals, serum lipoproteomic data were also collected at two additional time-points. One occurred around 2 years from the first visit (mean = 769 days, sd = 5 days) and the other around 1 year from the second visit (mean = 379 days, sd = 3 days).

Table 1 Cohort characteristics and analytical features from a basic biochemical blood analysis

Full size table

Cohort displayed close genetic ancestry with European populations

We first evaluated the genetic ancestral composition of the individuals in our cohort. After merging the genomic datasets (WES and GSA), we removed indels and identified 557,317 single-nucleotide variants (SNVs) and imputed to 9,620,831 variants using the Michigan Imputation Server. From the latter callset, we extracted 161,157 independent variants (r² ≤ 0.1) with allele frequency ≥0.05 and performed PCA against the 1000 Genomes reference populations in order to evaluate the ancestral background of the cohort. The cohort presented close genetic ancestry with European populations, showing strongest similarity with the Iberian population in Spain. Interestingly, we observed a continuous cline that reflects the variable amount of Basque genetic ancestry present in the current gene pool of the Spanish Basque Country region (Fig. 2).

**Fig. 2: PCA using cohort’s 161,157 independent variants and 1000 Genomes reference populations.**

Loss-of-function variants provided limited ability for phenotypic characterization of the cohort

Since we wanted to evaluate the stratification capability of each omic layer, we studied them separately starting with the genomic layer. We identified 164 LoF variants (see Methods) that fall in a non-canonical splice site and eight in a single exon transcript. The genes associated with the latter are genes related to either protein-protein interactions (WDR5B), taste receptors (TAS2R3), olfactory signaling (OR51B5, OR56A5, OR4C16, OR5AR1 and OR4M2) or diseases such as scleral staphyloma and peach allergy (NACA2) that have not been clinically reported in the cohort. On the other hand, the LoF variants that fall in a non-canonical splice site have especially been reported in the central nervous system⁷. Overall, the interpretation in the context of our healthy cohort is not unequivocal and it carries limited practical actionability.

The cohort was characterized by diverse Mendelian conditions with unknown probability of manifestation

Continuing with the genomic layer, we identified 257 pathogenic variants related to 62 Mendelian conditions and involving 39 genes (see Methods). These alleles had an average minor allele frequency (MAF) of 0.15 (sd = 0.11), suggesting limited evolutionary control by natural selection and hence few implications for the carrying individual. Regarding damaging variants, we identified 3639 variants (MAF: mean = 0.12, sd = 0.12) linked to 607 Mendelian conditions and 429 genes, whereas 84 variants (MAF: mean = 0.14, sd = 0.12) and 11 genes were annotated as having an association with 13 disorders.

As a way to filter these results, we focused on the top-5 Mendelian conditions that were related to those genes that accumulated the highest number of damaging variants based on MetaSVM. This strategy identified myopathy (TTN, 13/186 damaging variants, mean MAF 0.03), bronchiectasis and pancreatitis (CFTR, 9/12 damaging variants, mean MAF 0.04), macular degeneration (ABCA4, 6/15 damaging variants, mean MAF 0.05), erythermalgia (SCN9A, 5/10 damaging variants, mean MAF 0.02) and Alport syndrome 3 (COL4A3, 5/15 damaging variants, mean MAF 0.11). As expected in the context of a healthy cohort, we did not observe any symptoms related to the aforementioned conditions, and we consider the probability to develop such manifestations low given the age of the participants.

Regarding drug related conditions, we identified toxicity towards 5-fluorouracil (DPYD, 1/3 damaging variants, mean MAF 0.12), Warfarin sensitivity (CYP2C9, 0/6 damaging variants, mean MAF 0.04), Warfarin resistance (VKORC1, 0/3 damaging variants, mean MAF 0.08), codeine sensitivity (CYP2D6, 0/3 damaging variants, mean MAF 0.15) and debrisoquine sensitivity (CYP2D6, 0/3 damaging variants, mean MAF 0.15). These findings provide useful information regarding the selection of treatment, with sensitivity to debrisoquine representing a particular case. More specifically, one of the drug therapies that has been reported in the cohort concerned medication for hypertension with 7 individuals being under treatment and two out them carrying minor alleles of the rs3892097 and rs1065852 variants of CYP2D6.

Bringing it all together, although this kind of approach may provide some useful information for prescribing clinicians, overall it fails to provide actionable information for the development of a prevention strategy since it yields a diverse phenotypic landscape for each individual that lacks a probabilistic context that is related to manifestation.

Polygenic scores highlight certain biomolecular traits with potential ability to stratify the cohort, albeit with substantial uncertainty

As an alternative to the previous approaches on the genomic layer, polygenic scores (PGS) provide a probabilistic framework for stratification by estimating genetic predispositions to various phenotypes. For each biomolecular trait available in our serum metabolomic and lipoproteomic datasets, we assessed the odds of each individual to have levels ≥75th percentile observed in the cohort provided that their corresponding PGS was ≥75th percentile observed in the cohort. We capitalized on the genetic estimation models provided by OMICSPRED to calculate the corresponding PGS (see Methods) and performed logistic regression using biomolecular traits as response variables and PGS as predictors, correcting at the same time for age, gender and BMI (Supplementary note table ST1).

We identified 28 traits with the potential to stratify the cohort. However, most traits displayed odds-ratio close to 2, with confidence intervals often including 1 (Fig. 3A). Two traits, namely glycine and triglycerides in medium HDL, presented an odds-ratio close to 6. But in both cases the 95%CI was wide, implying uncertainty regarding the conditions under which an individual displays high levels. This heterogeneity can be linked to the small sample size of the cohort, as well as to factors associated to lifestyle since glycine and triglyceride levels are influenced by diet.

**Fig. 3: Odds-ratios for biomolecular traits with potential to stratify the cohort and distributions of cohort genotypes for GLDC variants.**

Triglycerides in HDL particles have been positively associated with high risk of myocardial infarction⁸, whereas glycine has been postulated to have a protective role⁹. After querying the LitVar2 portal¹⁰ for the variants associated with the glycine OMICSPRED model, we identified three variants (rs10975629, rs1061407, rs4419859) of the gene glycine decarboxylase (GLDC) that has been associated with non-ketotic hyperglycinemia (NKH). Inspection of the cohort genotypes (Fig. 3B) showed that only carriers of the rs4419859 and rs1061407 displayed the highest levels of glycine. Nevertheless, this observation concerned only a very small fraction from the carriers. This indicates the presence of confounding factors that would still remain unknown even if we increase the sample size. Additionally, from the two variants rs1061407 is classified as benign for NKH by dbSNP, whereas rs4419859 is an intron variant with unknown clinical significance.

Cohort displayed small variability across metabolomic and lipoproteomic features

We then moved to the metabolomic and lipoproteomic layers to assess their ability to effectively stratify the cohort. We performed principal component analysis (PCA) after standardizing all included features (Fig. 4A). The first two PCs captured 42% of the observed variance, with gender mainly separating individuals across PC1 as expected due to physiological differences between males and females.

**Fig. 4: Exploratory analysis of the metabolomic and lipoproteomic features of the cohort.**

We then divided PCA into four quadrants (Q1, Q2, Q3, and Q4) to study the individuals based on their colocation. Based on the features loadings, age correlated equally with both PCs and influenced clustering towards the Q1 quadrant. Closer inspection revealed this quadrant accumulated male and female individuals with higher median age than the rest of the individuals of the same gender (Fig. 4B). Regarding BMI, it presented a strong correlation with PC1, especially impacting clustering towards Q2. We observed that this factor drove the clustering of male and female individuals with high BMI towards the Q1 and Q2 quadrant, with the latter accumulating females with the highest median BMI than the rest of this gender (Fig. 4B).

For the omic data, we evaluated their net impact on the clustering by averaging the loadings of their features. Lipoproteomic data had the strongest impact on the clustering of individuals compared to the metabolomic data (Fig. 4A), with serum lipoproteins influencing the clustering towards Q1 (Fig. 4B). However, the differences in lipoprotein levels of these individuals with the rest of the cohort were very small, corroborating the high uncertainty we previously reported on the odd-ratios using PGS. In a similar manner, we observed small inter-quadrant variability with the serum and urine metabolomic data (Fig. 4B) that led to the smallest impact on the clustering (Fig. 4A).

Multi-omic integration separated individuals into four distinct clusters, with one cluster accumulating individuals with dyslipoproteinemias

We integrated the lipoproteomic layer, which showed the highest clustering effect during PCA, with the abovementioned polygenic scores from the OMICSPRED models to study non-linear relationships within the cohort. Initially, we regressed-out the factors age, gender and BMI from each lipoprotein due to their strong influence on the cohort clustering. We then combined the resulting data with the previously calculated PGS from the OMICSPRED models into a topological analysis with UMAP¹¹ to study non-linear relationships within the cohort. Finally we performed Kmeans clustering that divided the cohort into four distinct clusters (C1, C2, C3, and C4) (Fig. 5A).

**Fig. 5: UMAP with 70 PGS from OMICSPRED models and serum lipoproteomic data after regressing out age, gender and BMI.**

We next investigated whether the accumulation of individuals in the same cluster was driven by underlying biological factors. As a first step, we identified genes whose variants are associated with a high number of OMICSPRED models that we used for the PGS calculation, with the rationale that these genes might have an important role in driving the clustering. This strategy highlighted six variants (mean MAF = 0.19, sd = 0.16) that are associated with the gene APOB, fall within an intergenic region and are related to more than 40 out of the 70 biomolecular traits.

Since APOB is linked to vascular diseases, we implemented the diagnostic algorithm provided by Sniderman et al.¹² to identify individuals with potential underlying dyslipoproteinemias. The algorithm resembles a decision-tree classifier that groups individuals into six major dyslipoproteinemias, based on their levels of total cholesterol, triglycerides and apolipoprotein B (ApoB). Out of the 162 individuals, 13 individuals were identified with chylomicron and VLDL remnants, four with increased VLDL particles, three with increased LDL particles and two with increased VLDL and LDL particles. Interestingly, the vast majority of these individuals were collocated in the C4 cluster (Fig. 5B). This indicates that UMAP yielded practical clustering from a biological and phenotypic perspective.

Sniderman et al. provide diverse primary and secondary causes that could be associated with such manifestations. Some are related to familial incidence, for instance familial dysbetalipoproteinemia and hypertriglyceridemia, with genes such as APOB, APOE, APOC1 and PCSK9 being linked to such familial conditions^13,14,15,16. Variants associated with familial conditions represent rare mutations that would require much larger sample sizes in order to evaluate their impact. However, driven by an interest to observe the distribution of the individuals that might carry such rare mutations, we focused on the variants rs11591147 (MAF = 0.01) and rs148195424 (MAF = 0.01) that have been associated with hypocholesterolemia¹⁶. We identified six individuals that carry one risk allele of rs11591147 and two individuals with one risk allele of rs148195424. Interestingly, from the eight carriers, none of them was collocated in the C4 cluster, whereas the distribution for the rest of the clusters was two carriers in C1, four in C2 and two in C3. This serves to emphasize even further the biological underpinnings behind the formation of the UMAP clusters, since the C4 cluster that accumulated individuals with dyslipoproteinemias did not include carriers of variants related to hypocholesterolemia.

The derived insights from the multi-omic data integration were transferable to measurements derived from a routine hematological examination

To assess whether the observed stratification could be replicated with a conventional blood test, we incorporated a dataset derived from a routine hematological examination from a different blood sample collected the same day as the serum omic data. This dataset included complete blood counts and measurements from a basic metabolic panel, and it was not incorporated in any step of the thus far presented analytical pipeline.

After regressing out the factors age, gender and BMI from the measurements, we compared the levels of the basic metabolic panel between the UMAP clusters that were previously identified (Fig. 6). Interestingly, we observed a distinct profile of the individuals of the C4 cluster, who presented the highest levels of triglycerides and the lowest levels of high-density lipoprotein (HDL). Similar levels of HDL were observed also in the C3 cluster, however this cluster displayed a different profile than C4 when HDL is subtracted from the total cholesterol (noHDL) and the lowest levels of low-density lipoprotein (LDL) than the rest of the clusters. This highlights the consistency of the clustering across different datasets.

**Fig. 6: Measurements of a basic metabolic panel derived from a routine blood test of the cohort after regressing out the factors age, gender and BMI.**

Regarding the individuals that were classified with dyslipoproteinemias, they displayed the highest levels of triglycerides, LDL and noHDL. Nevertheless, the high uncertainty previously observed during the calculation of the odds-ratios (Fig. 4) became apparent, since we also observed overlapping between the levels of classified and unclassified individuals (Normal).

Furthermore, the clustering was influenced by the factors age and BMI, albeit they were controlled for (Fig. 7). More specifically, the C4 cluster accumulated individuals with overall age and BMI above the cohort’s median, whereas the opposite was observed for the individuals in the C2 and C3 clusters. This indicates that linear approaches are not sufficient to fully control for these factors, unless higher order polynomial terms are introduced. However, such strategies increase the risk of overfitting and hinder interpretation.

**Fig. 7: Characteristics of UMAP clusters for adjusted factors.**

Individuals in the C4 cluster displayed the smallest conversion rate according to Naive Bayes classification

To evaluate the temporal stability of the initial clustering from the first visit, we trained a Gaussian Naive Bayes (GNB) classifier (see Methods) using the cluster label as output and the levels of 59 lipoproteins, quantified across all visits, as predictors. The classifier was then used to predict the cluster labels for a subset of 61 individuals who provided data of serum lipoproteins for two additional visits.

We assessed cluster stability using the conversion rate, defined as the percentage of individuals whose predicted cluster label changed at least once across visits. After adjusting for the size of each cluster, we observed that individuals in the C4 cluster displayed the smallest conversion rate with 96% of the individuals retaining the same cluster label across the three visits (Fig. 8) compared to the rest of the clusters (C1 = 83%, C2 = 83% and C3 = 93%). This indicates that the multi-omic profile of the C4 cluster presents higher consistency over time, supplementing the previous observations associated with the distinctiveness of this cluster.

**Fig. 8: Results from Naive Bayes classification of individuals in each cluster across three visits using serum lipoproteomic data and 70 PGS from OMICSPRED models.**

APOE, PCSK9, and LPA highlighted the difference in the C4 cluster regarding LDL levels

To identify which genes influence the formation of the clusters, we performed a ridge regression on each cluster separately (see Methods), considering the main fractions of the lipoproteins HDL, IDL, LDL, and VLDL (Supplementary note Fig. SF2) and calculated monogenic scores for 28 genes. We then used the estimated coefficients of the latter to perform a hierarchical clustering (Fig. 9).

**Fig. 9: Hierarchical clustering using the estimated coefficients of ridge regressions from each cluster.**

Initially, we did not observe a statistically significant accumulation of minor allele carries within a specific cluster after conducting a chi-square test. Nevertheless, the hierarchical clustering revealed a dichotomic effect of the transcript variants of the considered genes on the clusters for the lipoproteins IDL and VLDL, with the genes ABCA1 and LPA having a pronounced impact. Moreover, the C4 cluster displayed a distinct profile than the rest of the clusters regarding LDL, with the genes APOE, PCSK9, and LPA driving the observed differences. As far as HDL is concerned, even though the C1 and C2 displayed the highest levels (Supplementary Note Fig. SF2), their levels were influenced positively by different genes, namely ABCA1 and PCSK9, respectively.

The mean effects of transcript variants of the genes ABCA1, LPA, APOE, and PCSK9, as well as the distribution of the minor alleles across the four clusters, are given in the Supplementary note Figs. SF3, SF4, SF5 and SF6 respectively.

Discussion

The main goal of precision medicine is the identification of patterns from biological data that could lead to patient stratification and targeted treatments. There are many remaining challenges, including data integration, privacy, and gaps in understanding genomic functions and patient recruitment^17,18.

To the best of our knowledge, our study is the first endeavor to explore cross-sectional multi-omic integration under a non-pathological state, aiming at providing insights to assist the outset of precision medicine implementations towards prevention. To evaluate the current prospects for precision medicine, our main focus was to implement an integrative approach of three omic layers derived from a cohort of 162 ostensibly healthy individuals, and identify patterns that are relevant to certain groups of the cohort, instead of rare characteristics of a few individuals.

Multi-omic integration allowed us to stratify the cohort into four distinct clusters, even though each omic layer displayed weak stratification power when examined separately. For instance, the initial clustering using the urine and serum omic data met our expectations related to the absence of any apparent pathological condition in the cohort, as prima facie we did not observe groups with distinct characteristics. Instead, the minor interindividual differences were amplified by demographic factors such as age, gender, and BMI whose discerning impact has been previously reported on the serum bile acids profiles of healthy persons¹⁹. Thus, we accounted for these factors during the downstream analysis.

The genomic data revealed a large number of potential Mendelian conditions with unknown probability of manifestation. Additionally, even if we were to evaluate these findings on an individual basis, an extensive family history would have been necessary. On the other hand, variant annotations provided by ClinVar and MetaSVM aim to elucidate the pathogenicity of the identified variants and reduce the phenotypic complexity. However, conflicting annotations and lack of manifestation for conditions associated with common damaging variants hindered the interpretation of the obtained results.

The calculated PGS for 70 serum biomolecular features highlighted the discriminatory potential of the traits triglycerides in medium HDL and glycine, however we observed great uncertainty over the estimates which can be linked to the lifestyle of each individual, the small sample size, and the lack of individuals with high or pathogenic levels. Nevertheless, the PGS provided a bridging guide to the downstream analysis towards phenotypes associated with dyslipoproteinemias due to their relation to the highlighted biomolecular traits.

Regarding the identified clusters after integration, even though we managed to control for the impact of age, gender and BMI, the influence of the latter persisted causing a statistically non-significant but noticeable aggregation of individuals with median BMI above the cohort’s median in a distinct cluster (C4), indicating the need for refined methodologies to minimize the residual effects of such factors. Nevertheless, these individuals presented noteworthy differences associated with an aggravated dyslipidemic profile. Although this could be regarded as an outcome of a given lifestyle, this cluster displayed higher classification consistency over time, higher triglyceride levels in the peripheral blood and distinct profile associated with the impact on LDL serum levels of transcript variants of genes involved in lipid metabolism. Even though we did not observe a statistically significant accumulation of certain transcript variants in this cluster, the obtained results indicate that the clustering is driven by the cumulative influence of numerous subtle features rather than by the dominant impact of a few features. This, combined with the narrowed variability in feature levels across individuals in the cohort, renders challenging the extraction and interpretation of latent components that drive interindividual differences under an ostensibly healthy state.

Through the integrative exercise in this challenging setting, we identified several weaknesses that hinder the development of precision medicine in healthy contexts towards prevention. This includes the difficulty in evaluating the pathogenicity of certain aforementioned findings, which pales in comparison with the easiness in discovery through already available genomic databases, and the significant amount of information that needs to be parsed, curated and structured to allow the emergence of relationships and the interpretation of the findings from the integration of the multi-omic layers. Developing unified bioinformatic frameworks for such implementations is a must, even more so considering the increasing tendency towards localized efforts to develop precision medicine approaches in specific regional communities.

Regarding limitations, our study lacks an external replication. While our primary objective focused on exploring the potential of multi-omic integration within a cohort of healthy individuals, the lack of an independent validation dataset restricts the generalizability of our findings. However, robust validation of multi-omic integration results is hindered by the fact that it necessitates datasets generated using consistent methodologies, particularly concerning NMR-based metabolomics, and ideally, from populations with similar ancestries, given the known population-specific performance of PGS scores. Moreover, even as multi-omic profiling becomes more accessible, most available datasets remain cross-sectional and lack harmonization across platforms and populations, hence limiting our ability to validate integration strategies in external cohorts.

Thus, even though this study does not provide direct actionable insights to be implemented in the clinical practice, our exercise confirms that integration of different omic layers is a promising methodology to stratify healthy individuals, filter the immense phenotypic diversity that is initially assigned to each group and provide phenotypic clues that assist towards the development of prevention strategies.

Methods

Cohort ethics approval and consent to participate

Participants belong to the AKRIBEA cohort, a large-scale precision medicine initiative from the Basque Country led by CIC bioGUNE²⁰. The cohort is recruited through annual medical check-ups in the Basque region, and hence it is particularly enriched for Basque genetic ancestry. Following the principles of the Declaration of Helsinki, all individuals provided informed consent for clinical research, with the consequent evaluation of the ethics committee for Investigation with medicinal products of Euskadi-Basque Country (CEIC-E 16–114 and CEIC-E 19-13). To protect patient confidentiality, all data has been double codified.

Whole-exome library preparation and sequencing

The quantity and quality of the DNAs were evaluated with Qubit dsDNA Broad Range Assay Kit and Agarose gels, respectively. Sequencing libraries were prepared following Nextera Flex for Enrichment Reference Guide using the corresponding Illumina kit for library preparation (Nextera DNA Flex Pre-Enrichment Library Prep and Enrichment Reagents-96 samples, Illumina Exome Panel - Enrichment Oligos only and indexes IDT for Illumina Nextera DNA Unique Dual Indexes Set A and B).

Input genomic DNA (300 ng) was tagmented by incubation with Enrichment Bead-Linked Transposomes (eBLT) for 5 min at 55 °C. After sample cool down at 10 °C, neutralization reagent was added to inactivate tagmentation reaction. In the next step, after sample cleanup, the Unique Dual index adapters and PCR Master Mix were added to the tagmented gDNA and amplification of libraries was carried out by PCR (72 °C for 3 min, 98 °C for 3 min and 9 cycles of: 98 °C for 20 s, 60 °C for 30 seconds, 72 °C for 1 min and a final extension step of 72 °C for 5 min). Subsequently, amplified libraries were purified using Agencourt AMPure XP beads, and were visualized on an Agilent 2100 Bioanalyzer using Agilent High Sensitivity DNA kit and quantified using Qubit dsDNA Broad Range Assay Kit Afterwards, libraries were pooled by mass attending to their concentration and their enrichment was performed overnight by incubation with enrichment oligos panel at 58 °C, then libraries were capture with streptavidin beads and cleaned up. Enriched libraries were then amplified by a second PCR (98 °C for 30 s and 10 cycles of: 98 °C for 10 s, 60 °C for 30 s, 72 °C for 30 s and a final extension step of 72 °C for 5 min). Eventually, after bead purification, the concentration and quality of the enriched library pools were checked by Qubit and Bioanalyzer, respectively.

Genomic data

We characterized the genome of the individuals using two strategies, namely through whole-exome sequencing (WES) and genome-wide genotyping with a commercial array. The sequencing libraries for WES were prepared following the Nextera Flex enrichment reference guide using the corresponding Illumina kit for library preparation as previously described. The resulting libraries were sequenced with NovaSeq 6000, generating paired-end reads of 101 base-pairs in length. The raw reads were aligned to the human reference GRCh38.p13, where we observed a depth coverage of ≥80× in up to 50% of the genomic positions across all samples (Supplementary note Fig. SF1), and variants were discovered after following the GATK best practices pipeline for germline short variants identification²¹. The analysis was performed using GATK v4.1.9.0²², Cutadapt v1.18, BWA v0.7.17²³, Samtools v1.7²⁴, VCFTools v0.1.16²⁵, Fastqc v0.11.9²⁶, and Picard v2.24.1²⁷, while more information on the processing steps is given in the supplementary note (SM1). As an end result, a joint callset was generated in VCF format and is referred to as VCF-WE in the downstream analysis.

We further included genotypes from genotypic arrays (GSA) in the genomic dataset. The samples were processed using Illumina’s Infinium global screening array 24.v3.0.a1 BeadChip kit following the Infinium HTS Assay Manual Protocol. Decodification of raw data was done based on the corresponding decoding files, and GSA-24.v3.0.A2.bpm manifest using GenomeStudio 2.0 genotypic module. The resulted PED and MAP files were converted into a callset in VCF format with PLINK²⁸. This VCF is referred to as VCF-GSA in the downstream analysis.

Genotype imputation

Genotype imputation was performed using the cloud-based implementation of the Michigan Imputatin Server (MIS)²⁹. Prior to imputation, we lifted the genomic positions of WEVCF and GSAVCF over to GRCh37 human reference using the command line tool LiftOver³⁰. Subsequently, we followed the recommendations provided in the sections on data preparation and quality control of MIS documentation (https://imputationserver.readthedocs.io/). Upon completion, the quality-controlled WEVCF included 83,022 SNPs, whereas GSAVCF 489,414 SNPs. Afterwards, we merged WEVCF and GSAVCF, generating 557,317 unique SNPs, and the merged VCFs per chromosome became the input of MIS for genotype imputation. From the resulting imputed VCFs, we removed SNPs with R² < 0.7 and lifted the genomic positions over to GRCh38 human reference with LiftOver. Finally, we merged all imputed chromosomal VCFs, producing a unified callset that included 9,620,831 phased SNPs.

Metabolomic and lipoproteomic data

The metabolomic dataset included measured levels of metabolites from urine and blood serum samples. The metabolite quantification was performed using Bruker’s NMR platform available at CIC bioGUNE. After excluding metabolites with zero levels across all samples or correlation between signal and calculated fit <50%, 27 urine and 29 blood serum metabolites were finally incorporated into the analysis. On the other hand, the lipoproteomic dataset encompassed 112 blood serum lipoproteins whose quantification was performed with the B.I.LISA platform from Bruker.

Identification of loss-of-function variants

Using the Ensembl variant effect predictor (VEP)³¹ along with the plugin LOFTEE³², we identified 567 variants that were annotated as loss-of-function (LoF) variants with high confidence and allele frequency ≤0.05. Finally, we removed those with unknown gene function.

Identification of variants associated with Mendelian conditions

We used the online Mendelian inheritance in man portal³³ and the dbNSFP database³⁴ to query information for whole-exome variants that are associated with Mendelian conditions. We annotated the variants according to mode of inheritance and classified them as either pathogenic, damaging, associated with a condition or associated with a drug according to annotations provided by ClinVar³⁵. Additionally, variants were characterized as tolerable or damaging by MetaSVM³⁶. Finally, we considered only those variants whose mode of inheritance (dominant or recessive) could be matched based on the individual’s genotypes.

Polygenic scores calculation

We generated polygenic scores (PGS) for 70 molecules in our blood serum datasets with genetic score models available at the OMICSPRED repository³⁷. The OMICSPRED models we selected concerned serum metabolomics and lipoproteomics that were quantified with NMR using the Nightingale platform. Using the model weights for each biomolecular trait, we calculated PGS using PLINK 2.0³⁸, after following the instructions suggested by the OMICSPRED platform for new cohorts.

Training of Gaussian Naive Bayes classifier

A subset of 61 individuals provided serum lipoproteomic data at two additional time-points. We utilized these data to assess the robustness of the initial clustering by training a GNB classifier that assumes independence between the predictors. Although this assumption might not hold for biological data due to interactions between biomolecules, the GNB classifier performs well when the independence condition is violated, and provides several advantages such as a simpler model that is less prone to overfit and requires less data to be trained. These advantages make this classifier a suitable choice for our cohort.

For the training process, we used 59 serum lipoproteins that were quantified across all visits, and trained the classifier on the whole cohort by using cluster label as output and lipoprotein levels from the first visit as predictors. Then, we used the lipoprotein levels from the second and third visits to predict the cluster label for each one of the 61 individuals and evaluate the overall divergence from the initial cluster assignment.

Ridge regression on serum lipoproteome

To perform ridge regression, we considered the main fractions of HDL, IDL, LDL, and VLDL that are measured in mg/dL and collectively represent the serum lipoproteome. Then, using the Reactome database³⁹, we queried 817 genes that are involved in biological processes such as the metabolism of lipids and assembly, remodeling, and clearance of plasma lipoproteins. This gene set was later reduced to 28 genes, representing genes that are involved in the OMICSPRED models and have identified SNPs in the cohort that fall within the transcript region based on GENCODE annotations⁴⁰. Finally, the mean effect across all OMICSPRED models was computed for every SNP of the 28 genes, and used to calculate 28 monogenic scores for each individual in the cohort.

The individuals in the cohort were separated based on the identified UMAP cluster, and a ridge regression was performed on each cluster using main fractions of a given serum lipoprotein (HDL or IDL or LDL, or VLDL) as response variable and 28 monogenic scores as predictors, correcting for age, sex, and BMI.

Code availability

The bioinformatic pipeline that was used for the identification of germline short variants from the whole-exome data is provided in the supplementary note.

Abbreviations

WES:: whole-exome sequencing
MIS:: Michigan imputation server
GSA:: global screening array
VCF:: variant call format
PGS:: polygenic scores
GNB:: Gaussian naive Bayes classifier
VEP:: variant effect predictor
MAF:: minor allele frequency

References

Shen, X. et al. Nonlinear dynamics of multi-omics profiles during human aging. Nat. aging 1–16 (2024).
Zhou, W. et al. Longitudinal multi-omics of host–microbe dynamics in prediabetes. Nature 569, 663–671 (2019).
Article CAS PubMed PubMed Central Google Scholar
The integrative human microbiome project. Nature 569, 641–648 (2019).
Ritchie, S. C. et al. Integrative analysis of the plasma proteome and polygenic risk of cardiometabolic diseases. Nat. Metab. 3, 1476–1483 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tabassum, R. et al. Omic personality: Implications of stable transcript and methylation profiles for personalized medicine. Genome Med. 7, 1–17 (2015).
Article CAS Google Scholar
Foy, B. H. et al. Haematological setpoints are a stable and patient-specific deep phenotype. Nature 1–9 (2024).
Sibley, C. R., Blazquez, L. & Ule, J. Lessons from non-canonical splicing. Nat. Rev. Genet. 17, 407–421 (2016).
Article CAS PubMed PubMed Central Google Scholar
Holmes, M. V. et al. Lipids, lipoproteins, and metabolites and risk of myocardial infarction and stroke. J. Am. Coll. Cardiol. 71, 620–632 (2018).
Article CAS PubMed PubMed Central Google Scholar
Ding, Y. et al. Plasma glycine and risk of acute myocardial infarction in patients with suspected stable angina pectoris. J. Am. Heart Assoc. 5, e002621 (2015).
Article PubMed PubMed Central Google Scholar
Allot, A. et al. Tracking genetic variants in the biomedical literature using LitVar 2.0. Nat. Genet. 1–3 (2023).
McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
Sniderman, A., Couture, P. & De Graaf, J. Diagnosis and treatment of apolipoprotein b dyslipoproteinemias. Nat. Rev. Endocrinol. 6, 335–346 (2010).
Article CAS PubMed Google Scholar
Berbée, J. F., Hoogt, C. C., van der, Sundararaman, D., Havekes, L. M. & Rensen, P. C. Severe hypertriglyceridemia in human APOC1 transgenic mice is caused by apoC-i-induced inhibition of LPL. J. lipid Res. 46, 297–306 (2005).
Article PubMed Google Scholar
Ricci, C. et al. STAT3 inhibition induces PCSK9 in hepatic cell line: Possible involvement in hypertriglyceridemia associated with insulin resistance. Atherosclerosis 241, e46–e47 (2015).
Article Google Scholar
Kypreos, K. E. & Zannis, V. I. LDL receptor deficiency or apoE mutations prevent remnant clearance and induce hypertriglyceridemia in mices. J. lipid Res. 47, 521–529 (2006).
Article CAS PubMed Google Scholar
Cameron, J. et al. Effect of mutations in the PCSK9 gene on the cell surface LDL receptors. Hum. Mol. Genet. 15, 1551–1558 (2006).
Article CAS PubMed Google Scholar
Liu, X., Luo, X., Jiang, C. & Zhao, H. Difficulties and challenges in the development of precision medicine. Clin. Genet. 95, 569–574 (2019).
Article PubMed Google Scholar
Martinez-Garcia, M. & Hernández-Lemus, E. Data integration challenges for machine learning in precision medicine. Front. Med. 8, 784455 (2022).
Article Google Scholar
Xie, G. et al. Profiling of serum bile acids in a healthy chinese population using UPLC–MS/MS. J. proteome Res. 14, 850–859 (2015).
Article CAS PubMed Google Scholar
Bizkarguenaga, M. et al. Prospective metabolomic studies in precision medicine: The AKRIBEA project. in Metabolomics and its impact on health and diseases 275–297 (Springer, 2022).
Van der Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra. (O’Reilly Media, 2020).
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. BioRxiv 201178 (2018).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 (2013).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Article CAS PubMed PubMed Central Google Scholar
Andrews, S. et al. FastQC: A quality control tool for high throughput sequence data. (2010).
Broad Institute. Picard tools. http://broadinstitute.github.io/picard/ (Accessed: 2018/02/21; version 2.17.8).
Purcell, S. et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article CAS PubMed PubMed Central Google Scholar
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
Article CAS PubMed PubMed Central Google Scholar
Hinrichs, A. S. et al. The UCSC genome browser database: Update 2006. Nucleic acids Res. 34, D590–D598 (2006).
Article CAS PubMed Google Scholar
McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17, 1–14 (2016).
Article Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article CAS PubMed PubMed Central Google Scholar
Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. & McKusick, V. A. Online mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic acids Res. 33, D514–D517 (2005).
Article CAS PubMed Google Scholar
Liu, X., Li, C., Mou, C., Dong, Y. & Tu, Y. dbNSFP v4: A comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome Med. 12, 1–8 (2020).
Article Google Scholar
Landrum, M. J. et al. ClinVar: Improving access to variant interpretations and supporting evidence. Nucleic acids Res. 46, D1062–D1067 (2018).
Article CAS PubMed Google Scholar
Dong, C. et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Hum. Mol. Genet. 24, 2125–2137 (2015).
Article CAS PubMed Google Scholar
Xu, Y. et al. An atlas of genetic scores to predict multi-omic traits. bioRxiv (2022).
Chang, C. C. et al. Second-generation PLINK: Rising to the challenge of larger and richer datasets. Gigascience 4, s13742–015 (2015).
Article Google Scholar
Croft, D. et al. Reactome: A database of reactions, pathways and biological processes. Nucleic acids Res. 39, D691–D697 (2010).
Article PubMed PubMed Central Google Scholar
Frankish, A. et al. GENCODE 2021. Nucleic acids Res. 49, D916–D923 (2021).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported by grants from the Spanish Ministry of Science MCIN/AEI/10.13039/501100011033 and ERDF/EU (PID2019-108244RA-I00, RTI2018-101269-B-I00, PID2023-151379OB-I00 and PID2021-124171OB-I00, RYC2020-030632-I, CEX2021-001136-S), la Caixa Foundation (ID 100010434, fellowship code LCF/BQ/LI18/11630001 to U.M.M.). O.M. and J.M.M. acknowledge CIBERehd the European Research Council (Consolidator grant 819242) and the NIH for grants NIH 1R01DK119437-01A1, R01DK123763-20.

Author information

Authors and Affiliations

Integrative Genomics Lab, Center for Cooperative Research in Biosciences (CIC bioGUNE), Basque Research and Technology Alliance (BRTA), Bizkaia Technology Park, Derio, Basque Country, Spain
Dimitrios Kioroglou & Urko M. Marigorta
Precision Medicine and Metabolism Laboratory, Center for Cooperative Research in Biosciences (CIC bioGUNE), Basque Research and Technology Alliance (BRTA), Bizkaia Technology Park, Derio, Basque Country, Spain
Rubén Gil-Redondo, Nieves Embade, Maider Bizkarguenaga, Ricardo Conde, Oscar Millet & José M. Mato
CIBER Enfermedades Hepáticas y Digestivas, Madrid, Spain
José M. Mato
Ikerbasque, Basque Foundation for Science, Bilbao, Bizkaia, Spain
Urko M. Marigorta

Authors

Dimitrios Kioroglou
View author publications
Search author on:PubMed Google Scholar
Rubén Gil-Redondo
View author publications
Search author on:PubMed Google Scholar
Nieves Embade
View author publications
Search author on:PubMed Google Scholar
Maider Bizkarguenaga
View author publications
Search author on:PubMed Google Scholar
Ricardo Conde
View author publications
Search author on:PubMed Google Scholar
Oscar Millet
View author publications
Search author on:PubMed Google Scholar
José M. Mato
View author publications
Search author on:PubMed Google Scholar
Urko M. Marigorta
View author publications
Search author on:PubMed Google Scholar

Contributions

O.M., J.M.M., and U.M.M. conceived and supervised the study. R.C., M.B., and N.E. performed laboratory experiments. O.M., J.M.M. supervised experiments. R.G.B. collected and preprocessed data. D.K. analyzed data. D.K., U.M.M. wrote the manuscript. All authors have reviewed the manuscript.

Corresponding author

Correspondence to Urko M. Marigorta.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kioroglou, D., Gil-Redondo, R., Embade, N. et al. Multi-omic integration sets the path for early prevention strategies on healthy individuals. npj Genom. Med. 10, 35 (2025). https://doi.org/10.1038/s41525-025-00491-7

Download citation

Received: 17 January 2025
Accepted: 15 April 2025
Published: 03 May 2025
Version of record: 03 May 2025
DOI: https://doi.org/10.1038/s41525-025-00491-7