Unsupervised learning reveals novel disease-associated proteins in high-dimensional human proteomic data

Bernard, Elvis; Wang, Yiling; Chen, Manlin; Xu, Shunqing

doi:10.1038/s41598-026-41385-7

Download PDF

Article
Open access
Published: 22 February 2026

Unsupervised learning reveals novel disease-associated proteins in high-dimensional human proteomic data

Elvis Bernard¹^na1,
Yiling Wang²^na1,
Manlin Chen² &
…
Shunqing Xu¹

Scientific Reports volume 16, Article number: 10185 (2026) Cite this article

1675 Accesses
2 Altmetric
Metrics details

Subjects

Abstract

Modern advancements in precision medicine have led to the generation of vast proteomic datasets, capturing the concentrations of thousands of proteins across tens of thousands of participants. These datasets are traditionally processed using supervised learning methods due to their relative simplicity to implement and assess the output. However, this approach can sometimes overlook subtle patterns that might offer deeper insights. In contrast, unsupervised learning, while capable of revealing hidden relationships, struggles with the challenge of high dimensionality, meaning that brute-force analysis could take millennia to complete. In this study, we developed the Dimensionality Reduction with Avoidance of Missing/COmmunity Detection (DIRAM/COD) framework to address this problem by combining dimensionality reduction techniques with unsupervised learning to analyze the massive proteomic dataset of the UK Biobank, which includes the concentrations of 2,923 plasma proteins from 52,691 participants. By applying this novel approach, we not only confirmed well-established biomarkers for diseases such as hypertension (UBE2L6) and leukemia (LRCH4) but also identified novel protein candidates. For instance, we identified IGF2BP3 in connection with celiac disease, a protein previously linked to intestinal barrier function, along with several other proteins not yet associated with these diseases. This approach opens up exciting possibilities for future research and may pave the way for the discovery of new biomarkers and therapeutic targets.

Combining knowledge distillation and neural networks to predict protein secondary structure

Article Open access 31 August 2025

Proteomic risk scores for predicting common diseases using linear and neural network models in the UK biobank

Article Open access 01 July 2025

Learning functional properties of proteins with language models

Article 21 March 2022

Introduction

Proteins circulating in the human bloodstream provide critical insights into an individual’s health status¹. These plasma proteins play essential roles in cell signaling, transport, growth, repair, and immune defense² and include proteins released from damaged cells³. The human blood proteome offers a comprehensive snapshot of an individual’s health by capturing the interactions between thousands of circulating molecules⁴. These interactions reflect the combined influences of genetics, lifestyle, environment, comorbidities, and medications^5,6,7.

Given the dynamic nature of the plasma proteome and the accessibility of blood sampling, these proteins serve as invaluable tools for diagnosing and predicting diseases⁸, identifying therapeutic targets⁹, and understanding disease mechanisms¹⁰. Proteomics has led to significant discoveries in gene‒protein interactions¹¹, disease biomarkers¹², aging¹³, and pharmacology¹⁴; however, its potential for systematically studying and assessing the risk of multiple future health conditions remains largely underexplored.

Historically, most predictive research using proteomics has been performed through supervised learning, relying on case-control studies¹⁵ to compare the plasma proteomes of healthy individuals and those diagnosed with diseases such as dementia¹⁶, Alzheimer’s¹⁷, coronary heart disease¹⁸, and type 1 diabetes¹⁹. Additionally, although shared molecular pathways have been identified in closely related diseases, limited knowledge exists concerning potential common mechanisms linking seemingly unrelated conditions. This gap highlights the need for a more systematic approach to understanding disease progression in humans.

The complexity of proteomic data presents both opportunities and challenges^20,21. High-dimensional datasets contain vast amounts of information that can reveal intricate patterns of biological variation^22,23. However, extracting meaningful insights from these data is difficult, particularly when the underlying structure is not immediately apparent²⁴. Unsupervised learning algorithms excel at identifying hidden structures in high-dimensional datasets²⁵. This capability makes them particularly useful in proteomics²⁶, where they can detect subgroups of individuals with similar protein expression profiles²⁷. By segmenting populations based on the similarity of their proteomic profiles, unsupervised learning can help uncover novel biological subtypes or identify biomarkers for disease stratification without prior knowledge of specific disease classifications or biological pathways^28,29.

Here, using the UK Biobank Proteomic dataset containing the concentrations of 2,923 plasmatic proteins from 52,691 participants, we explored the potential of proteomic segmentation clustering together individuals with similar health profiles. We used two different methods, one based on density-based clustering, which we named Dimensionality Reduction with Avoidance of Missing (DIRAM), and one based on community detection, Dimensionality Reduction with COmmunity Detection (DIRCOD), to create different clusters^30,31. We showed that the clusters created using these two methods were biologically meaningful and can provide new insights into the biology of different diseases because of the unbiased nature of the unsupervised machine learning methods. With respect to the three diseases we chose to analyze in more depth, we were able to highlight known actors while also bringing attention to unknown but plausible actors. Finally, we compared the advantages and disadvantages of these two different methods, with DIRAM delivering more compact clusters and DIRCOD creating larger clusters that are more suitable for data analysis.

Results

Unsupervised analysis of a large population

The two primary challenges in identifying patterns in high-dimensional real-world data are missing values and the large number of dimensions³². Each additional dimension increases data sparsity, making patterns more difficult to detect. Imputation is a common strategy for handling missing data³³; however, in large-scale proteomic datasets, missing data often reflect technical detection limits or biological variability, making reliable imputation infeasible. To mitigate both challenges, we developed two distinct clustering workflows (see Fig. 1).

The first approach, DIRAM, focused on avoiding missing values. During our initial analysis, we observed that certain groups of protein expression data shared missing values for the same participants. In these groups, removing protein expression data with missing values did not result in any loss of information. For the remaining variables, we grouped them based on the correlation of their missing values, minimizing the impact of removing incomplete observations. Each of these subdatasets was then processed through the same workflow: (1) dimensionality reduction to two dimensions using UMAP projection³⁴; (2) cluster detection using the density-based clustering algorithm DBSCAN³⁵, which allows the detection of an unspecified number of clusters of arbitrary shape in the presence of noise; and (3) merging clusters from different subdatasets that contained similar observations. This process was applied three times: once for the first release of proteomic data (8 clusters), once for the second release (13 clusters), and once using the combined dataset from both releases (16 clusters). As a result of this approach, 37 clusters have been defined.

The second approach, DIRCOD, aimed to reduce the impact of extra dimensions. This strategy involved two main steps. First, we applied the Leiden community detection algorithm to the dataset imputed using KNN^36,37. KNN is a very common method of imputation that has already been used successfully by Jia You et al. on half of the UK Biobank proteomic data⁸. The detected communities were subsequently analyzed in the full raw dataset (non-imputed without any filtering) to identify protein expression variables (dimensions) that exhibited significantly different distributions within each community compared with the rest of the population. These distinct dimensions were then reintroduced into the first step. This iterative process was repeated 50 times without any sign of convergence. As a result, the first 20 iterations were used to assign observations to clusters based on patterns appearing in these 20 community detection results. As a result of this approach, 18 additional clusters have been defined, resulting in a total of 55 clusters.

Characteristics of the unsupervised population clusters

In a first attempt to characterize the participants in each cluster, we looked at the age, sex and expression of the different proteins for each cluster (see Tables S1–S2–S3). The age distribution of 14 clusters significantly differed from that of the whole population; among them, the age distribution of 6 clusters exhibited a p value lower than 10^− 5 (see Table S1). Most clusters had a balanced population in terms of sex, while for 19 clusters, one sex represented less than 40% of the population (see Table S2). All the clusters varied greatly in terms of protein concentration. We decided to take a very conservative approach where for a protein to be considered to be present at a different concentration, the concentrations in the top 75% of the participants in this group should be higher than the concentrations in the bottom 75% of the participants in the other group (see Table S3).

To better describe the participants in the different clusters, we used the ICD10 code of the diagnosis linked to each participant (see Table S4). Remarkably, the results of the analysis provided different insights: both clustering methods can identify similar clusters, as shown in Clusters A8 and B9, both of which contain a high proportion of very sick people (organ failure, transplant, cancer, etc.). Interestingly, compared with Clusters B5 and B6, which had a higher prevalence of hypertension and were characterized by higher protein concentrations, Clusters B1 and B2 had a low prevalence of hypertension-related diseases and lower concentrations of a set of proteins.

We then conducted a GWAS for each of the clusters. The most significant associations are reported in Table S5. While it is clear that some variants were positively selected in different clusters, as reflected by their extremely low associated p value, their relationship with the observed phenotype was not clear (see Fig. S1).

We finally evaluated the questionnaire data, medication prescription and blood chemistry. We did not find any significant associations between these data and the different clusters. Surprisingly, we found that the medication prescription records were most likely incomplete. Although we were expecting to find some strong medication prescribed to the participants from Clusters A8 and B9, as these participants suffer very severe disease (e.g., organ transplant, cancer, or amputation), we did not find any of these associations.

Identification of disease-related clusters

We next wanted to investigate whether any biological insights could be found from these groups of clusters (see Fig. 2). For the first time, we looked at the proteins that were common between the clusters with the same disease. We chose to focus our attention on celiac disease (ICD 10 code: K90), hypertension (ICD 10 Code: I10) and leukemia (ICD10 code: C91). For each of these diseases, we identified several common proteins with high or low abundance (see Table S6).

To assess whether these proteins could reflect the identity of the disease cluster, we tried to create disease-rich clusters based on these proteins. First, we regrouped all the participants of the original clusters into a single cluster; then, for highly abundant proteins, we sliced the full population above a threshold corresponding to the different percentiles in the single cluster, and for the proteins with low abundance, we looked at the population below the threshold (see Fig. 3a). Using this method, we were able to create clusters with significantly higher or lower odds ratios than the original population. For celiac disease, we were able to create clusters with odd ratios of 5 or higher (see Fig. 3b). For hypertension, if we only looked at the common proteins that were differentially distributed in Clusters B1, B2, B5 and B6 for this analysis (77 proteins), we reached an odds ratio of 1.05 or higher for the overly abundant proteins and 0.8 or lower for the less abundant proteins (see Fig. 3c and d). We were also able to reach an odds ratio of 0.9 or lower using only the proteins with low abundance in the case of Cluster B15 for hypertension (see Fig. S2a and b). Finally, the same analysis was performed in the case of leukemia, and we were able to reach an odds ratio of 50 or higher, but the lower number of participants diagnosed with leukemia affected the robustness of the analysis (see Fig. 3e).

We then used this approach slightly differently by removing one protein for each of the analyses. If the missing protein has no impact on the disease, then removing it from the analysis will improve the result, but if the protein is important for the disease, then the result will appear worse. In the early iterations of the analysis, we used the Benjamin–Hochberg correction for the p value, and Cluster A7 was included in the analysis for hypertension; we noticed that removing ube2l6 from the analysis had a dramatic effect on the odds ratio, decreasing it by 0.01 (from 1.036 to 1.026) (see Fig. S2c and d). When we moved from the Benjamin–Hochberg correction to the Bonferroni correction, we had to remove Cluster A7, and the dramatic effect of ube2l6 was lost; however, removing it from the analysis still had a strong effect, indicating its importance for hypertension. In addition to ube2l6, hnrnpul1 and becn1 had similar effects (see Table S7). These 3 proteins have already been shown as being involved in hypertension or related disease^38,39,40. Similar analysis has been conducted on the cluster with celiac disease and igf2bp3 was the protein having the strongest effect. This protein has been shown to be involved in the intestinal barrier function⁴¹. Other proteins with strong effect are nrxn3 and cacnb1(see table S7).

As the of different common proteins seem to be linked to the probability of having a disease, we then tried to combine the different protein concentrations into one unique value by looking at the first dimension of a PCA reduction⁴². We first applied this approach to the case of hypertension where clusters with opposite values were available. As shown in Fig. 4a, the first dimension of the PCA seemed to capture the relationship between the concentration of the proteins and the prevalence of the disease for the participants in the clusters used. When applied to the full population, the relationship between the first dimension of the same PCA reduction and the incidence of hypertension was conserved, indicating that the relationship was not specific to the population used in the PCA (see Fig. 4b). We performed the same analysis for celiac disease and found similar results (see Fig. 4c and d). Unfortunately (from a data analytical point of view), owing to the low number of participants suffering from leukemia, we were not able to conduct the same analysis.

In an attempt to better understand the clusters with a higher prevalence of leukemia, we looked at the correlation between the different proteins differentially expressed in these clusters and compared them with those in the rest of the population (see Fig. 5a). These results revealed few proteins whose correlations between the participants in our cluster were very similar to their correlations in the rest of the population (see Fig. 5b). The correlations of some pairs of proteins decreased by almost 0.3, indicating a loss of coregulation, whereas those of most of the proteins increased by up to 0.7, indicating stronger coregulation of these proteins in these participants and suggesting a potential role in leukemia (a correlation of 0.16 was significant for a sample of that size after Bonferroni correction for each pair of proteins tested). Interestingly, the main proteins whose expression levels decreased were lrch4, wdr46, serpinb1 and nub1. Lrch4 has been associated with leukemia⁴³, and WDR46 has been associated with the development of gastric carcinoma, colorectal cancer and hepatocarcinogenesis⁴⁴; the serpine family is known to be involved in cancer, and nub1 has been identified as a biomarker in cancer^45,46.

The same analysis in the case of hypertension revealed less dramatic results, with only a decrease of − 0.05 and an increase of 0.12, which is in accordance with the view that hypertension is a widely multifactorial health problem (significant correlation after Bonferroni correction for that sample: 0.05) (see Fig. 5c). In the case of the clusters representative of celiac disease, we did not observe any significant decrease in correlation, but we observed a strong increase of 0.7 in correlation for almost all the proteins considered (significant correlation after Bonferroni correction for that sample: 0.09) (see Fig. 5d). These results clearly indicated that some changes in the coregulation of different proteins occur in leukemia and celiac disease.

Discussion

Plasma protein cohort datasets have primarily been analyzed using supervised machine learning approaches, and multiple predictive models have already been developed based on these data^8,16,47,48. The strong interest in plasma protein datasets underscores their significant promise for research discovery. However, to the best of our knowledge, these datasets have not yet been explored using unsupervised machine learning methods.

This research was designed to explore how unsupervised machine learning can be used to extract meaningful insights from large proteomic datasets. These datasets often contain the concentrations of thousands of proteins, and sifting through all possible combinations to find patterns is an incredibly time-consuming task. In fact, owing to the sheer volume of potential combinations (2 ^{number of proteins}), such an exhaustive search could take longer than a human lifetime, even with thousands of iterations performed every second. In this study, we evaluated the effectiveness of the DIRAM/COD framework using two different machine learning techniques to segment the UK Biobank proteomic data into distinct clusters that have biological significance. Our goal was to identify ways in which these methods could help reveal patterns within the data that are both insightful and actionable for future research.

Although both methods aim to reveal the same underlying biological patterns, they result in clusters of vastly different sizes. In the DIRAM method, where the dataset was divided to avoid missing data, we observed smaller clusters. Conversely, in the DIRCOD method, where missing data were imputed, the clusters were much larger. Although larger clusters benefit from a larger sample size, making them more likely to include individuals with rare diseases and thus offering more robust statistical power, they come with a trade-off: the imputed values may not represent the true biological reality but are instead estimates of the “best value”. To address this, we introduced a validation step using the nonimputed data, helping to minimize the bias introduced by imputation and ensuring that our findings are as accurate and reliable as possible.

To minimize false positives, we applied conservative Bonferroni corrections to the p values in this study. Although this approach ensures a high level of statistical rigor, it may not be ideal for researchers focused on more specific phenotypes or diseases, who could benefit from a less stringent correction. For instance, adjusting the p value correction affected the inclusion of Cluster A7 within the hypertension group, leading to different results. Similarly, Cluster A8, which initially had an uncorrected p value of 10⁻⁴ for celiac disease, had an increase in this value to 0.06 after correction. For more targeted studies, the Bonferroni correction may be too conservative, and researchers focused on specific conditions will likely find that less strict p value corrections are more appropriate for drawing meaningful conclusions without unnecessarily discarding relevant clusters.

DIRAM/COD successfully identified clusters with biological significance, providing valuable insights into a variety of diseases. For example, in the case of celiac disease, our analysis highlighted several key proteins, including IGF2BP3, NRXN3, LRP2BP, and CACNB1. IGF2BP3 has already been implicated in maintaining the intestinal barrier⁴¹, whereas LRP2BP is closely related to LRP1, which has been previously associated with celiac disease⁴⁹. NRXN3, which is known to be active in the abdominal region⁵⁰, has also emerged as an important candidate. In the case of leukemia, the analysis revealed the misregulation of proteins such as LRCH4, WDR46, SERPINB1, and NUB1. LRCH4 has been previously linked to leukemia⁴³, and WDR46 has been associated with gastric carcinoma, colorectal cancer, and hepatocarcinogenesis⁴⁴. SERPINE family members are well known for their involvement in cancer⁴⁵, and NUB1 has been identified as a cancer biomarker⁵¹. With respect to hypertension, we identified UBE2L6, which has been implicated in this disease³⁸, HNRNPUL1, for which some variants have been identified as risk factors for coronary heart disease³⁹, and BECN1, which has been implicated in pulmonary hypertension⁴⁰. In summary, our analysis not only reinforced the roles of well-established biomarkers but also identified potential new candidates, offering a fresh perspective on these diseases and their underlying biology.

GWAS revealed that PNLIPRP1 and PNLIPRP2 played major roles in the segmentation observed in the second method, with variants in these genes showing extremely low p values in association with several clusters. However, owing to the highly conservative approach used to determine significant protein concentration differences, proteins linked to these genes were largely excluded from the analysis, and their potential roles were not fully explored. For example, Clusters B1 and B2 presented high levels of PNLIPRP1, whereas Clusters B5 and B6 presented lower levels. These contrasting concentrations of PNLIPRP1 suggest a protective role for this protein in the context of hypertension, warranting further investigation.

Interestingly, Cluster B15 had a low incidence of hypertension despite having some characteristics of a cluster with a high incidence of hypertension (high levels of protein shown in Figs. 3 and S2 and the presence of similar SNPs as B5 and B6). This finding is in accordance with the view that hypertension is a multifaceted disease and that having some predisposing factors can be compensated by other protective factors.

Another interesting finding during the realization of the project involved Clusters B3 and B6. These clusters had odds ratios of 13.29 for hepatitis (K73) and 9.46 for disease of the spleen (D73), respectively, both among women. Although the number of incidences was too low to allow us to perform a thorough analysis, it is clear that these results were not just the fruit of coincidence, suggesting that some of the 420 common proteins that are differentially distributed could be involved in both diseases.

Two clusters that particularly attracted our attention were A8 and B9. Despite having a high number of severely ill participants, we were unable to pinpoint a clear reason behind these groupings. These clusters were characterized by a high incidence of complex, nontrivial diseases. For example, A8 had prevalence rates of 49.6% for acute renal failure, 38% for surgical procedures with later complications, and 25% for individuals with transplanted organs. The main common factor we initially considered was the potential use of heavy medication. However, a surprising finding from the drug prescription records revealed that these individuals did not appear to be prescribed strong medications or, in some cases, any medications at all. Given that we have ruled out all other available variables, we believe that the use of heavy medication may still be the most likely unifying factor behind these two clusters. This suggests that the UK Biobank’s drug prescription data may be incomplete, warranting further investigation.

In conclusion, this study demonstrates that unsupervised learning applied to proteomic datasets can provide valuable new insights into various diseases. The framework we have developed, DIRAM/COD, can help scientists achieve this goal. One key limitation we encountered was the relatively small number of participants with specific diseases. However, this issue is being actively addressed by the UK Biobank, which is expanding its dataset from 2,923 proteins in 50,000 participants to 5,400 proteins in 600,000 participants. With this significant increase in data, we believe that this approach will uncover unexpected insights into the biology of rarer diseases, offering exciting potential for future research.

Materials and methods

Study population

Plasma samples collected from 54,265 UK Biobank participants at their baseline visit were measured using Olink Explore 3072 as a part of UKB-PPP (UK Biobank application number 65851)⁵². All participants provided informed consent. A large majority of the samples were randomly selected across the UK Biobank, and only those were used for the analysis presented here. In addition, the second delivery of data containing the last 1,462 proteins had an issue for the participants from batch 8. Unless specified otherwise, these participants were excluded from the analysis, resulting in the number of participants decreasing to 45,174.

GWAS analysis

GWAS analysis was performed using the UK Biobank Research Analysis Platform (UKB RAP)⁵³. The previously created clusters were used to extract phenotypic information. The genomic data used were array data (field 22418) and imputed data (field 21008). SNPs with high deletion rates, low secondary allele frequencies (MAFs), deviations from the Hardy-Weinberg equilibrium (HWE) and individuals with high genotype deletion rates were filtered out. The HWE exact test p value for the variant was greater than 10^{− 15}, and the missing call rates for the variant and sample did not exceed 0.1. In the QC of the array data, the MAF was greater than 0.01, and the minor allele count (MAC) was greater than 100. The qc values of the imputed data were greater than 0.0001 and 10, respectively. Age and sex were regarded as covariates for the GWAS process. The steps of candidate gene mining included linkage disequilibrium (LD) clumping analysis and gene annotation.

Health outcome coding (diagnostic codes)

The health outcomes in the UK Biobank are defined by the International Classification of Diseases (ICD-10) and are divided into 22 disease chapters covering primary care, hospital inpatient data, etiology, location, pathology and clinical manifestations. Symptoms and signs, classification, and onset times of acute and chronic diseases were recorded. Our analysis included 237 grouped 3-character ICD10s and 1725 3-character ICD10s (field 41270).

Cluster construction

Two different methods were used to create the different clusters used in this study. The clusters named A1–A37 were built using the DIRAM method described below, and the clusters named B1–B18 were built using the DIRCOD method described below.

DIRAM method. Variables having identical or similar missing values were grouped together. Proteins with missing values for each group were discarded. The dimensionality of each of these groups was reduced to 2 dimensions using the scikit-learn implementation of the UMAP algorithm with default parameters except for the random seed. The resulting projection using 60 different random seeds (from 1 to 60) for each group was visualized to ensure that the project used was not the result of the algorithm being trapped in a local minimum. The clusters were then detected using the scikit-learn implementation of the DBSCAN algorithm, with the parameters min_samples = 10 and eps = 0.1, with the condition that the size of a cluster should be between 100 and 20,000 participants. Finally, the clusters issued from different groups were merged together if the list of participants included was similar, meaning that the difference between the expected distribution under the hypothesis that the probability of belonging to either cluster was independent and the observed distribution was greater than 40%.

DIRCOD method. The full dataset, with 45,174 participants, was used as a data object in scanpy⁵⁴. The missing values were imputed using the scanpy built-in preprocessing function nearest neighbors using the parameter (n_neighbors = 15). The resulting data frame was subjected to the Leiden community detection algorithm implemented in scanpy using default parameters. The resulting communities were characterized using nonimputed data, and proteins with different distributions in any of the clusters were selected. A dataframe using only the selected proteins was subjected to the same Leiden algorithm. The resulting communities were characterized again using nonimputed data, and proteins with different distributions were selected. This iteration of Leiden selection of proteins was repeated 20 times. Participants with similar patterns of community assignment were grouped together into the same cluster.

Definition of high concentration and low concentration of protein

In this study, we used the following definition for proteins with different concentrations (see Fig. 1): two proteins were at different concentrations if the top 3 quartiles of the distribution of the high concentration were higher than the highest quartile of the other population. Alternatively, a protein was considered to have a low concentration distribution if the bottom 3 quartiles of the distribution were lower than the lowest quartile of the other population.

Cluster recreation

To recreate the different clusters, we looked at the values of the differently distributed proteins inside the group of clusters of interest. We then used the values corresponding to the different percentiles for each protein to apply a filter to the whole population and reported the odds ratio of the disease in the selected population. We performed the same analysis but omitted one different protein each time to determine the contribution of that specific protein.

PCA

PCA was performed using the scikit-learn implementation of the PCA solver. The PCA transformation was fitted using the participants present in the clusters of interest in the first step. The participants in these clusters were subsequently binned into different groups of the same size depending on their value on the first dimension of the PCA, and the prevalence of the disease was subsequently calculated for each bin (see Fig. 4a and c). The same transformation was then applied to the whole population and reported separately (see Fig. 4b and d). A linear regression was fitted through all the data points except the lowest and the highest points due to the noise associated with these points.

Correlation analysis

Proteins present in all considered clusters or in all but one of the considered clusters were used for this analysis. Proteins with a correlation of 0.8 for any other protein were kept; the others were discarded. Two correlation matrices were constructed for these proteins, the first one (A) using the values inside the group of clusters and the second one (B) using all the other participants. The result presented in this study is the matrix resulting from the difference between these two matrices (A–B) (see Fig. 5a).

Statistical analysis and data manipulation

Unless stated otherwise, all the statistical analyses were performed using the Python language in the Jupyter environment, and all the p values reported have been adjusted for multiple testing with the Bonferroni correction. The following versions of the Python libraries were used: Python: 3.9.23; pandas: 2.3.0; numpy: 1.24.4; scipy: 1.13.1; sklearn: 1.6.1; matplotlib: 3.8.4; seaborn: 0.13.2; forestplot: 0.4.1; gwaslab: 3.6.8; and REGENIE: 2.0.0. Unless stated otherwise, all data analyses used the nonimputed data to avoid any bias that could have occurred during the imputation. Notebooks containing the different codes used in this study are available at https://github.com/Elvisbernard/unsupervised-learning-on-proteomics.

Data availability

The data used in the present study are available from the UK Biobank with restrictions applied. Data were used under license and are thus not publicly available. Access to the UK Biobank data can be requested through a standard protocol ( [https://www.ukbiobank.ac.uk/register-apply/](https:/www.ukbiobank.ac.uk/register-apply) ).

References

Geyer, P. E. et al. Plasma proteome profiling to assess human health and disease. Cell. Syst. 2, 185–195 (2016).
Article CAS PubMed Google Scholar
Schaller, J., Gerber, S., Kämpfer, U., Lejon, S. & Trachsel, C. Human Blood Plasma Proteins: Structure and Function https://doi.org/10.1002/9780470724378 (Wiley, 2008).
Wickman, G. R. et al. Blebs produced by actin–myosin contraction during apoptosis release damage-associated molecular pattern proteins before secondary necrosis occurs. Cell. Death Differ. 20, 1293–1305 (2013).
Article CAS PubMed PubMed Central Google Scholar
Williams, S. A. et al. Plasma protein patterns as comprehensive indicators of health. Nat. Med. 25, 1851–1857 (2019).
Article CAS PubMed PubMed Central Google Scholar
Suhre, K. et al. Connecting genetic risk to disease end points through the human blood plasma proteome. Nat. Commun. 8, 14357 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Enroth, S., Johansson, Å., Enroth, S. B. & Gyllensten, U. Strong effects of genetic and lifestyle factors on biomarker variation and use of personalized cutoffs. Nat. Commun. 5, 4684 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Lamb, J. R., Jennings, L. L., Gudmundsdottir, V., Gudnason, V. & Emilsson, V. It’s in Our Blood: A Glimpse of Personalized Medicine. Trends Mol. Med. 27, 20–30 (2021).
Article CAS PubMed Google Scholar
You, J. et al. Plasma proteomic profiles predict individual future health risk. Nat. Commun. 14, (2023).
Chen, L. et al. Systematic Mendelian randomization using the human plasma proteome to discover potential therapeutic targets for stroke. Nat. Commun. 13, 6143 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Rother, R. P., Bell, L., Hillmen, P. & Gladwin, M. T. The clinical sequelae of intravascular hemolysis and extracellular plasma hemoglobin: A novel mechanism of human disease. JAMA 293, 1653 (2005).
Article CAS PubMed Google Scholar
Dhindsa, R. S. et al. Rare variant associations with plasma protein levels in the UK Biobank. Nature 622, 339–347 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Hurst, J. R. et al. Use of plasma biomarkers at exacerbation of chronic obstructive pulmonary disease. Am. J. Respir Crit. Care Med. 174, 867–874 (2006).
Article CAS PubMed Google Scholar
Oh, H. S. H. et al. Organ aging signatures in the plasma proteome track health and disease. Nature 624, 164–172 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Mateus, A., Kurzawa, N., Perrin, J., Bergamini, G. & Savitski, M. M. Drug target identification in tissues by thermal proteome profiling. Annu. Rev. Pharmacol. Toxicol. 62, 465–482 (2022).
Article CAS PubMed Google Scholar
Garg, M. et al. Disease prediction with multi-omics and biomarkers empowers case–control genetic discoveries in the UK biobank. Nat. Genet. 56, 1821–1831 (2024).
Article CAS PubMed PubMed Central Google Scholar
Guo, Y. et al. Plasma proteomic profiles predict future dementia in healthy adults. Nat. Aging. 4, 247–260 (2024).
Article CAS PubMed Google Scholar
Tao, Q. Q. et al. Alzheimer’s disease early diagnostic and staging biomarkers revealed by large-scale cerebrospinal fluid and serum proteomic profiling. Innov. 5, 100544 (2024).
CAS Google Scholar
Wallentin, L. et al. Plasma proteins associated with cardiovascular death in patients with chronic coronary heart disease: A retrospective study. PLoS Med. 18, e1003513 (2021).
Article CAS PubMed PubMed Central Google Scholar
Rooney, M. R. et al. Proteomic predictors of incident diabetes: Results from the atherosclerosis risk in communities (ARIC) study. Diabetes Care. 46, 733–741 (2023).
Article CAS PubMed PubMed Central Google Scholar
Chandramouli, K., Qian, P. Y. & Proteomics challenges, techniques and possibilities to overcome biological sample complexity. Hum. Genom. Proteom. 1, (2009).
Harper, J. W. & Bennett, E. J. Proteome complexity and the forces that drive proteome imbalance. Nature 537, 328–338 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Clarke, R. et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat. Rev. Cancer. 8, 37–49 (2008).
Article CAS PubMed PubMed Central Google Scholar
Larance, M. & Lamond, A. I. Multidimensional proteomics for cell biology. Nat. Rev. Mol. Cell. Biol. 16, 269–280 (2015).
Article CAS PubMed Google Scholar
Tenenbaum, J. B., Silva, V. D. & Langford, J. C. A Global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323 (2000).
Article ADS CAS PubMed Google Scholar
Valkenborg, D. Unsupervised learning. American J. Orthod. Dentofac. Orthopedics 163, (2023).
Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Pallares Robles, A. et al. Unsupervised clustering of venous thromboembolism patients by clinical features at presentation identifies novel endotypes that improve prognostic stratification. Thromb. Res. 227, 71–81 (2023).
Article CAS PubMed Google Scholar
Kundel, V. et al. Advanced proteomics and cluster analysis for identifying novel obstructive sleep apnea subtypes before and after continuous positive airway pressure therapy. Annals ATS. 20, 1038–1047 (2023).
Article Google Scholar
Agache, I. Multidimensional endotyping using nasal proteomics predicts molecular phenotypes in the asthmatic airways. J Allergy Clin. Immunol. 151, (2023).
Kriegel, H., Kröger, P., Sander, J. & Zimek, A. Density-based clustering. WIREs Data Min. Knowl. 1, 231–240 (2011).
Article Google Scholar
Pizzuti, C. Evolutionary Computation for Community Detection in Networks: A Review. IEEE Trans. Evol. Computat. 22, 464–483 (2018).
Article Google Scholar
Zhang, Y., Wang, Y. H., Gong, D. W. & Sun, X. Y. Clustering-guided particle swarm feature selection algorithm for high-dimensional imbalanced data with missing values. IEEE Trans. Evol. Comput.. 26, 616–630 (2022).
Article Google Scholar
Lin, W. C. & Tsai, C. F. Missing value imputation: a review and analysis of the literature (2006–2017). Artif. Intell. Rev. 53, 1487–1509 (2020).
Article Google Scholar
Dorrity, M. W., Saunders, L. M., Queitsch, C., Fields, S. & Trapnell, C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat. Commun. 11, 1537 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Bryant, A., Cios, K. & RNN-DBSCAN: A density-based clustering algorithm using reverse nearest neighbor density estimates. IEEE Trans. Knowl. Data Eng. 30, 1109–1121 (2018).
Article Google Scholar
Traag, V. A., Waltman, L. & Van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Cunningham, P. & Delany, S. J. k-nearest neighbour classifiers - A tutorial. ACM Comput. Surv. 54, 1–25 (2022).
Article Google Scholar
González-Amor, M., Dorado, B. & Andrés, V. Emerging roles of interferon-stimulated gene-15 in age-related telomere attrition, the DNA damage response, and cardiovascular disease. Front. Cell. Dev. Biol. 11, 1128594 (2023).
Article PubMed PubMed Central Google Scholar
van der Net, J. B. et al. Replication study of 10 genetic polymorphisms associated with coronary heart disease in a specific high-risk population with familial hypercholesterolemia. Eur. Heart J. 29, 2195–2201 (2008).
Article PubMed PubMed Central Google Scholar
Costa, R., Bruder Nascimento, A., Tostes, R. & Bruder, T. Abstract 59: Beclin-1-dependent autophagy prevents hyperaldosternism-induced endothelial dysfunction and hypertension. Hypertension 81, A59–A59 (2024).
Lin, L. et al. Disrupting of IGF2BP3-stabilized CLDN11 mRNA by TNF-α increases intestinal permeability in obesity-related severe acute pancreatitis. Mol. Med. 31, 24 (2025).
Article CAS PubMed PubMed Central Google Scholar
Greenacre, M. et al. Principal component analysis. Nat. Rev. Methods Primers. 2, 100 (2022).
Article CAS Google Scholar
Rivière, T., Bader, A., Pogoda, K., Walzog, B. & Maier-Begandt, D. Structure and emerging functions of LRCH proteins in leukocyte biology. Front. Cell. Dev. Biol. 8, 584134 (2020).
Article PubMed PubMed Central Google Scholar
Kong, F. et al. HBV core protein enhances WDR46 stabilization to upregulate NUSAP1 and promote HCC progression. Hepatol. Commun. 9, (2025).
Liu, Y. et al. Pan-cancer analysis of SERPINE family genes as biomarkers of cancer prognosis and response to therapy. Front. Mol. Biosci. 10, 1277508 (2024).
Article PubMed PubMed Central Google Scholar
Arshad, M. et al. NUB1 and FAT10 proteins as potential novel biomarkers in cancer: A translational perspective. Cells 10, 2176 (2021).
Article CAS PubMed PubMed Central Google Scholar
Carrasco-Zanini, J. et al. Proteomic signatures improve risk prediction for common and rare diseases. Nat. Med. 30, 2489–2498 (2024).
Article CAS PubMed PubMed Central Google Scholar
Álvez, M. B. et al. A human pan-disease blood atlas of the circulating proteome. Science 390, eadx2678 (2025).
Article PubMed Google Scholar
Loppinet, E. et al. LRP-1 links post-translational modifications to efficient presentation of celiac disease-specific T cell antigens. Cell. Chem. Biol. 30, 55–68e10 (2023).
Article CAS Google Scholar
Hotta, K. et al. Polymorphisms in NRXN3, TFAP2B, MSRA, LYPLAL1, FTO and MC4R and their effect on visceral fat area in the Japanese population. J. Hum. Genet. 55, 738–742 (2010).
Article CAS PubMed Google Scholar
Tan, K. L., Pezzella, F., Harris, A. & Acuto, O. PO-479 NUB1 as a prognostic marker in breast cancer: a retrospective, integrated genomic, transcriptomic, and protein analysis. ESMO Open. 3, A417–A418 (2018).
Article Google Scholar
Eldjarn, G. H. et al. Large-scale plasma proteomics comparisons through genetics and disease associations. Nature 622, 348–358 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Sudlow, C. et al. UK biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article PubMed PubMed Central Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank the participants and investigators of the UK Biobank study who made this work possible (Resource Application Numbers 96054). We would like to thank Gaga Mahai, Lin Hai, Hongyan Yin and Xinyi Wan for their useful comments.

Funding

S.X. was funded by the Collaborative Innovation Center of One Health, Hainan University (XTCX2022JKA02), and the Innovation Fund for Scientific and Technological Personnel of Hainan Province (KJRC2023B02).

Author information

These authors contributed equally: Elvis Bernard and Yiling Wang.

Authors and Affiliations

School of Environmental Science and Engineering, Hainan University, Haikou, 570228, China
Elvis Bernard & Shunqing Xu
School of Life and Health Sciences, Hainan University, Haikou, 570228, China
Yiling Wang & Manlin Chen

Authors

Elvis Bernard
View author publications
Search author on:PubMed Google Scholar
Yiling Wang
View author publications
Search author on:PubMed Google Scholar
Manlin Chen
View author publications
Search author on:PubMed Google Scholar
Shunqing Xu
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization, E.B.; Methodology, E.B.; Data curation, E.B. and Y.W.; Analysis, E.B., Y.W. and M.C.; Writing, E.B, Y.W. and M.C.; Supervision, E.B. and S.X.; Funding acquisition, S.X.

Corresponding authors

Correspondence to Elvis Bernard or Shunqing Xu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics approval

This study used UK Biobank Resources, which have ethics approval from the North West Multicenter Research Ethics Committee (REC reference number 21/NW/0157). Under this approval, no further ethical approval is required for registered secondary analyses of the UK Biobank data. The study was performed in accordance with the Declaration of Helsinki. All the participants provided written informed consent to the UK Biobank, confirming their willingness to participate in the study.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (download CSV )

Supplementary Material 2 (download CSV )

Supplementary Material 3 (download XLSX )

Supplementary Material 4 (download XLSX )

Supplementary Material 5 (download XLSX )

Supplementary Material 6 (download XLSX )

Supplementary Material 7 (download XLSX )

Supplementary Material 8 (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bernard, E., Wang, Y., Chen, M. et al. Unsupervised learning reveals novel disease-associated proteins in high-dimensional human proteomic data. Sci Rep 16, 10185 (2026). https://doi.org/10.1038/s41598-026-41385-7

Download citation

Received: 03 December 2025
Accepted: 19 February 2026
Published: 22 February 2026
Version of record: 26 March 2026
DOI: https://doi.org/10.1038/s41598-026-41385-7