Introduction

Osteoporosis (OP) is a systemic bone disease characterized by decreased bone mass, destruction of bone microarchitecture, and increased bone fragility1. The imbalance in the tightly coupled process of osteoclast resorption and osteoblast formation is an essential reason for its occurrence. With the aging population, the incidence of OP is rising, presenting significant challenges for patients, their families and social healthcare systems. The most severe consequence of OP is osteoporotic fractures, with hip and vertebral fractures being the most common types. As a global public health issue, OP has garnered considerable attention from the medical community and society due to its high incidence and fracture risk.

In recent years, biomarkers developed for the clinical management of OP have demonstrated sensitivity and reliability in identifying individuals at high risk of fracture. To date, various studies have evaluated whether serum levels of specific biomarkers mirror bone phenotypes and which of these biomarkers are associated with osteoporosis and/or fragility fracture2,3,4. Some biomarkers have been identified in more than 1 study (miR-21-5p) and they are invaluable for predicting the risk of OP, identifying potential therapeutic targets, and exploring underlying mechanisms5,6,7. Furthermore, the role of the immune system in orthopedic diseases has been confirmed, leading to the emergence and development of the field of ‘bone immunology’. Research indicates that immune cell significantly contributes to the onset and progression of OP8. A comprehensive understanding of the complex interplay between immune regulation and bone homeostasis will enhance our understanding of the molecular pathological mechanisms of OP and facilitate the development of new immunotherapeutic targets.

Due to the heterogeneity of samples and tissues in independent studies, the results of most gene array analyses have been limited or inconsistent, resulting in poor reproducibility and reliability of findings. To address these shortcomings, various multiple machine learning methods and expression profiling techniques were employed. This study screened differentially expressed genes between OP and control group using the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/). The Least Absolute Shrinkage Selection Operator (LASSO), Support Vector Machine Recursive Feature Elimination (SVM-RFE), and Random Forest (RF) algorithms were applied to identify hub genes, and the receiver operating characteristic (ROC) curve was used to verify the performance of hub genes. In addition, hub genes were validated in external datasets and clinical samples. Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Gene Set Enrichment Analysis (GSEA) were performed to analyze the enrichment of hub genes associated with OP. Additionally, the CIBERSORT algorithm was applied to analyze the relationship between hub genes and immune cell composition, thereby enhancing our understanding of the molecular and immune mechanisms underlying the occurrence and progression of OP.

Methods

Data collection

For this analysis, three publicly available datasets of peripheral blood mononuclear cells (PBMCs) were utilized: GSE7158, GSE13850, and GSE230665. The selection of data series was based on the following criteria: (1) each data series must contain a minimum of 15 samples; (2) it must include a comprehensive whole-genome expression matrix9. These datasets are accessible through the public GEO microarray database. Specifically, the combined data from GSE7158 and GSE13850 served as the training set. The GSE7158 dataset (GPL570 platform) comprised 12 women with OP and 14 controls with normal bone mass, while the GSE13850 dataset (GPL96 platform) included 10 women with OP and 10 controls with normal bone mass. The GSE230665 microarray dataset (GPL10332 platform) was employed as the validation dataset to verify hub genes, which consisted of 12 women with OP and 3 controls. The research methods employed in this study are illustrated in Fig. 1.

Fig. 1
figure 1

The workflow chart of this study. LASSO, least Absolute Shrinkage Selection Operator; SVM-RFE, Support Vector Machine Recursive Feature Elimination; RF, Random Forest; OP, osteoporosis; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes; GSEA, Gene Set Enrichment Analysis; qRT-PCR, quantitative Reverse Transcription-Polymerase Chain Reaction.

Data preprocessing and differential expression analysis

The analysis of data was conducted using R (version 3.6.1; https://www.r-project.org/) and along with relevant software packages. The Avereps function was utilized to compute the mean, standard deviation, and variance estimates. In case where all gene expression values were negative, these values were set to zero, followed by a log transformation of the data. The expression profiles underwent normalization via the Normalize Between Arrays function provided by the limma package10. To address inter-batch variations in the two datasets, the ComBat function from the sva package was utilized, successfully minimizing possible confounding factors10,11. Principal Component Analysis (PCA) was conducted using the bioPCA function, and scatter plots of the PCA results illustrated the effects of batch correction. Differential expression analysis was performed on the corrected data, followed by empirical Bayesian estimation using the eBayes function to identify differentially expressed genes. The thresholds for significance were set at p < 0.05 and |log2 fold change (FC)|> 0.58512 to facilitate the generation of heat maps. The heatmap was generated using the ‘pheatmap’ package in R.

Machine learning model construction and hub gene screening

In order to identify hub genes associated with OP, the LASSO, SVM-RFE, and RF algorithms were utilized. LASSO regression is a widely utilized machine learning algorithm for fitting generalized linear models11. The parameter λ is utilized to modify the intricacy of the model. When λ increases, linear models with numerous variables many variables face a harsher penalty. This data mining technique simplifies the model to prevent overfitting and excels in high-dimensional data modeling, feature interpretation, and model stability13. The SVM-RFE algorithm, which is employed for data feature extraction through the embedded method, has gained significant popularity in the fields of pattern recognition and machine learning. A machine-learning technique called RF, which incorporates numerous independent decision trees, was employed to forecast classification or regression results. To enhance model accuracy and robustness, multiple decision trees are constructed and their prediction results are integrated. It is reported that LASSO、SVM-RFE and RF were of great significance to identified hub genes and these three algorithms have been widely used in research to identify diagnostic or prognostic factors14. In our study, the LASSO regression analysis of candidate hub genes was conducted using the ‘glmnet’ R package with ‘binomial’ as the family for classification for OP and control groups15. The optimal value of λ was determined through tenfold cross-validation, selecting the value that minimized the criterion. We utilized the ‘e1071’ package with a ‘linear’ kernel in R to create SVM-RFE models. The SVM-RFE model was established using tenfold cross-validation, and the selection of the best variables based on the minimum 10 × CV error16. We constructed our random forest model using the ‘random forest’ package in R. The analysis involved 500 trees and utilized the Gini impurity measure to rank the feature genes and the genes with a value exceeding 1 were considered potential candidates9. Finally, we created a Venn diagram using the ‘VennDiagram’ package, to illustrate the intersection of genes identified by the three methods, which were designated as the hub genes for this analysis. The ‘pROC’ package in R was employed to generate the ROC curve and calculate the Area Under the Curve (AUC) value to evaluate the accuracy and efficiency of the hub genes. To further validate the hub genes, we used the external GSE230665 dataset as a validation dataset for the machine learning model, allowing for an additional assessment of the diagnostic ability of the hub genes.

Patients and samples

The OP patients involved in this study were from the Department of Endocrinology at Gansu Provincial Hospital and the controls with normal BMD were recruited from patients undergoing routine health check-ups at our hospital’s Health Examination Center. Samples with other associated metabolic diseases, such as hyperparathyroidism, osteoarthritis, and diabetes, were excluded9. Additionally, patients with cancer, infections, or those who had undergone long-term treatment with immunosuppressants or anticancer drugs were also excluded. All participants were subjected to dual-energy X-ray absorptiometry (DXA) scans of the total lumbar spine (L1–L4), total hip, and femoral neck. According to the WHO diagnostic classification, osteoporosis is defined by a BMD at the hip or lumbar spine (L1-L4) T-score ≤ − 2.5SD, while a normal BMD diagnosis is defined by a T-score of the hip or lumbar spine (L1-L4) ≥ − 1.0SD17. Ultimately, 5 patients with OP and 5 controls were selected for candidate gene verification. Detailed characteristics of the study subjects are summarized in Supplementary Table 1. The samples obtained from these patients were specifically and solely employed for quantitative reverse transcription-polymerase chain reaction (qRT-PCR) analysis to validate the expression of the hub genes identified through our prior bioinformatic screening. The research project adhered to the guidelines set forth in the Declaration of Helsinki, and it received ethical approval from the Ethics Committee at Gansu Provincial Hospital18.

qRT-PCR-based validation of candidate genes

Previously described methods were employed to collect peripheral blood samples and isolate PBMCs19. Using an EDTA anticoagulant blood collection tube, a total of 5mL peripheral venous blood was drawn20. Initially, the entire blood sample was transferred into a 50-mL centrifuge tube and subsequently diluted using 10 mL of PBS followed by gentle mixing. The sample was then centrifuged continuously at 2000 rpm for 20 min. Upon completion of centrifugation, the blood sample was stratified, and the leukocyte layer, which contained PBMCs, was carefully aspirated using a pipette and moved into a new 15 mL centrifuge tube filled with 10–15 mL of PBS. Subsequently, the solution underwent centrifugation at a speed of 1500 rpm for a duration of 10 min, and the supernatant was discarded to ensure the retention of the desired PBMCs19. To extract total RNA from the cells, a quantity of 1 mL of TRIzol reagent was employed. Thereafter, reverse transcription of 1 µg of total RNA was performed using a cDNA synthesis kit (Yeasen, Shanghai, China). QRT-PCR was conducted using the 1x Hieff qPCR SYBR Green Master Mix (Yeasen, Shanghai, China) on a real-time PCR machine (ABI-7500, Applied Biosystems, USA)18. Specific primers for mRNAs were synthesized by Sangon Biotech (Shanghai, China), and all primer sequences are provided in Supplementary Table 2. Representative melting curves showing a single distinct peak for each primer pair (indicating primer specificity), included as Supplementary Figure S1. Amplified curve ensures the accuracy of quantification (Supplementary Figure S2). To determine the relative mRNA expression levels, GAPDH was used as an internal reference, and quantitative analysis was performed employing the 2ΔΔCT method to calculate the relative expression of each gene. All experiments were repeated three times. The average Ct values were determined by calculating the arithmetic mean of the three technical replicate Ct values.

Gene ontology and Kyoto encyclopedia of genes and genomes pathway enrichment analyses

Based on the gene expression data, genes were divided into two groups, “con” and “treat”, and corresponding linear models were created. The ImFit function was utilized to fit the linear model, while the fit function was employed to compare the models. A log2|FC|>1 and an adjusted p-value<0.05 were used to screen for significantly differentially expressed genes. A differential heat map and co-expression map between gene co-expression groups were generated, and enrichment analysis was performed using the co-expression results. The enrichGO function was used to enrich the genes, and the enrichment results were presented from the perspectives of Biological Processes (BP), Cellular Component (CC), and Molecular Function (MF). The R package “cluster Profiler” was selected for KEGG enrichment analysis21.

GSEA analysis of hub genes

The GSEA method is employed to evaluate the pattern of gene distribution in a defined set of genes in a gene table organized by phenotypic correlation. In this study, genes in the integrated GEO dataset (GSE13850 and GSE7158) were initially sorted by logFC value, followed by GSEA using the “clusterProfiler (v3.6.1)” package in R software. The reference gene set, which is the annotated gene set (c2.cp.kegg.Hs.symbols.gmt), was obtained from the GSEA website (http://www.broadinstitute.org/gsea/index.jsp). To visualize the GSEA results, “Ggplot2” and “enrichplot” R packages were utilized10. p-adjust < 0.05 was considered statistically significant.

Correlation analysis between hub genes and immune cell composition

The CIBERSORT algorithm was employed to investigate immune phenotype shifts. Single sample gene set enrichment analysis (ssGSEA) was used to assess the abundance of immune cell composition in both the OP and control groups22,23. The differences in immune cell composition were visualized using violin plots generated by the ‘ggplots’ package, while correlation heatmaps depicting the relationship among the 22 immune cells were created with the ‘ggcorrplot’ package. Finally, Spearman’s correlation analysis was performed to evaluate the relationships between hub genes and the composition of various immune cells, with results visualized using the ‘ggcorrplot’ and ‘ggstatsplot’packages.

Construction of CeRNA networks

To investigate the interactions and target-binding relationships among various types of ceRNA in OP, we predicted the miRNAs that target the hub genes using mirRDB (https://mirdb.org/), miRanda (version 3.3a; http://www.microrna.org/) and TargetScan (version 8.0; https://www.targetscan.org/vert_80/). Additionally, the SpongeScan (http://spongescan.rc.ufl.edu) database was employed to identify the upstream lncRNAs that interact with the selected miRNAs. mRNAs were set as the initial targets, and their corresponding miRNAs and lncRNAs were subsequently linked. The predicted binding relationships among lncRNAs, miRNAs, and mRNAs were visualized through an analysis of ceRNA-targeting relationships11.

Result

The gene expression matrices of the GSE7158 and GSE13850 datasets were normalized and batch-corrected. The PCA scatter plots illustrating the datasets before and after normalization are shown in Fig. 2A and B, respectively. The results indicate that GSE13850 (blue dot) exhibit a more pronounced clustering following normalization and GSE7158 (blue dot) still shows high variability after normalization, which may be related to sample heterogeneity or technical variation in the dataset. A differential analysis was performed on the corrected data, resulting in the identification of 12 statistically significant differentially expressed genes (LGALS4, LOC100996756, KLRC1, IL32, KLRF1, CCR5, GZMM, ZFP69B, GLI1, KIF13B, TMEM176A, TMEM176B) between OP and controls group(|log2FC|>0.585, p-value < 0.05). The results of the differentially expressed genes are displayed in a heatmap (Fig. 2C). The heat map appears to show more pronounced differences between OP and control sample gene expression in the GSE7158 data set, which is consistent with the PCA results and may reflect that this dataset is more sensitive to disease-related variations.

Fig. 2
figure 2

(A) PCA cluster maps of GSE13850 and GSE7158 datasets before sample correction. (B) PCA cluster plots after sample correction for GSE13850 and GSE7158 datasets. (Red represents the GSE13850 dataset, and blue represents the GSE7158 dataset). (C) Heatmap plot of the differential gene analysis. The red represents high gene expression, the blue represents low gene expression, the green bars represent the GSE13850 dataset, and the orange bars represent the GSE7158 dataset. The purple bars indicate OP groups, and the blue bars indicate control groups. The heatmap was generated using R software (version 3.6.1; https://www.r-project.org/) with the pheatmap package (version 1.0.13; https://cran.r-project.org/package=pheatmap). PCA, Principal Component Analysis.

Identify hub genes through machine learning

Three machine learning algorithms, namely LASSO, RF, and SVM-RFE, were used to further screen hub genes associated with OP. The LASSO algorithm successfully identified seven feature variables when the lambda value was minimized (Fig. 3A and B). SVM–RFE was performed to extract features demonstrating significant diagnostic efficiency. Consequently, we identified seven differentially expressed genes (DEGs) that exhibited the lowest classifier error and the highest classifier accuracy, with a minimal error of 0.176 and a maximal accuracy of 0.824(Fig. 3C and D). For the RF algorithm, we set the number of iterations of the random forest classifier to 500 so that the out-of-bag error was stable (Fig. 3E). The RF algorithm ranked the genes based on Gini importance plot, selecting those with an importance score greater than 1, resulting in a total of nine genes (Fig. 3F). Finally, through the intersection of the results obtained from the three different machine learning methods, as depicted in Venn diagrams, four hub genes (CCR5, KIF13B, LGALS4, and ZFP69B) were identified as significant DEGs. The visualization of these results is presented in Fig. 3G.

Fig. 3
figure 3

Identify hub genes of OP through machine learning. (A) Cross-validation plots of LASSO regression analysis. The top X-axis represents the number of variables in model decreasing from left to right, the bottom X-axis represents log(λ) value. The Y-axis of binomial deviance represents the degree of difference between the predicted value of the model and the true value. The smaller the error rate, the better the model. The left dotted line (lambda.min) represents the value of X-axis corresponding to the minimum error rate. Seven candidates of hub genes were predicted by the minimum cross-validation error with the optimal log(λ) value. (B) The LASSO coefficient distribution plot of the 7 hub genes was applied to identify the eigenvalues of the constructed diagnostic signal. Each curve corresponds to a candidate hub gene. (C, D) The SVM-RFE algorithm identified the latent hub genes with the lowest error rate (10 × CV error rate = 0.176) and the highest precision (10 × CV accuracy rate = 0.824). X-axis represents number of Features. Y-axis represents error rate and accuracy rate of curve. (E) The impact of the number of decision trees on the error rate; green, red, and black represent OP, control and all samples respectively. (F) Gini importance plot with mean decrease Gini on the horizontal axis and genes on the vertical. (G) The Venn-gram exhibited the hub genes obtained by the comprehensive intersection of 3 machine learning methods: LASSO regression (red circle), SVM-RFE (blue circle) and RF (green circle). LASSO, least Absolute Shrinkage Selection Operator; SVM-RFE, Support Vector Machine Recursive Feature Elimination; RF, Random Forest; OP, osteoporosis.

Validation of hub gene expression

Violin plots were used to show the expression of hub genes between OP and control group in the training set and the validation set. ROC curves were used to ascertain the AUC and 95% Confidence Intervals (CI) for each of the genes under consideration and evaluate the diagnostic ability in the training set and the validation set. In the training set, CCR5, KIF13B, ZFP69B were downregulated and LGALS4 was upregulated in OP group. The differences were statistically significant (p<0.05) (Fig. 4A-D). The results of ROC curves were as follows: The AUC values for CCR5(AUC: 0.805, 95%CI: 0.667–0.917), KIF13B (AUC: 0.769, 95%CI: 0.623–0.903), LGALS4(AUC: 0.761, 95%CI: 0.616–0.892) and ZFP69B (AUC: 0.716, 95%CI: 0.557–0.852) were all greater than 0.6(Fig. 4E-H). In order to screen for genes with greater diagnostic value, the hub genes were subsequently validated in the GSE230665 validation dataset and we filtered out significant results (p < 0.05, AUC > 0.8). The results revealing that CCR5 and KIF13B were downregulated in OP group (Fig. 4I and J) and the AUC value for CCR5(AUC: 1.000, 95%CI: 1.000–1.000) and KIF13B (AUC: 0.944, 95%CI: 0.778–1.000.778.000) exceeded 0.8 (Fig. 4K and L). These findings suggest that the AUC model demonstrated strong diagnostic efficacy and that the hub genes possess significant differential capabilities as potential biomarkers for OP. The visualization results are shown in Fig. 4.

Fig. 4
figure 4

(A-D) Violin plots of the expression of the four hub genes between OP and control group include (A)CCR5, (B)KIF13B, (C) LGALS4, and (D) ZFP69B in the training set. (E-H) ROC curve analysis of hub genes (E)CCR5, (F)KIF13B, (G) LGALS4, and (H) ZFP69B in the training set. (I, J) Violin plot of the hub genes CCR5 and KIF13B expression in the GSE230665 validation set. (K-L) ROC curve analysis of hub genes (K)CCR5 and (L)KIF13B in the GSE230665 validation set. (*p < 0.05, **p < 0.01, ***p < 0.001). OP, osteoporosis.

qRT-PCR validation

PBMCs mRNA isolated from whole blood was utilized to assess the expression of two hub genes, CCR5 and KIF13B, identified through our bioinformatics analysis. The results of qRT-PCR indicated that the mRNA expression levels of these two hub genes were significantly lower in the OP group compared to the control group (Fig. 5A and B). The expression levels of CCR5 and KIF13B were in agreement with the findings from the bioinformatics analysis.

Fig. 5
figure 5

The expression of CCR5 and KIF13B (A and B) in the OP and the control group. (*p < 0.05, **p < 0.01). OP, osteoporosis.

Gene ontology and Kyoto encyclopedia of genes and genomes pathway enrichment analyses

GO and KEGG analyses provide a comprehensive understanding of the biological processes and pathways associated with the hub genes. Bubble plots were used to visually display the enrichment results of the genes. For the differentially expressed gene CCR5, GO functional analysis indicated that it is primary involved in cellular responses to lipopolysaccharide, responses to molecules of bacterial origin, response to biotic stimuli, and other biological processes. Enrichment was observed in cellular components such as the external side of the plasma membrane and the serine/threonine protein kinase complex. In terms of molecular functions, the gene exhibited high abundance in GO terms related to serine-type endopeptidase activity, serine-type peptidase activity, serine hydrolase activity, oxygen binding, phosphoric diester hydrolase activity, cytokine receptor activity, and neurotransmitter receptor activity (Fig. 6A). KEGG annotation revealed enrichent in pathways related to complement and coagulation cascades, morphine addiction, viral protein interactions with cytokine receptors, Chagas disease, and cytokine-cytokine receptor interaction (Fig. 6B). Similarly, the bubble diagram presented the enrichment results in GO terms for the differentially expressed gene KIF13B, revealing that KIF13B predominantly participates in several biological processes, including potassium ion transport, intracellular calcium ion homeostasis, calcium ion homeostasis, protein maturation, protein targeting, nuclear membrane dynamics, and the nuclear envelope (Fig. 6C). In the context of KEGG analysis, the pathways primarily involved nucleocytoplasmic transport and neuroactive ligand-receptor interaction (Fig. 6D). Some of pathways (cytokine-cytokine receptor interaction, neuroactive ligand-receptor interaction) in our study have already been reported in previous studies to be associated with OP and bone metabolic diseases24,25. For instance, the pathway of neuroactive ligand-receptor interaction may involve in osteoclastogenesis26. This to some extent indicates the reliability of our research, and on the basis of previous studies, some new pathways related to OP have been discovered.

Fig. 6
figure 6

GO and KEGG enrichment analyses. (A) Bubble plot of GO enrichment analysis results of CCR5. (B) Bubble plot of KEGG pathway enrichment analysis of CCR527,28,29. (C) Bubble plot of GO enrichment analysis results of KIF13B. (D) Bubble plot of KEGG pathway enrichment analysis of KIF13B27,28,29. GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes.

Gene set enrichment analysis

We performed GSEA analyses of the two hub genes and visualised the top five up- or down-regulated pathways with the ‘GSEA’ package, only those results with a p-value<0.05 were retained. For KIF13B, the enrichment score is positive, indicating that the pathways were upregulated in OP group include ECM-receptor interaction, focal adhesion, hypertrophic cardiomyopathy, p53 signaling pathway, and regulation of the actin cytoskeleton (Fig. 7A). For CCR5, the enrichment score is negative, indicating that the pathways were downregulated in control group include the B-cell receptor signaling pathway, glycosaminoglycan biosynthesis heparan sulfate, leishmania infection, nod-like receptor signaling pathway, and pathways in cancer (Fig. 7B).

Fig. 7
figure 7

GSEA of hub genes in OP. (A) GSEA analysis of KIF13B in OP. (B) GSEA analysis of CCR5 in OP. OP, osteoporosis; KEGG, Kyoto Encyclopedia of Genes and Genomes; GSEA, Gene Set Enrichment Analysis.

Immune phenotype shifts analysis

The composition of immune cells in 46 samples (GSE7158 and GSE13850) was analyzed using CIBERSORT (version 1.03;). The proportions of immune cells were displayed using stacked plots, while the correlations among immune cells were illustrated through heatmaps (Fig. 8A and B). Specifically, the primary constituent immune cells were identified as monocytes/macrophages and B cells. A significant positive correlation was observed among T cells follicular (cor = 0.98, p < 0.01), T cells gamma delta (cor = 0.74, p < 0.01), and NK cells activated. Additionally, mast cells activated and eosinophils (cor = 0.71, p < 0.01), as well as T cells gamma delta and T cells follicular helper (cor = 0.74, p < 0.01), also exhibited significant positive correlations. Conversely, a significant negative correlation was found between mast cells activated and resting NK cells (cor = −0.71, p < 0.01). The differential analysis of immune cell composition indicated a significant difference in NK cells activated between the OP and control group (p < 0.05) (Fig. 8C). In addition, we demonstrated the correlation between hub genes and immune cells, revealing that CCR5 was negatively correlated with Mast cells activated (r=−0.49, p = 0.021), while KIF13B showed no correlation with immune cells (Fig. 8D, E and F).

Fig. 8
figure 8

Evaluation of immune cell composition and correlation analysis between hub genes and immune cells. (A) The stacked plot of the proportion of immune cell composition. (B) Violin plot of differential composition of all 22 immune cells between osteoporosis and normal samples. (C) Spearman’s correlation heatmap demonstrates the correlation of all 22 immune cells; red presents a positive correlation, and blue denotes a negative correlation. (D, F) The lollipop Chart shows the correlation between different types of immune cells and hub genes (CCR5 and KIF13B). (E) Dot plots of the correlation between immune cells and genes CCR5.

Prediction of target LncRNAs and construction of CeRNA networks

Based on two hub genes, CCR5 and KIF13B, relevant miRNAs were predicted in miRDB, miRanda, and TargetScan databases, and the lncRNAs that interacted with selected miRNAs were predicted using the spongeScan database. Finally, we obtained 29 target lncRNAs of 9 target miRNAs of CCR5 and 90 target lncRNAs of 15 target miRNAs of KIF13B. Cytoscape software was used to construct and show ceRNA networks based on the prediction results (Fig. 9A and B).

Fig. 9
figure 9

Potential RNA regulatory pathway for a ceRNA network. (A) ceRNA network of CCR5. (B) ceRNA network of KIF13B. Red diamonds represent the hub genes, blue squares represent miRNAs, and green circles represent lncRNAs.

Discussion

Although several effective pharmacological therapies exist for osteoporosis, including antiresorptives and anabolic agents, their use is often limited by side effects, cost, or suboptimal patient adherence, underscoring the need for continued research into novel treatment strategies30,31. Early detection of OP is crucial as it can significantly prevent fractures, which is essential for timely intervention and enhancing the quality of life. Identifying novel therapeutic targets for OP may facilitate the development of new treatment strategies. Microarray gene expression analysis serves as a valuable tool for pinpointing critical targets involved in the pathogenesis of OP32. In this study, we analyzed pooled microarray data comparing gene expression between OP patients and control group, revealing 12 differentially expressed genes. We subsequently employed machine learning technique to filter these differentially expressed genes, ultimately identifying four genes associated with OP: CCR5, KIF13B, LGALS4 and ZFP69B. Further validation using an external dataset confirmed that CCR5 and KIF13B were down regulated in OP. To verify the expression levels of these two genes in both OP and control groups, PBMCs were collected, and qRT-PCR analysis demonstrated that these genes are closely related to the progression of OP.

Several studies have reported a relationship between CCR5 and bone metabolism33,34,35. CCR5 serves as the receptor for the chemokine CCL5, with approximately 50 endogenous chemokines identified in humans and mice, which are categorized into four subfamilies: CXC, CC, CX3C, and C-chemokines36. The CCR5 receptor is expressed in smooth muscle endothelial cells, epithelial cells, T cells, and parenchymal cells, and it plays a role in regulating inflammation, infectious diseases, angiogenesis, embryogenesis, and tumorigenesis36. Several epidemiological studies have demonstrated that the loss of CCR5 function is associated with a lower incidence of bone-destructive diseases37,38. Furthermore, experimental studies suggest a direct positive role for CCR5 in osteoclastogenesis and in the communication between osteoclasts and osteoblasts39. Conversely, CCR5-deficient mice exhibit reduced osteoblast numbers, increased osteoclasts, and impaired bone formation40. Our study found that the expression of CCR5 was reduced, a result consistent with previous studies, which to some extent verifies the reliability of our study24. The kinesin superfamily (KIFs) consists of microtubule-based motor proteins that are crucial for the transport of various intracellular substances. To data, 45 KIFs have been identified in humans, with studies indicating that KIF13B plays a significant role in predicting glioma prognosis41. However, there are few studies on KIF13B, and none related to bone metabolism, suggesting this may represent a novel finding in the field of OP.

In recent years, immune cell has emerged as a crucial factor in the occurrence and progression of OP. To elucidate the role of immune cells in OP, we conducted an analyze of the composition of 22 immune cells types and their correlation with the hub genes associated of OP. The finding suggests that various immune cells, including monocytes/macrophages and B cells, may contribute to the pathogenesis of OP. B cells serve as essential stabilizers of bone turnover and play a critical role in regulating peak bone mass in vivo. Research indicates that B cells, when activated by estrogen deficiency and pro-inflammatory conditions, can secrete elevated levels of cytokines such as G-CSF and RANKL, which in turn activate osteoclast formation and enhance bone resorption42. A study observed significant decreases in bone mineral density and bone mass by 3 months of age in B-cell KO mice. B-cell reconstitution of B-cell KO mice at 4 weeks of age prevented the loss of bone mass. This demonstrates a novel role for B cells in the maintenance of basal peak bone mass in vivo43. Furthermore, evidence suggests that B cells may inhibit osteoblast differentiation through CCL3 and TNF signaling pathway44. Additional clinical data regarding low bone mineral density in patients with non-Hodgkin’s lymphoma underscore the importance of maintaining a balance between B-cell numbers and their activation for optimal bone homeostasis45.

Macrophages, derived from the monocytic lineage, are integral components of the innate immune system. They are categorized into M1 and M2 types based on their functional roles and the levels of inflammatory factor they secretion46,47. M1 macrophages are characterized by their secretion of high levels of reactive oxygen species (ROS), nitric oxide (NO), and pro-inflammatory cytokines, such as IL-1 and IL-6, which inhibit osteogenic differentiation48. Conversely, M2 macrophages secrete anti-inflammatory cytokines, including CCL18, CCL22 and IL18, as well as pro-osteogenic molecules like bone morphogenetic protein 2 (BMP2), transforming growth factor β(TGFβ) and osteopontin, which promote osteogenic differentiation49. Research indicated that the number of monocytes/macrophages in the bone marrow of postmenopausal women is increased50, and animal studies have demonstrated that bone marrow macrophages in ovariectomized (OVX) mice are activated, with increased M1 polarization and disrupted M2 polarization51,52. These findings suggest that macrophages play an essential role in the pathogenesis of OP.

Mast cells (MCs) are important sensor and effector cells within the immune system, significantly influencing bone metabolism and disorders. Studies have shown that the number of activated mast cells in the bone marrow of OVX mice is higher compared to sham-operated mice53. The accumulation of activated mast cells in OVX mice correlates directly with the increased number of osteoclasts47. Another study revealed that mast cell-deficient mice exhibit prevention of OVX-induced bone loss by inhibiting the increase in osteoclast number and activity, suggesting that mast cells may stimulate osteoclast production in the context of estrogen deficiency54. Our study demonstrated that CCR5, a key gene related to OP, is negatively correlated with mast cell activation, with reduced CCR5 expression observed in the OP group. This finding further supports the notion that CCR5 may be involved in the occurrence and progression of OP through the mediation of mast cell activation.

Our study included two microarrays to identify hub genes and pathways in the pathogenesis of OP. In contrast to many earlier analyses that relied on smaller GEO datasets55,56, our research set a minimum sample size threshold (n > 15) for dataset inclusion. This criterion is supported by methodological research showing that small sample sizes can lead to unstable gene lists and poor prediction accuracy in transcriptomic studies57. It is well known that PBMCs can secrete many potent cytokines that are crucial for osteoclast differentiation, activation, and apoptosis58. The dataset we selected mainly consists of PBMCs and has been validated in PBMCs from clinical samples. This reflects the correlation between cells and diseases to some extent, which was not reported in previous studies. Finally, we conducted clinical sample validation and the hub genes identified (CCR5, KIF13B) may serve as new biomarkers for early OP diagnosis in clinically, enabling non-invasive screening via peripheral blood tests. Furthermore, targeting these genes with small molecules or biologics could open avenues for precision therapy. Future studies should validate their diagnostic sensitivity/specificity in larger cohorts and explore therapeutic modulation in preclinical models. The potential relationship between OP and immune cells also providing new insights into the molecular mechanisms and therapeutic targets associated with OP. However, our study has certain limitations. Firstly, although we combined two microarray datasets, the small sample size of public databases may introduce bias into our results. Secondly, while qRT-PCR has confirmed the expression levels of hub genes, further validation with more clinical samples is necessary due to the limited size of our clinical sample set. Additionally, we need to clearly define the roles of the validated molecules in the differentiation of osteoblasts and osteoclasts in cell experiments and we need to determine whether overexpressing these validated molecules can improve bone mass loss caused by OP in animal experiments. Finally, for immune infiltration analysis, the datasets (GSE7158 and GSE13850) used in CIBERSORT analysis were specific to PBMCs, which to some extent indicates that the predominance of these cells in the results may reflect the composition of the input datasets rather than the broader immune participation in OP. Therefore, this conclusion should be interpreted with caution.

Conclusion

This study used microarray technology to characterize the mRNA expression profiles in patients with OP and to analyze the functional pathways and roles of differentially expressed genes. A machine learning algorithm identified CCR5 and KIF13B as hub genes associated with OP, which were subsequently validated using ROC curves and clinical samples. The results indicated that both CCR5 and KIF13B were significantly downregulated in OP patients and may serve as potential diagnostic markers for the condition. Furthermore, we observed significant correlations between the hub genes and immune cells. Additionally, differences in immune cell abundance were observed between OP and control groups. Thus, our findings contribute to a deeper understanding of the pathogenesis of OP and provide a theoretical basis for its diagnosis and treatment.