Introduction

Sepsis is characterized by a dysregulated response of the body to infection, leading to life-threatening organ dysfunction1, and continues to be a major cause of morbidity and mortality in both low- and high-income countries 2,3. Organ dysfunction arises both as a consequence of immunological dysregulation and as a catalyst in the vicious cycle of sepsis development4.

The central nervous system is among the first organs to be affected5. Although there is no fully agreed-upon definition of SAE, it is commonly understood as a combination of the extracranial infection and clinical manifestations of neurological dysfunction. The incidence of SAE varies widely, estimated between 8 and 70%6, with 20–40% of sepsis patients in intensive care units (ICUs) progressing to SAE7, making it the most common cause of brain dysfunction in ICUs. In terms of long-term prognosis, sepsis hospitalization is associated with a 10% increase in the incidence of cognitive impairment within 8 years8,9.

The development of SAE may be due to external signals activating microglia through a compromised blood–brain barrier10, leading to neuroinflammation, ischemia, and cellular metabolic stress11. However, current studies still have significant limitations, and there is no specific treatment method.

The advent of high-throughput technologies has revolutionized biological research at the microscopic level. With the rapid increase in transcriptomic data generation, bioinformatics and machine learning have become indispensable for extracting meaningful insights from large datasets. These advanced computational methods enable the identification of key biomarkers, the elucidation of complex biological mechanisms, and more accurate disease prediction. In particular, the integration of bioinformatics and machine learning has significantly advanced transcriptomic data analysis, providing robust tools to manage high-dimensional and heterogeneous datasets. In this study, we used machine learning and bioinformatics approaches to organize and analyze existing RNA-seqencing data on SAE, identifying marker genes that could serve as potential therapeutic targets for the treatment of septic encephalopathy. The flow chart of this research was shown in Fig. 1.

Fig. 1
figure 1

The flow chart of research.

Materials and methods

Data collecting and processing

The raw expression profile datasets of SAE and control groups, namely GSE198862, GSE167610 and GSE253438 were download from the GEO database12. In subsequent data integration, the data will be normalized using Counts Per Million (CPM), followed by logarithmic transformation. The batch effects will then be mitigated using the ComBat function from the ‘sva’ package (version 3.48.0)13 in R software (version 4.3.1). The batch effect-removed dataset contains 15 control groups and 15 SAE groups.

Differentially expressed genes (DEGs) analysis

DEGs in GSE198862 were identified using the ‘DESeq2’ package (version 1.40.2)14, with thresholds of adjusted p ≤ 0.05 and either log2FoldChange ≥ ± log2(1.5). For GSE167610, the ‘Limma’ package (version 3.56.2)15 was utilized to identify DEGs, applying thresholds of adjusted p ≤ 0.05 and |log2FoldChange|> 1. The expression patterns of the DEGs were visualized in the form of volcano plots using the ‘ggplot2’ package (version 3.4.4).

Weighted gene co-expressed network analysis (WGCNA) and key module genes identification

Following the scale-free topology criterion, the co-expression network within the batch effect-removed dataset was constructed using ‘WGCNA’ package (version 1.72.1)16. To determine an optimal soft threshold power alongside adjacencies, the pickSoftThreshold function from the WGCNA package was employed. Subsequently, the adjacency matrix was transformed into a Topological Overlap Matrix (TOM), and its corresponding dissimilarity was computed to facilitate hierarchical clustering analysis. Utilizing the dynamic tree cutting method, with a stipulated minimum module size of 30, enabled the identification of co-expressed gene modules.

Elastic net regression analysis

Elastic net regression analysis is a sophisticated statistical technique that merges the strengths of ridge and lasso regression through a linear combination of their penalties. This method is designed to handle situations where the number of predictors far exceeds the number of observations, a common scenario in genomic data analysis. The elastic net penalty is defined by the formula λ

$$\alpha \Sigma \left|{\beta }_{j}\right|+\frac{1-a}{2}\Sigma {\beta }_{j}^{2},$$

where βj are the coefficients of the predictors, λ is the penalty term that controls the overall strength of the penalty, and α balances the contribution of the lasso (L1) and ridge (L2) penalties. By allowing for both variable selection (like lasso) and shrinkage (like ridge), elastic net can identify a relevant set of predictors from high-dimensional datasets, thus providing a powerful tool for uncovering the genomic features that differentiate disease groups from control groups.

Least absolute shrinkage and selection operator (LASSO)

LASSO is a regression analysis method that performs both variable selection and regularization to enhance the prediction accuracy and interpretability of the statistical model it yields. Distinguished from traditional regression techniques that may lead to overfitting in cases with a high number of predictors, LASSO mitigates this by imposing a constraint on the sum of the absolute values of the model parameters, which effectively reduces some coefficients to zero, thus achieving variable selection. Utilizing the ‘glmnet’ package (version 4.1-8), we conducted LASSO regression to identify the best λ value, followed by coefficient estimation, thus ensuring a rigorous approach to model optimization and analysis.

Random forest

In our study, we employed the random forest algorithm as a robust machine learning technique for both classification and regression tasks by ‘randomFores’ package (version 4.7.1.1). Random forest, an ensemble learning method, constructs multiple decision trees during training time and outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees. This approach effectively improves predictive accuracy and controls over-fitting by aggregating the results of numerous trees, each built on a random subset of the data and features.

Extreme gradient boosting (XGBoost)

In our study, we employed XGBoost algorithm using ‘xgboost’ package (version 1.7.5.1). XGBoost is a sophisticated machine learning technique renowned for its effectiveness in binary classification tasks. It stands out for its efficiency, scalability, and the capacity to handle sparse data, making it exceptionally suited for complex predictive modeling challenges. Our XGBoost model was meticulously configured to optimize performance: we limited the trees depth to a max_depth of 10 and The learning rate, controlled by the eta parameter, was set at 0.5.

Cell culture and treatment

The Mouse microglia BV2 cell line was purchased from Procella Life Science&Technology Co. Ltd (Wuhan, China). These cells were supplemented with 10% fetal bovine serum(Procella, China) and 1% penicillin/streptomycin(Solarbio,China) in DMEM(GIBCO, USA) in the culture medium and cultured at a temperature of 37 °C and humidity of 5% CO2.

To establish an in vitro neuroinflammatory model, we chose to use lipopolysaccharide (LPS) to stimulate BV2 cells. By controlling the concentration of LPS(L2880,Sigma) at 100 ng/ml, we successfully induced an inflammatory response in the cells, thus simulating neuroinflammatory conditions.

Reverse transcription quantitative polymerase chain reaction (RT-qPCR)

Total RNA was extracted from the cells using TRIZol (Invitrogen, CA, USA), followed by reverse transcription using a commercial reverse transcription kit (AG11728, AG, China) to convert the RNA into cDNA. Subsequently, we employed the RT-qPCR technique using specialized kits (AG11701, AG, China) for PCR reactions. During data analysis, Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) was selected as an internal reference gene and the relative expression was calculated using the 2(−ΔΔCt) method. Finally, a one-way ANOVA was executed using GraphPad Prism 9 software.

Results

Distinct identification of differential genes in sepsis-associated encephalopathy

In the analysis of the GSE198862 dataset, we identified 1413 upregulated and 101 downregulated genes between the control and experimental groups (Fig. 2a). In the GSE167610 dataset, there were 202 upregulated and 16 downregulated genes (Fig. 2b). Upon intersecting these gene sets, we discovered 84 common genes (Fig. 2c). These are named MDEGs, and notably, all of these genes are upregulated in the differential gene sets of both datasets. Through enrichment analysis of the differential genes in each group, we observed biological processes closely related to inflammation. These include cellular chemotaxis, leukocyte migration, regulation of inflammatory response, and cytokine production (Fig. 2d and e).

Fig. 2
figure 2

Distinct identification of differential genes in sepsis-associated encephalopathy. (a) and (b) Volcano plots illustrating differentially expressed genes in experimental from GSE198862 (a) and GSE167610 (b), respectively, compared to their control groups. (c) Venn diagram depicting the overlap of differentially expressed genes, identifying 84 genes common to both gene sets, namely MDEG. (d) Bubble chart representing the Gene Ontology (GO) pathway enrichment analysis for the two sets of differentially expressed genes and MDEGs. (e) Bar graph illustrating the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis for the two sets of differentially expressed genes and MDEGs.

Linear model analysis and WGCNA of the overall dataset

We merged the GSE198862 and GSE167610 datasets after eliminating batch effects (Fig. 3a and b). Following data filtration, the consolidated dataset contained expression data for 6,139 genes across 30 samples, evenly divided with half of the samples from the control group and the other half from the experimental group. We utilized a linear model to fit the gene expression data and performed statistical inference using an empirical Bayesian approach. This led to the identification of 138 upregulated differentially expressed genes, which we have named as the sDEG (Selected Differential Expression Gene) set (Fig. 3c). Subsequently, gene expression data from all experimental group samples were extracted for WGCNA (Fig. 3d). A total of 21 gene modules were identified (Fig. 3e and f). Upon comparing sDEG and MDEG with the 21 gene modules, it was observed that the module Coral exhibited the highest number of overlapping genes (Fig. 3g). Enrichment analysis of genes in the coral module revealed their involvement in apoptosis, immune response, and biosynthesis (Fig. 3h). Consequently, we decided to focus on the coral module which contained 728 genes for subsequent analyses.

Fig. 3
figure 3

Linear model analysis and WGCNA of the overall dataset. (a) Principal Component Analysis (PCA) results of the combined datasets prior to batch effect correction. (b) PCA results of the combined datasets after batch effect correction. (c) Heatmap of the overall differentially expressed genes sDEGs, with genes also present in the MDEGs marked on the right side. (d) WGCNA soft-thresholding power analysis showing scale independence (left) and mean connectivity (right) to determine the threshold for network topology. (e) The dendrogram from WGCNA displays the clustering results of 21 gene modules and genes. (f) The bar graph illustrates the number of genes contained within each of the 21 WGCNA gene modules. (g) The radar chart depicts the degree of overlap between the sDEG and MDEG gene sets and each gene module identified by WGCNA. (h) Results of the GO enrichment analysis for genes in the Coral module.

Utilizing various machine learning methods to identify feature genes

To identify key feature genes, we analyzed the expression data of 728 Coral module genes from 30 samples using Elastic Net, LASSO, Random Forest, and XGBoost. These methods were chosen for their suitability in high-dimensional data analysis and feature selection.

Elastic Net regression, optimized with cross-validation (λ = 0.2154435, Fig. 4a), identified 17 non-zero coefficient genes, 5 of which overlapped with MDEGs (Fig. 4b). LASSO analysis further identified 9 genes, all of which were included in the Elastic Net results, confirming their robustness.

Fig. 4
figure 4

Utilizing various machine learning methods to identify feature genes. (a) Plot representing the mean squared error across varying log(λ) values in an elastic net regression analysis, highlighting the optimal point of regularization via the minimum criterion. (b) The image displays the 17 characteristic genes with positive results identified through elastic net regression analysis. (c) The dimensionality reduction plot demonstrates the classification capability of the random forest, with two colors representing the experimental and control groups, respectively. (d) Variable importance plots from a random forest model, showing the mean decrease in accuracy (left) and the mean decrease in Gini impurity (right) for the top contributing features. (e) Bar chart of feature importance for genes identified by the XGBoost method, with the bars representing the score which quantifies each feature’s contribution to the model; the unit of measurement on the x-axis is the F-score.

Random Forest evaluated feature importance using two criteria—Mean Decrease in Accuracy and Gini index (Fig. 4c and d). Fourteen overlapping genes were identified as consistently contributing to classification. XGBoost analysis further validated 11 key feature genes, demonstrating their predictive relevance (Fig. 4e).

All feature gene sets identified by the machine learning algorithms, along with the genes overlapping with MDEGs, are listed in Table 1. All models demonstrated strong classification performance on the training set. We calculated the index values and probabilities for each model (Table S1) and analyzed their confusion matrices (Fig. S1).

Table 1 Identified features and their gene intersections by MDEGs.

In summary, the integration of these machine learning methods identified a consistent set of genes with potential diagnostic and therapeutic implications for SAE, including genes validated by multiple algorithms.

ROC curve of the selected genes

We merged the four overlapping gene sets, resulting in a total of 18 biomarkers: Ifitm1, S100a8, Vwf, Saa3, Anxa2, Pglyrp1, Lcn2, Emilin2, S100a9, Ifitm2, Socs3, C4b, Gpx3, Ifitm3, S10a11, Atp10d, Rps21, Lamb3. Subsequently, batch-corrected data from the two datasets were used to generate ROC curves (Fig. S2). In the datasets, all biomarkers exhibited high AUC values, indicating their strong classification efficacy. Furthermore, we utilized the GSE253438 dataset as an external validation set. Among the 18 biomarkers, 17 were expressed in the external validation dataset. By constructing logistic regression models, we evaluated the strength of each biomarker as a classification feature, presenting the results in the form of ROC curves (Fig. 5). Most biomarkers maintained relatively high AUC values.

Fig. 5
figure 5

ROC curves of selected genes evaluated on the external validation dataset GSE253438.

In vitro experimental validation of marker expression in SAE

We stimulated BV2 cells with different concentrations of LPS (10, 50, 100, and 500 ng/mL) for 12 h with the aim of inducing inflammatory responses and damage in the cells. By a combination of RT-qPCR and ELISA, we confirmed that the expression levels of the inflammatory factors TNF-αand IL-6 were both on the rise in the LPS-induced inflammation model of BV2 cells compared to the cells of the blank control group (Fig. 6a–d). This indicates that the stimulation of LPS successfully triggered the inflammatory state of the cells.

Fig. 6
figure 6

In vitro experimental validation of marker expression in SAE. (a,b) TNF-α, IL-6, levels in control and LPS-stimulated BV2 cell were measured using ELISA. (cn) The relative expression levels of TNF-α, IL-6, Lcn2, Atp10d, Rps21, Anax2, Gabarap, S100a11, Labm3, Fkbp4, and Pglyrp1 were detected by RT-qPCR in control and LPS-stimulated BV2 cells using GAPDH as an internal reference gene. In statistical analyses, markers indicating significance levels were *P < 0.05, **P < 0.01, and ***P < 0.001, respectively.

Further RT-PCR assay showed that the expression levels of Lcn2, Atp10d, Rps21, Anax2, Gabarap, S100a11, Labm3, and Fkbp4 genes were significantly up-regulated in the inflammatory model of BV2 cells stimulated by LPS (100 ng/mL) (Fig. 6e–n). In addition, we observed that the expression of Pglyrp1 gene was suppressed in the BV2 cellular inflammation model. These results reveal that these genes may play a critical role in the cellular inflammatory process.

Discussion

SAE is primarily caused by sepsis-induced systemic inflammation characterized by impaired blood–brain barrier (BBB) function, hyperactivation of immune cells such as microglia in the brain, and neuroinflammation 17. These factors can trigger neurotransmitter imbalance, oxidative stress, and apoptosis ultimately leading to microcirculatory impairment, cognitive dysfunction, and neuronal damage 18. Controlling the inflammatory response is essential to improve the prognosis of patients with SAE. Our study identified a number of important inflammation-related molecules, and LCN2 was associated with inflammation, neuroprotection and neurorepair. LCN2 was found to be upregulated in Parkinson’s disease, associated with iron accumulation and neuroinflammation, and may promote dopamine neuron damage19. FKBP4 is an important protein chaperone in neurons. Several studies have revealed the role of FKBP4 in immune-related diseases and inflammation, and FKBP4 may act as a molecular chaperone to regulate immune responses triggered by misfolding or amyloid proteins such as aSyn20. S100A11, a member of the S100 family of proteins, plays an important role in inflammatory processes21. It has a multifaceted impact on the development and resolution of inflammatory responses by regulating cytokine production, promoting immune cell migration, influencing apoptosis, and participating in exocytotic communication22. Serum amyloid A3 (SAA3) is a major acute-phase protein, and aberrant upregulation of SAA3 and other amyloid proteins has been associated with a variety of inflammatory diseases, infections, autoimmune disorders and cancer23,24. The serum amyloid A family has been found to be involved in inflammatory regulation by inducing pathogenic differentiation of Th17 cells25.

Compromising the integrity of the blood–brain barrier is a central factor contributing to sepsis-related brain dysfunction and systemic injury. The systemic inflammatory response resulting from sepsis poses a major threat to the integrity of the blood–brain barrier (BBB), critically because of its ability to increase the permeability of the BBB, permitting inflammatory cells, cytokines, and other harmful substances to enter the brain17. This process involves not only the release of inflammatory mediators and alterations in microvascular permeability, but also direct damage to the structural components of the BBB by oxidative stress and impaired cellular energy metabolism26,27. Damage to the BBB promotes neuroinflammation and neurological damage within the brain, exacerbating the severity of sepsis-associated encephalopathy (SAE).ANXA2 and LAMB3 are critical to the integrity and function of the blood–brain barrier. ANXA2, a member of the membrane-bound protein family, affects BBB stability and brain protective mechanisms by regulating cell adhesion, neovascularization, inflammatory responses, and apoptosis28. LAMB3, a subunit of laminin, maintains the selectivity of the BBB by promoting the formation of tight junctions, affecting endothelial cell stability and function, and participating in BBB repair and reconstruction.LAMB3 maintains the selectivity of BBB permeability and protects the brain from harmful substances29.

The development of SAE is closely related to metabolic processes, as reflected in disorders of energy metabolism, mitochondrial dysfunction, changes in neurotransmitter metabolism, fluctuations in blood glucose, and the effects of inflammatory mediators on metabolism30,28,32. Systemic inflammation and oxidative stress triggered by sepsis damages mitochondria and affects energy production, whereas metabolic changes affect neurotransmission and brain cell survival, leading to cerebral dysfunction. ATP10D belongs to the family of phospholipid-transporting ATPases, which play an important role in lipid metabolism and in the structure and function of cellular membranes. Abnormal expression or mutation of ATP10D may be associated with a number of disorders, especially those involving lipid metabolism and nervous system function. Abnormal expression or mutation of ATP10D may be associated with a number of disorders, especially those involving lipid metabolism and nervous system function33. GABARAP is a protein in the autophagy pathway that plays a key role in maintaining metabolic homeostasis by participating in the autophagy process, influencing energy, lipid, and glucose metabolism, maintaining cellular metabolic homeostasis and coping with metabolic stress, and ensuring mitochondrial function34,32,36. RPS21 is a ribosomal protein that plays a key role in maintaining metabolic homeostasis by indirectly influencing cellular metabolism through its involvement in protein synthesis, regulating energy production, and responding to nutrient changes. plays a role in defense against bacterial infections37. It was found that targeting PGLYRP1 inhibited autoimmune neuroinflammation and that PGLYRP1 gene deletion prevented experimental neuritis38,39. These proteins were upregulated in our study, emphasizing the importance of metabolic changes in the development of SAE.