Fig. 5: Predictive modeling of mental disorders using EHR features and knowledge graph embeddings.
From: Large language model powered knowledge graph construction for mental health exploration

a Workflow illustrating the integration of UK Biobank Electronic health records (EHRs) and the MDKG for mental disorder prediction. EHRs are first cleaned and preprocessed to extract non-medical factors (e.g., lifestyle, environmental exposures, family history). ICD-10 diagnoses and Phecodes are mapped to knowledge graph entities, and RDF2Vec is used to generate embeddings for the aligned medical entities. Each patient’s medical history embedding is computed by averaging these vectors. Predictive models are then trained under three input settings: (1) EHR factors only, (2) medical knowledge grpah (KG) embeddings only, and (3) a combination of both. b SHAP-based feature importance for predicting major depressive disorder (MDD) using environmental factors only (top) and using environmental factors combined with KG embeddings (bottom). Bars represent the average absolute SHAP value of each feature across all predictions. c Prediction performance (AUC) for MDD, anxiety, and bipolar disorder across different models and input feature settings. Each box plot shows results from 10-fold cross-validation for four classifiers—logistic regression (LR), random forest (RF), support vector machine (SVM), and XGBoost—under three experimental conditions: KG Embeddings Only, EHR Factors Only, and EHR + KG Embeddings. Box plots show the median (center line), interquartile range (box: 25th–75th percentile), and whiskers extending to 1.5 times the interquartile range (IQR). Each dot represents the AUC score from one cross-validation fold. Points beyond the whiskers are plotted as outliers. The source data are provided as a Source Data file.