Figure 1 | Scientific Reports

Figure 1

From: Machine learning approach to dynamic risk modeling of mortality in COVID-19: a UK Biobank study

Figure 1

Workflow for model development and feature selection. (A) Conceptual diagram of the data ingestion pipeline and analysis methods. To combine databases, several data pre-processing steps were carried out, including: sanitisation (eliminating redacted records and nuanced entries); normalization (scaling values to ensure fitting with a reasonable range for further processing); time filtering; duration calculation (computing the time interval between testing positive and mortality); missing value substitution (replacing missing values or records with the mean value of the UK Biobank database); augmentation (bringing all data for each subject into a single unified record); and one-hot-encoding (codifying the presence of a pre-existing condition or symptom into a binary sequence for each subject). This data ingestion process standardized the input features and attributes for all subjects in this study regardless of their unique and variable conditions, symptoms, vital signs, and records. (B) Illustration of the data-driven and clinically reviewed feature refinement process. (C) Schematic representation of the leave-one-out cross-validation method for feature selection and model validation. Each sample is systematically left out in each fold (purple). Prediction error estimates are based on left out samples. AUC = area under the curve; GP = general practice; LOO = Leave-One-Out; ROC = receiver operating characteristic.

Back to article page