Fig. 2: CLL-TIM ensemble composition: base-learners, variables and feature encodings. | Nature Communications

Fig. 2: CLL-TIM ensemble composition: base-learners, variables and feature encodings.

From: Machine learning can identify newly diagnosed patients with CLL at high risk of infection

Fig. 2

CLL-TIM is composed of 28 base-learners: 20 model the composite outcome, 5 model CLL treatment outcome and 3 model infection as a first event. Base-learners include 13 XGBoost, 7 Random Forests, 4 Extra Trees, 2 Elastic Network and 2 Logistic Regression models. a Distinct variables used in CLL-TIM and b features that these variables are encoded into. Each variable may have multiple feature encodings (right panel in b) and can hence represent multiple features in the ensemble model. For instance, the variable Haemoglobin is encoded into four features; the mean, minimum and variability of the test result and the number of days since the last test. These four features were calculated on patient look-backs of 3 months, 1 year, and 7 years. In total, the feature encodings selected for 84 variables in CLL-TIM resulted in a set of 228 features (Supplementary Data 1). For visualization purposes, here we show a condensed representation with only variables that were used by at least 10% of CLL-TIM’s base-learners (left panel in a). CLL-TIM models: rates of change, variability, average values and extremities in the results of several routine lab tests; recentness of lab tests and the number of tests taken (modeling doctor’s decisions); several encodings representing the recentness and distribution of infection dates; counts of rare pathology codes (i.e. those found in <1% of the CLL training cohort) and inflammation events (pathology diagnoses). ECOG - Eastern cooperative oncology group, IGHV- The immunoglobulin heavy chain gene. Β-2M - Beta-2 microglobulin, TIBC – Total iron-binding capacity, MCV – Mean cell volume, ECOG - Eastern Cooperative Oncology Group.

Back to article page