Fig. 1: Development of CLL-TIM and selection of high-risk patients for PreVent-ACaLL clinical trial.
From: Machine learning can identify newly diagnosed patients with CLL at high risk of infection

a For each patient, we modeled patient data in three look-back windows. Prediction-point was set at 3-months post-diagnosis and the 2-year risk of infection or CLL treatment (composite outcome) was the target outcome. b We assembled five datasets on 4149 CLL patients from the Nationwide Danish CLL registry, the Danish Microbiology Database, the Persimune data warehouse and health registries. c Using the Bag-Of-Words (BOW) approach, we modeled the frequency of occurrence of 216 diagnoses, 153 pathologies, and 46 microbiology findings (including 9 blood culture findings). We modeled the distribution of past infections and laboratory test results and designed latent features that model urgency of patient’s condition and patient symptoms as interpreted by the treating physician. d Generation of a single base-learner (single outlook) required the random selection of: a machine learning algorithm; hyper-parameters; a target outcome and feature selection. In total, using this randomized protocol, we generated 2000 base-learners, each with their unique outlook into a patient’s history. e A genetic algorithm (GA) was designed to generate 29 ensembles of 2–30 base-learners each. The generated ensembles were then post-ranked according to multiple criteria designed to maximize the generalizability of the ensemble. f The top-ranked ensemble chosen as CLL-TIM was then invoked to predict the 2-year composite outcome on a previously unseen test cohort and subsequently for the selection of high-risk patients for the PreVent A-CaLL Clinical Trial.