Figure 2

Visual representation of the proposed SNP selection approach in a BC risk prediction task. (A) Partitioning the genotyped data into training fold and test data with 4:1 proportion. The training fold data is further partitioned using a 5-fold stratified CV: one fold (validation data) is used for evaluating the set of identified SNPs produced by the module 2 and the remaining 4 folds are merged into a training set data for XGBoost model training and finding initial candidate BC risk-predictive SNPs (module 1). (B) Using training fold data for XGBoost hyperparameter optimization. (C) Module 1: using training set data to learn an XGBoost model and produce initial list of candidate BC risk-predictive SNPs. (D) Module 2: An adaptive iterative SNP selection process using the initial list of candidate SNPs obtained from C and the validation data. In this process, SNPs are re-ranked (see Algorithm 1) and the top interacting SNPs yielding the best BC risk prediction accuracy on the validation data are selected. (E) The top identified interacting SNPs from (D) are adopted to predict the BC risk on the test data using an SVM classifier. (F) Performances are averaged to obtain the final BC risk prediction accuracy across the test data. Same individuals are not used in the training, validation and test sets.