Figure 1 | Scientific Reports

Figure 1

From: Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants

Figure 1

A schematic representation of the hyperSMURF method. HyperSMURF divides the majority class (the negative class of probably non-deleterious variants–blue rectangles) into n partitions. For each partition, oversampling techniques are used to generate additional examples from the minority class (the positive class of deleterious variants–green rectangles), that closely resemble the distribution of the actual positive examples within the vector space of genomic attributes, to amplify the number of training examples from the minority class. At the same time a comparable number of examples is subsampled from the majority class. Then hyperSMURF trains in parallel n random forests on the resulting balanced data sets and finally combines the prediction of the n ensembles according to a hyper-ensemble (ensemble of ensembles) approach.

Back to article page