Fig. 1: A systematic auditing framework for ML applications in biology. | Communications Biology

Fig. 1: A systematic auditing framework for ML applications in biology.

From: Systematic auditing is essential to debiasing machine learning in biology

Fig. 1

a Presentation of the four modules of the auditing framework. b In the Benchmarking Module 1, the ML model is trained and tested on a split dataset (Dtr and Dts, respectively) to generate a ‘Test: original’ performance for a given dataset and ML model. Performances are compared across different models and datasets to suggest bias sources that can be examined in subsequent modules as detailed in Supplementary Note 1 (Systematic Auditing Protocol). c The Bias Interrogation Module 2 compares the original performance of the model to its performance when tested on an independent dataset, Generalization dataset (Dg), to detect a bias. d The Bias Identification Module 3 modifies the data or model used in training and compares the modified with the original performances to reject or confirm the formulated bias hypotheses. The auditors here are examples of the bias identification process in paired-input problems. In the Feature Auditor in d1, the model is trained on the original training dataset but with the features masked (Dtr_m), and tested on the original test set (Dts). The performance of Test: masked is compared to the expected random performance, Test: random, e.g., when AUC is used, the Test: random AUC is 0.5. If Test: masked significantly outperforms Test: random, there is likely a bias in the dataset, independent of the features, that drives the non-random performance. In the Node-Degree Auditor in d2, each interacting object in the training dataset is represented by its node degree counts in the positive and negative training datasets to constitute Dtr_d. A model is trained on Dtr_d and tested on the test set Dts_d where each object in the original Dts is represented by its node degrees in the training datasets, Dtr. The performance of Test: degree, is compared to the original performance, Test: original. If there is no significant difference, there is likely a bias related to node degree recurrence in the original dataset. The Recurrence Auditor in d3 is similar in structure to the Node-degree auditor in d2, except that the ML model is replaced by a function to score the probability of an interaction between a pair in the test set (Dts) based on the differential node degree of the pair in the positive and negative training sets (recurrence score). These are compared against the probabilities generated by the original model, Test: original. If the performance of the recurrence-based scoring function is similar to that of the original model, the model is likely learning from the node-degree bias. In the Debiasing Auditor in d4, the training dataset is debiased by removing the node degree bias (node balancing is performed) and the features are masked to create Dtr_mb. The performance of Test: masked is compared to the expected random performance, Test: random. If the model performance, Test: masked, balanced is equal to the expected random performance (Test: random; AUC of 0.5), then the node-degree imbalance is confirmed as the major bias source in this particular data-model combination. If the bias persists, i.e., the Test: masked, balanced performs better than random, there is likely another bias driving the learning process. e The Bias Elimination Module 4 tests the driving power of the bias identified in Module 3 by debiasing the data (or model) and testing whether the performance will generalize to independent datasets, i.e., test if the performance of the model on the testing subset after training the model on the debiased subset (Dtr_b), Test: debiased, is comparable to the performance on the generalization subset (Dg), Test: debiased, generalization.

Back to article page