Table 3 Prediction performance results in different model scenarios.

From: Addressing gaps in data on drinking water quality through data integration and machine learning: evidence from Ethiopia

Model Scenario

Accuracy (95%CI)

F1-score

Sensitivity

Specificity

AUC (95%CI)

All Features

0.89 (0.87, 0.91)

0.93

0.95

0.64

0.91 (0.89, 0.94)

Water Source Only

0.80 (0.80, 0.80)

0.89

1.00

0.00

0.80 (0.77, 0.84)

Water Source & Household Variables

0.85 (0.83, 0.87)

0.91

0.95

0.46

0.89 (0.86, 0.91)

Geospatial Only

0.88 (0.85, 0.90)

0.92

0.95

0.58

0.91 (0.88, 0.93)

Geospatial & Household Variables

0.87 (0.85, 0.89)

0.92

0.95

0.56

0.90 (0.87, 0.93)

  1. Results are based on the 2015/16 ESS data. Accuracy results are significantly higher than the no information rate (NIR) of 0.80 in all but the “Water Source Only” scenario where the model predicted all water sources to be contaminated. See Supplementary Tables 5 and 6 for both train and test data results from RF and XGboost models.