Fig. 8: Explanation of the limits of KDE to determine domain.
From: A general approach for determining applicability domain of machine learning models

a shows the assessment of the Friedman data fit with an RF model type where we purposefully shuffled y to acquire an Mprop with no predictive ability. Because X no longer has a strong relationship with y, all data are OD. b shows the relationship between d and Earea for using σu instead of σc for Munc. Because Munc is poor at estimating uncertainties, all data are OD. The UMAP projections of Friedman and FWODC onto two-dimensions are shown in c, d, respectively. One sampling of X yields distinct regions in features (left) and the other does not (right). The colors represent the labels for three clusters acquired through agglomerative clustering. e, f show the relationship between d, \({E}^{RMSE/{\sigma }_{y}}\), and Earea for the poorly clustered FWODC data. Note that at least the bin with the highest d for \({A}^{RMSE/{\sigma }_{y}}\) should be OD. We observe that data closest to our XITB for Aarea are marked as OD domain by \({E}_{c}^{area}\), but should ideally be ID.