Fig. 2: Information entropy measures dataset completeness, compressibility, and sample efficiency in machine learning interatomic potentials. | Nature Communications

Fig. 2: Information entropy measures dataset completeness, compressibility, and sample efficiency in machine learning interatomic potentials.

From: Model-free estimation of completeness, uncertainties, and outliers in atomistic machine learning using information theory

Fig. 2: Information entropy measures dataset completeness, compressibility, and sample efficiency in machine learning interatomic potentials.

a Information entropy for three example molecules from the rMD17 dataset as a function of the dataset size. Simpler molecules exhibit lower entropy and converge faster, while more diverse molecules require more samples to converge. b Correlation between the error in predicted forces and the information gap for all molecules in the rMD1716,38. The errors were obtained from the original reference for MACE22. A circle indicates errors when 1000 samples are used to train the models, and crosses are errors when only 50 samples are used to train the models. ρ is the Pearson’s correlation coefficient. c Information entropy (blue bars) of selected subsets of the carbon GAP-20 dataset39. The maximum entropy is given by \(\log n\) (gray bars), where n is the number of atomic environments. The results are sorted by dataset entropy. The numbers are the dataset entropy (in nats). d Information gap obtained by compressing the “Fullerenes” and “Graphene” subsets of GAP-20 by up to 20% of their original sizes. While the information gap of “Graphene” remains close to zero, the one from “Fullerenes” monotonically increases as the dataset size decreases. e Test root mean squared error errors relative to the errors obtained when a MACE model is trained on the full subset of GAP-20 (ΔRMSE). The results show that the “Graphene” subset can be compressed by up to 20% of its size without loss of performance, whereas this is not the case for the “Fullerenes” subset. f Information entropy (\({{{\mathcal{H}}}}\)) and diversity (D) for the ANI-Al dataset31 computed for each generation of active learning. Oversampling of certain phases leads to a total reduction of entropy, as demonstrated by (g), showing decreasing novelty in the samples. In this approach, novelty is the fraction of environments showing \(\delta {{{\mathcal{H}}}} > 0\) when the dataset of all previous generations are taken as reference. Nevertheless, the diversity of the dataset continues to increase.

Back to article page