Figure 4
From: Linguistic measures of chemical diversity and the “keywords” of molecular collections

Estimation of linguistic richness bases on partial analyses of datasets and Heap’s law. (a) As progressively larger fractions of a particular collection/database (here, 1,000 drugs) are analyzed, the fits based on the Heap’s law, V R (n) = Knβ, converge to the type-token distribution characterizing the entire collection. (b) During such convergence, the exponents β decrease and prefactors K increase. The inset shows that this relationship is common to different molecular or literature collections – the straight lines on the doubly-logarithmic scale indicate a power law β ~ K−γ (note: similar slopes correspond to similar values of γ). (c) Prediction of the type-token ratios, TTRs, based on the partial fits for different types of collections. The true value of the entire collection is taken as 100%. “Database group 1” and “database group 2” are the two families of Mcule databases from Fig. 3. The largest discrepancy between fits and real diversity is observed for natural products whose linguistic peculiarity is also manifest in our other analyses (cf. Figure 2b where the natural-products curve intersects dependencies for drugs and Reaxys molecules). For other collections, estimating 30–50% of the content already gives decent estimates of their actual diversity.