Fig. 2: Benchmarking the machine learning algorithms.

a Comparison of machine-learning regression performance using descriptors generated by different methods. The error bars are the standard deviations of prediction errors in the 5-fold cross validations. OHE one-hot encoding, DFT density functional theory, EI electrotopological-state index, CM coulomb matrix. b Parity plot of Pm values (Pm, probability of meso linkages) predicted by Gaussian process regression (GPR) using the DFT-encoded descriptors and observed Pm values obtained from the literature dataset. The error bars are the predicted standard deviation values. The optimization curves for 12-round search of the maximum observed (c) Pm and (d) Pr values (Pr, probability of racemic linkages). Each optimization process was independently repeated for 10 runs (12 iterations per run). For each run, three initial points were randomly selected, and three new points were proposed per iteration. Data are shown as the mean value with the standard deviation (band width) of the highest observed (c) Pm or (d) Pr up to each iteration (details in Supplementary Information S4.5). The Bayesian optimization curves in c and d both achieved convergence (i.e. the blue band diminished, pointed with the arrows) within 7 rounds. In contrast, the random search process exhibited large standard deviations (i.e. the red band never diminished) and failed to converge within 12-round optimization.