Fig. 2: Schematic of the machine learning pipeline based on the super learner framework for the melanin binding data set.
From: Machine learning-driven multifunctional peptide engineering for sustained ocular drug delivery

a Scheme of a larger microarray, which includes 5499 peptides used to train a regression super learner. Random peptides were generated based on position-dependent amino acid frequencies calculated using the second peptide array data, and the melanin binding levels were predicted. Peptides with desired melanin binding levels were selected for further experimental validation. Created with BioRender.com. b Scheme of the super learner complexity reduction. Holdout predictions of peptides (shown as rows) were generated for each base model (shown as columns) with tenfold cross-validation (CV) on the input data set. A meta-learner (generalized linear model) was fitted on the holdout predictions with another tenfold cross-validation. The number of base models was reduced by applying an iterative reduction procedure (see Methods). The final super learner ensemble was trained on the input data set with the optimal combination of the selected base models. c Scheme of the machine learning pipeline for an unbiased model performance evaluation. The nested cross-validation includes an outer loop for model evaluation and an inner loop for model selection (cyan). The outer loop generated 10 sets of train-test splits using a Monte Carlo method, and the inner loop generated 10 sets of train-test splits using a modulo method. d Plot of the base models of the final melanin binding super learner. Coefficients of determination (R2) are denoted with color and conveyed as white text on the bars or gray text adjacent bars. Base model coefficients are indicated at the bar ends. There is one model having zero coefficient and not shown. See Methods and Supplementary Note 2 for information about model hyperparameter details and statistics of model performance.