Fig. 2: An overview of the data available for each setting in relation to the labeled sequences used for the prediction task.

For each setting is given: the total number of sequences and the part used for model training after exclusion of the outliers. For each promoter library the genotypic background is given: heterologous sigma factors (σ) B, F, or W from Bacillus subtilis expressed in Escherichia coli (E. coli), or wild-type (WT) E. coli harboring no heterologous sigma factors. For the binary setting, the positive class represents fluorescence, indicating the loss of orthogonality. An overview of the data samples for each setting is listed in Supplementary Table 1. Source data are provided as a Source Data file.