Introduction

The field of chemistry is currently undergoing a paradigm shift driven by the adoption of data-driven modeling techniques, particularly machine learning (ML) and artificial intelligence (AI) algorithms. These algorithms can potentially transform vast amounts of chemical data into valuable predictions, fundamentally changing the way chemists approach research1,2,3. ML algorithms are increasingly being employed to predict catalyst efficiency4,5, regioselectivity6,7, and chemical properties such as toxicity and electrophilicity8,9,10,11,12. Beyond property prediction, ML/AI methods have been successfully applied in drug discovery13,14, the development of new materials15,16, retrosynthetic analysis17, the optimization of reaction conditions18,19,20,21, and even the creation of synthesis-capable robotic platforms22. Furthermore, advances in modern quantum chemistry and computational techniques now enable the accurate calculation of a wide range of molecular properties. These computed descriptors can be combined with experimental output data to further enhance the predictive capabilities of ML algorithms.

Typically, ML/AI methods require large datasets as input. In many fields today, such datasets are increasingly generated through high-throughput experiments (HTEs). HTE is a scientific approach that combines laboratory automation, optimized experimental design, and the rapid execution of parallel or sequential experiments. It allows for the systematic acquisition of large, reproducible datasets using robotics, miniaturized workflows, and computational data management pipelines. This methodology accelerates discovery, enhances reproducibility, and supports data-driven decision-making. Consequently, HTE has had a transformative impact across diverse scientific and engineering disciplines, including chemistry23,24,25, biology26,27,28, materials science29,30, and other related fields31,32,33.

The goal of any HTE is to investigate a specific variable, referred to as the dependent or target variable or a response, e.g., the yield of a chemical reaction, the success probability of a genetic variant, or ionic conductivity. This is achieved by defining a set of experimental conditions under which this target variable is measured. In essence, an HTE involves systematically obtaining measurements of the target variable across combinations of experimental conditions that are relevant to the study.

To be more specific, let us consider a dataset related to palladium-catalyzed C-N activation, commonly known as the Buchwald-Hartwig amination (Fig. 1a). Despite its importance in organic synthesis, this reaction is notoriously sensitive and heavily dependent on carefully chosen (often empirically determined) reaction conditions23,34,35,36,37 obtained an impressive Buchwald-Hartwig amination dataset by exploiting HTE reactions, which we will use as a case study in this work. In this HTE,23 selected 22 isoxazole additives (23 were originally considered, but one was excluded from analysis), 15 aromatic halides, 4 palladium catalyst ligands, and 3 bases, as illustrated in Fig. 1b. Each reaction yield in the dataset was subsequently measured under each of the 22 × 15 × 4 × 3 = 3 960 possible combinations of these reaction conditions. In addition to yield values, a set of descriptors of all reaction conditions was provided. Specifically, 19 descriptors were computed for additives, 27 for aryl halides, 10 for bases, and 64 for ligands. Note that these values were not obtained in the HTE but rather chosen as descriptors of selected reaction conditions. Finally,23 applied various ML techniques in order to model and predict obtained yield values in terms of selected descriptors.

Fig. 1: The reaction components of Buchwald-Hartwig amination as used in ref. 23.
figure 1

In (a) the Buchwald-Hartwig amination is shown, while b summarizes all explored systems.

The work of23 has received considerable attention and has inspired numerous follow-up studies. By mid-2025, the publication had garnered over 1 100 citations, according to Google Scholar. Various standard ML/AI methods, including gradient boosting, deep forest, neural networks, and k-nearest neighbors, have since been applied in similar contexts, both to the Buchwald-Hartwig amination and to similar datasets from other chemical reactions generated via HTE24,38,39,40,41,42,43.

To identify the most suitable approaches for analyzing any HTE-generated data, it is first essential to understand the type and structure of the data in order to situate it appropriately within a regression framework. In particular, the selection of estimation or prediction methods should take into account not only the type of the target variable but also that of the explanatory features.

For example, in the Buchwald-Hartwig amination dataset, the target variable is the reaction yield, a continuous variable ranging from 0 to 100. This outcome is modeled as a function of four features: additives, aryl halides, bases, and ligands, which collectively determine the reaction performance. In statistical terms, all four features are categorical variables; that is, the values of each feature correspond to discrete, chemically distinct species or conditions with no inherent order. For instance, different additives interact with catalytic intermediates through fundamentally distinct mechanisms, such as coordination, electron donation or withdrawal, steric hindrance, hydrogen bonding, or dispersion interactions, and thus cannot be meaningfully represented on a continuous or ordered scale. Hence, in this case, we are dealing with a regression problem involving a continuous target variable and four categorical features. As will be demonstrated in this work, the data-generating process of such data can be captured by a simple (generalized) linear model, making application of ML/AI techniques superfluous. Moreover, compared to ML/AI approaches, a linear model is fully interpretable, allowing for deeper insight into the underlying reaction mechanisms. In other words, if the data-generated process from an HTE follows a simple parametric model, ML/AI techniques cannot offer any clear advantage, both in terms of modeling and prediction.

In the broader landscape of HTE-generated data across different scientific fields, all possible combinations of data types can be encountered. While the target variable is typically continuous, the features may be continuous, categorical, or a mixture of both. In settings where both the target and features are continuous, ML/AI methods may offer advantages over simple parametric alternatives in terms of predictive performance, although this typically comes at the expense of interpretability and system-level understanding. Nevertheless, in many disciplines and especially in chemistry, a substantial share of HTE-generated datasets involves categorical features only.

In this work, we focus on HTE-generated datasets with a target variable from the exponential family of distributions and categorical features. Using the Buchwald-Hartwig amination dataset as a case study, we discuss its properties and subsequently propose a simple linear statistical model that accurately captures the corresponding data-generating process. We show that standard estimation techniques are unsuitable for parameter estimation within this model. Subsequently, we develop a tailored estimation method that enables both reliable identification and meaningful interpretation of model parameters, thereby providing deeper insight into the most influential components of the Buchwald-Hartwig amination reaction. Although our approach is demonstrated using a single case study, the proposed method is readily extendable to other datasets characterized by a target variable from the exponential family and categorical features. To substantiate this claim, in the Supplementary Section 1, we provide a link to the Python package that implements our approach for an arbitrary number of features (ranging from two to five), together with the code and analysis of two additional datasets containing three and five categorical features, respectively.

Results

Structures in Buchwald-Hartwig amination data

Let us first exploit the data collected in ref. 23 in detail. There 3 960 values of the yield and each of 120 descriptors are available, which result from measurements under all possible 22 × 15 × 4 × 3 = 3 960 combinations of reaction components. These reaction components will be referred to as additives, halides, ligands and bases, and statistically are categorical variables or factors that take 22, 15, 4, and 3 possible unordered values (known as factor levels), respectively. Additionally, a matrix of 120 descriptors is provided, which contains 19 continuous descriptors of additives, 27 descriptors of halides, 10 descriptors of bases, and 64 descriptors of ligands. First, we discuss the statistical properties of the yield; then, we examine the structures within the matrix of descriptors; and finally, we present a model that adequately captures the data-generating process, allowing for meaningful interpretation.

In a statistical framework, the target variable, in our case, the reaction yield, is typically treated as a random variable and often assumed to follow a normal distribution. However, as the histogram in Fig. 2a clearly shows, the distribution of yields deviates substantially from normality: it has bounded support (restricted to the [0,100] interval, in line with the physical meaning of yield) and is right-skewed (with a greater number of reactions resulting in lower yields). Therefore, we propose modeling the yield using the continuous Bernoulli distribution. Details on this distribution are provided in the Supplementary Section 2. The continuous Bernoulli distribution fits the data reasonably well (see Fig. 2a) and, importantly, belongs to the exponential family.

Fig. 2: Reaction yield and continuous Bernoulli distribution.
figure 2

Histogram of the yield together with the density of the continuous Bernoulli distribution is shown in (a) and the link function g of a continuous Bernoulli distribution is given in (b).

Let us now examine the matrix of descriptors. It is straightforward to verify that the matrix of 120 descriptors, as provided in ref. 23, has rank 39. In the Supplementary Section 3, we demonstrate that by adding three random descriptors for the additives, the resulting matrix with 123 columns would have rank 41 and can be uniquely linearly transformed into a matrix with dummy (or one-hot) coding, which reflects the underlying experimental design. This implies that the matrix of descriptors contains, up to a linear transformation, only information about combinations of reaction components that lead to the observed yield. Hence, the matrix of descriptors can be ignored.

With this, we arrive at a model in which the target variable follows a continuous Bernoulli distribution, while four categorical variables additives, halides, bases, and ligands serve as explanatory features for this target. This can be formulated in a simple statistical model that adequately captures the data-generating process of the Buchwald-Hartwig amination dataset:

$$g\{{\mathbb{E}}({{\rm{yield}}}_{ijkl})\} = \, {\mu }_{0}+{\alpha }_{i}+{\beta }_{j}+{\gamma }_{k}+{\delta }_{l}+{(\alpha \beta )}_{ij}+{(\alpha \gamma )}_{ik}+{(\alpha \delta )}_{il}+{(\beta \gamma )}_{jk} \\ +{(\beta \delta )}_{jl}+{(\gamma \delta )}_{kl}+{(\alpha \beta \gamma )}_{ijk}+{(\alpha \beta \delta )}_{ijl}+{(\alpha \gamma \delta )}_{ikl}\\ +{(\beta \gamma \delta )}_{jkl}+{(\alpha \beta \gamma \delta )}_{ijkl}.$$
(1)

Here \(g:[0,1]\to {\mathbb{R}}\) is the canonical link function of the continuous Bernoulli distribution, which has no closed-form expression but is shown in (Fig. 2b) (see also Supplementary Section 2). Furthermore, \({\mathbb{E}}\) denotes expectation, g−1(μ0) is an overall mean of the yield, αi, i = 1, …, 22 is the contribution (also known as a treatment effect) of the i-th level of factor additives, βj, j = 1, …, 15 is the contribution of the j-th level of the factor halides, γk, k = 1, 2, 3 is the contribution of the k-th level of the factor bases and δl, l = 1, 2, 3, 4 is the contribution of the l-th level of the factor ligands. Further terms denote corresponding interaction effects. For example, (αβ)ij denotes the interaction effect of the i-th level of factor additives and j-th level of the factor halides. The number of parameters in the model (1) is 7 360, and there are 3 400 constraints necessary for identifiability. The full list of these constraints is given in the Supplementary Section 5.

According to the model (1), each parameter describes the effect of the corresponding factor level (e.g., i-th additive) or factor level combination (e.g., i-th additive with j-th halide) on the yield value compared to the baseline μ0. In particular, a positive parameter indicates that the corresponding factor level or factor level combination leads to higher yield values compared to the baseline (interpreted as a stabilizing effect), whereas a negative parameter indicates lower yields relative to the baseline (interpreted as a destabilizing effect).

The interactions between factor levels are particularly valuable from a chemical standpoint. These interactions may reflect cooperative or antagonistic effects on the catalytic cycle. For example, certain i-th additive with j-th base interactions may stabilize or destabilize key intermediates or transition states differently. Statistically and chemically, understanding these interactions allows us to identify conditions that favor reaction selectivity or efficiency. We believe that this modeling approach – integrating statistical design principles with chemical knowledge – provides a robust framework not only for predicting experimental outcomes within the studied domain but also for generating chemically meaningful hypotheses about the underlying mechanisms. This, in turn, can guide future experimentation and mechanistic studies more systematically and efficiently.

At this point, it is important to emphasize that categorical features naturally limit what can be learned from the model (1). In particular, one cannot predict yield values for unseen levels of factors outside of experimental design. However, one can predict or actually fill in (and this is what23 and follow-up works ultimately do) the missing values within the experimental design. That is, one can perform only a share of 3960 experiments, e.g., 70%, and then predict the yield values for the remaining 30% missing combination of factor levels.

Altogether, model (1) is a (generalized) linear model, and its parameters, once properly estimated, can be used to understand how specific reaction components and their interactions contribute to the yield. For example, one can identify reaction conditions that lead to higher or lower yields or conditions that do not affect the yield at all. This contrasts with black-box ML/AI techniques, which are well-suited for prediction within the experimental design but fail to provide an explanation of the system. Naturally, model (1) can also be used to predict the yield within the experimental design. In the Supplementary Section 7, we compare such predictions with those obtained using several of the most successful ML algorithms, demonstrating that the difference in performance is negligible. We conclude that ML/AI methods cannot bring any advantages over simple linear models if the features are categorical variables.

Estimation results

It turns out that estimating the model (1) requires non-standard approaches. In Section 3 we explain the inherent problems of the model and describe the algorithm we developed.

In the following, we present and analyze the estimated parameters of model (1). The intercept term μ0 in the model (1), which serves as a baseline and represents the transformed overall mean of the yield, was estimated to be  − 6.9. Applying the inverse link function g−1( − 6.9) = 0.14, gives an estimator for the expectation of the yield, indicating that the mean yield in this HTE is notably low. The first column of Table 1 reports the estimated main effects. More precisely, the values for 22 additives given in the first column of Table 1 are estimates for parameters αi, i = 1, …, 22 in equation (1). In the same way, 15 halide values are estimates for βj, j = 1, …, 15, 3 base values are estimates for γk, k = 1, 2, 3 and 4 ligand values are estimates for δl, l = 1, …, 4. Fig. 3 provides a complementary visual representation of the estimated main effects, linking the factor level labels to the corresponding chemical structures.

Fig. 3: Estimated main effects of reaction components for the Buchwald-Hartwig amination.
figure 3

Shown are the estimated coefficients for all levels of additives (A1–A22), halides (H1–H15), bases (B1–B3), and ligands (L1–L4) corresponding to the model (1). The x-axis indicates the magnitude of the main effects (dimensionless), where positive values enhance and negative values reduce yield relative to baseline. Panel colors match the factor coding of Fig. 1 and serve to distinguish component types.

Table 1 Estimated main effects and first 44 absolutely large two-way interactions

The second column of Table 1 lists the 44 absolute largest two-way interaction effects. For example, an interaction of additive 9 with halide 1 is an estimator for the parameter (αβ)91. Notably, all four-way interaction terms were estimated to be exactly zero, and only a few three-way interactions were found to be non-zero. These remaining three-way interactions are relatively minor in magnitude, especially compared to the main effects and the dominant two-way interactions, suggesting that higher-order interactions play a limited role in this setting. A complete table of all estimated coefficients is available through the link provided in the Supplementary Section 1. In total, 415 parameters were estimated to be non-zero out of all 7 360 parameters of the model (2), reflecting a high degree of sparsity in the fitted model. This sparsity substantially simplifies the interpretation of the estimated model.

The estimation results can also be rationalized from a chemical viewpoint. First, it is chemically intuitive that halides exert the most significant influence on the yield, as the nature of the leaving group (Cl, Br, I) directly affects the oxidative addition step44,45,46, which often involves the rate-determining states (turnover-determining intermediate TDI and turnover-determining transition state TDTS) in Buchwald-Hartwig amination47,48. A better leaving group typically lowers the activation energy, thus significantly enhancing reaction efficiency. However, it is worth noting that, in some cases, iodides have been shown to be less effective counterparts in Pd-catalyzed C-N coupling reactions than other halides due to side reactions. This observation further highlights the intrinsic complexity of the catalytic process under consideration49.

The second most influential factor, additives, often plays subtle but critical roles in catalytic reactions. Chemically, the additives considered in this study (mimicking the behavior of complex substrates) can, among others, stabilize catalytic intermediates or transition states, influence the solubility or aggregation of catalyst complexes, or alter the coordination environment around the palladium center50. However, the modest size of the additive effects observed here (ranging around  ± 9 units) aligns with typical scenarios where additives fine-tune rather than fundamentally alter reaction mechanisms51,52,53,54.

Bases and ligands generally have smaller but still meaningful influences, as they primarily facilitate crucial mechanistic steps like deprotonation and oxidative addition/reductive elimination35,55. Thus, these components might be expected to show moderately sized effects, which is precisely reflected in the statistical results obtained. Importantly, the observed strong interactions between halides and bases are highly chemically meaningful. Bases are directly involved in deprotonation steps that become more or less relevant depending on the reactivity of the palladium-halide intermediate47. Hence, the pronounced halide-base interactions are mechanistically justified: the nature of the halide significantly impacts the effectiveness of a base in promoting essential catalytic steps.

From Table 1, one can see another interesting fact: the factor-level interactions of the same halide 1 with two different additives can be dramatically different. Indeed, it is highly negative ( − 32.40) in the case of additive 9, while it is, in contrast, highly positive ( + 27.91) with additive 22. Can this be chemically interpreted? This strongly suggests that these two additives chemically interact in fundamentally different ways with the halide-containing intermediate or a transition state. In Buchwald-Hartwig reactions, the oxidative addition step, where the Pd catalyst inserts into the carbon-halogen bond, is highly sensitive to both electronic and steric effects. Additives can significantly alter this step or subsequent catalytic intermediates. For instance, the highly positive interaction ( + 27.91) indicates that this additive 22 likely stabilizes a crucial catalytic intermediate (such as the Pd(II)-aryl intermediate) formed specifically from halide 1. It may do so through beneficial electronic interactions (electron donation or stabilization through coordination), which enhance reactivity, reduce activation barriers, and thus significantly improve yield. On the other hand, the highly negative interaction ( − 32.40, additive 9) suggests a strong destabilization of the catalytic intermediate or transition state, specifically associated with the same halide 1. This additive might compete with catalytic intermediates for coordination to palladium or otherwise sterically or electronically interfere with essential steps in the catalytic cycle, drastically lowering reaction efficiency. Therefore, the stark contrast between these two interaction values emphasizes how subtle chemical differences between additives can drastically change their roles from strongly beneficial (enhancing catalytic efficiency) to detrimental (interfering with catalytic processes), particularly when combined with specific halides. Such chemical insights highlight the practical importance of carefully choosing additive-halide combinations, as minor structural differences in additives can lead to dramatically different outcomes in reaction efficiency.

Finally, the absolute largest additive–halide interaction coefficients align with a simple mechanistic picture. Additive 22 is strongly positive with halide 1 ( + 27.91) – consistent with a coordinating donor facilitating oxidative addition for an electronically demanding aryl chloride—but negative with halide 2 ( − 9.71) and with reactive halide 10 and halide 13 ( − 6.24,  − 2.20), in line with over-stabilization or coordination-site competition when additional assistance is unnecessary. In contrast, additive 9 suppresses halide 1 ( − 32.40) yet is modestly positive with reactive halides (10 and 13:  + 2.91,  + 2.70). Thus, electron-rich, coordinating additives aid difficult, electron-rich chlorides but can impede already-reactive substrates; weaker or non-coordinating additives show the opposite tendency, providing a general chemical rationale for the observed interaction signs and magnitudes.

Our proposed method is scalable and applicable to a broad range of datasets generated in HTE with target variables from the exponential family of distributions and categorical explanatory features. In the Supplementary Section 1, we provide a link to the analysis of two additional datasets with three and five categorical features, as well as a Python package that enables the analysis of similar datasets.

Methods

In this section, we detail the parameter estimation problem in model (1), which turns out to be a very specific type of generalized linear model. The right-hand side of the model corresponds to the structures used in the analysis of variance (ANOVA) with four factors, where only a single observation is available for each combination of factor levels. However, unlike classical ANOVA models, the response variable in (1) does not follow a normal distribution; instead, it follows a continuous Bernoulli distribution. Model (1) can be rewritten in a matrix form

$$g\{{\mathbb{E}}({Y}_{i})\}={{{\boldsymbol{Z}}}}_{i}{{\boldsymbol{\theta }}},$$
(2)

where Yi is the i-th entry of vector \({{\boldsymbol{Y}}}={({Y}_{1},\ldots ,{Y}_{n})}^{T}\), n = 3 960, which contains all yield values, Zi is the i-th row of a dummy coded matrix \({{\boldsymbol{Z}}}\in {{\mathbb{R}}}^{n\times n}\), corresponding to a four-factor ANOVA experimental design with single replicates (see Supplementary Section 3 for the details on construction of Z), and \({{\boldsymbol{\theta }}}={({\mu }_{0},{\alpha }_{1},\ldots ,{(\alpha \beta \gamma \delta )}_{21,14,2,3})}^{t}\in {{\mathbb{R}}}^{n}\) is the vector of unknown parameters. If Z were a regular matrix satisfying all the assumptions of classical (generalized) linear models, one could proceed with the estimation of θ using iteratively re-weighted least squares and identify the factor levels and their interactions that are particularly influential for the yield in Buchwald-Hartwig amination.

Unfortunately, the matrix Z is not regular. Due to the availability of only a single observation per factor level combination, there are as many parameters as observations. Since it is reasonable to assume that a large portion of the parameters are zero, one might be tempted to apply a classical Lasso algorithm with a response from the exponential family56. However, it can be easily observed that matrix Z does not satisfy the assumptions of the classical Lasso algorithm. In particular, Z has only 16 distinct singular values (see Supplementary Section 4 for proof) and is ill-conditioned, with a condition number of 62.9. As a result, the Lasso estimator would fail to correctly identify the parameter θ, instead setting entries of θ to zero randomly, as discussed in refs. 57 and 58.

The ill-posedness of Z arises from the fact that model (1) is a four-way interactions model: it includes both the main effects of factor levels as well as their interactions, leading to high dependency among the columns of Z. To impose model identifiability and enhance interpretability, it is reasonable to assume that an interaction effect should only be included in the model if the corresponding main effects are non-zero. These constraints are not novel and are referred to as marginality or hierarchically well-formulated in (generalized) linear models59,60 or heredity constraints in designed experiments61. For example, in a model with two-way interactions, the condition that either αi = 0 or βj = 0 implies (αβ)ij = 0 is known as the strong heredity condition. Some statisticians argue that interaction models violating strong heredity are nonsensical. For a more detailed discussion on the statistical reasoning behind heredity conditions, see ref. 62.

Statistically, imposing the heredity conditions makes model (1) more parsimonious and easier to interpret. From a chemical perspective, strong heredity has a clear mechanistic interpretation, particularly in the context of catalytic reactions like Buchwald-Hartwig amination. In chemical terms, interactions between factor levels, such as a specific base and an additive, typically represent cooperative molecular effects. For instance, one component may stabilize or activate an intermediate produced by another, significantly altering the reaction energetics and thus influencing the yield. However, each component must exhibit some intrinsic chemical effect individually for such meaningful chemical interactions to occur. If an additive or base alone has no measurable impact on the reaction outcome, it suggests that the component neither interacts with reaction intermediates nor significantly influences the catalytic pathway. Without individual chemical relevance, it becomes mechanistically implausible that combining two chemically inert components would suddenly produce a notable combined effect. Energetically, reaction outcomes are determined by the stabilization or destabilization of intermediates and transition states. If neither component modifies the reaction’s energy profile individually, there is no plausible molecular mechanism by which their combined presence could dramatically alter the reaction energetics or pathway. Therefore, imposing a strong heredity condition in statistical modeling, where interaction effects are allowed only if the main effects are present, aligns naturally with chemical realism. This ensures that statistically identified interactions correspond to chemically meaningful scenarios, thereby enhancing the interpretability and reliability of the model.

It turns out that all existing algorithms for parameter estimation in interaction models under (strong) heredity conditions have been developed and implemented specifically for two-way interactions with continuous descriptors and a normal response variable. The main contribution of this work is the extension of the algorithm from63 to handle four-way interactions in ANOVA models with a single observation per factor level combination, where the response variable follows a continuous Bernoulli distribution.

In the following, we give the main ideas of our estimation algorithm; more details are provided in the Supplementary Section 6. As previously discussed, model (1) must be estimated under strong heredity conditions to ensure parameter identifiability. Given that our model includes interactions up to four-factor levels, the following strong heredity conditions should be imposed:

  1. (i)

    αi = 0 or βj = 0 implies (αβ)ij = 0;

  2. (ii)

    (αβ)ij = 0 or (αγ)ik = 0 or (βγ)jk = 0 implies (αβγ)ijk = 0;

  3. (iii)

    (αβγ)ijk = 0 or (αβδ)ijl = 0 or (αγδ)ikl = 0 or (βγδ)jkl = 0 implies (αβγδ)ijkl = 0. 

To fix the ideas, let us adopt the matrix form in equation (2), which provides a more convenient representation compared to the model in equation (1). Additionally, we explicitly highlight the hierarchical structure of the matrix Z via representation

$$\begin{array}{rcl}g\{{\mathbb{E}}({Y}_{i})\} & = & {\eta }_{i}={\mu }_{0}+{\sum }_{j=2}^{41}{\xi }_{j}{Z}_{i,j}+{\sum }_{f(j) < f(k)}{\Theta }_{jk}{Z}_{i,j}{Z}_{i,k}+{\sum }_{f(j) < f(k) < f(l)}{\Psi }_{jkl}{Z}_{i,j}{Z}_{i,k}{Z}_{i,l}\\ & & +{\sum }_{f(j) < f(k) < f(l) < f(m)}{\Phi }_{jklm}{Z}_{i,j}{Z}_{i,k}{Z}_{i,l}{Z}_{i,m},\,j,k,l,m=2,\ldots ,41,\end{array}$$

where Zi,j denotes the element from the i-th row and j-th column of Z. Each column Zj, j = 2, …, 41 corresponds to a factor level of main effects, that is, Z2, …, Z22 are columns corresponding to levels of factor additives, Z23, …, Z36 capture levels of factor halides, Z37Z38 belong to factor bases and Z39Z40Z41 are ligands. Further columns of Z are built as products of all levels of these four factors, such that only interactions between main effects from different factors are included in the model (no interactions between different factor levels of the same factor are taken). For example, Z42 = Z2Z23 and so on. To exclude products of columns from the same factor, notation f(j) {1, 2, 3, 4} is introduced, where f(j) = 1 for j = 2, …, 22 (factor additives), f(j) = 2, j = 23, …, 36 (factor halides), f(j) = 3, j = 37, 38 (factor bases) and f(j) = 4, j = 39, 40, 41 (factor ligands). Altogether, ξ describes the contribution of the main effects, while two-way, three-way, and four-way factor level interactions are captured by Θ, Ψ, and Φ, respectively.

Imposing strong heredity conditions translates now to the following constraints

$${\Theta }_{jk}={\rho }_{jk}{\xi }_{j}{\xi }_{k},\,{\Psi }_{jkl}={\zeta }_{jkl}{\Theta }_{jk}{\Theta }_{jl}{\Theta }_{kl},\,{\Phi }_{jklm}={\tau }_{jklm}{\Psi }_{jkl}{\Psi }_{jkm}{\Psi }_{jlm}{\Psi }_{klm},$$

where ρjk is a proportionality constant, corresponding to the interaction of main effects j and k, ζjkl is a proportionality constant, corresponding to the interaction of main effects j, k, l, and τjklm is a proportionality constant, corresponding to the interaction of main effects j, k, l and m.

Consequently, in order to find a solution that satisfies strong heredity conditions, the following penalized negative log-likelihood should be minimized:

$$Q({\mu }_{0},{{\boldsymbol{\xi }}},{{\boldsymbol{\rho }}},{{\boldsymbol{\zeta }}},{{\boldsymbol{\tau }}})=-\frac{1}{n}{\sum }_{i=1}^{n}\ell ({\eta }_{i},{y}_{i})+{\lambda }_{\xi }\parallel {{\boldsymbol{\xi }}}{\parallel }_{1}+{\lambda }_{\rho }\parallel {{\boldsymbol{\rho }}}{\parallel }_{1}+{\lambda }_{\zeta }\parallel {{\boldsymbol{\zeta }}}{\parallel }_{1}+{\lambda }_{\tau }\parallel {{\boldsymbol{\tau }}}{\parallel }_{1},$$
(3)

where λξ, λρ, λζ and λτ are four tuning parameters that control the amount of regularization and (ηiyi) is the log-likelihood of the i-th sample given in the Supplementary Section 6.2.

Algorithm 1

Estimation under strong heredity conditions

Input:Z, Y, \({\widehat{{{\boldsymbol{\mu }}}}}_{0}^{(0)}\), \({\widehat{{{\boldsymbol{\xi }}}}}^{(0)}\), \({\widehat{{{\boldsymbol{\rho }}}}}^{(0)}\), \({\widehat{{{\boldsymbol{\zeta }}}}}^{(0)}\), \({\widehat{{{\boldsymbol{\tau }}}}}^{(0)}\), T, M

Output:\({\widehat{\mu }}_{0}\), \(\widehat{{{\boldsymbol{\xi }}}}\), \(\widehat{{{\boldsymbol{\rho }}}}\), \(\widehat{{{\boldsymbol{\zeta }}}}\), \(\widehat{{{\boldsymbol{\tau }}}}\)

it ← 0;

\({\widehat{\mu }}_{0}\leftarrow {\widehat{\mu }}_{0}^{(0)}\), \(\widehat{{{\boldsymbol{\xi }}}}\leftarrow {\widehat{{{\boldsymbol{\xi }}}}}^{(0)}\), \(\widehat{{{\boldsymbol{\rho }}}}\leftarrow {\widehat{{{\boldsymbol{\rho }}}}}^{(0)}\), \(\widehat{{{\boldsymbol{\zeta }}}}\leftarrow {\widehat{{{\boldsymbol{\zeta }}}}}^{(0)}\), \(\widehat{{{\boldsymbol{\tau }}}}\leftarrow {\widehat{{{\boldsymbol{\tau }}}}}^{(0)}\);

while it < M do

\({Q}_{old}\leftarrow Q({\widehat{\mu }}_{0},\widehat{{{\boldsymbol{\xi }}}},\widehat{{{\boldsymbol{\rho }}}},\widehat{{{\boldsymbol{\zeta }}}},\widehat{{{\boldsymbol{\tau }}}})\);

it ← it + 1;

Update \({\widehat{{{\boldsymbol{\mu }}}}}_{0}\): \({\widehat{{{\boldsymbol{\mu }}}}}_{0}\leftarrow \arg {\min }_{{\mu }_{0}}Q({\mu }_{0},\widehat{{{\boldsymbol{\xi }}}},\widehat{{{\boldsymbol{\rho }}}},\widehat{{{\boldsymbol{\zeta }}}},\widehat{{{\boldsymbol{\tau }}}})\);

Update each component of \(\widehat{{{\boldsymbol{\xi }}}}\): \({\widehat{\xi }}_{j}\leftarrow \arg \mathop{\min }_{{\xi }_{j}}Q\left(\right.{\widehat{\mu }}_{0},{\xi }_{j},{\widehat{{{\boldsymbol{\xi }}}}}_{-j},\widehat{{{\boldsymbol{\rho }}}},\widehat{{{\boldsymbol{\zeta }}}},\widehat{{{\boldsymbol{\tau }}}}\left)\right.\);

Update each component of \(\widehat{{{\boldsymbol{\rho }}}}\): \({\widehat{{{\boldsymbol{\rho }}}}}_{jk}\leftarrow \arg \mathop{\min }_{{\rho }_{jk}}Q \left(\right.{\widehat{\mu }}_{0},\widehat{{{\boldsymbol{\xi }}}},{\rho }_{jk},{\widehat{{{\boldsymbol{\rho }}}}}_{-jk}, \widehat{{{\boldsymbol{\zeta }}}},\widehat{{{\boldsymbol{\tau }}}}\left)\right.\);

Update each component of \(\widehat{{{\boldsymbol{\zeta }}}}\): \({\widehat{\zeta }}_{jkl}\leftarrow \arg \mathop{\min }_{{\zeta }_{jkl}}Q\left(\right.{\widehat{\mu }}_{0},\widehat{{{\boldsymbol{\xi }}}},\widehat{{{\boldsymbol{\rho }}}},{\zeta }_{jkl},{\widehat{{{\boldsymbol{\zeta }}}}}_{-jkl},\widehat{{{\boldsymbol{\tau }}}}\left(\right.\);

Update \(\widehat{{{\boldsymbol{\tau }}}}\): \(\widehat{\tau }\leftarrow \arg \mathop{\min }_{{{\boldsymbol{\tau }}}}Q({\widehat{\mu }}_{0},\widehat{{{\boldsymbol{\xi }}}},\widehat{{{\boldsymbol{\rho }}}},\widehat{{{\boldsymbol{\zeta }}}},{{\boldsymbol{\tau }}})\);

\({Q}_{new}\leftarrow Q({\widehat{\mu }}_{0},\widehat{{{\boldsymbol{\xi }}}},\widehat{{{\boldsymbol{\rho }}}},\widehat{{{\boldsymbol{\zeta }}}},\widehat{{{\boldsymbol{\tau }}}})\);

if \(| {Q}_{old}-{Q}_{new}| \le T\cdot | {Q}_{old}|\) then

  return \({\widehat{\mu }}_{0}\), \(\widehat{{{\boldsymbol{\xi }}}}\), \(\widehat{{{\boldsymbol{\rho }}}}\), \(\widehat{{{\boldsymbol{\zeta }}}}\), \(\widehat{{{\boldsymbol{\tau }}}}\);

else

  Continue;

end if

end while

The minimization problem in equation (3) does not have a closed-form solution but can be solved using a coordinate descent approach. The algorithm takes as input the data matrix Z, the response vector Y, a convergence tolerance level T, a maximum number of iterations M, and sensible initializations for the model parameters: μ0, ξ, ρ, ζ, and τ.

Initializing with estimated main effects and setting all other parameters to zero has been found to be both robust and simple; further details can be found in the Supplementary Section 6. The main steps of the algorithm are outlined in Algorithm 1. In the algorithm description, vj denotes a vector v without the j-th main effect, vjk is a vector v without the component corresponding to the interaction between j-th main effect and k-th main effect, and vjkl lacks the component of the three-way interaction between j-th, k-th and l-th main effects.