Introduction

As a pivotal preparative chromatographic technique in chemistry, column chromatography (CC) is productive in several qualitative and quantitative aspects, including the analysis, separation, and purification of substances1. It spans a wide spectrum of applications, including medicine, the chemical industry, and biochemistry2. Numerous synthetic laboratories worldwide conduct a staggering volume of CC separations daily to purify synthesized compounds and isolate bioactive compounds from natural products3,4. However, the effectiveness of CC depends on a number of critical factors, such as the choice of mobile phases, impurities in the mixtures and compounds of interest, and column specifications, which are currently determined by the experimenter’s experience. In general, prior to CC, researchers often perform a thin-layer chromatography (TLC) analysis to determine what conditions should be used in CC.

As a pilot measure, the retardation factor (RF value) obtained from TLC evaluates the relative polarity of the components in the mixture as compared to the mobile phases, which dominantly affects the separation efficiency achieved in CC. During actual operations, the proportion of mobile phases is usually tailored to maintain the RF value of the compound of interest within the range of 0.2–0.3 (Fig. 1b). Although empirical, this insight has successfully reduced the requirement for repetitive trials, thereby enhancing separation efficiency. Consequently, it has gained widespread acceptance among the global chemistry community as a reliable method for determining the optimal separation conditions in CC. However, the rationale behind chemists’ experiential methods has not been established and analysed. This leads to the phenomenon of “to know what, but not why”, which impedes a deeper understanding of the chemical essence. This phenomenon essentially stems from the challenge of cross-scale modeling in chromatography.

Fig. 1: Overview of the framework.
figure 1

a Comparison of the human experience and artificial intelligence (AI) experience, exemplified by the task of separating mixtures. b Chemical expertise concerning the determination of the separation conditions in column chromatography (CC) through the retardation factor (RF) values of thin-layer chromatography (TLC) when separating mixtures. c The data-driven rationale for the chemists’ experience via statistics and interpretable machine learning through the experimental dataset acquired from the automated platform. The numbers in the figure are from the statistics of the dataset.

As illustrated in Fig. S1, the dynamics of chromatography involve multiple scales ranging from microscopic to macroscopic5. For each scale, physicochemical mechanistic models have been established through explicit mathematical equations6,7,8,9, which accurately describe the chromatographic process. Nevertheless, the prediction of the chromatographic results, such as the RF values and retention times, remains a challenge for mechanistic models because of the elusive coupling relationships between mechanisms at different scales. Therefore, when determining the separation conditions for chromatography, chemists rely more on expertise than on mechanistic models. The empirical insights can reflect the coupling relationships between the TLC and CC to some extent. However, the separation results of CC are influenced by a multitude of factors, including the choice of mobile phase, column type, and various other experimental settings. Consequently, the value of chemical expertise is limited, which may lead to the failure of CC separation within sophisticated contexts. Moreover, chemical expertise is difficult to directly replicate, which requires repeated attempts under human guidance for acquisition. This underscores the paramount importance of transforming empirical insights into formalized knowledge, with the fundamental distinction lying in the provision of a clear rationale. In other words, it is imperative to elucidate, through quantitative methods, how the experimental conditions and results of TLC experiments (i.e., RF values) influence the outcomes of CC, thereby offering a rationale for chemists to determine the separation conditions.

Statistical and machine learning techniques have been widely adopted in chemistry since they are able to address complex relationships in high-dimensional space10. Although various quantitative structure‒retention relationship (QSRR) models have been established to predict outcomes for a range of chromatographic techniques, including TLC, CC, and high-performance liquid chromatography (HPLC)11,12,13,14,15,16, they often operate as “black boxes”, in which the learned relationship cannot be expressed by explicit concise equations17. This opacity limits their ability to provide interpretability, a crucial aspect for generating fresh chemical insights. Meanwhile, data issues are at the forefront of the challenge. Acquiring a profound understanding of the relationship between TLC and CC necessitates a copious amount of concordant experimental data. Fortunately, in our previous study, an automated TLC platform was established18, and a standardized TLC dataset was proposed for constructing a deep learning model to predict RF values19. Nevertheless, for preparative chromatography techniques such as CC, which are time intensive (ranging from tens of minutes to several hours per experiment), it is impractical to gather an extensive dataset manually.

In this work, we endeavor to embrace a data-centered viewpoint and discern patterns directly from extensive experimental data like a chemist. In other words, we attempt to take a step further to discover the coupling relationship between the TLC and CC from experimental data, which is expressed in a concise equation-like form. This kind of equation is termed as “artificial intelligence (AI) experience” to differentiate it from the empirical formulas summarized by humans. This method elucidates the relationship between TLC and CC explicitly, thereby offering a rationale for chemists to better comprehend the determination of the separation conditions. Compared with human expertise, the “AI experience” obtains a closer alignment with the experimental data, and can be quickly replicated without the human learning process, which enhances its practicality.

An automatic platform is constructed to systematically measure the retention volume of 192 compounds under various experimental conditions, resulting in a comprehensive CC dataset of 5984 data points (Fig. 1c). The compounds are chosen according to their diversity and accessibility and include various types, such as ketone, aldehyde, and phenol. Notably, to investigate the relationship between the TLC and CC, the stationary phases in the two modes are maintained to be identical, and the environmental and operational influences are controlled through automated experiments, which are detailed in the Methods section. Given the inconsistency between the existing TLC dataset and the generated CC datasets, surrogate models have been trained to produce consistent model predictions, thereby efficiently bridging the disparity between the two datasets.

With the experimental dataset, a knowledge discovery technique is proposed to discover an explicit “AI experience” from the machine learning models to describe the relationship between TLC and CC, which aids in explaining how chemists determine separation conditions (Fig. 1c). Through the discovered explicit statistical “AI experience”, we can directly estimate the potential intervals of retention volumes for each component in the mixture during CC through rapid TLC experiments and utilized mobile phases. This estimation allows for an assessment of the potential for successful separation under the given conditions, effectively eliminating the need for multiple time-consuming trials of CC. The proposed framework enables the generation of a dependable “AI experience” in the future, thereby enhancing our exploration of scientific inquiries.

Results

Column chromatography dataset from the automatic platform

Generating the “AI experience” that elucidates the correlation between the TLC and CC is fundamentally challenging because of the necessity of acquiring sufficient experimental data. In prior research19, an automated high-throughput TLC platform was established, enabling the measurement of RF values for 387 compounds across various mobile phases. In this work, we focus on constructing a comprehensive CC dataset. However, given the substantial time and solvent consumption involved in CC processes, manually collecting abundant data is infeasible.

To address this challenge, two strategic approaches are employed. First, an automated platform for CC is invented, which integrates a suite of instrumentation to implement whole-process automation, including sample loading, mobile phase preparation, chromatographic separation, absorbance detection, and result analysis. The design of the automated CC platform is depicted in Fig. S2, and more details are provided in the Methods section. After the completion of an experiment, the absorbance was read and processed by a computer, the peak was identified through the change in absorbance, and the start and end times were obtained from the peak. The identification algorithm is provided in the open source code. The retention volumes are calculated on the basis of the retention time and flow rate. All the processes are controlled and accomplished via a laptop. Notably, two retention volumes (VS and VE) are recorded. VS represents the volume of the mobile phase when the compound is first detected, whereas VE signifies the volume of the mobile phase when the compound has completely eluted from the column. Consequently, \({V}_{E}-{V}_{S}\) corresponds to the volume of the separated compound solution. This automation significantly reduces the dependency on manual labor, enabling continuous data collection during the day and night, and thereby enhancing experimental efficiency. Furthermore, it minimizes human-induced variability, thereby improving the consistency and accuracy of the collected data.

Nevertheless, commonly used preparative chromatography columns (e.g., 25 g and 40 g columns) have relatively large column lengths and internal diameters, and contain a relatively large amount of packing material. Consequently, they typically require tens of minutes to several hours to complete an experiment. Therefore, even with the adoption of automated platforms, the costs associated with the solvents and time remain substantial when data are collected on these columns. To mitigate this issue, we adopted a strategy that generates “AI experience” from the data acquired through 4 g preparative columns, and then extrapolates it to other column specifications, including tandem 4 g column and 25 g column. Although infrequently utilized, the 4 g column allows for controlled time and solvent volumes in the experiments, which facilitates the acquisition of a substantial amount of experimental data. This strategy is grounded in the fact that the fundamental principles of CC remain the same across divergent column specifications. Therefore, a total of 5984 data points were collected for 192 compounds under a variety of experimental conditions (Fig. 1), including the proportions of the mobile phase (Fig. S3), sample mass, and column specifications. As illustrated in Fig. S4a, within the dataset, the majority of the data are generated via 4 g columns (4999 data), whereas a small amount of data is obtained from two common column specifications, including the tandem 4 g column (457 data) and the 25 g column (528 data). Notably, data acquisition on these larger columns is markedly more time-consuming, thereby limiting the dataset size. For each data point, the retention volumes of the starting and ending points (VS and VE) were automatically identified from their respective absorbance curves, the distributions of which are illustrated in Fig. S4b.

Since the datasets for TLC and CC were independently acquired in distinct experiments, a simple one-to-one mapping between them cannot be established. A comparison of the TLC and CC datasets revealed that there were intersections, including 60 compounds present in both datasets (Fig. S4c). However, for the remaining compounds, the RF values from TLC and the retention volumes from CC under identical eluent ratios are not simultaneously present. This lack of synchronicity between the two datasets poses a significant challenge in deciphering the interrelationship between TLC and CC.

The construction of surrogate models and alignment of asynchronous datasets

To overcome the disconnect between the independently acquired TLC and CC datasets, our methodology entailed the utilization of surrogate models for the generation of model predictions. This facilitated the extrapolation of unobserved values within the datasets, thus establishing a comprehensive and cohesive relationship between the two datasets.

Surrogate models, which function as black-box predictive models, can effectively substitute for the original datasets. Here, two surrogate models were trained on the TLC and CC datasets. Each surrogate model was constructed via a five-layer fully connected neural network, which encompassed an input layer, three hidden layers, and an output layer, with each hidden layer composed of 256 neurons. The models were fed inputs comprising compound information and experimental conditions, yielding outputs that represented experimental results. The output neuron is 1 for the TLC model and 2 for the CC model. More details about the construction of surrogate models can be found in the Methods section. These surrogate models excel at learning and establishing intricate high-dimensional mappings between the input information and output results. This facilitates reliable predictions for a range of inputs, especially for the unobserved data points within the datasets.

Given the role of surrogate models in elucidating quantitative structure‒retention (QSRR) relationships, the precise characterization of both compounds and experimental conditions is of paramount importance for enhancing the models’ predictive accuracy. Notably, owing to the sequential nature of the TLC and CC processes, they share a majority of the characteristic features. As delineated in Fig. 2a, a comprehensive array of 167-dimensional molecular fingerprints and 16 molecular descriptors were employed to characterize the molecular information. The details and descriptions of these molecular descriptors are systematically cataloged in Table S1. These descriptors were carefully selected on the basis of correlation analyses, highlighting the features with significant relevance to both the TLC and CC processes.

Fig. 2: The results of surrogate models.
figure 2

a Featurization of the compound information and experimental conditions for the thin-layer chromatography (TLC) and column chromatography (CC) datasets. n is the number of samples, m is the sample mass, e is sample solvent type, and Ve is solvent volume. b Predictive performance of the surrogate models. c Predicted curves of the mobile phase and the predicted retention volumes and retardation factor (RF) values. d The distribution of retention volume under the experimental conditions with different RF value ranges, and the standard deviation (std) for different RF value ranges. The total number of samples are 4279. e Influence of the loading sample mass. The y-axis represents the mean value of the ratio of the retention volume under the corresponding sample mass (V) to the retention volume under 50 mg (V50mg), with the other experimental conditions remaining the same. The uncertainty is illustrated in Fig. S9. Source data are provided as a Source Data file. Here, PE refers to petroleum ether and EA refers to ethyl acetate. RMSE, MAE and R2 refer to the root mean squared error, mean absolute error and coefficient of determination, respectively. VS, VE and \(\Delta V\) refer to the retention volume at starting point, retention volume at ending point, and their difference.

In terms of experimental conditions, the RF value in TLC is predominantly determined by the proportion of the mobile phase, whereas the outcomes of CC under the same mobile phase are influenced by additional factors such as sample mass, sample solvent type, and solvent volume, thereby making these features unique to the CC process. For the mobile phase, an averaged molecular description is adopted, which constitutes a 6-dimensional vector, which is detailed in the Methods section. Consequently, the input of the surrogate model is a 189-dimensional feature vector for TLC, and a 192-dimensional vector for the CC. More details about the characterization are provided in the “Methods” section. To train the surrogate model, the datasets are partitioned into 80% for training, 10% for validation, and 10% for testing. The training epoch is 10,000, and the learning rate is 10-3. To prevent overfitting, early stopping strategies were employed, where the epoch with the minimum number of validating errors was considered the best epoch. The validated R2 values for the TLC and CC models are 0.948 (RF value), 0.838 (VS), and 0.898 (VE). The predictive performance of both the TLC and CC surrogate models on the test dataset is illustrated in Fig. 2b. The results demonstrate the models’ satisfactory predictive capabilities, where the R2 maintains over 0.8 for the prediction of the RF value and retention volume. This can be attributed to the high consistency of datasets sourced from the automated platform and the efficacy of feature characterization. The R2 of \(\Delta V\) is relatively low since it is obtained from the predicted VS and VE, which means that the error will accumulate.

On the basis of these accurate surrogate models, predictions of the RF values and retention volumes for the same compound across varying eluent ratios can be conducted. This facilitates the generation of mobile phase-related curves, where datapoints are derived from model predictions rather than direct experimental data (Fig. 2c). Through the application of surrogate models, we effectively reconciled the asynchronous nature of the two datasets, enriching the pool of usable data and establishing a robust foundation for subsequent statistical analysis.

For each data point in the CC dataset, the RF values under the corresponding mobile phase are obtained from the prediction via the TLC surrogate model. Thus, a unified dataset encompassing both RF values and retention volumes is constructed. With this unified dataset, we can analyse the relationship between the RF values and the corresponding retention volumes from a statistical perspective. Since the RF value serves as an approximate measurement of the relative polarity of the compounds, we coarsely categorize it into different ranges, including 0–0.2, 0.2–0.4, 0.4–0.6, 0.6–0.8, and 0.8–1.0. Notably, this categorization criterion is rudimentary and solely intended for the basic differentiation of RF values. The distribution of the retention times corresponding to the data within the different RF value ranges is illustrated in Fig. 2d. A conspicuous pattern can be observed in the graph, and shows that as the RF value increases, the variance of the retention volume distribution decreases, and vice versa. This pattern indicates a latent, yet significant, relationship between the RF and retention volumes.

Moreover, the influence of the loading sample mass can also be investigated from the model predictions, as depicted in Fig. 2e. The figure shows that as the sample mass increases, both the variance and the deviation of the mean value also increase. Interestingly, the trend of the ratio for VS and VE is divergent. The mean ratio of VS decreases to below 1, whereas the mean ratio of VE increases to above 1. This discovery aligns with the experience that a large loading sample amount may result in column overloading, leading to an increased peak width and decreased resolution. More details can be found in Supplementary Information S1.

Rationale for the determination of the separation conditions

In this section, a rationale is established from experimental data via statistics and machine learning, which explicitly describes the latent relationship between the RF values and CC outcomes (i.e., the retention volume).

Here, a methodological approach of knowledge discovery is adopted to probe the TLC‒CC nexus, utilizing the unified model predictions derived from trained surrogate models. From the results of the analyses presented in Fig. 2, a potential pattern between the RF values and the distribution of retention times is observed. To further investigate this trend, we divided the range of RF values into ten equal intervals of length 0.1, allowing for a more detailed examination. Additionally, the effects of the mobile phase proportion were investigated. Specifically, 10 commonly used proportions of petroleum ether (PE) and ethyl acetate (EA), ranging from 1:0 to 0:1, are studied. A higher proportion of PE indicates a lower polarity of the mobile phase, and vice versa. For each specified mobile phase proportion, model predictions can be generated, and the distributions of the retention volumes for all the compounds across different RF value ranges can be computed and analysed. These distributions are visualized via box plots (Fig. 3a). Notably, the boxplots refer to the distribution of the retention volumes of compounds whose RF values are in specific ranges under a given solvent eluent.

Fig. 3: Discovery of the relationship between the retention volumes of column chromatography and the retardation factor (RF) values.
figure 3

a The distribution of retention volumes in diverse ranges of RF values under given mobile phase proportions. The sample size n = 585. The boxplot is defined by the maximum, minimum, and quartiles. The mean values are represented by triangles and the median values are denoted by lines. b The statistically averaged separation index matrix. c The mean retention volumes \({\bar{V}}_{S}\) under different RF value ranges and eluent ratios and the discovered formula. d The mean retention volumes \({\bar{V}}_{E}\) under different RF value ranges and eluent ratios and the discovered formula. The observed mean retention volumes and fitted mean retention volumes by the formula are displayed under the equation. Here, PE refers to petroleum ether and EA refers to ethyl acetate. rPE and rEA are the proportion of PE and EA in the eluent. RMSE and R2 refer to the root mean squared error and coefficient of determination, respectively. Source data are provided as a Source Data file.

Figure 3 shows several interesting findings. Within each proportion of the mobile phase, exemplified by PE:EA = 50:1 (V/V) (more examples are provided in Fig. S5), a trend is discernible, where higher RF values are associated with a constricted range of retention volume fluctuations and a corresponding reduction in their mean values. This statistical observation is in alignment with chemical intuition, since a higher RF value corresponds to a sample with a smaller polarity, where the retention volume is usually small. Moreover, for an identical RF value range, the retention volumes of mobile phases with larger polarities are discovered to have a smaller variance. This implies that, for an identical RF value range, a larger mobile phase polarity will result in a lower uncertainty. Consequently, from a statistical perspective, the retention volume is correlated with both the RF value and the eluent ratio. Through this statistical approach, a quantitative analysis of the selection of separation conditions can be conducted.

Let us assume that the mixture contains the desired product A and an impurity B, for simplicity in the analysis. In TLC experiments, the respective RF values of both compounds can be obtained simultaneously. Given the mobile phase proportion, and the corresponding RF values, the distribution range of their retention volumes (VS and VE) on CC can be determined from Fig. 3a. Evidently, complete separation is only achievable when either the VS of product A is larger than the VE of impurity B, or the VS of impurity B is larger than the VE of product A. By combining the distribution ranges of A’s and B’s respective retention volumes, the separation index for this scenario can be calculated, which is defined as the ratio of the length of the distribution range of the retention volumes in complete separation to the length of the overall distribution range. The detailed definitions and calculation process can be found in Supplementary Information S2. Here, a larger separation index corresponds to a greater possibility of separation in CC. For each mobile phase proportion, the corresponding separation index for A and B in different RF value intervals can be calculated, resulting in a matrix of separation indices. By taking the average of the separation index matrices corresponding to the eight commonly used mobile phase proportions, a statistical separation index matrix can be obtained, which is displayed in Fig. 3b.

In Fig. 3b, lighter red indicate smaller separation indices, whereas deeper red indicates larger ones. Interestingly, a distinct pattern emerges. The separation indices corresponding to the lower right corner of the matrix are always relatively small. This suggests that when both RF values exceed 0.5, the likelihood of separation is low, even with a significant difference in the RF values of both compounds. Moreover, when one compound has a small RF value (especially less than 0.3), the separation index is generally large, and a greater difference in RF values between the two compounds increases the possibility of separation. Figure 3a shows that a smaller RF value corresponds to a larger mean value of the retention volume distribution, indicating that a longer time is required for separation. Therefore, maintaining the RF value of the desired product within the range of 0.2 to 0.3 is optimal for separation. Through the separation index matrix, we provide a statistical rationale for the empirical determination of the separation conditions in CCs. Additionally, we have observed that if there is no discernible difference in RF values among the components of the mixture in TLC, they are most likely not separable in CC.

Furthermore, we attempt to express the inherent relationship between RF values and retention volumes through explicit formulas. As illustrated in Fig. 3c, the mean values of the retention volume distribution across divergent mobile phase proportions and the RF values were calculated. The figure shows that the mean retention volume follows an inversely proportional relationship, where an increase in the mobile phase proportion precipitates a decrease in the inverse proportionality coefficient. To elucidate this complex relationship explicitly, the symbolic regression algorithm, pySR, is employed, which facilitates the derivation of an exceptionally concise and accurate equation20. The number of iterations is 100, the utilized operators are +, ×, and /, and the criterion is the mean squared error. To discover a concise equation, only the outcomes with complexities smaller than 10 are considered. Both the structural complexity and regression loss are considered when choosing the best equation. The discovered equations are written as:

$${\bar{V}}_{S}=\left\{\begin{array}{c}\frac{r}{0.147\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+0.0114},r \, > \, 0\\ 5.147,r=0 \end{array}\right.$$
(1)
$${\bar{V}}_{E}=\left\{\begin{array}{c}\frac{r}{0.069\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+0.0054}\\ 10.98,r=0\end{array} \right.,r \, > \, 0$$
(2)
$$r=\frac{{r}_{PE}}{{r}_{PE}+{r}_{EA}}$$
(3)

where \({\bar{V}}_{S}\) and \({\bar{V}}_{E}\) refer to the mean retention volume, and r is the proportion of PE in the eluent. The R2 values of the symbolic regression for \({\bar{V}}_{S}\) and \({\bar{V}}_{E}\) are 0.882 and 0.900, respectively. Notably, the resemblance in form between \({\bar{V}}_{S}\) and \({\bar{V}}_{E}\) suggests a potential inner relationship, given that the symbolic regression was performed independently during the equation discovery process. Interestingly, \({\bar{V}}_{E}\) is approximately twice the value of \({\bar{V}}_{S}\) in the 4 g column. Fundamentally, this can be rationalized by the fact that a larger VS frequently signifies a slower solute outflow, which typically coincides with a higher VE. Traditionally, experimenters may discern this trend, yet the discovered equations can clearly illustrate this pattern.

These discovered equations are concise and practically significant, and unravel the statistical interconnection between the RF values and retention volumes, translating conventional experiential insights into a quantifiable, explicit formula. By applying this formula, it becomes feasible to predict the range of outcomes in CC on the basis of the preliminary TLC findings, thus substantially facilitating the determination of the potential for a successful separation under the given conditions. Considering that TLC usually only takes a few minutes, while CC requires more time and solvent, the proposed equation can improve the efficiency of the CC process.

Model generalization across different compounds and column specifications

Utilizing advanced knowledge discovery methodologies, we formulate equations that correlate the RF values with the mean retention volumes \({\bar{V}}_{S}\) and \({\bar{V}}_{E}\) in Eqs. (1) and (2), offering a statistical interpretation of the macroscopic relationship between the TLC and CC. However, even under identical RF values, diverse compounds and column specifications influence the retention volume. In this section, we aim to quantify this influence to achieve a more generalized and accurate “AI experience”. To quantify the influence of compounds, a variable named the CC ratio ε, which is defined as the ratio of the actual retention volume V of a compound to the predicted average retention volume \(\bar{V}\) under given experimental conditions, is proposed in this study.

$$\varepsilon=\frac{V}{\bar{V}}$$
(4)

Notably, under a fixed mobile phase proportion, the CC ratio ε depends exclusively on the structural and property characteristics of the compound. Nonetheless, deriving a direct equation for ε is challenging because of the representation of the compound features by high-dimensional vectors composed of molecular fingerprints and descriptors. Therefore, this study employs a visual tree structure to represent the relationship between the CC ratio ε and the compounds. Owing to the binary nature of molecular fingerprints, tree structures are particularly suitable for depicting this relationship. For each mobile phase proportion, a regression tree is trained on the model predictions, establishing a correlation between the CC ratio and the compound’s structural and property attributes. Regression trees in all the mobile phase proportions collectively form a forest. The maximum depth of the regression tree is 9, the minimum leaf depth is 1, and the criterion is the mean absolute error.

In this context, the visual regression trees for PE:EA ratios of 50:1 (V/V) (Figs. 4a) and 1:1 (V/V) (Fig. 4b) are exemplified via dendrograms. For enhanced clarity, the CC ratio ε is divided into three categories, including \(\varepsilon \, \le \, 0.5\), \(0.5\, < \, \varepsilon \, \le \, 1\), and \(\varepsilon \, > \, 1\), each are indicated by distinct colors. Variables closer to the roots have greater importance and correlation, whereas those closer to the leaves have less importance and correlation. The dendrogram’s branching pattern distinctly separates different ε values, clearly illustrating the decision logic of the regression trees, and the influence of pertinent variables on the CC ratio ε. The figure reveals that the molecular descriptors AATSC0P and TPSA are dominant. AATSC0P, which was previously utilized in predicting liquid chromatography retention times without an elucidated mechanism21, was found to inversely correlate with the CC ratio ε. Conversely, TPSA, known for its strong correlation with molecular polarities and TLC outcomes19, demonstrates direct proportionality with ε.

Fig. 4: The adaptation of the model to diversified compounds.
figure 4

a Dendrograms of the visual regression tree of column chromatography ratios under PE:EA = 50:1 (V/V). b Dendrograms for the visual regression tree of column chromatography ratios under PE:EA = 1:1 (V/V). Here, PE refers to petroleum ether and EA refers to ethyl acetate. ε is the column chromatography ratio. The symbols in the dendrogram are explicitly defined in Table S1.

In contrast to neural network models, the simplicity of the regression tree models permits a visual representation, thereby offering superior interpretability. Additionally, limiting the regression tree’s maximum depth automatically selects the most relevant variables from the high-dimensional data, thereby achieving a sparse representation. However, regression trees often fall short in modeling complex relationships, and often exhibit suboptimal predictive capabilities. In this study, the regression tree model is developed on the basis of insights from Eqs. (1), (2), thereby reducing relationship complexity for improved data fitting, and enhancing both interpretability and chemical relevance. Table S2 demonstrates the accuracy of the CC ratio ε predicted by the regression trees across different mobile phase proportions, confirming their effectiveness in learning the latent relationships between ε and compounds.

With the predicted ε, the retention volumes can be directly calculated from the utilized mobile phase proportion and the corresponding RF values in the TLC through Eqs. (1) and (2). Compared with direct tree model training, this approach not only intensifies precision but also augments chemical interpretability (Fig. S6). While the regression tree model’s prediction accuracy (R2 = 0.678) slightly lags behind that of previous surrogate models, its complete interpretability is paramount. This study offers invaluable insights into the intricacies of the CC process.

The equations can be easily extrapolated to other column specifications with a small amount of data through a transfer learning strategy, which recalibrates the coefficients of the equations learned from the symbolic regression with the data from the target domain. Compared with fine-tuning machine learning models, this strategy has greater computational efficiency since it only needs a simple regression. Crucially, while acknowledging that the column specifications do not influence the fundamental principles of chromatography, the inverse proportional statistical relationship encapsulated by these equations is expected to remain valid, with the inverse proportionality coefficient varying in accordance with the column specification. Moreover, the CC ratio, which is dependent on the compound characteristics, should maintain its consistency across the different column specifications. Therefore, extrapolation to other column specifications can be accomplished by fitting the coefficients in Eqs. (1) and (2), which is a process detailed in the Methods section.

Here, the tandem 4 g column and 25 g column are taken as examples. As depicted in Fig. 5a, b, the straightforward application of Eqs. (1) and (2) to these distinct cases is impractical, as it would yield substantial errors, particularly when the column specification disparity is pronounced. After recalibrating the coefficients in Eqs. (1) and (2) with data from the target domain, the modified equations can be written as:

$${V}_{S}^{4g+4g}=\left\{\begin{array}{c}\frac{r\cdot \varepsilon }{0.055\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+0.0062},r \, > \, 0\\ 7.832\cdot \varepsilon,r=0\end{array},\right.{V}_{E}^{4g+4g}=\left\{\begin{array}{c}\frac{r\cdot \varepsilon }{0.031\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+0.0036},r \, > \, 0\\ 17.61\cdot \varepsilon,r=0\end{array},\right.$$
(5)
$${V}_{S}^{25g}=\left\{\begin{array}{c}\frac{r\cdot \varepsilon }{0.022\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+0.0027},r \, > \, 0\\ 15.70\cdot \varepsilon,r=0\end{array},\right.{V}_{E}^{25g}=\left\{\begin{array}{c}\frac{r\cdot \varepsilon }{0.013\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+0.0016},r \, > \, 0\\ 26.81\cdot \varepsilon,r=0\end{array},\right.$$
(6)

where \({V}_{S}^{4g+4g}\) and \({V}_{E}^{4g+4g}\) refer to the predicted retention volumes for the tandem 4 g column, and where \({V}_{S}^{25g}\) and \({V}_{E}^{25g}\) refer to the predicted retention volumes for the 25 g column. The predictive capabilities of the modified equations are depicted in Fig. 5a, b. The equations can predict the retention volumes well, but the performance is relatively worse than that of the 4 g column. This may be because the amount of data in larger columns is smaller, and the difference between the large column and the small column is notable, affecting the transfer learning. From an analytical standpoint, as the mass of the column packing increases, the coefficient in the denominator decreases, indicating larger retention volumes.

Fig. 5: The adaptation of the model to diverse column specifications.
figure 5

a The fitted and observed retention volume obtained by directly utilizing discovered equations from the 4 g column (left), and the equations fitted by nonlinear regression and their predictive performance for the extrapolation of the tandem 4 g column (right). b The fitted and observed retention volume obtained by directly utilizing discovered equations from the 4 g column (left), and the equations fitted by nonlinear regression and their predictive performance for the extrapolation of the 25 g column (right). Here, r is the proportion of PE in the eluent, ε is the column chromatography ratio. RMSE and R2 refer to the root mean squared error and coefficient of determination, respectively. VS and VE refer to the retention volume at starting point and retention volume at ending point. Source data are provided as a Source Data file.

Discussion

This study provides a rationale for the determination of the separation conditions from the perspective of statistics through machine learning techniques. To guarantee the effectiveness and generalizability of the discovered relationship between TLC and CC, automated platforms are established to conduct standardized experiments, where the stationary phase, humidity, temperature, and other environmental and operational influences are controlled. Surrogate models are constructed to align the asynchronous experimental datasets for TLC and CC. Our work introduces a separation index matrix that quantitatively explains the chemical expertise that the RF value of the compound of interest should be tailored to 0.2–0.3 in TLC. Furthermore, the “AI experience” is extracted from the experimental data obtained from automatic platforms via knowledge discovery frameworks, where explicit equations that describe the interplay between RF values and retention volumes have first been identified. This process converts conventional chemical insights into formalized equations, enabling quantitative predictions of the outcomes of CC on the basis of preliminary TLC data. The equations can be easily extrapolated to other column specifications via a simple nonlinear regression of coefficients. This advancement not only deepens the understanding of chromatographic separation but also improves experimental efficiency.

Our work is an application of knowledge discovery in experimental chemistry. Prior efforts in knowledge discovery have focused mainly on unearthing complex partial differential equations in the realm of physics22,23,24, which involves fewer variables that embody intricate interrelations, and often expressed through differential and integral forms. In contrast, a unique challenge arises in the realm of chemistry in which molecular complexity involves a vast array of variables. Each variable has a simpler relationship with the outcomes, but the overall complexity is derived from their collective interactions. Directly applying traditional knowledge discovery approaches in this context is impractical. In our study, we utilized the inherent correlation between TLC and CC outcomes, employing RF values as pivotal variables to uncover statistical equations, which effectively bypasses the challenges posed by molecular complexity. Notably, the knowledge discovery technique is not intended to replace the mechanism model but rather to complement it. Their common goal is to obtain an explicit equation to describe chemical phenomena.

In the domain of cheminformatics, deep learning techniques have been prevalently applied for molecular property prediction, achieving notable successes11,12,13,14,25. However, these models often suffer from a lack of interpretability, which hinders a deeper understanding of chemical phenomena, and limits their utility in specific scenarios. Thus, the interpretable modeling approach introduced in our research represents a notable advancement. We managed to extract explicit equations from the experimental data to construct an interpretable model. While there are concessions in precision, it can enhance our comprehension of the relationship between TLC and CC, making it more applicable in experimental contexts. Specifically, the constructed TLC and CC surrogate models can provide accurate predictions of the RF values and retention times, which can guide the experiments. Moreover, the discovered “AI experience” provides explicit equations to bridge TLC and CC, which can facilitate chromatographic separation. Furthermore, considering that there are a variety of empirical formulas and chemical expertise in the field of chemistry, where it is difficult to model potential cross-scale relationships through theoretical deduction. Therefore, our method is promising for discovering more “AI experience” in diverse problems.

Nevertheless, several aspects of this research warrant further enhancement. First, owing to the time-intensive nature of chromatographic processes, we developed an automated platform for data collection to minimize the involvement of human resources and maximize efficiency. However, the dataset size, especially for larger columns, is still limited due to time and resource constraints. Moreover, despite our efforts to encompass a diverse range of compounds, the selection of 192 compounds in CC does not cover the extensive spectrum of chemical compounds available. A more diverse dataset would be useful in refining the precision of our interpretable model. Despite these limitations, our research provides invaluable insights into the field of chromatography. We anticipate that this framework will generate a more reliable “AI experience” to facilitate studies of nature and science.

Methods

Automatic column chromatographic platform

In this work, an automated CC platform is developed to measure the retention volumes of different compounds under various experimental conditions to construct a CC dataset. The developed automated CC platform offers greater flexibility and comprehensive automation, enabling fully automated CC analysis and significantly enhancing data collection efficiency.

As shown in Fig. S2, the automated sample loader is one of the core components facilitating the automation process. For the whole automated process, prepared samples are dissolved in the loading solvent, while the eluent is configured onsite through an infusion pump according to a preset ratio. The mixtures of standard solutions are then injected into the column via the automated sample loader, and the eluent is pumped into the column at a predetermined flow rate to perform CC. The detector continuously monitors the absorbance, and terminates the experiment automatically once the solute is fully collected.

The automatic CC system consists of two precision medium-speed infusion pumps equipped with a syringe-based solution mixer and an ultraviolet (UV) detector (Fig. S2). All the devices are connected by tubing. The prepared mixtures of standard solutions are placed in the syringe sample tray. They are then loaded via a syringe needle and transported to the chromatography column under pump pressure. The two pumps deliver the corresponding eluents at set flow rates, which are mixed in the solution mixer. The standard solutions are then eluted and separated, and the UV signal of the effluent solution during the elution process is recorded by a UV detector. In the auto CC system, the start time (tS) and end time (tE) of the sample separation are automatically calculated on the basis of recorded raw data via recognition algorithms. After each experiment, the automated platform proceeds to clean the column, and initiates the next experiment automatically under different conditions.

For each compound, eight experiments are conducted using different eluent ratios, namely, PE:EA = 1:0, 100:1, 50:1, 10:1, 5:1, 2:1, 1:1, and 0:1 (V/V), yielding eight sets of data from the UV detector. To calculate the target data, the raw data are first converted into the corresponding absorbance values by translating the complete 16-byte commands. The start and end times are then determined on the basis of the magnitude of the absorbance. At the end of each test, the eluent ratio is changed and maintained for approximately 30 s. Subsequently, the syringe automatically cleans the needle before proceeding to the next sample. To eliminate the influence of stray peaks, we analyse the data using a specific window width equivalent to 20 s. The calculated retention time is multiplied by the flow rate to obtain the retention volumes VS and VE, respectively.

For TLC, the silica gel plate from the Yinlong brand is utilized, and the stationary phase is standard silicone with an average particle size of 40 μm. For CC, the column from the Agela brand is utilized, and the stationary phase is also standard silicone with an average particle size of 40 μm. In the TLC automated platform, the whole process is accomplished by a collaborative robot to guarantee the precision of the measured RF values. The developing is conducted in a square flat-bottom developing chamber, and the developing time is 300 s. Specifically, the robot can finely control the spotting procedure, which makes the spot as little dispersive as possible. Moreover, the developed TLC plate is photographed, and a computer vision algorithm is adopted to identify the geometry center and calculate the RF values. To guarantee the quality of the data, human verification is adopted to calibrate TLC plates that are incorrectly identified by the computer. For more details of the automated TLC platform, refer to the protocol18.

Surrogate model construction

In this work, two fully connected artificial neural networks are utilized to construct surrogate models to fit the TLC and CC datasets and generate model predictions. For model construction, characterization of the compounds and experimental settings is crucial. The compounds are represented by the Molecular Access System (MACCS) keys with 167 dimensions, where each dimension is a binary code that denotes the existence of certain substructures and properties. In addition to the molecular fingerprint, molecular descriptors are also employed to describe the overall properties of the compounds. In this work, 16 molecular descriptors are selected from thousands of candidate descriptors according to the correlation coefficients. For the mobile phase, which is composed of PE and EA, the averaged descriptor of the mobile phase is utilized. Here, the molecular weight (MW_e), topological polar surface area (TPSA_e), number of rotatable bonds (NRotB_e), number of hydrogen bond donors (HBD_e), number of hydrogen bond acceptors (HBA_e), and lipid‒water partition coefficient (LogP_e) of the PE and EA are weighted averaged based on the proportions used to obtain the descriptors of the mobile phase, as illustrated in Fig. S7. The abbreviations and meanings of the descriptors are provided in detail in Table S1. Therefore, for the surrogate model of TLC, the input is the combination of molecular fingerprints and descriptors with 189 dimensions, and the output is the RF value. For the surrogate model of CC, the input is similar to that of TLC, but the experimental conditions, including the sample mass, sample solvent type, and solvent volume, are incorporated into the input vector, which has 192 dimensions. The output is the retention volume VS and VE. Both neural networks have five layers, including an input layer, three hidden layers, and an output layer. Considering that the RF value is between 0 and 1, an additional Sigmoid layer is employed in the TLC surrogate model to guarantee that the output conforms to the range. The activation function is LeakyReLU. The datasets are partitioned into 80% for training, 10% for validation, and 10% for testing. The number of training epochs is 10,000. To prevent overfitting, early stopping strategies were employed on the basis of the validating error. To facilitate training, the input vector is standardized via max-min standardization.

In this research, the generation of model predictions serves as a methodological approach to simulate unobserved experiments, thereby bridging the inconsistency between the TLC and CC datasets. Specifically, each data point in the CC dataset refers to a specific experimental scenario involving a target compound under predefined conditions. Considering that a critical factor influencing the separation efficacy of CC is the proportion of solvents in the mobile phase, the generation of model predictions focuses on different solvent ratios, including PE:EA = 1:0, 100:1, 50:1, 10:1, 5:1, 2:1, 1:1, and 0:1 (V/V). For each molecule, while keeping the other experimental conditions unchanged, only the solvent ratio is changed to generate different conditions, which are then input into the CC surrogate model to obtain the predicted retention volumes. Notably, the experimental data for these solvent ratios might not exist in the CC dataset, implying that the study effectively simulates and predicts experimental outcomes under these conditions. Therefore, predictions are also made for the same compounds and solvent ratios in the TLC dataset to generate TLC model predictions.

The generation of model predictions from both TLC and CC surrogate models enables the simultaneous prediction of RF values and retention volumes for the compounds studied under a spectrum of solvent ratios, which is challenging to obtain directly from the datasets. Therefore, this model prediction generation approach is instrumental in laying the groundwork for subsequent knowledge discovery and analysis. In quantitative terms, 4680 model predictions are generated, including 585 experimental conditions evaluated across 8 solvent ratios.

Symbolic regression

In this work, the statistical analysis of the generated model predictions reveals a discernible pattern correlating the retention volume, the RF value, and the ratio of the developing agent. These observed relationships align with the established knowledge in the field of chemistry, yet defined mathematical expressions are lacking to characterize these interdependencies succinctly. Consequently, this study employs symbolic regression as a methodological tool for knowledge discovery, aiming to unearth the most fitting explicit mathematical formulations that captures the essence of these relationships. Symbolic regression integrates a genetic programming algorithm with advanced optimization techniques, ensuring both efficiency and robustness in knowledge discovery. The genetic programming process simulates biological evolution to iteratively evolve mathematical expressions, optimizing their fit to the data. This process begins with a diverse set of random expressions described by symbolic trees, which undergo continuous evolution through operations such as crossover and mutation. Regularization is adopted to balance the exploration of new solutions with the refinement of the existing ones, thus preventing the algorithm from stagnating in a local optima. In this work, symbolic regression is implemented via the Python package PySR. This tool is simple and user friendly, and ideal for chemical issues with numerous variables, but it is less effective with complex equations. For problems involving partial differential equations, methods such as DLGA or DISCOVER are recommended24,26.

Visual regression trees

In this work, statistically explicit expressions were identified to characterize the relationship between the TLC and CC. Evidently, for each specific compound, the corresponding retention volume of CC will be influenced even if the RF value is identical. This influence stems mainly from the structure and properties of the molecule. Therefore, we define a CC ratio under a given mobile phase proportion to represent this influence. In this work, molecular fingerprints and descriptors, which are high-dimensional features, are used to express the structure and properties of the compounds. As such, it is difficult to directly observe the relationship between the CC ratio and these features, or find an explicit expression. To further explore this potential relationship, we employed a visual regression tree approach to provide an explicit explanation. Given the binary nature of molecular fingerprints, their impact on the CC ratio is well-suited for expression in a binary tree format. Here, a binary regression tree with a maximum depth of 9 layers is used to fit the CC ratio, with the input being the molecular fingerprints and descriptors of each molecule, and the output being the CC ratio. Notably, by limiting the maximum depth, the complexity of the binary regression tree is constrained, ensuring the interpretability of the model. Since the CC ratio is assumed to be solely related to molecular properties, data from both VS and VE can be combined when training the regression tree. For each solvent ratio, a binary regression tree can be trained, with all the trees for the commonly used solvent ratios forming a forest. The trained regression trees are represented using visualized tree root graphs, with more important conditions located closer to the roots. Visualization can enhance the interpretability and understanding of the model, allowing for intuitive insights into how molecular structures and properties affect the CC ratio and subsequently impact the chromatographic results. In addition to its explanatory capabilities, the regression tree model can also be used for prediction. Given a specific compound and solvent ratio, the CC ratio can be directly predicted using this model, allowing for rapid calculation of its precise retention volume via the formula.

Extrapolation to other column specifications

In this work, the relationship between TLC and CC is discovered within the experimental dataset from a 4 g column, which is expressed by Eqs. (1) and (2). Importantly, these equations can be generalized to other column types. Given that the underlying chromatographic principles remain consistent across different columns, the forms of Eqs. (1) and (2) can still be adopted. Additionally, it is assumed that the CC ratio does not vary with column type. Therefore, the task of generalization to other column types can be reformulated as a coefficient regression problem, which can be described as:

$${\bar{V}}_{S}=\left\{\begin{array}{c}\frac{r}{{a}_{1}\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+{b}_{1}},r \, > \, 0\\ {c}_{1},r=0\hfill\end{array}\right.$$
(7)
$${\bar{V}}_{E}=\left\{\begin{array}{c}\frac{r}{{a}_{2}\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+{b}_{2}}\\ {c}_{2},r=0\end{array} \right.,r \, > \, 0$$
(8)
$${V}_{S}={\bar{V}}_{S}\cdot \varepsilon=\left\{\begin{array}{c}\frac{r\cdot \varepsilon }{{a}_{1}\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+{b}_{1}},r\, > \, 0\\ {c}_{1}\cdot \varepsilon,r=0\hfill\end{array}\right.$$
(9)
$${V}_{E}={\bar{V}}_{E}\cdot \varepsilon=\left\{\begin{array}{c}\frac{r\cdot \varepsilon }{{a}_{2}\cdot {{{\rm{R}}}}{}_{{{{\rm{F}}}}}+{b}_{2}}\hfill\\ {c}_{2}\cdot \varepsilon,r=0\end{array} \right.,r \, > \, 0$$
(10)

where a1, b1, c1, a2, b2, and c2 are the coefficients that need to be determined through the regression. In this study, we examined the use of tandem 4 g and 25 g columns. By leveraging the datasets collected from these columns, we perform nonlinear regressions on Eqs. (9) and (10) to calculate the coefficients for generalization. This approach circumvents the need for a fine-tuning of the models, which is typically required in traditional transfer learning methods, and instead necessitates only a small amount of data for simple regression. This not only highlights the advantages of explicit equations but also demonstrates the reliability of the equation forms identified in this paper.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.