Explicit relation between thin film chromatography and column chromatography conditions from statistics and machine learning

Xu, Hao; Wu, Wenchao; Chen, Yuntian; Zhang, Dongxiao; Mo, Fanyang

doi:10.1038/s41467-025-56136-x

Download PDF

Article
Open access
Published: 19 January 2025

Explicit relation between thin film chromatography and column chromatography conditions from statistics and machine learning

Nature Communications volume 16, Article number: 832 (2025) Cite this article

13k Accesses
8 Citations
4 Altmetric
Metrics details

Subjects

Abstract

In chemistry, empirical paradigms prevail, especially within the realm of chromatography, where the selection of separation conditions frequently relies on the chemist’s experience. However, the underlying rationale for such experiential knowledge has not been established or analysed. This study explicitly elucidates how chemists use thin-layer chromatography (TLC) to determine column chromatography (CC) conditions, employing statistical analysis and machine learning techniques. An experimental dataset of the CC is generated from the automatic platform developed in this study. On this basis, an “artificial intelligence (AI) experience” is generated through a knowledge discovery framework, where the relationship between the retardation factor (R_F) value from TLC and retention volume from CC is unveiled in the form of explicit equations. These equations demonstrate satisfactory accuracy and generalizability, providing a scientific basis for the selection of the experimental conditions, and contributing to a better understanding of chromatography.

Retention time dataset for heterogeneous molecules in reversed–phase liquid chromatography

Article Open access 29 August 2024

Generic and accurate prediction of retention times in liquid chromatography by post–projection calibration

Article Open access 08 March 2024

Evaluation guidelines for machine learning tools in the chemical sciences

Article 24 May 2022

Introduction

As a pivotal preparative chromatographic technique in chemistry, column chromatography (CC) is productive in several qualitative and quantitative aspects, including the analysis, separation, and purification of substances¹. It spans a wide spectrum of applications, including medicine, the chemical industry, and biochemistry². Numerous synthetic laboratories worldwide conduct a staggering volume of CC separations daily to purify synthesized compounds and isolate bioactive compounds from natural products^3,4. However, the effectiveness of CC depends on a number of critical factors, such as the choice of mobile phases, impurities in the mixtures and compounds of interest, and column specifications, which are currently determined by the experimenter’s experience. In general, prior to CC, researchers often perform a thin-layer chromatography (TLC) analysis to determine what conditions should be used in CC.

As a pilot measure, the retardation factor (R_F value) obtained from TLC evaluates the relative polarity of the components in the mixture as compared to the mobile phases, which dominantly affects the separation efficiency achieved in CC. During actual operations, the proportion of mobile phases is usually tailored to maintain the R_F value of the compound of interest within the range of 0.2–0.3 (Fig. 1b). Although empirical, this insight has successfully reduced the requirement for repetitive trials, thereby enhancing separation efficiency. Consequently, it has gained widespread acceptance among the global chemistry community as a reliable method for determining the optimal separation conditions in CC. However, the rationale behind chemists’ experiential methods has not been established and analysed. This leads to the phenomenon of “to know what, but not why”, which impedes a deeper understanding of the chemical essence. This phenomenon essentially stems from the challenge of cross-scale modeling in chromatography.

As illustrated in Fig. S1, the dynamics of chromatography involve multiple scales ranging from microscopic to macroscopic⁵. For each scale, physicochemical mechanistic models have been established through explicit mathematical equations^6,7,8,9, which accurately describe the chromatographic process. Nevertheless, the prediction of the chromatographic results, such as the R_F values and retention times, remains a challenge for mechanistic models because of the elusive coupling relationships between mechanisms at different scales. Therefore, when determining the separation conditions for chromatography, chemists rely more on expertise than on mechanistic models. The empirical insights can reflect the coupling relationships between the TLC and CC to some extent. However, the separation results of CC are influenced by a multitude of factors, including the choice of mobile phase, column type, and various other experimental settings. Consequently, the value of chemical expertise is limited, which may lead to the failure of CC separation within sophisticated contexts. Moreover, chemical expertise is difficult to directly replicate, which requires repeated attempts under human guidance for acquisition. This underscores the paramount importance of transforming empirical insights into formalized knowledge, with the fundamental distinction lying in the provision of a clear rationale. In other words, it is imperative to elucidate, through quantitative methods, how the experimental conditions and results of TLC experiments (i.e., R_F values) influence the outcomes of CC, thereby offering a rationale for chemists to determine the separation conditions.

Statistical and machine learning techniques have been widely adopted in chemistry since they are able to address complex relationships in high-dimensional space¹⁰. Although various quantitative structure‒retention relationship (QSRR) models have been established to predict outcomes for a range of chromatographic techniques, including TLC, CC, and high-performance liquid chromatography (HPLC)^{11,12,13,14,15,16}, they often operate as “black boxes”, in which the learned relationship cannot be expressed by explicit concise equations¹⁷. This opacity limits their ability to provide interpretability, a crucial aspect for generating fresh chemical insights. Meanwhile, data issues are at the forefront of the challenge. Acquiring a profound understanding of the relationship between TLC and CC necessitates a copious amount of concordant experimental data. Fortunately, in our previous study, an automated TLC platform was established¹⁸, and a standardized TLC dataset was proposed for constructing a deep learning model to predict R_F values¹⁹. Nevertheless, for preparative chromatography techniques such as CC, which are time intensive (ranging from tens of minutes to several hours per experiment), it is impractical to gather an extensive dataset manually.

In this work, we endeavor to embrace a data-centered viewpoint and discern patterns directly from extensive experimental data like a chemist. In other words, we attempt to take a step further to discover the coupling relationship between the TLC and CC from experimental data, which is expressed in a concise equation-like form. This kind of equation is termed as “artificial intelligence (AI) experience” to differentiate it from the empirical formulas summarized by humans. This method elucidates the relationship between TLC and CC explicitly, thereby offering a rationale for chemists to better comprehend the determination of the separation conditions. Compared with human expertise, the “AI experience” obtains a closer alignment with the experimental data, and can be quickly replicated without the human learning process, which enhances its practicality.

An automatic platform is constructed to systematically measure the retention volume of 192 compounds under various experimental conditions, resulting in a comprehensive CC dataset of 5984 data points (Fig. 1c). The compounds are chosen according to their diversity and accessibility and include various types, such as ketone, aldehyde, and phenol. Notably, to investigate the relationship between the TLC and CC, the stationary phases in the two modes are maintained to be identical, and the environmental and operational influences are controlled through automated experiments, which are detailed in the Methods section. Given the inconsistency between the existing TLC dataset and the generated CC datasets, surrogate models have been trained to produce consistent model predictions, thereby efficiently bridging the disparity between the two datasets.

With the experimental dataset, a knowledge discovery technique is proposed to discover an explicit “AI experience” from the machine learning models to describe the relationship between TLC and CC, which aids in explaining how chemists determine separation conditions (Fig. 1c). Through the discovered explicit statistical “AI experience”, we can directly estimate the potential intervals of retention volumes for each component in the mixture during CC through rapid TLC experiments and utilized mobile phases. This estimation allows for an assessment of the potential for successful separation under the given conditions, effectively eliminating the need for multiple time-consuming trials of CC. The proposed framework enables the generation of a dependable “AI experience” in the future, thereby enhancing our exploration of scientific inquiries.

Results

Column chromatography dataset from the automatic platform

Generating the “AI experience” that elucidates the correlation between the TLC and CC is fundamentally challenging because of the necessity of acquiring sufficient experimental data. In prior research¹⁹, an automated high-throughput TLC platform was established, enabling the measurement of R_F values for 387 compounds across various mobile phases. In this work, we focus on constructing a comprehensive CC dataset. However, given the substantial time and solvent consumption involved in CC processes, manually collecting abundant data is infeasible.

To address this challenge, two strategic approaches are employed. First, an automated platform for CC is invented, which integrates a suite of instrumentation to implement whole-process automation, including sample loading, mobile phase preparation, chromatographic separation, absorbance detection, and result analysis. The design of the automated CC platform is depicted in Fig. S2, and more details are provided in the Methods section. After the completion of an experiment, the absorbance was read and processed by a computer, the peak was identified through the change in absorbance, and the start and end times were obtained from the peak. The identification algorithm is provided in the open source code. The retention volumes are calculated on the basis of the retention time and flow rate. All the processes are controlled and accomplished via a laptop. Notably, two retention volumes (V_S and V_E) are recorded. V_S represents the volume of the mobile phase when the compound is first detected, whereas V_E signifies the volume of the mobile phase when the compound has completely eluted from the column. Consequently, ${V}_{E}-{V}_{S}$ corresponds to the volume of the separated compound solution. This automation significantly reduces the dependency on manual labor, enabling continuous data collection during the day and night, and thereby enhancing experimental efficiency. Furthermore, it minimizes human-induced variability, thereby improving the consistency and accuracy of the collected data.

Nevertheless, commonly used preparative chromatography columns (e.g., 25 g and 40 g columns) have relatively large column lengths and internal diameters, and contain a relatively large amount of packing material. Consequently, they typically require tens of minutes to several hours to complete an experiment. Therefore, even with the adoption of automated platforms, the costs associated with the solvents and time remain substantial when data are collected on these columns. To mitigate this issue, we adopted a strategy that generates “AI experience” from the data acquired through 4 g preparative columns, and then extrapolates it to other column specifications, including tandem 4 g column and 25 g column. Although infrequently utilized, the 4 g column allows for controlled time and solvent volumes in the experiments, which facilitates the acquisition of a substantial amount of experimental data. This strategy is grounded in the fact that the fundamental principles of CC remain the same across divergent column specifications. Therefore, a total of 5984 data points were collected for 192 compounds under a variety of experimental conditions (Fig. 1), including the proportions of the mobile phase (Fig. S3), sample mass, and column specifications. As illustrated in Fig. S4a, within the dataset, the majority of the data are generated via 4 g columns (4999 data), whereas a small amount of data is obtained from two common column specifications, including the tandem 4 g column (457 data) and the 25 g column (528 data). Notably, data acquisition on these larger columns is markedly more time-consuming, thereby limiting the dataset size. For each data point, the retention volumes of the starting and ending points (V_S and V_E) were automatically identified from their respective absorbance curves, the distributions of which are illustrated in Fig. S4b.

Since the datasets for TLC and CC were independently acquired in distinct experiments, a simple one-to-one mapping between them cannot be established. A comparison of the TLC and CC datasets revealed that there were intersections, including 60 compounds present in both datasets (Fig. S4c). However, for the remaining compounds, the R_F values from TLC and the retention volumes from CC under identical eluent ratios are not simultaneously present. This lack of synchronicity between the two datasets poses a significant challenge in deciphering the interrelationship between TLC and CC.

The construction of surrogate models and alignment of asynchronous datasets

To overcome the disconnect between the independently acquired TLC and CC datasets, our methodology entailed the utilization of surrogate models for the generation of model predictions. This facilitated the extrapolation of unobserved values within the datasets, thus establishing a comprehensive and cohesive relationship between the two datasets.

Surrogate models, which function as black-box predictive models, can effectively substitute for the original datasets. Here, two surrogate models were trained on the TLC and CC datasets. Each surrogate model was constructed via a five-layer fully connected neural network, which encompassed an input layer, three hidden layers, and an output layer, with each hidden layer composed of 256 neurons. The models were fed inputs comprising compound information and experimental conditions, yielding outputs that represented experimental results. The output neuron is 1 for the TLC model and 2 for the CC model. More details about the construction of surrogate models can be found in the Methods section. These surrogate models excel at learning and establishing intricate high-dimensional mappings between the input information and output results. This facilitates reliable predictions for a range of inputs, especially for the unobserved data points within the datasets.

Given the role of surrogate models in elucidating quantitative structure‒retention (QSRR) relationships, the precise characterization of both compounds and experimental conditions is of paramount importance for enhancing the models’ predictive accuracy. Notably, owing to the sequential nature of the TLC and CC processes, they share a majority of the characteristic features. As delineated in Fig. 2a, a comprehensive array of 167-dimensional molecular fingerprints and 16 molecular descriptors were employed to characterize the molecular information. The details and descriptions of these molecular descriptors are systematically cataloged in Table S1. These descriptors were carefully selected on the basis of correlation analyses, highlighting the features with significant relevance to both the TLC and CC processes.

**Fig. 2: The results of surrogate models.**

In terms of experimental conditions, the R_F value in TLC is predominantly determined by the proportion of the mobile phase, whereas the outcomes of CC under the same mobile phase are influenced by additional factors such as sample mass, sample solvent type, and solvent volume, thereby making these features unique to the CC process. For the mobile phase, an averaged molecular description is adopted, which constitutes a 6-dimensional vector, which is detailed in the Methods section. Consequently, the input of the surrogate model is a 189-dimensional feature vector for TLC, and a 192-dimensional vector for the CC. More details about the characterization are provided in the “Methods” section. To train the surrogate model, the datasets are partitioned into 80% for training, 10% for validation, and 10% for testing. The training epoch is 10,000, and the learning rate is 10^-3. To prevent overfitting, early stopping strategies were employed, where the epoch with the minimum number of validating errors was considered the best epoch. The validated R² values for the TLC and CC models are 0.948 (R_F value), 0.838 (V_S), and 0.898 (V_E). The predictive performance of both the TLC and CC surrogate models on the test dataset is illustrated in Fig. 2b. The results demonstrate the models’ satisfactory predictive capabilities, where the R² maintains over 0.8 for the prediction of the R_F value and retention volume. This can be attributed to the high consistency of datasets sourced from the automated platform and the efficacy of feature characterization. The R² of $\Delta V$ is relatively low since it is obtained from the predicted V_S and V_E, which means that the error will accumulate.

On the basis of these accurate surrogate models, predictions of the R_F values and retention volumes for the same compound across varying eluent ratios can be conducted. This facilitates the generation of mobile phase-related curves, where datapoints are derived from model predictions rather than direct experimental data (Fig. 2c). Through the application of surrogate models, we effectively reconciled the asynchronous nature of the two datasets, enriching the pool of usable data and establishing a robust foundation for subsequent statistical analysis.

For each data point in the CC dataset, the R_F values under the corresponding mobile phase are obtained from the prediction via the TLC surrogate model. Thus, a unified dataset encompassing both R_F values and retention volumes is constructed. With this unified dataset, we can analyse the relationship between the R_F values and the corresponding retention volumes from a statistical perspective. Since the R_F value serves as an approximate measurement of the relative polarity of the compounds, we coarsely categorize it into different ranges, including 0–0.2, 0.2–0.4, 0.4–0.6, 0.6–0.8, and 0.8–1.0. Notably, this categorization criterion is rudimentary and solely intended for the basic differentiation of R_F values. The distribution of the retention times corresponding to the data within the different R_F value ranges is illustrated in Fig. 2d. A conspicuous pattern can be observed in the graph, and shows that as the R_F value increases, the variance of the retention volume distribution decreases, and vice versa. This pattern indicates a latent, yet significant, relationship between the R_F and retention volumes.

Moreover, the influence of the loading sample mass can also be investigated from the model predictions, as depicted in Fig. 2e. The figure shows that as the sample mass increases, both the variance and the deviation of the mean value also increase. Interestingly, the trend of the ratio for V_S and V_E is divergent. The mean ratio of V_S decreases to below 1, whereas the mean ratio of V_E increases to above 1. This discovery aligns with the experience that a large loading sample amount may result in column overloading, leading to an increased peak width and decreased resolution. More details can be found in Supplementary Information S1.

Rationale for the determination of the separation conditions

In this section, a rationale is established from experimental data via statistics and machine learning, which explicitly describes the latent relationship between the R_F values and CC outcomes (i.e., the retention volume).

Here, a methodological approach of knowledge discovery is adopted to probe the TLC‒CC nexus, utilizing the unified model predictions derived from trained surrogate models. From the results of the analyses presented in Fig. 2, a potential pattern between the R_F values and the distribution of retention times is observed. To further investigate this trend, we divided the range of R_F values into ten equal intervals of length 0.1, allowing for a more detailed examination. Additionally, the effects of the mobile phase proportion were investigated. Specifically, 10 commonly used proportions of petroleum ether (PE) and ethyl acetate (EA), ranging from 1:0 to 0:1, are studied. A higher proportion of PE indicates a lower polarity of the mobile phase, and vice versa. For each specified mobile phase proportion, model predictions can be generated, and the distributions of the retention volumes for all the compounds across different R_F value ranges can be computed and analysed. These distributions are visualized via box plots (Fig. 3a). Notably, the boxplots refer to the distribution of the retention volumes of compounds whose R_F values are in specific ranges under a given solvent eluent.

**Fig. 3: Discovery of the relationship between the retention volumes of column chromatography and the retardation factor (RF) values.**

Figure 3 shows several interesting findings. Within each proportion of the mobile phase, exemplified by PE:EA = 50:1 (V/V) (more examples are provided in Fig. S5), a trend is discernible, where higher R_F values are associated with a constricted range of retention volume fluctuations and a corresponding reduction in their mean values. This statistical observation is in alignment with chemical intuition, since a higher R_F value corresponds to a sample with a smaller polarity, where the retention volume is usually small. Moreover, for an identical R_F value range, the retention volumes of mobile phases with larger polarities are discovered to have a smaller variance. This implies that, for an identical R_F value range, a larger mobile phase polarity will result in a lower uncertainty. Consequently, from a statistical perspective, the retention volume is correlated with both the R_F value and the eluent ratio. Through this statistical approach, a quantitative analysis of the selection of separation conditions can be conducted.

Let us assume that the mixture contains the desired product A and an impurity B, for simplicity in the analysis. In TLC experiments, the respective R_F values of both compounds can be obtained simultaneously. Given the mobile phase proportion, and the corresponding R_F values, the distribution range of their retention volumes (V_S and V_E) on CC can be determined from Fig. 3a. Evidently, complete separation is only achievable when either the V_S of product A is larger than the V_E of impurity B, or the V_S of impurity B is larger than the V_E of product A. By combining the distribution ranges of A’s and B’s respective retention volumes, the separation index for this scenario can be calculated, which is defined as the ratio of the length of the distribution range of the retention volumes in complete separation to the length of the overall distribution range. The detailed definitions and calculation process can be found in Supplementary Information S2. Here, a larger separation index corresponds to a greater possibility of separation in CC. For each mobile phase proportion, the corresponding separation index for A and B in different R_F value intervals can be calculated, resulting in a matrix of separation indices. By taking the average of the separation index matrices corresponding to the eight commonly used mobile phase proportions, a statistical separation index matrix can be obtained, which is displayed in Fig. 3b.

In Fig. 3b, lighter red indicate smaller separation indices, whereas deeper red indicates larger ones. Interestingly, a distinct pattern emerges. The separation indices corresponding to the lower right corner of the matrix are always relatively small. This suggests that when both R_F values exceed 0.5, the likelihood of separation is low, even with a significant difference in the R_F values of both compounds. Moreover, when one compound has a small R_F value (especially less than 0.3), the separation index is generally large, and a greater difference in R_F values between the two compounds increases the possibility of separation. Figure 3a shows that a smaller R_F value corresponds to a larger mean value of the retention volume distribution, indicating that a longer time is required for separation. Therefore, maintaining the R_F value of the desired product within the range of 0.2 to 0.3 is optimal for separation. Through the separation index matrix, we provide a statistical rationale for the empirical determination of the separation conditions in CCs. Additionally, we have observed that if there is no discernible difference in R_F values among the components of the mixture in TLC, they are most likely not separable in CC.

Furthermore, we attempt to express the inherent relationship between R_F values and retention volumes through explicit formulas. As illustrated in Fig. 3c, the mean values of the retention volume distribution across divergent mobile phase proportions and the R_F values were calculated. The figure shows that the mean retention volume follows an inversely proportional relationship, where an increase in the mobile phase proportion precipitates a decrease in the inverse proportionality coefficient. To elucidate this complex relationship explicitly, the symbolic regression algorithm, pySR, is employed, which facilitates the derivation of an exceptionally concise and accurate equation²⁰. The number of iterations is 100, the utilized operators are +, ×, and /, and the criterion is the mean squared error. To discover a concise equation, only the outcomes with complexities smaller than 10 are considered. Both the structural complexity and regression loss are considered when choosing the best equation. The discovered equations are written as:

$${\bar{V}}_{S}=\left\{\begin{array}{c}\frac{r}{0.147\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+0.0114},r \, > \, 0\\ 5.147,r=0 \end{array}\right.$$

(1)

$${\bar{V}}_{E}=\left\{\begin{array}{c}\frac{r}{0.069\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+0.0054}\\ 10.98,r=0\end{array} \right.,r \, > \, 0$$

(2)

$$r=\frac{{r}_{PE}}{{r}_{PE}+{r}_{EA}}$$

(3)

where ${\bar{V}}_{S}$ and ${\bar{V}}_{E}$ refer to the mean retention volume, and r is the proportion of PE in the eluent. The R² values of the symbolic regression for ${\bar{V}}_{S}$ and ${\bar{V}}_{E}$ are 0.882 and 0.900, respectively. Notably, the resemblance in form between ${\bar{V}}_{S}$ and ${\bar{V}}_{E}$ suggests a potential inner relationship, given that the symbolic regression was performed independently during the equation discovery process. Interestingly, ${\bar{V}}_{E}$ is approximately twice the value of ${\bar{V}}_{S}$ in the 4 g column. Fundamentally, this can be rationalized by the fact that a larger V_S frequently signifies a slower solute outflow, which typically coincides with a higher V_E. Traditionally, experimenters may discern this trend, yet the discovered equations can clearly illustrate this pattern.

These discovered equations are concise and practically significant, and unravel the statistical interconnection between the R_F values and retention volumes, translating conventional experiential insights into a quantifiable, explicit formula. By applying this formula, it becomes feasible to predict the range of outcomes in CC on the basis of the preliminary TLC findings, thus substantially facilitating the determination of the potential for a successful separation under the given conditions. Considering that TLC usually only takes a few minutes, while CC requires more time and solvent, the proposed equation can improve the efficiency of the CC process.

Model generalization across different compounds and column specifications

Utilizing advanced knowledge discovery methodologies, we formulate equations that correlate the R_F values with the mean retention volumes ${\bar{V}}_{S}$ and ${\bar{V}}_{E}$ in Eqs. (1) and (2), offering a statistical interpretation of the macroscopic relationship between the TLC and CC. However, even under identical R_F values, diverse compounds and column specifications influence the retention volume. In this section, we aim to quantify this influence to achieve a more generalized and accurate “AI experience”. To quantify the influence of compounds, a variable named the CC ratio ε, which is defined as the ratio of the actual retention volume V of a compound to the predicted average retention volume $\bar{V}$ under given experimental conditions, is proposed in this study.

$$\varepsilon=\frac{V}{\bar{V}}$$

(4)

Notably, under a fixed mobile phase proportion, the CC ratio ε depends exclusively on the structural and property characteristics of the compound. Nonetheless, deriving a direct equation for ε is challenging because of the representation of the compound features by high-dimensional vectors composed of molecular fingerprints and descriptors. Therefore, this study employs a visual tree structure to represent the relationship between the CC ratio ε and the compounds. Owing to the binary nature of molecular fingerprints, tree structures are particularly suitable for depicting this relationship. For each mobile phase proportion, a regression tree is trained on the model predictions, establishing a correlation between the CC ratio and the compound’s structural and property attributes. Regression trees in all the mobile phase proportions collectively form a forest. The maximum depth of the regression tree is 9, the minimum leaf depth is 1, and the criterion is the mean absolute error.

In this context, the visual regression trees for PE:EA ratios of 50:1 (V/V) (Figs. 4a) and 1:1 (V/V) (Fig. 4b) are exemplified via dendrograms. For enhanced clarity, the CC ratio ε is divided into three categories, including $\varepsilon \, \le \, 0.5$, $0.5\, < \, \varepsilon \, \le \, 1$, and $\varepsilon \, > \, 1$, each are indicated by distinct colors. Variables closer to the roots have greater importance and correlation, whereas those closer to the leaves have less importance and correlation. The dendrogram’s branching pattern distinctly separates different ε values, clearly illustrating the decision logic of the regression trees, and the influence of pertinent variables on the CC ratio ε. The figure reveals that the molecular descriptors AATSC0P and TPSA are dominant. AATSC0P, which was previously utilized in predicting liquid chromatography retention times without an elucidated mechanism²¹, was found to inversely correlate with the CC ratio ε. Conversely, TPSA, known for its strong correlation with molecular polarities and TLC outcomes¹⁹, demonstrates direct proportionality with ε.

**Fig. 4: The adaptation of the model to diversified compounds.**

In contrast to neural network models, the simplicity of the regression tree models permits a visual representation, thereby offering superior interpretability. Additionally, limiting the regression tree’s maximum depth automatically selects the most relevant variables from the high-dimensional data, thereby achieving a sparse representation. However, regression trees often fall short in modeling complex relationships, and often exhibit suboptimal predictive capabilities. In this study, the regression tree model is developed on the basis of insights from Eqs. (1), (2), thereby reducing relationship complexity for improved data fitting, and enhancing both interpretability and chemical relevance. Table S2 demonstrates the accuracy of the CC ratio ε predicted by the regression trees across different mobile phase proportions, confirming their effectiveness in learning the latent relationships between ε and compounds.

With the predicted ε, the retention volumes can be directly calculated from the utilized mobile phase proportion and the corresponding R_F values in the TLC through Eqs. (1) and (2). Compared with direct tree model training, this approach not only intensifies precision but also augments chemical interpretability (Fig. S6). While the regression tree model’s prediction accuracy (R² = 0.678) slightly lags behind that of previous surrogate models, its complete interpretability is paramount. This study offers invaluable insights into the intricacies of the CC process.

The equations can be easily extrapolated to other column specifications with a small amount of data through a transfer learning strategy, which recalibrates the coefficients of the equations learned from the symbolic regression with the data from the target domain. Compared with fine-tuning machine learning models, this strategy has greater computational efficiency since it only needs a simple regression. Crucially, while acknowledging that the column specifications do not influence the fundamental principles of chromatography, the inverse proportional statistical relationship encapsulated by these equations is expected to remain valid, with the inverse proportionality coefficient varying in accordance with the column specification. Moreover, the CC ratio, which is dependent on the compound characteristics, should maintain its consistency across the different column specifications. Therefore, extrapolation to other column specifications can be accomplished by fitting the coefficients in Eqs. (1) and (2), which is a process detailed in the Methods section.

Here, the tandem 4 g column and 25 g column are taken as examples. As depicted in Fig. 5a, b, the straightforward application of Eqs. (1) and (2) to these distinct cases is impractical, as it would yield substantial errors, particularly when the column specification disparity is pronounced. After recalibrating the coefficients in Eqs. (1) and (2) with data from the target domain, the modified equations can be written as:

$${V}_{S}^{4g+4g}=\left\{\begin{array}{c}\frac{r\cdot \varepsilon }{0.055\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+0.0062},r \, > \, 0\\ 7.832\cdot \varepsilon,r=0\end{array},\right.{V}_{E}^{4g+4g}=\left\{\begin{array}{c}\frac{r\cdot \varepsilon }{0.031\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+0.0036},r \, > \, 0\\ 17.61\cdot \varepsilon,r=0\end{array},\right.$$

(5)

$${V}_{S}^{25g}=\left\{\begin{array}{c}\frac{r\cdot \varepsilon }{0.022\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+0.0027},r \, > \, 0\\ 15.70\cdot \varepsilon,r=0\end{array},\right.{V}_{E}^{25g}=\left\{\begin{array}{c}\frac{r\cdot \varepsilon }{0.013\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+0.0016},r \, > \, 0\\ 26.81\cdot \varepsilon,r=0\end{array},\right.$$

(6)

where ${V}_{S}^{4g+4g}$ and ${V}_{E}^{4g+4g}$ refer to the predicted retention volumes for the tandem 4 g column, and where ${V}_{S}^{25g}$ and ${V}_{E}^{25g}$ refer to the predicted retention volumes for the 25 g column. The predictive capabilities of the modified equations are depicted in Fig. 5a, b. The equations can predict the retention volumes well, but the performance is relatively worse than that of the 4 g column. This may be because the amount of data in larger columns is smaller, and the difference between the large column and the small column is notable, affecting the transfer learning. From an analytical standpoint, as the mass of the column packing increases, the coefficient in the denominator decreases, indicating larger retention volumes.

**Fig. 5: The adaptation of the model to diverse column specifications.**

Discussion

This study provides a rationale for the determination of the separation conditions from the perspective of statistics through machine learning techniques. To guarantee the effectiveness and generalizability of the discovered relationship between TLC and CC, automated platforms are established to conduct standardized experiments, where the stationary phase, humidity, temperature, and other environmental and operational influences are controlled. Surrogate models are constructed to align the asynchronous experimental datasets for TLC and CC. Our work introduces a separation index matrix that quantitatively explains the chemical expertise that the R_F value of the compound of interest should be tailored to 0.2–0.3 in TLC. Furthermore, the “AI experience” is extracted from the experimental data obtained from automatic platforms via knowledge discovery frameworks, where explicit equations that describe the interplay between R_F values and retention volumes have first been identified. This process converts conventional chemical insights into formalized equations, enabling quantitative predictions of the outcomes of CC on the basis of preliminary TLC data. The equations can be easily extrapolated to other column specifications via a simple nonlinear regression of coefficients. This advancement not only deepens the understanding of chromatographic separation but also improves experimental efficiency.

Our work is an application of knowledge discovery in experimental chemistry. Prior efforts in knowledge discovery have focused mainly on unearthing complex partial differential equations in the realm of physics^22,23,24, which involves fewer variables that embody intricate interrelations, and often expressed through differential and integral forms. In contrast, a unique challenge arises in the realm of chemistry in which molecular complexity involves a vast array of variables. Each variable has a simpler relationship with the outcomes, but the overall complexity is derived from their collective interactions. Directly applying traditional knowledge discovery approaches in this context is impractical. In our study, we utilized the inherent correlation between TLC and CC outcomes, employing R_F values as pivotal variables to uncover statistical equations, which effectively bypasses the challenges posed by molecular complexity. Notably, the knowledge discovery technique is not intended to replace the mechanism model but rather to complement it. Their common goal is to obtain an explicit equation to describe chemical phenomena.

In the domain of cheminformatics, deep learning techniques have been prevalently applied for molecular property prediction, achieving notable successes^{11,12,13,14,25}. However, these models often suffer from a lack of interpretability, which hinders a deeper understanding of chemical phenomena, and limits their utility in specific scenarios. Thus, the interpretable modeling approach introduced in our research represents a notable advancement. We managed to extract explicit equations from the experimental data to construct an interpretable model. While there are concessions in precision, it can enhance our comprehension of the relationship between TLC and CC, making it more applicable in experimental contexts. Specifically, the constructed TLC and CC surrogate models can provide accurate predictions of the R_F values and retention times, which can guide the experiments. Moreover, the discovered “AI experience” provides explicit equations to bridge TLC and CC, which can facilitate chromatographic separation. Furthermore, considering that there are a variety of empirical formulas and chemical expertise in the field of chemistry, where it is difficult to model potential cross-scale relationships through theoretical deduction. Therefore, our method is promising for discovering more “AI experience” in diverse problems.

Nevertheless, several aspects of this research warrant further enhancement. First, owing to the time-intensive nature of chromatographic processes, we developed an automated platform for data collection to minimize the involvement of human resources and maximize efficiency. However, the dataset size, especially for larger columns, is still limited due to time and resource constraints. Moreover, despite our efforts to encompass a diverse range of compounds, the selection of 192 compounds in CC does not cover the extensive spectrum of chemical compounds available. A more diverse dataset would be useful in refining the precision of our interpretable model. Despite these limitations, our research provides invaluable insights into the field of chromatography. We anticipate that this framework will generate a more reliable “AI experience” to facilitate studies of nature and science.

Methods

Automatic column chromatographic platform

In this work, an automated CC platform is developed to measure the retention volumes of different compounds under various experimental conditions to construct a CC dataset. The developed automated CC platform offers greater flexibility and comprehensive automation, enabling fully automated CC analysis and significantly enhancing data collection efficiency.

As shown in Fig. S2, the automated sample loader is one of the core components facilitating the automation process. For the whole automated process, prepared samples are dissolved in the loading solvent, while the eluent is configured onsite through an infusion pump according to a preset ratio. The mixtures of standard solutions are then injected into the column via the automated sample loader, and the eluent is pumped into the column at a predetermined flow rate to perform CC. The detector continuously monitors the absorbance, and terminates the experiment automatically once the solute is fully collected.

The automatic CC system consists of two precision medium-speed infusion pumps equipped with a syringe-based solution mixer and an ultraviolet (UV) detector (Fig. S2). All the devices are connected by tubing. The prepared mixtures of standard solutions are placed in the syringe sample tray. They are then loaded via a syringe needle and transported to the chromatography column under pump pressure. The two pumps deliver the corresponding eluents at set flow rates, which are mixed in the solution mixer. The standard solutions are then eluted and separated, and the UV signal of the effluent solution during the elution process is recorded by a UV detector. In the auto CC system, the start time (t_S) and end time (t_E) of the sample separation are automatically calculated on the basis of recorded raw data via recognition algorithms. After each experiment, the automated platform proceeds to clean the column, and initiates the next experiment automatically under different conditions.

For each compound, eight experiments are conducted using different eluent ratios, namely, PE:EA = 1:0, 100:1, 50:1, 10:1, 5:1, 2:1, 1:1, and 0:1 (V/V), yielding eight sets of data from the UV detector. To calculate the target data, the raw data are first converted into the corresponding absorbance values by translating the complete 16-byte commands. The start and end times are then determined on the basis of the magnitude of the absorbance. At the end of each test, the eluent ratio is changed and maintained for approximately 30 s. Subsequently, the syringe automatically cleans the needle before proceeding to the next sample. To eliminate the influence of stray peaks, we analyse the data using a specific window width equivalent to 20 s. The calculated retention time is multiplied by the flow rate to obtain the retention volumes V_S and V_E, respectively.

For TLC, the silica gel plate from the Yinlong brand is utilized, and the stationary phase is standard silicone with an average particle size of 40 μm. For CC, the column from the Agela brand is utilized, and the stationary phase is also standard silicone with an average particle size of 40 μm. In the TLC automated platform, the whole process is accomplished by a collaborative robot to guarantee the precision of the measured R_F values. The developing is conducted in a square flat-bottom developing chamber, and the developing time is 300 s. Specifically, the robot can finely control the spotting procedure, which makes the spot as little dispersive as possible. Moreover, the developed TLC plate is photographed, and a computer vision algorithm is adopted to identify the geometry center and calculate the R_F values. To guarantee the quality of the data, human verification is adopted to calibrate TLC plates that are incorrectly identified by the computer. For more details of the automated TLC platform, refer to the protocol¹⁸.

Surrogate model construction

In this work, two fully connected artificial neural networks are utilized to construct surrogate models to fit the TLC and CC datasets and generate model predictions. For model construction, characterization of the compounds and experimental settings is crucial. The compounds are represented by the Molecular Access System (MACCS) keys with 167 dimensions, where each dimension is a binary code that denotes the existence of certain substructures and properties. In addition to the molecular fingerprint, molecular descriptors are also employed to describe the overall properties of the compounds. In this work, 16 molecular descriptors are selected from thousands of candidate descriptors according to the correlation coefficients. For the mobile phase, which is composed of PE and EA, the averaged descriptor of the mobile phase is utilized. Here, the molecular weight (MW_e), topological polar surface area (TPSA_e), number of rotatable bonds (NRotB_e), number of hydrogen bond donors (HBD_e), number of hydrogen bond acceptors (HBA_e), and lipid‒water partition coefficient (LogP_e) of the PE and EA are weighted averaged based on the proportions used to obtain the descriptors of the mobile phase, as illustrated in Fig. S7. The abbreviations and meanings of the descriptors are provided in detail in Table S1. Therefore, for the surrogate model of TLC, the input is the combination of molecular fingerprints and descriptors with 189 dimensions, and the output is the R_F value. For the surrogate model of CC, the input is similar to that of TLC, but the experimental conditions, including the sample mass, sample solvent type, and solvent volume, are incorporated into the input vector, which has 192 dimensions. The output is the retention volume V_S and V_E. Both neural networks have five layers, including an input layer, three hidden layers, and an output layer. Considering that the R_F value is between 0 and 1, an additional Sigmoid layer is employed in the TLC surrogate model to guarantee that the output conforms to the range. The activation function is LeakyReLU. The datasets are partitioned into 80% for training, 10% for validation, and 10% for testing. The number of training epochs is 10,000. To prevent overfitting, early stopping strategies were employed on the basis of the validating error. To facilitate training, the input vector is standardized via max-min standardization.

In this research, the generation of model predictions serves as a methodological approach to simulate unobserved experiments, thereby bridging the inconsistency between the TLC and CC datasets. Specifically, each data point in the CC dataset refers to a specific experimental scenario involving a target compound under predefined conditions. Considering that a critical factor influencing the separation efficacy of CC is the proportion of solvents in the mobile phase, the generation of model predictions focuses on different solvent ratios, including PE:EA = 1:0, 100:1, 50:1, 10:1, 5:1, 2:1, 1:1, and 0:1 (V/V). For each molecule, while keeping the other experimental conditions unchanged, only the solvent ratio is changed to generate different conditions, which are then input into the CC surrogate model to obtain the predicted retention volumes. Notably, the experimental data for these solvent ratios might not exist in the CC dataset, implying that the study effectively simulates and predicts experimental outcomes under these conditions. Therefore, predictions are also made for the same compounds and solvent ratios in the TLC dataset to generate TLC model predictions.

The generation of model predictions from both TLC and CC surrogate models enables the simultaneous prediction of R_F values and retention volumes for the compounds studied under a spectrum of solvent ratios, which is challenging to obtain directly from the datasets. Therefore, this model prediction generation approach is instrumental in laying the groundwork for subsequent knowledge discovery and analysis. In quantitative terms, 4680 model predictions are generated, including 585 experimental conditions evaluated across 8 solvent ratios.

Symbolic regression

In this work, the statistical analysis of the generated model predictions reveals a discernible pattern correlating the retention volume, the R_F value, and the ratio of the developing agent. These observed relationships align with the established knowledge in the field of chemistry, yet defined mathematical expressions are lacking to characterize these interdependencies succinctly. Consequently, this study employs symbolic regression as a methodological tool for knowledge discovery, aiming to unearth the most fitting explicit mathematical formulations that captures the essence of these relationships. Symbolic regression integrates a genetic programming algorithm with advanced optimization techniques, ensuring both efficiency and robustness in knowledge discovery. The genetic programming process simulates biological evolution to iteratively evolve mathematical expressions, optimizing their fit to the data. This process begins with a diverse set of random expressions described by symbolic trees, which undergo continuous evolution through operations such as crossover and mutation. Regularization is adopted to balance the exploration of new solutions with the refinement of the existing ones, thus preventing the algorithm from stagnating in a local optima. In this work, symbolic regression is implemented via the Python package PySR. This tool is simple and user friendly, and ideal for chemical issues with numerous variables, but it is less effective with complex equations. For problems involving partial differential equations, methods such as DLGA or DISCOVER are recommended^24,26.

Visual regression trees

In this work, statistically explicit expressions were identified to characterize the relationship between the TLC and CC. Evidently, for each specific compound, the corresponding retention volume of CC will be influenced even if the R_F value is identical. This influence stems mainly from the structure and properties of the molecule. Therefore, we define a CC ratio under a given mobile phase proportion to represent this influence. In this work, molecular fingerprints and descriptors, which are high-dimensional features, are used to express the structure and properties of the compounds. As such, it is difficult to directly observe the relationship between the CC ratio and these features, or find an explicit expression. To further explore this potential relationship, we employed a visual regression tree approach to provide an explicit explanation. Given the binary nature of molecular fingerprints, their impact on the CC ratio is well-suited for expression in a binary tree format. Here, a binary regression tree with a maximum depth of 9 layers is used to fit the CC ratio, with the input being the molecular fingerprints and descriptors of each molecule, and the output being the CC ratio. Notably, by limiting the maximum depth, the complexity of the binary regression tree is constrained, ensuring the interpretability of the model. Since the CC ratio is assumed to be solely related to molecular properties, data from both V_S and V_E can be combined when training the regression tree. For each solvent ratio, a binary regression tree can be trained, with all the trees for the commonly used solvent ratios forming a forest. The trained regression trees are represented using visualized tree root graphs, with more important conditions located closer to the roots. Visualization can enhance the interpretability and understanding of the model, allowing for intuitive insights into how molecular structures and properties affect the CC ratio and subsequently impact the chromatographic results. In addition to its explanatory capabilities, the regression tree model can also be used for prediction. Given a specific compound and solvent ratio, the CC ratio can be directly predicted using this model, allowing for rapid calculation of its precise retention volume via the formula.

Extrapolation to other column specifications

In this work, the relationship between TLC and CC is discovered within the experimental dataset from a 4 g column, which is expressed by Eqs. (1) and (2). Importantly, these equations can be generalized to other column types. Given that the underlying chromatographic principles remain consistent across different columns, the forms of Eqs. (1) and (2) can still be adopted. Additionally, it is assumed that the CC ratio does not vary with column type. Therefore, the task of generalization to other column types can be reformulated as a coefficient regression problem, which can be described as:

$${\bar{V}}_{S}=\left\{\begin{array}{c}\frac{r}{{a}_{1}\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+{b}_{1}},r \, > \, 0\\ {c}_{1},r=0\hfill\end{array}\right.$$

(7)

$${\bar{V}}_{E}=\left\{\begin{array}{c}\frac{r}{{a}_{2}\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+{b}_{2}}\\ {c}_{2},r=0\end{array} \right.,r \, > \, 0$$

(8)

$${V}_{S}={\bar{V}}_{S}\cdot \varepsilon=\left\{\begin{array}{c}\frac{r\cdot \varepsilon }{{a}_{1}\cdot {{{{\rm{R}}}}}_{{{{\rm{F}}}}}+{b}_{1}},r\, > \, 0\\ {c}_{1}\cdot \varepsilon,r=0\hfill\end{array}\right.$$

(9)

$${V}_{E}={\bar{V}}_{E}\cdot \varepsilon=\left\{\begin{array}{c}\frac{r\cdot \varepsilon }{{a}_{2}\cdot {{{\rm{R}}}}{}_{{{{\rm{F}}}}}+{b}_{2}}\hfill\\ {c}_{2}\cdot \varepsilon,r=0\end{array} \right.,r \, > \, 0$$

(10)

where a₁, b₁, c₁, a₂, b₂, and c₂ are the coefficients that need to be determined through the regression. In this study, we examined the use of tandem 4 g and 25 g columns. By leveraging the datasets collected from these columns, we perform nonlinear regressions on Eqs. (9) and (10) to calculate the coefficients for generalization. This approach circumvents the need for a fine-tuning of the models, which is typically required in traditional transfer learning methods, and instead necessitates only a small amount of data for simple regression. This not only highlights the advantages of explicit equations but also demonstrates the reliability of the equation forms identified in this paper.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The dataset generated in this study has been deposited in the GitHub repository, https://github.com/woshixuhao/Discovery_of_column_chromatography/tree/main/data. Source data are provided with this paper.

Code availability

All the original code has been deposited at the website https://github.com/woshixuhao/Discovery_of_column_chromatography/tree/main/data. The version of the record of the GitHub repo is doi:10.5281/zenodo.7623903²⁷. The reproducible codes have been provided at the website https://bohrium.dp.tech/notebooks/86427319178. The codes in both repositories are the same, while the codes on Bohrium can run online for reproducibility. The license for all repositories is Apache-2.0 license.

References

Still, W. C., Kahn, M. & Mitra, A. Rapid chromatographic technique for preparative separations with moderate resolution. J. Org. Chem. 43, 2923–2925 (1978).
Kondeti, R. R., Mulpuri, K. S. & Meruga, B. Advancements in column chromatography: A review. World J. Pharm. Sci. 2, 1375–1383 (2014).
Google Scholar
Sasidharan, S., Chen, Y., Saravanan, D., Sundram, K. M. & Latha, L. Y. Extraction, isolation and characterization of bioactive compounds from plants’ extracts. Afr. J. Tradit. Complement. Alternat. Med. 8, 1–10 (2011).
Zhang, Q. W., Lin, L. G. & Ye, W. C. Techniques for extraction and isolation of natural products: A comprehensive review. Chin. Med 13, 1–26 (2018).
Article PubMed PubMed Central MATH Google Scholar
Shekhawat, L. K. & Rathore, A. S. An overview of mechanistic modeling of liquid chromatography. Preparative Biochem. Biotechnol. 49, 623–638 (2019).
Michel, M., Epping, A., Jupke A. Modelling and determination of model parameters In Preparative Chromatography: Of Fine Chemicals and Pharmaceutical Agents. Eds. Schmidt-Traub H., Schulte ML, and Seidel-Morgenstern A. Weinheim: Wiley, 2005, 215-312.
Püttmann, A., Schnittert, S., Naumann, U. & von Lieres, E. Fast and accurate parameter sensitivities for the general rate model of column liquid chromatography. Comput Chem. Eng. 56, 46–57 (2013).
Article Google Scholar
Brooks, C. A. & Cramer, S. M. Steric mass‐action ion exchange: Displacement profiles and induced salt gradients. AIChE J. 38, 1969–1978 (1992).
Article ADS CAS MATH Google Scholar
Osberghaus, A. et al. Determination of parameters for the steric mass action model-A comparison between two approaches. J. Chromatogr. A 1233, 54–65 (2012).
Article CAS PubMed MATH Google Scholar
Artrith, N. et al. Best practices in machine learning for chemistry. Nat. Chem. 13, 505–508 (2021).
Article CAS PubMed MATH Google Scholar
Usman, A. G., Işik, S. & Abba, S. I. A novel multi-model data-driven ensemble technique for the prediction of retention factor in HPLC method development. Chromatographia 83, 933–945 (2020).
Article CAS MATH Google Scholar
Osipenko, S. et al. Machine learning to predict retention time of small molecules in nano-HPLC. Anal. Bioanal. Chem. 412, 7767–7776 (2020).
Article CAS PubMed MATH Google Scholar
Domingo-Almenara, X. et al. The METLIN small molecule dataset for machine learning-based retention time prediction. Nat. Commun. 10, 1–9 (2019).
Article Google Scholar
Low, D. Y. et al. Data sharing in PredRet for accurate prediction of retention time: Application to plant food bioactive compounds. Food Chem. 357, 129757 (2021).
Article CAS PubMed MATH Google Scholar
Singh, Y. R. et al. Current trends in chromatographic prediction using artificial intelligence and machine learning. Anal. Methods 15, 2785–2797 (2023).
Article PubMed MATH Google Scholar
Singh, Y. R., Shah, D. B., Maheshwari, D. G., Shah, J. S. & Shah, S. Advances in AI-Driven retention prediction for different chromatographic techniques: unraveling the complexity. Crit. Rev. Anal. Chem. 1, 11 (2023).
MATH Google Scholar
Guidotti, R. et al. A survey of methods for explaining black box models. ACM Comput Surv. 51, 1–42 (2018).
Article MATH Google Scholar
Xu, H., Zhang, D. & Mo, F. High-throughput automated platform for thin layer chromatography analysis. STAR Protoc. 3, 101893 (2022).
Article CAS PubMed PubMed Central Google Scholar
Xu, H. et al High-throughput discovery of chemical structure-polarity relationships combining automation and machine-learning techniques. Chem 1–13 https://doi.org/10.1016/j.chempr.2022.08.008 (2022).
Cranmer, M. Interpretable machine learning for science with PySR and SymbolicRegression. jl. arXiv preprint arXiv:2305.01582 (2023).
Parinet, J. Predicting reversed-phase liquid chromatographic retention times of pesticides by deep neural networks. Heliyon 7, (2021).
Raissi, M., Yazdani, A. & Karniadakis, G. E. Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science (1979) 367, 1026–1030 (2020).
MathSciNet CAS MATH Google Scholar
Kaiser, E., Kutz, J. N. & Brunton, S. L. Sparse identification of nonlinear dynamics for model predictive control in the low-data limit. Proc. R. Soc. A: Math. Phys. Eng. Sci. 474, 1–25 (2018).
Xu, H., Chang, H. & Zhang, D. DLGA-PDE: Discovery of PDEs with incomplete candidate library via combination of deep learning and genetic algorithm. J. Comput Phys. 418, 109584 (2020).
Article MathSciNet MATH Google Scholar
Sun, L. et al. A simple method for HPLC retention time prediction: Linear calibration using two reference substances. Chin. Med. (U. Kingd.) 12, 1–12 (2017).
MATH Google Scholar
Du, M., Chen, Y. & Zhang, D. DISCOVER: Deep identification of symbolically concise open-form partial differential equations via enhanced reinforcement learning. Phys. Rev. Res. 6, 013182 (2024).
Article CAS Google Scholar
Xu, H., Wu, W., Chen, Y., Zhang, D., Mo, F. Discovering explicit relation between thin film chromatography and column chromatography from statistics and machine learning. Github, https://doi.org/10.5281/zenodo.7623903 (2024).

Download references

Acknowledgements

This work is supported by the Natural Science Foundation of China (Grant Nos. 22071004, 21933001, and 22150013, received by F. M., and 62106116, received by Y. C.), and China Postdoctoral Science Foundation (Grant No. 2024M761535, received by H. X.). F.M. thanks Peking University Shenzhen Graduate School and Shenzhen Government for the start-up funding support. We thank the High-Performance Computing Platform of Peking University and High Performance Computing Centers at Eastern Institute of Technology, Ningbo, and Ningbo Institute of Digital Twin for machine learning model training.

Author information

Authors and Affiliations

AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
Hao Xu, Wenchao Wu & Fanyang Mo
BIC-ESAT, ERE, and SKLTCS, College of Engineering, Peking University, 100871, Beijing, P. R. China
Hao Xu
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, Zhejiang, 315200, P. R. China
Hao Xu & Yuntian Chen
School of Materials Science and Engineering, Peking University, 100871, Beijing, P. R. China
Wenchao Wu & Fanyang Mo
Zhejiang Key Laboratory of Industrial Intelligence and Digital Twin, Eastern Institute of Technology, Ningbo, Zhejiang, 315200, China
Yuntian Chen & Dongxiao Zhang
School of Advanced Materials, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
Fanyang Mo
Guangdong Provincial Key Laboratory of Nano-Micro Materials Research, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
Fanyang Mo

Authors

Hao Xu
View author publications
Search author on:PubMed Google Scholar
Wenchao Wu
View author publications
Search author on:PubMed Google Scholar
Yuntian Chen
View author publications
Search author on:PubMed Google Scholar
Dongxiao Zhang
View author publications
Search author on:PubMed Google Scholar
Fanyang Mo
View author publications
Search author on:PubMed Google Scholar

Contributions

W.W. and F.M. established the automated platform, W.W. conducted the experiments and collected the column chromatography dataset. H.X. analysed the data. H.X. performed the chemoinformatic and machine learning studies. H.X., D.Z., and F.M. wrote the manuscript. Y. C., D.Z. and F. M. revised the manuscript. F.M. conceived the idea and designed the overall research. F.M. and D.Z. supervised the entire project.

Corresponding authors

Correspondence to Dongxiao Zhang or Fanyang Mo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Irena Vovk, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting summary

Transparent Peer Review file

Source data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Xu, H., Wu, W., Chen, Y. et al. Explicit relation between thin film chromatography and column chromatography conditions from statistics and machine learning. Nat Commun 16, 832 (2025). https://doi.org/10.1038/s41467-025-56136-x

Download citation

Received: 29 May 2024
Accepted: 09 January 2025
Published: 19 January 2025
Version of record: 19 January 2025
DOI: https://doi.org/10.1038/s41467-025-56136-x

This article is cited by

Switching-Type Fluorescence Probe Based on Carbon Dot Nanocomposites for the Visual Detection of Superoxide Anions
- Jiao Zhang
- Bin Liao
- Wanyi Chen
Journal of Fluorescence (2025)

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Column chromatography dataset from the automatic platform

The construction of surrogate models and alignment of asynchronous datasets

Rationale for the determination of the separation conditions

Model generalization across different compounds and column specifications

Discussion

Methods

Automatic column chromatographic platform

Surrogate model construction

Symbolic regression

Visual regression trees

Extrapolation to other column specifications

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links