Background & Summary

Optimization of ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties plays a pivotal role in drug discovery. These pharmacokinetic properties directly influence a drug’s efficacy, safety, and ultimately clinical success. Early assessment and optimization of ADMET properties are essential for mitigating the risk of late-stage failures and for the successful development of new therapeutic agents1.

The development of computational approaches provides a fast and cost-effective means for drug discovery, allowing researchers to focus on candidates with better ADMET potential and reduce labor-intensive and time-consuming wet-lab experiments2,3,4. One of the key factors contributing to the success of computational approaches in drug discovery is the decent volume of compound-related biomedical data5. The number of bioassays is increasing each year, and many of their screening results are publicly accessible in databases such as ChEMBL6, PubChem7, and BindingDB8 etc.

Manual curation of ADMET data based on public data sources has been reported and some of them have been widely used as benchmark datasets for model evaluation. Wu et al.9, who constructed a large-scale benchmark for molecular machine learning named MoleculeNet, included 17 datasets and more than 700,000 compounds covering categories of physical chemistry and physiology related to ADMET experiments. Huang et al.10 published the Therapeutics Data Commons, which includes 28 ADMET-related datasets with over 100,000 entries by integrating multiple curated datasets from previous work. For specific ADMET experiment, Meng et al.11 present B3DB, which includes 1,058 compounds containing log BB values and 7,807 compounds with classification labels for the blood-brain barrier as one of the distribution properties. Meng et al.12 collected seven aqueous solubility datasets and presented a dataset curation workflow to establish solubility datasets as one of the physicochemical properties.

However, serious concerns about these benchmark datasets still exist. Firstly, most of these benchmarks include only a small fraction of the publicly available bioassay data. For instance, the ESOL dataset13 within MoleculeNet provides water solubility data for 1,128 compounds, while the PubChem7 database contains more than 14,000 relevant entries. Secondly, the entries in these benchmarks differ substantially from those in the industrial drug discovery pipeline. For example, the mean molecular weight of compounds in the ESOL dataset is only 203.9 Dalton, whereas compounds typically within the drug discovery projects have molecular weights ranging from 300 to 800 Dalton14.

These limitations of compiled open-source benchmark datasets are primarily due to the high complexity of data annotation for biological and chemical experimental records. Frequently, experimental results for identical compounds can vary significantly under different conditions, even within the same type of experiment15. For example, aqueous solubility can be influenced by various factors, such as different types of buffers, pH level, and experimental procedure. Thus, the same compound might be annotated with different solubility values depending on those experimental conditions16. This sort of variability poses a big challenge in the fusion of experimental results.

Recently developed Large Language Models (LLMs) like ChatGPT17, PubMedBERT18, and BioBERT19 represent a novel approach of effectively extracting data from a large body of text, therefore a potential method for addressing data curation challenges. Some of these LLMs demonstrate state-of-the-art performance through one-shot or few-shot learning as a form of multi-task learning17,20,21. Compared to supervised methods or models requiring thousands of data for fine-tuning, this approach allows us to develop condition extraction models more efficiently with only a few examples.

In the current study, we leveraged these LLMs as a core engine to extract experimental conditions from assay descriptions within biomedical databases, and an automated data processing framework was established for processing them for facilitating compilation of ADMET benchmark datasets as shown in Fig. 1. We implemented the pipeline to process bioassay data from the ChEMBL database and extract the experimental conditions missing from the table descriptions. These data, along with some other public datasets, were standardized and filtered to create PharmaBench22.

Fig. 1
Fig. 1
Full size image

Data processing workflow for building PharmaBench: From left to right, the multi-agent LLM system extracts experimental conditions from the ChEMBL database, combines other data sources and standardizes the data, filters various data types, and validates them through repeated tests, property distribution, and AI modeling.

Eventually, PharmaBench22, a data package including eleven ADMET properties, was curated and provided to cheminformatics community serving as a benchmark set for ADMET predictive model evaluation. These properties are recognized as key factors in real-world drug development efforts, and both the size and diversity of the data are significantly greater than those of previous datasets. We also included multiple validation steps to confirm the data quality, molecular properties, and modeling capabilities of PharmaBench22.

Methods

The Methods section provides a detailed overview of the data processing workflow used in constructing PharmaBench22, as depicted in Fig. 1. The Data Collection subsection outlines the data sources employed to build PharmaBench22. It includes a comprehensive description of the multi-agent LLM system for extracting experimental conditions from assay descriptions, detailed in the Data Mining subsection. Following the identification of experimental conditions in the Data Mining stage, we merge experimental results from various sources and standardize and filter the data based on drug-likeness, experimental values, and conditions, as summarized in the Data Standardization and Filtering section. Finally, we post-process the datasets by removing duplicate test results and dividing the dataset based on Random and Scaffold splitting methods for AI modeling purposes.

We establish a final benchmark set that comprises experimental results in consistent units and under standardized experimental conditions. In addition, the data processing workflow described in the Methods section can eliminate inconsistent or even contradictory experimental results for the same compounds, enabling other researchers to effectively construct datasets from public data sources. For code reproduction, all data processing tasks were conducted within a Python 3.12.2 virtual environment, established using Conda on an OSX-64 platform. This environment included pandas 2.2.1, NumPy 1.26.4, Matplotlib 3.8.3, rdkit 2023.9.5, scikit-learn 1.4.1.post1, scipy 1.12.0, seaborn 0.13.2, and openai 1.12.0. A detailed description of the environment requirements can be found on GitHub at https://github.com/mindrank-ai/PharmaBench.

Data collection

Our data primarily originated from the ChEMBL database, a manually curated collection of SAR (Structure-Activity Relationship) and related physicochemical property data, largely sourced from peer-reviewed journal articles. The data type within the ChEMBL database typically includes experimental value, chemical structure, assay description, type of experiment, and certain experimental conditions. Table 1 summarizes the original entries we collected, along with the number of bioassays of the ChEMBL database used for PharmaBench22. We analysis through 97,609 raw entries based on 14,401 different bioassays in PharmaBench22.

Table 1 Summary of data sources for PharmaBench, from left to right: the broad ADMET category, property name, number of ChEMBL entries and bioassays, number of other entries, and a summary of the sources with references.

These entries from different bioassays in the ChEMBL database were analyzed through our Data Mining workflow to extract the experimental conditions. This is mainly because most of the experimental conditions recorded in ChEMBL are not explicitly specified. For instance, for solubility experiments, entries in the ChEMBL database do not include explicit data columns such as buffer type, pH condition, and experimental procedure, which are critical factors influencing experimental results. Although these conditions can be found in the assay descriptions, they cannot be directly used as a filter to distinguish experiments due to their unstructured nature. Manual mining work would be labor-intensive, which necessitates an automatic data processing framework to identify important experimental conditions from the description texts.

Thus, our multi-agent LLM system uses the entries from the ChEMBL database as the original sources and identifies various conditions for different ADMET experiments as summarised in Table 2. Additionally, we have augmented our datasets with some public datasets that have associated assay descriptions as illustrated in Fig. 1. Table 1 presents a summary of the 59,009 entries we have compiled from various public datasets, along with a delineation of their respective sources.

Table 2 Table of Experimental Conditions and Filters Across Datasets.

Overall, we have used more than 150,000 entries from public data sources to construct PharmaBench22, and the data mining process has analyzed 14,401 different bioassays.

Data mining

GPT-417, a model created by OpenAI, was utilized as the core LLM for the data-mining task. Based on previous research, to obtain optimized results from GPT-4, a prompt with clear instructions and examples is required for every specific task17,23,24,25. As shown in Fig. 2, the prompt for our data-mining process includes both instructions and examples. The instructions summarize the experimental conditions as the data mining goal and specify the requirements for the output formats. The examples, on the other hand, provide few-shot learning examples for the LLM. This prompt engineering is an important process for improving the results of GPT-417.

Fig. 2
Fig. 2
Full size image

Sample Prompt for LLM Interaction: Illustration of a Typical User Query Input, Including Instructions and Example Parts.

However, constructing prompts for various tasks requires domain knowledge of the ADMET experiments, and creating examples for these data mining tasks remains labor-intensive. We wish to explore whether the LLMs can automatically identify key experimental conditions from different types of experiments, generate examples, and complete the complex data mining process with minimal human effort.

As a result, a multi-agent LLM data mining system was proposed in this study to extract experimental conditions from the descriptions of various bioassays26,27,28. An agent is a module or entity that utilizes the LLM to perform specific tasks, such as understanding, generating, or processing natural language texts28. Instead of using a single LLM-powered agent, a multi-agent system was proposed to customize LLMs into various agents, each with different capability, to automatically complete the complex data mining process, as shown in Fig. 326.

Fig. 3
Fig. 3
Full size image

Overview of the Multi-Agent LLM Data Mining Workflow. This figure presents a summary of the multi-agent LLM data mining workflow, which includes three key components: the Keyword Agent, responsible for identifying experimental conditions; the Example Agent, tasked with generating examples; and the Data Mining Agent, designed to extract experimental conditions from assay descriptions.

The multi-agent system consists of three agents, namely keyword extraction agent (KEA), example forming agent (EFA), and data mining agent (DMA), as illustrated in Fig. 3. The KEA will pick out and summarize the key experimental conditions for ADMET experiments. The EFA will then generate examples based on these experimental results summarized by the KEA. We will manually validate the outcomes of the KEA and EFA to ensure their quality. Finally, the DMA will mine through all the assay descriptions and identify all the experimental conditions within these texts. The following sections will introduce these three agents in more detail.

Keyword extraction agent

The KEA is designed to summarize key experimental conditions from various ADMET experiments. A prompt, as illustrated in Fig. 4, along with texts from 50 randomly selected assay descriptions, was created as the model input for the KEA. This prompt instructs GPT-4 to summarize the experimental conditions from selected assay descriptions of bioassays in ChEMBL. The model’s task is to identify and summarize the top five most frequently mentioned experimental conditions. For more complex experiments, such as microsome clearance and CYP inhibition, the model was asked to summarize the top ten conditions. GPT-4 is required to generalize these conditions rather than just listing specific conditions and duplicating or listing similar conditions should be avoided. An example of a Python list is provided to KEA to illustrate the desired output format for GPT-4. This process will leverage GPT-4’s internal knowledge to generate a list of significant experimental conditions. An example of the input and output for the KEA is shown in Fig. 4.

Fig. 4
Fig. 4
Full size image

Sample Prompt for the Keyword Extraction Agent.

The experimental conditions summarized by the KEA are listed in the ‘Experimental Condition’ column of Table 2. Domain experts were invited to confirm if these conditions are key conditions for ADMET experiments. These experimental conditions are then used as the primary data mining goal for the DMA to extract from each assay description.

Example forming agent

The EFA focuses on generating examples from assay description texts. The prompt for this agent includes clear instructions incorporating the key experimental conditions summarized by the Keyword Agent, along with forty assay descriptions for analysis purposes. The Example Agent returns a Python dictionary containing the index, original sentences, and key experimental conditions as the keys. It will return ‘None’ if no information is provided within the sentences.

For each ADMET experiment, forty examples will be generated through these automatic pipelines. Manual examination is conducted on the examples to eliminate errors and confirm the format. This fast labeling process generates few-shot learning examples for DMA, which avoids intensive human labeling. The example input and output for this agent is shown in Fig. 5.

Fig. 5
Fig. 5
Full size image

Sample Prompt for the Example Forming Agent.

Data mining agent

The DMA aims to complete the mining task for all assay descriptions from the ChEMBL database. As shown in Fig. 2, the prompt for this agent includes instructions containing the experimental conditions summarized by the KEA and forty examples generated by the EFA. As shown in Fig. 5, the prompt defines the data mining task, identifying experimental conditions and outputting them in the desired format. These examples provide few-shot learning data for the DMA to learn how to standardize the output format and improve the overall output quality.

GPT-4 has a limit on the number of tokens to be processed in a single request17. Thus, we divided the assay descriptions with a batch size of twenty to mitigate the risk of overloading the model. This batching technique allows for a more accurate and reliable analysis, especially when dealing with complex assay descriptions.

The DMA will return a Python Dictionary for every batch input. A routine was written to convert the Python Dictionary from a Markdown file into a Pandas DataFrame, which is then stored. Eventually, the Data Mining Agent goes through all the assay descriptions in the raw data and stores the output of every batch. The experimental conditions mined based on this multi-agent system are then merged back into the original file for data standardization and filtering.

Overall, this multi-agent system mines through 14,401 assay descriptions to identify the experimental conditions for seven different ADMET experiments. It largely minimizes human effort to extract structured experimental conditions which will be then used in the following data standardization and filtering procedures.

Data standardization

The data obtained from different sources exhibit significant variability in the format of structure, data type, name of experimental condition, and the unit and range of experimental value. For standardizing the data, we design a data standardization workflow to clean the data obtained from the previously described data mining step and it includes standardization of structure format, experimental condition, and experimental value.

  • Structure Standardization: A standard pipeline using RDKit29 is used to convert compound SMILES into canonical SMILES. This pipeline includes checking validity, stripping salts, and removing molecules containing metal atoms.

  • Standardization of experiment condition: Experiment conditions from various sources are standardized into a unified format. For conditions being numerical values, such as pH, temperature, and compound concentration, they are converted into floating numbers. String values, such as buffer type, CYP type, cell strain type, etc., are standardized using the same naming format. For binary variables, such as the addition of S9 in an Ames experiment, a boolean value of ‘True’ or ‘False’ is used. The experimental conditions across different sources are standardized using a consistent naming strategy, thereby facilitating the data filtering section.

  • Standardization of experiment value: A similar standardization procedure is also carried out on experimental readouts. For regression tasks, the experimental results, which may be in varying units, are converted to a consistent unit. In some cases, log transformation is applied to experimental results to reduce data range. For classification tasks, thresholds are defined to assign class labels on datasets.

Data filtering

A data filtering process aims to filter out entries with abnormal molecules and irregular experimental results, to construct the final benchmark set that contains experimental results in consistent units and experimental conditions.

  • Molecule Filter: Molecules containing metal atoms are removed. In addition, amino acids, peptides, or antibodies are removed.

  • Filter of experiment value: For filtering experiment results, entries containing results outside the normal data range, e.g. negative values for half-life data, are removed. Additionally, upper and lower limits for experiment results are set. Outliers and abnormal distributions in the regression values are manually validated and eliminated if they cannot be self-explained.

  • Filter of experiment condition: The extreme experiment conditions are eliminated while preserving the rest of the entries. For experiment conditions that contain a few ‘None’ values, we typically only retain entries within a specific range of result value, as indicated in the ‘Filter’ column of Table 2, and remove the entries that fall outside of this range or contain ‘None’. For instance, we only preserve the pH value equal to 7.4 for the LogD experiments and remove the rest. We exclude experiment conditions of which the majority is a ‘None’ entry, as they do not provide useful filters due to the predominance of unknown information. The details for the Experimental Condition Filter can be found in Table 2.

Data preparation for AI modeling

After the above data processing workflow, a series of ADMET datasets were constructed from various bioassays. The count of entities is summarized in Table 3. However, multiple experimental results for the same compounds occur under the same conditions within the datasets after the processing workflow.

Table 3 Summary of Datasets in PharmaBench: ‘Property Name’ refers to the name of the dataset.

Thus, we employed various strategies to unify these repeated results in the final datasets. For regression tasks, for compounds with repeated data, the mean value was taken as the unified value. There are two classification datasets in our benchmark set. For the BBB experiment, we eliminate all compounds with contradictory results, while for AMES, we label the compounds as positive if at least one positive result occurs in these experiments. This approach is primarily due to the fact that AMES is a toxicity-related experiment, which requires the model to be highly sensitive to positive results30.

Additionally, we divided the datasets for each property into training and test sets with a ratio of 0.8:0.2 respectively, utilizing both random and scaffold splitting methods. Random splitting involves distributing the compounds arbitrarily across the training and test sets, whereas scaffold splitting is designed to create sets with distinct structural features by allocating compounds that share the same core scaffold exclusively to either the training or the test set10. This approach ensures that the test set is structurally different from the training compounds, making it more challenging for models to predict.

Data Records

We have compiled 11 ADMET datasets to form PharmaBench, which is freely available at figshare22. Table 3 includes the number of entries after the data processing workflow for each dataset and the final entries for AI modeling. The final entries consist of one experimental result for each molecule, based on the experimental condition as described in the ‘Filter’ column of Table 2. The mission type of the different datasets is also summarized in the ‘Mission Type’ column of Table 3, including regression and classification.

Overall, PharmaBench22 comprises a total of 52,482 entries. It is stored in comma separated values (CSV) format and includes a unified SMILES representation, experimental results, property names, and training labels based on both scaffold and random splitting, as summarized in Table 4. The data are also openly accessible on GitHub at https://github.com/mindrank-ai/PharmaBench, along with the processing workflow.

Table 4 List of information in the final datasets.

The following section will introduce different datasets in more detail, including a general introduction to various ADMET properties, the units for different datasets, and the number of molecules.

  • LogD LogD31 measures a drug’s pH-adjusted lipophilicity, representing the ratio of its total concentration (both ionized and un-ionized) in oil and water phases. This is an important property to consider in drug discovery as it influences a compound’s bioavailability, permeability, and other pharmacokinetic properties. The unit for LogD, which stands for the logarithm of the distribution coefficient (D), is dimensionless. We introduce a regression task that includes 13,068 unique molecules for predicting LogD.

  • Water Solubility Water solubility32 denotes the maximum amount of a solute that can dissolve in water to form a uniform solution. In drug development, it significantly impacts drug bioavailability, since a drug requires adequate solubility for absorption into the bloodstream. The unit for the water solubility dataset is log10nM, and it includes 11,701 unique molecules for the regression prediction of these values. We filtered out the dynamic water solubility data in this dataset based on the experimental conditions.

  • The Blood-Brain Barrier (BBB) The BBB33 is a selective barrier that separates the blood from the central nervous system (CNS) and poses significant challenges for drug delivery to the CNS. Predicting BBB penetration is crucial for designing drugs targeting CNS diseases. We have chosen log BB = –1 as the threshold value, as this is the most widely used threshold, as discussed in the B3DB. Overall, there are 8,301 unique molecules for the BBB task.

  • Plasma Protein Binding (PPB): PPB34 is an important pharmacokinetic parameter that characterizes the extent to which a compound binds to proteins in the bloodstream. PPB can influence a compound’s distribution, elimination, and therapeutic efficacy. The experimental results for PPB experiments range from 0 to 1, representing the percentage of the drug in the plasma that is bound. For instance, if a drug has a PPB of 90%, it means that 90% of the drug molecules present in the plasma are attached to plasma proteins, leaving only 10% free and active. There are records of 1,262 molecules in the PPB datasets.

  • CYP: Cytochrome P450 (CYP)35 is the primary metabolic enzyme responsible for drug metabolism in the body. CYP enzymes catalyze the oxidation of organic substances, a process that often represents the first step in the metabolism of many drugs. Multiple CYP isoforms exist in the human body, each with unique specificity for various substrates. The unit for the CYP datasets is Log10uM, indicating the binding affinity of compounds to different CYP enzymes. There are three different CYP datasets in this benchmark, namely CYP 2C9 (999 molecules), CYP2D6 (1,214 molecules), and CYP 3A4 (1,980 molecules).

  • Liver Microsome Clearance (LMC): Liver Microsome Clearance36 refers to the process by which compounds are metabolized and cleared in the liver microsomal system. This in vitro assessment is crucial in drug discovery and development, as it offers an early estimation of a compound’s in vivo clearance rate and potential for drug-drug interactions. The unit for LMC is Log10(mL.min-1.g-1), indicating the clearance speed of microsomes for different drugs. We have included three different LMC datasets in this benchmark, namely human LMC (2,286 molecules), rat LMC (1,129 molecules), and mouse LMC (1,403 molecules).

  • AMES: The AMES test30 evaluates a compound’s mutagenic potential by assessing whether specific bacteria regain the ability to grow without histidine. It serves as a cost-effective, preliminary toxicity screening method widely used in various industries, particularly in drug development, to identify potential carcinogens. A positive AMES result indicates that the compound may have mutagenic potential, characterized by abnormal bacterial growth speed. We have included 9,139 molecules for the AMES test.

Technical Validation

Once the data collection is done, we evaluate the datasets from three aspects. Firstly, we use the repeated test results for the datasets before and after the implementation of the data processing workflow to demonstrate the improvement in data quality resulting from this workflow. Secondly, we illustrate the characteristics of PharmaBench22 by showing distributions of various molecular properties. Lastly, we trained various machine learning and deep learning models on the datasets and presented model performance on the test sets.

Repeated test for data quality assessment

A comparison for repeated test results is a methodological approach where the same experiment is conducted multiple times to verify the consistency of the results37. Limited by the scope of this work, we cannot verify each data point through wet lab experiments or review each literature to confirm the direct data quality of the dataset. Thus, we implement an indirect approach, namely repeated testing, to confirm the data quality before and after data processing. A raw dataset often contains multiple records for the same compound due to different sources and varying experimental conditions. Repeated testing compares the maximum and minimum values for the same compound under the same condition to validate the data quality.

As shown in Fig. 6, the repeated test plot is used to analyze regression results, and the confusion matrix is used to analyze the classification results. If the experimental results are consistent for different data sources, the repeated test plot will exhibit higher correlation and a lower mean absolute error (MAE) for regression tests, and the confusion matrix will show higher accuracy (ACC), precision, and recall for classification tests. In contrast, low-quality data will have opposite metric scores.

Fig. 6
Fig. 6
Full size image

Comparison of Data Quality Before and After the Data Processing Workflow Through Repeated Test Plots and Confusion Matrices (a) Repeated Test Plot for the LogD Experiment Before Data Processing. (b) Repeated Test Plot for the LogD Experiment After Data Processing. (c) Repeated Test Plot for the BBB Experiment Before Data Processing. (d) Repeated Test Plot for the BBB Experiment After Data Processing. Additional data can be found in Table 5.

We use this method to compare data quality before and after considering the experimental conditions mined through our data mining process, thereby demonstrating improvement in data quality based on our approach. The data quality before and after the data processing workflow can be compared and evaluated through the metrics mentioned above.

Specifically, we group data entries for the same molecules from the raw data to create the ‘before data processing’ plot, and we group data entries for the same molecules under identical conditions for the ‘after data processing’ plot. The maximum and minimum experimental results for each group are selected as the worst-case scenario. We have created a scatter plot for the regression tests and a confusion matrix for the classification tests, as shown in Fig. 6. The R, MAE, and RMSE for regression tasks, and ACC, F1, precision, and recall for classification tasks, have been calculated and are recorded within Tables 5 and 6.

Table 5 Comparison of Metrics Between the Regression Datasets Before and After the Data Processing Workflow.
Table 6 Comparison of Metrics Between the Classification Datasets Before and After the Data Processing Workflow.

Table 5 demonstrates the results of repeated tests for regression tasks within PharmaBench22, while Table 6 summarizes the classification tasks. All metrics improved following the data processing workflow, validating the quality increment. The results of repeated tests for certain experiments, such as the LogD experiment, have significantly improved data quality after the data processing workflow, reaching a level comparable to that of traditional wet lab experiments. However, the results of repeated tests for CYP and clearance experiments remain relatively low, due to the complex nature of these in vitro experiments.

Analysis of property distribution

Basic physicochemical properties of the compounds, including atom counts, molecular weight, LogP, and QED, were calculated using RDKit. Histograms representing the frequency of these properties were calculated and are presented in Fig. 7 to illustrate the characteristics of the molecules within PharmaBench22,38.

Fig. 7
Fig. 7
Full size image

Frequency Histograms for All PharmaBench Datasets: Count Distribution of Atom Numbers, Molecular Weights, LogP, and QED Scores Across Different Datasets.

This histogram demonstrates that compounds in PharmaBench22 exhibit a broad distribution. The number of non-hydrogen atoms per molecule typically ranges from 10 to 50, and molecular weights range from 200 to 600 Daltons, which are consistent with the range of drug-like small molecules39. Additionally, the LogP values of these datasets are in the range from 0 to 8, indicating a tendency towards lipophilicity, which is also well aligned with that of drug-like compounds39. QED is a metric that evaluates the potential of a compound to be developed as a successful drug, based on a multi-factorial analysis of molecular properties of marked drugs39. The QED distribution for PharmaBench22 is skewed towards 1, suggesting that many compounds in the datasets possess favorable physio-chemical properties.

Overall, the molecules within PharmaBench22 demonstrate preferable characteristics, which are similar to those in the small molecule drug discovery projects.

Deep learning and machine learning modeling

Modeling protocol

Similar to the repeated test mentioned above, we used MAE, RMSE, and Pearson correlation coefficient R to evaluate the regression results. For classification results, we utilized AUC (area under the receiver operating characteristic curve), ACC, and the F1 score (F1) for evaluation.

We selected two machine-learning approaches and seven deep-learning models for this evaluation process. The machine learning models include XGBoost40 and Random Forest (RF)41, utilizing the Extended Connectivity Fingerprints (ECFP) as descriptors for the molecules42. We selected seven deep learning models, some of them need a pre-training process, and their input is either 2D graph or 3D conformation. Detailed descriptions of these models can be found in Table 7.

Table 7 Summary of AI Models Utilized in the Validation Process.

All models were built using default parameters, without additional fine-tuning. Although hyper-parameter optimizing strategy might improve the results, our intention is not to select the best model but rather to use these models to verify the quality of the PharmBench.

Modeling results

We present the metrics for the regression models and for the classification models, trained using both random as shown in Table 8 and scaffold splitting as shown in Table 9 datasets.

Table 8 Summary of final results for the PharmaBench based on random split.
Table 9 Summary of final results for the PharmaBench based on scaffold split.

For the datasets associated with regression tasks, the prediction results achieve desirable metrics for LogD, water solubility, BBB, and microsomal clearance, exhibiting relatively high R values and low MAE and RMSE. However, the prediction results for the CYP remain relatively low, which indicates that further improvements in data quality and modeling approaches are required for these datasets. The metrics for the classification tasks are all relatively high, which indicates that the models can effectively predict the classification results for the BBB and AMES datasets.

In regards to the splitting method, the prediction results of random splitting are better than scaffold splitting for the majority of tasks. This is understandable since the prediction performance for the majority of models is usually worse for compounds with new scaffolds. In addition, deep learning approaches significantly outperform the machine learning approaches in regression tasks. The performance gap between the deep learning and machine learning models widens for datasets with a large amount of data, such as LogD and water solubility, but narrows for smaller datasets, such as mouse microsomal clearance. This indicates that conventional machine learning approach can adapt to small datasets and has less capability to model large amounts of data compared with deep learning model. In contrast, the performance of the machine learning approach for classification tasks witnesses a significant increase. The metrics for XGBoost models for AMEs and BBB datasets surpass some deep learning approaches, indicating that machine learning approaches are more suitable for classification tasks.

Among deep learning approaches, the model with pretraining demonstrates the best performance in both regression and classification tasks for the majority of datasets. This indicates that the pretraining process can be useful for improving model performance for ADMET properties predictions. Moreover, there is no significant performance difference between graph-based and transformer-based approaches, or between 2D and 3D feature-based methods.

More research and modeling work are encouraged to utilize this benchmark set in the future. For instance, investigating approaches to improve model capabilities in predicting molecules with novel scaffolds would be valuable. The use of transfer learning and pre-training approaches is also recommended for the analysis with these datasets. Additionally, applying explainable AI techniques could provide valuable insights into the key pharmacological factors influencing ADMET properties.

Usage Notes

There are eleven ADMET datasets within PharmaBench22. Standardized SMILES representations of compounds were provided for modeling the compounds, and the experimental values are provided as the prediction targets. Users may use the labels within the scaffold_train_test_label and random_train_test_label as the train-test labels for fair comparison.