Table 1 Data management features
From: Protein–ligand data at scale to support machine learning
Attribute | Description |
---|---|
The AIRCHECK database | Houses Target 2035 screening datasets Supports machine learning (ML)/artificial intelligence (AI) model development, evaluation and reusability Follows FAIR principles (findable, accessible, interoperable, reusable) Publishes and documents data and computer code for data processing, quality control and normalization Ensures transparency and allows users to scrutinize data transformation and ML/AI models |
Standardizing experimental data using controlled vocabulary | Links experimental protocols to assay data via electronic lab notebooks and laboratory information management systems Uses commercial tools to allow uptake by the community Shares database architecture and controlled vocabulary across labs Facilitates integration of data (e.g., protein production, screening hit validation) |
Robust versioning | Automatically tracks and documents dataset changes Uses data nutrition labels to visualize and summarize dataset characteristics and updates Transforms datasets for integration into repositories, such as ChEMBL and PubChem |
Reusability | Provides comprehensive documentation, including experimental protocols and lab notebooks Offers analysis code and output files from tutorials and workshops, and fully specified ML/AI models Creates educational materials for users Enables users to understand the data and previous analyses |
Data release | Releases generated and quality-controlled data immediately or at regular intervals (e.g., quarterly) Aligns data releases with open benchmarking challenges to encourage use and re-use Releases data in the context of chemical probe collaborations for added scientific value |
Integrating diverse data | Supports ingestion of data from diverse screening platforms (affinity-selection mass spectrometry, DNA-encoded chemical library ML) Creates multimodal data objects integrating data for a single target from various platforms Tracks processing pipelines and ensures full traceability of data generation (inspired by the ORCESTRA platform for genomics data70) |
Equity and inclusion | Ensures data and computational resources are free to access Cloud implementation allows users with limited resources to run ML/AI methods using free research credits Partners with cloud providers to facilitate resource use for users from low-income countries65 Develop the Artificial Intelligence-Ready CHEmiCal Knowledge base (AIRCHECK) web application following the Web Accessibility Initiative to maximize inclusion and diversity71 |
Data science | Trains ML models using rigorously processed and curated data Represents data in formats optimized for downstream applications Uses random, chronological or other splitting mechanism to divide data into training, validation and test sets Continuously tests and updates models with new data Enhances predictive accuracy and monitors ‘model drift’ over time Uses active learning to drive design–make–test–analyse cycles Evaluates prediction uncertainty to inform decision-making and reinforce model reliability |