Nature Reviews Chemistry

Table 1 Data management features

From: Protein–ligand data at scale to support machine learning

Attribute	Description
The AIRCHECK database	Houses Target 2035 screening datasets Supports machine learning (ML)/artificial intelligence (AI) model development, evaluation and reusability Follows FAIR principles (findable, accessible, interoperable, reusable) Publishes and documents data and computer code for data processing, quality control and normalization Ensures transparency and allows users to scrutinize data transformation and ML/AI models
Standardizing experimental data using controlled vocabulary	Links experimental protocols to assay data via electronic lab notebooks and laboratory information management systems Uses commercial tools to allow uptake by the community Shares database architecture and controlled vocabulary across labs Facilitates integration of data (e.g., protein production, screening hit validation)
Robust versioning	Automatically tracks and documents dataset changes Uses data nutrition labels to visualize and summarize dataset characteristics and updates Transforms datasets for integration into repositories, such as ChEMBL and PubChem
Reusability	Provides comprehensive documentation, including experimental protocols and lab notebooks Offers analysis code and output files from tutorials and workshops, and fully specified ML/AI models Creates educational materials for users Enables users to understand the data and previous analyses
Data release	Releases generated and quality-controlled data immediately or at regular intervals (e.g., quarterly) Aligns data releases with open benchmarking challenges to encourage use and re-use Releases data in the context of chemical probe collaborations for added scientific value
Integrating diverse data	Supports ingestion of data from diverse screening platforms (affinity-selection mass spectrometry, DNA-encoded chemical library ML) Creates multimodal data objects integrating data for a single target from various platforms Tracks processing pipelines and ensures full traceability of data generation (inspired by the ORCESTRA platform for genomics data⁷⁰)
Equity and inclusion	Ensures data and computational resources are free to access Cloud implementation allows users with limited resources to run ML/AI methods using free research credits Partners with cloud providers to facilitate resource use for users from low-income countries⁶⁵ Develop the Artificial Intelligence-Ready CHEmiCal Knowledge base (AIRCHECK) web application following the Web Accessibility Initiative to maximize inclusion and diversity⁷¹
Data science	Trains ML models using rigorously processed and curated data Represents data in formats optimized for downstream applications Uses random, chronological or other splitting mechanism to divide data into training, validation and test sets Continuously tests and updates models with new data Enhances predictive accuracy and monitors ‘model drift’ over time Uses active learning to drive design–make–test–analyse cycles Evaluates prediction uncertainty to inform decision-making and reinforce model reliability

Back to article page

Search

Advanced search

Quick links