Table 1 Data management features

From: Protein–ligand data at scale to support machine learning

Attribute

Description

The AIRCHECK database

Houses Target 2035 screening datasets

Supports machine learning (ML)/artificial intelligence (AI) model development, evaluation and reusability

Follows FAIR principles (findable, accessible, interoperable, reusable)

Publishes and documents data and computer code for data processing, quality control and normalization

Ensures transparency and allows users to scrutinize data transformation and ML/AI models

Standardizing experimental data using controlled vocabulary

Links experimental protocols to assay data via electronic lab notebooks and laboratory information management systems

Uses commercial tools to allow uptake by the community

Shares database architecture and controlled vocabulary across labs

Facilitates integration of data (e.g., protein production, screening hit validation)

Robust versioning

Automatically tracks and documents dataset changes

Uses data nutrition labels to visualize and summarize dataset characteristics and updates

Transforms datasets for integration into repositories, such as ChEMBL and PubChem

Reusability

Provides comprehensive documentation, including experimental protocols and lab notebooks

Offers analysis code and output files from tutorials and workshops, and fully specified ML/AI models

Creates educational materials for users

Enables users to understand the data and previous analyses

Data release

Releases generated and quality-controlled data immediately or at regular intervals (e.g., quarterly)

Aligns data releases with open benchmarking challenges to encourage use and re-use

Releases data in the context of chemical probe collaborations for added scientific value

Integrating diverse data

Supports ingestion of data from diverse screening platforms (affinity-selection mass spectrometry, DNA-encoded chemical library ML)

Creates multimodal data objects integrating data for a single target from various platforms

Tracks processing pipelines and ensures full traceability of data generation (inspired by the ORCESTRA platform for genomics data70)

Equity and inclusion

Ensures data and computational resources are free to access

Cloud implementation allows users with limited resources to run ML/AI methods using free research credits

Partners with cloud providers to facilitate resource use for users from low-income countries65

Develop the Artificial Intelligence-Ready CHEmiCal Knowledge base (AIRCHECK) web application following the Web Accessibility Initiative to maximize inclusion and diversity71

Data science

Trains ML models using rigorously processed and curated data

Represents data in formats optimized for downstream applications

Uses random, chronological or other splitting mechanism to divide data into training, validation and test sets

Continuously tests and updates models with new data

Enhances predictive accuracy and monitors ‘model drift’ over time

Uses active learning to drive design–make–test–analyse cycles

Evaluates prediction uncertainty to inform decision-making and reinforce model reliability