Table 2 Sample benchmarking challenges

From: Protein–ligand data at scale to support machine learning

Data

Challenge

Experimental validation

SMILES and/or fingerprint and enrichment metrics of DNA-encoded chemical library (DEL) screening hits and negatives from 4-10B compound library

Train machine learning (ML)/artificial intelligence (AI) models on DEL screening data and use them to predict actives from billions of commercial compounds

Procure and test predicted hits with two orthogonal assays

300k affinity-selection mass spectrometry (AS–MS) compound library (SMILES) and protein target

Predict true and false positives

Compare predictions with screening results, annotated with orthogonal assays

AS–MS screening and orthogonal hit confirmation data for 80% of a 300k compound library

Challenge 1: predict confirmed hits for the remaining 20% hold-out set

Challenge 2: if successful, predict novel hits from commercial libraries

For challenge 1: unblind existing data from the hold-out set

For challenge 2: procure and test predicted hits with two orthogonal assays

300k AS–MS compound library (SMILES), protein target and annotated screening results (including orthogonal hit verification)

Challenge 1: use target-based and/or receptor-based virtual screening to predict experimental hits

Challenge 2: if successful, predict novel hits from commercial libraries

Challenge 1: unblind existing data

Challenge 2: procure and test predicted hits with two orthogonal assays

SMILES and/or fingerprint and enrichment metrics of DNA-encoded chemical library (DEL) screening hits and negatives from 4-10B compound libraries against hundreds of targets

Build a foundation model to predict hits from commercial libraries for targets absent from the training set

Procure and test predicted hits with two orthogonal assays

AS–MS screening and orthogonal confirmation data for 80% of >1,000 targets

Predict hits for homologous and/or unrelated targets

Procure and test predicted hits with two orthogonal assays