Protein–ligand data at scale to support machine learning

Edwards, Aled M.; Owen, Dafydd R.

doi:10.1038/s41570-025-00737-z

Roadmap
Published: 23 July 2025

Protein–ligand data at scale to support machine learning

Nature Reviews Chemistry volume 9, pages 634–645 (2025)Cite this article

12k Accesses
2 Citations
8 Altmetric
Metrics details

Subjects

Abstract

Target 2035 is a global initiative that aims to develop a potent and selective pharmacological modulator, such as a chemical probe, for every human protein by 2035. Here, we describe the Target 2035 roadmap to develop computational methods to improve small-molecule hit discovery, which is a key bottleneck in the discovery of chemical probes. Large, publicly available datasets of high-quality protein–small-molecule binding data will be created using affinity-selection mass spectrometry and DNA-encoded chemical library screening. Positive and negative data will be made openly available, and the machine learning community will be challenged to use these data to build models and predict new, diverse small-molecule binders. Iterative cycles of prediction and testing will lead to improved models and more successful predictions. By 2030, Target 2035 will have identified experimentally verified hits for thousands of human proteins and advanced the development of open-access algorithms capable of predicting hits for proteins for which there are not yet any experimental data.

You have full access to this article via your institution.

Download PDF

CACHE (Critical Assessment of Computational Hit-finding Experiments): A public–private partnership benchmarking initiative to enable the development of computational methods for hit-finding

Article 15 February 2022

Testing the predictive power of reverse screening to infer drug targets, with the help of machine learning

Article Open access 09 May 2024

Targeted protein degradation: mechanisms, strategies and application

Article Open access 04 April 2022

Introduction

Chemical probes — potent, selective, cell-active small molecules targeting specific proteins — constitute some of the most impactful research tools in the life sciences arsenal, as evidenced by citations and impact on drug discovery^1,2. The broader availability of chemical probes for all human proteins would greatly advance our understanding of the human proteome, as well as help prioritize potential new drug targets. In 2009, the Structural Genomics Consortium (SGC) launched a programme to assemble and invent chemical probes for human proteins related to cell signalling, protein homeostasis and epigenetics. The programme successfully developed and collected new chemical probes for over 200 unique proteins from the academic and industrial communities. The impact of these 200 chemical probes has been profound: more than 60,000 samples have been distributed to scientists around the world, they have collectively been cited at least 13,000 times as assessed by searching for the name of the probe in Google Scholar, and the discoveries they have enabled are being tested in more than 85 clinical trials.

The obligatory first step in creating a chemical probe for a new protein (or a proximity pharmacology tool such as proteolysis-targeting chimeras (PROTACs))³ is to identify a validated, chemically tractable hit. For proteins that belong to precedented classes of drug targets, hits can often be identified quite readily, either by screening focused chemical libraries that are enriched in experimentally verified structural classes⁴ or by making computational predictions based on pre-existing experimental data^5,6,7,8,9,10. By contrast, for lesser-studied proteins, hit-finding is more challenging and is often rate determining. Currently, hit finding is almost always initiated with an experimental screen of large and diverse chemical libraries followed by time- and cost-intensive hit verification and optimization. Although the available experimental hit-finding approaches have expanded greatly over the past 20 years, there has not been a dramatic improvement in their overall success rates or cost effectiveness^11,12,13. This situation underscores the need for a radically different approach in the context of the Target 2035 initiative¹⁴.

Computational methods, particularly machine learning (ML) and artificial intelligence (AI) strategies have the most potential to develop cost-effective hit-finding methods for unprecedented targets¹⁵. However, the development of hit-finding algorithms is currently limited by the lack of suitable protein–ligand datasets in the public domain^16,17: the existing chemical bioactivity datasets are either fragmented across databases such as ChEMBL and PubChem, or are not available to the public¹⁸, most have been compiled from non-standardized experimental protocols that introduce noise into training data¹⁹, the datasets are not always prepared for ML/AI analysis, and most lack high-quality data on inactive compounds²⁰.

With data paucity identified as the greatest hurdle to the development of hit-finding algorithms, our SGC/Target 2035 working group decided that the next phase of the Target 2035 initiative (2025–2030) should organize a programme that (1) systematically generates large experimental protein–small–molecule binding datasets and provides open access to the well-annotated data, and (2) works with the community to train, develop, refine, test and benchmark hit-finding algorithms, to start.

A scientific and operational plan for the initiative, including target selection, data generation and dissemination, benchmarking of ML/AI predictions, success criteria, governance, and funding was discussed in a face-to-face meeting in Frankfurt, Germany, in the autumn of 2023, and in London, UK, in the autumn of 2024. In this Roadmap, we consolidate the outputs from these meetings into an ambitious yet tractable roadmap to provide sufficient experimentally derived data to transform hit finding into a computational endeavour. We also highlight how the Target 2035 open science initiative is structured to provide ample mechanisms for the greater experimental and computational academic and industry communities to contribute and benefit.

In our approach, there are conceptual parallels between our proposed approach and the development of the AlphaFold programs for protein structure prediction. The successful application of ML to protein structure prediction was empowered by massive open data generation by the structural biology and genomics communities, longstanding stewardship of the data by the Protein Data Bank and GenBank teams²¹ and an engaged structure prediction community, whose algorithms were benchmarked by the CASP team (Critical Assessment of Protein Structure Prediction)²² through open challenge competitions. This analogy has limits though. The immense space of intramolecular interactions that is afforded by 20 amino acids and defines the protein-folding paradigm is relatively constrained when compared with the diversity of possible interactions between proteins and ~10⁶⁰ drug-like molecules. Clearly, novel ML strategies will be necessary for a breakthrough in AI-driven drug design, and it is not possible a priori to predict the size and diversity of the protein–ligand datasets that will be required to enable such a breakthrough, or even if it will be possible in the near term. With this caveat, it is nevertheless apparent that high-quality, large-size protein–ligand datasets will be foundational to solving the problem.

Overview of project workflow

This 5-year project will generate high-quality, open datasets that include binding data for millions to billions of small molecules to more than 2,000 diverse proteins. The data will include results from testing both experimentally derived and computationally predicted hit candidates using orthogonal biophysical and functional assays.

The project workflow is outlined in (Fig. 1) and is described in more detail below. In brief, the project will

(1)
Generate purified proteins both within the project and by inviting community members to contribute purified proteins. All purified proteins would be subject to strict quality control.
(2)
Generate binding data using affinity-selection mass spectrometry (AS–MS) and DNA-encoded chemical library (DEL) screening, each of which measures the binding of small molecules to purified proteins directly. AS–MS and DEL screening are also performed in a standardized way, and the outputs have associated quality metrics. Candidate small-molecule binders will be tested in secondary screens using orthogonal, high-quality biophysical assays.
(3)
Make annotated primary screening data openly available in an ML/AI-ready format via a project database called AIRCHECK (Artificial Intelligence-Ready CHEmiCal Knowledge base; https://aircheck.ai/).
(4)
Challenge the ML/AI and computational chemistry communities to make predictions based on the data and organize a series of benchmarking competitions to help advance the methods.
(5)
Experimentally test community predictions using biophysical methods.
(6)
Share assay data from predicted binders via AIRCHECK.
(7)
Share reagents, protocols, binders and data without restrictions on use.

In addition, for as many confirmed binders as feasible, co-crystallization with the cognate target would be attempted, and structure–activity relationships explored by testing structural analogues of the confirmed binders, either purchased from vendors or synthesized by collaborating chemists. Ideally, all binders would be tested in functional assays when available.

The project will have two important outcomes. First, it will generate new small-molecule binders for prioritized proteins. Second, it will create a comprehensive, well-annotated dataset to advance computational methods. This second outcome will be achieved by prioritizing data quality, data consistency and data access, and designing the experimental workflow and outputs in partnership with data scientists²³.

Access to diverse high-quality proteins

To generate protein–small-molecule binding datasets of sufficient size and diversity, it will be critical to access and prosecute a structurally diverse set of purified, homogeneous and stable proteins. Given that it is not possible a priori to estimate the number of datasets that will be required to enable computational methods, for planning purposes, we have arbitrarily set a goal to screen a minimum of 2,000 different proteins over 5 years. For perspective, and to attest to the feasibility of the project, this is approximately the number of unique proteins purified within the SGC in the 5-year period between 2007 and 2012.

The selection of which proteins to screen will be guided by the requirement to maximize the structural and functional diversity of the protein targets, as well as by the desire of funders and participants to identify hits for protein targets of their immediate scientific interest. Initially, experimental tractability will be prioritized to establish and optimize project platforms, logistics, data workflows and procedures. Tractable targets constitute those that can be readily purified in sufficient quantities, are known to have suitable biophysical properties, and for which orthogonal assays are either already available or can be readily developed (Fig. 2). The SGC and the wider protein and structural biology communities have successfully produced over a thousand human proteins (or domains thereof) in the past that meet these criteria, and these proteins should be easily and rapidly accessible or able to be repurified. A graphical representation of ~400 proteins already purified at the SGC is included in Supplementary Fig. 1 and a snapshot of the protein database in Supplementary Table 1. As the project progresses, the number of never-before purified proteins and proteins that are more technically challenging to produce will be increased.

**Fig. 2: Target selection and prioritization.**

Protein production

To ensure protein quality and consistency, protein quality criteria have been established and implemented (an exemplar is shown in Supplementary Fig. 2) and the majority of the proteins will be produced in a handful of geographically distributed protein purification hubs that share methodologies. These hubs will probably be organized around protein families and/or scientific themes. To attract a wider diversity of protein targets, experts in the community would be invited to contribute purified proteins that meet the diversity and quality criteria. The incentives for community members to contribute proteins will be to access high-quality chemical screening capabilities and to be able to pursue any small-molecule hits identified in the screens, without precondition (https://public.thesgc.org/protein_registry/protein_intake.php).

Protein–ligand open data generation

All purified proteins that pass quality control will be screened for binders within large chemical libraries. Screens will be carried out in academic or industry hubs, selected for having track records in high-quality data generation. The distribution of proteins to and among the screening hubs will be centrally coordinated to avoid duplication of effort.

A key strategic decision was to select the data-generating modalities. Platform(s) that screen for direct binding of ligands to purified proteins would be implemented, for the following reasons:

(1)
Direct-binding assays eliminate the impractical requirement to develop bespoke functional assays for each protein, including the thousands of human proteins with no known activity. For proteins with known function, a functional assay might aid in the hit verification process, and in the further advancement of the chemical matter.
(2)
A single preparation of purified protein can be used both for the primary binding screen used for hit identification as well as for the secondary orthogonal biophysical assays²⁴ used for hit verification.
(3)
Screening campaigns could begin immediately, using the many hundreds of human proteins that have been purified or can be readily purified in high quality and quantity, by the SGC, by industry, and by the wider protein and structural biology academic communities.

After considering many screening platforms, DEL^25,26,27,28 and AS–MS^29,30,31 were chosen. These two biophysical screening methods have been used successfully for a wide variety of proteins, have the potential to generate millions of high-quality data points per screen and have already demonstrated efficient hit-finding results in our hands for diverse proteins. In addition, data generated by these methods have a common experimental design and can be represented in a machine-readable format and aggregated into increasingly large datasets. The large size and high dimensionality of these data also leverage respective analytical techniques that have been extensively developed by the ML/AI community^32,33 and employed in cheminformatics for drug discovery applications³⁴.

DEL screening

DEL screening is an affinity-mediated technique that has been used for more than two decades as a tool for identifying compounds that bind proteins^35,36,37,38. In this technology, pools of compounds, each covalently attached to an oligonucleotide whose sequence encodes its synthetic history (and therefore the presumed compound identity), are incubated with the protein. Proteins are then captured using an affinity tag and associated library members are separated from non-binders by washing. The DNA encoding the retained library members is then amplified and sequenced, allowing the synthetic history of each compound and their enrichment over the background to be determined. Historically, enriched library compounds were resynthesized off the DNA and tested for binding or activity in an orthogonal assay. The technology allows for probing an enormous chemical library (>1 trillion members), but it has limitations: the presence of DNA induces many false positives, synthesizing the many potential binders off-DNA is time consuming and costly, and the chemical diversity of the library members is restricted by the requirement to use reactions that are compatible with the presence of DNA.

Some of these limitations can be overcome by integrating ML/AI with DEL screening. In this iteration of DEL screening data analysis, the datasets, comprising billions of data points and including both positive and (critically) negative binding data, are used to train ML algorithms and build models to predict the molecular features of a binder^26,28,39. These algorithms are then used to search for active molecules within the billions of commercially available compounds or compound collections internal to organizations. The compounds are then acquired and tested for binding to the purified proteins using orthogonal binding and/or functional assays. This strategy offers several potential advantages. First, it is faster and less expensive for most investigators to purchase molecules⁴⁰ than to synthesize each of the enriched library compounds. Second, predictions are not restricted to the molecules in the DEL but can be made against the large, diverse, and more drug-like chemical space represented in pre-enumerated, synthetically accessible commercial libraries.

This conceptual DEL ML workflow was pioneered by McCloskey et al.²⁶ using three precedented targets and the scalability and generalizability of the approach have been subsequently confirmed^27,28,41,42 These encouraging results emboldened us to imagine a scaled-up process in which DEL screening datasets from hundreds to thousands of proteins, including detailed protocols and metadata, would be provided to the academic and industry communities without restriction in a standardized, ML-ready format^43,44. By providing open access to these data (aircheck.ai), the ML/AI community would be enabled to make predictions that can be tested experimentally and to develop methods that can be benchmarked (Fig. 3). In the first datasets in AIRCHECK, the data include a 10:1 ratio of negative to positive training examples, and up to 1 million data points. Negative training examples were proportionally distributed to positive training examples on a per-library basis. These data have already been used to build models that have successfully predicted new micromolar binders for the WDR91 protein⁴⁵.

**Fig. 3: Schematic representation of DEL screen output data and ML/AI workflow.**

Initially, the DEL screens will be carried out in selected organizations that have a track record of success in applying ML to their DEL data^43,44. Over time, any other company or academic that has robust DEL synthesis and screening infrastructure and that agrees to share relevant data openly and in a standardized, ML-ready format^43,44 would be welcome to join the initiative.

AS–MS

AS–MS has emerged as a robust hit identification approach in the pharmaceutical industry⁴⁶. In this method, pools of mass-differentiated compounds, typically up to 2,000, are first incubated with the protein. The protein and small molecules are then resolved chromatographically, and compounds that co-elute with the protein are subjected to liquid chromatography–mass spectrometry and unambiguously identified by their exact masses. Compound binding is then verified using an orthogonal functional or binding assay(s). The current upper limit of detection for compounds in most AS–MS platforms is an affinity constant in the 1–15 micromolar range⁴⁶.

With some notable exceptions^47,48,49,50, AS–MS has not been widely adopted as a small-molecule screening platform in academia, in part due to the significant infrastructure that is required, but mostly because cost-effective use of the infrastructure requires a pipeline of purified proteins in multi-milligram quantities. Given the ability to access thousands of purified proteins in these quantities in this project, AS–MS was prioritized as a screening platform (Fig. 4). To optimize screening capacity and throughput, we have elected to implement an off-line AS–MS method that screens affinity-tagged proteins (his, GFP or biotin) against pools of compounds, and then resolves the protein/compound complexes from the non-binding compounds by binding the tagged protein to the corresponding magnetic affinity microbeads⁵⁰. This pipeline was piloted by screening a diverse set of 31 proteins against a small chemical library explicitly optimized for mass spectrometry screening, and binders were discovered for 11 proteins⁵¹.

**Fig. 4: The AS–MS screening workflow.**

The primary binding data and metadata from both DEL and AS–MS screens, as well as the results from secondary biophysical assays are now being placed into AIRCHECK without restriction on use. Raw mass spectrometry data will also be made available via Metabolomics Workbench (https://www.metabolomicsworkbench.org/) or a similar vehicle.

Annotation and verification of screening data

With a priority to generate screening datasets for ML/AI applications, particular attention will be paid to data quality, data annotation and data availability — using learnings from the experiences of our industry partners and other public initiatives⁵². Data quality standards will be made openly available and implemented at three key levels: for the protein samples, for the DEL and AS–MS screening outputs, and for the hit annotation.

Proteins

Proteins entering screens must meet established experimental quality criteria and must also be accompanied by key metadata that might influence data interpretation and model building, such as purification conditions or the presence of metal ions.

Screening datasets

Primary AS–MS and DEL screening-derived datasets will be assessed for technical quality against a set of relevant parameters (Supplementary Fig. 3). For public DEL and AS–MS screens that pass quality checks, all the raw screening data will be placed into the public domain.

Secondary annotation of primary screening data

Both experimental screening platforms will generate false positive and false negative hits, and to maintain the quality of the datasets for ML/AI applications, true and false positives must be distinguished using orthogonal assays⁵³. This is technically challenging because weaker binding compounds are often insoluble at the concentrations used for many biophysical or functional assays^54,55, which readily leads to artefacts in any single assay. As a result, many candidate binders may have to be tested in several different assays to gain sufficient confidence in their veracity.

Given the technical challenges in analysing weakly binding compounds, it will be critical to agree on how much effort the project should invest in determining if a screening hit is a true binder and to communicate the limitations of each of the assays^54,55 and the resulting data to the modelling community. The strategic decision is how to balance annotating the largest number of true positives in the dataset, which is optimal for model building and also provides practical and valuable insight into the ligandability of a protein, with investing considerable resources in characterizing weakly binding compounds, which reduces the number of proteins that can be screened. The CACHE competition has created a document that explains how to interpret the biophysical binding assays and how to identify potential artefacts⁵⁶. Constant and close discussion between experimentalists and data scientists in the project will minimize misinterpretation or over-interpretation of the screening and hit-characterization data.

For this project, a generous threshold in nominating hits from the initial screen would be implemented. A target affinity threshold of 10 µM (K_D value) would be set for the orthogonal assay, potentially with some target-specific leeway²⁰. Ideally, all candidate hits arising from the first orthogonal assay would be tested in an additional assay. The outcome will be a robust and inclusive list of well-annotated positive binders with a K_D of ≤10 µM.

Data consistency

To prioritize data consistency, secondary screening and data annotation will be centralized in well-equipped and experienced academic or commercial laboratories that follow standard operating procedures. Samples will be exchanged regularly among laboratories and tested to monitor and eliminate any inter-laboratory variability. These laboratories will have access to a range of orthogonal assay formats including some form of surface binding assay such as surface plasmon resonance⁵⁷ or grating-coupled interferometry⁵⁸, and other biophysical methods with reasonable throughput, such as spectral shift, microscale thermophoresis, NMR or thermal shift methodologies^59,60,61. One of the complexities of the project is that for many of the novel proteins that will be screened, orthogonal assays will have to be built without the benefit of a positive control binder. If functional assays that confirm target modulation are readily available, they would add another layer of verification to the hit-confirmation process and provide invaluable insight into how to develop the ligand into a chemical probe.

Data management and access

To fully realize the value of the annotated protein–ligand datasets, data management approaches will be treated with equal diligence as the experimental methods. Accordingly, the project will adhere to the data management roadmap recently described by Edfeldt and colleagues²³. This will include establishing a controlled vocabulary for experimental data, using automation and electronic laboratory notebooks whenever possible, centralizing the database architecture to facilitate data integration and providing comprehensive documentation. Raw data will be provided whenever possible and transparent and reproducible data processing will be performed, including choosing the most relevant data representation, defining the right training and test sets, and providing estimates of prediction uncertainty. The comprehensive data management plan and its attributes are outlined in Table 1.

Table 1 Data management features

Full size table

Benchmarking with experimental feedback

The intention of providing large, consistent and high-quality datasets to the community is to enable the development of computational and ML/AI hit-finding and hit optimization methods. The models will be focused in the near term on predicting binders and optimization strategies for proteins in the screening set and in the longer term to build foundation models of hit discovery and optimization.

To accelerate the development of these methods, the project will partner with organizations, including CASP, DREAM⁶² and CACHE¹⁵, that launch benchmarking challenges, including those in which predictions from the community will be tested experimentally and compared. Data used as input to challenges would be kept confidential while challenges are in progress, and a regular cadence of challenges and data release would be established. The value of benchmarking initiatives in computational biology was clearly established by CASP, which, for over 30 years, has driven and monitored progressive improvements in computational methods^63,64.

Some of the proposed initial benchmarking challenges are listed in Table 2. As the project advances, other types of benchmarking challenges would probably be incorporated, including those that involve combining data from multiple platforms, not only from AS–MS and DEL screening but also from novel hit-finding screening platforms that may arise in the future. A combination of challenges that better represent a typical drug discovery screening pipeline may also have added value, including those that integrate some form of experimental or computational protein structural information. However, as even relatively simple challenges require significant logistics and the associated experimental costs are high, running more elaborate pipelines at the start of the project is probably too ambitious.

Table 2 Sample benchmarking challenges

Full size table

Participants will be encouraged to make their models open source and freely available to anyone for use directly from AIRCHECK. To encourage this, the costs of procuring compounds and testing them experimentally, partly or in full, would ideally be defrayed for qualified participants who make their ML/AI models publicly available and with permissive licenses.

From pilots to implementation

Pilot projects have laid foundational elements for this project. The capacities are now in place to

(1) produce more than 2,000 high-quality human proteins, most ‘never previously liganded’; (2) screen these proteins against compound libraries using AS–MS and DEL; (3) store and disseminate project data, with a robust data management plan and database architecture; (4) annotate screening data and test predictions; and (5) solicit community contributions and participation.

In the first year of the project, the individual elements will be scaled and integrated to create a data generation plan that balances the shorter-term goal of identifying hits for high-priority proteins with the longer-term goal of generating data that will advance computational hit finding. The most likely screening cascade will involve screening each protein first by AS–MS against an exploratory (~15k) library whose composition will be made openly available. The rationale is that this screen is scalable, yields a direct binding readout, is the most cost effective and will most rapidly identify those proteins that are readily ‘ligandable’. The exploratory screen will also flag proteins that have physiochemical properties that render them unsuitable for AS–MS or DEL and will not be screened further. For example, the exploratory screen would flag proteins that appear stable but that in fact have transiently unfolded regions that may bind large numbers of compounds nonspecifically.

Stable and monodisperse proteins that do not yield hits from the AS–MS exploratory screen, or for which greater chemical diversity or large datasets are required, will be channelled into screens with larger chemical libraries, using both AS–MS and DEL. The proposed screening cascade will be reviewed periodically and adjusted to optimize the process or incorporate other screening approaches as needed.

Encouraging community contributions

Active participation of the wider scientific community will be essential to meet the project goals. Robust community engagement will be made feasible only by adopting open science principles within the project. For clarity, this means compounds, data and algorithms developed using project resources will be made available without restriction on use, and without intellectual property constraints. This open science position provides a clarity of purpose and short-circuits what could be prolonged and complex discussions over ownership of compounds and algorithms. In keeping with this position, there will also be no restrictions on subsequent research or commercial use of data, chemical structures and algorithms generated using project resources. With this as background, community contributions in the following areas are envisioned (Table 3).

Table 3 Community contributions

Full size table

Protein scientists

Structural biologists, and protein scientists more broadly, often have unique expertise in purifying proteins in their scientific areas of interest. Community members would be encouraged to contribute their purified proteins to the screening process. For the project, this will expand the diversity of the protein–ligand datasets. For the contributing scientist, this could provide open access to hits that they can pursue without restriction in their own laboratories. Already >30 protein scientists have sent proteins to Toronto for AS–MS screening, including from Brazil, the UK, Canada, Germany, Sweden and the USA, and binders for 8 of these community proteins have already been identified, verified by surface plasmon resonance, and shared with the contributor (for example, Wang et al.⁵¹). Tapping into this diverse community at a larger scale will bring enormous scientific benefit, but will also add logistical burdens, so the project will need to implement this process carefully.

Data generation

Project screening data would be generated initially using the AS–MS and DEL screening platforms in selected hubs. However, there are clear advantages to expanding the number of participating screening laboratories, and the range of data generation technologies. Accordingly, new screening methodologies would be explored on a continual basis. To manage this process, a set of ~25 well-characterized, diverse and ligandable proteins that will have been screened comprehensively through all the initial platforms will serve as a technology test set for new screening hubs or technologies. The project board and its scientific advisers will review all data and provide recommendations about adding new centres or technologies.

Engaging computational scientists worldwide

Each screen will generate multiple GB-scale datasets, which may need to be downloaded and manipulated. The use of cloud resources will ensure the scalability of the AIRCHECK platform while allowing users to easily access the data and the computational resources for ML/AI modelling. It also allows users to leverage education or research credits from large cloud providers to support more equitable, diverse and inclusive access (for example, Google Cloud program for higher education in Africa⁶⁵). Scientists from resource-poor environments will be actively encouraged to participate. We will also facilitate the development of open-source algorithms by collaborating closely with a project-associated global network of computational scientists, called MAINFRAME⁶⁶.

Chemists

The synthetic and medicinal chemistry communities will be encouraged (for example, through the SGC’s Open Chemistry Networks) to design and/or generate molecules related to the hits to improve the original binders. Testing these compounds within the project may generate preliminary structure–activity relationships and provide confidence that the binder can be advanced. Chemists will also be encouraged to contribute compounds that are theoretically accessible through their chemistries to the emerging virtual screening library of all compounds that are synthetically accessible⁶⁷.

Training and networking

The project will be generating data explicitly to promote the development of ML/AI algorithms and as such will be operating at the intersection of experimental, data and computational sciences. This will provide an excellent training environment for scientists seeking a working and operational knowledge of the various domains, and programmes for trainees will be established. Regular project meetings that prioritize scientific exchange between the various communities will be established.

Project structure and governance

The project will be structured as a pre-competitive, open science partnership in which compound assay data generated with project resources, including chemical structures of confirmed hits and algorithms, will be made available to the public under a license that requires attribution but that places no restriction on subsequent use. As stated previously, the rationale is pragmatic and evidence-based: pragmatic, in that it would be almost impossible to imagine a seamless cross-sectoral, cross-disciplinary and multinational collaboration that could operate under an agreement that allowed for the protection of potential intellectual property; and evidence- based, in that the development of ML/AI algorithms, in whatever field, advances most rapidly when provided with open data and with a mechanism to benchmark progress transparently⁶⁸.

The project needs to involve scientists from both public and private sectors to access the wide range of skill sets and expertise that will be required. It will also involve funding from both public and private sectors to achieve the requisite scale (Fig. 5). The major funders from the public and private sectors will form a governing board that oversees all project activities, including financial, scientific and management. The governing board will also oversee risk management, including any potential security risks associated with the data and the algorithms developed in the project. The governance board will be mandated to balance the needs of private sector funders with those of the public sector and its funding bodies, and also to provide a fair and time-limited mechanism for project or community contributors to pursue selected scientific questions. The governance structure that is currently used by the SGC is suitable because it has been used successfully to govern mission-oriented public–private partnerships of this complexity and scale⁶⁹.

A range of outcomes

The long-term aim of this project is to develop efficient computational hit-finding algorithms that can be used to generate freely available, small-molecule binders initially for thousands of proteins, and eventually for all relevant human proteins. However, over the course of the project, intermediate outcomes of considerable value will be generated, and these outcomes should be used as metrics to track and manage the project. Some of the key metrics are listed in Table 4.

Table 4 Metrics

Full size table

A range of benefits to all participants

Open-access public–private partnerships are structures to carry out projects that require skills distributed among a wide range of academic and industry scientists, that tackle problems that span the boundary of public and private interests, and that might otherwise be crippled by intellectual property negotiations. However, in return for ceding their potential intellectual property rights to the public good, funders and participants must feel that they gain more than they lose, directly or indirectly. Table 5 lists some of the benefits that this project will generate for participants.

Table 5 Benefits to project participants

Full size table

References

Edwards, A. M. et al. Too many roads not taken. Nature 470, 163–165 (2011).
Article CAS PubMed Google Scholar
Moustakim, M. et al. Target identification using chemical probes. Methods Enzymol. 610, 27–58 (2018).
Article CAS PubMed Google Scholar
Bond, M. J. & Crews, C. M. Proteolysis targeting chimeras (PROTACs) come of age: entering the third decade of targeted protein degradation. RSC Chem. Biol. 2, 725–742 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kanev, G. K., de Graaf, C., Westerman, B. A., de Esch, I. J. P. & Kooistra, A. J. KLIFS: an overhaul after the first 5 years of supporting kinase research. Nucleic Acids Res. 49, D562–D569 (2021).
Article CAS PubMed Google Scholar
Bender, B. J. et al. A practical guide to large-scale docking. Nat. Protoc. 16, 4799–4832 (2021).
Article CAS PubMed PubMed Central Google Scholar
Petrović, D. et al. Virtual screening in the cloud identifies potent and selective ROS1 kinase inhibitors. J. Chem. Inf. Model. 62, 3832–3843 (2022).
Article PubMed Google Scholar
Alon, A. et al. Structures of the σ2 receptor enable docking for bioactive ligand discovery. Nature 600, 759–764 (2021).
Article CAS PubMed PubMed Central Google Scholar
Stein, R. M. et al. Virtual discovery of melatonin receptor ligands to modulate circadian rhythms. Nature 579, 609–614 (2020).
Article CAS PubMed PubMed Central Google Scholar
Ren, F. et al. A small-molecule TNIK inhibitor targets fibrosis in preclinical and clinical models. Nat. Biotechnol. 43, 63–75 (2025).
Article CAS PubMed Google Scholar
Lyu, J. et al. Ultra-large library docking for discovering new chemotypes. Nature 566, 224–229 (2019).
Article CAS PubMed PubMed Central Google Scholar
Schneider, P. et al. Rethinking drug design in the artificial intelligence era. Nat. Rev. Drug Discov. 19, 353–364 (2020).
Article CAS PubMed Google Scholar
Zhu, T. et al. Hit identification and optimization in virtual screening: practical recommendations based on a critical literature analysis. J. Med. Chem. 56, 6560–6572 (2013).
Article CAS PubMed PubMed Central Google Scholar
Schneider, G. Virtual screening: an endless staircase? Nat. Rev. Drug Discov. 9, 273–276 (2010).
Article CAS PubMed Google Scholar
Carter, A. J. et al. Target 2035: probing the human proteome. Drug Discov. Today 24, 2111–2115 (2019).
Article CAS PubMed Google Scholar
Ackloo, S. et al. CACHE (Critical assessment of computational hit-finding experiments): a public–private partnership benchmarking initiative to enable the development of computational methods for hit-finding. Nat. Rev. Chem. 6, 287–295 (2022).
Article PubMed PubMed Central Google Scholar
For chemists, the AI revolution has yet to happen. Nature 617, 438 (2023).
Mock, M., Edavettal, S., Langmead, C. & Russell, A. AI can help to speed up drug discovery — but only if we give it the right data. Nature 621, 467–470 (2023).
Article CAS PubMed Google Scholar
Martin, E. J. et al. All-assay-max2 pQSAR: activity predictions as accurate as four-concentration IC₅₀s for 8558 novartis assays. J. Chem. Inf. Model. 59, 4450–4459 (2019).
Article CAS PubMed Google Scholar
Landrum, G. A. & Riniker, S. Combining IC₅₀ or Ki values from different sources is a source of significant noise. J. Chem. Inf. Model. 64, 1560–1567 (2024).
Article CAS PubMed PubMed Central Google Scholar
Martin, E. J. & Zhu, X. W. Collaborative profile-QSAR: a natural platform for building collaborative models among competing companies. J. Chem. Inf. Model. 61, 1603–1616 (2021).
Article CAS PubMed Google Scholar
Zardecki, C., Dutta, S., Goodsell, D. S., Voigt, M. & Burley, S. K. RCSB Protein Data Bank: a resource for chemical, biochemical, and structural explorations of large and small biomolecules. J. Chem. Educ. 93, 569–575 (2016).
Article CAS Google Scholar
Moult, J., Pedersen, J. T., Judson, R. & Fidelis, K. A large‐scale experiment to assess protein structure prediction methods. Proteins 23, ii–v (1995).
Article CAS PubMed Google Scholar
Edfeldt, K. et al. A data science roadmap for open science organizations engaged in early-stage drug discovery. Nat. Commun. 15, 5640 (2024).
Article CAS PubMed PubMed Central Google Scholar
Thorne, N., Auld, D. S. & Inglese, J. Apparent activity in high-throughput screening: origins of compound-dependent assay interference. Curr. Opin. Chem. Biol. 14, 315–324 (2010).
Article CAS PubMed PubMed Central Google Scholar
Clark, M. A. et al. Design, synthesis and selection of DNA-encoded small-molecule libraries. Nat. Chem. Biol. 5, 647–654 (2009).
Article CAS PubMed Google Scholar
McCloskey, K. et al. Machine learning on DNA-encoded libraries: a new paradigm for hit finding. J. Med. Chem. 63, 8857–8866 (2020).
Article CAS PubMed Google Scholar
Li, A. S. M. et al. Discovery of nanomolar DCAF1 small molecule ligands. J. Med. Chem. 66, 5041–5060 (2023).
Article CAS PubMed PubMed Central Google Scholar
Ahmad, S. et al. Discovery of a first-in-class small-molecule ligand for WDR91 using DNA-encoded chemical library selection followed by machine learning. J. Med. Chem. 66, 16051–16061 (2023).
Article CAS PubMed Google Scholar
Kelly, M. A., McLellan, T. J. & Rosner, P. J. Strategic use of affinity-based mass spectrometry techniques in the drug discovery process. Anal. Chem. 74, 1–9 (2002).
Article CAS PubMed Google Scholar
Prudent, R., Annis, D. A., Dandliker, P. J., Ortholand, J. Y. & Roche, D. Exploring new targets and chemical space with affinity selection-mass spectrometry. Nat. Rev. Chem. 5, 62–71 (2021).
Article CAS PubMed Google Scholar
Gesmundo, N. J. et al. Nanoscale synthesis and affinity ranking. Nature 557, 228–232 (2018).
Article CAS PubMed Google Scholar
L’Heureux, A., Grolinger, K., Elyamany, H. F. & Capretz, M. A. M. Machine learning with big data: challenges and approaches. IEEE Access. 5, 7776–7797 (2017).
Article Google Scholar
Najafabadi, M. M. et al. Deep learning applications and challenges in big data analytics. J. Big Data 2, 1 (2015).
Article Google Scholar
Lo, Y. C., Rensi, S. E., Torng, W. & Altman, R. B. Machine learning in chemoinformatics and drug discovery. Drug Discov. Today 23, 15–38-1546 (2018).
Article Google Scholar
Brenner, S. & Lerner, R. A. Encoded combinatorial chemistry. Proc. Natl Acad. Sci. USA 89, 5381–5383 (1992).
Article CAS PubMed PubMed Central Google Scholar
Melkko, S., Dumelin, C. E., Scheuermann, J. & Neri, D. Lead discovery by DNA-encoded chemical libraries. Drug Discov. Today 12, 456–471 (2007).
Article Google Scholar
Gironda-Martínez, A., Donckele, E. J., Samain, F. & Neri, D. DNA-encoded chemical libraries: a comprehensive review with succesful stories and future challenges. ACS Pharmacol. Transl. Sci. 4, 1265–1279 (2021).
Article PubMed PubMed Central Google Scholar
Peterson, A. A. & Liu, D. R. Small-molecule discovery through DNA-encoded libraries. Nat. Rev. Drug Discov. 22, 699–722 (2023).
Article CAS PubMed PubMed Central Google Scholar
Lim, K. S. et al. Machine learning on DNA-encoded library count data using an uncertainty-aware probabilistic loss function. J. Chem. Inf. Model. 62, 2316–2331 (2022).
Article CAS PubMed PubMed Central Google Scholar
Tingle, B. I. et al. ZINC-22 — a free multi-billion-scale database of tangible compounds for ligand discovery. J. Chem. Inf. Model. 63, 1166–1176 (2023).
Article CAS PubMed PubMed Central Google Scholar
Ackloo, S. et al. A target class ligandability evaluation of WD40 repeat-containing proteins. J. Med. Chem. 68, 1092–1112 (2024).
Article PubMed PubMed Central Google Scholar
Han, S. et al. Highly selective novel heme oxygenase-1 hits found by DNA-encoded library machine learning beyond the DEL chemical space. ACS Med. Chem. Lett. 15, 1456–1466 (2024).
Article CAS PubMed PubMed Central Google Scholar
SGC and HitGen announce research collaboration focused on DNA-encoded library based drug discovery. HitGen https://www.hitgen.com/en/news-details-319.html (2023).
X-chem and structural genomics consortium enter into collaboration to unlock the human proteome and promote open science. X-Chem https://www.x-chemrx.com/about/news/x-chem-and-structural-genomics-consortium-enter-into-collaboration-to-unlock-the-human-proteome-and-promote-open-science/ (2023).
Wellnitz, J. et al. Enabling open machine learning of DNA encoded library selections to accelerate the discovery of small molecule protein binders. Preprint at https://doi.org/10.26434/chemrxiv-2024-xd385 (2024).
Prudent, R., Lemoine, H., Walsh, J. & Roche, D. Affinity selection mass spectrometry speeding drug discovery. Drug Discov. Today 28, 103760 (2023).
Article CAS PubMed Google Scholar
Xin, Y. et al. Affinity selection of double-click triazole libraries for rapid discovery of allosteric modulators for GLP-1 receptor. Proc. Natl Acad. Sci. USA 120, e2220767120 (2023).
Article CAS PubMed PubMed Central Google Scholar
Liu, J. et al. The omega-3 hydroxy fatty acid 7(S)-HDHA is a high-affinity PPARα ligand that regulates brain neuronal morphology. Sci. Signal. 15, eabo1857 (2022).
Article CAS PubMed Google Scholar
Zhang, P. et al. Development of an α-klotho recognizing high-affinity peptide probe from in-solution enrichment. JACS Au 4, 1334–1344 (2024).
Article CAS PubMed PubMed Central Google Scholar
Muchiri, R. N. & van Breemen, R. B. Affinity selection–mass spectrometry for the discovery of pharmacologically active compounds from combinatorial libraries and natural products. J. Mass Spectrom. 56, e4647 (2021).
Article CAS PubMed Google Scholar
Wang, X. et al. Enantioselective protein affinity selection mass spectrometry (EAS-MS). Preprint at https://doi.org/10.1101/2025.01.17.633682 (2025).
Paillard, G. et al. The ELF Honest Data Broker: informatics enabling public–private collaboration in a precompetitive arena. Drug Discov. Today 21, 97–102 (2016).
Article PubMed Google Scholar
Quancard, J. et al. The European Federation for Medicinal Chemistry and Chemical Biology (EFMC) best practice initiative: hit generation. ChemMedChem 18, e202300002 (2023).
Article CAS PubMed Google Scholar
Giannetti, A. M., Koch, B. D. & Browner, M. F. Surface plasmon resonance based assay for the detection and characterization of promiscuous inhibitors. J. Med. Chem. 51, 574–580 (2008).
Article CAS PubMed Google Scholar
Rich, R. L. & Myszka, D. G. Grading the commercial optical biosensor literature — class of 2008: ‘The Mighty Binders’. J. Mol. Recognit. 23, 1–64 (2010).
Article CAS PubMed Google Scholar
Understanding SPR data. Critical Assessment of Computational Hit-Finding Experiments (CACHE) https://cache-challenge.org/sites/default/files/downloadable/forms/understanding_SPR_data.pdf (2024).
Wood, R. W. XLII. On a remarkable case of uneven distribution of light in a diffraction grating spectrum. Lond. Edinb. Dubl. Phil. Mag. J. Sci. 4, 396–402 (1902).
Article Google Scholar
Kartal, Ö., Andres, F., Lai, M. P., Nehme, R. & Cottier, K. waveRAPID — a robust assay for high-throughput kinetic screens with the creoptix WAVEsystem. SLAS Discov. 26, 995–1003 (2021).
Article CAS PubMed Google Scholar
Niesen, F. H., Berglund, H. & Vedadi, M. The use of differential scanning fluorimetry to detect ligand interactions that promote protein stability. Nat. Protoc. 2, 2212–2221 (2007).
Article CAS PubMed Google Scholar
Sparks, R. P. & Fratti, R. in Methods in Molecular Biology (ed. Fratti, R.) 1860, 191–198 (2019).
Langer, A. et al. A new spectral shift-based method to characterize molecular interactions. Assay Drug Dev. Technol. 20, 83–94 (2022).
Article CAS PubMed PubMed Central Google Scholar
Meyer, P. & Saez-Rodriguez, J. Advances in systems biology modeling: 10 years of crowdsourcing DREAM challenges. Cell Syst. 12, 636–653 (2021).
Article CAS PubMed Google Scholar
Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021).
Article CAS PubMed PubMed Central Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article CAS PubMed PubMed Central Google Scholar
Manoharan, F. Google cloud expands higher education credits to 8 countries in Africa. Google Cloud https://cloud.google.com/blog/topics/public-sector/google-cloud-expands-higher-education-credits-8-countries-africa/ (2022).
MAchine learning Innovation Network For Research to Advance MEdicinal chemistry. MAINFRAME https://www.aircheck.ai/mainframe (2025).
Bedart, C. et al. The pan-Canadian chemical library: a mechanism to open academic chemistry to high-throughput virtual screening. Sci. Data 11, 597 (2024).
Article PubMed PubMed Central Google Scholar
Burley, S. K. & Berman, H. M. Open-access data: a cornerstone for artificial intelligence approaches to protein structure prediction. Structure 29, 515–520 (2021).
Article CAS PubMed PubMed Central Google Scholar
Edwards, A. Reproducibility: team up with industry. Nature 531, 299–301 (2016).
Article CAS PubMed Google Scholar
Mammoliti, A. et al. Orchestrating and sharing large multimodal data for transparent and reproducible research. Nat. Commun. 12, 5797 (2021).
Article CAS PubMed PubMed Central Google Scholar
Accessibility principles. Web Accessibility Initiative (WAI) https://www.w3.org/WAI/fundamentals/accessibility-principles/ (2024).

Download references

Acknowledgements

The SGC is a registered charity (no. 1097737) that receives funds from Bayer AG, Boehringer Ingelheim, Bristol Myers Squibb, Genentech, Genome Canada through Ontario Genomics Institute (OGI-196), Janssen, Merck KGaA (aka EMD in Canada and USA), Pfizer, Takeda and the Innovative Medicines Initiative 2 Joint Undertaking (JU) under grant agreement no. 875510. This work was also funded by the Member States of the European Molecular Biology Laboratory. A.A.A. is an ISCIII–Miguel Servet Fellow supported by the Instituto de Salut Carlos III grant CP23/00115 and by the Spanish Ministry of Science and Innovation (MCIN/AEI) (PID2022- 136344OA-I00); CERCA Program/Generalitat de Catalunya, and FEDER funds/European Regional Development Fund (ERDF) — a way to Build Europe.

Author information

Authors and Affiliations

Structural Genomics Consortium, University of Toronto and University Health Network, Toronto, Ontario, Canada
Aled M. Edwards, Matthieu Schapira, Santha Santhakumar, Hui Peng, Maxwell R. Morgan, Sofia Melliou, Rachel J. Harding, Levon Halabelian, Benjamin Haibe-Kains, Claudia Gordijo, Madison M. Edwards, Dalia Barsyte-Lovejoy, Cheryl Arrowsmith & Suzanne Ackloo
Pfizer Research and Development, Cambridge, MA, USA
Dafydd R. Owen
IBM Accelerated Discovery Research, Yorktown Heights, NY, USA
Leili Zhang & Wendy D. Cornell
Collaborative Drug Discovery (CDD) Baylor College of Medicine One Baylor Plaza, Houston, TX, USA
Damian W. Young
Structural Genomics Consortium, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Timothy M. Willson, James Wellnitz, Alexander Tropsha, David Drewry, Rafael Counago, Peter J. Brown, Frances M. Bashore & Alison D. Axtman
Division of Chemical Biology and Medicinal Chemistry UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
James Wellnitz & Alexander Tropsha
National Institutes of Health, Bethesda, MD, USA
Yanli Wang
Hit Discovery, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
Jarrod Walsh & Emma Rivers
Research & Early Development, Novo Nordisk A/S, Måløv, Denmark
Erik Vernet
Institute of Pharmaceutical Chemistry, Johann Wolfgang Goethe University, Frankfurt, Germany
Claudia Tredup, Amelia Tjaden, Susanne Müller-Knapp, Stefan Knapp & Thomas Hanke
Buchmann Institute for Molecular Life Sciences and Structural Genomics Consortium (SGC), Frankfurt, Germany
Claudia Tredup, Amelia Tjaden, Susanne Müller-Knapp, Stefan Knapp & Thomas Hanke
Structural Genomics Consortium, School of Pharmacy, University College London, London, UK
Matthew H. Todd, Snezana Djordjevic & Nicola A. Burgess-Brown
Discovery Research, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach, Germany
Sven Thamm, Florian Montel, Uta Lessel & Amaury Fernández-Montalván
Structural Genomics Consortium, Department of Medicine, Karolinska University Hospital and Karolinska Institutet, Stockholm, Sweden
Michael Sundström & Opher Gileadi
Machine Learning and Computational Sciences, Pfizer Research and Development, Berlin, Germany
Andreas Steffen & Djork-Arné Clevert
Center for Therapeutics Discovery, Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA
Shaun Stauffer & Jesse A. Coker
Center for Molecular Biology and Genetic Engineering (CBMEG), Universidade Estadual de Campinas (UNICAMP), Campinas/SP. Center for Medicinal Chemistry (CQMED), Universidade Estadual de Campinas (UNICAMP), Campinas/SP, Brazil
Lucas Rodrigo de Souza & Mario H. Bengtson
National Center for Advancing Translational Sciences, National Institutes of Health, Rockville, MD, USA
Min Shen
Pfizer Research and Development, Machine Learning and Computational Sciences, Berlin, Germany
Kristof Schütt
Protein Science, Structural Biology and Biophysics, Discovery Sciences, Research and Development, AstraZeneca, Gothenburg, Sweden
Lovisa Holmberg Schiavone
Takeda Pharmaceuticals, San Diego, CA, USA
Kumar Saikatendu
Digital Life Sciences, Nuvisan ICB GmbH, Berlin, Germany
Dušan Petrović
Research & Development, Pharmaceuticals, Bayer AG, Monheim, Germany
John P. O’Donnell
Nuvisan Innovation Campus Berlin GmbH, Berlin, Germany
Anke Mueller-Fahrnow
Protein, Structure and Biophysics, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
Juan Carlos Mobarec
Science for Life Laboratory, Department of Oncology and Pathology, Karolinska Institute, Stockholm, Sweden
Maurice Michel
Center for Molecular Medicine, Karolinska Institute and Karolinska Hospital, Stockholm, Sweden
Maurice Michel
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
Andrew R. Leach
Discovery Research, Boehringer Ingelheim International GmbH, Ingelheim, Germany
Oliver Krämer
Evotec SE, Hamburg, Germany
Florian Krieger
X-Chem Inc., Waltham, MA, USA
Anthony Keefe, Marie-Aude Guié & Arrash J. Baghaie
Fraunhofer Institute for Translational Medicine and Pharmacology ITMP, Frankfurt, Germany
Aimo Kannt
Bristol Myers Squibb, San Diego, CA, USA
Scott A. Johnson
Structural Genomics Consortium Frankfurt, Goethe University Frankfurt, Frankfurt, Germany
Sandra Häberle
Bristol Myers Squibb, Cambridge, MA, USA
Emily Rose Holzinger
Discovery and Development Technologies, Merck KGaA, Darmstadt, Germany
Ingo V. Hartung
Bayer AG, Drug Discovery Sciences, Berlin, Germany
Judith Günther
Sage Bionetworks, Seattle, WA, USA
Luca Foschini
Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
Ola Engkvist
Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
Ola Engkvist
OMass Therapeutics Ltd, ARC, Oxford, UK
Katharina Duerr
HitGen Inc., Chengdu, China
Dengfeng Dou
Abcam, Biomedical Campus, Cambridge, UK
Alejandra Solache Diaz
Data Sciences and Quantitative Biology, Discovery Sciences, R&D BioPharmaceuticals, AstraZeneca, Cambridge, UK
Sergio Martinez Cuesta
Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
Timothy Cernak
proCURE Department, Oncobell Program, Catalan Institute of Oncology (ICO) and Bellvitge Biomedical Research Institute (IDIBELL), Hospital Duran y Reynals, L’Hospitalet del Llobregat, Barcelona, Spain
Albert A. Antolin

Authors

Aled M. Edwards
View author publications
Search author on:PubMed Google Scholar
Dafydd R. Owen
View author publications
Search author on:PubMed Google Scholar

Consortia

The Structural Genomics Consortium Target 2035 Working Group

Aled M. Edwards
, Dafydd R. Owen
, Leili Zhang
, Damian W. Young
, Timothy M. Willson
, James Wellnitz
, Yanli Wang
, Jarrod Walsh
, Erik Vernet
, Alexander Tropsha
, Claudia Tredup
, Matthew H. Todd
, Amelia Tjaden
, Sven Thamm
, Michael Sundström
, Andreas Steffen
, Shaun Stauffer
, Lucas Rodrigo de Souza
, Min Shen
, Kristof Schütt
, Lovisa Holmberg Schiavone
, Matthieu Schapira
, Santha Santhakumar
, Kumar Saikatendu
, Emma Rivers
, Dušan Petrović
, Hui Peng
, John P. O’Donnell
, Susanne Müller-Knapp
, Anke Mueller-Fahrnow
, Maxwell R. Morgan
, Florian Montel
, Juan Carlos Mobarec
, Maurice Michel
, Sofia Melliou
, Uta Lessel
, Andrew R. Leach
, Oliver Krämer
, Florian Krieger
, Stefan Knapp
, Anthony Keefe
, Aimo Kannt
, Scott A. Johnson
, Sandra Häberle
, Emily Rose Holzinger
, Ingo V. Hartung
, Rachel J. Harding
, Thomas Hanke
, Levon Halabelian
, Benjamin Haibe-Kains
, Judith Günther
, Marie-Aude Guié
, Claudia Gordijo
, Opher Gileadi
, Luca Foschini
, Amaury Fernández-Montalván
, Ola Engkvist
, Madison M. Edwards
, Katharina Duerr
, David Drewry
, Dengfeng Dou
, Snezana Djordjevic
, Alejandra Solache Diaz
, Sergio Martinez Cuesta
, Rafael Counago
, Wendy D. Cornell
, Jesse A. Coker
, Djork-Arné Clevert
, Timothy Cernak
, Nicola A. Burgess-Brown
, Peter J. Brown
, Mario H. Bengtson
, Frances M. Bashore
, Dalia Barsyte-Lovejoy
, Arrash J. Baghaie
, Alison D. Axtman
, Cheryl Arrowsmith
, Albert A. Antolin
& Suzanne Ackloo

Corresponding authors

Correspondence to Aled M. Edwards or Dafydd R. Owen.

Ethics declarations

Competing interests

D.-A.C., K.S. and D.R.O. are shareholders in Pfizer Inc. The Cernak Lab’s research has been supported by MilliporeSigma, Johnson & Johnson, Relay Therapeutics, Merck & Co., Inc., SPT Labtech, National Defense Medical Center, Shanghai University of Traditional Chinese Medicine, Ministry of Education Taiwan, and Entos, Inc. T.C. has consulted for the University of Dundee Drug Discovery Unit, Scorpion Therapeutics, Relay Therapeutics, Amgen, Genentech, Janssen, Pfizer, Vertex, MilliporeSigma, the US Food & Drug Administration, Gilead, AbbVie, Corteva, Syngenta, Firmenich, Biogen, Bayer, UCB Biopharma, National Taiwan University, AstraZeneca, Grunenthal, and Iambic Therapeutics (previously known as Entos, Inc.). He holds equity in Scorpion Therapeutics and is a co-founder and equity holder at Iambic Therapeutics. B.H.-K. is a co-Founder of the MAQC (Massive Analysis and Quality Control) Society and part of the Scientific Advisory Board of: Consortium de recherche biopharmaceutique (CQDM), Quebec, Canada, Break Through Cancer, Commonwealth Cancer Consortium, United States, Canadian Institute of Health Research–Institute of Genetics, Canada, Cancer Grand Challenges, United Kingdom, Shriners Children, United States. He is part of the Executive Committee of the Terry Fox Digital Health and Discovery Platform, Canada and in the Board of Directors of AACR International–Canada, The American Association for Cancer Research, United States. D.W.Y. is co-founder and shareowner of Deliver Therapeutics. I.V.H. is part of the Board of Directors of TenAces Biosciences. A.K. serves on the SAB of Cilcare, Sulfateq BV and Heartbeat.bio. J.C.M. may hold stock options in Astrazeneca. A.M.-F. is the Board Chair of SGC and Conscience. She is also a shareholder for Bayer AG and an external consultant for Nuvisan ICB GmbH. N.B.-B. is on the SAB for Oxford Vacmedix and holds shares of Exact Sciences. A.T. is co-founder of Predictive LLC. A.S.D. holds stocks in DANAHER. F.K. is a shareholder in Evotec SE. A.F.-M. possess Bayer AG shares. A.A.A. is a consultant to Darwin Health and has received grant funding from Vivan Therapeutics and AtG Therapeutics. D.P. holds stock in Novartis.

Peer review

Peer review information

Nature Reviews Chemistry thanks Brian Shoichet, Brent Stockwell and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Edwards, A.M., Owen, D.R. & The Structural Genomics Consortium Target 2035 Working Group. Protein–ligand data at scale to support machine learning. Nat Rev Chem 9, 634–645 (2025). https://doi.org/10.1038/s41570-025-00737-z

Download citation

Accepted: 10 June 2025
Published: 23 July 2025
Issue date: September 2025
DOI: https://doi.org/10.1038/s41570-025-00737-z

Subjects

Abstract

Similar content being viewed by others

CACHE (Critical Assessment of Computational Hit-finding Experiments): A public–private partnership benchmarking initiative to enable the development of computational methods for hit-finding

Testing the predictive power of reverse screening to infer drug targets, with the help of machine learning

Targeted protein degradation: mechanisms, strategies and application

Introduction

Overview of project workflow

Access to diverse high-quality proteins

Protein production

Protein–ligand open data generation

DEL screening

AS–MS

Annotation and verification of screening data

Proteins

Screening datasets

Secondary annotation of primary screening data

Data consistency

Data management and access

Benchmarking with experimental feedback

From pilots to implementation

Encouraging community contributions

Protein scientists

Data generation

Engaging computational scientists worldwide

Chemists

Training and networking

Project structure and governance

A range of outcomes

A range of benefits to all participants

References

Acknowledgements

Author information

Authors and Affiliations

Consortia

The Structural Genomics Consortium Target 2035 Working Group

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links