Abstract
High-content image-based phenotypic screens (HCSs) provide a scalable approach to characterize biological functions of compounds. The widespread adoption of HCS has led to a growing body of available profile datasets. However, study-specific experimental and computational choices lead to profile datasets that cannot be directly combined. A critical, long-standing challenge is how to integrate these rich but currently isolated HCS dataset resources. Here we introduce a contrastive, deep-learning framework that leverages sparse sets of overlapping profiles as fiducials to align heterogeneous HCS profile datasets in a shared latent space. We demonstrate that this alignment facilitates accurate ‘transitive’ predictions, whereby the function of an uncharacterized compound screened in one dataset can be predicted through comparison with characterized compounds already profiled in other datasets. In silico alignment of HCS resources provides a path to unify fast-growing HCS resources and accelerate early drug discovery efforts.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
For raw datasets from curated databases, the data and analysis pipeline for dataset 1 was obtained online (https://www.science.org/doi/suppl/10.1126/science.1100709/suppl_file/perlman.som.zip). For datasets 2 and 3, data were shared by the authors. For datasets 4 and 5, analysis code was obtained online (https://static-content.springer.com/esm/art%3A10.1038%2Fnbt.3419/MediaObjects/41587_2016_BFnbt3419_MOESM21_ESM.zip) and data were shared by the authors. For datasets 6 and 7, the processed dataset was downloaded from AWS following instructions from GitHub (https://github.com/carpenter-singh-lab/2022_Haghighi_NatureMethods) and replicate_level_cp_normalized.csv.gz features were used. For project datasets 8–13, datasets and analysis results were downloaded from Zenodo (https://doi.org/10.5281/zenodo.7352486)62. For RxRx3, the dataset was obtained online (https://www.rxrx.ai/rxrx3). L1000 transcript datasets were downloaded using the same link as datasets 6 and 7 and the processed transcript data files (replicate_level_l1k.csv) were used. The CHEMBL database (version 34) was accessed online (https://www.ebi.ac.uk/chembl/beta/). Processed data for the hypoxia phenotypic study and curated drug screen database including all processed datasets can be accessed from figshare (https://doi.org/10.6084/m9.figshare.29061038)63. Within each dataset folder, the ‘meta’ files list compound names and categories, while the ‘data’ files include the corresponding profile features.
Code availability
CLIPn can be installed through the PyPi repository using ‘pip install clipn’. Detailed documentation, source code and demonstrations are provided on GitHub (https://github.com/AltschulerWu-Lab/CLIPn).
References
Kirkpatrick, P. & Ellis, C. Chemical space. Nature 432, 823–824 (2004).
Reymond, J.-L. The chemical space project. Acc. Chem. Res. 48, 722–730 (2015).
von Lilienfeld, O. A., Müller, K.-R. & Tkatchenko, A. Exploring chemical compound space with quantum-based machine learning. Nat. Rev. Chem. 4, 347–358 (2020).
Mullard, A. The drug-maker’s guide to the galaxy. Nature 549, 445–447 (2017).
Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2011).
Bender, B. J. et al. A practical guide to large-scale docking. Nat. Protoc. 16, 4799–4832 (2021).
Van de Sande, B. et al. Applications of single-cell RNA sequencing in drug discovery and development. Nat. Rev. Drug Discov. 22, 496–520 (2023).
Ye, C. et al. DRUG-seq for miniaturized high-throughput transcriptome profiling in drug discovery. Nat. Commun. 9, 4307 (2018).
Readhead, B. et al. Expression-based drug screening of neural progenitor cells from individuals with schizophrenia. Nat. Commun. 9, 4412 (2018).
Burbaum, J. & Tobal, G. M. Proteomics in drug discovery. Curr. Opin. Chem. Biol. 6, 427–433 (2002).
Meissner, F., Geddes-McAlister, J., Mann, M. & Bantscheff, M. The emerging role of mass spectrometry-based proteomics in drug discovery. Nat. Rev. Drug Discov. 21, 637–654 (2022).
Sleno, L. & Emili, A. Proteomic methods for drug target discovery. Curr. Opin. Chem. Biol. 12, 46–54 (2008).
Hoelder, S., Clarke, P. A. & Workman, P. Discovery of small molecule cancer drugs: successes, challenges and opportunities. Mol. Oncol. 6, 155–176 (2012).
Agarwal, P., Huckle, J., Newman, J. & Reid, D. L. Trends in small molecule drug properties: a developability molecule assessment perspective. Drug Discov. Today 27, 103366 (2022).
Swamidass, S. J. Mining small-molecule screens to repurpose drugs. Brief. Bioinform. 12, 327–335 (2011).
Perlman, Z. E. et al. Multidimensional drug profiling by automated microscopy. Science 306, 1194–1198 (2004).
Boutros, M., Heigwer, F. & Laufer, C. Microscopy-based high-content screening. Cell 163, 1314–1325 (2015).
Carpenter, A. E. Image-based chemical screening. Nat. Chem. Biol. 3, 461–465 (2007).
Way, G. P., Sailem, H., Shave, S., Kasprowicz, R. & Carragher, N. O. Evolution and impact of high content imaging. SLAS Discov. 28, 292–305 (2023).
Mitchison, T. J. Small‐molecule screening and profiling by using automated microscopy. ChemBioChem 6, 33–39 (2005).
Scheeder, C., Heigwer, F. & Boutros, M. Machine learning and image-based profiling in drug discovery. Curr. Opin. Syst. Biol. 10, 43–52 (2018).
Swedlow, J. R. Innovation in biological microscopy: current status and future directions. Bioessays 34, 333–340 (2012).
Carpenter, A. E. et al. CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome Biol. 7, R100 (2006).
Yang, S. J. et al. Applying deep neural network analysis to high-content image-based assays. SLAS Discov. 24, 829–841 (2019).
Spitzer, H., Berry, S., Donoghoe, M., Pelkmans, L. & Theis, F. J. Learning consistent subcellular landmarks to quantify changes in multiplexed protein maps. Nat. Methods 20, 1058–1069 (2023).
Gut, G., Herrmann, M. D. & Pelkmans, L. Multiplexed protein maps link subcellular organization to cellular states. Science 361, eaar7042 (2018).
Slack, M. D., Martinez, E. D., Wu, L. F. & Altschuler, S. J. Characterizing heterogeneous cellular responses to perturbations. Proc. Natl Acad. Sci. USA 105, 19306–19311 (2008).
Loo, L.-H. et al. An approach for extensibly profiling the molecular states of cellular subpopulations. Nat. Methods 6, 759–765 (2009).
Bougen‐Zhukov, N., Loh, S. Y., Lee, H. K. & Loo, L. H. Large‐scale image‐based screening and profiling of cellular phenotypes. Cytometry A 91, 115–125 (2017).
Bray, M.-A. et al. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nat. Protoc. 11, 1757–1774 (2016).
Bray, M.-A. et al. A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay. GigaScience 6, 1–5 (2017).
Haghighi, M., Caicedo, J. C., Cimini, B. A., Carpenter, A. E. & Singh, S. High-dimensional gene expression and morphology profiles of cells across 28,000 genetic and chemical perturbations. Nat. Methods 19, 1550–1557 (2022).
Georgi, F. et al. A high-content image-based drug screen of clinical compounds against cell transmission of adenovirus. Sci. Data 7, 265 (2020).
Siqueira-Neto, J. L. et al. An image-based high-content screening assay for compounds targeting intracellular Leishmania donovani amastigotes in human macrophages. PLoS Negl. Trop. Dis. 6, e1671 (2012).
Peppard, J. et al. Identifying small molecules which inhibit autophagy: a phenotypic screen using image-based high-content cell analysis. Curr. Chem. Genom. Transl. Med. 8, 3 (2014).
Hale, C. M. et al. Identification of modulators of autophagic flux in an image-based high content siRNA screen. Autophagy 12, 713–726 (2016).
Chandrasekaran, S. N. et al. JUMP Cell Painting dataset: morphological impact of 136,000 chemical and genetic perturbations. Preprint at bioRxiv https://doi.org/10.1101/2023.03.23.534023 (2023).
Young, D. W. et al. Integrating high-content screening and ligand-target prediction to identify mechanism of action. Nat. Chem. Biol. 4, 59–68 (2008).
Williams, E. et al. Image Data Resource: a bioimage data integration and publication platform. Nat. Methods 14, 775–781 (2017).
de Groot, R., Lüthi, J., Lindsay, H., Holtackers, R. & Pelkmans, L. Large‐scale image‐based profiling of single‐cell phenotypes in arrayed CRISPR–Cas9 gene perturbation screens. Mol. Syst. Biol. 14, e8064 (2018).
Tromans‐Coia, C. et al. Assessing the performance of the Cell Painting assay across different imaging systems. Cytometry A 103, 915–926 (2023).
Shariff, A., Kangas, J., Coelho, L. P., Quinn, S. & Murphy, R. F. Automated image analysis for high-content screening and analysis. J. Biomol. Screen. 15, 726–734 (2010).
Krentzel, D., Shorte, S. L. & Zimmer, C. Deep learning in image-based phenotypic drug discovery. Trends Cell Biol. 33, 538–554 (2023).
Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (eds Meila, M. & Zhang, T.) (PMLR, 2021).
Schoenauer-Sebag, A. et al. Multi-domain adversarial learning. In Proc. 7th International Conference on Learning Representations (ICLR, 2019).
Yu, M. et al. Deep learning large-scale drug discovery and repurposing. Nat. Comput. Sci. 4, 600–614 (2024).
Thompson, B. Canonical Correlation Analysis: Uses and Interpretation 1st edn (Sage, 1984).
Stuart, T. & Satija, R. Integrative single-cell analysis. Nat. Rev. Genet. 20, 257–272 (2019).
Ghazanfar, S., Guibentif, C. & Marioni, J. C. Stabilized mosaic single-cell data integration using unshared features. Nat. Biotechnol. 42, 284–292 (2024).
Murtagh, F. Multilayer perceptrons for classification and regression. Neurocomputing 2, 183–197 (1991).
Verdú, S. Total variation distance and the distribution of relative information. In 2014 IEEE Information Theory and Applications Workshop (IEEE, 2014).
Heinrich, L., Kumbier, K., Li, L., Altschuler, S. J. & Wu, L. F. Selection of optimal cell lines for high-content phenotypic screening. ACS Chem. Biol. 18, 679–685 (2023).
Kang, J. et al. Improving drug discovery with high-content phenotypic screens by systematic selection of reporter cell lines. Nat. Biotechnol. 34, 70–77 (2016).
Ersahin, T., Tuncbag, N. & Cetin-Atalay, R. The PI3K/AKT/mTOR interactive pathway. Mol. Biosyst. 11, 1946–1954 (2015).
Morgensztern, D. & McLeod, H. L. PI3K/Akt/mTOR pathway as a target for cancer therapy. Anticancer Drugs 16, 797–803 (2005).
Fay, M. M. et al. RxRx3: phenomics map of biology. Preprint at bioRxiv https://doi.org/10.1101/2023.02.07.527350 (2023).
Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437–1452 (2017).
Li, L. et al. A phenopushing platform to identify compounds that alleviate acute hypoxic stress by fast-tracking cellular adaptation. Nat. Commun. 16, 2684 (2025).
Dixit, A. et al. Perturb-Seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).
Feldman, D. et al. Optical pooled screens in human cells. Cell 179, 787–799 (2019).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proc. 3rd International Conference on Learning Representations (ICLR, 2015).
Heinrich, L., Kumbier, K., Li, L., Altschuler, S. & Wu, L. Selection of optimal cell lines for high-content phenotypic screening. Zenodo https://doi.org/10.5281/zenodo.7352486 (2025).
Bao, F. Datasets used in ‘Transitive prediction of small-molecule function through alignment of high-content screening resources’. figshare https://doi.org/10.6084/m9.figshare.29061038 (2025).
Acknowledgements
This work was supported by the National Institutes of Health (NIH) R01 CA184984 (L.F.W.), Mark Foundation ASPIRE Award and NIH R01 CA300245 (S.J.A.), NIH R38AG070171 and R25MH0602 (S.Q.S.) and Human Frontier Science Program LT000908/2020-C (L.L.).
Author information
Authors and Affiliations
Contributions
S.J.A. and L.F.W. conceptualized the project. F.B., S.J.A. and L.F.W. developed the CLIPn machine learning framework and designed the evaluation strategy. F.B. implemented the method, designed the simulation approach, compared alternative methods, processed the data and performed the integration analysis. F.B. and L.L. curated the HCS datasets and conducted the functional prediction analyses. L.L., S.Q.S., S.J.A. and L.F.W. designed the experimental validation strategy. L.L. performed the cell-based assays and validated the predicted drug functions. L.L., S.Q.S., H.H. and S.H. conducted the hypoxia-related experiments and collected the associated data. F.B., S.J.A. and L.F.W. wrote and revised the paper. All authors contributed to revisions and reviewed, edited and approved the final version of the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks Trey Ideker, Jason Swedlow and Scott Warchal for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Setups for simulation experiment, comparing approaches and evaluation metrics.
(a) Cartoon illustration of simulated data generation. (b) Sequential integration strategies for CCA (left) and StabMAP (right) to work with multiple datasets (Methods). (c) An overview of evaluation metrics for integration and how different metric values reflect different integration results. (d) Calculation steps using the total variation distance to evaluate the multi-dataset alignments.
Extended Data Fig. 2 Performance evaluation using simulated datasets (related to Fig. 2).
(a) Visualization of simulated profiles generated using the parameters from Fig. 2a. (b) Classification evaluation using 5-fold cross-validation. Integration approaches are trained on 80% of the data, with the remaining 20% mapped to the integrated space and evaluated using linear classifiers on integrated embeddings. Boxplot: median (center line), interquartile range (box), and data range (whiskers). (c) Classification performance (F1 score) for each category in each dataset from panel b. (d) Summary of classification performance relative to label frequencies in datasets, indicating how often the label appears across different datasets. (e) UMAP visualization of integrated embeddings for all datasets (left) or only the dataset with inactive categories (right) for the comparison approaches not shown in Fig. 2f. (f) Overfitting evaluation of the MLP approach, using 5-fold cross-validation to split training and testing sets. (g) Overfitting evaluation of CLIPn on categories that were inherently similar. The same simulation parameters were used as in (a). Profiles from two categories were replaced by profiles generated from the same category. Linear classifier with 5-fold cross-validation was used and F1 scores for similar and distinct categories were reported separately. *: p < 0.05 based on two-sided Mann–Whitney U test. (h) Visualization of integrated embeddings for comparison approaches on simulated data with varying sample sizes. Eight datasets were generated, each containing approximately 250 to 5,000 samples. UMAP was used to visualize the data, with colors representing categories (left) or dataset sources (right). (i) Evaluation of integration (top) and classification (bottom) performance on simulated datasets with varying number of datasets (left) and noise levels (right).
Extended Data Fig. 3 Performance evaluation using simulated datasets (related to Fig. 2).
(a) Left: Visualization of one example integration results in the presence of a rare category. Right: Comparison of classification performance between the rare category and the rest. Each category in each dataset is iteratively selected as the rare category, with samples from this category in other datasets removed from training. Classification is quantified using 5-fold cross-validation with a linear classifier. Boxplot: same definition as in Extended Data Fig. 2b. (b) Classification evaluation on datasets with varying levels of category overlap. Left: Design of label distribution across datasets. Right: Classification performance based on 10 random simulation repeats. Boxplot: same definition as in Extended Data Fig. 2b. (c) Illustration of the experiment design for transitive predictions. Samples from selected categories are masked from datasets, and the integration model is trained only on available data. The masked categories are mapped to the latent space using the trained model and predicted using the 10 nearest neighbors. (d) Transitive prediction performance based on integrating increasing numbers of datasets. (e) Comparison of different strategies for using MLP with multiple datasets of varying feature matrix shapes. Left: Expansion of feature matrices by feature zero-padding or diagonal concatenation. Middle: Visualization of latent embeddings learned from the two feature expansion strategies. Right: Evaluation of alignment and classification in the latent spaces (as shown in Fig. 2).
Extended Data Fig. 4 Integration of HCS datasets (related to Fig. 3).
(a) Overview of the number of reference drugs and samples in each category and dataset used for the integration analysis. (b) Summary of sample sizes (top) and feature dimensions (bottom) for each reference dataset. (c) UMAP visualization of integrated embeddings from different integration methods, with color annotations indicating reference compound categories (top), datasets (bottom left), and cell lines (bottom right). (d) Heatmap visualization (left) and value distribution comparison (right) of original profiles for active and inactive samples in datasets #6 and #7. P-value: two-sided Mann-Whitney U test. (e) UMAP visualization of the CLIPn embeddings, highlighting compounds from the mTOR and PI3K inhibitor categories (colored dots).
Extended Data Fig. 5 Analysis of integration space for 13 HCS datasets (related to Fig. 3).
(a) UMAP visualization of original profiles (first column) and integrated embeddings (remaining columns) for each individual dataset. Raw features or integrated embeddings are independently transformed in UMAP for each dataset. Colors represent drug categories and correspond to those in Fig. 3d. (b) CLIPn embeddings and raw profiles highlighting the neuron-specific receptor categories in Datasets #6 and #7.
Extended Data Fig. 6 Evaluation of transitive predictions using CLIPn integrated space (related to Figs. 4, 5).
(a) Comparison of the classification probability distribution from k-NN predictions using either raw profiles or CLIPn embeddings for previously validated compounds in Fig. 4f. Likelihood was estimated based on the percentage of the 10 nearest neighbors sharing the same label as the predictions. P value = 5.8E-25, two-sided t test. (b) Visualization of original profiles for reference compounds (colored dots) and uncharacterized compounds selected for validation (crosses). Predicted confidence is shown as the percentage of the 10 nearest neighbors with the same labels as the predictions. (c) Experimental validation results for different categories. Intensities of function-specific markers in each assay are shown for DMSO (blue), reference compounds (orange), and predicted hits (green). P values are derived by comparing DMSO to each predicted hit using a one-sided Mann-Whitney U test. Compound 221019 was predicted to EGFR at high dose (10 uM) and proteasome at lower dose (1 uM). (d) Left: The number of known compounds in each predicted category from the ChEMBL database. Right: Tanimoto similarities based on MACCS features between two ChEMBL compounds and between ChEMBL compounds and experimentally validated hits.
Extended Data Fig. 7 Evaluation of different strategies to integrate the new RxRx3 dataset with the existing CLIPn integrated space (related to Fig. 6).
Left: retain strategy – completely retrain the model from raw profiles. Right: merge strategy – align new dataset with existing latent space. Same category color palette is used as in Fig. 3.
Extended Data Fig. 8 Evaluation of integration for multimodal profiles (related to Fig. 6).
(a) Dataset information (left), categories (middle), and profile visualizations (right) for transcriptional datasets used in the analysis. (b) Visualization of integrated space with dataset annotations. (c) Comparison of categorical separation before and after integration using Supporting Vector Machine (SVM) or Multi-Layered Perceptron (MLP). 5-fold cross-validation was used. Dataset color was consistent with (b). (d) Categorical separation for transcript datasets before and after integration using linear classifier.
Extended Data Fig. 9 Evaluation of integration for phenotypic profiles under hypoxia perturbations (related to Fig. 6).
(a) Evaluation of alignment for samples from the same hypoxia treatment durations but different datasets using total variance distance. The boxplot displays the median (center line), interquartile range (box), and data range (whiskers). (b) Evaluation of separation for samples from different treatments using F1 score. A linear classifier was trained on a subset of datasets and tested using the remaining datasets. Each experiment was repeated 10 times. Boxplot: the same definition as in (a). (c) Evaluation of transitive predictions for masked treatments in the integrated space. Each condition in each dataset was sequentially dropped from training. Masked samples were mapped to the latent space and predicted using neighbors with known labels. The median and data range were shown. (d) Performance with fewer conditions in each dataset. From left to right, a randomly selected condition was further removed from each dataset in the integration analysis. Integrated embeddings were visualized.
Supplementary information
Supplementary Information
Supplementary Tables 1–4 and Note.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bao, F., Li, L., Hammerlindl, H. et al. Transitive prediction of small-molecule function through alignment of high-content screening resources. Nat Biotechnol (2025). https://doi.org/10.1038/s41587-025-02729-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41587-025-02729-2