Transitive prediction of small-molecule function through alignment of high-content screening resources

Bao, Feng; Li, Li; Hammerlindl, Heinz; Shen, Susan Q.; Hammerlindl, Sabrina; Altschuler, Steven J.; Wu, Lani F.

doi:10.1038/s41587-025-02729-2

Article
Published: 11 July 2025

Transitive prediction of small-molecule function through alignment of high-content screening resources

Nature Biotechnology (2025)Cite this article

2812 Accesses
22 Altmetric
Metrics details

Subjects

Abstract

High-content image-based phenotypic screens (HCSs) provide a scalable approach to characterize biological functions of compounds. The widespread adoption of HCS has led to a growing body of available profile datasets. However, study-specific experimental and computational choices lead to profile datasets that cannot be directly combined. A critical, long-standing challenge is how to integrate these rich but currently isolated HCS dataset resources. Here we introduce a contrastive, deep-learning framework that leverages sparse sets of overlapping profiles as fiducials to align heterogeneous HCS profile datasets in a shared latent space. We demonstrate that this alignment facilitates accurate ‘transitive’ predictions, whereby the function of an uncharacterized compound screened in one dataset can be predicted through comparison with characterized compounds already profiled in other datasets. In silico alignment of HCS resources provides a path to unify fast-growing HCS resources and accelerate early drug discovery efforts.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Problem statement and overview of CLIPⁿ.**

**Fig. 2: Evaluation of CLIPⁿ on simulated datasets.**

**Fig. 3: Application of CLIPⁿ on a curated database of 13 drug screens generated across 20 years.**

**Fig. 4: Transitive prediction through CLIPⁿ integration for uncharacterized compounds.**

**Fig. 5: Experimental validation of CLIPⁿ-predicted hits.**

**Fig. 6: Extension of CLIPⁿ to integrate additional datasets, profiles under hypoxic perturbations and multimodal profiles.**

A statistical framework for high-content phenotypic profiling using cellular feature distributions

Article Open access 22 December 2022

HDBind: encoding of molecular structure with hyperdimensional binary representations

Article Open access 23 November 2024

Identification of non-classical hCA XII inhibitors using combination of computational approaches for drug design and discovery

Article Open access 30 July 2021

Data availability

For raw datasets from curated databases, the data and analysis pipeline for dataset 1 was obtained online (https://www.science.org/doi/suppl/10.1126/science.1100709/suppl_file/perlman.som.zip). For datasets 2 and 3, data were shared by the authors. For datasets 4 and 5, analysis code was obtained online (https://static-content.springer.com/esm/art%3A10.1038%2Fnbt.3419/MediaObjects/41587_2016_BFnbt3419_MOESM21_ESM.zip) and data were shared by the authors. For datasets 6 and 7, the processed dataset was downloaded from AWS following instructions from GitHub (https://github.com/carpenter-singh-lab/2022_Haghighi_NatureMethods) and replicate_level_cp_normalized.csv.gz features were used. For project datasets 8–13, datasets and analysis results were downloaded from Zenodo (https://doi.org/10.5281/zenodo.7352486)⁶². For RxRx3, the dataset was obtained online (https://www.rxrx.ai/rxrx3). L1000 transcript datasets were downloaded using the same link as datasets 6 and 7 and the processed transcript data files (replicate_level_l1k.csv) were used. The CHEMBL database (version 34) was accessed online (https://www.ebi.ac.uk/chembl/beta/). Processed data for the hypoxia phenotypic study and curated drug screen database including all processed datasets can be accessed from figshare (https://doi.org/10.6084/m9.figshare.29061038)⁶³. Within each dataset folder, the ‘meta’ files list compound names and categories, while the ‘data’ files include the corresponding profile features.

Code availability

CLIPⁿ can be installed through the PyPi repository using ‘pip install clipn’. Detailed documentation, source code and demonstrations are provided on GitHub (https://github.com/AltschulerWu-Lab/CLIPn).

References

Kirkpatrick, P. & Ellis, C. Chemical space. Nature 432, 823–824 (2004).
Article CAS Google Scholar
Reymond, J.-L. The chemical space project. Acc. Chem. Res. 48, 722–730 (2015).
Article CAS PubMed Google Scholar
von Lilienfeld, O. A., Müller, K.-R. & Tkatchenko, A. Exploring chemical compound space with quantum-based machine learning. Nat. Rev. Chem. 4, 347–358 (2020).
Article Google Scholar
Mullard, A. The drug-maker’s guide to the galaxy. Nature 549, 445–447 (2017).
Article PubMed Google Scholar
Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2011).
Article PubMed PubMed Central Google Scholar
Bender, B. J. et al. A practical guide to large-scale docking. Nat. Protoc. 16, 4799–4832 (2021).
Article CAS PubMed PubMed Central Google Scholar
Van de Sande, B. et al. Applications of single-cell RNA sequencing in drug discovery and development. Nat. Rev. Drug Discov. 22, 496–520 (2023).
Article PubMed PubMed Central Google Scholar
Ye, C. et al. DRUG-seq for miniaturized high-throughput transcriptome profiling in drug discovery. Nat. Commun. 9, 4307 (2018).
Article PubMed PubMed Central Google Scholar
Readhead, B. et al. Expression-based drug screening of neural progenitor cells from individuals with schizophrenia. Nat. Commun. 9, 4412 (2018).
Article PubMed PubMed Central Google Scholar
Burbaum, J. & Tobal, G. M. Proteomics in drug discovery. Curr. Opin. Chem. Biol. 6, 427–433 (2002).
Article CAS PubMed Google Scholar
Meissner, F., Geddes-McAlister, J., Mann, M. & Bantscheff, M. The emerging role of mass spectrometry-based proteomics in drug discovery. Nat. Rev. Drug Discov. 21, 637–654 (2022).
Article CAS PubMed Google Scholar
Sleno, L. & Emili, A. Proteomic methods for drug target discovery. Curr. Opin. Chem. Biol. 12, 46–54 (2008).
Article CAS PubMed Google Scholar
Hoelder, S., Clarke, P. A. & Workman, P. Discovery of small molecule cancer drugs: successes, challenges and opportunities. Mol. Oncol. 6, 155–176 (2012).
Article CAS PubMed PubMed Central Google Scholar
Agarwal, P., Huckle, J., Newman, J. & Reid, D. L. Trends in small molecule drug properties: a developability molecule assessment perspective. Drug Discov. Today 27, 103366 (2022).
Article CAS PubMed Google Scholar
Swamidass, S. J. Mining small-molecule screens to repurpose drugs. Brief. Bioinform. 12, 327–335 (2011).
Article CAS PubMed Google Scholar
Perlman, Z. E. et al. Multidimensional drug profiling by automated microscopy. Science 306, 1194–1198 (2004).
Article CAS PubMed Google Scholar
Boutros, M., Heigwer, F. & Laufer, C. Microscopy-based high-content screening. Cell 163, 1314–1325 (2015).
Article CAS PubMed Google Scholar
Carpenter, A. E. Image-based chemical screening. Nat. Chem. Biol. 3, 461–465 (2007).
Article CAS PubMed Google Scholar
Way, G. P., Sailem, H., Shave, S., Kasprowicz, R. & Carragher, N. O. Evolution and impact of high content imaging. SLAS Discov. 28, 292–305 (2023).
Article CAS PubMed Google Scholar
Mitchison, T. J. Small‐molecule screening and profiling by using automated microscopy. ChemBioChem 6, 33–39 (2005).
Article CAS PubMed Google Scholar
Scheeder, C., Heigwer, F. & Boutros, M. Machine learning and image-based profiling in drug discovery. Curr. Opin. Syst. Biol. 10, 43–52 (2018).
Article PubMed PubMed Central Google Scholar
Swedlow, J. R. Innovation in biological microscopy: current status and future directions. Bioessays 34, 333–340 (2012).
Article PubMed PubMed Central Google Scholar
Carpenter, A. E. et al. CellProfiler: image analysis software for identifying and quantifying cell phenotypes. Genome Biol. 7, R100 (2006).
Article PubMed PubMed Central Google Scholar
Yang, S. J. et al. Applying deep neural network analysis to high-content image-based assays. SLAS Discov. 24, 829–841 (2019).
Article CAS PubMed PubMed Central Google Scholar
Spitzer, H., Berry, S., Donoghoe, M., Pelkmans, L. & Theis, F. J. Learning consistent subcellular landmarks to quantify changes in multiplexed protein maps. Nat. Methods 20, 1058–1069 (2023).
Article CAS PubMed PubMed Central Google Scholar
Gut, G., Herrmann, M. D. & Pelkmans, L. Multiplexed protein maps link subcellular organization to cellular states. Science 361, eaar7042 (2018).
Article PubMed Google Scholar
Slack, M. D., Martinez, E. D., Wu, L. F. & Altschuler, S. J. Characterizing heterogeneous cellular responses to perturbations. Proc. Natl Acad. Sci. USA 105, 19306–19311 (2008).
Article CAS PubMed PubMed Central Google Scholar
Loo, L.-H. et al. An approach for extensibly profiling the molecular states of cellular subpopulations. Nat. Methods 6, 759–765 (2009).
Article CAS PubMed PubMed Central Google Scholar
Bougen‐Zhukov, N., Loh, S. Y., Lee, H. K. & Loo, L. H. Large‐scale image‐based screening and profiling of cellular phenotypes. Cytometry A 91, 115–125 (2017).
Article PubMed Google Scholar
Bray, M.-A. et al. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nat. Protoc. 11, 1757–1774 (2016).
Article CAS PubMed PubMed Central Google Scholar
Bray, M.-A. et al. A dataset of images and morphological profiles of 30 000 small-molecule treatments using the Cell Painting assay. GigaScience 6, 1–5 (2017).
Article CAS PubMed PubMed Central Google Scholar
Haghighi, M., Caicedo, J. C., Cimini, B. A., Carpenter, A. E. & Singh, S. High-dimensional gene expression and morphology profiles of cells across 28,000 genetic and chemical perturbations. Nat. Methods 19, 1550–1557 (2022).
Article CAS PubMed PubMed Central Google Scholar
Georgi, F. et al. A high-content image-based drug screen of clinical compounds against cell transmission of adenovirus. Sci. Data 7, 265 (2020).
Article CAS PubMed PubMed Central Google Scholar
Siqueira-Neto, J. L. et al. An image-based high-content screening assay for compounds targeting intracellular Leishmania donovani amastigotes in human macrophages. PLoS Negl. Trop. Dis. 6, e1671 (2012).
Article CAS PubMed PubMed Central Google Scholar
Peppard, J. et al. Identifying small molecules which inhibit autophagy: a phenotypic screen using image-based high-content cell analysis. Curr. Chem. Genom. Transl. Med. 8, 3 (2014).
Article CAS PubMed PubMed Central Google Scholar
Hale, C. M. et al. Identification of modulators of autophagic flux in an image-based high content siRNA screen. Autophagy 12, 713–726 (2016).
Article CAS PubMed PubMed Central Google Scholar
Chandrasekaran, S. N. et al. JUMP Cell Painting dataset: morphological impact of 136,000 chemical and genetic perturbations. Preprint at bioRxiv https://doi.org/10.1101/2023.03.23.534023 (2023).
Young, D. W. et al. Integrating high-content screening and ligand-target prediction to identify mechanism of action. Nat. Chem. Biol. 4, 59–68 (2008).
Article CAS PubMed Google Scholar
Williams, E. et al. Image Data Resource: a bioimage data integration and publication platform. Nat. Methods 14, 775–781 (2017).
Article CAS PubMed PubMed Central Google Scholar
de Groot, R., Lüthi, J., Lindsay, H., Holtackers, R. & Pelkmans, L. Large‐scale image‐based profiling of single‐cell phenotypes in arrayed CRISPR–Cas9 gene perturbation screens. Mol. Syst. Biol. 14, e8064 (2018).
Article PubMed PubMed Central Google Scholar
Tromans‐Coia, C. et al. Assessing the performance of the Cell Painting assay across different imaging systems. Cytometry A 103, 915–926 (2023).
Article PubMed PubMed Central Google Scholar
Shariff, A., Kangas, J., Coelho, L. P., Quinn, S. & Murphy, R. F. Automated image analysis for high-content screening and analysis. J. Biomol. Screen. 15, 726–734 (2010).
Article PubMed Google Scholar
Krentzel, D., Shorte, S. L. & Zimmer, C. Deep learning in image-based phenotypic drug discovery. Trends Cell Biol. 33, 538–554 (2023).
Article CAS PubMed Google Scholar
Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (eds Meila, M. & Zhang, T.) (PMLR, 2021).
Schoenauer-Sebag, A. et al. Multi-domain adversarial learning. In Proc. 7th International Conference on Learning Representations (ICLR, 2019).
Yu, M. et al. Deep learning large-scale drug discovery and repurposing. Nat. Comput. Sci. 4, 600–614 (2024).
Article CAS PubMed Google Scholar
Thompson, B. Canonical Correlation Analysis: Uses and Interpretation 1st edn (Sage, 1984).
Stuart, T. & Satija, R. Integrative single-cell analysis. Nat. Rev. Genet. 20, 257–272 (2019).
Article CAS PubMed Google Scholar
Ghazanfar, S., Guibentif, C. & Marioni, J. C. Stabilized mosaic single-cell data integration using unshared features. Nat. Biotechnol. 42, 284–292 (2024).
Article CAS PubMed Google Scholar
Murtagh, F. Multilayer perceptrons for classification and regression. Neurocomputing 2, 183–197 (1991).
Article Google Scholar
Verdú, S. Total variation distance and the distribution of relative information. In 2014 IEEE Information Theory and Applications Workshop (IEEE, 2014).
Heinrich, L., Kumbier, K., Li, L., Altschuler, S. J. & Wu, L. F. Selection of optimal cell lines for high-content phenotypic screening. ACS Chem. Biol. 18, 679–685 (2023).
Article CAS PubMed PubMed Central Google Scholar
Kang, J. et al. Improving drug discovery with high-content phenotypic screens by systematic selection of reporter cell lines. Nat. Biotechnol. 34, 70–77 (2016).
Article CAS PubMed Google Scholar
Ersahin, T., Tuncbag, N. & Cetin-Atalay, R. The PI3K/AKT/mTOR interactive pathway. Mol. Biosyst. 11, 1946–1954 (2015).
Article CAS PubMed Google Scholar
Morgensztern, D. & McLeod, H. L. PI3K/Akt/mTOR pathway as a target for cancer therapy. Anticancer Drugs 16, 797–803 (2005).
Article CAS PubMed Google Scholar
Fay, M. M. et al. RxRx3: phenomics map of biology. Preprint at bioRxiv https://doi.org/10.1101/2023.02.07.527350 (2023).
Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437–1452 (2017).
Article CAS PubMed PubMed Central Google Scholar
Li, L. et al. A phenopushing platform to identify compounds that alleviate acute hypoxic stress by fast-tracking cellular adaptation. Nat. Commun. 16, 2684 (2025).
Article CAS PubMed PubMed Central Google Scholar
Dixit, A. et al. Perturb-Seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866 (2016).
Article CAS PubMed PubMed Central Google Scholar
Feldman, D. et al. Optical pooled screens in human cells. Cell 179, 787–799 (2019).
Article CAS PubMed PubMed Central Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proc. 3rd International Conference on Learning Representations (ICLR, 2015).
Heinrich, L., Kumbier, K., Li, L., Altschuler, S. & Wu, L. Selection of optimal cell lines for high-content phenotypic screening. Zenodo https://doi.org/10.5281/zenodo.7352486 (2025).
Bao, F. Datasets used in ‘Transitive prediction of small-molecule function through alignment of high-content screening resources’. figshare https://doi.org/10.6084/m9.figshare.29061038 (2025).

Download references

Acknowledgements

This work was supported by the National Institutes of Health (NIH) R01 CA184984 (L.F.W.), Mark Foundation ASPIRE Award and NIH R01 CA300245 (S.J.A.), NIH R38AG070171 and R25MH0602 (S.Q.S.) and Human Frontier Science Program LT000908/2020-C (L.L.).

Author information

Authors and Affiliations

Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA, USA
Feng Bao, Li Li, Heinz Hammerlindl, Susan Q. Shen, Sabrina Hammerlindl, Steven J. Altschuler & Lani F. Wu
Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, USA
Susan Q. Shen

Authors

Feng Bao
View author publications
Search author on:PubMed Google Scholar
Li Li
View author publications
Search author on:PubMed Google Scholar
Heinz Hammerlindl
View author publications
Search author on:PubMed Google Scholar
Susan Q. Shen
View author publications
Search author on:PubMed Google Scholar
Sabrina Hammerlindl
View author publications
Search author on:PubMed Google Scholar
Steven J. Altschuler
View author publications
Search author on:PubMed Google Scholar
Lani F. Wu
View author publications
Search author on:PubMed Google Scholar

Contributions

S.J.A. and L.F.W. conceptualized the project. F.B., S.J.A. and L.F.W. developed the CLIPⁿ machine learning framework and designed the evaluation strategy. F.B. implemented the method, designed the simulation approach, compared alternative methods, processed the data and performed the integration analysis. F.B. and L.L. curated the HCS datasets and conducted the functional prediction analyses. L.L., S.Q.S., S.J.A. and L.F.W. designed the experimental validation strategy. L.L. performed the cell-based assays and validated the predicted drug functions. L.L., S.Q.S., H.H. and S.H. conducted the hypoxia-related experiments and collected the associated data. F.B., S.J.A. and L.F.W. wrote and revised the paper. All authors contributed to revisions and reviewed, edited and approved the final version of the paper.

Corresponding authors

Correspondence to Steven J. Altschuler or Lani F. Wu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks Trey Ideker, Jason Swedlow and Scott Warchal for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Setups for simulation experiment, comparing approaches and evaluation metrics.

(a) Cartoon illustration of simulated data generation. (b) Sequential integration strategies for CCA (left) and StabMAP (right) to work with multiple datasets (Methods). (c) An overview of evaluation metrics for integration and how different metric values reflect different integration results. (d) Calculation steps using the total variation distance to evaluate the multi-dataset alignments.

Extended Data Fig. 2 Performance evaluation using simulated datasets (related to Fig. 2).

(a) Visualization of simulated profiles generated using the parameters from Fig. 2a. (b) Classification evaluation using 5-fold cross-validation. Integration approaches are trained on 80% of the data, with the remaining 20% mapped to the integrated space and evaluated using linear classifiers on integrated embeddings. Boxplot: median (center line), interquartile range (box), and data range (whiskers). (c) Classification performance (F1 score) for each category in each dataset from panel b. (d) Summary of classification performance relative to label frequencies in datasets, indicating how often the label appears across different datasets. (e) UMAP visualization of integrated embeddings for all datasets (left) or only the dataset with inactive categories (right) for the comparison approaches not shown in Fig. 2f. (f) Overfitting evaluation of the MLP approach, using 5-fold cross-validation to split training and testing sets. (g) Overfitting evaluation of CLIPn on categories that were inherently similar. The same simulation parameters were used as in (a). Profiles from two categories were replaced by profiles generated from the same category. Linear classifier with 5-fold cross-validation was used and F1 scores for similar and distinct categories were reported separately. *: p < 0.05 based on two-sided Mann–Whitney U test. (h) Visualization of integrated embeddings for comparison approaches on simulated data with varying sample sizes. Eight datasets were generated, each containing approximately 250 to 5,000 samples. UMAP was used to visualize the data, with colors representing categories (left) or dataset sources (right). (i) Evaluation of integration (top) and classification (bottom) performance on simulated datasets with varying number of datasets (left) and noise levels (right).

Extended Data Fig. 3 Performance evaluation using simulated datasets (related to Fig. 2).

(a) Left: Visualization of one example integration results in the presence of a rare category. Right: Comparison of classification performance between the rare category and the rest. Each category in each dataset is iteratively selected as the rare category, with samples from this category in other datasets removed from training. Classification is quantified using 5-fold cross-validation with a linear classifier. Boxplot: same definition as in Extended Data Fig. 2b. (b) Classification evaluation on datasets with varying levels of category overlap. Left: Design of label distribution across datasets. Right: Classification performance based on 10 random simulation repeats. Boxplot: same definition as in Extended Data Fig. 2b. (c) Illustration of the experiment design for transitive predictions. Samples from selected categories are masked from datasets, and the integration model is trained only on available data. The masked categories are mapped to the latent space using the trained model and predicted using the 10 nearest neighbors. (d) Transitive prediction performance based on integrating increasing numbers of datasets. (e) Comparison of different strategies for using MLP with multiple datasets of varying feature matrix shapes. Left: Expansion of feature matrices by feature zero-padding or diagonal concatenation. Middle: Visualization of latent embeddings learned from the two feature expansion strategies. Right: Evaluation of alignment and classification in the latent spaces (as shown in Fig. 2).

Extended Data Fig. 4 Integration of HCS datasets (related to Fig. 3).

(a) Overview of the number of reference drugs and samples in each category and dataset used for the integration analysis. (b) Summary of sample sizes (top) and feature dimensions (bottom) for each reference dataset. (c) UMAP visualization of integrated embeddings from different integration methods, with color annotations indicating reference compound categories (top), datasets (bottom left), and cell lines (bottom right). (d) Heatmap visualization (left) and value distribution comparison (right) of original profiles for active and inactive samples in datasets #6 and #7. P-value: two-sided Mann-Whitney U test. (e) UMAP visualization of the CLIPⁿ embeddings, highlighting compounds from the mTOR and PI3K inhibitor categories (colored dots).

Extended Data Fig. 5 Analysis of integration space for 13 HCS datasets (related to Fig. 3).

(a) UMAP visualization of original profiles (first column) and integrated embeddings (remaining columns) for each individual dataset. Raw features or integrated embeddings are independently transformed in UMAP for each dataset. Colors represent drug categories and correspond to those in Fig. 3d. (b) CLIPⁿ embeddings and raw profiles highlighting the neuron-specific receptor categories in Datasets #6 and #7.

Extended Data Fig. 6 Evaluation of transitive predictions using CLIPn integrated space (related to Figs. 4, 5).

(a) Comparison of the classification probability distribution from k-NN predictions using either raw profiles or CLIPⁿ embeddings for previously validated compounds in Fig. 4f. Likelihood was estimated based on the percentage of the 10 nearest neighbors sharing the same label as the predictions. P value = 5.8E-25, two-sided t test. (b) Visualization of original profiles for reference compounds (colored dots) and uncharacterized compounds selected for validation (crosses). Predicted confidence is shown as the percentage of the 10 nearest neighbors with the same labels as the predictions. (c) Experimental validation results for different categories. Intensities of function-specific markers in each assay are shown for DMSO (blue), reference compounds (orange), and predicted hits (green). P values are derived by comparing DMSO to each predicted hit using a one-sided Mann-Whitney U test. Compound 221019 was predicted to EGFR at high dose (10 uM) and proteasome at lower dose (1 uM). (d) Left: The number of known compounds in each predicted category from the ChEMBL database. Right: Tanimoto similarities based on MACCS features between two ChEMBL compounds and between ChEMBL compounds and experimentally validated hits.

Extended Data Fig. 7 Evaluation of different strategies to integrate the new RxRx3 dataset with the existing CLIPⁿ integrated space (related to Fig. 6).

Left: retain strategy – completely retrain the model from raw profiles. Right: merge strategy – align new dataset with existing latent space. Same category color palette is used as in Fig. 3.

Extended Data Fig. 8 Evaluation of integration for multimodal profiles (related to Fig. 6).

(a) Dataset information (left), categories (middle), and profile visualizations (right) for transcriptional datasets used in the analysis. (b) Visualization of integrated space with dataset annotations. (c) Comparison of categorical separation before and after integration using Supporting Vector Machine (SVM) or Multi-Layered Perceptron (MLP). 5-fold cross-validation was used. Dataset color was consistent with (b). (d) Categorical separation for transcript datasets before and after integration using linear classifier.

Extended Data Fig. 9 Evaluation of integration for phenotypic profiles under hypoxia perturbations (related to Fig. 6).

(a) Evaluation of alignment for samples from the same hypoxia treatment durations but different datasets using total variance distance. The boxplot displays the median (center line), interquartile range (box), and data range (whiskers). (b) Evaluation of separation for samples from different treatments using F1 score. A linear classifier was trained on a subset of datasets and tested using the remaining datasets. Each experiment was repeated 10 times. Boxplot: the same definition as in (a). (c) Evaluation of transitive predictions for masked treatments in the integrated space. Each condition in each dataset was sequentially dropped from training. Masked samples were mapped to the latent space and predicted using neighbors with known labels. The median and data range were shown. (d) Performance with fewer conditions in each dataset. From left to right, a randomly selected condition was further removed from each dataset in the integration analysis. Integrated embeddings were visualized.

Supplementary information

Supplementary Information

Supplementary Tables 1–4 and Note.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bao, F., Li, L., Hammerlindl, H. et al. Transitive prediction of small-molecule function through alignment of high-content screening resources. Nat Biotechnol (2025). https://doi.org/10.1038/s41587-025-02729-2

Download citation

Received: 16 November 2024
Accepted: 29 May 2025
Published: 11 July 2025
DOI: https://doi.org/10.1038/s41587-025-02729-2