Introduction

In image-based drug discovery1,2, particularly with techniques like Cell Painting, learning robust image representations is essential for extracting meaningful insights from complex, high-throughput image datasets. Cell Painting involves using multiple fluorescent dyes to label various organelles and cellular components, producing multi-channel images that capture phenotypic changes in response to different drugs and perturbations3. These high-dimensional images are rich in information, making automated methods for learning image representations crucial. These representations enable the development of predictive models for drug discovery, allowing for better identification of therapeutic compounds, understanding drug mechanisms, and predicting off-target effects2,3,4. Many laboratories and companies have generated extensive Cell Painting datasets5,6,7,8, facilitating the identification of phenotypic changes in response to drug treatments and supporting various downstream applications. For example, analyzing changes in cell morphology can provide insights into drug targets and mechanisms of action9,10, while comparisons of cell morphology before and after treatment can help evaluate drug efficacy and identify promising candidates with significant impacts7,11. Additionally, these image representations can be integrated with other data types, such as gene expression profiles6,12, allowing for comprehensive multimodal analyses that further advance drug discovery and biomedical research8,13.

The high-dimensional nature of Cell Painting images often introduces redundancy and noise, necessitating extensive preprocessing steps such as normalization, segmentation, and artifact removal7,14. Furthermore, the large-scale nature of these datasets demands substantial computational resources for scalable processing, and the extracted morphological features may lack biological interpretability, making it challenging to directly utilize these images for meaningful analysis. General-purpose models pre-trained on ImageNet15, such as ResNet5016 and ViT17, offer general solutions for image processing and representation learning. ResNet50 provides hierarchical feature representations through residual connections, while ViT captures long-range dependencies in image data. ImageNet pre-training further enhances their generalizability for computational image analysis across diverse datasets. To address the specific challenges posed by Cell Painting images, tailored methods including CellProfiler18, DeepProfiler9, SPACe19 and OpenPhenom20 have been developed to extract informative and compact representations of cell morphology. CellProfiler is a versatile, open-source tool for high-throughput image analysis, enabling biologists to extract physical cellular features (e.g., size, shape, intensity, texture) through modular pipelines. SPACe enhances computational efficiency using less but refined feature set and GPU-accelerated texture analysis. DeepProfiler adopts a deep learning approach, utilizing the EfficientNet21 model architecture to generate morphological profiles from Cell Painting sub-images. OpenPhenom leverages Vision Transformer (ViT)-based masked autoencoders to analyze large-scale microscopy images. By transforming complex image data into concise and interpretable representations, these methods enable phenotypic profiling, providing valuable insights into drug effects and cellular perturbations.

Despite these advancements, existing methods for morphological representation learning face several critical limitations, particularly when applied to high-dimensional Cell Painting images. First, these methods often process whole multi-channel images by decomposing them into multiple sub-images, resulting in a complex and resource-intensive workflow. These approaches typically involve segmenting the whole images to identify individual cell locations, extracting sub-images for each cell, applying models to extract features from these sub-images, and finally integrating those features to generate a comprehensive representation of the original whole multi-channel image. This multi-step process not only increases computational and acquisition costs but also introduces additional sources of error, such as inaccuracies in segmentation and feature integration. Second, these methods rely on drug treatment conditions as classification labels9,22, which provide limited information for capturing the diversity and complexity of cellular responses. This reliance often results in less biologically meaningful phenotypic representations, as those condition labels may fail to capture subtle morphological changes. Moreover, these labels lack universality and are often specific to certain plates or experimental setups. Consequently, existing models trained on these limited labels struggle to generalize effectively across diverse experimental conditions, reducing their scalability and applicability. These limitations highlight the need for more streamlined, efficient, and robust method to obtain biologically meaningful representations.

In this paper, we introduce PhenoProfiler, an innovative tool for learning phenotypic representations of cell morphology from high-throughput images. Unlike existing methods, PhenoProfiler functions as an end-to-end framework that directly encodes high-content multi-channel images into low-dimensional feature representations, without the need for extensive preprocessing such as segmentation and sub-image extraction. PhenoProfiler consists of three main modules: a gradient encoder, a transformer encoder, and a multi-objective learning module that integrates classification, regression, and contrastive learning. These modules establish a unified and robust feature space for representing cellular morphology. Additionally, PhenoProfiler incorporates a tailored phenotype correction strategy, designed to emphasize relative changes in cell phenotypes under different treatment conditions, enhancing its ability to capture meaningful biological signals. Extensive benchmarking on nearly 400,000 multi-channel images demonstrates PhenoProfiler’s state-of-the-art performance in extracting accurate and interpretable phenotypic representations. By addressing the complexity, high costs, and limited generalization capabilities of existing workflows, PhenoProfiler represents a significant advancement in phenotypic profiling and offers a powerful tool for accelerating image-based drug discovery. PhenoProfiler is freely accessible at https://phenoprofiler.org.

Results

Overview of the PhenoProfiler model

PhenoProfiler learns morphological representations and extracts phenotypic changes of treatment effects from high-throughput images. Different with existing methods (Fig. 1a), PhenoProfiler is designed as an end-to-end model with three key modules (Fig. 1b): a gradient encoder using difference convolution23,24, which enhances cell edge information and improves the clarity and contrast of cell morphology, and thus enhances the model’s adaptive perception of cells. A transformer encoder with multi-head self-attention mechanisms22 further captures long-range dependencies and intricate relationships within the data. A multi-objective learning module, composed of two multi-layer perceptron layers, is then utilized to enhance accuracy and generalization. This module integrates classification, regression, and contrastive learning. The classification learning maps image representations to categorize treatment conditions based on their corresponding labels. The regression learning leverages the rich and continuous supervisory information provided by regression objective to capture detailed morphological representations across different treatments, thereby significantly improving model performance. The contrastive learning improves robustness and generalization by maximizing similarity among representations of similar treatment conditions and minimizing it among dissimilar ones. After well training, PhenoProfiler identifies a unified and robust feature space for representing cell morphology. During the inference phase (Fig. 1c), PhenoProfiler utilizes a phenotype correction strategy to emphasize relative phenotypic changes under different treatment conditions, thereby revealing associated biological matches and treatment-associated representations. The detailed rationale for the design of PhenoProfiler is provided in Supplementary Note 1.

Fig. 1: Framework of the PhenoProfiler for morphology representations.
figure 1

a Flowchart comparison of end-to-end PhenoProfiler with existing non-end-to-end methods. b PhenoProfiler includes a gradient encoder to enhance edge gradients, improving clarity and contrast in cell morphology. A transformer encoder then captures high-dimensional dependencies and intricate relationships, enriching image representations. A designed multi-objective learning module is utilized for accurate morphological representation learning. c For model inference, PhenoProfiler uses phenotype correction strategy (PCs) with hyperparameter α to identify morphological changes between treated and control conditions. Created in BioRender. Song, Q. (2025) https://BioRender.com/v0nqs11.

Superior performance of PhenoProfiler in biological matching tasks

For comprehensive and robust evaluation of the PhenoProfiler, we compare it with established methods, including DeepProfiler9, OpenPhenom20, ResNet5016, and ViT17 (Benchmarking Methods) using the leave-perturbation-out strategy. Two evaluation metrics (Evaluation Metrics) are used: Folds of Enrichment (FoE) and Mean Average Precision (MAP).

First, we use over 230,000 images from three datasets (BBBC02225, CDRP-BIO-BBBC03626, and TAORF-BBBC03727), covering 231 plates and 4285 treatments, including both compound and gene overexpression perturbations. The experimental results are illustrated in Fig. 2, demonstrating that PhenoProfiler surpasses all competing methods across three benchmarking datasets in both FoE and MAP metrics. As shown in Fig. 2a, PhenoProfiler achieves FoE improvements of 23.8%, 2.1%, and 12.9% over the second-best method (DeepProfiler) on the BBBC022, CDRP-BIO-BBBC036, and TAORF-BBBC037 datasets, respectively. For MAP evaluation, PhenoProfiler outperforms the second-best method by significant margins of 3.3%, 7.3%, and 7.1% across these datasets. Additionally, as depicted in Fig. 2b, we conducted comprehensive recall analysis at multiple thresholds (Recall@1, @3, @5, @10), where recall rates measure the proportion of biologically relevant treatments retrieved within top-ranked predictions (Evaluation Metrics). Using Recall@10 as a representative benchmark, PhenoProfiler exhibits performance enhancements of 17.4%, 4.3%, and 10.5% compared to DeepProfiler on the BBBC022, CDRP-BIO-BBBC036, and TAORF-BBBC037 datasets, respectively.

Fig. 2: Performance analysis of benchmarking methods in biological matching tasks.
figure 2

a Comparison of end-to-end feature representation performance across different methods in biological matching tasks using three benchmark datasets (BBBC022, CDRP-BIO-BBBC036, and TAORF-BBBC037) under leave-perturbations-out setting, evaluated with two evaluation metrics (MAP, FoE), and four comparison methods (DeepProfiler, ResNet50, ViT, OpenPhenom). b Performance comparison of different methods at different recall rates (recall@1, recall@3, recall@5, and recall@10). c Ablation experiments of PhenoProfiler, showing performance changes after sequential removal of each module. Specifically, “-MSE”, “-Con”, and “-CLS” represent the removal of regression, contrastive, and classification learning in the multi-objective module, while “-Gradient” represents the exclusion of difference operations. d Performance curve of PhenoProfiler under solely classification learning, showing variations in MAP and FoE as the classification loss decreases. e Sensitivity analysis of multi-objective learning of PhenoProfiler, exploring the impact of regression and contrastive learning (\({\lambda }_{2}\) and \({\lambda }_{3}\)) while maintaining fixed classification learning. f Hyperparameter analysis of θ₁ and θ₂ in the gradient encoder in parallel branches. MAP: Mean Average Precision; FoE: Folds of Enrichment; MSE: Mean Squared Error; Con: Contrastive Learning; CLS: Classification Learning. Source data are provided as a Source Data file.

To further illustrate the contributions of each module within PhenoProfiler, we have performed extensive ablation experiments using BBBC022 dataset in end-to-end pipeline (Fig. 2c). First, we remove the regression learning component within the multi-objective learning module (i.e., “-MSE” option), retaining only classification and contrastive learning. The results show that removing the regression learning results in a notable performance drop, with FoE and MAP decreasing by 12.0% and 12.7%. Next, we test various combinations of loss functions (e.g., “-Con”, “-CLS”, “-MSE-Con”, “-Con-CLS”, and “-CLS-MSE”). For example, the removal of both regression and classification learning leads to more performance decrease, with FoE and MAP reduced by 28.0% and 20.6%, respectively. Moreover, compared to the average performance of models without gradient encoding (i.e., “-MSE-Con-Gradient”, “-Con-CLS-Gradient”, and “-CLS-MSE-Gradient”), this modification results in a reduction of 25.2% in FoE and 11.5% in MAP, highlighting the effectiveness of gradient-based feature encoding. Additionally, the measurement metrics do not consistently improve as classification loss decreases. As illustrated in Fig. 2d, while both MAP and FoE initially increase with decreasing classification loss, they eventually decline. This observation highlights the importance of PhenoProfiler’s multi-objective learning design. The optimal weights for multi-objective learning were systematically determined through sensitivity analysis on the BBBC022 datasets. Figure 2e–f presents the tuned parameters of PhenoProfiler achieved optimal performance. Detailed evaluation of leave-sites-out are shown in Supplementary Notes 2 and 3.

PhenoProfiler demonstrates robust generalization and applicability

To evaluate the generalization of PhenoProfiler, we conduct experiments on the benchmarking datasets using the leave-plates-out and leave-dataset-out evaluation strategies. For the leave-plates-out strategy, A subset of plates are used as the test set while the remaining plates are used for training. For the leave-dataset-out strategy, one dataset is used for training and the other two serve as test sets. For example, “BBBC022 → BBBC036” indicates training on the BBBC022 dataset and validating on BBBC036 (CDRP-BIO-BBBC036). Regarding the leave-plates-out scenario (Fig. 3a), PhenoProfiler consistently surpasses other methods in both FoE and MAP. Specifically, PhenoProfiler has higher FoE than the second-best method by 57.0%, 10.7%, and 13.7%. For the MAP metric, PhenoProfiler outperforms the second-best method by 16.9%, 17.3%, and 6.0%. Figure 3b illustrates the performance comparison in the leave-dataset-out scenario, highlighting the consistent superior performance of PhenoProfiler. For example, in the BBBC022 → BBBC036 scenario, PhenoProfiler outperforms the next best method with higher FoE and MAP metrics by 13.0% and 17.1%, respectively. Similarly, in the BBBC022 → BBBC037 scenario, PhenoProfiler surpasses the next best method with higher FoE and MAP metrics by 6.9% and 5.4%, respectively. Collectively, PhenoProfiler demonstrates better generalization than existing methods, enhancing more accurate downstream tasks in drug discovery.

Fig. 3: Benchmarking zero-shot performance across multiple datasets.
figure 3

a Performance evaluation of different models using leave-plates-out validation. b Performance evaluation of different models using leave-dataset-out validation. c Performance evaluation of different models using out-of-distribution validation. The test plates include BR00115125-BR00115134 (cpg0001) and SQ00014812-SQ00014816 (cpg0004). FoE: Folds of Enrichment; MAP: Mean Average Precision; BBBC036: CDRP-BIO-BBBC036; BBBC037: TAORF-BBBC037. Source data are provided as a Source Data file.

To further validate PhenoProfiler’s generalization capability, we performed an out-of-distribution (OOD) evaluation using 10 diverse plates (BR00115125-BR00115134) from cpg0001 dataset, comprising 76,800 images covering 83 unique treatments and 47 annotated mechanisms of action (MoA). As shown in the upper panel of Fig. 3c, we directly applied PhenoProfiler models pretrained on BBBC022, BBBC036, and BBBC037 to these OOD plates. Remarkably, PhenoProfiler outperformed the second-best model (DeepProfiler) by an average of 45.8% in FoE and 27.3% in MAP across all test plates (upper panel of Fig. 3c), demonstrating robust generalization and OOD robustness. Moreover, since these evaluations were exclusively from the U2OS cell line, we further evaluated PhenoProfiler on five A549 cell line plates (SQ00014812, SQ00014813, …, SQ00014816 from cpg0004 dataset, lower panel of Fig. 3c). Remarkably, PhenoProfiler outperforms the second-best model (OpenPhenom) by 21.4% (FoE) and 20.3% (MAP) on average (lower panel of Fig. 3c), demonstrating robust generalization across different cell lines.

PhenoProfiler effectively removes batch effects for robust phenotypic representations

Variations stemming from technical and instrumental factors9,28 can introduce batch effects between plates subjected to the same treatment, which obscure true biological phenotypic signals and compromise downstream analyses. To evaluate PhenoProfiler’s ability to mitigate batch effects, we employed the Inverse Median Absolute Deviation (IMAD) metric, which quantifies the dispersion of image representations. Higher IMAD values indicate reduced dispersion, reflecting successful batch effect correction (details provided in the Evaluation Metrics section).

Figure 4 illustrates the well-level representation features, with plate IDs distinguished by different colors to highlight batch effects. For DeepProfiler, its representation features extracted from the BBBC022 dataset exhibit clear separation between plate IDs (IMAD = 0.326), indicating the presence of significant plate-specific biases. This separation shows strong technical variations among the extracted representation features, which can obscure true biological signals and hinder downstream analyses. Upon applying extra batch correction step to DeepProfiler’s representation features, the UMAP projections displayed improvements in feature cohesiveness (IMAD = 0.458). Notably, the representation features learned by PhenoProfiler exhibit a distinctly more integrated distribution (IMAD = 0.603). This superior performance suggests that PhenoProfiler learns harmonized representation features across different plates, effectively addressing batch effects without additional corrections. This pattern is consistently observed across all three datasets, further validating PhenoProfiler’s reliability in generating phenotypic representations that are robust to technical confounders. The inherent capacity of PhenoProfiler to learn harmonized representations directly from raw data not only reduces the need for computationally intensive post-processing but also ensures that biological signals are preserved.

Fig. 4: Robustness of feature representations across different methods.
figure 4

UMAP visualizations of well-level features predicted by DeepProfiler and PhenoProfiler across three benchmark datasets. Features are colored by plate IDs and treatment conditions (control vs. treatment). IMAD quantifies the cohesiveness of the UMAP patterns, with higher values indicating better batch effects removal. UMAP: Uniform Manifold Approximation and Projection; BC: Batch Correction; IMAD: Inverse Median Absolute Deviation.

Phenotype correction strategy of PhenoProfiler improves biological matches

To effectively capture relative changes under treatments, PhenoProfiler is tailor designed with a phenotype correction strategy (PCs, Fig. 5a, details in Materials and Methods section) to refine learned phenotypic presentations, which distinguishes PhenoProfiler from existing methods. As illustrated in Fig. 5a, PhenoProfiler with PCs corrects image representations by leveraging controlled and treated wells within one plate and emphasizing relative changes in cell phenotypes under treatments.

Fig. 5: Quantitative analysis of the phenotype correction strategy in PhenoProfiler.
figure 5

a The conceptual motivation for the design of PCs. Phenotypic differences between treated- and controlled- wells capture the treatment response. b Ablation experiments demonstrating the impact of PCs across three datasets, showing a consistent increase in the FoE with the inclusion of PCs. c Sensitivity analysis of the hyperparameter α in the PCs. d UMAP visualizations of feature representations generated by PhenoProfiler, with and without PCs, evaluating the harmonization of well-level features using IMAD. PCs: Phenotype correction strategy; IMAD: Inverse Median Absolute Deviation; FoE: Folds of Enrichment; MAP: Mean Average Precision; UMAP: Uniform Manifold Approximation and Projection. Source data are provided as a Source Data file. Created in BioRender. Song, Q. (2025) https://BioRender.com/y4tl93e.

As shown in Fig. 5b, we assess the impact of the PCs through ablation experiments on three benchmark datasets. The results demonstrate that PCs consistently improves the FoE metric with minimal impact on the MAP metric (Supplementary Fig. 3). We also analyze the hyperparameter α in PCs, which determines the relative weight of controlled and treated wells. As illustrated in Fig. 5c, the outer circle of the radar chart corresponds to the value of α, with α = 0 representing PhenoProfiler without PCs. As α increases from 0 to 1, the relative weight of controlled wells increases, resulting in more pronounced differential effects. Meanwhile, the FoE generally trends upward while the MAP metric remains relatively stable. When α is greater than or equal to 0.7, the FoE begin to reach its maximum. These results demonstrate that PhenoProfiler with PCs can effectively capture the relative changes of cell phenotypes under treatments.

Furthermore, we analyze the aggregation of features before and after incorporating PCs across the three benchmark datasets. UMAP with well-level image representations are quantitatively measured for their dispersion before and after adding PCS using the IMAD metric (Fig. 5d). After implementing PCs, the representation features from different plates become significantly more clustered. Additionally, the IMAD metrics increases notably, with substantial improvements of 51.5%, 69.7%, and 11.6% across the three benchmark datasets, respectively.

PhenoProfiler efficiently captures representations of treatment effects

To visually illustrate the treatment effects, we obtain phenotypic representations using PhenoProfiler under various treatment conditions. Figure 6 displays a UMAP projection of PhenoProfiler representations across three benchmark datasets, providing a clear demonstration of how PhenoProfiler captures and organizes biological patterns in both non-end-to-end (Fig. 6a) and end-to-end (Fig. 6b) scenarios. For the BBBC022 and CDRP-BIO-BBBC036 datasets, which involve compound treatments, distinct clusters emerge based on their mechanisms of action (MoA). These clusters represent the functional similarities between compounds with shared MoAs, effectively demonstrating how PhenoProfiler translates phenotypic profiles into meaningful groupings that align with known biological functions. Similarly, the TAORF-BBBC037 dataset, which involves gene overexpression perturbations, reveals clear clusters of treatments corresponding to their genetic pathways, such as MAPK and PI3K/AKT. This clustering reaffirms established biological relationships and validates PhenoProfiler’s capability to accurately discern pathway-specific features. Notably, these groupings remain consistent across different wells, showcasing PhenoProfiler’s robust performance in feature extraction and its reliability in accurately representing treatment effects across experimental variations. However, some clusters are less distinctly discerned in the UMAP projections, suggesting that incorporating additional features, such as compound structures or gene expression data, could enhance the understanding of MoAs.

Fig. 6: Quantitative and qualitative evaluation of feature representations of treatment effects.
figure 6

a UMAP projections of treatment profiles using well-level features provided by PhenoProfiler in non-end-to-end scenario. b UMAP projections of treatment profiles using well-level features provided by PhenoProfiler in end-to-end scenario. Well-level profiles, control wells, and treatment-level profiles are included. Text annotations highlight clusters where all or most points share same biological annotations for treatment-level profiles. UMAP: Uniform Manifold Approximation and Projection. c Heatmap visualization of PhenoProfiler-derived features demonstrates significant feature shifts in drug-treated groups compared to DMSO controls. d Biological interpretability of key drug-responsive phenotypic features captured by PhenoProfiler. The directionality of feature changes captured by PhenoProfiler (with red indicating increase and blue representing decrease) shows strong concordance with established pharmacological mechanisms.

To further assess PhenoProfiler’s ability to identify clinically actionable phenotypic patterns while maintaining biological interpretability, we performed an in-depth evaluation using the cpg0004-LINCS dataset, which includes treatments with well-characterized MoAs. As shown in Fig. 6c, all drug-treated groups exhibited significant feature shifts compared to DMSO controls. Notably, quadruplicate wells of the same drug class demonstrated high consistency, while distinct separation was observed between different drug categories. These results confirm PhenoProfiler’s dual capability: sensitive detection of drug-induced phenotypic perturbations and precise discrimination between pharmacological mechanisms. To elucidate the model’s biological interpretability, we focused on key phenotypic features responsive to pharmacological perturbations. From five major feature categories (granularity, correlation, spatial distribution, intensity, and area shape), we identified 15 most drug-sensitive biomarkers (three per category). Quantitative analysis in Fig. 6d demonstrates that PhenoProfiler-captured feature dynamics (red for upregulation, blue for downregulation) show remarkable concordance with established pharmacological mechanisms. For example, for the aclidinium, PhenoProfiler detected significantly increased cellular compactness and decreased eccentricity, aligning with its known mechanism of vesicle trafficking inhibition through mAChR blockade29,30. More remarkably, PhenoProfiler also captured a decrease in Golgi fluorescence intensity (Cells_Intensity_MeanIntensity_AGP) and an increase in RNA distribution heterogeneity (Cells_Intensity_MADIntensity_RNA), which highly coincides with the secretory dysfunction caused by aclidinium inhibiting muscarinic receptors and leading to signaling blockade29,30. Additionally, we specifically analyzed a typical case of the PKC activator TPA (12-O-Tetradecanoylphorbol-13-acetate). PhenoProfiler accurately captured the cell area expansion induced by this drug, which perfectly matches the theory of PKC-mediated cytoskeletal reorganization31,32.

Discussion

In this study, we present PhenoProfiler, an advanced tool for phenotypic representations of cell morphology in drug discovery. PhenoProfiler operates as a fully end-to-end framework, transforming multi-channel images into low-dimensional, biologically meaningful representations. By integrating a multi-objective learning module that incorporates classification, regression, and contrastive learning, PhenoProfiler effectively captures a unified and robust feature space. The extensive benchmarking conducted in this study, involving over 400,000 images from seven publicly available datasets, demonstrates that PhenoProfiler significantly outperforms existing state-of-the-art methods in both end-to-end, leave-plates-out, leave-dataset-out, and out-of-distribution validation scenarios. Moreover, PhenoProfiler achieves greater computational efficiency compared to existing non-end-to-end approaches (Supplementary Note 4). Furthermore, in non-end-to-end settings, we utilize over 8.42 million single-cell images for comparative analysis. PhenoProfiler’s superior performance on these large-scale datasets across all scenarios highlights its robustness, generalizability, and scalability, positioning it as a highly effective tool for advancing image-based drug discovery.

A key challenge in phenotypic representation learning lies in the heterogeneity of cellular responses to treatments. To address this, PhenoProfiler incorporates a multi-objective loss function that effectively guides the model’s learning process. First, the classification loss serves as the foundation, capturing phenotypic distinctions across treatments and enabling macro-level differentiation while aligning the learned representations with biologically meaningful semantics. Second, the regression component, supervised by pre-extracted morphological features, directs the model’s attention to cells exhibiting significant phenotypic perturbations, filtering out noise from unaffected cells and mitigating the effects of intracellular heterogeneity. Third, contrastive learning improves robustness to batch effects and morphological variability by encouraging similar representations for similar treatments and distinguishing between different treatments. Together, these objectives enable PhenoProfiler to learn rich hierarchical features, from local morphological details (via regression) to global treatment-specific patterns (via classification and contrastive learning), without relying on segmentation or sub-image extraction. Finally, the Phenotype Correction strategy (PCs) enhances the learned features by contrasting control and treated wells, amplifying biologically meaningful signals related to spatial distributions. In practical applications, PhenoProfiler supports flexible strategies for obtaining the pre-extracted morphological features used in regression supervision. These can be derived either from CellProfiler-extracted features, commonly available in Cell Painting assays, or from morphology profiles generated by a non-end-to-end EfficientNet model (Supplementary Note 5, Supplementary Figs. 4, 5).

While PhenoProfiler sets a new benchmark in phenotypic representation learning, there are several areas for future explorations and enhancement. First, the design of the multi-objective learning module can be further refined by exploring the interconnections and synergies among classification, regression, and contrastive objectives. Understanding the dependencies between these objectives33,34 could lead to more cohesive learning strategies. Additionally, while PhenoProfiler currently employs a stepwise training approach to address conflicts between objectives35,36, future work could focus on developing joint training and optimization techniques to balance these objectives more effectively. Second, recent advancements in large biomedical language models37,38,39,40 offer opportunities to integrate extensive domain knowledge into computational frameworks. Integrating these models’ embeddings into PhenoProfiler could enhance its generalizability, robustness, and effectiveness. Last, future efforts should prioritize integrating assays with complementary data modalities, such as genetic profiles12,41 and chemical structures42,43. Combining these multi-modal data would enable more comprehensive representations, offering a holistic view of cell states and phenotypic responses to various treatments.

The ability of PhenoProfiler to consistently capture and organize complex biological information across diverse datasets and treatment types underscores its versatility and utility in high-throughput drug screening and discovery. By addressing critical challenges in phenotypic profiling, such as scalability, robustness, and interpretability, PhenoProfiler enhances our understanding of treatment effects at the phenotypic level. Additionally, its potential for integrative multi-modal analysis, combining phenotypic data with complementary modalities such as genetic profiles, transcriptomics, and chemical structures, opens new opportunities to explore drug mechanisms and novel drug targets.

Methods

The PhenoProfiler model

PhenoProfiler is an end-to-end model, with input as multi-channel images and output as phenotypic representations under treatments. The overall framework includes three key modules: a gradient encoder, a transformer encoder, and a multi-objective learning module.

Gradient encoder

To enhance the model’s ability to understand cell morphology, we design a gradient encoder based on difference convolution (\({DC}\))23,24. This \({DC}\) function enhances gradient information around cell edges, improving the perception of morphological structures44,45. Formally, let \({p}_{0}\) represent the central position of a local receptive field \(R({p}_{0})\), with \({p}_{{{\rm{i}}}}\in R\left({p}_{0}\right)\). The pixel value at position \({p}_{i}\) in the input image is denoted as \({{{\boldsymbol{x}}}}_{{p}_{{{\rm{i}}}}}\), and the \({DC}\) function is defined as:

$${DC}\left({{{\boldsymbol{x}}}}_{{p}_{0}},\theta \right)={\sum }_{{p}_{i}\in R\left({p}_{0}\right)}{{{\boldsymbol{w}}}}_{{p}_{i}}\bullet \left({{{\boldsymbol{x}}}}_{{p}_{i}}-\theta \bullet {{{\boldsymbol{x}}}}_{{p}_{0}}\right),$$
(1)

where \({{{\boldsymbol{w}}}}_{{p}_{i}}\) is a learnable parameter, and \(\theta \in [{\mathrm{0,1}}]\) is a hyperparameter that controls the balance between semantic and gradient information. When \(\theta=0\), the \({DC}\) function reduces to traditional convolution (\({TC}\)), i.e., \({TC}\left({p}_{0}\right)={\sum }_{{p}_{i}\in R\left({p}_{0}\right)}{{{\boldsymbol{w}}}}_{{p}_{i}}\cdot {{{\boldsymbol{x}}}}_{{p}_{i}}\).

The gradient encoder consists of two components: gradient enhancement and residual feature extraction. The gradient enhancement includes two parallel branches of difference convolution: the Deep Gradient branch, denoted as \({DG}={DC}(\bullet {,\theta }_{1})\), and the Shallow Gradient branch, denoted as \({SG}={DC}(\bullet {,\theta }_{2})\). We conducted a systematic experimental analysis of the hyperparameters \({\theta }_{1}\) and \({\theta }_{2}\) in the gradient encoder module (see Parameter Tuning section, Supplementary Note 3). Through comprehensive evaluation of different weight combinations, we ultimately determined the optimal configuration of \({\theta }_{1}=0.7\) and \({\theta }_{{{\boldsymbol{2}}}}=0.3\) for the two-branch architecture. The multi-channel input images, represented as \({{{\boldsymbol{x}}}}_{{in}}\), are processed in parallel through the two branches. The outputs of these branches \({{{\boldsymbol{H}}}}_{1}\) are then concatenated and passed through a Multi-Layer Perceptron (MLP) layer as follows:

$${{{\boldsymbol{G}}}}_{{DG}}={BN}\left({{\rm{ReLU}}}\left({DG}\left({{{\boldsymbol{x}}}}_{{in}}\right)\right)\right),$$
(2)
$${{{\boldsymbol{G}}}}_{{SG}}={BN}\left({{\rm{ReLU}}}\left({SG}\left({{{\boldsymbol{x}}}}_{{in}}\right)\right)\right),$$
(3)
$${{{\boldsymbol{H}}}}_{1}={BN}\left({{\rm{ReLU}}}\left({{\rm{MLP}}}\left(\left[{{{\boldsymbol{G}}}}_{{DG}},{{{\boldsymbol{G}}}}_{{SG}}\right]\right)\right)\right).$$
(4)

Here \({{{\boldsymbol{H}}}}_{1}\) is the enhanced latent features. Batch normalization (\({BN}\)) ensures stable and efficient training by normalizing the intermediate latent features.

The subsequent residual feature extraction component employs the ResNet5016 model, pre-trained on ImageNet15. By incorporating residual connections, the ResNet50 model effectively addresses the vanishing gradient problem encountered in training deep neural networks, enabling the network to extract deeper features. This step outputs a fixed-length feature vector \({{{\boldsymbol{H}}}}_{2}{{\mathbb{\in }}{\mathbb{R}}}^{B\times 2048}\), where \(B\) denotes the batch size.

Transformer encoder

The latent features \({{{\boldsymbol{H}}}}_{2}\) is passed through the transformer encoder17,46 with multi-head self-attention mechanism, effectively capturing global dependencies among features. Then, layer normalization and residual connections are applied to ensure stable gradient propagation. Subsequently, a Feed-Forward Neural Network (\({FFN}\)) is used to further extract high-level features, resulting in the output feature vector \({{{\boldsymbol{H}}}}_{4}{{\mathbb{\in }}{\mathbb{R}}}^{B\times 2048}\). This process is formulated as below:

$${{{\boldsymbol{H}}}}_{3}={\mbox{MultiHeadAttention}}\left({{{\boldsymbol{H}}}}_{2},{{{\boldsymbol{H}}}}_{2},{{{\boldsymbol{H}}}}_{2}\right),$$
(5)
$${{{\boldsymbol{H}}}}_{4}={\mbox{LayerNorm}}\left({{{\boldsymbol{H}}}}_{2}+{\mbox{Dropout}}\left({{{\boldsymbol{H}}}}_{3}\right)\right),$$
(6)
$${{{\boldsymbol{H}}}}_{5}={\mbox{FFN}}\left({{{\boldsymbol{H}}}}_{4}\right)={\mbox{ReLU}}\left({{{\boldsymbol{W}}}}_{2}\left({\mbox{ReLU}}\left({{{\boldsymbol{W}}}}_{1}{{{\boldsymbol{H}}}}_{4}+{{{\boldsymbol{b}}}}_{1}\right)\right)+{{{\boldsymbol{b}}}}_{2}\right),$$
(7)
$${{{\boldsymbol{H}}}}_{6}={\mbox{LayerNorm}}\left({{{\boldsymbol{H}}}}_{4}+{Dropout}\left({{{\boldsymbol{H}}}}_{5}\right)\right).$$
(8)

Here, \({MultiHeadAttention}\) computes attention weights across multiple attention heads to capture diverse feature dependencies. The \({FFN}\) consists of two fully connected layers with non-linear ReLU activations, parametrized by weights \({{{\boldsymbol{W}}}}_{1}\), \({{{\boldsymbol{W}}}}_{2}\), and biases \({{{\boldsymbol{b}}}}_{1}\), \({{{\boldsymbol{b}}}}_{2}\). This transformer encoder effectively captures global and local information simultaneously, ensuring the generation of rich and informative feature representations.

Multi-objective learning

To facilitate efficient multi-objective learning, we use a feature projection that maps latent features to a lower-dimensional space, facilitating efficient multi-objective learning. Specifically, the latent feature \({{{\boldsymbol{H}}}}_{6}{{\mathbb{\in }}{\mathbb{R}}}^{B\times 2048}\) is first linearly transformed to a lower-dimensional space. A GELU activation function is then applied to introduce non-linear characteristics, followed by further processing through a fully connected layer with dropout to prevent overfitting. This feature projection process provides final output \(\hat{{{\boldsymbol{Z}}}}{{\mathbb{\in }}{\mathbb{R}}}^{B\times 672}\) as follows:

$${{{\boldsymbol{Z}}}}_{1}={{{\bf{W}}}}_{3}{{{\boldsymbol{H}}}}_{6}+{{{\boldsymbol{b}}}}_{3},$$
(9)
$${{{\boldsymbol{Z}}}}_{2}={\mbox{Dropout}}\left({{{\boldsymbol{W}}}}_{4}\bullet {\mbox{GELU}}\left({{{\boldsymbol{Z}}}}_{1}\right)+{{{\boldsymbol{b}}}}_{4}\right),$$
(10)
$$\hat{{{\boldsymbol{Z}}}}={\mbox{LayerNorm}}\left({{{\boldsymbol{Z}}}}_{1}+{{{\boldsymbol{Z}}}}_{2}\right).$$
(11)

Then a classification head is implemented as a simple linear layer maps output representations \(\hat{{{\boldsymbol{Z}}}}\) to \(\hat{{{\boldsymbol{y}}}}\), representing the predicted treatment labels.

Classification learning

To characterize the phenotypic responses of cells under various treatments and enable the model to learn discriminative features among different treatments, we employ a cross-entropy loss function to quantify the discrepancy between the model’s predictions and the ground truth:

$${{\boldsymbol{{\mathscr{L}}}}}_{{\mbox{CLS}}}=-{\sum }_{i=1}^{N}{{{\boldsymbol{y}}}}_{i}\log \left({\hat{{{\boldsymbol{y}}}}}_{i}\right).$$
(12)

Here, \(N\) represents the number of treatment categories, and \({{{\boldsymbol{y}}}}_{i}\) is the one-hot encoding of the treatment label in ground truth. If the ground truth is treatment category \(i\), then \({{{\boldsymbol{y}}}}_{i}=1\); otherwise, \({{{\boldsymbol{y}}}}_{i}=0\).

Regression learning

Here, we have innovatively designed a regression learning component to learn cell morphological representations. Unlike discrete classification labels, this approach leverages richer and more nuanced feature information, enabling the model to focus on cells exhibiting the most pronounced phenotypic perturbations while effectively disregarding interference from unaffected cell populations. To obtain regression labels, we use the median of the morphology profiles of cpg0019 as the regression morphology labels and train the model using Mean Squared Error (MSE) loss:

$${{\boldsymbol{{\mathscr{L}}}}}_{{\mbox{MSE}}}=\frac{1}{m}{\sum }_{i=1}^{M}{({{{\boldsymbol{Z}}}}_{i}-{\hat{{{\boldsymbol{Z}}}}}_{i})}^{2},$$
(13)

where \({{{\boldsymbol{Z}}}}_{i}\) is the morphology profiles of the \(i\)-th image, \({\hat{{{\boldsymbol{Z}}}}}_{i}\) is predicted morphology features, and \(M\) is the number of images.

Contrastive learning

Contrastive learning enhances the model ‘s ability to distinguish features by maximizing the similarity between similar images and minimizing the similarity between dissimilar ones. Specifically, contrastive learning does not rely on ground truth labels but focuses on learning feature representations based on the relative relationships between images. This approach not only reduces the negative impact of noisy labels but also improves the model’s robustness and generalization when handling unseen data. The contrastive loss function is formulated as follows:

$${{\boldsymbol{{\mathscr{L}}}}}_{{\mbox{Con}}}=-\frac{1}{B}{\sum }_{i=1}^{B}\log \exp \left({\hat{{{\boldsymbol{Z}}}}}_{i}\cdot {{{\boldsymbol{Z}}}}_{i}/\tau \right)/{\sum }_{j=1}^{B}\exp ({\hat{{{\boldsymbol{Z}}}}}_{i}\bullet {{{\boldsymbol{Z}}}}_{i}/\tau )$$
(14)

where \(B\) is the batch size, \({\hat{{{\boldsymbol{Z}}}}}_{i}\) denotes the predicted morphology features of the \(i\)-th image, and \({{{\boldsymbol{Z}}}}_{i}\) is the morphology profiles, and \(\tau\) is the temperature parameter. This objective trains the model to produce discriminative feature vectors by maximizing the similarity between matching image-embedding pairs while minimizing the similarity between non-matching pairs.

Multi-objective loss

By integrating classification, regression, and contrastive learning, PhenoProfiler provides a unified and robust feature space, comprehensively learns the image representations of multi-channel cell images. This multi-object learning architecture enhances the overall generalization performance of the model. To achieve an effective balance in multi-object learning, we assign a weight parameter to each object and adjust these weight parameters to balance the losses. The final total loss function can be expressed as:

$${{\boldsymbol{{\mathscr{L}}}}}_{{\mbox{total}}}={\lambda }_{1}{{\boldsymbol{{\mathscr{L}}}}}_{{\mbox{CLS}}}+{\lambda }_{2}{{\boldsymbol{{\mathscr{L}}}}}_{{\mbox{MSE}}}+{\lambda }_{3}{{\boldsymbol{{\mathscr{L}}}}}_{{\mbox{Con}}},$$
(15)

where \({\lambda }_{1},{\lambda }_{2},{\lambda }_{3}\) are the weight parameters for the classification, regression, and contrastive learning, respectively. Based on extensive ablation experiments (see Fig. 2e), we set these weights to 0.1, 100, and 1, respectively. This multi-loss balancing strategy enables PhenoProfiler to find the optimal trade-off among different objects, thereby enhancing the overall performance and robustness of the model. Additionally, this strategy allows us to flexibly adjust the weights of each learning according to the specific requirements of the application scenario, achieving optimal feature representation and prediction performance.

Parameter tuning

PhenoProfiler is an end-to-end multi-channel image encoder designed to convert multi-channel images into corresponding morphology representations. During training, conflicts among objects in multi-objective joint training caused the model to struggle to converge to an optimal state35,36. To address this, we adopted a stepwise training strategy. Initially, we trained the regression learning using MSE loss to optimize the model. After approximately 100 epochs, we proceeded with joint optimization based on the multi-object learning architecture. The hyperparameter settings were as follows: a batch size of 300, a maximum of 200 training epochs, and 12 workers. The learning rate followed a staged decay strategy: 2e-3 for the first 10 epochs, 1e-3 for the next 50 epochs, 5e-4 for the subsequent 60 epochs, and 1e-4 for the final 80 epochs. The training environment was Ubuntu 22.04, utilizing four NVIDIA A100 GPUs (40GB version). Details on the tuning of loss function coefficients and the gradient encoder hyperparameters are provided in Supplementary Note 3 and Supplementary Fig. 2.

Model inference

In model inference, we first clarify the four levels of data involved in this task. The dataset comprises four levels of features: plate-level, treatment-level, well-level, and site-level. Specifically, a dataset contains P plates, each plate includes T treatments, each treatment corresponds to W wells, and each well contains S sites, with each site corresponds to a multi-channel image. Hence, the output of PhenoProfiler is a site-level feature \(\hat{{{\boldsymbol{Z}}}}\). The values of P, T, W, and S vary across datasets. Following the validation procedure used in DeepProfiler, we applied mean aggregation at both stages to get the next level aggregated features. After obtaining well-level features, we employed Sphering transform as a batch correction method to minimize confounders. Through these aggregation processes, we obtained the final treatment-level features for evaluation.

Phenotype Correction Strategy (PCs) directly optimizes the output of the PhenoProfiler in a plate. It aims to leverage the differences between treated wells and controlled wells within a plate to refine image representation. The implementation process of PCs is detailed in Fig. 1c. First, we calculate the mean of all control wells in the current plate, denoted as \({{{\boldsymbol{W}}}}^{i}\), where \(i\) represents the \(i\)-th plate. Assuming that a plate contains C controlled wells, the calculation is as follows:

$${{{\boldsymbol{W}}}}^{i}=\frac{1}{C*S}{\sum }_{j=1}^{C}{\sum }_{k=1}^{S}{\hat{{{\boldsymbol{Z}}}}}_{{jk}}.$$
(16)

Then, define \(\alpha\) as a hyperparameter that balances the weight between controlled wells and treated wells, \({\hat{{{\boldsymbol{E}}}}}_{{ij}}\) represent the predicted embedding vector for the \(j\)-th well in the \(i\)-th plate. \(\alpha*{{{\boldsymbol{W}}}}^{i}\) is subtracted from all predicted embeddings \({\hat{{{\boldsymbol{E}}}}}_{{ij}}\) in the current plate to obtain the refined embeddings \({{{\boldsymbol{E}}}}_{{ij}}\). The process is as follows:

$${{{\boldsymbol{E}}}}_{{ij}}={\hat{{{\boldsymbol{E}}}}}_{{ij}}-\alpha*{{{\boldsymbol{W}}}}^{i}.$$
(17)

PCs serves as a correction and optimization operation applied to the extracted features, making it virtually cost-free and plug-and-play.

Web server implementation

The PhenoProfiler web server integrates a JavaScript/TypeScript frontend (Next.js framework) with server-side rendering and Tailwind CSS styling, enhanced with React hooks for dynamic interactivity. The Python Django backend supports scalable data processing, coupled with an SQLite database for lightweight storage. A RESTful API coordinates client-server communication, enabling seamless file uploads and standardized JSON responses. Deployment utilizes the Caddy server with automated HTTPS/SSL and reverse proxy configurations, ensuring secure and efficient operations. The platform is publicly accessible at https://phenoprofiler.org.

Benchmarking methods

To intuitively evaluate PhenoProfiler’s performance within an end-to-end pipeline, we compared it against four benchmarking methods: DeepProfiler9, OpenPhenom20, ResNet5016, and ViT-Base17. DeepProfiler is a deep learning-based phenotyping tool that processes single-cell imaging data using an EfficientNet model architecture to generate single-cell morphological features. To adapt it for end-to-end image processing tasks, we trained the EfficientNet model on our end-to-end image dataset. OpenPhenom is a self-supervised learning method based on masked autoencoders. Since its training includes multi-cell images, we directly used the official pre-trained weights (available at https://huggingface.co/recursionpharma/OpenPhenom) for inference, resizing all input images to 256×256 resolution without further training or fine-tuning. ResNet50 and ViT-Base are widely recognized as general-purpose benchmark models in computer vision, serving as standard references for performance comparison. We made uniform adjustments to apply these methods to high-content multi-channel image processing. Specifically, to handle five-channel Cell Painting images, we included a convolutional layer at the front of each model to adjust the number of input image channels from five to three and reduce the dimensionality. The networks were initialized using the corresponding pre-trained weights provided by the “timm” library. For example, the ViT-Base model utilized the weights for vit_base_patch32_224. Notably, we adopted a fully trainable parameter mode for all models, without freezing any task weights, and maintained consistent training methods and hyperparameters throughout the training process.

In the non-end-to-end scenarios, we included three additional comparison methods: CellProfiler18, SPACe19, and EfficientNet21. CellProfiler is an open-source software tool designed for measuring and analyzing cell images, providing a robust platform for extracting quantitative data from biological images. It enables researchers to identify and quantify phenotypic changes effectively. SPACe is an open-source single-cell image analysis platform. Compared to CellProfiler, it achieves faster processing speeds while maintaining high accuracy by optimizing feature extraction and employing CellPose-powered cell segmentation. Following SPACe’s pipeline, we merged the five feature files extracted in its fourth step to obtain 423-dimensional features, encompassing intensity, morphology, and texture measurements, which were then used for downstream performance evaluation. It is worth noting that this tool has a relatively high entry barrier and learning curve, as it requires handling multiple prerequisites. For instance, users need to create customized guideline files tailored to their dataset characteristics and modify corresponding data-loading code accordingly. As for EfficientNet, adhering to the design framework of DeepProfiler, we utilized models pre-trained on the large-scale ImageNet dataset without additional training or fine-tuning on Cell Painting images. By incorporating these models, we aimed to benchmark our pipeline’s performance against established standards in the field, ensuring a comprehensive evaluation of its capabilities.

Benchmarking datasets

The PhenoProfiler model leverages seven distinct datasets: BBBC02225, CDRP-BIO-BBBC03626, TAORF-BBBC03727, LUAD-BBBC04347, LINCS48, cpg00017, and cpg000412. Among these, BBBC022, CDRP-BIO-BBBC036, LINCS, cpg0001, and cpg0004 focus on phenotypic responses to compound treatments, whereas TAORF-BBBC037 and LUAD-BBBC043 address phenotypic responses to gene overexpression. Together, these datasets form a comprehensive dataset of nearly 400,000 multi-channel images. These images encompass two treatment types (compounds and gene overexpression), two control types (empty and DMSO), and two cell lines (A549 and U2OS), collected from 246 plates. Detailed dataset information is summarized in Supplementary Data 1. To facilitate data storage and transmission, we apply image compression and illumination correction to convert the images from TIFF to PNG format, achieving approximately six times (via 2× bit-depth reduction from 16-bit to 8-bit and ~3× PNG lossless encoding) the compression without significant quality loss9. To further validate the cell morphology representation capability of PhenoProfiler in a non-end-to-end pipeline, we introduce the cpg0019 dataset9. This dataset comprises 8.4 million single-cell images, which were cropped from multi-channel images across these first five datasets. It includes 450 treatments, meticulously selected to represent a diverse array of phenotypic responses.

Given the distinct input requirements of the end-to-end and non-end-to-end pipelines, we trained and validated them on different datasets. For the end-to-end pipeline, we directly used the BBBC022, CDRP-BIO-BBBC036, and TAORF-BBBC037 datasets. For the non-end-to-end pipeline, cells from the first five datasets were segmented into single-cell sub-images and supplemented with a subset from the cpg0019 dataset. Because the other two datasets lacked ground truth, both pipelines were ultimately evaluated on BBBC022, CDRP-BIO-BBBC036, and TAORF-BBBC037, with the main distinction being whether the input images were cropped. To ensure robust evaluation, we applied two complementary validation strategies: leave-perturbations-out (Fig. 2) and leave-sites-out (Supplementary Fig. 2). In the leave-perturbations-out setting, the training and test sets contained entirely distinct perturbations, ensuring evaluation on unseen experimental conditions. The leave-sites-out strategy assessed performance at the site level, the smallest input unit, by systematically holding out a subset of sites from each well, providing an additional layer of fine-grained validation. Finally, the cpg0001 and cpg0004 datasets were used to test out-of-distribution generalization.

Data preprocessing involves three main components: (1) Image data preprocessing: We stack images from different channels in the order of [‘DNA’, ‘ER’, ‘RNA’, ‘AGP’, ‘Mito’] to obtain multi-channel images. The images are resized to a uniform size of (5, 448, 448) pixels and the pixel values are scaled to the range of 0 to 1. (2) Classification label preprocessing: We read the CSV file and initialize the label encoder. Labels are encoded based on the column names (Treatment or pert_name) in the CSV file. (3) Morphology profiles preprocessing for the regression and contrastive learning. For the supervisory labels in regression and contrastive learning, we select the median values of the morphology profiles provided in the cpg0019 dataset.

Evaluation metrics

We evaluate the model’s capacity to represent cell morphology using a reference collection of treatments to identify biological matches in treatment experiments. Following strategies outlined in previous studies9,49,50,51, we implement a biological matching task where users can search for treatments associated with the same MoA or genetic pathway, applicable to both compound and gene overexpression perturbations. Initially, we aggregate features from various methods at the treatment level and assess the relationships among these treatments within the feature space, guided by established biological connections. This approach allows us to identify treatments that are proximally situated, suggesting potential similarities in their biological effects or mechanisms of action. This evaluation not only demonstrates the model’s effectiveness in representing cell morphology but also offers a valuable framework for advancing biological research. To quantify the similarity between query treatments, we utilize cosine similarity and generate a ranked treatment list based on relevance, presented in descending order. A positive result is achieved if at least one biological annotation in the sorted list matches the query; otherwise, the result is regarded as negative.

For evaluating the quality of results for a given query, we employ two primary metrics: (1) Folds of Enrichment (FoE) and (2) Mean Average Precision (MAP).

(1) Folds of Enrichment (FoE): This metric assesses the overrepresentation of predicted features in the reference set, indicating the model’s ability to identify relevant biological treatments. We calculate the odds ratio using a one-sided Fisher’s exact test for each query treatment, which employs a 2 × 2 contingency table. The first row contains the counts of treatments with the same MoAs or pathways (positive matches) versus those with different MoAs or pathways (negative matches) above a pre-defined threshold. The second row contains the corresponding counts for treatments below the threshold. The odds ratio is computed as the sum of the first row divided by the sum of the second row, estimating the likelihood of observing treatments sharing the same MoA or pathway among the top connections. We average the odds ratios across all query treatments, with the threshold set at the top 1% of connections, thus anticipating significant enrichment for positive matches49.

(2) Mean Average Precision (MAP): For each query treatment, we calculate the average precision, which is the area under the precision-recall curve, following standard practices in information retrieval. The evaluation process starts with the result most similar to the query and continues until all relevant pairs (those with the same mechanism of action or pathway) are identified. MAP effectively captures both the precision and recall of the model’s predictions, offering insights into its reliability and robustness. The specific calculation process is as follows:

$${\mbox{Precision}}=\frac{{\mbox{TP}}}{{\mbox{TP}}+{\mbox{FP}}},$$
(18)
$${\mbox{Recall}}=\frac{{\mbox{TP}}}{{\mbox{TP}}+{\mbox{FN}}}.$$
(19)

Here, \({TP}\) stands for true positives, \({FP}\) for false positives, and \({FN}\) for false negatives. MAP refers to the average precision across multiple queries. For each query, we calculate the area under the precision-recall curve and then average these values across all queries. Given that the number of MoAs or pathways varies, precision and recall are interpolated for each query to cover the maximum number of recall points. The interpolated precision at each recall point is:

$${P}_{{inter}}\left(r\right)={\max }_{{r}^{{\prime} }\ge r}P\left({r}^{{\prime} }\right)$$
(20)

The average precision of a query treatment is the mean of the interpolated precision values \({P}_{{inter}}\) at all recall points. The reported MAP is the average of the average precision values across all queries. Together, these metrics provide a rigorous assessment of PhenoProfiler’s performance in capturing biological relevance in treatment experiments and underscore its utility in phenotypic drug discovery.

We also used recall rates at various levels (recall@1, recall@3, recall@5, and recall@10) to evaluate the model’s performance. Recall@K is an important metric in information retrieval and recommendation systems, indicating the proportion of correct results within the top K returned results. For instance, recall@5 represents the proportion of biologically related treatments retrieved within the top five positions of a predicted treatments ranked list. This is crucial for user experience, as users typically only look at the first few results, and the relevance of these results directly impacts user satisfaction.

In addition to the main metrics mentioned above, we introduced the Inverse Median Absolute Deviation (IMAD) metric to quantitatively evaluate the aggregation degree of features. The calculation steps for the IMAD metric are as follows: First, Principal Component Analysis (PCA) is performed to reduce the dimensionality of the data, retaining 95% of the variance. Then, we use Uniform Manifold Approximation and Projection (UMAP) for further dimensionality reduction and embedding. Next, we combine the UMAP 1 and UMAP 2 columns into a coordinate array and calculate the pairwise distances between all points. Subsequently, we compute the Median Absolute Deviation (MAD) of these distances. Finally, we take the reciprocal of the MAD to obtain the IMAD. A higher IMAD indicates a tighter aggregation of the data.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.