Abstract
Biomolecular condensates are essential cellular structures formed via biomacromolecule phase separation. Synthetic condensates allow for systematic engineering and understanding of condensate formation mechanisms and to serve as cell-mimetic platforms. Phase diagrams give comprehensive insight into phase separation behavior, but their mapping is time-consuming and labor-intensive. Here, we present an automated platform for efficiently mapping multi-dimensional condensate phase diagrams. The automated platform incorporates a pipetting system for sample formulation and an autonomous confocal microscope for particle property analysis. Active machine learning is used for iterative model improvement by learning from previous results and steering subsequent experiments towards efficient exploration of the binodal. The versatility of the pipeline is demonstrated by showcasing its ability to rapidly explore the phase behavior of various polypeptides, producing detailed and reproducible multidimensional phase diagrams. The self-driven platform also quantifies key condensate properties such as particle size, count, and volume fraction, adding functional insights to phase diagrams.
Introduction
Organization and compartmentalization are fundamental aspects of nature1. The spatial arrangement of biomolecules is essential for maintaining cellular function and facilitating metabolic processes, such as molecular transport, energy production, and structural support2,3. In this respect, biomolecular condensates have gained significant interest in recent years, as these membrane-less organelles play essential roles in compartmentalization and may contribute to the emergence of cellular complexity4,5. Condensates are phase-separated, micron-sized subcellular droplets that are formed through multivalent interactions between (macro)molecules, such as proteins and nucleic acids6. Their dynamic formation mechanism and complex biochemistry have become topic of intensive investigation, in particular in the field of molecular and cell biology7,8.
An alternative approach to provide valuable insight into these structures is to engineer synthetic condensates in vitro—outside of the cellular environment9,10,11,12,13,14. This allows a more systematic tuning and study of the physicochemical properties of condensates15 and also enables the development of self-assembled and/or cell-mimetic platforms that can be used for the exploration of novel therapeutic strategies16,17,18,19,20,21,22. Although synthetic condensates circumvent the need to take the cell’s complexity into account, still significant challenges remain in terms of predicting condensate formation and properties based on the molecular structures and elucidating the effects of environmental factors, such as pH and ionic strength, on condensate formation and properties (as well as the underlying molecular mechanisms)23,24.
However, it quickly becomes unfeasible to manually navigate the vast combinatorial space, given it spans diverse molecular structures and environmental factors. This process involves preparing hundreds to thousands of samples, each with precisely controlled conditions (e.g., concentration, pH, and ionic strength), followed by detailed and consistent analysis of the phase separation parameters25,26,27,28,29,30. Often, researchers are interested in measuring the binodal, the boundary in a phase diagram separating the single-phase region from the two-phase region. Beyond this boundary, the homogeneous phase separates into two distinct phases with equal chemical potentials. Identifying the binodal by collecting data across a broad range of conditions without specific guidance (e.g., based on intuition) is not only time-consuming and labor-intensive but also prone to human error, highlighting the need for automated, machine learning-driven, high-throughput methods31,32.
To address these challenges, recent innovations in high-throughput biochemical assays, microfluidics, and automated microscopy and analysis have enabled new methods to study biomolecular condensates under varied conditions33,34,35,36. Notwithstanding these advances, fully leveraging the vast datasets these techniques produce opens opportunities to explore condensate behavior more efficiently. Integrating machine learning, particularly active learning37,38, into this field presents a valuable opportunity to enhance data-driven parameter exploration, refine predictive models, and reduce the need for extensive experimental input. Active machine learning iteratively selects the most informative data points to analyze and to steer the next iteration of experiments39,40,41,42, which makes it particularly useful for automation by reducing the amount of data and experimentation needed to achieve accurate results39,43.
In this work, we introduce an automated, high-throughput platform designed to map multi-dimensional phase diagrams of biomolecular condensates. Our platform integrates active machine learning for phase mapping optimization, an automated pipetting system for sample formulation, and an autonomous confocal microscope for high-content imaging and detailed sample characterization. Using this platform, we extensively examine the phase behavior of two well-studied polypeptides across a range of formulations. Beyond reproducibly identifying the binodal, the platform also measures particle size, particle count, and volume fraction—offering deep insight into condensate characteristics. To demonstrate the robustness of the approach, we construct higher-dimensional phase diagrams, allowing to uncover how multiple factors influence condensate formation. The automated platform not only accelerates and standardizes phase separation behavior mapping but also enhances our understanding of environmental parameter effects on condensate properties. We expect this approach to increase the application potential of synthetic condensates as a platform for the study of their natural analogues and to engineer self-assembled cell-mimetic platforms.
Results
Closed loop navigation of coacervate formation
Condensate formulation typically involves mixing complementary components, for example, of anionic and cationic nature44, at specific speeds and durations, in a pH-controlled aqueous solution to form condensate, or more specifically, complex coacervate microdroplets (Fig. 1A). This process can be laborious, error-prone, and time-consuming, limiting current capabilities to determine detailed phase diagrams and, correspondingly, the optimal conditions for condensate formation. To date, there is no standardized protocol for producing condensates, and scientists often adhere strictly to formulation techniques that work for their specific applications45. Here, we present a generalizable, closed-loop workflow that combines automation and machine learning to (a) standardize and speed up condensate preparation and reduce handling errors, (b) provide an automated characterization approach, and (c) navigate complex coacervate phase diagrams more efficiently, thanks to machine learning predictions. The workflow is based on the following mutually interacting components:
-
I.
Robotic sample production. Efficient, accurate, and contamination-free sample preparation is critical for exploring vast experimental spaces with diverse conditions. Our platform addresses these needs with a cost-effective and versatile robotic pipetting platform (Fig. 1B) that combines adaptable deck space, scalable reservoir options, and an open-source programming interface. These features enable high-throughput automation of condensate formulations in any pre-defined, multi-dimensional experimental space. Custom features (Fig. 1C) allow increasing production rates through optimized liquid handling and prevent cross-contamination by using adaptable dispensing heights for contactless dispensing and different contact points for each liquid via a custom touch-tip functionality. Together, this also reduces plastic consumption, by allowing tip re-use for each distinct liquid.
-
II.
Automated particle characterization. High-throughput condensate analysis requires high-throughput imaging with sufficient spatial resolution and consistent focus, which can be challenging due to heterogeneous and varying sizes of condensate sizes. In our platform, samples are transferred to a 96-well microscopy plate and imaged using an automated confocal microscope (Fig. 1D). This setup enables high-speed imaging and precise focus tracking through hardware autofocus. After formulation, condensates naturally settle over time on the glass surface, allowing for 3D reconstruction through dynamically acquired Z-stacks at four positions within each sample (Fig. 1D). This approach generates technical replicates and accounts for potential inconsistencies. The automated image analysis pipeline involves (a) applying binary thresholding to detect particles (Fig. 1E), (b) identifying the optimal Z-plane where each particle is in the best focal plane (Fig. 1F), and (c) classifying the sample as phase-separated when a threshold number of particles is observed (Fig. 1G), or as non-phase-separated otherwise (Supplementary Fig. 1). Additionally, condensate properties, such as morphology and volume fraction, are extracted for follow-up analysis and characterization.
-
III.
Active machine learning. In our platform, the collected experimental data (Fig. 1H) are used to train a Gaussian Process Classifier (GPC), a machine learning model that leverages Bayesian probability to make predictions, while accounting for uncertainty in classification decisions46. The model is trained to predict whether a pair of polypeptides at specific concentrations (optionally along with other experimental parameters) will phase-separate. The trained model is then used to predict the phase-separation behavior of the pre-defined experimental space (Fig. 1I). Based on the predictions, new experimental points are requested for the next experimental iteration (Fig. 1J). This is achieved via the exploration of areas in the phase diagram with high prediction uncertainty (in the form of information entropy, see Methods, Eq. 5), and via diversity-based sampling (via so-called farthest point sampling47). The selected points are then produced and characterized (via steps I and II) and contribute to the next phase of model training.
The workflow is constituted by three parts: (I) condensate formulation, where samples are automatically prepared, (II) confocal microscopy and sample classification, for characterization, and (III) active machine learning, that learns from the collected data and suggests the next experiments. A Condensate microparticles are formed by mixing cationic and anionic polypeptides, resulting in phase-separated micron-sized droplets. B Schematic representation of the robotic pipetting platform with 16 flexible deck slots. C Formulations are prepared in a conical PCR plate, using contactless dispensing with volume tracking. A custom touch-tip functionality follows a touch-point trajectory to ensure accurate dispensing. D Confocal imaging is performed using dynamic Z-stack acquisition. E Example segmentation of representative confocal microscopy data using automated binary Yen-thresholding for particle detection in each Z-plane. F The optimal Z-plane is selected based on the largest detected area, corresponding to the slice that is best in focus. G Samples with 12 or more particles are labeled as phase-separated (condensates), while those below the threshold are labeled as non-condensates. H Experimentally validated data points are incorporated into the machine learning algorithm for training. I The model predicts a phase diagram based on the acquired experimental data. J The model then guides the selection of new formulations, restarting the automation cycle at (A).
Thanks to this closed-loop make-analyze-predict cycle, the sample production (step I) and characterization (step II) produce data for the machine-learning-driven choice of the next experiments (step III)—this procedure is repeated until convergence. According to self-driving lab autonomy criteria, our pipeline qualifies as a Level 4 platform, since it integrates multiple hardware operations (e.g., liquid handling and imaging) with iterative, software-driven decision-making48. In this framework, the machine learning algorithm autonomously selects future experiments, and the system automatically evolves based on the newly acquired experimental data, while humans are only tasked with defining the initial search space49. This setup goes beyond traditional, trial-and-error based approaches, and it can generalize to virtually any system: once the initial search space is defined, the condensate phase behavior can be automatically explored in a self-driving manner.
Proof-of-concept: automated construction of phase diagrams
To showcase the potential of our self-driving platform, we applied it to navigate the phase behavior of poly-L-(lysine) and poly-L-(aspartic acid) (Fig. 2), two well-investigated polypeptides in phase separation research30,50,51,52,53,54. Even in this case, despite their widespread use, the detailed phase diagrams that capture their binodal remain underexplored, possibly owing to the need of labor-intensive experiments29,30,51. In this context, this condensate system was a useful case-study to investigate how well our automated workflow was suited to effectively determine its phase behavior.
A Schematic of the active machine learning pipeline used for phase diagram mapping. (I) The initial sample points are selected using farthest point sampling to ensure broad coverage of the design space. (II) A Gaussian Process Classifier is trained on the data, generating a preliminary phase diagram. (III) An uncertainty landscape is computed, highlighting regions with the highest uncertainty. From these regions, new points are sampled using farthest point sampling. (IV) The selected samples are experimentally validated and added to the dataset, refining the phase diagram prediction. (V) Steps I–IV are repeated until convergence is achieved. B Representative confocal micrographs for the first eight experimentally validated samples (scale bar = 20 µm). C The predicted phase diagram for poly-L-(aspartic acid)200 and poly-L-(lysine)100 based on the validated samples in (B). Phase separation is represented by blue points and no separation by red points, with the surface depicting the model’s predictions. D The entropy landscape is constructed based on the prediction in (C), and new samples (white points) are selected using farthest point sampling in the high entropy region of the landscape. The requested points are experimentally classified, and a new phase diagram is predicted from the combined data (E), along with its associated entropy landscape (F). G Subsequent iterations continue until 72 data points are acquired. Phase boundaries are indicated by dotted lines; in some cases, these may be partially obscured by overlapping contour lines. Total polypeptide consumption: 4.0 mg poly-L-(aspartic acid)200 and 4.2 mg poly-L-(lysine)100.
Our experiments followed the make-analyze-predict workflow, as follows:
-
1.
Initialization step (Fig. 2A). We constructed an experimental design space, ranging from 0.1 to 8.1 mM monomer concentration for each polypeptide. As a starting point we used poly-L-(lysine)100 and poly-L-(aspartic acid)200. Eight points for the experimental formulation and characterization were selected by the farthest point sampling algorithm47, which starts from a randomly selected point, and then chooses maximally dispersed samples across the design space.
-
2.
Automated sample production and characterization. The chosen samples were then formulated and characterized experimentally for their phase separation (Fig. 2B, Supplementary Fig. 2). Based on their phase separation behavior, they were labeled as either ‘condensate’, or ‘non-condensate’ for training the machine learning model.
-
3.
Model training and experiments selection. The experimentally determined labels were used to train the model and predict the coacervate behavior across the design space (Fig. 2C). In particular, the GPC algorithm generates a new phase diagram prediction across the design space. The probabilistic nature of GPC prediction allows to compute an uncertainty measure per class, which we leverage in the form of entropy of the class probabilities (the higher the entropy, the higher the uncertainty across the classes, see “Methods”, Eq.( 4)). Once the points within the highest uncertainty regions are identified, farthest point sampling again selects the next batch points for production and characterization (Fig. 2D).
After the initialization (step 0), steps 1–2 were iteratively repeated, by adding the new experimental labels to the training dataset and subsequently updating both the phase (Fig. 2E) and uncertainty (Fig. 2F) landscapes for the next cycle. This active learning process continued until a total of 72 samples were measured across nine cycles (Fig. 2A, G).
After approximately 40 samples (five iterations), only minor changes were observed in the predicted phases, suggesting that the model started to stabilize. Collecting a total of 72 samples further reduced the uncertainty of the predicted phase boundaries (Supplementary Figs. 3–5). To minimize nonspecific surface interactions, all experiments were performed in BSA-coated plates, which we verified to have no detectable influence on phase behavior (Supplementary Fig. 6). The automated exploration of the phase diagram was carried out in approximately four hours, whereas conducting these experiments manually would have required more than one week. Additionally, the active learning approach generated a detailed phase diagram, a result that would have otherwise required the intuition of an experienced scientist to achieve manually, and potentially many more datapoints (Supplementary Fig. 7). As a further validation, we tested the pipeline on a synthetic phase diagram containing multiple isolated negative phases (Supplementary Fig. 8). The model successfully identified these hidden regions, demonstrating its flexibility and robustness in navigating complex phase landscapes. Together, these results highlight the platform’s effectiveness in reducing time and guiding experimental efforts toward the most relevant areas.
Notably, the resulting shape of the phase diagram is consistent with the physical principles underlying condensate formation. In associative LLPS, droplet formation is driven by multivalent electrostatic interactions and the release of bound solvent molecules, which together must outweigh the entropic cost of reduced chain flexibility55,56. These conditions are optimally met near stoichiometric charge ratios, where attractions between the polypeptides are maximized. When one component is in excess or depleted, the resulting charge imbalance and electrostatic screening hinder the formation of an extended interaction network, thereby suppressing phase separation as reflected in the mapped binodal57. Additionally, turbidity and DLS measurements support the observed phase boundary (Supplementary Fig. 9), although DLS also detected the formation of nanometer-scale assemblies below the resolution limit of confocal microscopy.
Convergence of condensate phase mapping
A desirable feature when automatically mapping phase diagrams is the unified convergence and reproducibility of the final results regardless of the starting points selection. In fact, while the designed space available for selecting experimental conditions is vast (in this study, a grid of 6561 points), the models are trained in a low-data regime (up to 72 datapoints), which opens questions about how the underlying patterns and trends are captured58. Moreover, given the iterative nature of the approach, initial decisions (e.g., starting points for training) and automation-related challenges (e.g., equipment inconsistencies) might affect decisions in later cycles. To shed light on this key question, we performed three independent replicates using the poly-L-(lysine)100 and poly-L-(aspartic acid)200 system, so that each replicate was carried out identically (as explained above), but starting from a unique and non-overlapping initial set (step 0) for model training (Fig. 3A).
A Schematic illustration of the experimental workflow used to produce replicated phase diagrams. The initial set of experimental samples is selected by farthest point sampling, resulting in different starting points for each replicate. Each subsequent cycle then follows a unique path to reach the same phase diagram. Reproducibility across replicates is expected only if the machine learning, formulation, and analysis steps are consistent. B Phase diagrams, showing phase separation in blue and no separation in red, with probability prediction fits (background), validated points (colored dots), and new sample selections (white dots) for three experimental replicates of poly-L-(lysine)100 and poly-L-(aspartic acid)200 condensates. A total of 72 data points is experimentally validated across 9 cycles. Representative cycles are shown, remaining cycles and entropy maps are reported in Supplementary Figs. 4, 5 and Supplementary Figs. 10–13. C Balanced accuracy plot showing the accuracy on the prediction for each successive cycle with respect to a “ground truth” phase diagram. Cycle 0 represents the balanced accuracy computed with respect of a randomly generated phase diagram as a baseline comparison. D Average Jensen-Shannon Divergence plot illustrating within-experiment divergence by comparing consecutive cycles for each replicate. This reflects the progressive convergence toward the final phase diagram for each replicate. Cycle 0 is a random phase diagram included as a reference for low similarity. E Average Jensen-Shannon Divergence plot comparing divergence across replicates at each cycle, highlighting inter-experiment variation. Cycle 0 compares three random phase diagrams and is included as a reference for low similarity. Phase boundaries are indicated by dotted lines; in some cases, these may be partially obscured by overlapping contour lines. Polypeptide consumption per phase diagram: 4.0 mg poly-L-(aspartic acid)200 and 4.2–4.8 mg poly-L-(lysine)100.
Since the starting sets highly differed across replicates, they resulted in different phase and uncertainty landscapes in early cycles (Fig. 3B, Supplementary Figs. 10–13). While each run followed its ‘prediction route’ across cycles, after approximately 40 samples (cycle number 5), the phase diagrams appeared to converge across the replicates. After collecting 72 samples (cycle number 9), the replicate phase diagrams displayed remarkable similarity and low uncertainty levels.
To further assess the reproducibility of our experiments, we constructed a “ground truth” phase diagram (Supplementary Fig. 14) using all data collected across replicates (Supplementary Figs. 15–20). We quantified the prediction agreement between each replicate’s predictions (at each cycle) and the ground truth via balanced accuracy (the higher, the more similar the predictions, see “Methods” Eq.( 7))59. Across replicates, the balanced accuracy steadily increased over successive cycles (Fig. 3C), which is especially visible from the fifth cycle onwards, where balanced accuracy reached values consistently above 95% across all replicates. This indicates that, no matter the starting point, all replicates converge to a similar phase diagram in a data-efficient way (i.e., by using substantially less data than the “ground truth” diagram). These results agree with existing active learning literature39,58,60, showing the potential of this approach to progressively mitigate the effect of the starting data.
The Jensen-Shannon divergence61 (see “Methods”, Eq.( 8), (9)) was computed to directly compare phase diagrams (the lower the divergence, the more similar). A “within-replicate” divergence was calculated, by comparing the predicted probabilities of each replicate across consecutive cycles (Fig. 3D). The results showed an exponential decrease in divergence values, with substantial changes in the predicted phase diagrams within the first 32 samples (cycle 4) and minimal changes after 56 samples (cycle 7), suggesting that each phase diagram reached a ‘stable’ state, where additional experiments did not significantly alter predictions. Moreover, we calculated a “between-replicate” divergence (Fig. 3E), by comparing the predictions of each cycle across different replicates. The divergence values decreased sharply during the first three cycles, after which they stabilized. These results indicate that only three cycles were necessary to mitigate the stochastic differences by the different starting points, after which the replicates progressively aligned along a common trajectory.
Mapping condensate properties via phase diagram exploration
Traditional studies on phase separation behavior have primarily focused on determining whether condensates form under specific conditions27,28,30,34,62. Our automated data production and characterization pipeline collects additional information beyond phase separation. In particular, depth-resolved imaging from confocal microscopy allows to derive several properties of condensates, including particle count, morphology, and volume fraction, within the phase diagram. Here we compounded the data from the previously described replicates, along with data obtained from optimization experiments, totaling 480 experimentally determined samples (Fig. 4A, Supplementary Fig. 14). The collected samples show a wide range in particle morphologies, ranging from densely packed condensate clusters to tiny, barely visible particles (Fig. 4B). This diversity underscores the variability in condensate formation even within a single “simple” phase, underscoring the necessity of collecting a broader set of properties to gain a deeper understanding of phase behavior.
A Combined 2D phase diagram of poly-L-(lysine)100 and poly-L-(aspartic acid)200, based on 480 validated data points. The data is compiled from six independent experiments, represented by varying shades of blue (indicating phase separation) and red (indicating no phase separation). B Representative confocal micrographs that illustrate the wide range of observed condensate phenotypes (scale bar = 20 µm). Each image corresponds to a different sample condition; micrographs are shown to demonstrate morphological diversity rather than biological replicates. (C–E) Quantification of condensate properties overlaid on the phase diagram in (A): (C) number of detected condensates, (D) average particle area, and (E) apparent volume fraction, estimated from particle count and size. White points indicate conditions where no particles were detected. Total polypeptide consumption for 480 samples: 26.4 mg poly-L-(aspartic acid)200 and 29.8 mg poly-L-(lysine)100.
Here, we focused on the following condensate properties: (a) number of detected condensates, (b) average particle area, and (c) total volume fraction, extrapolated by combining particle counts and area. These properties were mapped onto the compounded phase diagram, and all of them showed evident trends across the experimentally determined space. Low particle counts were for example observable near phase boundaries, while the count increased when both protein concentrations increased (Fig. 4C). Particle size showed a similar trend, with larger condensates forming at higher concentrations (Fig. 4D). Some regions showed fewer but larger particles, suggesting potential fusion (coalescence) of condensates due to surface saturation. Volume fractions were lower near the binodal and higher toward the inner part of the phase separated region (Fig. 4E), in line with complementary Nanoparticle Tracking Analysis (NTA) measurements (Supplementary Fig. 21).
Notably, our property measurements represent standardized snapshot observations taken after a 15-min incubation period of a dynamic, continuously evolving system. While the incubation time can be adjusted to suit the user’s objective or extended to enable kinetic measurements (Supplementary Figs. 22–24), we selected 15 min as a practical and reproducible readout, based on DLS and turbidity data showing that key features for identifying phase separation begin to stabilize around this time (Supplementary Fig. 25). This approach enables systematic mapping of phenotypic variations across the phase diagram, offering insights into condensate behavior beyond the binary presence or absence of phase separation. To further extend the platform’s scope, we also performed preliminary measurements of dense-phase concentrations (Supplementary Fig. 26), providing a basis for future composition-dependent analyses that could approximate partitioning coefficients and the associated thermodynamic driving forces63. By enabling broad, quantitative screening, these metrics lay the groundwork for deeper, targeted analyses using advanced biophysical tools (i.e., FRAP, microrheology, or optical trapping) to probe material properties such as viscosity, dynamics, or mechanical behavior. Collectively, these quantitative descriptors may provide a basis for integration with theoretical models of phase behavior, while also enabling more informed and targeted formulation strategies, particularly in regions of the experimental space where specific material properties rather than phase separation alone are critical for function or downstream application64,65,66.
Identifying structure-separation relationships with automation
To further extend the applicability of our workflow, we applied it to elucidating how polypeptide chain length affects phase separation. We constructed phase diagrams for nine combinations of poly-L-(lysine) and poly-L-(aspartic acid) polypeptides, each differing in chain length (poly-L-(lysine)n: n = 20, 100, 250; poly-L-(aspartic acid)n: n = 30, 100, 200) but with constant overall monomer concentrations. All combinations exhibited phase separation within the tested experimental space (Fig. 5, Supplementary Figs. 27–42). However, although these polypeptides share the same structural monomeric unit, their phase behavior, as well as their properties (Supplementary Figs. 43–45) varied considerably. The machine-learning-guided exploration of these phase diagrams was carried out in approximately one week, whereas conducting these experiments manually and based on intuition would have been seriously challenging and labor-intensive.
This figure displays nine phase diagrams illustrating the automated mapping of phase separation for combinations of poly-L-(lysine) and poly-L-(aspartic acid) with varying chain lengths. Panels represent phase diagrams for poly-L-(lysine) with a chain length of 20, combined with poly-L-(aspartic acid) of lengths 30 (A), 100 (B), and 200 (C). Panels show poly-L-(lysine) with a chain length of 100, paired with poly-L-(aspartic acid) lengths of 30 (D), 100 (E), and 200 (F). Panels (G-I) depict poly-L-(lysine) with a chain length of 250, combined with poly-L-(aspartic acid) lengths of 30 (G), 100 (H), and 200 (I). Datapoints are marked as dots, with blue indicating phase separation and red indicating no phase separation. Each phase map includes a background color gradient derived from predictions based on 72 datapoints per combination, acquired over nine cycles of eight datapoints. Remaining cycles and entropy maps are reported in Supplementary Figs. 27–42. Phase boundaries are indicated by dotted lines; in some cases, these may be partially obscured by overlapping contour lines. Total polypeptide consumption per phase diagram: 4.0–4.9 mg poly-L-(aspartic acid) and 3.9–5.4 mg poly-L-(lysine).
Notably, even with the more complex and curved diagrams of some of the combinations, we successfully identified well-defined phase boundaries within 72 samples (9 cycles) for all tested conditions. Generally, increasing the length of one polypeptide while keeping the length of the other polypeptide constant enabled phase separation at lower concentrations for the elongated polypeptide, but it required higher concentrations of the fixed-length polypeptide, as visible, for instance, in the case of poly-L-(lysine)20 (Fig. 5A–C). Similarly, when the poly-L-(lysine) length increased from 20 to 100 or 250 repeats (Fig. 5D–I), while maintaining a constant poly-L-(aspartic acid) length, phase separation occurred at lower lysine concentrations, but required higher concentrations of poly-L-(aspartic acid).
These results highlight the delicate balance required in designing polypeptide systems for phase separation. Simply increasing the concentration or length of one polypeptide does not necessarily lead to enhanced phase separation; instead, the process is highly sensitive to the interplay between both polypeptides. Our findings indicate that an optimal balance exists at equal chain lengths of 100 repeats (Fig. 5E), where phase separation occurs extensively across most of the investigated chemical space. In some cases, particularly with poly-L-(lysine)250, phase boundaries showed slight bends, suggesting complex, non-linear dynamics. These complexities highlight the challenges in controlling and predicting condensate formation, as even minor adjustments at the molecular level can lead to pronounced changes in phase behavior.
Navigating phase behavior in complex environments
Building upon these results, we increased experimental complexity by introducing salt (NaCl) as an additional dimension to our system. Salts modulate electrostatic interactions between charged polypeptides and thereby significantly influence condensate phase behavior and properties24,67. This expansion increased the potential experimental space from 6,561 points (two dimensions) to 531,441 points (three dimensions). To evaluate the platform’s performance, we performed two independent replicates using the poly-L-(lysine)100 and poly-L-(aspartic acid)200 system for 20 active learning cycles with 32 samples each (640 measured points per replicate; Supplementary Figs. 46–47). These newly acquired points were compounded with previous data to construct a comprehensive 3D “ground truth” phase diagram (Fig. 6A, B). As anticipated, salt greatly influenced condensate formation, promoting phase separation at moderate concentrations (150–700 mM), while disrupting it at higher concentrations (1200–1300 mM)45. Interestingly, some phase-separated regions at higher salt concentrations were identified (Fig. 6B, 270° rotation), which result from salt-induced aggregate phases (Supplementary Fig. 48). As our current analysis detects any contiguous fluorescent signal above the pixel threshold, these aggregates were classified as ‘phase separated’, regardless of internal structure or material state.
A Two independent experiments (Supplementary Figs. 46–47) were conducted to explore the effect of salt (NaCl) on the phase behavior of poly-L-(lysine)100 and poly-L-(aspartic acid)200. The combined dataset, made from 1760 datapoints, was used to construct the “ground truth” three-dimensional phase diagram, here reported. Iso-probability surfaces indicate phase separation (blue, higher opacity) and no phase separation (red, lower opacity). B Four distinct orientations of the phase diagram with non-transparent surfaces are shown to emphasize phase behavior from different perspectives. C Balanced accuracy plot showing the accuracy on the prediction for each successive cycle with respect to the “ground truth” phase diagram in panel A. Cycle 0 represents the balanced accuracy computed with respect of a randomly generated phase diagram as a baseline comparison. D Within-experiment Jensen-Shannon Divergence (JSD) plotted across cycles. This metric tracks convergence by comparing consecutive cycles, illustrating how each replicate approaches the final phase diagram. Cycle 0 reflects divergence from a randomly generated phase diagram. E Between-experiment Jensen-Shannon Divergence (JSD) across replicates at each cycle. Similar to panel D, Cycle 0 serves as a baseline, representing divergence from a randomly generated phase diagram. Total polypeptide consumption for 1280 samples: 85.1 mg poly-L-(aspartic acid) and 82.1 mg poly-L-(lysine).
To assess the pipeline’s reproducibility and performance, we again calculated the balanced accuracy59 (See “Methods”, Eq.( 7), Fig. 6C) and within- and between-replicate Jensen-Shannon divergence61 (see “Methods”, Eqs.( 8), (9), Fig. 6D, E). As anticipated, all metrics showed consistent improvements across cycles and rapid convergence toward the global phase diagram, with stabilization occurring after approximately eight cycles (256 samples). Notably, these metrics effectively captured the overall progression in identifying phase behavior but may be less sensitive to minor changes in the large design space during the later stages of optimization. Nonetheless, the balanced accuracy continued to improve slightly in subsequent cycles, primarily enhancing the resolution around the phase boundaries (white areas in Fig. 6A, B; Supplementary Figs. 46–47).
Increasing dimensionality introduces challenges, both for machine learning algorithms and due to the formation of distinct aggregate phases. Despite these challenges, we successfully mapped these 3D phase diagrams in just three days. These results not only demonstrate the platform’s capability to rapidly explore vast and complex design spaces but also highlight the essential role of machine learning in effectively navigating and elucidating such high-dimensional complex assemblies (Supplementary Fig. 7). To accommodate complex chemical spaces, users can flexibly adjust the resolution of the search space depending on their objectives. This is supported by simulations with down-sampled grids, which showed that early-cycle performance remains robust in 2D and 3D even with substantially reduced design spaces (Supplementary Fig. 49).
Discussion
In this work, we presented a versatile, machine learning-driven automated platform that rapidly navigates multi-dimensional phase diagrams of condensates. By integrating (a) active machine learning to optimize sample selection and phase diagram navigation, (b) automated pipetting for precise sample formulation, and (c) advanced and automated confocal microscopy for high-content particle characterization, we examined the phase behavior of polypeptides across various formulations and concentration profiles. Our platform reliably and rapidly identified phase boundaries with high accuracy and reproducibility, demonstrating the robustness of our approach. Additionally, it quantified key condensate properties, such as morphology, particle count, and volume fraction, providing insights beyond the traditional binary classifications of phase separation. Moreover, the platform’s flexibility enabled rapid exploration of complex phase spaces, allowing to reveal the influence of polypeptide chain length and salt on phase behavior.
Looking forward, numerous opportunities exist to further enhance our platform’s capabilities and broaden its applications. By refining the sampling strategies (e.g., by balancing exploration of uncertain regions with exploitation of high-certainty points) the efficiency of phase diagram navigation could be further improved68. Furthermore, integrating robotics to enhance platform autonomy69,70 and leveraging machine learning for advanced image analysis can significantly improve condensate classification71. Moreover, integrating condensate properties into active machine learning algorithms will allow us to incorporate desirable particle properties in the decision-making process, supporting the design of biomaterials for applications such as drug delivery and tissue engineering. In the future, incorporating molecular information into machine learning models (e.g., via deep learning72,73) will enable linking molecular structure with phase behavior, extending beyond the training sets74. Finally, the platform’s modularity and adaptability make it generalizable to other complex micron-sized assemblies, such as tactoids75 and microgels76. Looking ahead, complementary tools such as fluorescence recovery after photobleaching (FRAP), microrheology, partitioning measurements, or the use of structure-sensitive dyes (e.g., ThT, Amytracker) could be incorporated to quantify condensate diffusivity, viscosity, molecular enrichment, or the presence of β-structured aggregates, further deepening functional insights77,78. This versatility also opens up opportunities to explore how minor structural and compositional changes in natural proteins, resulting from processes like splicing, mutations, and post-translational modifications, influence condensate behavior, offering valuable insights into phase separation principles under diverse conditions79.
Methods
Preparation and dye labeling of polypeptides
All polypeptides used in this study were purchased from Alamanda Polymers. They were dissolved in fresh Milli-Q water (MQ) at 25 mg/mL, then sterile-filtered through a 0.2μm filter, and stored in aliquots at –20 °C. Further dilutions were prepared in MQ, with stock solutions maintained at 4 °C.
A portion of the poly-L-(lysine) polypeptides was labeled with NHS-Sulfo-Cy5 dye (Lumiprobe) for confocal imaging. The dye was dissolved in DMSO at a concentration of 10 mg/mL and stored at –20 °C. Poly-L-(lysine) was labeled in a reaction buffer consisting of 100 mM HEPES (pH 8.0) and 150 mM NaCl in MQ. The polymer-to-dye ratios were 1:3 for poly-L-(lysine) with a chain length of 100 and 1:6 for chain lengths of 20 and 250. The reaction was carried out for two hours at room temperature while shaking at 550 rpm using an Eppendorf MixMate.
Unbound dye was removed using a PD Minitrap G-25 size exclusion column (Cytiva), which was pre-equilibrated with a storage buffer of 25 mM HEPES (pH 7.4) and 100 mM NaCl. Labeling was, if possible, further confirmed by analyzing the flow-through of the dye-labeled polymer after centrifugation with a 3 kDa spin filter (Amicon). All polypeptides were freeze-dried, weighed, and dissolved in MQ. The dye concentration was determined using a nanodrop spectrophotometer (Thermo Scientific NanoDrop 1000). This measurement, combined with the dry weight of the polypeptide, allowed for the calculation of the Degree of Labeling (DoL). The final dye-labeled polypeptides were sterile-filtered (0.2 μm) and stored at –20 °C, with additional dilutions prepared in MQ and maintained at 4 °C.
Preparation microscopy plates
For confocal imaging, black 96-well glass-bottom microscopy plates (Cellvis, 1.5, P96-1.5H-N) were used and the glass surface was passivated to prevent wetting of the condensates. To prepare the surface coating, bovine serum albumin (BSA) was dissolved in MQ at 30 mg/mL and then sterile-filtered through a 0.2 µm filter. A volume of 100 µL of this BSA solution was added to each well. The plates were placed on a MixMate shaker (Eppendorf) and incubated at 500 rpm for 60 min at room temperature. After incubation, the BSA solution was discarded, and each well was rinsed three times with 100 µL of MQ water. The plates were then dried overnight, covered with a Kimwipe, and stored at room temperature under a protective cover until use.
Data architecture and general automation workflow
All devices were integrated within a local network and regulated through a central orchestrator workstation, which served as the control hub for the entire platform. The orchestrator contains all necessary protocols and information, and coordinates all device actions and data exchange. Communication with platform components was achieved through USB connections and a local Ethernet network, using TCP-based network communication protocols such as SSH and HTTP.
A centralized data architecture was implemented to manage knowledge transfer between instruments. This architecture included a structured folder system on the orchestrator workstation for organizing Python protocols, instrument logs, raw data storage, and dedicated information transfer files. These information transfer files, detailed below, contained specific instructions for each device—often generated through machine-learning algorithms—and were sent from the orchestrator workstation to individual components. Each device passively listened to the orchestrator workstation, which assigned tasks and actions directly. Devices executed only the actions directed by the orchestrator, forming a streamlined, centralized data workflow across the platform.
Master file
Central, continuously updated database for all sample details, conditions, and results. It logs sample locations and barcoded plates, directing sample creation, handling, and analysis. A versioned copy was made before each update to maintain data integrity.
Barcode file
Output from machine learning, which is cross-referenced with the Master File to identify samples to be processed.
-
Batch File: Contains a detailed description of polypeptide stocks, including date, version, and degree of labeling for dye-labeled polypeptides. It is essential for calculating component volumes in sample preparation.
-
Source File: Tracks materials that are stored in a 96-well plate, including their concentrations and volumes. This file was updated after each pipetting step and versioned once per automation cycle to support accurate records.
This system operated in a closed-loop workflow, where each action depended on information from previous steps, all coordinated by the central orchestrator workstation. The workflow began with the machine learning model, which assessed the chemical space and determined the next set of samples to be measured. It appended these new sample conditions to the Master File and created a matching Barcode File. Next, the pipetting platform used information from the Master, Source, and Batch Files to calculate the required volumes and assign target locations for each sample. During sample preparation, the Source File was updated after each pipetting step to keep track of remaining volumes. Once the samples were prepared, their locations were added to the Master File. The microscope then cross-referenced the Barcode File with the updated Master File to find sample locations and imaging coordinates. It automatically acquired and processed confocal micrographs and added the classification results to the Master File. Finally, the machine learning model retrieved these updated classifications, incorporated them into the chemical space, and initiated the next cycle of experiments.
Automated sample preparation
Instrument setup and configuration
Samples were prepared automatically using an Opentrons Flex pipetting robot equipped with both single- and 8-channel pipettes (5–1000 µL) and 200 µL tips. The deck was configured as follows: 200 µL tip rack in slot B1; 195 mL NEST reservoir filled with MQ in slot C2; Heater-shaker module (Gen 1) with a PCR adapter plate and either a NEST 96-well PCR plate for 2D phase diagrams or an Opentrons Tough 96-well PCR plate for 3D phase diagrams in slot D1; 2 mL 96-well deep-well plate (NEST) containing stock solutions in slot D2; waste chute in slot D3; and a 96-well microscopy plate (Cellvis) in slot C3.
Pipette offset calibration
The Flex platform was calibrated for height and x/y offsets, following the manufacturer’s guidelines.
Source plate setup
Stock solutions of HEPES, NaCl, and polypeptides (labeled and unlabeled) were preloaded in the source plate (D2). The robot tracked and updated each well’s volume (see Data Architecture and Workflow), prompting refills to bring wells up to 1800 µL when volumes dropped below 200 µL.
Liquid handling
Reagents were dispensed sequentially to achieve a final volume of 150 µL per PCR well: MQ water, HEPES buffer (50 mM, pH 7.4), NaCl (150 mM for 2D or 25–2050 mM for 3D diagrams), dye-labeled poly-L-(lysine) (96–250 nM), unlabeled poly-L-(lysine), and poly-L-(aspartic acid) (0.1–8.1 mM monomer concentration). Final calculations accounted for any additional monomers introduced by the dye-labeled poly-L-(lysine) to ensure accurate concentrations. The same tip was used for multi-dispensing reagents, with new tips used for each aspiration step (except MQ).
Mixing
From NaCl addition onward, samples were mixed (1500 rpm) for 10 seconds. After the final component (poly-L-aspartic acid), samples were mixed (1500 rpm) for 5 minutes to promote phase separation.
Custom dispensing technique
To improve accuracy, transfers used a minimum of 10 µL, leaving 5 µL of residual volume in the tip. Prior to dispensing, dynamic volume tracking adjusted the pipette height based on the anticipated liquid volume in the well. Dispensing occurred just above the liquid surface to ensure that any residual droplets hanging from the tip were reliably released into the bulk solution (Fig. 1C, side view). After dispensing, a custom touch-tip function guided the pipette to contact specific points along the well wall at the same height to remove remaining droplets. Unlike the default four-point Opentrons routine, our implementation dynamically adjusted both the number and location of contact points based on the number of distinct liquids added to each well (Fig. 1C, top view). As additional liquids were dispensed, new touch points were assigned at progressively higher vertical positions, forming a gentle upward spiral (Fig. 1C, touch point trajectory). This approach enabled accurate multi-liquid dispensing with a single tip, minimized cross-contamination risk, and maintained spatial separation between touch locations regardless of the number of liquids used.
Final transfer for imaging
After preparation, 100 µL of each sample was transferred to the imaging plate (C3), which was then sealed with an adhesive aluminum foil seal (ThermoFisher) for confocal imaging. A new tip was used for each well, with samples mixed three times by aspiration/dispensing before transfer. Samples were incubated for 15 minutes before analysis, unless indicated otherwise.
Automated confocal microscopy
Confocal microscopy setup and hardware configuration
Imaging was conducted on a custom confocal setup integrated by Confocal NL. The microscope consisted of an open-frame inverted microscope (Zaber), with a confocal NL line re-scan system (NL5 + ) mounted on the left-side camera port. Additionally, the microscope was equipped with a motorized filter wheel (Confocal NL), and a laser autofocus module (Zaber). The NL5+ unit was equipped with an sCMOS camera (Teledyne Photometrix BSI express), providing a large Field of View of 18.8 mm (diagonal). Laser excitation from an Oxxius L4Cc laser diode combiner (containing a 638 nm laser) was coupled to the NL5+ module via an optical fiber. All experiments were conducted using laser power 7%, and a 60x air objective (Nikon, NA 0.95).
Software and connections
All components were controlled via Python. Specifically, pycromanager interacted with Micro-Manager (version 2.0.3) to control laser powers, Z-stacks, and XY positioning. Additionally, the zaber_motion library was connected to the Zaber Launcher (version 2024.11.14) to control the autofocus device.
Automated autofocus adjustments
The autofocus loop involved several steps. To start, the objective was initially directed to a preset Z-position, aligning the autofocus laser within range for the first autofocus attempt. The autofocus was then triggered, aligning the objective with the bottom of the imaging plate. This in-focus focal height was recorded and serves as a reference for the next autofocus loop. After acquisition (see below), the autofocus routine subsequently started each new loop 10 µm below the previously recorded focal plane, searching upward to locate the plate bottom.
XY Positioning and Image Acquisition
The 96-well microscopy plate was mapped into 2 × 2 grids (550 µm spacing), creating technical replicates within each well. A well-specific event list was created, associating each well with the correct sample barcodes, coordinates, grid locations, and channel information. The scanning algorithm employed a snake pattern, optimizing acquisition time by minimizing travel distance and positional drift across the microscopy plate. Following autofocus, Z-stacks were captured as height additions on top of the recorded autofocus height, using dynamic spacing: a fine 0.5 µm step for the first 5 µm, increasing to 1.0 µm for the next 5 µm, then 2.5 µm for the following 5 µm, and finally 5.0 µm for deeper layers, spanning a total of 50 µm of Z-depth per position. Acquisitions were performed at 5 frames per second.
Verification of imaging completion
To monitor imaging progress, a continuous background process compared the number of saved image slices to the expected slice count based on the number of imaging events (i.e., focal planes across wells). Once the saved slice count matched the target, the acquisition was deemed complete, and MM and associated processes were automatically closed.
Automated image analysis and classification
Each acquired micrograph underwent automated analysis to extract sample classifications and particle features. Particles were detected using the scikit-image Python module. Yen thresholding was applied to create a binary mask, which was used to detect particles above 500 pixels (5.87 µm2). The extracted particle properties (e.g., X/Y position, area, mean intensity) were saved for each micrograph. Results were then grouped by grid position and sorted by Z-index. For each particle, the slice with the largest detected area was selected as the representative view, which was used for the property mappings performed in this study. Wells were classified based on particle count and distribution, with 12 or more particles across at least three grid positions indicating “Phase Separation” and fewer particles marking “No Phase Separation”.
Machine learning and computation
Design of the parameters space
The initial dataset (i.e., cycle 0) for any given system formulation was created by computing a regular D-dimensional grid of points (with D being the number of variables to be considered), where each independent component of the formulation accounts for a dimension. Two of the dimensions were always assigned to the concentration of the two oppositely charged polymers, poly-L-(lysine) and poly-L-(aspartic acid) respectively. Additional dimensions could be added to account for other behaviors. The response variable was represented by an integer that mapped the recorded phase to either coacervate or not. In all our experiments we restrained our formulations to study the coacervation phenomena of two oppositely charged polymers as a function of the two polymer concentrations and the salt concentration. Additionally in the current work, we only focused on 2-D and 3-Dimensional datasets. This means that in the former case (2-D) the salt concentration is fixed and kept constant, while in the latter case (3-D) it is allowed to change. The range for the polymer concentrations was constrained to be the same for all the experiments, regardless the polymer identity, and it was chosen to be a regularly spaced interval starting from concentration of 0.1 mM to 8.1 mM, with steps of 0.1 mM, giving a total of 81 concentrations values (end points included). Similarly, the range for the variation of the salt concentration was chosen to vary from 50 mM to 2075 mM with steps of 25 mM, giving a total of 81 values. All the ranges were chosen accordingly to the accuracy of the machines used to formulate the solutions. Finally, the dataset was created by filling a 2-D or 3-D regular grid with the values of the variable under investigation, creating a total of 6561 (81 × 81) points for the 2-D case, and 531441 (81x81x81) for the 3-D case. In all the experiments the response variable was set to -1, the undefined default value, for all the points of the grid.
Selection of new points
Starting from cycle 0, and for each cycle, a subset of points \(n\) (i.e., new formulations) was sampled from the available pool of points \(N\). The chosen sampling techniques followed the rules of Farthest Point Sampling (FPS)47. FPS is a sampling technique used to select a subset of points that are maximally spread out from each other within a given dataset. The goal is to retain points that represent the diversity of the data distribution by maximizing the minimum distance between selected points. Given a starting dataset \(X=\{{x}_{1},{x}_{2},\ldots,{x}_{N}\}\) of \(N\) points, a first random point \({r}_{1}\in X\) was selected and added to the set of sampled points \(S=\{{r}_{1}\}\). For each remaining point \(x\in {X\backslash S}\) the minimum distance to any point in \(S\) was computed:
Then, the point \({x}_{i}\) with the largest \(d\left({x}_{i}\right)\) was selected and added to the set of sampled points (i.e., the point farthest from the currently sampled points). This selection was repeated until the desired number \(n\) of points was reached. The result is a subset \(S\subset X\) of \(n\) points that were distributed in such a way that they maintain maximal separation, thereby capturing the structure of the original dataset more effectively than random sampling in cases where spread was important.
Phase diagram (PD) prediction
At each cycle \(N\), a phase diagram was predicted using the data that has been experimentally tested in cycle \(N-1\). In the case of \(N=0\), no previous tested data was available, the prediction was skipped, and the FPS selected points were fed to the experimental validation pipeline, where their phase is recorded. For all \(N\ge 1\), all the points assigned to the sampled set \(S\), after experimental validation, would be used as the ground truth for a Gaussian Process Classifier (GPC)46 model, that is going to predict the phase distributions over the entire input space. The GPC models the probability distribution over classes (e.g., the phases) by defining a latent function \(f:{{\mathbb{R}}}^{d}\to {{\mathbb{R}}}^{K}\) that associates each input \(x\in {{\mathbb{R}}}^{d}\), contained in the input space, with a set of probabilities \(p\left(y={c|x},S\right)\), where \(c\in \left\{1,\ldots,K\right\}\) represents the class labels. The training step involved using the subset \(S\) to learn the posterior distribution of \(f\), which, in turn, yielded a probabilistic model capable of assigning any points \({x}_{i}\in X\) to the probability of belonging to a specific class.
In our case, for each point in the input space the GPC would output a probability vector defined as follows:
where each component represents the probability of \({x}_{i}\) belonging to either the “non-aggregate” (\(y=1\)) or “coacervate” (\(y=2\)) class. Obviously, given that \({{{{\boldsymbol{p}}}}}_{i}\) is a probability vector, it holds that the value for the sum of the individual contribution in Eq. 2 needed to sum up to 1. Thus, the GPC trained on the set of all the sampled and tested points could be used to provide a probabilistic prediction over the entire dataset, simply defined concatenating the individual vectors (Eq. 2) for all the points contained in \(X\),
Equation( 3) enabled inference about phase membership across all points, essentially representing the phase diagram.
The GPC algorithm used in our work was defined using a Radial Basis Function (RBF) kernel with length scale \(1.0\), multiplied by a constant kernel with default value of \(1.0\). In each application of the prediction algorithm, we allowed for an automatic internal optimization step by setting the parameters n_restarts_optimizer to \(5\), and the max_iter_predict to \(150\) (more information can be found on the original GPC Scikit-Learn documentation page80).
Uncertainty estimation
At each cycle, to estimate the uncertainty in the phase diagram predictions, we computed the information entropy for each point’s probability vector \({{{{\boldsymbol{p}}}}}_{i}\) (Eq. 2). The uncertainty was computed as the (information) Entropy \(H\left({x}_{i}\right)\), and for \({x}_{i}\) could be computed as follow:
Higher entropy values indicate greater uncertainty, providing an uncertainty measure for each point in the phase diagram that is representative of the prediction’s confidence level. The entropy range of values is bounded, and it depends on the number of independent classes \(K\). In all our cases, \(K=2\), leading to a range of values that goes from \(H=0\), if either of the two classes was known for certain, i.e. \({{{{\boldsymbol{p}}}}}_{i}=\left[{{\mathrm{1.0,0.0}}}\right]\), to \(H=0.69\), if both of the two classes were most uncertain, i.e. \({{{{\boldsymbol{p}}}}}_{i}=\left[{{\mathrm{0.5,0.5}}}\right]\).
Highest uncertainty landscape and exploration
The values of \(H\left({x}_{i}\right)\) gave direct access to the so-called uncertainty (phase) landscape which represented, per cycle, which areas of the design space were most (un)certain. This information was then exploited to select a subset of points \({X}^{{\prime} }\subset X\) that exhibited maximal entropy, within a set range of entropy values:
In Eq.( 4) the upper-bound limit, \({H}_{\max }\), represented the maximum value of entropy, defined as:
which for \(K=2\) it takes the value of \({H}_{\max }=\log 2\approx 0.69\). The lower-bound limit can be freely chosen, and in our cases was set it to \(h=0.60\), effectively selecting only the highest uncertainty regions.
The points contained in \({X}^{{\prime} }\) would then be used as the new search space for the FPS algorithm, sampling new suitable points for refining the prediction of the phase diagram. In the context of active learning, this was often referred to as the exploration phase of the cycle, where new points were selected trying to maximize the exploration, lowering the overall uncertainty of the predictive algorithm.
Accuracy measurement
To assess the accuracy of our classification model in a way that accounts for class imbalance, we used a balanced accuracy metric59. At each cycle a set of labels \({Y}^{(t)}=\left\{{y}_{i}^{(t)}\right\}\) was computed for each point in our dataset from the global vector of probabilities (Eq.( 3)). Balanced accuracy was defined as the average of the sensitivity for each class. Given the fact that we are dealing with a binary classification problem we can consider the ‘coacervate’ class as the ‘positive’ outcome and the ‘non-aggregate’ class as the ‘negative’ outcome.
Then, the balanced accuracy was defined as:
In Eq. 7, \({T}_{P}\) and \({T}_{N}\) refer to the true-positive and true-negative predicted labels, while \({F}_{P}\) and \({F}_{N}\) refer to the false-positive and false-negative predicted labels.
Convergence measurements
To monitor the convergence of the model across cycles and/or experiment replicas, we tracked changes in the phase diagram, represented by the concatenated probability vector \({{{{\boldsymbol{P}}}}}^{(t)}\) that is outputted from the GPC prediction (Eq.( 3)). The superscript \((t)\) indicates a cycle specific output of the probability vector. The convergence, in terms of Jensen-Shannon divergence (JSD)61, could be computed in two main directions: across cycles and across replicas of experiments. The former required two different probability vectors, which belonged to two consecutive cycles of the same experiment, \({{{{\boldsymbol{P}}}}}^{(t)}\) and \({{{{\boldsymbol{Q}}}}}^{(t+1)}\), and it was defined as:
where \({{{{\boldsymbol{M}}}}}^{(t,t+1)}=\frac{1}{2}\left({{{{\boldsymbol{P}}}}}^{(t)}{{{\boldsymbol{+}}}}{{{{\boldsymbol{Q}}}}}^{(t+1)}\right)\) represents the midpoint distribution. Each term on the right-hand side of Eq.( 8) represents the Kullback-Leibler divergence between one of the distributions and the midpoint. By computing the JSD between \({{{{\boldsymbol{P}}}}}^{(t)}\) and \({{{{\boldsymbol{Q}}}}}^{(t+1)}\) over successive iterations of the AL algorithm, we obtained a measure of convergence, with decreasing JSD values indicating stabilization of the model prediction across cycles. To compute the JSD across experiment replicas we could average the individual JSD measurements (Eq.( 8)) as follow,
The average JSD value represented the convergence trend across multiple experimental replicas, which allowed to qualitatively account for the experimental variability.
Overfitting tests and analysis
To check for possible overfitting during the AL cycles we conducted some targeted robustness tests (Supplementary Fig. 50). First, we performed a parameter stress-test aimed at challenging the prediction ability of the trained model. We masked an increasing number of points from the screened pool and retrained the model (Supplementary Fig. 50B). The overall prediction remains almost unchanged, and even in the most extreme case, the overall prediction follows the expected result showcasing how the model is drawing the prediction in a smooth fashion, avoiding overfitting. Secondly, to test overfitting in a more systematic way, we employed the “y-scrambling” technique. In this approach, the target response values (the phase labels) of the training set are randomly shuffled, breaking any true relationship between input features and the target (Supplementary Fig. 50C). On average, the “scrambled” balanced accuracy is centered around 0.5 or lower, indicating that our approach indeed picks up relevant patterns in the underlying data.
Downs-sampling search
To test the performances of our pipeline with a coarser spaced grid we set up two different versions of the 2D and 3D search spaces. We systematically evaluated coarser sampling grids by selecting every third or fourth point. This yielded 9- to 15-fold reductions in the number of conditions in 2D (e.g., 6561 → 729 or 441) and 27- to 53-fold reductions in 3D (e.g., 531,441 → 19,683 or 9,261). These new search spaces were used to run additional in-silico experiments comparing their performances to the default grid.
Software and implementation
All code regarding active machine learning was written in Python 3.12. The Python packages scikit-learn (v.1.5.0) was used for the implementation of the Gaussian Process Classifier and the calculation of the balanced accuracy. SciPy (v.1.13.1) was used for the computation of the information entropy. Pandas (v.2.2.1) was used to handle the datasets. All the other operations (e.g., design space creation, farthest point sampling, and convergence calculation) were carried out with custom scripts using NumPy (v.<2.0.0). For data visualization, matplotlib (v.3.8.4) and plotly (v.5.9.0) were used in combination with Adobe Illustrator.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All data supporting this study are available in the article, the Supplementary Information, and the Source Data file (SourceData.xlsx). Raw and processed datasets from the active machine learning cycles, including those used to generate the manuscript and Supplementary Figs., are available on GitHub [https://github.com/molML/activeML-navigation-of-condensate-phases] and Zenodo [https://doi.org/10.5281/zenodo.17223126], together with instructions on how to replicate the figures. The complete confocal microscopy image dataset is too large for deposition in a public repository but is archived locally and can be made available by the corresponding authors upon request. Source data are provided with this paper.
Code availability
The Python code to replicate and extend our active machine learning framework is openly accessible on GitHub at [https://github.com/molML/activeML-navigation-of-condensate-phases]. The code at the time of publishing is available at [https://doi.org/10.5281/zenodo.17223126].
References
Diekmann, Y. & Pereira-Leal, J. B. Evolution of intracellular compartmentalization. Biochem. J. 449, 319–331 (2013).
Bar-Peled, L. & Kory, N. Principles and functions of metabolic compartmentalization. Nat. Metab. 4, 1232 (2022).
Alberts, B. et al. Molecular biology of the cell. Biochem Educ. 22, 641–695 (1994).
Boeynaems, S. et al. Protein phase separation: A new phase in cell biology. Trends Cell Biol. 28, 420–435 (2018).
Hyman, A. A., Weber, C. A. & Jülicher, F. Liquid-liquid phase separation in biology. Annu Rev. Cell Dev. Biol. 30, 39–58 (2014).
Banani, S. F., Lee, H. O., Hyman, A. A. & Rosen, M. K. Biomolecular condensates: Organizers of cellular biochemistry. Nat. Rev. Mol. Cell Biol. 18, 285–298 (2017).
Lyon, A. S., Peeples, W. B. & Rosen, M. K. A framework for understanding the functions of biomolecular condensates across scales. Nat. Rev. Mol. Cell Biol. 22, 215–235 (2020).
Aguzzi, A. & Altmeyer, M. Phase separation: Linking cellular compartmentalization to disease. Trends Cell Biol. 26, 547–558 (2016).
Wan, L., Ke, J., Zhu, Y., Zhang, W. & Mu, W. Recent advances in engineering synthetic biomolecular condensates. Biotechnol. Adv. 77, 108452 (2024).
Ramm, B. et al. Biomolecular condensate drives polymerization and bundling of the bacterial tubulin FtsZ to regulate cell division. Nat. Commun. 14, 1–24 (2023).
Welles, R. M. et al. Determinants that enable disordered protein assembly into discrete condensed phases. Nat. Chem. 16, 1062–1072 (2024).
Visser, B. S., Lipiński, W. P. & Spruijt, E. The role of biomolecular condensates in protein aggregation. Nat. Rev. Chem. 8, 686–700 (2024).
Buddingh, B. C. & Van Hest, J. C. M. Artificial cells: Synthetic compartments with life-like functionality and adaptivity. Acc. Chem. Res. 50, 769–777 (2017).
Dai, Y., You, L. & Chilkoti, A. Engineering synthetic biomolecular condensates. Nat. Rev. Bioeng. 1, 466–480 (2023).
Erkamp, N. A. et al. Biomolecular condensates with complex architectures via controlled nucleation. Nat. Chem. Eng. 1, 430–439 (2024).
Song, S. et al. Peptide-based biomimetic condensates via liquid-liquid phase separation as biomedical delivery vehicles. Biomacromolecules 25, 5468–5488 (2024).
Mitrea, D. M., Mittasch, M., Gomes, B. F., Klein, I. A. & Murcko, M. A. Modulating biomolecular condensates: A novel approach to drug discovery. Nat. Rev. Drug Discov. 21, 841–862 (2022).
Ambadi Thody, S. et al. Small-molecule properties define partitioning into biomolecular condensates. Nat. Chem. 16, 1794–1802 (2024).
Dai, Y. et al. Programmable synthetic biomolecular condensates for cellular control. Nat. Chem. Biol. 19, 518–528 (2023).
Duro-Castano, A. et al. Capturing “Extraordinary” soft-assembled charge-like polypeptides as a strategy for nanocarrier design. Adv. Mater. 29, 1702888 (2017).
Liu, S. et al. Enzyme-mediated nitric oxide production in vasoactive erythrocyte membrane-enclosed coacervate protocells. Nat. Chem. 12, 1165–1173 (2020).
Dzuricky, M., Rogers, B. A., Shahid, A., Cremer, P. S. & Chilkoti, A. De novo engineering of intracellular condensates using artificial disordered proteins. Nat. Chem. 12, 814–825 (2020).
Chin, K. Y., Ishida, S., Sasaki, Y. & Terayama, K. Predicting condensate formation of protein and RNA under various environmental conditions. BMC Bioinforma. 25, 1–14 (2024).
Patel, A. et al. Biochemistry: ATP as a biological hydrotrope. Science (1979) 356, 753–756 (2017).
Castelletto, V., Seitsonen, J., Pollitt, A. & Hamley, I. W. Minimal peptide sequences that undergo liquid-liquid phase separation via self-coacervation or complex coacervation with ATP. Biomacromolecules 25, 5321–5331 (2024).
Nobeyama, T., Furuki, T. & Shiraki, K. Phase-diagram observation of liquid-liquid phase separation in the poly(l-lysine)/ATP system and a proposal for diagram-based application strategy. Langmuir 39, 17043–17049 (2023).
Banani, S. F. et al. Compositional control of phase-separated cellular bodies. Cell 166, 651 (2016).
Boeynaems, S. et al. Phase separation of C9orf72 dipeptide repeats perturbs stress granule dynamics. Mol. Cell 65, 1044–1055.e5 (2017).
Poudyal, M. et al. Intermolecular interactions underlie protein/peptide phase separation irrespective of sequence and structure at crowded milieu. Nat. Commun. 14, 1–21 (2023).
Cakmak, F. P., Choi, S., Meyer, M. C. O., Bevilacqua, P. C. & Keating, C. D. Prebiotically-relevant low polyion multivalency can improve functionality of membraneless compartments. Nat. Commun. 11, 1–11 (2020).
Erkamp, N. A., Qi, R., Welsh, T. J. & Knowles, T. P. J. Microfluidics for multiscale studies of biomolecular condensates. Lab Chip 23, 9–24 (2022).
Chen, T., Lei, Q., Shi, M. & Li, T. High-throughput experimental methods for investigating biomolecular condensates. Quant. Biol. 9, 255–266 (2021).
Nakashima, K. K., André, A. A. M. & Spruijt, E. Enzymatic control over coacervation. Methods Enzymol. 646, 353–389 (2021).
Arter, W. E. et al. Biomolecular condensate phase diagrams with a combinatorial microdroplet platform. Nat. Commun. 13, 1–10 (2022).
Bray, M. A. et al. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nat. Protoc. 11, 1757–1774 (2016).
Bremer, A., Mittag, T. & Heymann, M. Microfluidic characterization of macromolecular liquid–liquid phase separation. Lab Chip 20, 4225–4234 (2020).
Di Fiore, F., Nardelli, M. & Mainini, L. Active learning and bayesian optimization: A unified perspective to learn with a goal. Arch. Comput. Methods Eng. 31, 2985–3013 (2024).
Reker, D. & Schneider, G. Active-learning strategies in computer-assisted drug discovery. Drug Discov. Today 20, 458–465 (2015).
van Tilborg, D. & Grisoni, F. Traversing chemical space with active deep learning for low-data drug discovery. Nat. Comput. Sci. 4, 786–796 (2024). 2024 4:10.
Khalak, Y., Tresadern, G., Hahn, D. F., De Groot, B. L. & Gapsys, V. Chemical space exploration with active learning and alchemical free energies. J. Chem. Theory Comput 18, 6259–6270 (2022).
Seegobin, N. et al. Optimising the production of PLGA nanoparticles by combining design of experiment and machine learning. Int J. Pharm. 667, 124905 (2024).
Ortiz-Perez, A., van Tilborg, D., van der Meel, R., Grisoni, F. & Albertazzi, L. Machine learning-guided high throughput nanoparticle design. Digital Discov. 3, 1280–1291 (2024).
Tamasi, M. J. & Gormley, A. J. Biologic formulation in a self-driving biomaterials lab. Cell Rep. Phys. Sci. 3, 101041 (2022).
Mason, A. F., Buddingh, B. C., Williams, D. S. & Van Hest, J. C. M. Hierarchical self-assembly of a copolymer-stabilized coacervate protocell. J. Am. Chem. Soc. 139, 17309–17312 (2017).
Alberti, S., Gladfelter, A. & Mittag, T. Considerations and challenges in studying liquid-liquid phase separation and biomolecular condensates. Cell 176, 419–434 (2019).
Rasmussen, C. E. & Williams, C. K. I. Gaussian processes for machine learning. Gaussian Processes for Machine Learning https://doi.org/10.7551/MITPRESS/3206.001.0001 (2005).
Eldar, Y., Lindenbaum, M., Porat, M. & Zeevi, Y. Y. The farthest point strategy for progressive image sampling. IEEE Trans. Image Process. 6, 1305–1315 (1997).
Tom, G. et al. Self-driving laboratories for chemistry and materials science. Chem. Rev. 124, 9633–9732 (2024).
Canty, R. B., Koscher, B. A., McDonald, M. A. & Jensen, K. F. Integrating autonomy into automated research platforms. Digital Discov. 2, 1259–1268 (2023).
van Haren, M. H. I., Visser, B. S. & Spruijt, E. Probing the surface charge of condensates using microelectrophoresis. Nat. Commun. 15, 1–10 (2024).
Sathyavageeswaran, A., Bonesso Sabadini, J. & Perry, S. L. Self-assembling polypeptides in complex coacervation. Acc. Chem. Res. 57, 386–398 (2024).
Fisher, R. S. & Elbaum-Garfinkle, S. Tunable multiphase dynamics of arginine and lysine liquid condensates. Nat. Commun. 11, 1–10 (2020).
Ukmar-Godec, T. et al. Lysine/RNA-interactions drive and regulate biomolecular condensation. Nat. Commun. 10, 1–15 (2019).
Leurs, Y. H. A. et al. Stabilization of condensate interfaces using dynamic protein insertion. J. Am. Chem. Soc. 147, 18412–18418 (2025).
Ribeiro, S. S., Samanta, N., Ebbinghaus, S. & Marcos, J. C. The synergic effect of water and biomolecules in intracellular phase separation. Nat. Rev. Chem. 3, 552–561 (2019).
Dignon, G. L., Best, R. B. & Mittal, J. Biomolecular phase separation: From molecular driving forces to macroscopic properties. Annu. Rev. Phys. Chem. 71, 53–75 (2020).
Milin, A. N. & Deniz, A. A. Reentrant phase transitions and non-equilibrium dynamics in membraneless organelles. Biochemistry 57, 2470–2477 (2018).
van Tilborg, D. et al. Deep learning for low-data drug discovery: Hurdles and opportunities. Curr. Opin. Struct. Biol. 86, 102818 (2024).
Thölke, P. et al. Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data. Neuroimage 277, 120253 (2023).
Gangwal, A., Ansari, A., Ahmad, I., Azad, A. K. & Wan Sulaiman, W. M. A. Current strategies to address data scarcity in artificial intelligence-based drug discovery: A comprehensive review. Comput Biol. Med. 179, 108734 (2024).
Nielsen, F. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy 21, 485 (2019).
Erkamp, N. A. et al. Multidimensional protein solubility optimization with an ultrahigh-throughput microfluidic platform. Anal. Chem. 95, 5362–5368 (2023).
Riback, J. A. et al. Composition-dependent thermodynamics of intracellular phase separation. Nature 581, 209–214 (2020).
Muschol, M. & Rosenberger, F. Liquid–liquid phase separation in supersaturated lysozyme solutions and associated precipitate formation/crystallization. J. Chem. Phys. 107, 1953–1962 (1997).
Minton, A. P. Simple calculation of phase diagrams for liquid-liquid phase separation in solutions of two macromolecular solute species. J. Phys. Chem. B 124, 2363–2370 (2020).
Pappu, R. V., Cohen, S. R., Dar, F., Farag, M. & Kar, M. Phase transitions of associative biomacromolecules. Chem. Rev. 123, 8945–8987 (2023).
Posey, A. E. et al. Biomolecular condensates are characterized by interphase electric potentials. J. Am. Chem. Soc. https://doi.org/10.1021/JACS.4C08946 (2024).
Maruyama, B. et al. Artificial intelligence for materials research at extremes. MRS Bull. 47, 1154–1164 (2022).
Burger, B. et al. A mobile robotic chemist. Nature 583, 237–241 (2020).
Vescovi, R. et al. Towards a modular architecture for science factories. Digital Discov. 2, 1980–1998 (2023).
Chen, L., Shi, D., Kang, X., Ma, C. & Zheng, Q. Deep learning enabled comprehensive evaluation of jumping-droplet condensation and frosting. ACS Appl. Mater. Interfaces 16, 25473–25482 (2024).
Birolo, R. et al. Deep supramolecular language processing for co-crystal prediction. https://doi.org/10.26434/CHEMRXIV-2024-VGVHK-V2 (2024).
Njirjak, M. et al. Reshaping the discovery of self-assembling peptides with generative AI guided by hybrid deep learning. Nat. Mach. Intell. 2024 1–14. https://doi.org/10.1038/s42256-024-00928-1 (2024).
van Mierlo, G. et al. Predicting protein condensate formation using machine learning. Cell Rep. 34, 108705 (2021).
Fu, H. et al. Supramolecular polymers form tactoids through liquid–liquid phase separation. Nature 626, 1011–1018 (2024).
Rovers, M. M. et al. Using a supramolecular monomer formulation approach to engineer modular, dynamic microgels, and composite macrogels. Adv. Mater. 2405868 https://doi.org/10.1002/ADMA.202405868 (2024).
Bracha, D., Walls, M. T. & Brangwynne, C. P. Probing and engineering liquid-phase organelles. Nat. Biotechnol. 37, 1435–1445 (2019).
Lin, Y., Protter, D. S. W., Rosen, M. K. & Parker, R. Formation and maturation of phase-separated liquid droplets by RNA-binding proteins. Mol. Cell 60, 208–219 (2015).
Tsang, B., Pritišanac, I., Scherer, S. W., Moses, A. M. & Forman-Kay, J. D. Phase separation as a missing mechanism for interpretation of disease mutations. Cell 183, 1742–1756 (2020).
Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Acknowledgements
This work was supported by the National Growth Fund “Big Chemistry” funded by the Dutch Ministry of Education, Culture and Science (grant number 1420578 to J.C.M.H., F.G., L.B.). We gratefully acknowledge the Institute for Complex Molecular Systems (ICMS) for providing laboratory facilities. Special thanks to the Chemical Technology IT department, particularly Tom van Teeffelen and Frank Malipaard, for their expert advice and assistance in communication networks. We also extend our gratitude to Cristina Izquierdo Lozano for her support in data management and for the insightful discussions that enriched this work.
Author information
Authors and Affiliations
Contributions
Y.H.A.L., W.H., A.G., J.L.J.D., J.C.M.H., F.G., L.B. designed the automation pipeline. Y.H.A.L., W.H., A.G., and J.L.J.D. developed the automation pipeline. Y.H.A.L., W.H., A.G., A.R-A., N.A.E., J.C.M.H., F.G., L.B. designed the experiments. Y.H.A.L., W.H., A.G., A.R-A, and N.A.E. performed the experiments. Y.H.A.L., W.H., A.G., A.R-A., N.A.E., J.C.M.H., F.G., L.B. analyzed the data. Y.H.A.L., W.H., A.G., J.C.M.H., F.G., and L.B. wrote the manuscript. J.C.M.H., F.G., and L.B. supervised the study. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Michael Heymann who co-reviewed with Florian Hiering; Michael Lake, and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Leurs, Y.H.A., van den Hout, W., Gardin, A. et al. Automated navigation of condensate phase behavior with active machine learning. Nat Commun 16, 9598 (2025). https://doi.org/10.1038/s41467-025-64617-2
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-025-64617-2





