Abstract
The selection of high-performing cell lines is crucial for biopharmaceutical production but is often time-consuming and labor-intensive. We investigated label-free multimodal nonlinear optical microscopy for non-perturbative profiling of biopharmaceutical cell lines based on their intrinsic molecular contrast. Employing simultaneous label-free autofluorescence multiharmonic (SLAM) microscopy with fluorescence lifetime imaging microscopy (FLIM), we characterized Chinese hamster ovary (CHO) cell lines at early passages (0–2). A machine learning (ML)-assisted analysis pipeline leveraged high-dimensional information to classify single cells into their respective lines. Remarkably, the monoclonal cell line classifiers achieved balanced accuracies exceeding 96.8% as early as passage 2. Correlation features and FLIM modality played pivotal roles in early classification. This integrated optical bioimaging and machine learning approach presents a promising solution to expedite cell line selection process while ensuring identification of high-performing biopharmaceutical cell lines. The techniques have potential for broader single-cell characterization applications in stem cell research, immunology, cancer biology and beyond.

Similar content being viewed by others
Introduction
The pharmaceutical industry has experienced a substantial surge in investments in biopharmaceutical products research and development in response to the growing demand in recent years1. Among the available biopharmaceutical products, recombinant proteins, including monoclonal antibodies (mAbs), constitute a significant share of these biotechnological medicines and rank as the highest-selling class2. The majority of recombinant proteins are produced in mammalian expression platforms due to their ability to produce diverse, correctly folded, and glycosylated proteins3. Chinese Hamster Ovary (CHO) cell cultures have emerged as the preferred host platform for mAb production4. From 2018 to 2022, 95 of the 107 (89%) approved biopharmaceutical products made in mammalian systems were produced in CHO cells5.
The CHO cell line development for recombinant protein production involves transfecting the gene of interest in CHO host cell lines. Increased heterogeneity is shown among the transfected cell line clones due to random gene integration and the gene amplification process, leading to varying product quality attributes (QAs) and process performance6,7. The selection of high-performing CHO cell lines is thus an essential step in biopharmaceutical product development. Traditionally, high-performing cell lines are selected by screening a large number of clones using the limiting dilution method, which is costly, labor-intensive, and time-consuming8. Considerable efforts have been dedicated to identifying effective solutions for the early identification of high-performing cell lines capable of progressing swiftly to commercial-scale production, which has a major impact on the economics of biopharmaceutical drug development9,10.
The identification of high-performing cell lines is commonly based on a few QAs, such as cell productivity, stability, and product titre11. However, the performance-based cell line selection in the early stages of development can be problematic because culture conditions are usually different from those in large-scale production, potentially leading to the exclusion of promising candidates or the selection of poor producers12,13. To mitigate this, ‘omics techniques, particularly metabolomics, have been used to predict production phenotypes by analyzing intracellular and extracellular metabolites14,15,16. Studies have investigated the differences in metabolic profiles of high- and low-performing CHO cell lines for mAb production17,18,19. Due to the high energy demand for recombinant protein synthesis and maintenance, energy generation is crucial for recombinant CHO cell survival20,21. Key intracellular metabolites, including reduced nicotinamide adenine dinucleotide (phosphate) (NAD(P)H) and flavin adenine dinucleotide (FAD), are closely linked to energy production and mAb productivity, as they participate in the citric acid cycle, oxidative phosphorylation, and glycolysis. Elevated NAD(P)H and FAD levels in high-producing recombinant CHO cells were reported in previous studies, indicating their more active state of glycolysis and/or oxidative phosphorylation21. These findings highlight the potential of intracellular metabolites as indicators of production phenotypes, promising more robust cell line selection in the early stages of cell line development.
Technologies commonly used for metabolic profiling of recombinant CHO cells include Liquid Chromatography-Mass Spectrometry (LC-MS), Gas Chromatography-Mass Spectrometry (GC-MS), and Metabolic Flux Analysis (MFA)17,22. LC-MS identifies and quantifies metabolites by their mass-to-charge ratios, while GC-MS vaporizes and detects volatile metabolites23,24. MFA tracks the flow of metabolites through cellular pathways using stable isotopes22. These techniques have notable limitations. Mass Spectrometry requires destructive sample preparations, which can result in the loss of spatial information and only provides bulk measurements, sacrificing single-cell resolution and cellular heterogeneity. MFA, while useful, is labor-intensive and can produce inaccurate flux estimations due to its dependence on mathematical models and the complexity of metabolism. Nondestructive, high-resolution, single-cell metabolic profiling techniques, offering more detailed information of individual recombinant CHO cells, remain underexplored.
Label-free nonlinear optical microscopy has emerged as a powerful imaging technique, providing high-resolution images with rich structural and functional details using intrinsic molecular contrasts25,26. By utilizing autofluorescence from molecules like FAD and NAD(P)H or capturing optical signals from harmonic generations, it has been applied to investigate a range of biological phenomena27,28,29. Fluorescence lifetime imaging microscopy (FLIM) further enhances this by measuring fluorescence lifetimes, which has been used as a sensitive and nondestructive tool to investigate metabolism under various physiological and pathological conditions30,31,32. FLIM of NAD(P)H, which exhibits a longer lifetime (1–5 ns) when bound to proteins compared to its free state (0.3–0.6 ns), has been used to examine metabolic states in diverse biological samples, ranging from in vitro cell culture to in vivo human and animal studies33,34. Additionally, the optical redox ratio (ORR), calculated from FAD and NAD(P)H autofluorescence intensities, serves as a quantitative metric for assessing shifts between glycolytic and mitochondrial metabolism35,36. These imaging technologies obviate the need for exogenous labeling, thereby mitigating the potential for unforeseen disruptions to biological or physiological processes37. Previously, our lab developed simultaneous label-free autofluorescence multiharmonic (SLAM) microscopy, which allows simultaneous collection of signals from four imaging modalities: two-photon fluorescence (2PF), three-photon fluorescence (3PF), second-harmonic generation (SHG), and third-harmonic generation (THG)38,39. The 2PF and 3PF modalities show the autofluorescence signals from FAD and NAD(P)H, respectively. In addition, our prior investigations demonstrated that the fluorescence intensity and lifetime profile of NAD(P)H in CHO cells provided information about cell growth and protein production, highlighting the potential of label-free nonlinear optical microscopy in the characterization of recombinant CHO cell lines40.
Data analytics and machine learning (ML) techniques have become crucial for understanding the complex relationships between cell profiles and the production phenotypes of CHO cell lines. Recent studies have employed various approaches to uncover these connections, such as statistical analysis linking intracellular metabolites to mAb productivity21 and ML approaches predicting cell line performance based on the temporal evolution of metabolic phenotypes at early process timepoints41. Moreover, pattern recognition techniques have been used to predict performance across different scales by analyzing dynamic behavior over time11. However, most investigations have predominantly focused on bulk or low-content analyses, using a limited set of manually selected features. Advancements in ML techniques have greatly enhanced the feasibility and popularity of high-content single-cell profiling42,43. These methods enable the extraction and selection of a diverse array of features from individual cells segmented from images, offering a more detailed understanding of cell behaviors and characteristics.
In this study, we investigated the effectiveness of label-free multimodal nonlinear optical microscopy, combined with ML-assisted high-content analysis, for early identification of biopharmaceutical CHO cell lines with desired production phenotypes. Specifically, a SLAM setup with fluorescence lifetime capabilities in the NAD(P)H channel was employed, enabling simultaneous collection of SLAM and FLIM signals. Four industrially relevant mAb-producing CHO cell lines with varying production phenotypes were imaged during early passages (i.e., passage 0, passage 1, and passage 2). In addition to single-cell-line wells (i.e., monoclonal wells), artificial cell line pools (i.e., artificial pools) were created by combining multiple cell lines to simulate the cell line selection process. An ML-assisted single-cell analysis pipeline was developed to segment individual cells, extract diverse cellular features, and classify them into corresponding cell lines. Our findings highlight the potential of this methodology for differentiating CHO cell lines in early passages, accelerating the cell line selection process, and providing insights into optical signatures associated with desired production phenotypes.
Results
Experimental design and CHO cell imaging
An overview of the experimental design is illustrated in Fig. 1. Four single-origin recombinant CHO cell lines (referred to as A, B, C, and D) with varying production phenotypes were investigated in this study (Fig. 1a). A 15-day ambr®15 (Sartorius Stedim Biotech) production run was performed to quantify their process performances and the measurements for production phenotypes, including Viable Cell Concentration (VCC), Peak Titre, and Specific Productivity Rate (SPR), are reported in Table S1. Among the four cell lines, cell line A had the highest Peak Titre (4095 mg/L) and the highest Specific Productivity Rate (SPR) (21.35 pg/cell/day), whereas cell line B had the lowest Peak Titre (649 mg/L). Cell lines C and D had similar Peak Titre and Peak Viable Cell Concentration (VCC) values, but they had different stability profiles (i.e., cell line C was unstable, while cell line D was stable). Here, unstable cell lines were characterized as those experiencing a titre reduction exceeding 30% over a span of 80–100 generations.
a Description of the four CHO cell lines and the sample preparation process. b Schematic of the multimodal nonlinear optical microscopy system. PMT, photomultiplier tube. HPD, hybrid photodetector. TCSPC, time-correlated single photon counting. c Example multimodal CHO cell images obtained from an artificial pool containing all four cell lines in passage 2. d Overview of the ML-assisted single-cell analysis pipeline developed in this study. e Plot of the numbers and ratios of cells in artificial pools and monoclonal wells across all passages. f SLAM composite images of randomly selected CHO cells from different cell lines in passage 2. These cells were segmented from the original images and rearranged in 2-dimensional grids, with the intensity of each channel normalized across all cell lines.
A lab-built multimodal nonlinear optical microscope was used to image cells in passage 0, passage 1, and passage 2 (Fig. 1b). The simultaneous recording of 2PF, 3PF, THG, SHG, and 3PF FLIM signals provides both structural and metabolic information about the samples (Fig. 1c). Details of the imaging system are further illustrated in Fig. S1. The SHG channel, which mainly shows signals of collagen fibers, was removed during data analysis due to the low intracellular signal. In addition to the original SLAM intensity channels, optical redox ratio (i.e., FAD/(FAD + NAD(P)H)) and NAD(P)H fluorescence lifetime estimations were generated using both a biexponential fitting model (bound lifetime, free lifetime, bound intensity, free intensity) and phasor analysis (phasor component g, phasor component s, mean lifetime). This resulted overall in an 11-channel image for each field-of-view (FOV) (Fig. S2). The core implementation of multimodal imaging is to utilize the multi-dimensional data provided by the spatially co-registered multimodal images to extract parameters that can serve as sensitive and meaningful biomarkers.
For each passage, cells were placed in a 96-well plate for imaging (Fig. S3). In addition to the monoclonal wells, which contained cells from one of the four cell lines, artificial pools were created by deliberately mixing cells from different cell lines to simulate the cell line selection process. These artificial pools were composed of real, physically mixed cells to mimic experimental conditions rather than virtual or simulated data. To validate the cell line classification results in artificial pools, different combinations of the A, B, C, and D cell lines were created for each passage (i.e., AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, ABCD). The cell numbers from different cell lines within each artificial pool were kept at comparable levels during mixing. An ML-assisted single-cell analysis pipeline was developed to segment and classify individual cells in both monoclonal wells and artificial pools (Fig. 1d). The whole dataset consisted of 27,929 individual cells segmented from 804 FOVs across the three passages. The number of cells in each FOV varied across passages due to the increasing cell densities. Specifically, passage 0, passage 1, and passage 2 yielded 2,915 (10.4%), 8,714 (31.2%), and 16,300 (58.4%) cells, respectively (Fig. 1e). SLAM composite images of randomly selected cells from each cell line are shown in Fig. 1f. To minimize human bias in sample preparation, data collection, and data analysis, the identities of all cell lines remained unknown until all experiments and data analyses were completed.
Single-cell analysis pipeline and cell line classification
The ML-assisted single-cell analysis pipeline consisted of multiple data processing steps at pixel-, object- and feature-levels, as detailed in Fig. S4. Individual cells were segmented from images using the Cellpose deep neural network44. For each individual cell, 1480 pre-engineered features from six categories (i.e., size, shape, intensity, texture, colocalization, and granularity) were generated using CellProfiler45. Feature engineering was then conducted to clean the cell profiles and select the most informative set of features for each passage. Based on the selected features, ML classifiers were trained to classify individual cells into the four cell line classes (A, B, C, and D). While cells from monoclonal wells were assigned to one of the four classes during the training process, uncertain labels were given to cells in artificial pools. For instance, cells in the AB artificial pool were assigned two candidate classes (i.e., A or B but not C or D).
Five ML classifiers were employed and compared for the cell line classification task, including k-nearest neighbors (kNN), random forest (RF), gradient boosting classifier (GBC), support vector machine (SVM), and multi-layer perceptron (MLP). To train these ML classifiers with cells from both monoclonal wells and artificial pools, the Expectation-Maximization-based Iterative Label Refinement (EM-ILR) algorithm was developed, which iteratively refined the uncertain labels of training cells from artificial pools (Fig. S5). It was observed that the classification performance improved along the EM-ILR iterations, while the label modification ratio decreased throughout the processes (Fig. S5d). Ablation studies were conducted to compare the classification performance with and without EM-ILR. Results show that balanced accuracy, calculated exclusively for cells in monoclonal wells, showed a slight decrease with EM-ILR, whereas the impurity score, which incorporates both monoclonal wells and artificial pools, showed substantial improvements (Fig. S6). These findings highlight the importance of including cells from artificial pools during training to achieve robust classification performance across the entire dataset.
The evaluation of cell line classification performance was conducted using 10-fold Monte Carlo cross-validation, where the dataset was randomly partitioned into a training set (70% of cells) and a test set (30% of cells) for each cross-validation fold. In addition to conventional evaluation metrics (e.g., balanced accuracy, precision, recall), the impurity score was created which can be used to measure the classification performance of cells with uncertain labels. Specifically, the predicted cell line class that was not included in the candidate classes was considered a wrong prediction (e.g., a cell from a BC artificial pool that was predicted as A or D). The impurity score, where a lower value indicates better performance, is then defined as the ratio of incorrectly classified cells among all cells in the test set. The following sections delve into the detailed discussion of cell line classification results from the top-performing ML classifier (i.e., MLP trained with EM-ILR) (Fig. S7).
Cross-validated cell line classification results
Label-free multimodal nonlinear optical microscopy demonstrated remarkable capabilities in differentiating CHO cell lines with varying production phenotypes in early passages, as substantiated by the cross-validated classification results presented in Fig. 2. When comparing the classification performance of each passage, the same number of cells were randomly sampled from passages 1 and 2 to match the cell number in passage 0 (Fig. 2a–c). Among the three passages, comparable impurity scores were observed for passage 1 and passage 2 (p value = 0.209), while passage 0 had the highest impurity scores (0.128 ± 0.013). Similar trends were observed when evaluating cell line predictions in monoclonal wells, where the balanced accuracies of passage 0, passage 1, and passage 2 were 0.826 ± 0.028, 0.902 ± 0.021, and 0.898 ± 0.018, respectively. The p value of the statistical comparison of balanced accuracies between passage 1 and passage 2 is 0.663.
a–c Classification performance using an equal number of cells from each passage for training and testing. d–f Classification performance using all cells from each passage for training and testing. For each cross-validation fold, balanced accuracies were calculated on test set cells from monoclonal wells, while impurity scores were calculated on test set cells from all wells. a, d Scatter plots illustrate the cross-validated classification performance of the three passages. Each data point represents the result of one cross-validation fold. Bar plots in (b, e) show the balanced accuracies (higher is better) of test cells from monoclonal wells, while (c, f) present the impurity scores (lower is better) calculated on test cells from all wells. Error bars represent the standard deviations. Cell line predictions for monoclonal wells and artificial pools in different passages are shown in (g–i), (j–l), and (m–o) for passage 0, passage 1, and passage 2, respectively. g, j, m Confusion matrices illustrating the cell line classifications of monoclonal wells. h, k, n Ratio of cells with different predicted cell line classes in different monoclonal wells and artificial pools. i, l, o Visualization of cell line predictions in randomly selected FOVs from monoclonal wells and artificial pools, where cells are color-coded by their predicted cell line classes.
Enhanced classification performance was observed for passage 1 and passage 2 when using all test set cells from these passages (Fig. 2d–f). The balanced accuracies increased to 0.952 ± 0.005 and 0.968 ± 0.004 for passage 1 and passage 2, respectively (p value = 6.681 × 10−7), and the impurity scores of passage 1 and passage 2 dropped to 0.041 ± 0.003 and 0.022 ± 0.002, respectively (p value = 2.412 × 10−11). Other evaluation metrics (i.e., F1 score, precision, recall, Jaccard index, Area Under the Receiver Operating Characteristic Curve (AUC)) calculated in the two experiments are reported in Tables S2 and S3. Confusion matrices were generated based on cells from monoclonal wells (Fig. 2g, j, m). Notably, cell line B exhibited the highest sensitivity scores across all three passages, while cell line A exhibited a relatively higher rate of misclassification, with 33.9% of cells being incorrectly classified in passage 0. The visualization of cell line predictions in both monoclonal wells and artificial pools is depicted in Fig. 2h, k, and n, demonstrating the ratios of cells with various predicted cell line classes. Additionally, Fig. 2i, l, and o provide detailed representations of single-cell predictions in both monoclonal wells and artificial pools, featuring color-coded cells based on their predicted cell line class.
Visualization of cell line distributions across passages
The characteristics of different cell lines across passages were visualized using dimensionality reduction techniques. The t-distributed Stochastic Neighbor Embedding (t-SNE) plots in Fig. 3a–c depict the distributions of cells from each cell line in passage 0, passage 1, and passage 2, respectively. Cells with similar profiles were grouped close to each other. In addition, the Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) visualizations are shown in Fig. S8. It is observed that the separation between cell line distributions became more obvious in passage 1 and passage 2. Additionally, cell line B has distinct clusters that show clear separation from other cell lines in those passages. To provide more details of different cell line distributions, single-cell SLAM composite images corresponding to the dots in the t-SNE plots were sampled and visualized in the same location for all three passages (Fig. 3d–f). The individual channels of SLAM (i.e., 2PF, 3PF, THG) are shown in Fig. S9. In addition, redox ratio images and NAD(P)H mean lifetime images were presented in Fig. 3g–i and Fig. 3j–l, respectively. Based on these results, several key observations were made. In passage 0, the four cell lines exhibited similar characteristics in morphology, SLAM intensities, redox ratio, and NAD(P)H mean lifetime. While in passage 1 and passage 2, cell line B exhibited unique morphological characteristics, including larger cell size and higher FAD and NAD(P)H intensities compared to other cell lines. Furthermore, cell line B displayed a higher redox ratio and longer NAD(P)H mean lifetime within the cellular environment. On the other hand, cell line A shows slightly larger cell sizes and longer NAD(P)H mean lifetime compared to cell lines C and D, while cell lines C and D appear visually similar. This qualitative inspection provided by these visualizations serves as an initial step in identifying cell line differences. However, the heterogeneity within each cell line was observed across passages, posing challenges in visually identifying instinct characteristics of each cell line. In the following sections, further analyses were conducted to provide a quantitative investigation into the importance of various features and optical characteristics associated with the different cell lines.
a–c t-SNE plots illustrating the distributions of cells from four cell lines in passage 0, passage 1, and passage 2, respectively. Each dot represents one cell in the test set. d–f Single-cell SLAM composite images corresponding to the dots in the t-SNE plot were sampled and visualized in the same location for the three passages. Additionally, g–i present optical redox ratio images, while (j–l) display NAD(P)H mean lifetime images obtained from the phasor analysis. All scale bars represent 100 µm.
Feature importance ranking for cell line classification
Further investigations were conducted to assess the contribution and importance of cell features across six categories, providing insights into the salient optical signatures for differentiation of cell lines at early time points. During the feature selection procedure, the Recursive Feature Elimination (RFE) algorithm was employed to recursively remove features that were less relevant to the cell line classification task (Fig. 4a–c). The importance of features can then be inferred by the order of feature removal, with features removed earlier in the RFE procedure being less relevant to the classification task. Here, we use this information to analyze the importance of different categories of features. Specifically, based on the order of removal, all features (1480 features in total) were sorted and grouped into 15 feature ranking groups, each of which had 100 features, except for the last group, which contained 80 features (Fig. 4d–f). The first feature ranking group has the highest importance for cell line classification.
During feature extraction, 6 categories of features were calculated for each cell, including area and shape, correlation, granularity, intensity, radio distribution, and texture. a–c The trends of the classification performance during the feature selection process using RFE for each passage. The error bars represent the standard deviations of balanced accuracies at each iteration of RFE. The red dashed lines indicate the number of features and the mean balanced accuracy of the optimal subsets of features. d–f Ratios of features from each category in all feature ranking groups, where group 1 represents the last 100 features retained during RFE (most important). g–i The distribution of features in high-to-low feature ranking groups for each feature category. j–l Pie charts present the ratios and counts of features from different categories in the optimal feature subsets determined by RFE.
Notably, a significant proportion of features in the first feature ranking group belong to the correlation features, which measure the colocalization and co-occurrence of intensities between image channels (Fig. 4d–f). In addition, the ridgeline plots in Fig. 4g–i show the distribution of features in high-to-low feature ranking groups for each feature category. A large portion of correlation features (peaks in the ridgeline plots) were retained in the first five feature ranking groups across all passages. Compared to other feature categories, the peaks of the correlation features are closer to the left. However, the long tails in the plots indicate that not all correlation features were relevant to the cell line classification task. Moreover, Fig. 4j–l show the ratios of different categories of features in the optimal set of features, which yielded the best classification performance. The sizes of the optimal feature set of passage 0, passage 1, and passage 2 are 40, 80, and 70, respectively (Fig. 4a–c). Notably, correlation features constituting 80% or more of the optimal sets of features in all three passages. The rest of the features (less than 20%) were from the intensity and texture categories.
The importance of individual features was analyzed within the optimal set of features selected by RFE, offering fine-grained feature importance rankings. Here, the feature importance was quantitatively measured using permutation feature importance (PIMP), which is defined as the decrease in the classification accuracy when the feature value is randomly permuted46. Features with high PIMP scores are considered more important to the cell line classification task. For passage 2, the top 30 most important features are ranked and illustrated in Fig. 5a. In addition to their PIMP scores, the category of the feature, the related imaging modality (i.e., SLAM or FLIM), and the median profiles of the four cell lines are visualized in Fig. 5a. Besides, for the top-6 most important features, the distributions of feature values of different cell lines were shown in Fig. 5b. To directly visualize the contributions of different features for the differentiation of cell lines, the feature values of individual cells were visualized in the t-SNE plot of passage 2 (Fig. 5c, d). The top-30 feature importance rankings for passage 0 and passage 1 are reported in Fig. S10.
a Permutation feature importance of the top 30 most important features for cell line classification in passage 2. For each feature, the associated imaging modality, feature category, and median profiles of the four cell lines are illustrated. The definitions of different types of correlation features can be found in Supplementary Note 1. b Violin plots show the value distributions of the top 6 most important features. Mann–Whitney U tests were employed to compare the feature values of different cell lines. c, d The contribution of the top-6 most important features for the classification of the four cell lines. c Cells in the t-SNE plot of passage 2 are color-coded by their corresponding cell line labels. The feature values of the top 6 features are overlaid onto the t-SNE plots in (d).
Importance ranking of imaging modalities
The imaging modalities used in this study exhibited different levels of importance for the CHO cell line classification task. To rank the importance of imaging modalities, the cell line classification performance was compared when using different subsets of imaging modalities, including three single-channel intensity sets (i.e., 2PF, 3PF, THG) and two combinations of modalities (i.e., SLAM intensities, FLIM). The SLAM subset consists of 2PF, 3PF, THG channels, and the redox channel, which was derived from 2PF and 3PF channel intensities. For the FLIM subset, the intensity channel (3PF) and lifetime estimation results were used. Based on the selected image channel(s), cellular features were extracted and processed following the same procedure as the aforementioned analysis using all imaging modalities. Correlation features, which measure the correlation between the intensities of different image channels, were absent in the experiments using a single imaging modality. Morphological features (i.e., area and shape) that were calculated from the cell masks were shared among all imaging modalities. Using the same classifier (i.e., MLP trained with EM-ILR), the classification performance of all three passages is compared among imaging modalities. For the three passages, the balanced accuracies and impurity scores of individual cross-validation folds are shown in Fig. 6a–c. Bar plots in Fig. 6d–f show the mean and standard deviation of classification performance for different imaging modalities across passages. Additionally, Welch’s unequal variances t-tests were conducted to compare the classification performance between different subsets of imaging modalities. A full list of p values from the statistical analysis is reported in Tables S4–S6.
The results from passage 0, passage 1, and passage 2 are shown in (a, d), (b, e), and (c, f), respectively. a–c Classification performance for each passage using individual imaging modalities (i.e., 2PF, 3PF, THG) and combinations of imaging modalities (i.e., SLAM, FLIM, SLAM + FLIM). Each dot in the plots represents the result of one-fold of the 10-fold Monte Carlo cross-validation. d–f Balanced accuracies and impurity scores of all experiments are shown in the bar plots. The error bars indicate the standard deviations. Welch’s unequal variances t-tests were used for the statistical comparison of classification performance between different imaging modalities.
Discussion
In this study, we assessed label-free optical microscopy techniques for the early differentiation of biopharmaceutical CHO cell lines with varying production phenotypes. We employed a custom-built multimodal optical imaging system that simultaneously captured SLAM and NAD(P)H FLIM images, providing both structural and metabolic information. Single-cell analysis was performed using a ML-assisted analysis pipeline, extracting a broad range of features to distinguish cell lines with varying productivity and stability. The classification results demonstrated the potential of this approach to effectively differentiate between CHO cell lines in early passages. In addition, the data inspection and feature interpretation steps provided valuable insights into the relationship between optical properties and phenotypic variations among these cell lines.
The inclusion of both monoclonal wells and artificial pools in the training process enhanced the robustness and reliability of the classification model, improving its performance in handling the complexities and variations that exist in more real-world scenarios. The classification results revealed phenotypic differences among CHO cell lines across the three passages, with the poorest performance observed in passage 0 (Fig. 2). This is likely due to the cells not yet reaching their optimal growth state, as they were imaged shortly after revival, resulting in less distinct phenotypic differences and greater variability. In contrast, the classification performance improved in later passages. When the same number of cells from passage 1 and passage 2 were analyzed, impurity scores and balanced accuracies were comparable (Fig. 2a–c). However, including all cells from these passages significantly enhanced classification performance, underscoring the impact of sample size on results (Fig. 2d–f). Despite similar overall performance between passages 1 and 2, subtle differences in predictive accuracy for specific cell lines persisted. These variations may be attributed to inherent disparities in the optical characteristics of these cell lines.
To explore the distinct phenotypic profiles of different cell lines, we utilized various data inspection and feature interpretation techniques. t-SNE visualization provided qualitative insights into cell heterogeneity across passages (Figs. 3 and S9). Cell line B exhibited clear morphological differences in passages 1 and 2, while differences among the other cell lines, particularly C and D, were less pronounced. Moreover, within-cell-line heterogeneity was evident across passages, complicating the visual assessment of differences among cell lines. These findings highlight the importance of quantitative image-based cell profiling for identifying optical characteristics specific to each cell line. The feature category importance analysis revealed that correlation features played a pivotal role in the classification of cell lines, surpassing other feature categories assessed in this study (Fig. 4). This aligned with our t-SNE visualization results, suggesting that morphological features alone are insufficient for distinguishing all cell lines. In contrast, correlation features, which capture the colocalization of signals between channels, proved significant due to their ability to infer and quantify biomolecular interactions47,48. Past research has effectively utilized these features to explore protein-protein interactions, molecular colocalization, and cellular signaling pathways49. The key role of correlation features in cell line classification emphasizes the value of colocalization analysis in extracting valuable insights from multimodal bioimages, as well as the benefits of multimodal imaging for understanding cellular phenomena.
Consistent with our earlier feature category importance analysis, correlation features consistently emerged as dominant in the top 30 features with the highest PIMP scores across all three passages (Figs. 5 and S10). However, specific features and their rankings varied by passage. In passage 2, where the best classification performance was achieved using all cells, the lower quartile intensity of the redox ratio channel—representing the redox ratio threshold below which a quarter of pixels in the cell fall—was the top feature (Fig. 5a). Cell line B showed significantly higher values for this feature compared to others. The Manders overlap coefficient (MOC) of 3PF intensity and phasor mean lifetime was the second most important feature in passage 2 (Fig. 5a). The MOC is a modification of Pearson’s correlation coefficient by omitting the subtraction of the average pixel intensities from the original intensity values50, leading to a value range of 0 to 1. Higher MOCs between those channels were reported in cell line A, meaning that the co-occurrence of high NAD(P)H intensity and long NAD(P)H mean lifetime was more obvious in those cells. Further investigation is needed to biologically interpret these top-ranking features and their relation to cell line productivity and stability. Notably, all features listed in the feature importance ranking list had positive PIMP scores. This implies that permuting the value of any of these features would result in a decrease in classification accuracy. Furthermore, through violin plots and statistical analysis, we observed that the majority of top-ranking features are informative to differentiate a subset of cell lines instead of all four cell lines (Fig. 5b). These observations suggest that the ML classifiers relied on the collective contribution of all these features to achieve the reported classification performance.
The importance of imaging modalities varied in their relevance to cell line classification across the passages (Fig. 6). In passage 0, the best classification performance was achieved using all imaging modalities. However, in later passages, comparable classification performance was reported between using FLIM only and using all modalities, with FLIM outperforming SLAM in all passages. This superiority of FLIM can be attributed to its ability to capture the fluorescence lifetime of intrinsic fluorophores, rendering it sensitive to cellular metabolic activities and molecular interactions. Besides, compared to the imaging modalities in SLAM, FLIM exhibited greater resilience to variations in illumination conditions, leading to more reliable and consistent feature readouts. When comparing the classification performance between SLAM and its individual channels, notable enhancements were observed in all three passages, demonstrating the added value of multimodal imaging. It is worth noting that the importance ranking of imaging modalities was calculated based on four specific CHO cell lines. Different cell lines might yield different rankings, necessitating a careful and individualized assessment.
While this study has provided valuable insights into the applicability of multimodal optical imaging for biopharmaceutical cell line classification, several limitations must be noted. First, due to constraints in availability, our investigation focused on a specific set of CHO cell lines used in mAbs production. The generalizability of our findings to other CHO cell lines or biopharmaceutical production systems requires further investigations. Second, the classification performance may benefit from larger datasets and further refinement of ML models. Imperfections in artificial pool predictions highlight the complexity and heterogeneity within these pools, suggesting the need for further optimization of the single-cell analysis pipeline and ML classifiers to enhance accuracy in order to meet real-world production standards. Moreover, the current analysis pipeline prioritizes interpretability, employing a step-by-step approach with hand-engineered features, which can lead to some degree of information loss and reduced computational efficiency. For future cell line selection applications in larger datasets and high-throughput environments, improvements are necessary to increase both computational efficiency and overall performance. Nonetheless, findings from this study, particularly the significance of correlation features, will inspire the design of more efficient pipelines. Although many studies indicated relationship between optical biomarkers and metabolic status of cells51,52,53, the interpretation of the top-ranking features and their biological significance to biopharmaceutical cell line characterization remains a subject for future research, necessitating in-depth molecular and cellular investigations. Lastly, the throughput of the lab-built multimodal optical imaging system is lower compared to established high-throughput techniques, such as flow cytometry, which can analyze large numbers of cells at a rapid rate. Although flow cytometry is better suited for large-scale screening, it lacks the spatial and metabolic detail provided by our technique. To bridge this gap, future efforts should aim to increase the speed and scalability of multimodal imaging, potentially through automation or parallelization, while preserving its high-content advantages. This approach could enable real-time, in-line monitoring and early cell line selection in biopharmaceutical manufacturing.
It is important to highlight that the applications of the proposed label-free multimodal nonlinear optical microscopy and ML-assisted single-cell analysis techniques transcend mere biopharmaceutical cell line classification, with significant implications for comprehensive single-cell characterization. These methods have potential applications across various fields, including stem cell research, immunology, and cancer biology, where understanding the intricacies of cellular behavior at the single-cell level is paramount. In stem cell research, they could offer unprecedented insights into the differentiation pathways and pluripotency of stem cells, enabling researchers to track cellular fate decisions with high precision54. In immunology, they could aid in deciphering the complex interactions between immune cells, immune cell activation states, and elucidating mechanisms underlying immune responses and diseases55. In cancer biology, these methodologies hold promise for unraveling cancer cell transformation, the cellular heterogeneity within tumors, and shedding light on the diverse cell populations contributing to tumor progression, metastasis, and therapeutic resistance56. Additionally, they could facilitate the study of developmental biology, neurobiology, and beyond, offering unprecedented insights into the fundamental processes governing life at the cellular level.
In conclusion, this study demonstrated the immense potential of integrating label-free multimodal nonlinear optical microscopy with ML-assisted single-cell analysis techniques for biopharmaceutical cell line characterization and selection. The diverse structural and metabolic information provided by these techniques is crucial for capturing cell line heterogeneity and accurately distinguishing different phenotypes. Our findings highlight the significance of employing ML-assisted single-cell analysis techniques to address the limitations of visually inspecting and distinguishing cell lines solely based on morphological features. This study represents a key step toward understanding the relationship between optical characteristics and production phenotypes, paving the way for improved cell line characterization and early selection in biopharmaceutical research and development. The expedited cell line selection process holds potential in reducing the costs of biopharmaceutical drug development, cutting down drug expenses, and improving overall drug accessibility. Looking forward, we envision that the versatility of the proposed methodologies extends far beyond biopharmaceutical cell line selection, promising to reshape our understanding of cellular biology and accelerate advancements across a wide range of scientific disciplines.
Methods
Imaging system setup
A custom-built multimodal nonlinear optical microscopy system was used in this study, similar to previously published systems57,58. Briefly, a 10 MHz, 1040 nm pulsed fs laser source (FemtoTrain, Spectra-Physics) was used to pump a photonic crystal fiber (LMA-PM-15, NKT Photonics) to create a broad supercontinuum. The output of the PCF was directed to a pulse shaper (FemtoJock, Biophotonic Solutions) for temporal compression and selection of optimal wavelengths, resulting in an excitation beam with a bandwidth of 995–1115 nm and a pulse width of 58 fs at the sample plane to simultaneously excite four different types of endogenous contrast38,57. Two galvanometer-mounted mirrors (Cambridge Tech) were used for point raster scanning. An inverted epi-detection setup was used with a 25× water immersion objective lens with 1.05 NA (XLPLN25XWMP2, Olympus). After a 735 nm dichroic mirror (FF735-Di02-25) separated excitation from emission, four separate channels were used to collect emitted nonlinear optical signals from the cells (THG intensity, three-photon excited NAD(P)H autofluorescence intensity and lifetime, SHG intensity, and two-photon excited FAD autofluorescence intensity), with optical filters used to separate the channels listed in Table S7. The three intensity-only channels (THG, SHG, and FAD) were collected using photon-counting PMTs (H7421-40, Hamamatsu) and time-resolved detection was used for the NAD(P)H channel in order to collect information on both intensity and fluorescence lifetime using a hybrid photodetector (PMA 40 Hybrid, PicoQuant) and a time-to-digital converter (HydraHarp 400, PicoQuant) for Time-Correlated Single Photon Counting (TCSPC). Samples were mounted on a motorized XY-axis piezo stage (SLC-24150-LC, SmarAct, Germany) and the objective lens was mounted on a piezoelectric positioning system (SLC-24120-LC, SmarAct) to adjust the focus. Image acquisition was controlled through a custom LabView (National Instruments) program.
CHO cell line characterization
Four single-origin recombinant CHO cell lines with distinct production phenotypes were provided by GSK (Stevenage, United Kingdom). To measure the product phenotypes of the cell lines, an ambr®15 (Sartorius Stedim Biotech) production run was conducted using GSK’s proprietary platform process in a fed-batch mode. The Peak VCC represents the highest viable cell count achieved during the production run, while the Peak Titre indicates the maximum amount of product generated during the process. UpSPR quantifies the antibody production rate on a per-cell basis, providing information about the efficiency of the cell line in generating the product. Stability refers to the sustained performance of the cell line over multiple generations. It was evaluated by observing the percentage drop in titre across a range of 80–100 generations. A stability threshold of 30% serves as an indication of the cell line’s ability to maintain consistent productivity over an extended period.
Cell line preparation and imaging
Upon receipt of the four CHO cell lines (assigned A, B, C, D for blinded investigation), vials with cells were stored in vapor phase liquid nitrogen. For the experiments, CHO cells were thawed and cultured at 5% CO2 at 37 °C in a shaking incubator rotating at 140 rpm. An optimized proprietary chemically defined medium supplemented with 25 μM methionine sulfoxamine (MSX) acquired from GSK was used to culture and maintain the cells. Viability and cell count were counted on the second day using a Vi-CELL XR Cell Viability Analyzer from Beckman Coulter (Indianapolis, IN). The cultures were passaged to 0.3 × 106 cells/mL in 30 mL media at passage 1 (3 days post revival) and passage 2 (7 days post revival).
Cells were imaged at passage 0 (1 day post revival), at passage 1 (3 days post revival), and at passage 2 (7 days post revival). For each experiment, 2 mL of each cell culture content was extracted during passaging. After viability and cell count were measured, the concentration of each cell line was diluted to the concentration of the cell line with the lowest concentration, so all concentrations were equal but the number of cells to image was maximized. Artificial pools were prepared and plated at a volume of 100 μL per well on a #1.5H glass-bottom 96-well plate from CellVis (Mountain View, CA). After plating, the well plate was placed in a stage-top incubator (Tokai Hit, Shizuoka-ken, Japan) in which the temperature, humidity, and CO2 was controlled to 37 °C, >95%, and 5%, respectively. The cells were then imaged on the multimodal nonlinear optical microscope with an average power of 31 mW at the sample. Each field of view (FOV) contained 512 × 512 pixels, covering an area of 180 × 180 μm. Five frames were captured and summed for each FOV. The acquisition time for a single frame was approximately 5.3 seconds (0.19 frames per second), resulting in a total imaging time of around 26.5 seconds per FOV. For each well, cells were imaged approximately 15 μm above the bottom of the glass plate. Six FOVs were imaged per well. Three replicates were created for each monoclonal and artificial cell line pools across passages. The plate layout design is shown in Fig. S3. After imaging each replicate, FAD (5 mM) and NADH (10 mM) solution samples were prepared and imaged, which were later used in intensity normalization and illumination correction of 2PF and 3PF channels, mitigating potential variability caused by fluorescence signal fluctuations and inhomogeneous background illumination.
Single-cell analysis pipeline
A single-cell analysis pipeline was developed to extract features from individual cells in the multi-channel images. The pipeline includes image preprocessing, cell segmentation, cell selection, feature extraction, feature engineering, and downstream analysis (Fig. S4). During image preprocessing, the 2PF and 3PF channels were normalized using pixel-wise division with the FAD and NADH solution images collected after imaging each replicate. The optical redox ratio was calculated based on the normalized intensity values of the 2PF channel (FAD) and 3PF channel (NAD(P)H) using the formula:
For the raw FLIM images, the average photons per pixel (PPP) in cell regions is 28.13 (Fig. S11a). The mean total photon counts per cell is 3.17 × 104 (Fig. S11b). And the mean of average PPP per cell (i.e., total photon count per cell divided by cell size in pixels) is 26.86 (Fig. S11c). The temporal bin size of the raw FLIM images was 8 picoseconds. Temporal binning was applied by summing every 8 bins. This resulted in FLIM images with a temporal bin size of 64 picoseconds. Spatial mean filtering with 5×5-pixel kernel was applied to the FLIM images. A comparison of the FLIM intensity image and fluorescence decay profiles before and after mean filtering is shown in Fig. S12c, d. The estimation of fluorescence lifetime of NAD(P)H was performed using phasor analysis and a least square fitting method (i.e., the Levenberg-Marquardt algorithm, LMA)59. The fluorescence lifetime estimation process generated 7 additional channels (Fig. S2). The phasor analysis provided three additional channels (i.e., phasor g, phasor s, and phasor mean lifetime). The polar coordinates \({g}_{i,j}\left(\omega \right)\) and \({s}_{i,j}\left(\omega \right)\) are computed using the equations:
where \(n\) and \(\omega\) are the harmonic frequency and the angular frequency of excitation, respectively. \(T\) is the repeat frequency of the acquisition. And phasor mean lifetime (τ) was calculated using formula:
To increase the dynamic range of phasor g and s values, we increased the frequency from 10 MHz (laser repetition rate) to 74 MHz by treating the duration of the fitting range (i.e., the analysis window of the decay curve) as one period of the cycle. The fitting range consisted of 210 time bins, which lasted 13.44 ns (64 picoseconds/bin). Importantly, this adjustment enhanced the resolution of the phasor plot without affecting the phasor mean lifetime. The biexponential fluorescence decay curve fitting using LMA can be represented by the following formula:
where \(I\left(t\right)\) is the fluorescence intensity at time \(t\). \({A}_{1}\) and \({A}_{2}\) are the amplitudes of the respective exponential components, \({\tau }_{1}\) and \({\tau }_{2}\) are the fluorescence lifetimes of the short and long exponential components, respectively, and \(C\) is a constant offset. The two exponential components represented the free and bound NAD(P)H, resulting in four additional channels, which were: free NAD(P)H lifetime (i.e., \({\tau }_{1}\)), free NAD(P)H intensity (i.e., \({A}_{1}\)), bound NAD(P)H lifetime (i.e., \({\tau }_{2}\)), and bound NAD(P)H intensity (i.e., \({A}_{2}\)). Deconvolution was not applied in both phasor and LMA analyses due to the short instrument response function (IRF). The starting point the fluorescence decay curve was set to the time bin with the highest photon count, determined by summing the decay histograms across all image pixels. Meanwhile, the SHG channel from SLAM, which mainly shows the signals from collagen fibers, was removed from the following single-cell analysis due to low intracellular signal, which was expected.
The segmentation of individual cells from all images was achieved by using a pretrained deep neural network, Cellpose, which generated masks for individual cells based on the SLAM composite images44. During cell selection, dead cells and cells touching the border of an image were removed. Dead cells were characterized as the ones with abnormally high NAD(P)H intensity, which has been reported to be a biomarker of cell death60. A simple thresholding method was used to remove these dead cells. Cells with mean NAD(P)H intensity (total photon count per cell divided by cell size in pixels) values above 60 were classified as dead and excluded from further analysis. Based on the multimodal images and corresponding cell masks, cellular features were extracted using the CellProfiler software45. In total, 1480 features were calculated using the CellProfiler Measurement modules, which measure area, shape, intensity, intensity distribution, texture, colocalization, and granularity features. All features, together with their corresponding CellProfiler Measurement modules and image channels, are listed in Table S8. The definitions of different types of correlation features used in this study are provided in Supplementary Note 1. The CellProfiler modules are illustrated in detail in the software’s documentation (https://cellprofiler.org/manuals).
The feature engineering procedure consisted of data cleaning, feature normalization, and feature selection. Features with a significant ratio (20% in this study) of missing values were removed. In addition, the zero-variance features were excluded from the cell profile table. Z-score normalization was applied to all features. During feature selection, recursive feature elimination (RFE) was implemented to select the most informative set of features for the classification of cell lines. Using a linear support vector machine classifier, RFE recursively selected a smaller set of features by pruning the least important features based on the feature weights (i.e., coefficients of the support vector) assigned by the classifier. A minimum number of features of 10 and a step size of 10 were chosen during the RFE. The optimal set of features was determined as the one that yielded the best classification performance.
Cell line classification
Machine learning classifiers were trained to classify individual cells in each passage. For cells in the training set, certain cell line labels (i.e., A, B, C, or D) were assigned to the cells in monoclonal wells, while cells from artificial pools were given uncertain labels. For instance, cells from the artificial pool that was a mixture of cell lines A and B were labeled as AB, meaning that they could be either A or B (2 candidate classes). During the training of the ML classifiers, two approaches were adopted and compared. For the first approach, the ML classifiers were trained using cells only from monoclonal wells. For the second approach, the training set involves cells from both monoclonal wells and artificial pools. To train ML classifiers using cells with uncertain labels, the EM-ILR algorithm was developed, which iteratively refined uncertain cell labels and retrained the ML classifiers using Expectation Maximization (Fig. S5). The EM-ILR training procedure started with the initial label assignment step, where cells were randomly assigned to one of the candidate classes. The ML model was then trained on the dataset with the current label assignment (M-step). After training, the model generated the per-class prediction scores (class probabilities) for each cell in the training set. Cell labels were then updated by modifying the labels to the candidate class with the highest prediction score (E-step). The E-steps and M-steps continued in a recurrent manner until all cell labels were unchanged or the maximum number of iterations was reached. Notably, cells with certain labels can be considered as having one candidate class, which would not be changed during the E-step. In addition, the EM-based algorithm is compatible with a wide range of ML models as long as the model can generate class probabilities or a proxy to that during model prediction. The mathematical formulation of the EM-ILR algorithm is further illustrated in Supplementary Note 2. Five ML models were utilized and compared for the single cell classification, including the k-nearest neighbors (kNN), random forest (RF), gradient boosting classifier (GBC), support vector machine (SVM), and multi-layer perceptron (MLP). For SVMs, the estimates of class probabilities were generated using 5-fold cross-validation during prediction. To avoid overfitting, the MLP classifier’s architecture was constrained to a single hidden layer with 25 neurons. Additionally, early stopping was employed during training by monitoring the model’s performance using 10% of the training data as a validation set. Training was halted if the validation score did not improve for 50 consecutive epochs. The configurations of each ML classifier are reported in Table S9.
Classification performance evaluation
To compare the cell line classification of different passages, the same number of cells (i.e., 2915 cells) were randomly sampled from passage 0, passage 1, and passage 2, individually. In addition, the ratios of cells with certain and uncertain labels were kept the same among all passages. Namely, for each passage, 857 cells were sampled from monoclonal wells, whereas 2058 cells were sampled from the artificial pools. Each dataset was randomly partitioned into a training set (70% of cells) and a test set (30% of cells) ten times for the 10-fold Monte Carlo cross-validation. To evaluate classification performance, we introduced the impurity score (\({S}_{{impurity}}\)) as a metric to measure classification performance when uncertain cell labels exist. The impurity score is defined as:
where \({N}_{{total}}\) is the total number of cells in the test set, and \({N}_{{correct}}\) is the number of cells of which the predicted cell line class is included in its annotated candidate class(es). The value of the impurity score ranges from 0 to 1, with lower values meaning better classification performance. In addition, several traditional classification evaluation metrics were calculated, including balanced accuracy, F1-score, precision, recall, Jaccard score, and Area-Under-the-ROC-Curve (AUC). Due to the varying cell densities and cell sizes of the four cell lines, different cell line classes had different numbers of cells. To account for imbalanced datasets, balanced accuracy (\({bACC}\)) was used, which is defined as:
where \(C\) is the number of classes (\(C=4\) in this study). Notably, the traditional classification evaluation metrics can only be calculated on cells with certain ground truth labels (i.e., cells from monoclonal wells), whereas the impurity scores can be calculated on cells from both monoclonal wells and artificial pools.
Feature importance measurement and visualization
To measure the importance of features for the cell line classification, a model-agnostic feature importance measurement technique (i.e., permutation feature importance, or PIMP) was leveraged46. PIMP is defined as the decrease in classification performance when the value of the selected feature is randomly shuffled. The PIMP scores were calculated on the test set of one cross-validation fold. The top-performing ML classifier was used as the estimator. Highly correlated features were identified prior to the PIMP measurement to avoid underestimating PIMP scores. The Spearman rank-order correlation coefficient was used to measure the correlation between features. Hierarchical clustering was then performed on the Spearman rank-order correlations to identify clusters of correlated features, with the threshold of correlation coefficient being set to 0.1. Only one feature was randomly selected from each cluster, while others were excluded from the PIMP measurement. Each of the remaining features was randomly permutated 10 times. The differences in the balanced accuracy scores after feature permutation were recorded.
The distribution of different cell lines in the feature space was visualized in two dimensions (2D) using three dimensionality reduction methods, including t-SNE, PCA, and UMAP. For each passage, the dimensionality reduction was conducted on features selected by RFE. Each dot in the resulting 2D plots represents one single cell in the test set. In addition, single-cell images corresponding to the dots in the t-SNE plot were sampled and visualized in the same locations.
Importance of imaging modality
The importance of different imaging modalities used in this study was assessed by comparing the cell line classification performance using all modalities and subsets of modalities. The subsets of imaging modalities include 2PF, 3PF, THG, SLAM (2PF, 3PF, THG, and redox ratio), and FLIM (3PF and lifetime estimation results). Based on the selected imaging modalities, cell features were calculated using the single-cell analysis pipeline. Since correlation features were calculated on two image channels, they were absent in the cell features when using single imaging modalities (i.e., 2PF, 3PF, THG). The area and shape features were kept the same among all subsets of imaging modalities since they were calculated from the cell masks that were shared by all image channels. Based on the features calculated from each subset of imaging modalities, the cell line classification results of all three passages were generated using the same ML classifier (i.e., MLP trained with EM-ILR). The classification performance was evaluated using 10-fold Monte Carlo cross-validation (the training-test ratio is set at 7:3).
Statistics and reproducibility
The statical differences in cell feature distributions between different cell lines were assessed using the Mann–Whitney U tests. Welch’s unequal variances t-tests were used for the comparison of classification performance between different passages and different imaging modalities. Each data point represents the classification evaluation result from one cross validation fold (n = 10 for the 10-fold Monte Carlo cross validation). A p value less than 0.05 was considered statistically significant.
Data analysis hardware and software
The image analysis was conducted on a desktop computer equipped with a central processing unit (CPU) (i9-11900KF, Intel), a graphics processing unit (GPU) (GeForce RTX 3090, Nvidia), and 128 gigabytes of memory. The computer operated on a Windows system (Windows 10, Microsoft).
The single-cell analysis pipeline was developed in Python 3 using the JupyterLab (version 3.6). Cellpose (version 2.0) was used for cell segmentation. CellProfiler (version 4.2.1) was used in cell feature extraction. FlimLib (version 2.2.3) was used for fluorescence lifetime estimation. Scikit-learn (version 0.23.2) was used to implement the ML model and calculate evaluation metrics. Plots were generated using Matplotlib (version 3.2.2) and Seaborn (version 0.11.0). Other Python libraries including NumPy (version 1.19.1), Pandas (version 1.1.2), and SciPy (version 1.5.2) were used to assist data analysis.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The source data behind figures in the paper can be found in Supplementary Data 1. The raw CHO cell line images can be provided by University of Illinois Urbana-Champaign and GSK pending scientific review and a completed data use agreement. Requests for the CHO cell line dataset should be submitted to: boppart@illinois.edu.
Code availability
The analysis pipeline developed in this study is publicly available from GitHub: https://github.com/Biophotonics-COMI/Biopharm_cell_line_selection.
References
Smietana, K., Siatkowski, M. & Møller, M. Trends in clinical success rates. Nat. Rev. Drug Discov. 15, 379–380 (2016).
Walsh, G. Biopharmaceutical benchmarks 2018. Nat. Biotechnol. 36, 1136–1145 (2018).
Tihanyi, B. & Nyitray, L. Recent advances in CHO cell line development for recombinant protein production. Drug Discov. Today.: Technol. 38, 25–34 (2020).
Harcum, S. W. & Lee, K. H. CHO cells can make more protein. Cell Syst. 3, 412–413 (2016).
Walsh, G. & Walsh, E. Biopharmaceutical benchmarks 2022. Nat. Biotechnol. 40, 1722–1760 (2022).
Wurm, F. M. Production of recombinant protein therapeutics in cultivated mammalian cells. Nat. Biotechnol. 22, 1393–1398 (2004).
Lai, T., Yang, Y. & Ng, S. K. Advances in mammalian cell line development technologies for recombinant protein production. Pharmaceuticals 6, 579–603 (2013).
Kim, J. Y., Kim, Y.-G. & Lee, G. M. CHO cells in biotechnology for production of recombinant proteins: current state and further potential. Appl. Microbiol. Biotechnol. 93, 917–930 (2012).
Rameez, S., Mostafa, S. S., Miller, C. & Shukla, A. A. High-throughput miniaturized bioreactors for cell culture process development: Reproducibility, scalability, and control. Biotechnol. Prog. 30, 718–727 (2014).
Le, K. et al. A novel mammalian cell line development platform utilizing nanofluidics and optoelectro positioning technology. Biotechnol. Prog. 34, 1438–1446 (2018).
Facco, P. et al. Using data analytics to accelerate biopharmaceutical process scale-up. Biochemical Eng. J. 164, 107791 (2020).
Porter, A. J., Racher, A. J., Preziosi, R. & Dickson, A. J. Strategies for selecting recombinant CHO cell lines for cGMP manufacturing: improving the efficiency of cell line generation. Biotechnol. Prog. 26, 1455–1464 (2010).
Trummer, E. et al. Process parameter shifting: Part I. Effect of DOT, pH, and temperature on the performance of Epo-Fc expressing CHO cells cultivated in controlled batch bioreactors. Biotechnol. Bioeng. 94, 1033–1044 (2006).
Stolfa, G. et al. CHO-omics review: The impact of current and emerging technologies on Chinese hamster ovary based bioproduction. Biotechnol. J. 13, 1700227 (2018).
Clarke, C. et al. Predicting cell-specific productivity from CHO gene expression. J. Biotechnol. 151, 159–165 (2011).
Meleady, P. et al. Sustained productivity in recombinant Chinese hamster ovary (CHO) cell lines: proteome analysis of the molecular basis for a process-related phenotype. BMC Biotechnol. 11, 1–11 (2011).
Pereira, S., Kildegaard, H. F. & Andersen, M. R. Impact of CHO metabolism on cell growth and protein production: an overview of toxic and inhibiting metabolites and nutrients. Biotechnol. J. 13, 1700499 (2018).
Dean, J. & Reddy, P. Metabolic analysis of antibody producing CHO cells in fed-batch production. Biotechnol. Bioeng. 110, 1735–1747 (2013).
Coulet, M., Kepp, O., Kroemer, G. & Basmaciogullari, S. Metabolic profiling of CHO cells during the production of biotherapeutics. Cells 11, 1929 (2022).
Schmidt, E. V. The role of c-myc in cellular growth control. Oncogene 18, 2988–2996 (1999).
Chong, W. P. K. et al. LC-MS-based metabolic characterization of high monoclonal antibody-producing Chinese hamster ovary cells. Biotechnol. Bioeng. 109, 3103–3111 (2012).
Ahn, W. S. & Antoniewicz, M. R. Towards dynamic metabolic flux analysis in CHO cell cultures. Biotechnol. J. 7, 61–74 (2012).
Zhou, B., Xiao, J. F., Tuli, L. & Ressom, H. W. LC-MS-based metabolomics. Mol. Biosyst. 8, 470–481 (2012).
Sellick, C. A. et al. Metabolite profiling of CHO cells: Molecular reflections of bioprocessing effectiveness. Biotechnol. J. 10, 1434–1445 (2015).
Li, R. et al. Advances in nonlinear optical microscopy for biophotonics. J. Nanophotonics 12, 033007–033007 (2018).
Boppart, S. A., You, S., Li, L., Chen, J. & Tu, H. Simultaneous label-free autofluorescence-multiharmonic microscopy and beyond. APL Photonics 4, 100901 (2019).
Zipfel, W. R. et al. Live tissue intrinsic emission microscopy using multiphoton-excited native fluorescence and second harmonic generation. Proc. Natl. Acad. Sci. 100, 7075–7080 (2003).
Zoumi, A., Yeh, A. & Tromberg, B. J. Imaging cells and extracellular matrix in vivo by using second-harmonic generation and two-photon excited fluorescence. Proc. Natl. Acad. Sci. 99, 11014–11019 (2002).
Débarre, D. et al. Imaging lipid bodies in cells and tissues using third-harmonic generation microscopy. Nat. Methods 3, 47–53 (2006).
Evers, M. et al. Enhanced quantification of metabolic activity for individual adipocytes by label-free FLIM. Sci. Rep. 8, 8757 (2018).
Skala, M. C., Fontanella, A., Lan, L., Izatt, J. A. & Dewhirst, M. W. Longitudinal optical imaging of tumor metabolism and hemodynamics. J. Biomed. Opt. 15, 011112–011112-011118 (2010).
Yu, Q. & Heikal, A. A. Two-photon autofluorescence dynamics imaging reveals sensitivity of intracellular NADH concentration and conformation to cell physiology at the single-cell level. J. Photochemistry Photobiol. B: Biol. 95, 46–57 (2009).
Lakowicz, J. R., Szmacinski, H., Nowaczyk, K. & Johnson, M. L. Fluorescence lifetime imaging of free and protein-bound NADH. Proc. Natl. Acad. Sci. 89, 1271–1275 (1992).
Liu, Z. et al. Mapping metabolic changes by noninvasive, multiparametric, high-resolution imaging using endogenous contrast. Sci. Adv. 4, eaap9302 (2018).
Chance, B., Schoener, B., Oshino, R., Itshak, F. & Nakase, Y. Oxidation-reduction ratio studies of mitochondria in freeze-trapped samples. NADH and flavoprotein fluorescence signals. J. Biol. Chem. 254, 4764–4771 (1979).
Walsh, A. J. et al. Optical metabolic imaging identifies glycolytic levels, subtypes, and early-treatment response in breast cancer. Cancer Res. 73, 6164–6174 (2013).
Kitamura, T., Pollard, J. W. & Vendrell, M. Optical windows for imaging the metastatic tumour microenvironment in vivo. Trends Biotechnol. 35, 5–8 (2017).
You, S. et al. Intravital imaging by simultaneous label-free autofluorescence-multiharmonic microscopy. Nat. Commun. 9, 2125 (2018).
You, S. et al. Slide-free virtual histochemistry (Part I): development via nonlinear optics. Biomed. Opt. Express 9, 5240–5252 (2018).
Sternisha, S. M. et al. Longitudinal monitoring of cell metabolism in biopharmaceutical production using label-free fluorescence lifetime imaging microscopy. Biotechnol. J. 16, 2000629 (2021).
Barberi, G. et al. Integrating metabolome dynamics and process data to guide cell line selection in biopharmaceutical process development. Metab. Eng. 72, 353–364 (2022).
Chandrasekaran, S. N., Ceulemans, H., Boyd, J. D. & Carpenter, A. E. Image-based profiling for drug discovery: due for a machine-learning upgrade? Nat. Rev. Drug Discov. 20, 145–159 (2021).
Caicedo, J. C. et al. Data-analysis strategies for image-based cell profiling. Nat. Methods 14, 849–863 (2017).
Pachitariu, M. & Stringer, C. Cellpose 2.0: how to train your own model. Nat. Methods 19, 1634–1641 (2022).
Stirling, D. R. et al. CellProfiler 4: improvements in speed, utility and usability. BMC Bioinforma. 22, 1–11 (2021).
Altmann, A., Toloşi, L., Sander, O. & Lengauer, T. Permutation importance: a corrected feature importance measure. Bioinformatics 26, 1340–1347 (2010).
Bolte, S. & Cordelières, F. P. A guided tour into subcellular colocalization analysis in light microscopy. J. Microsc. 224, 213–232 (2006).
Costes, S. V. et al. Automatic and quantitative measurement of protein-protein colocalization in live cells. Biophysical J. 86, 3993–4003 (2004).
Dunn, K. W., Kamocka, M. M. & McDonald, J. H. A practical guide to evaluating colocalization in biological microscopy. Am. J. Physiol.-Cell Physiol. 300, C723–C742 (2011).
Manders, E. M., Verbeek, F. & Aten, J. Measurement of co-localization of objects in dual-colour confocal images. J. Microsc. 169, 375–382 (1993).
Tu, H. et al. Stain-free histopathology by programmable supercontinuum pulses. Nat. Photonics 10, 534–540 (2016).
Iyer, R. R. et al. Label-free metabolic and structural profiling of dynamic biological samples using multimodal optical microscopy with sensorless adaptive optics. Sci. Rep. 12, 3438 (2022).
Bower, A. J. et al. Label-free in vivo cellular-level detection and imaging of apoptosis. J. Biophotonics 10, 143–150 (2017).
Morrow, C. S. et al. Autofluorescence is a biomarker of neural stem cell activation state. Cell Stem Cell 31, 570–581.e577 (2024).
Walsh, A. J. et al. Classification of T-cell activation via autofluorescence lifetime imaging. Nat. Biomed. Eng. 5, 77–88 (2021).
Shah, A. T., Diggins, K. E., Walsh, A. J., Irish, J. M. & Skala, M. C. In vivo autofluorescence imaging of tumor heterogeneity in response to treatment. Neoplasia 17, 862–870 (2015).
Lee, J. H. et al. Simultaneous label-free autofluorescence and multi-harmonic imaging reveals in vivo structural and metabolic changes in murine skin. Biomed. Opt. Express 10, 5431–5444 (2019).
Rico-Jimenez, J. et al. Non-invasive monitoring of pharmacodynamics during the skin wound healing process using multimodal optical microscopy. BMJ Open Diab. Res. Care 8, e000974 (2020).
Hu, L., Ter Hofstede, B., Sharma, D., Zhao, F. & Walsh, A. J. Comparison of phasor analysis and biexponential decay curve fitting of autofluorescence lifetime imaging data for machine learning prediction of cellular phenotypes. Front. Bioinforma. 3, 1210157 (2023).
Buschke, D. G., Squirrell, J. M., Fong, J. J., Eliceiri, K. W. & Ogle, B. M. Cell death, non-invasively assessed by intrinsic fluorescence intensity of NADH, is a predictive indicator of functional differentiation of embryonic stem cells. Biol. Cell 104, 352–364 (2012).
Acknowledgements
The authors would like to acknowledge Yuen-Ting Chim in Biopharm Process Research team at GSK for providing CHO cell lines and ambr®15 data used in this study. The authors also thank Kevin K. Tan for supporting the FLIM data analysis. This work was funded through an academic-industry partnership grant between the University of Illinois Urbana-Champaign and GSK. SAB, MM, and DRS were supported in part by the National Institutes of Health (Grant No. P41EB031772).
Author information
Authors and Affiliations
Contributions
S.A.B., S.R.H., R.T., G.F., and A.A. conceptualized the work. J.S., C.E.S., and A.H. developed code used in this work and analyzed the data. E.J.C. prepared biological samples. J.E.S. contributed to the imaging system development. A.H. acquired the data. S.A.B., S.R.H., M.D., M.M., D.R.S. supervised the study. The manuscript was written through contributions of all authors.
Corresponding author
Ethics declarations
Competing interests
The authors declare the following competing interests: The GSK Center for Optical Molecular Imaging, its personnel, and the projects that are pursued are supported financially through an academic-industry partnership grant between the University of Illinois Urbana-Champaign and GSK. A.A., R.T., M.D., G.F., and S.R.H. are employees and shareholders of GSK. J.S., A.H., C.E.S., E.J.C., J.E.S., D.R.S., M.M., and S.A.B. declare no conflict of interest. The method and apparatus described in this paper has been disclosed as intellectual property to the Office of Technology Management of the University of Illinois Urbana-Champaign.
Peer review
Peer review information
Communications Biology thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editors: Dr Giulia Bertolin and Dr Ophelia Bu. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Shi, J., Ho, A., Snyder, C.E. et al. Accelerating biopharmaceutical cell line selection with label-free multimodal nonlinear optical microscopy and machine learning. Commun Biol 8, 157 (2025). https://doi.org/10.1038/s42003-025-07596-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s42003-025-07596-w