Introduction

Recent advancements in imaging technology and machine learning have significantly advanced high-throughput quantitative single-cell analysis1,2,3,4. Combined with genomic technologies, cell imaging techniques have great potential to elucidate the relationship between genotype and phenotype5,6,7. For instance, mapping genome-scale mutants of budding yeast to various cell biological characteristics (phenotypes) in single-cell images helps systematically explore and interpret how genetic variations influence cellular behavior and structure at the single-cell level8,9. However, cellular phenotypes in images are inherently unstructured and irregular, making it difficult to decode meaningful biological patterns from pixel values with accuracy and sensitivity10. Thus, the rapid, precise, and comprehensive acquisition and recognition of single-cell images is both critical and challenging.

The concept of Microscopic-diffraction Imaging Flow Cytometry (MDIFC) was introduced, capturing both microscopic and polarized diffraction images for single-cell measurement and employing supervised machine learning algorithms for classification11. Polarized diffraction images record the spatial distribution of coherent light scattered from single cells excited by a laser beam, providing stain-free, non-invasive, and quantitative features12,13. Microscopic images offer intricate morphological insights and serve as references for polarized diffraction images. Previous studies have predominantly leveraged Grey-Level Co-occurrence Matrix (GLCM) feature extraction and supervised machine learning techniques, such as Support Vector Machines (SVM) or Random Forests (RF), for recognition, typically handling microscopic or diffraction images independently11. However, these methods have faced limitations in terms of accuracy and processing speed, necessitating new image processing and machine learning classification algorithms.

Image fusion addresses the incompleteness, uncertainty, and errors inherent in interpreting individual images. It provides a comprehensive visual representation that enriches information content, improves visibility, and reduces noise14,15,16. These techniques also increase the efficacy of image feature extraction, classification, and object recognition. For classification, Convolutional Neural Network (CNN) algorithms such as VGG-net have demonstrated superior performance across diverse computer vision tasks17,18,19. Therefore, there is a compelling need to develop a novel approach based on CNNs and compare it with traditional algorithms.

Saccharomyces cerevisiae serves as a classical model for studying global morphological regulation20,21. Its budding index usually correlates linearly with growth rate, offering insights into cell activity22,23. In this study, our objective is to employ image fusion and a VGG-structured CNN algorithm integrated into the MDIFC technique to evaluate its efficacy in recognizing unstained budding yeast cells.

Results

Dataset preparation

In this study, a dataset comprising 3,329 events was employed, of which 2,330 events were designated for the training dataset and 999 for the test dataset. The training dataset comprises 687 budding, 806 single-celled, and 837 aggregation events (Fig. 1).

Fig. 1
figure 1

Distribution of datasets in the study.

The impact of image fusion weight on classification accuracy

Various fusion weights α generate multiple fusion image sets. Figure 2 presents the classification accuracies of three algorithms on different fusion images. Fusion images outperform microscopic (\(\:a\)= 1.0) or diffraction images (\(\:a\)= 0), with diffraction images performing better than microscopic ones. Regarding \(\:a\) values, \(\:a\)= 0.2 performs better than other values. \(\:a\)= 0.2 fusion images were chosen for subsequent experiments.

Fig. 2
figure 2

Validation accuracies of fusion images at different fusion weight (a) values.

Comparison of performances of three algorithms for fusion image

Based on the selected fusion image dataset, performances of three algorithms were shown in Table 1. From comparison, VGG-net achieves the best performance, followed by SVM (linear), while RF displays the least effective classification performance. This comparison underscores the superiority of VGG-net deep learning algorithm. Figure 3 shows the confusion matrices corresponding to the three evaluated algorithms.

Table 1 Comparison of performances of three algorithms.
Fig. 3
figure 3

Confusion matrices of three algorithms: VGG-net, SVM, and RF.

Comparison of classification speed

The speed of cell measurement is a crucial parameter for MDIFC. Table 2 presents a comparison of the classification times for different algorithms under the same experimental conditions.

Table 2 Time and speed of classification.

The analysis of classification time across various algorithms reveals that SVM (linear) and RF methods, due to their reliance on GLCM feature extraction, take more total time compared to VGG-net algorithms. VGG-net achieves a measurement speed of about 260.42 cells/s, approximately 15 times faster than the other two methods.

Discussion

Different types of images convey distinct information. Through diffraction theory, diffraction images capture the spatial distribution of elastically scattered light, revealing the 3D morphology of biological cells through the intracellular distribution of refractive index24,25. This characteristic makes diffraction images suitable for GLCM texture feature extraction, analyzing second-order correlations among pixels in an input image26,27. In contrast, microscopic images lack these correlations, potentially explaining the superior performance of SVM and RF in diffraction images compared to microscopic images. VGG-net employs a convolutional kernel as a receptive field for feature extraction, generating more parameters than GLCM, enabling effective performance with both image types. In addition, unlike other algorithms that require parameter adjustments, the VGG-net deep learning model can self-adjust parameters once established, thereby enabling automatic analysis.

Employing image fusion techniques further enriches available information for analysis and interpretation. In this study, considering the necessity to minimize processing time, a simple weighted fusion approach was employed, and the best weight to achieve promising classification performance was selected.

Despite concerns about computational demands, VGG-net demonstrated remarkable measurement speed of 260.42 cells/s, surpassing the maximum speed of approximately 17.04 cells/s for GLCM-based classification. Even previous attempts at GPU implementation, resulting in a 7 times average speedup for GLCM calculation and a 9.83 times average speedup for feature calculation28, did not match the speed demonstrated by VGG-net in this study.

In a separate study, Mehran Ghafari utilized CNN to classify 99,380 microfluidic time-lapse images of dividing yeast cells, achieving an impressive accuracy of 98% with data augmentation29. In our study, a comparable level of accuracy was achieved with a relatively small dataset. This was accomplished through image fusion and the VGG-net algorithm, underscoring the robust information provided by MDIFC.

In conclusion, this study introduces an innovative methodology using VGG-net deep learning and image fusion into the MDIFC system for unstained budding yeast recognition and reached better performance and faster speed than traditional methods. These findings highlight the potential value for future applications of MDIFC.

Methods

The comprehensive process of budding yeast recognition

This study comprised several sequential steps, elucidated through the diagram in Fig. 4. In summary, yeast cells were cultured and measured using the MDIFC system. Paired microscopic and diffraction images of the same events, including budding, single-celled, and aggregated states, were collected. Following data preparation (which included data labeling, preprocessing, and image fusion), the fusion images, along with separate microscopic and diffraction images, underwent training and classification using the VGG-net-structured deep learning algorithm. For comparative purposes, two traditional algorithms, SVM (linear) and RF, both relying on GLCM feature extraction, were also utilized. The classification performances were evaluated based on metrics including accuracy, precision, recall, F1 score, and speed.

Fig. 4
figure 4

Graphical representation of the budding yeast recognition process of this study.

MDIFC system design

Figure 5 illustrates the design of the MDIFC system. In detail, a continuous-wave solid-state laser (MGL-III-532-100, CNI) was employed to generate a 532 nm wavelength incident beam along the x-axis with a power of up to 100 mW. The 450 nm LED, with a power of 1 W along the z-axis, served as the illumination source. The incident beam on the core fluid carrying the cells, along the y-axis, was scattered and linearly polarized into two perpendicular ways by the beam splitter PBS –the horizontal one as the s-polarized path, and the vertical one as the p-polarized path. In both optical paths, short-pass filter (< 500 nm) and narrow-band filter (532 nm) were added, respectively, to prevent mutual interference between light sources and enhance imaging quality. Two CCD cameras (LM075, Lumenera) were employed to capture two cross-polarized images, each with a resolution of 640 × 480 pixels and a pixel depth of 12 bits. The p-polarized CCD captures microscopic images, while the s-polarized CCD captures diffraction images.

Fig. 5
figure 5

Schematic of the MDIFC system, with specific components represented by the following abbreviations: OB (objective lens), PBS (polarizing beam splitter), SP (short-pass filter), BP (narrow-band filter), and T (tube lens).

Preparation of budding yeast cell suspension

To prepare the budding yeast cell sample, 2 mg of Saccharomyces cerevisiae (Sigma-Aldrich) powder was placed in 5 mL of YPD medium (comprising 1% yeast extract, 2% peptone, and 2% glucose) at room temperature for 8 h. The rate of cell budding was assessed based on observations from ten different fields of view using a microscope (IX71, Olympus), ranging from 8 to 15%. Figure 6 represents one microscopic field, where the majority of yeast cells exist as single, isolated entities. Some cells are in a budding state, and a small number of cells can adhere to each other to form a cell cluster referred to as “aggregation.”

Fig. 6
figure 6

Microscope image of the yeast cell sample. White arrows indicate single cells, squares highlight budding cells, and an ellipse indicates aggregation.

Cell image acquisition and preprocessing

For measurements, the final cell concentration was adjusted to 1 × 106 cells/ml. Each time, 50 µl of cell suspension was pumped into the core stream, and the flow velocity was set to 10 mm/s to reduce microscopic blur, resulting in a capture rate to 7 cells/s. Images with pixels of saturated intensities (over-exposed) or of very weak total pixel intensities (under-exposed) cause unbalanced parameter distribution for machine learning, so they were first removed by an in-house developed preprocessing software before analysis30. Figure 7 displays typical microscopic and diffraction paired images of three categories.

Fig. 7
figure 7

Illustration of paired microscopic and corresponding diffraction images (640 × 480) of three categories: single-celled (A, B), budding (C, D), aggregation (E, F). Bar = 10 μm.

Image fusion

The grey values of separate microscopic and diffraction images (640 × 480 each) depicting the same events were combined using a weighted fusion approach. This involved multiplying the microscopic image grey values (denoted as Gm) by a weight factor \(\:a\), and the diffraction image grey values (denoted as Gd) by \(\:1-a\). The fusion image grey value (denoted as Gf) is determined by the following equation:

$$G_{f} = G_{m} \times \alpha + G_{d} \times \left( {1 - \alpha } \right)$$
(1)

Here, α represents the weight assigned to the microscopic image. The value of \(\:a\) ranges between 0 and 1, with intervals of 0.1. Figure 8 illustrates fusion images generated with various \(\:a\) values, where \(\:a\) = 0 represents diffraction images, and \(\:a\) = 1 represents microscopic images.

Fig. 8
figure 8

Visualization of image fusion based on the weight factor \(\:a\) for the single-celled category (\(\:a\) was set between 0 and 1, with intervals of 0.2). The rest values of \(\:a\) and categories are not displayed.

Experimental setup for machine learning

Table 3 lists the hardware configurations and software environment used for subsequent machine learning procedures.

Table 3 Detailed information of experiment.

Image classification using VGG-net structured deep learning

A lightweight 11-layer VGG-net structure was devised, depicted in Fig. 9, to strike a balance between classification performance and speed. VGG11 structure consists of eight concealed convolutional layers, each featuring a 3 × 3 receptive field with a stride of one and zero-padding, followed by a Rectified Linear Unit (ReLU). Max-pooling layers with 2 × 2 filters and a stride of 2 (without zero-padding) are employed for all pooling layers. The number of channels in each convolutional layer progressively increases as the network deepens. Through flatten and dense, fully connected (FC) layers reduced dimension. Subsequently, the softmax function is applied to categorize the output into three classes. Cross-entropy serves as the loss function. During training, dropout layers are incorporated in the middle of FC layers to mitigate overfitting. In the training process, Momentum is employed as the optimizer, with 150 epochs and a batch size of 16.

Fig. 9
figure 9

The illustration of VGG11 deep learning architecture.

GLCM feature extraction based SVM and RF classification

GLCM is constructed where the rows and columns correspond to the grayscale levels of an image31. Each element p (i, j, d, θ) in the matrix denotes the relative frequency of two pixels with intensities i and j, respectively, separated by a distances d at a specified angle θ. Set pixel distance d to 1, 2, 4, 8, 16 and 32 with four angles of 0°, 45°, 90° and135° for generating a 24 Ng×Ng matrix for each image. From each GLCM matrix, six second-order statistics are computed as characteristic parameters, yielding a total of 144 features. Below are the definitions of these six second-order parameters.

$$Angular\;second\;moment = \mathop \sum \limits_{{{\text{i}},j = 0}}^{{N_{g} - 1}} P_{{ij}}^{2}$$
(2)
$$Contrast = \mathop \sum \limits_{{i,j = 0}}^{{N_{g} - 1}} P_{{ij}} \left( {i - j} \right)^{2}$$
(3)
$$Correlation = \mathop \sum \limits_{{i,j = 0}}^{{N_{g} - 1}} P_{{ij}} \frac{{\left( {i - \mu _{i} } \right)\left( {j - \mu _{j} } \right)}}{{\sigma _{i} \sigma _{j} }}$$
(4)
$$Homogeneity = \mathop \sum \limits_{{i,j = 0}}^{{N_{g} - 1}} \frac{{P_{{ij}} }}{{1 + \left( {i - j} \right)^{2} }}$$
(5)
$$Dissimilarity = \mathop \sum \limits_{{i,j = 1}}^{{N_{g} - 1}} P_{{ij}} \left| {i - j} \right|$$
(6)
$$Energy = \sqrt {ASM}$$
(7)

where i and j represent the intensities of two pixels in an image with grayscale Ng, µi and µj are the horizontal and vertical means of GLCM respectively, while σi and σj are the variances of the pixel intensities that contribute to µi and µj.

Extracted GLCM features were employed for classification using either SVM or RF algorithms. SVM was initially designed for binary classification problems. In this study, the one-versus-one framework was chosen to avoid the classification imbalance inherent in the one-versus-rest framework32. This issue arises when certain classes exhibit a significantly higher number of samples than others. The linear kernel function was chosen over the non-linear kernel function due to its higher accuracy observed in previous studies13,33. RF combines multiple randomized decision trees and aggregates predictions by averaging, and also demonstrated outstanding performance in diffraction image analysis13.

Performance evaluation

To assess classification performance, metrics such as accuracy, precision, recall, and F1 score were employed. System metrics is the average value across all data blocks using 10-fold cross-validation under five different random seeds for consistent comparison. These measures are defined as follows:

$$Accuracy = \frac{{TP + TN}}{{TP + TN + FP + FN}}$$
(8)
$$Precision = \frac{{TP}}{{TP + FP}}$$
(9)
$$Recall = \frac{{TP}}{{TP + FN}}$$
(10)
$$F1\,score = \frac{{2 \times Precision \times Recall}}{{Precision + Recall}}$$
(11)

The equations use the following terms: TP for true positives (correctly predicted positive examples), TN for true negatives (correctly predicted negative examples), FN for false negatives (positive instances predicted as negative), and FP for false positives (negative instances predicted as positive).