Introduction

Mental health disorders are highly prevalent and associated with increased disability and mortality worldwide1. Mood disorders encompass two main groups: major depressive disorder (MDD, also referred to as unipolar depression), and bipolar disorders (BD), that place a substantial burden on the healthcare systems2. MDD is a chronic mental illness that is characterized by recurrent episodes of depressed mood and anhedonia3. In contrast, BD alternates between periods of depression, abnormal euphoria, and irritability. MDD and BD have been recognized as major contributors to disability across various cohorts2,4. Worldwide, the lifetime prevalence of MDD is 11.1–14.6%, while for BD, it reaches 3.4%1,5.

Unfortunately, the clinical profiles of depressive episodes in a patient with MDD or with BD are indistinguishable, often leading to misdiagnosis and delayed treatment. Furthermore, it has been shown that up to 69% of BD patients could have been initially misdiagnosed with MDD, while more than 30% may remain misdiagnosed for up to 10 years6. This results in significant disability and high economic costs6,7 due to an incorrect selection of pharmacological treatment (i.e., use of antidepressant monotherapy, or lack of treatment with mood stabilizers), which may increase treatment-emergent manic episodes, cause poor illness course, poor treatment response, low study/work productivity, low quality of life, and high societal costs8,9,10,11. Currently, BD and MDD diagnosis is based on clinical criteria and symptoms reported by patients, with laboratory tests or brain imaging used only to rule out other neurological disorders. Therefore, great efforts have been made to include biological measurements that improve the differential BD/MDD diagnosis.

Biomarkers (biological parameters that are associated with disease states or traits) arise as potential aids for the differential diagnosis of depression. In searching for such biomarkers, small extracellular vesicles (sEVs) have emerged as ideal sources of brain disease markers, because brain cell-released sEVs have been shown to reach the bloodstream and may provide a significant opportunity to study psychiatric disorders more precisely. Thus, sEVs derived from plasma could reveal neuroinflammation associated with psychiatric disorders12. sEVs are nanometric lipid structures (50–200 nm) released into the extracellular space by nearly all cell types. They carry a complex molecular cargo consisting of various soluble and transmembrane proteins, genetic material, and lipids13,14. Evidence suggests that plasma sEVs play a crucial role in the brain and systemic inflammatory response15. Microglial cells are critical decoders of neuroinflammation, which is closely associated with clear changes in their secretory and morphological responses16,17. Therefore, we reasoned that sophisticated computer vision strategies could provide a robust approach for discriminating the microglial cellular response to patient-derived sEVs.

Deep Learning (DL) models, such as Convolutional Neural Networks (CNNs), have been employed effectively to solve various computer vision problems. These CNN models can be trained using high-dimensional data to identify distinguishing features among various disorders18,19,20,21. Applications of CNN models in complex pattern recognition have been developed in image classification22,23, image segmentation24,25, face recognition26,27, iris recognition28,29 and other applications. These models extract and learn complex features successfully from visual data, allowing them to achieve high accuracy (ACC) in previously challenging image analysis tasks. By leveraging the ability of CNNs to process high-dimensional data, they have also been applied in biomedical fields to analyze medical images, either in classification30,31,32, segmentation33,34,35, or regression tasks36,37,38, supporting expert diagnoses3,39,40,41,42,43. Once trained, CNNs could assist in reaching early and accurate diagnoses, reducing the dependence on symptom-based assessments, and minimizing the risk of misdiagnosis44,45,46. Despite these advances, most current applications in biomedical imaging mainly focus on isolated features or individual units of analysis, which may limit the capture of biologically relevant heterogeneity. This is particularly true in cellular systems, where pathological states often manifest not through discrete changes in single cells, but as emergent patterns across populations. In such cases, approaches that can integrate spatial or contextual information across multiple cells can provide a more comprehensive and robust diagnostic signal. Addressing this gap could enhance the accuracy and generalization capacity of image-based classification significantly in complex biological scenarios, especially when subtle morphological shifts are involved47,48.

The diagnosis of mood disorders, particularly major depressive disorder (MDD) and bipolar disorder (BD), remains a significant challenge due to overlapping symptomatology and reliance on subjective clinical evaluation. Recent advances in biomedical informatics have leveraged artificial intelligence to develop objective, data-driven diagnostic tools capable of capturing neurophysiological, behavioral, and linguistic signatures of these disorders. Neuroimaging-based studies continue to report important generalization challenges, as demonstrated by Belov et al.49, who conducted the largest multi-site brain-imaging analysis of major depressive disorder (MDD) and found that traditional machine-learning models achieved only moderate balanced accuracy (~ 62%) before dropping to near-chance performance once site effects were controlled. Yang et al.50 used resting-state fMRI to compare brain functional efficiency across schizophrenia, bipolar disorder, and major depressive disorder. They found shared sensorimotor disruptions between schizophrenia and MDD, linked to genes involved in glutamatergic and calcium/cAMP signaling, suggesting standard neurobiological mechanisms among these disorders.

Electrophysiological studies using EEG have demonstrated consistently stronger performance. Zhao et al.51 introduced the SE-1DCNN-LSTM framework for distinguishing MDD from bipolar disorder (BD), reporting accuracies of 81.10% at the epoch level and 83.16% at the subject level. Hata et al.52 evaluated a transformer-based deep-learning model trained on frequency-domain features extracted from portable EEG recordings, achieving a balanced accuracy of 80.8% and an AUC of 0.872 in differentiating healthy volunteers from patients with dementia-related conditions. Subgroup analyses across diagnostic categories and severity levels yielded AUCs ranging from 0.812 to 0.898 with balanced accuracies of up to 86.4%. Anik et al.53 proposed an 11-layer Ex-1DCNN that identified Gamma-band activity in 15-second epochs as a highly discriminative biomarker, achieving 99.6% accuracy for depression detection, highlighting the potential of short-window, non-invasive electrophysiological screening. Liu et al.54 explored frontal resting-state EEG for distinguishing generalized anxiety disorder (GAD) from healthy controls, introducing a “Differential Channel” method and connectivity-based features to enhance anxiety-related signal discrimination. Using these representations, the Deep Forest classifier achieved an accuracy of up to 98.08% with short time windows, supporting the feasibility of frontal-channel EEG and functional connectivity metrics for reliable GAD identification.

Beyond physiological signals, voice and text-based machine learning models have shown equally strong performance. Huang et al.55 applied a pre-trained wav2vec 2.0 model to acoustic recordings, achieving approximately 96% accuracy in binary depression classification, while Xu et al.56 demonstrated that fusing text and voice embeddings (via BERT and Wav2Vec) within a CNN-BiLSTM architecture substantially outperformed unimodal voice or text models.

Parallel advances have emerged in social-media text analysis, where Ding et al.57 compared classical machine-learning methods (logistic regression, random forests, LightGBM) with deep learning approaches (ALBERT, GRU), reporting comparable performance. Their results suggest that classical models maintain interpretability advantages, while deep architectures capture more complex linguistic patterns. Additionally, Li et al.58 provided a comprehensive systematic review of 65 studies that combined audio and text data for automated depression detection, highlighting the growing relevance of natural language and paralinguistic features in mental health diagnostics.

Additional progress has been made using wearable sensor and behavioral data. Ricka et al.59 identified a stable physiological signature of depression using cardiac and electrodermal activity, enabling daily mood prediction with ~ 86% accuracy. Saad et al.60 transformed smartwatch motor-activity time series into Markov transition-field images for analysis via an attention-based CNN, achieving ~ 95% accuracy in the depression class. Similarly, Wu et al.61 demonstrated that digital biomarkers-heart rate, sleep, and activity can predict bipolar mood states, reaching 83% accuracy for depressive symptoms and 91% for manic symptoms, supporting the feasibility of continuous mood-state monitoring. Psychometric information represents another modality with strong predictive value. Using DASS-42 scores together with demographic features, ShamsEldin et al.62 reported SVM-based classification accuracies above 98% across depression, anxiety, and stress categories.

Our contributions

In this study, we propose a proof-of-concept for a new diagnostic technology. First, we introduce a non-invasive strategy that uses sEVs to modulate microglial morphology for the classification of MDD, BD, and CTRL subjects. Second, we develop a DL-based image analysis pipeline that achieves high diagnostic accuracy from microglial cell morphology. Third, we propose a structured array-based image organization that enables spatially enriched image classification, and facilitates data augmentation through cell image permutation, flipping, and rotation. Finally, we address challenges posed by variable data quality and sample imbalance by generating multiple augmented image arrays per subject. This framework offers a powerful and scalable approach that integrates biological signal amplification with context-aware DL, paving the way for precision diagnostics in psychiatry, and beyond.

Materials and methods

Patients

Participants with bipolar disorder (BD) and major depressive disorder (MDD), as well as healthy control participants (CTRL), were recruited at Clínica Universidad de los Andes by a psychiatrist with expertise in mood disorders. The study was approved by the Comité Ético-Científico of Universidad de los Andes (approval #CEC201975). All procedures and methods were performed in accordance with relevant institutional and national guidelines and regulations, and in compliance with the Declaration of Helsinki. Written informed consent was obtained from all participants. Each group included 15 participants (BD, MDD, and CTRL; total n = 45). Inclusion and exclusion criteria are detailed in Supplementary Materials S1, and participant demographic and clinical characteristics are provided in Table S1.

Animals

Sprague-Dawley rats were acquired from the Animal Facility of Pontificia Universidad Católica de Chile, Santiago, Chile. All animal procedures and methods were conducted in accordance with the ARRIVE guidelines and with institutional regulations for the care and use of laboratory animals, and were approved by the Universidad de los Andes Bioethical Committee (approval #CEC202039).

Microglial cell culture

Postnatal day 1–2 (P1–P2) rat pups were euthanized without anesthesia by rapid decapitation using sharp scissors for immediate brain tissue collection in accordance with institutional guidance for neonatal euthanasia. Euthanasia was performed by trained personnel in accordance with institutional regulations and international guidelines. Mixed glial cells were isolated from the telencephalic portion of a 1–2 day-old Sprague–Dawley rat brain as previously described in63 and seeded in a 100 mm treated plate (1 whole brain per plate). After 14 days, confluent mixed glial cultures were gently swirled for 60 s in a clockwise and an anticlockwise manner, as previously described64, to obtain a pure microglial suspension. Next, 10,000 to 20,000 cells were seeded in 96-well microscopy plates, (Falcon, 353219), previously treated with 0.1 mg/ml poly-l-lysine (Sigma). After 48 h, each well was treated for 24 h with 12 µg of protein from the corresponding patient-derived plasma EVs.

Extracellular vesicle isolation

To obtain patient plasma, 15–20 mL of blood were collected, centrifuged, and subjected to a Ficoll® gradient by mixing 4 mL of blood with 4 mL of Ficoll®. The mixture was subsequently centrifuged at 400 RCF for 45 min at 4 °C without applying a brake. The plasma remained in the upper phase. Next, the plasma was centrifuged at 2000 RCF for 30 min at 4 °C, using 500 µL per tube. The resulting supernatant was centrifuged again at 10,000 RCF for 40 min at 4 °C. Then, 480 µL of the supernatant were collected and incubated with 240 µL of a commercial kit (Total Exosome Isolation Kit, Invitrogen, #4478359) for 16 h under rotary agitation at 4 °C. Finally, the sample was centrifuged at 10,000 RCF for 1 h at 4 °C, and the pellet, corresponding to sEVs, was resuspended in 500 µL of sterile PBS. This sEV fraction displayed a particle size distribution (in nm) and contained molecular markers (such as flotillin and CD-63) as expected for extracellular vesicles. The size distribution was determined by nanoparticle tracking analysis (NTA), while molecular markers were detected by Western blots as described in65 and shown in Supplementary Materials, Fig. S1.

Immunofluorescence

Treated cells were fixed with a solution of 4% paraformaldehyde (PFA) plus 4% sucrose for 20 min at room temperature, then washed twice with PBS containing 0.5% BSA. Blocking was performed for 30 min at room temperature with 5% BSA, followed by two washes with PBS containing 0.5% BSA, and finally, the cells were permeabilized with 0.3% Triton X-100. Cells were stained for nuclear markers: DAPI (D1306, Invitrogen); microglia-specific cytoplasmic marker: Iba-1 (019-19741, Fujifilm), and β-actin (A5441, Sigma-Aldrich).

Equipment and settings

Nanoparticle tracking analysis (NTA) was performed using the NanoSight NS300 equipment (Malvern Panalytical) coupled to manufacturer software NTA version 3.2. Samples were diluted 1:100 with DPBS right before analyses. Camera level was kept on value 8 and detection threshold was kept on value 3 for all the samples. Western blotting was performed using x-ray films. The films were later scanned, and images were processed using Adobe Photoshop software. Automated microscopy images encoded in 24-bit RGB color space were acquired using a Cellomics Arrayscan XTI microscope (Thermo Fisher Scientific) with a 20× objective (NA 0.4) at a resolution of 1104 × 1104 pixels per channel. Each raw image contains artifacts that were removed to leave only isolated microglial cells. Artifacts included stains, overlapped microglial cells, and microglial cells overlapping with the edge of the image.

Image Preprocessing

Individual cells were identified in the microglial raw images. The red channel from the raw images was used, as it contains most of the information because of the color of the cells. The red channel was binarized using an automatic threshold provided by the non-parametric and unsupervised Otsu method66. All blobs connected to the image boundary were removed, since the edge of the image would section the cells. Additionally, we removed small blobs with an area of less than 400 pixels because they do not contain images of microglial cells. Figure 1 shows an example of the blob detection process. Note that blobs 9, 12, and 14 are eliminated because they are sectioned by the edge of the image. After the blob detection process, the centroids of each of the remaining blobs inside the binarized images were computed. These centroids were used to extract the cells into sub-images of 75 × 75 pixels from the raw image, thus obtaining individual cells. Finally, removing stain artifacts (green spots) was necessary as shown in Fig. 2. In addition, Fig. 3 shows the distribution of individual cells for each class. The violin plot represents the density distribution of cell counts, highlighting the spread and concentration of data points. The embedded scatter points indicate individual subjects, while the dashed lines indicate quartiles. Subjects with the highest and lowest cell counts within each class are annotated accordingly. The subject with the largest number of cells belongs to the BD class with the ID 3,650,109 having 248 instances, while subject 1029 of the CTRL class has only 71 instances. The final dataset is described in Table 1.

Fig. 1
Fig. 1
Full size image

(a) Example of a raw microglial cells image. (b) The detected blobs of subject 1039 of the CTRL class. Blobs 9, 12, and 14 were eliminated because the edge of the image sectioned part of the cell in those blobs.

Fig. 2
Fig. 2
Full size image

Artifact removal process. (a) example of an individual cell surrounded by two green stains. These stains were detected in (b) and refilled with the mean color of the background, as was shown in (c).

Table 1 Number of cells available after artifact removal for each class BD, CTRL, and MDD.
Fig. 3
Fig. 3
Full size image

Distribution of cell counts per class after artifact removal. Arrows indicate the subjects with the highest and lowest cell counts within each class. Across the entire dataset, subject 3,650,109, from the BD class, exhibits the highest count with 248 samples, while subject 1029, from the CTRL class, has the lowest count with 71 samples.

Proposed Pipeline

The overall pipeline of this research is shown in Fig. 4. Our proposal is that the combined use of CNNs and microglial cells, the latter acting as cellular sensors, can improve diagnostic accuracy. The method begins with the extraction of patient-derived sEVs from their blood plasma. These vesicles are applied to cultured microglia acting as biological sensors capable of exhibiting disease-specific morphological responses. Following this treatment, the microglia are imaged using fluorescence microscopy to capture the morphological features of the microglia cells. These images are then processed to detect and segment individual microglial cells.

Since each microglial cell may not provide enough information to correctly classify the three classes \(\:\varOmega\:=\{BD,MDD,CTRL\}\) (|\(\:{\Omega\:}|=3\)), we developed an alternative method using cells organized into arrays instead of using individual cell instances67. In this way, there is a higher probability of showing an individual microglial cell to the classifier that reacted positively to the sEVs from patients. In cases where two or more cell sensors react positively to sEVs within the array, the CNN model will learn common features among examples of the same class68. Therefore, rather than analyzing cell images in isolation, the pipeline groups the images into structured arrays of fixed dimensions (M×M cells per array), which serve as input to a CNN based on the DenseNet12123 architecture. This network was initially pre-trained on the ImageNet dataset69 and subsequently fine-tuned to classify microglial arrays into one of three diagnostic categories of \(\:{\Omega\:}\): BD, MDD, or CTRL. Additionally, to improve accuracy, the method aggregates predictions from multiple arrays belonging to the same subject by summing the class-specific confidence scores output by the CNN model. The class with the highest cumulative confidence determines the final subject-level diagnosis. This hierarchical strategy, which shifts classification from individual cells to grouped arrays, improves accuracy and generalization.

Training - Testing Protocols and Dataset Generation

The evaluation protocol corresponds to repeated subject-disjoint random splits, also referred to as repeated random subsampling or repeated holdout validation, which has been similarly adopted in other problem settings70. Five independent iterations were performed, each constructing a subject-disjoint partition by randomly assigning subjects to training and test sets, stratified by class (\(\:{N}_{s}^{train}=10\) subjects per class for training, \(\:{N}_{s}^{test}\) = 5 per class for testing), yielding an overall 30/15 training/held-out split per iteration. As is inherent to this procedure, a given subject may appear in the held-out set across multiple iterations or potentially in none. In our implementation, the five random splits were constructed such that every subject appeared in the held-out test set in at least one iteration. Table 2 reports the held-out frequency for each of the 45 subjects across the five iterations. Consequently, the set of cell images used for training, \(\:{\phi\:}^{train}\), and testing. \(\:{\phi\:}^{test}\), depends on the iteration partition and the selected subjects.

Table 2 Held-out frequency for each of the 45 subjects across the five iterations of the repeated subject-disjoint random splits. Each iteration randomly assigned 10 subjects per class to the training set and 5 per class to the test set. All subjects appeared in the held-out test set in at least one iteration.

Each array is created by randomly selecting individual microglial cell images from only one subject, and thus, each array belongs to the class of that subject. Each cell image could be at any location within the array, and therefore, by making cell image permutations, different arrays for the same subject may be generated. We used the letter M to specify the size of the array; for example, Fig. 6 shows arrays for M = 5, 6, and 7, respectively. Each cell image is selected just once in each array. Note that the arrays have equal aspect ratios, and their size depends on M, since individual cell images are 75 × 75 pixels. In Fig. 6, the three arrays are presented with equal size although the number of pixels in each one depends on M.

Data augmentation is performed by permutation of cell image position within each array, and by vertical or horizontal cell image flips, rotations, and rotations with flips. In summary, the affine transformations are the following: vertical flip, horizontal flip, double flip, rotate 90° (counterclockwise), rotate 90° + vertical flip, rotate 90° + horizontal flip, rotate 270° (clockwise).

The weighted voting process is illustrated in Fig. 5. The pipeline proceeds through five stages. In Stage 1, for each subject U, a set of P = 150 image arrays is generated, each consisting of M×M microglial cell images randomly sampled from that subject, with affine transformations applied for augmentation. In Stage 2, each array is resized to 512 × 512 pixels and processed by the fine-tuned DenseNet121 CNN, which generates a confidence vector \(\:\pi\:\:=\:[{\pi\:}_{BD},\:{\pi\:}_{MDD},\:{\pi\:}_{CTRL}]\) via softmax activation. In Stage 3, the P confidence vectors generated for subject U are collected in the set \(\:{\varPsi\:}_{U}\:=\:\{\pi\:_{1},\:\pi\:_{2},\:\dots\:,\:{\pi\:}_{P}\}\), where each array may yield different confidence levels for each class, reflecting the natural variability across different cell combinations. In Stage 4, all confidence scores are summed for each class across the P arrays according to \(\:{S}_{c}\:=\:\varSigma\:\:{\pi\:}_{\left\{i,c\right\}}\) for each c Ω, accumulating soft probabilities rather than counting hard votes. In Stage 5, the argmax function is applied to the accumulated scores: \(\:{\widehat{y}}_{U}=\text{arg}\text{max}\left({S}_{BD},\:{S}_{MDD},\:{S}_{CTRL}\right)\), and the class with the highest accumulated confidence score is assigned as the final predicted label at the subject level. In the illustrated example, a BD subject has accumulated scores of \(\:{S}_{BD}\) = 128.4, \(\:{S}_{MDD}\) = 14.2, and \(\:{S}_{CTRL}\) = 7.4, after summing over P = 150 arrays, resulting in a correct BD classification.

Fig. 4
Fig. 4
Full size image

Detailed overview of the proposed image-based pipeline for morphological classification of cellular sensors (cultured rat microglial cells) using patient-derived sEVs. The pipeline is organized into three functional blocks. Block 1: sEV Isolation and Microglial Treatment. sEVs are isolated from patient blood plasma and applied to cultured microglia which serve as biosensors capable of developing disease-specific morphological responses. After exposure to sEVs, microglial cells are fixed and immunostained. Immunostaining is used to visualize nuclei (DAPI, blue), microglial cytoplasm (Iba-1, red), and β-actin filaments (green). Block 2: Image Preprocessing. Fluorescent images are analyzed using blob detection to extract individual microglial cells, which are grouped into structured image arrays of size M×M. Each array includes cells from a single subject and is augmented by applying random affine transformations, including flips and rotations. A total of P arrays are generated per subject, enhancing sampling diversity and generalization. Block 3: Deep Learning-Based Classification. Arrays are the input to a DenseNet121 CNN, pre-trained on ImageNet and fine-tuned for this task. The final layer of the CNN is adapted to classify each microglial morphology into one of three diagnostic categories Ω: BD, MDD, or CTRL. Confidence scores π are computed for each class, and final subject-level predictions are obtained through a weighted voting mechanism that aggregates scores across all the arrays from a given subject. This hierarchical framework improves diagnostic accuracy by incorporating both cellular diversity and subject-level context.

Fig. 5
Fig. 5
Full size image

Weighted voting decision pipeline for subject-level classification. For each subject, P = 150 image arrays of 5 × 5 microglial cells are generated with random permutations and affine transformations (Stage 1) and individually classified by a fine-tuned DenseNet121 CNN, which outputs a softmax confidence vector \(\:\pi\:\:=\:[{\pi\:}_{BD},\:{\pi\:}_{MDD},\:{\pi\:}_{CTRL}]\) per array (Stages 2–3). The confidence scores are then accumulated per class across all arrays (Stage 4), and the final diagnosis is assigned via argmax over the cumulative scores (Stage 5). In the illustrated example, the BD class dominates with 128.40 versus 14.20 (MDD) and 7.40 (CTRL), resulting in a BD prediction.

Fig. 6
Fig. 6
Full size image

Example of three arrays of microglial cell images for M = 5, 6, and 7. The three images have been resized to show them at the same size.

Fig. 7
Fig. 7
Full size image

Eight transformations are applied to a selected cell: (a) Original instance. (b) Vertical flip. (c) Horizontal flip. (d) Double flip. (e) Rotated 90° counterclockwise. (f) Rotated 90° + vertical flip. (g) Rotated 90° + horizontal flip. (h) Rotated 270° clockwise.

Fig. 8
Fig. 8
Full size image

Set of arrays Φ that represent a subject. The configuration used for the figure corresponds to M = 5; P = 5.

Fig. 9
Fig. 9
Full size image

Example of two arrays from the same subject. Note that the cells within the green and red bounding boxes appear not only in different positions in the arrays but are also placed using a different affine transformation.

Therefore, when a cell is selected to be part of an array, only one of the above transformations is applied randomly. This allows the subsequent arrays that have chosen the previously chosen cell to place it in a different location through the M dimension of the array. This can also be done with a different transformation, with a probability of 1/8, as illustrated in the example of Fig. 7. The number of arrays \(\:\left|{\Phi\:}\right|\) generated for each subject is P, as shown in Fig. 8. In this way, we obtain P arrays for each subject. To increase the variability in the P dimension, we applied affine transformations as data augmentation to the array of selected cells. There are eight possible transformations (position and seven affine transformations). Figure 9 shows two of the affine transformations mentioned above.

The method was tested by changing M from 3 to 10, and P was changed from 100 to 200. M was also used to balance the number of samples from the different subjects. Therefore, the number of arrays \(\:\left|{\Phi\:}\right|\) generated for the training and testing datasets in each k-fold cross-validation depends on the P values as follows:

$$\:\left|{{\Phi\:}}^{\text{t}\text{r}\text{a}\text{i}\text{n}}\right|=\left|{\Omega\:}\right|\times\:{N}_{s}^{train}\times\:P,$$
(1)
$$\:\left|{{\Phi\:}}^{\text{t}\text{e}\text{s}\text{t}}\right|=\left|{\Omega\:}\right|\times\:{N}_{s}^{test}\times\:P,$$
(2)

where \(\:\left|{\Omega\:}\right|=3\), \(\:{N}_{s}^{train}=10\) and \(\:{N}_{s}^{test}=5\). The total number of cells in each array is M×M, while the number of cells per subject is M×M×P as seen in Fig. 8.

Convolutional Neural Network Classifier

The proposed pipeline includes a classifier based on a CNN that has been pre-trained using the ImageNet dataset69. ImageNet is a large-scale visual database designed for visual object recognition research. It contains millions of labeled images across thousands of categories, making it an essential resource for training and benchmarking DL models23. The dataset is widely used in computer vision to pre-train models, which are then fine-tuned for specific tasks.

Using the pre-trained weights of a DenseNet12123, we performed fine-tuning over the cell dataset, deleting the last layer of the model, which contains 1000 neurons, and adding a layer with three neurons corresponding to the classes \(\:{\Omega\:}\:\epsilon\:\left\{BD,\:MDD,\:CTRL\right\}\). The resulting model contains 6,956,931 trainable parameters.

The output of the final layer was activated using the softmax function, defined as

$$\:{{\uppi\:}}_{i,c}=\frac{{e}^{{z}_{i,c}}}{\sum\:_{k\epsilon{\Omega\:}}{e}^{{z}_{i,\:k}}},\:$$
(3)

where \(\:{z}_{i,\:c}\) denotes the logit of class \(\:c\) for image \(\:i\), and \(\:{\pi\:}_{i,\:c}\) represents the predicted probability for that class. The network was optimized using categorical cross-entropy loss, defined as:

$$\:CE=\:-\sum\:_{i}\sum\:_{c\epsilon{\Omega\:}}{y}_{i,c}\:\bullet\:\text{log}\left({{\uppi\:}}_{\text{i},\:\text{c}}\right),$$
(4)

where \(\:{y}_{i,c}\) is the one hot encoded ground truth label.

Experiments and Model Evaluation

We performed two types of experiments. In the first one, the inputs to the classifier were individual cells φ, while in the second, cell arrays Φ were the inputs to the classifier. We used the pre-trained DenseNet121 after evaluating various architectures and determining that some of them yielded similar results, as shown in Table S2, Supplementary Materials. The evaluated architectures were DenseNet121, DenseNet169, DenseNet20123, ResNet18, ResNet34 and ResNet50 71. The performance obtained by DenseNet121 was similar to ResNet34 (p > 0.05); however, the DenseNet121 model contains fewer parameters.

Two distinct categories of hyperparameters were involved in the pipeline, and they were handled differently to guard against optimistic bias. The first category comprises the training hyperparameters of the CNN (learning rate, weight decay, batch size, number of epochs, and learning rate decay schedule). These were determined using only the training partition of the first split (i.e., monitored on a held-out validation subset drawn from the training subjects, not from the test subjects), and then kept fixed for the remaining four splits. Critically, the four remaining test partitions were never exposed to any tuning decision and therefore provide unbiased performance estimates.

The second category comprises the pipeline design parameters M (array size) and P (number of arrays per subject). These were selected using a nested cross-validation strategy to ensure that hyperparameter tuning and model evaluation were performed on strictly separate data. In the inner loop, the 30 training subjects (10 per class) were further divided into 24 subjects for training and 6 for validation, and a 5-fold cross-validation was conducted over all tested combinations (M {3, 5, 7}, P {100, 150, 200}). This inner evaluation identified M = 5 and P = 150 as the configuration offering the best balance between accuracy and variance. In the outer loop, the model was retrained using all 30 inner-loop subjects with the selected hyperparameters and evaluated on the remaining 15 held-out test subjects (5 per class), which were never used during hyperparameter selection (full results of the outer-loop evaluation across all tested combinations are reported in Table S4, Supplementary Materials). This nested design prevents information leakage between model selection and performance estimation, ensuring that the reported results reflect an unbiased evaluation of the chosen configuration. The CNN backbone architecture was selected via an inner-loop evaluation conducted prior to the main evaluation pipeline. Multiple architectures were compared (DenseNet121, DenseNet169, DenseNet201, ResNet18, ResNet34, and ResNet50), with full results reported in Table S2 (Supplementary Materials). DenseNet121 was chosen because it achieved comparable performance to the best-performing alternatives (p > 0.05) while requiring fewer parameters.

Finally, the subject-level aggregation strategy (confidence-weighted voting, Eq. 6) is not a tuned hyperparameter but a fixed design choice of the pipeline, applied uniformly across all configurations and folds.

During training, both classifiers (with inputs φ and Φ) were trained with a batch size of 64 for 150 epochs. We employed the Adam optimizer with a momentum of 0.8, and weight decay of 0.0001. Input images of φ were resized to dimensions of 64 × 64 pixels for the individual cell classifier, and 512 × 512 pixels for the array Φ classifier. Initial learning rates were set to 0.0001 and 0.001 for the individual cell and array classifiers, respectively. Additionally, the learning rate was decayed by a factor of 0.8 every 50 epochs. We optimized these hyperparameters to achieve the best performance in both classifiers. The training was performed on two NVIDIA GeForce RTX 3080 Ti GPUs in parallel.

We reported accuracy at two levels. The accuracy for each class in a multiclass classification task is computed as follows:

$$\:ACC\left(y,\:\widehat{y}\right)=\frac{1}{N}\sum\:_{i=0}^{N-1}1\left({y}_{i}={\widehat{y}}_{i}\right),$$
(5)

where \(\:N\) represents the total number of samples (\(\:\left|{\phi\:}^{test}\right|\) and \(\:\left|{{\Phi\:}}^{\text{t}\text{e}\text{s}\text{t}}\right|\) for cell image and array inputs respectively), \(\:{y}_{i}\) is the true label for the \(\:i-th\) instance, and \(\:{\widehat{y}}_{i}\) is the corresponding predicted label. The indicator function \(\:1\left({y}_{i}={\widehat{y}}_{i}\right)\) returns 1 if the predicted label matches the true label, and 0 otherwise. The summation counts the total number of correct predictions and dividing this sum by \(\:N\) yields the overall accuracy.

The accuracy at the first level assessed the model using individual cell images or array images, considering the labels and predictions for each cell or each array, respectively. At the second level the class of the subject was evaluated instead of individual images from cells or arrays. Therefore, all cells/arrays from that subject voted for the predominant class. We applied weighted voting to determine the predominant class of the subject, as follows:

$$\:{\widehat{y}}_{U}=\:\text{arg}\underset{c\in\:{\Omega\:}}{\text{max}}\left(\sum\:_{i\in\:{\psi\:}_{U}}{{\uppi\:}}_{i,\:c}\:\right)$$
(6)

where \(\:U\) denotes each subject in the test partition, \(\:{\psi\:}_{U}\) corresponds to the set of cells/arrays associated with the subject U (\(\:{\phi\:}^{test}\) and \(\:{{\Phi\:}}^{\text{t}\text{e}\text{s}\text{t}}\) respectively), and \(\:{{\uppi\:}}_{i,\:c}\) represents the model confidence for class \(\:c\) obtained from cell/array \(\:i\). The final subject label \(\:{\widehat{y}}_{U}\) thus corresponds to the class with the highest accumulated confidence across all its associated inputs.

Results

We first assessed the potential of the morphology of individual cellular sensors for disease classification; for this, we trained a classifier with isolated microglia cell images \(\:\phi\:\). Performance across folds remained consistently low, with accuracy values under 60% (Fig. 11a, blue bars). This limitation suggests that morphological variance among individual microglia does not provide sufficient information for reliable differential diagnosis among BD, MDD, and CTRL samples. Interestingly, when votes from multiple cells belonging to the same subject were aggregated (subject-level voting), classification improved notably, reaching 80% accuracy, or higher, in four out of five folds, and 73.34% in the fifth one (Fig. 11a, green bars). This significant difference reveals the importance of considering intercellular context within subjects when characterizing disease signatures.

To further improve prediction, we tested whether grouping cell images into arrays \(\:{\Phi\:}\) of fixed sizes (M = 3, 5, or 7), and varying the number of arrays generated for each subject (P = 100, 150, 200), improved diagnostic accuracy. This strategy boosted array-level accuracy significantly (Fig. 11b). With M = 3, average accuracy improved over the single-cell baseline, but remained modest (83.2%). However, increasing M to 5 yielded a considerable improvement (90.5%), while M = 7 resulted in the highest average accuracy (93.3%), although with slightly increased variance across folds, as is shown in Table S3, Supplementary Materials. This trade-off suggests that larger arrays may capture the morphological diversity induced by sEVs exposure of our cellular sensors better but may introduce variability in cases where individual subject samples are heterogeneous.

Notably, voting among arrays of the same subject further improved classification, achieving near-perfect results for M = 5, and M = 7 conditions (Fig. 11c). Therefore, in four of the five folds, all the subjects were classified correctly regardless of the array size, P. Only one subject (1160028) was repeatedly misclassified across configurations, suggesting either atypical microglial responses or a labeling error. Interestingly, although M = 7 achieved slightly higher overall accuracy than M = 5, it resulted in an error in fold 3, including a misclassification of the same subject 1,160,028 in fold 5 for P = 150 (Fig. 11c). These results suggest that M = 5 may represent an optimal balance between accuracy and variance. Together, these results indicate that disease-specific patterns of microglial morphology emerge more clearly when cells are analyzed in structured groups rather than in isolation. Subject-level classification further enhances the reliability of predictions, highlighting the value of contextualizing single-cell features within higher-order spatial or sampling structures. These findings support the development of diagnostic tools that incorporate hierarchical aggregation strategies to account for both intercellular and intersubject variability.

To provide a comprehensive evaluation of diagnostic classification performance beyond overall accuracy, Fig. 10 presents the aggregated confusion matrices at both the array level and the subject level for the best-performing configuration (DenseNet121, M = 5, P = 150), pooled across all five iterations of the repeated holdout validation. Additionally, Table 3 reports the corresponding per-class precision, recall, and F1-score derived from these aggregated confusion matrices. At the subject level, the aggregated confusion matrix shows that 44 out of 45 subjects were correctly classified, with only one subject (1160028) consistently misclassified across iterations.

Fig. 10
Fig. 10
Full size image

Accumulated confusion matrices across all cross-validation folds. (A) Array-level predictions, where each sample corresponds to an individual array, achieving an overall accuracy of 90.57%. (B) Subject-level predictions obtained through weighted voting across arrays belonging to the same subject, yielding an overall accuracy of 98.67%.

Table 3 Per-class precision, recall, and F1-score for the best-performing configuration (DenseNet121, M = 5, P = 150), derived from the aggregated confusion matrices pooled across all five iterations of the repeated holdout validation. Results are reported at both the array level and the subject level (weighted voting). Macro and weighted averages are included for each level.

To enable full reproducibility and transparency, Table 2 lists for each of the 45 subjects: (i) the number of iterations in which the subject appeared in the held-out test set, and (ii) the class label. The subject-level classification accuracy reported for each iteration corresponds to the fraction of test subjects correctly classified via weighted voting (Eq. 6) within the held-out partition of that iteration. The overall result of 44/45 individuals correctly classified was obtained by pooling the per-iteration outcomes: across the five iterations, each correctly classified subject was counted once, and the one subject that was misclassified (subject 1160028) was consistently misclassified across multiple iterations. The per-iteration subject-level accuracy values for the selected configuration (M = 5, P = 150) are reported, yielding a mean accuracy of 98.67% across the five iterations, enabling the reader to independently verify the aggregate result.

To complement these results under a standard stratified protocol, we additionally performed a stratified 5-fold cross-validation on the full cohort (n = 45), with 12 subjects per class for training and 3 per class for testing in each fold (36/9 overall). The results of this additional analysis are reported in Table 4 and are consistent with those obtained under the repeated random splits protocol, with 44/45 subjects correctly classified using the configuration M = 5 and P = 150.

Table 4 Per-fold classification results for the stratified 5-fold cross-validation (DenseNet121, M = 5, P = 150). Each fold used 12 subjects per class for training and 3 per class for testing (36/9 split).

Array-level and subject-level (weighted voting) accuracies are reported, along with the misclassified subjects. Consistent with the repeated random splits protocol, 44 out of 45 subjects were correctly classified, with the same subject (1160028) being the only misclassified case.

Fig. 11
Fig. 11
Full size image

(a) Accuracy of classification using individual microglial cell images \(\:\phi\:\) (blue), and subject-level voting (green). Subject-level aggregation improves classification accuracy significantly across all folds, indicating the limitations of single-cell predictions for diagnostic purposes. (b) Accuracy of array-level classification per fold. Bar plots show classification accuracy for microglial image arrays \(\:{\Phi\:}\) grouped by array size (M = 3, 5, or 7), and number of arrays for each subject (P = 100, 150, or 200). Each color gradient within M groups indicates P values. Results are averaged across five cross-validation folds. (c) Accuracy of subject-level classification per fold. Bars represent the percentage of subjects correctly classified using voting among arrays of the same subject. Notably, M = 5 achieves consistent performance with minimal variance across all P values. (d) Accuracy at the array level for different configurations of M (3, 5, 7) across P dimension (100, 150, 200). The results show that larger array dimensions (M = 5 and M = 7) consistently outperform M = 3, with statistically significant differences confirmed through ANOVA and Tukey post-hoc tests.

Figure 11d presents the first-level accuracy across different values of M for each P configuration. The results indicate that configurations with M = 5 and M = 7 achieve higher accuracy compared to M = 3, regardless of the P dimension. However, increasing M also leads to a greater standard deviation. Among the highest performing configurations, M = 5 and P = 150 provide a good balance, yielding high accuracy while maintaining the lowest variability.

The results demonstrate that M = 5 and M = 7 yield superior performance compared to M = 3, regardless of the P dimension. The ANOVA test yielded p-values of 0.0021, 0.0003, and 0.0003 for P = 100, 150, and 200, respectively (p < 0.05 for all cases). This indicates that the differences in performance across the M dimension are statistically significant for each value of P. The Tukey post-hoc test was conducted to explore these differences further. For P = 100, the test revealed statistically significant differences between M = 3 and M = 5 (mean difference = 8.17, p = 0.0126) and between M = 3 and M = 7 (mean difference = 10.49, p = 0.0023). However, no significant difference was observed between M = 5 and M = 7 (mean difference = 2.32, p = 0.6057). Similarly, for P = 150, the Tukey test showed significant differences between M = 3 and M = 5 (mean difference = 7.54, p = 0.0038) and between M = 3 and M = 7 (mean difference = 10.21, p = 0.0003). Again, no statistically significant difference was observed between M = 5 and M = 7 (mean difference = 2.67, p = 0.3474). For P = 200, the test results were consistent with the other configurations, with statistically significant differences between M = 3 and M = 5 (mean difference = 7.50, p = 0.0053) and between M = 3 and M = 7 (mean difference = 10.80, p = 0.0003). As in previous cases, M = 5 and M = 7 did not differ significantly (mean difference = 3.30, p = 0.2352). While M = 5 and M = 7 are statistically equivalent, by choosing M = 5, the model achieved accuracies at the second level of 100% in four cross-validations.

Discussion

This study presents a novel image-based diagnostic framework that integrates CNNs with microglial cells acting as functional cellular sensors for patient-derived sEVs, constituting the first proof-of-concept for this diagnostic technology. Unlike traditional approaches that analyze patient samples directly, our method leverages the capacity of microglia to undergo disease-specific morphological changes in response to sEVs exposure, effectively translating molecular disease signals into measurable cellular phenotypes. This transduction step introduces a biologically meaningful amplification of subtle diagnostic cues, which are then captured through standardized fluorescence imaging, and interpreted by a fine-tuned DenseNet121 CNN model. Our results correctly classified 44/45 subjects across the five repeated subject-disjoint random splits. The best results were achieved with a method that organizes images of individual microglial cells into arrays of 5 × 5 images and uses voting among arrays that were created from the same subject. A key innovation lies in the hierarchical analysis pipeline, which groups isolated microglial cells into structured image arrays prior to classification. This approach captures the intercellular variability within subjects, enhancing diagnostic accuracy significantly over single-cell-based predictions. Furthermore, the use of confidence-weighted subject-level voting aggregates predictions across arrays, improving accuracy while reducing variance. Together, these strategies represent a conceptual advance in computational pathology by shifting from isolated cell analysis to context-aware, multi-cell inference, and establishes a new paradigm for applying AI to psychiatric diagnostics via immune-derived functional readouts.

Our method incorporates several methodological strategies to mitigate overfitting. The dataset partition was strictly subject-disjoint, ensuring that no subject appears in both training and testing within the same fold. All 45 subjects appear in the test partition at least once across the five iterations, providing a comprehensive assessment across the entire cohort. Cross-validation is the established standard practice for internal validation in neuroimaging and psychiatric deep learning studies, particularly when independent external datasets are not available72,73. The DenseNet121 model was pre-trained on ImageNet and subsequently fine-tuned, rather than trained from scratch. Transfer learning has been extensively demonstrated to reduce overfitting risk on small medical imaging datasets by leveraging robust feature representations learned from large-scale natural image datasets74,75. Our augmentation strategy goes beyond standard geometric transformations. The generation of multiple arrays (P {100, 150, 200})per subject through random cell permutations creates biologically meaningful data diversity. Since each array contains a different random combination and spatial arrangement of the same subject’s cells, the model is forced to learn generalizable morphological patterns rather than memorize specific cell configurations. Data augmentation is widely recognized as one of the most effective strategies for reducing overfitting in deep learning when working with limited medical imaging data76. The weighted voting mechanism across multiple arrays introduces an ensemble-based aggregation that reduces the impact of individual misclassifications at the subject level. While near-perfect accuracy may initially suggest overfitting, several observations support the biological validity of our results. Only one subject (1160028) was consistently misclassified across configurations, which may reflect atypical microglial responses or a potential labeling ambiguity rather than model memorization. Furthermore, accuracy at the array level (without voting) was notably lower (90–93%), indicating that the model has not memorized training patterns but rather benefits from the aggregation of multiple, partially informative signals. This progressive improvement from microglial-level (55%) to array-level (93%) to subject-level (98%) accuracy demonstrates a structured signal amplification consistent with genuine biological signals rather than noise fitting.

The present study is based on a reduced sample size, which is associated with limitations that should be addressed in future studies, including the impact of sex, age, medication, other demographic variables, or childhood trauma on patient stratification.   

There is currently no objective, scalable, and time-efficient method that can accurately classify mood disorders using peripheral biomarkers such as those found in blood samples. This highlights the urgent need for diagnostic approaches that are both biologically based, and applicable practically to large populations.

These findings reinforce the effectiveness of combining microglial cell-based biosensing with DL-based classification strategies, providing a robust and scalable solution for mood disorder classification. By using primary microglial cells as cellular sensors, the method detects changes in cell morphology caused by the content of plasma-derived sEVs. These alterations, imperceptible to human observers or traditional image analysis, are effectively decoded through DL techniques. The use of CNNs, particularly when applied to structured image arrays and combined with subject-level voting, significantly enhances the ability of the model to generalize across individuals and diagnostic categories. This dual-layered strategy, biological amplification of patient-specific signals through microglial response, and computational enhancement via array-based CNN analysis, represents a major advancement toward the goal of providing a non-invasive and high-throughput diagnostic tool. Furthermore, the minimal invasiveness of blood collection, and the speed of the image acquisition and analysis pipeline suggest potential for integration into clinical workflows. These elements establish a foundation for future precision psychiatry approaches, where DL can be used to complement and strengthen clinical decision-making in complex mood disorders such as BD, MDD, and potentially other mental disorders.

Regarding clinical scalability, the 24-hour sEV incubation period is comparable to the turnaround times of several routine clinical assays, including blood cultures (2–5 days), batch-run autoimmune/serology panels (1–3 days), and many molecular/genetic tests77,78,79,80. Importantly, once the CNN model is trained, the computational classification step (inference) requires only seconds per subject. From a cost standpoint, the reagents and consumables for sEV isolation, microglial culture, and immunofluorescence staining are considerably lower than those required for neuroimaging modalities such as fMRI or PET. Moreover, the growing availability of automated high-content imaging platforms, such as Opera Phenix (Revvity/PerkinElmer) and ImageXpress Micro Confocal (Molecular Devices), provides feasible routes for further standardization and high-throughput scaling. Most importantly the 24-hour turnaround must be weighed against the current clinical reality: up to 69% of BD patients are initially misdiagnosed with MDD, and more than 30% may remain misdiagnosed for up to 10 years, with substantial downstream consequences, including inappropriate pharmacotherapy, treatment-emergent mania, and significant personal and economic burden. In this context, a next-day objective diagnostic readout, even if it requires overnight incubation, represents a substantial improvement over years of diagnostic uncertainty.

To provide insight into the features driving classification, we applied Grad-CAM + +81 to the final convolutional layer of the fine-tuned DenseNet121 model. The resulting class-discriminative heatmaps reveal that the CNN does not distribute attention uniformly across all cells in the array; instead, a subset of specific cells consistently concentrates the highest activation, indicating that these cells carry greater discriminative power for distinguishing the classes BD, MDD, and CTRL. Additionally, an equivariance analysis82 confirmed that the attention patterns of the model are robust to geometric transformations: When the arrays from a test partition are geometrically transformed, the same cells remain highly activated in most cases. Figure 12 illustrates this behavior for representative subjects from each diagnostic class. Panel (A) shows a BD subject (ID 3650017) under horizontal flip, panel (B) an MDD subject (ID 1160028) under double flip, and panel (C) a CTRL subject (ID 1012) under vertical flip. In all three cases, the Grad-CAM + + heatmaps between original and transformed arrays exhibit high spatial correlation (Pearson r > 0.90), demonstrating that the CNN attends to the same individual cells regardless of spatial arrangement and confirming that it learns genuine cellular features rather than large areas of the array. Notably, this consistency holds even at lower confidence levels, as observed for the MDD subject (60–72% confidence), suggesting that the learned representations are stable across the confidence spectrum. The Grad-CAM + + analysis suggests that the model attends to biologically meaningful cells. The observation that only a fraction of cells within each array drives classification also opens avenues for future work, including pre-selection strategies that prioritize morphologically informative cells to further improve classification performance.

Fig. 12
Fig. 12
Full size image

Grad-CAM + + equivariance analysis for representative subjects from each diagnostic class. Each row shows, from left to right, the original 5 × 5 microglial cell array, the transformed version, and the corresponding Grad-CAM + + heatmaps overlaid on the original and transformed arrays. (A) BD subject 3,650,017 under horizontal flip. The model correctly classifies both versions as BD with 100% confidence, and the attention maps are highly consistent (Pearson r = 0.919). (B) MDD subject 1,160,028 under double flip. The model correctly classifies both versions, with 60% and 72% confidence for the original and transformed arrays, respectively. The Grad-CAM + + heatmaps remain consistent across transformations (Pearson r = 0.903), indicating stable feature attention even at lower confidence levels. (C) CTRL subject 1012 under vertical flip. The model correctly classifies both versions as CTRL with 100% confidence, and the attention patterns are highly consistent (Pearson r = 0.919). Across all three cases, the high spatial correlation of heatmaps between the original and transformed arrays confirms that the CNN identifies genuine cellular features rather than large areas of the array.

Conclusion

Major Depressive Disorder and Bipolar Disorder are significant public health concerns, contributing extensively to global mental disability, and placing a substantial burden on healthcare systems. Traditional diagnostic methods rely on symptom-based assessments, often leading to misdiagnosis and delayed treatment. There is a high rate of misdiagnosis, ranging over 30%, for BD and MDD.

In this work, we developed a method based on DL to classify images from microglial cells stimulated with patient-derived sEVs, into three classes: BD, MDD, and CTRL. In such a way, microglial cells are used as cellular sensors which respond with morphological changes to the complex immune-regulatory content of sEVs. The best results were achieved with a method that organizes images of individual microglial cells into arrays of 5 × 5 images and uses voting among arrays that were created from the same subject. We also tested the accuracy without voting, and using individual cells with and without voting, obtaining lower results than those with arrays and voting. Our method shows better results using voting by subject for individual cell classification as well as for arrays of cell images. Our developed method, based on arrays of cell images and CNN models, yields excellent classification results for the three classes (BD, MDD, and CTRL). Results achieved 100% in 4 of the five partitions, and only one error in the fifth partition. Our results provide an alternative to the traditional diagnostic method, based on symptom assessment, that can result in a high rate of misdiagnosis. Therefore, it is a promising tool that could be further developed and tested.