Introduction

Urine sediment analysis is a cornerstone of clinical diagnostics, essential for the assessment and management of kidney diseases, urinary tract infections, and various systemic disorders1,2. In clinical practice, examining the microscopic components of urine, such as cells, casts, and crystals, provides vital clues about renal function and the presence of pathological abnormalities1. However, despite its clinical significance, manual urine sediment analysis remains a labor-intensive, subjective, and operator-dependent procedure, which leads to variability in results between professionals and laboratories3,4,5. With the rising volume of laboratory requests and limited availability of skilled personnel, there is a pressing need for accurate and efficient automated methods that deliver reliable and standardized results4.

The main problem addressed in our study is the challenge of automating urine sediment analysis using artificial intelligence (AI), especially for detecting and classifying the full spectrum of urinary particles. Many of these particles have highly diverse morphologies, small sizes, and are frequently underrepresented in datasets6,7. Although state-of-the-art AI methods are promising, they often depend on large, labeled datasets and tend to struggle with real-world image diversity and rare category detection5,6,8. This situation highlights the need for novel solutions that can leverage both labeled and unlabeled data within robust architectures.

Deep learning, especially convolutional neural networks (CNNs), has revolutionized medical image analysis in recent years. It offers automated feature extraction and remarkable performance in tasks such as disease detection, localization, and segmentation across multiple imaging modalities9,10,11,12. However, applying deep learning to urine sediment images presents unique challenges, including the lack of large annotated datasets, high-resolution imaging requirements, and wide variation in image quality. Recent advances demonstrate the value of self-supervised learning13,14, where unlabeled data is used to pretrain models via methods such as image reconstruction. This method can help to overcome data scarcity and improve model generalizability15. In this context, we introduce a large and diverse dataset, OpenUrine, which contains 790 labeled images (with over 31,285 expert-annotated bounding boxes across 39 categories) and an additional 5,640 unlabeled images for self-supervised learning.

A key innovation in our study is the design of a multi-head YOLOv12 architecture. Six parallel detection heads are specifically dedicated to Cells, Casts, Crystals, Microorganisms/Yeast, Artifacts, and Others. These heads operate simultaneously to enable comprehensive and precise detection of all relevant urinary sediment particles and their respective subclasses. Unlike previous single-head models, this architecture allows the model to independently capture the distinct morphological and visual characteristics of diverse particle types. Such a multi-head mechanism is essential for robust identification and discrimination among particle classes that differ widely in size, shape, and appearance, ensuring that both common and rare elements are detected with high accuracy.

The primary goal of this research is to develop and validate an effective and scalable deep learning method based on the multi-head YOLOv12, complemented by self-supervised pretraining and Slicing Aided Hyper Inference (SAHI)-based inference, for comprehensive and automated detection and classification of urinary sediment particles in microscopy images.

Related works

The potential of deep learning to overcome the limitations of traditional urine sediment analysis has spurred significant research efforts in developing deep learning-based AI models for automated analysis6. These models are designed to automatically identify and classify the various microscopic particles found in urine sediment, including red blood cells, white blood cells, epithelial cells, casts, crystals, bacteria, and yeast. Researchers have explored a wide range of deep learning architectures for this purpose, with CNNs being particularly prominent. Models such as AlexNet16, ResNet17, GoogleNet18, DenseNet19, MobileNet20, and YOLO have been adapted and applied to the task of urine particle classification and detection.

Some studies have focused on specific clinical applications of these AI models. For instance, research has explored the use of deep learning to detect bacteria in urine samples directly from microscopic images, eliminating the need for traditional, time-consuming urine culture methods21. Another study has investigated the potential of AI to screen for rare diseases, such as Fabry disease, by identifying unique cellular morphologies in urine sediment images22. To enhance the performance of these models, researchers are continuously exploring various techniques, including novel image amplification methods to augment training datasets, the incorporation of attention mechanisms to focus on relevant image features, and the development of hybrid methods that combine the strengths of CNNs with traditional feature extraction techniques like Local Binary Patterns (LBP)5,6,23,24. The use of pre-trained models and transfer learning is also a common strategy, allowing researchers to leverage knowledge gained from training on large general image datasets to improve the performance of models on the often-smaller urine sediment image datasets8 Furthermore, object detection methods like Faster R-CNN25, SSD26, and YOLO are being applied to simultaneously locate and classify urine particles within microscopic images, providing a more comprehensive analysis than simple image-level classification6.

The application of deep learning to urine sediment analysis encompasses various methodes tailored to specific analytical needs. Many studies focus on classification tasks, where the goal is to categorize individual urine sediment particles into predefined classes, such as red blood cells, white blood cells, and different types of crystals5,12. These models learn to recognize the distinct visual features of each particle type to perform accurate classification. Another significant method involves object detection tasks, where the AI model aims to not only classify but also to precisely locate multiple urine particles within a single microscopic image7,12. This is particularly valuable in clinical settings as it allows for the quantification of different particle types and the analysis of their spatial relationships within the urine sediment.

Liang et al.27 in their study used a dataset containing 10,752 images with seven classes consisting of urinary particles (erythrocytes, leukocytes, epithelial, low-transitional epithelium, casts, crystal, and squamous epithelial cells). It was stated that after balancing the image categories, the data was used to train a RetinaNet model28. It was stated that an 88.65% accuracy value was obtained with this developed method on a test set, with a processing time of 0.2 s per image. Yildirim et al.5 in their study used a data set containing 8,509 particle images with eight classes obtained from urine sediment. They developed a hybrid model based on textural (LBP) and ResNet50. It was stated that after optimizing and combining features, a high accuracy value of 96.0% was obtained with the proposed model. Liang et al.23 conducted a series of studies aimed at improving urinary sediment analysis through deep learning-based object detection models. In one study, they proposed the Dense Feature Pyramid Network (DPFN) architecture, integrating DenseNet into the standard FPN model and incorporating attention mechanisms into the network head. This method significantly mitigated class confusion in urine sediment images, particularly improving erythrocyte detection accuracy from 65.4% to 93.8%, and achieving a mean average precision (mAP) of 86.9% on the test set. In a complementary study29, they framed urinary particle recognition as an object detection task using CNN-based models such as Faster R-CNN and SSD. Evaluated on a dataset of 5,376 labeled images across seven urinary particle categories, their best-performing model achieved an mAP of 84.1%.

Ji et al.15 proposed a semi-supervised network model (US-RepNet) to classify urine sediment images. They used a data set containing 429,605 urine sediment images with 16 classes. They stated that they obtained a 94% accuracy value with their suggested model. Li et al.30 in their study used a data set containing 2551 urine sediment images with four classes (red blood cells, white blood cells, epithelial cells, and crystals). They developed a modified LeNet-531. They stated that they performed classification with 92% accuracy. Khalid et al.24 compiled a dataset of 820 annotated urine sediment images. This dataset was used to train and evaluate five convolutional neural network models - MobileNet, VGG1632, DenseNet, ResNet50, and InceptionV333 - along with a proposed CNN architecture. MobileNet achieved the highest true positive recall, followed closely by the proposed model. Both models reached a top accuracy of 98.3%, while InceptionV3 and DenseNet demonstrated slightly lower but still comparable accuracy of 96.5

Avci et al.34 developed a model for urinary particle recognition that enhances the resolution of microscopic images using a super-resolution Faster R-CNN method. They utilized pre-trained architectures including AlexNet, VGG16, and VGG19. Among these, the AlexNet-based model delivered the best performance, achieving a recognition accuracy of 98.6%. In another study35, they introduced a combination of Discrete Wavelet Transform (DWT) and a neural network-based system, the ADWEENN algorithm, for recognizing 10 different categories of urine sediment particles, achieving an accuracy of 97.58%.

In another study, Erten et al. introduced Swin-LBP, a handcrafted feature engineering model for urine sediment classification that combines the Swin transformer architecture with local binary pattern (LBP) techniques. Their six-phase approach—including LBP-based feature extraction, neighborhood component analysis (NCA) for feature selection, and support vector machine (SVM)36 classification achieved an accuracy of 92.60% across 7 classes of urinary sediment elements, outperforming conventional deep learning methods applied on the same dataset37. In a subsequent study, the same group proposed another model integrating cryptographic-inspired image preprocessing techniques, notably the Arnold Cat Map (ACM), with patch-based mixing and transfer learning. Leveraging DenseNet201 for deep feature extraction and NCA for feature selection, this model reached an even higher classification accuracy of 98.52% for seven types of urinary particles8.

A recent study proposed a combined CNN model integrated with an Area Feature Algorithm (AFA), enabling improved recognition of 10 urine sediment categories from a large dataset of 300,000 images, achieving a test accuracy of 97% and significantly enhancing the recognition of visually similar particles such as RBCs and WBCs38. A deep learning model based on VGG-16 was developed to classify 15 types of urinary sediment crystals using 441 images, which were augmented to 60,000 images through targeted data augmentation. Removing the random cropping step in data augmentation significantly improved accuracy, and the model achieved a performance of 91.8%39.

Lyu et al.7 developed an advanced deep learning model, YUS-Net, based on an improved YOLOX40 architecture for multi-class detection of urinary sediment particles. The model integrates domain-specific data augmentation, attention mechanisms, and Varifocal loss to enhance the detection of challenging particle types, particularly small and densely distributed objects. Evaluated on the USE dataset, YUS-Net achieved impressive performance, with a mean Average Precision (mAP) of 96.07%, 99.35% average precision, and 96.77% average recall, demonstrating its potential for efficient and accurate end-to-end urine sediment analysis.

A critical limitation of existing research is the narrow scope of detection. The vast majority of published object detection studies focus on a small number of classes. For example, the influential work by Liang et al.29 used a dataset of 5,376 labeled images across seven urinary particle categories. The dataset used by Li et al.27 also contained seven classes. The hybrid classification model by Yildirim et al.5 was trained on eight particle types. Even more ambitious studies, such as that by Ji et al.15, which used a large dataset, topped out at 16 categories.

Beyond prior urine microscopy studies, several recent deep learning frameworks across other domains further highlight the rapid evolution of hybrid architectures. In biomedical imaging, models such as DCSSGA-UNet41 and EFFResNet-ViT42 adopt dense connectivity, semantic attention, and CNN–Transformer fusion to enhance segmentation and classification precision. Similarly, deep hybrid and self-supervised architectures from cyber-physical security research43,44,45 demonstrate parallel methodological advances in representation learning and encoder–decoder design. Comparable trends have also appeared in unrelated areas such as sports performance analytics and wearable sensor forecasting46,47, reflecting the general shift toward multi-branch and attention-driven deep models across domains.

This “granularity gap” between existing research and the diverse reality of clinical samples is a major barrier to practical deployment. Our work directly confronts this gap by introducing a model and a public dataset, OpenUrine, designed for the comprehensive detection of 39 distinct categories, representing a significant leap in complexity and clinical relevance.

Dataset

The dataset utilized in this study, named OpenUrine, comprises 6430 images of the urinary sediment. OpenUrine consists of a total of 790 anonymized, expert-labeled microscopic images of urinary sediment, in addition to 5,640 unlabeled images used for self-supervised learning. This is the first publicly available dataset dedicated to urinary particle detection. No patient metadata was collected at any stage; all samples were fully anonymized and are referenced only by randomly assigned identification codes. None of the images carry patient-specific information, ensuring complete privacy and compliance with ethical data standards. Images were collected from multiple laboratories using different microscope models and various smartphone cameras to ensure a broad range of imaging conditions reflective of real-world clinical variability.

An overview of the dataset, including the number of labeled and unlabeled images as well as the total number of bounding box annotations, is presented in Table 1. Table 2 provides a detailed breakdown of all 39 categories, reporting the number of annotated objects, number of images containing each label, and a brief scientific description for each particle type, facilitating a comprehensive understanding of the dataset’s diversity and clinical relevance.

Table 1 Summary of the OpenUrine dataset, detailing the number of images and annotated bounding boxes for both labeled and unlabeled subsets.
Table 2 Detailed breakdown of the 39 categories in OpenUrine dataset. The total count represents image-label pairs, as individual images may contain multiple particle types. The actual number of unique images is 790.
Fig. 1
figure 1

Representative annotated microscopic fields from the OpenUrine dataset. Each sub-image shows a clinical urine sample with expert-labeled bounding boxes over multiple particle types, reflecting the high density, diversity, and spatial complexity encountered in real-world urinalysis.

Fig. 2
figure 2

Example images representing all 39 urinary sediment particle categories included in the OpenUrine dataset.

Data labeling

Each image was assigned a unique identification code upon acquisition. Two experienced clinical biochemistry experts conducted the labeling process independently, ensuring high reliability and consensus in recognizing and delineating all urinary sediment structures present. All detectable objects were marked with bounding boxes and assigned one of the 39 class labels. Figure  1 presents sample annotated microscopic fields from the OpenUrine dataset. Each sub-image shows a real clinical sample with expert-verified bounding box annotations identifying and localizing multiple urinary particles across diverse imaging conditions. Figure  2 displays representative examples of all 39 particle categories present in the dataset. Each image illustrates the unique morphology and appearance of a specific urinary sediment particle, such as various cell types, casts, crystals, microorganisms, and artifacts.

Unlabeled images for self-supervised learning

Beyond the labeled portion, the OpenUrine dataset also includes 5,640 unlabeled images. These images, which share the same acquisition characteristics as the labeled set, were used in the self-supervised stage of the proposed method to further boost model performance and robustness.

Data partitioning

The labeled dataset was divided into training and testing sets in an 80:20 ratio at the patient level, ensuring that all images from a single patient are assigned to either the training or testing set, but not both. This patient-level split prevents data leakage and ensures realistic evaluation of the model’s generalization capability to new patients.

The labeled dataset was divided using 5-fold cross-validation. Each fold was trained independently, and the reported results represent the mean±std (%) across the five folds. This partitioning and validation process ensures fair and objective model assessment.

Method

This section outlines the methodology for fully automated detection and categorization of urinary sediment particles in high-resolution microscopy, leveraging a custom multi-head YOLOv12 architecture designed specifically for the OpenUrine dataset.

Architecture overview

As illustrated in Fig.  3, the proposed method is a multi-head object detection based on YOLOv1248,49, adapted and optimized for challenging urinary sediment images (average size \(1800 \times 1800\) px). A key innovation is the separation of the detection module into six distinct semantic heads, each corresponding to a clinically relevant super-category of urinary sediment objects. This structure enhances discrimination and robustness, particularly for rare or visually subtle subclasses. To further address the challenges of detecting small, densely packed structures in large fields, Slicing Aided Hyper-Inference (SAHI)50 is tightly integrated into the inference pipeline.

Fig. 3
figure 3

Overview of the proposed two-stage deep learning method. (A) The encoder–decoder network is pretrained via self-supervised reconstruction on unlabeled urine sediment images to learn rich feature representations. (B) The pretrained encoder (backbone) is fine-tuned for object detection using six parallel heads, enabling precise multi-class identification of urinary particles.

Backbone network

Each input image X is processed by a YOLO backbone, which extracts multiscale, high-level feature maps:

$$\begin{aligned} F = \textrm{Backbone}(X) \end{aligned}$$

These feature maps provide rich spatial and morphological representations crucial for accurate detection across a wide range of object scales.

Multi-head detection module

The detection module utilizes six parallel output heads, each specializing in one clinically important super-category of urinary sediment particles: Cells, Casts, Crystals, Microorganisms/Yeast, Artifact, and Others. This categorization directly follows established clinical taxonomy and precisely matches the semantic groupings defined in Table 2. Each head is responsible for detecting all subcategories corresponding to its group.

For each group i, the shared feature map F is passed to the corresponding detection head:

$$\begin{aligned} Y_i = \textrm{Head}_i(F) \end{aligned}$$

where \(Y_i\) encodes the bounding boxes, objectness scores, and class probabilities for all subclasses assigned to that head.

Loss function

In our architecture, the detection loss follows the YOLOv12 formulation48, but is applied independently to each of the six output heads. This design allows every head, specialized for its own clinical super-category, to optimize its parameters without interference from unrelated particle types, while still contributing to the overall network performance.

The total loss is the weighted sum of the head-specific losses, as shown in Equation 1.

$$\begin{aligned} L_{\textrm{total}} = \sum _{i=1}^{6} \lambda _i \, L_i \end{aligned}$$
(1)

where \(\lambda _i\) controls the relative weight of head i based on its clinical importance and representation in the dataset. The values are tuned as shown in Table 3.

Each head-specific loss \(L_i\) is defined in Eq. 2.

$$\begin{aligned} L_i = \textrm{gain}_{\textrm{box}} \cdot L_{\textrm{bbox}}^{(i)} + \textrm{gain}_{\textrm{obj}} \cdot L_{\textrm{obj}}^{(i)} + \textrm{gain}_{\textrm{cls}} \cdot L_{\textrm{cls}}^{(i)} \end{aligned}$$
(2)

where \(\textrm{gain}_{\textrm{box}}\), \(\textrm{gain}_{\textrm{cls}}\), and \(\textrm{gain}_{\textrm{obj}}\) correspond to box loss gain, classification loss gain, and objectness scaling hyperparameters defined in Table 3.

Bounding box regression in YOLOv12 combines IoU-based loss51 with the Distribution Focal Loss (DFL)52 to enhance localization precision. The bounding box regression loss for head i is expressed in Eq. (3).

$$\begin{aligned} L_{\textrm{bbox}}^{(i)} = L_{\textrm{CIoU}}^{(i)} + \textrm{gain}_{\textrm{DFL}} \cdot L_{\textrm{DFL}}^{(i)} \end{aligned}$$
(3)

Here, \(L_{\textrm{CIoU}}\) accounts for overlap, center distance, and aspect ratio, while \(L_{\textrm{DFL}}\) refines predicted box coordinates at sub-pixel resolution. The complete IoU loss term is defined in Eq. (4).

$$\begin{aligned} L_{\textrm{CIoU}}^{(i)} = 1 - \textrm{IoU}(b, b^{GT}) + \frac{\rho ^2(b_c, b_c^{GT})}{c^2} + \alpha v \end{aligned}$$
(4)

where \(\rho\) is the center-point distance, c is the diagonal length of the smallest enclosing box, and v is the aspect ratio term with balance factor \(\alpha\). The distribution focal loss is expressed in Eq. (5).

$$\begin{aligned} L_{\textrm{DFL}}^{(i)} = -\frac{1}{N_i}\sum _{j=1}^{N_i} \big [ q_{j} \log (p_{j}) + (1-q_{j})\log (1-p_{j}) \big ] \end{aligned}$$
(5)

where \(p_j\) is the predicted probability for the discretized bin of a coordinate value and \(q_j\) is the corresponding soft target.

Objectness loss53 measures how well the model distinguishes objects from background. The objectness loss for head i is formulated in Eq. (6).

$$\begin{aligned} L_{\textrm{obj}}^{(i)} = -\frac{1}{M_i}\sum _{j=1}^{M_i} \big [ o_{j}^{GT}\log (o_{j}) + (1-o_{j}^{GT})\log (1-o_{j}) \big ] \end{aligned}$$
(6)

where \(o_j\) is the predicted objectness score for anchor j, and \(o_j^{GT}\in \{0,1\}\).

Classification loss53 ensures correct subclass identification within each head. The classification loss for head i is defined in Eq. (7).

$$\begin{aligned} L_{\textrm{cls}}^{(i)} = -\frac{1}{P_i} \sum _{j=1}^{P_i}\sum _{k=1}^{C_i} \big [ y_{j,k}\log (p_{j,k}) + (1-y_{j,k})\log (1-p_{j,k}) \big ] \end{aligned}$$
(7)

where \(p_{j,k}\) is the predicted probability for subclass k in sample j, and \(C_i\) is the number of subclasses in head i.

Unlike a unified detector that learns all categories together, here each head focuses only on the visual patterns of its assigned group, using Eqs. 3 through 7 independently. This separation avoids competition between unrelated classes, reduces the impact of severe class imbalance, and allows adjusting \(\lambda _i\) in Eq. 1 to boost underrepresented yet clinically significant categories. As our ablation studies show, removing this head-level independence leads to the sharpest drop in mean Average Precision (mAP) and recall.

Training procedure

A two-stage training strategy, optimized to leverage both labeled and unlabeled data, is applied.

(1) Self-supervised pretraining: during the self-supervised pretraining stage, all 5640 unlabeled images were utilized in an image reconstruction autoencoder architecture (as illustrated in Fig. 3). The network follows an encoder-decoder structure: the encoder mirrors the YOLOv12 backbone to extract latent morphological representations, and the decoder reconstructs the input image using these features. The model was optimized with a combined L1 + SSIM reconstruction loss, enforcing both pixel-level accuracy and structural consistency between input and reconstructed outputs. This pretext task effectively encourages the backbone to capture intrinsic microscopic texture and morphology priors even without labels. The pretrained encoder weights were subsequently transferred to initialize the YOLOv12 backbone during the supervised fine-tuning stage.

(2) Supervised fine-tuning: the backbone’s pretrained weights initialize the detection model, which is then fine-tuned using the 790 image-level-labeled samples and bounding box annotations. Each particle is routed to its corresponding semantic head, and the total loss is jointly optimized. Diverse data augmentation (e.g., Mosaic, Sacle) and a SGD with Momentum are employed.

Inference with SAHI

For inference, we employed SAHI to enhance detection performance on high-resolution microscopic images. SAHI systematically divides each input image into overlapping tiles of 640\(\times\)640 pixels with an overlap ratio of 0.25 (25%) in both horizontal and vertical directions. This slicing strategy enables the model to process smaller image regions with higher effective resolution, significantly improving detection sensitivity for small and densely packed urinary particles that might be missed in full-resolution inference.

Each slice is independently processed by our proposed method, generating separate predictions for particles within that region. The tiled outputs are subsequently merged through non-maximum suppression (NMS) to eliminate duplicate detections and produce consolidated, non-redundant bounding boxes.

Results

To comprehensively evaluate the effectiveness of our proposed object detection method, we conducted a series of experiments on the OpenUrine dataset. These experiments were specifically designed to demonstrate the superiority of our method compared to prior object detection methods under identical conditions.

Performance evaluation metrics

Model performance was assessed using several established object detection metrics. Precision quantifies the proportion of correctly identified positive detections among all predicted positives, as defined in Eq. (8):

$$\begin{aligned} \textrm{Precision} = \frac{TP}{TP + FP} \end{aligned}$$
(8)

where \(TP\) and \(FP\) denote the numbers of true positive and false positive predictions, respectively. Recall measures the proportion of actual positives that are correctly detected by the model, as shown in Eq. (9):

$$\begin{aligned} \textrm{Recall} = \frac{TP}{TP + FN} \end{aligned}$$
(9)

where \(FN\) is the number of false negatives. The overall detection capability is further summarized by the mean Average Precision (mAP), which is the unweighted mean of the Average Precision (AP) across all object classes, as presented in Eq. (10):

$$\begin{aligned} \textrm{mAP} = \frac{1}{N} \sum _{i=1}^{N} \textrm{AP}_i \end{aligned}$$
(10)

where \(N\) is the total number of classes under consideration. Additionally, the evaluation follows the COCO protocol54 by reporting mAP@50-95, which represents the mean AP computed over multiple intersection-over-union (IoU) thresholds ranging from 0.5 to 0.95 (in increments of 0.05), thereby providing a stricter and more comprehensive measure of detection performance.

Table 3 Optimized hyperparameters for our method training on the OpenUrine dataset.
Fig. 4
figure 4

Scatter plots illustrating the effect of key hyperparameters on object detection performance across 300 training runs (each for 100 epochs) on the OpenUrine dataset. Cross markers denote the final optimal values used for the main model training.

Implementation details

A comprehensive hyperparameter optimization protocol was carried out as part of our experimental design (see Fig.  4). For this purpose, we performed 300 independent training runs, each for 100 epochs, gradually searching the space of learning rate, momentum, weight decay, and various augmentation factors as listed in Table  3. Each configuration was evaluated on the validation split after every epoch, allowing us to systematically identify optimal values. All experiments, including baseline comparisons, were performed on the OpenUrine dataset for scientific consistency.

The scatter plots in Fig. 4 visualize the relationship between key hyperparameters and resulting detection metrics (such as mAP, mAP@50-95, Precision, and Recall); final selected values are denoted by a cross marker.

For the final training of our best-performing model, we utilized the optimal parameters over 300 epochs with a batch size of 16 and an input image size of \(960 \times 960\) pixels, ensuring maximal capacity to learn robust object representations.

Comparative evaluation

Quantitative and qualitative evaluation of automated urine sediment analysis models is crucial for establishing their accuracy, robustness, and clinical viability. In this section, we present a comprehensive comparative analysis of our proposed method,and state-of-the-art methods, followed by investigations into how input image size and particle class influence model performance. The reliability and interpretability of the deep network are further validated through visual explanation techniques such as Grad-CAM.

Comparison with state-of-the-art methods

Table 4 provides a comprehensive comparison between our method, ans state-of-the-art methods. Our proposed method achieves the highest performance on all core metrics (precision, recall, mAP\(_{50}\), mAP\(_{50-95}\)), outperforming both the latest YOLO models and prior state-of-the-art methods. While absolute values such as 76.59% precision may appear modest compared to simpler tasks, it is important to note that the OpenUrine dataset includes 39 diverse classes, making it a far more complex challenge than datasets used in previous studies. The ablation results reveal that both the multi-head detection strategy and self-supervised pretraining contribute substantially to the observed gains. In particular, removing the multi-head scheme leads to the largest drop in precision, recall, and mean average precision, highlighting the value of specialized detection branches for different particle types. Our method, even without some of these advanced modules, remains competitive with or superior to prior works. YOLO-based baselines and state-of-the art methods, while strong, are outperformed by our method, especially on the challenging OpenUrine dataset. These results demonstrate the effectiveness of our architectural innovations for improving the automated analysis of urine sediment images.

Table 4 Comparison of our proposed method and state-of-the-art methods on the OpenUrine dataset. Metrics are reported as mean±std (%) across five cross-validation folds. Bold values indicate the best result per metric.

Impact of input size

As shown in Table 6, an input resolution of \(960\times 960\) pixels yielded the highest overall mAP while maintaining stable convergence and feasible GPU memory usage (24 GB). Hence, this resolution was adopted for all subsequent experiments. The model achieves optimal results at \(960 \times 960\) pixels, outperforming both smaller and even larger input sizes on almost all metrics. While a further increase to \(1280 \times 1280\) yields competitive results, there is no consistent improvement and some metrics are slightly reduced, likely due to increased computational noise, overfitting, or diminished returns with upscaling. Notably, reducing the input size below 960 sharply degrades performance, especially for mAP50 and recall. This is particularly important because many urinary particles (such as bacteria and crystals) are small and easily lost at lower resolutions. At the lowest tested sizes (80 and 40 pixels), model recall especially collapses, confirming that sufficient image resolution is critical for the reliable detection of fine and small-scale particles. These findings underscore the need to optimize input size for automatic urine sediment analysis, balancing computational efficiency with the necessity to preserve particle detail.

Impact of heads

The influence of each detection head was evaluated through a head-wise ablation test, as summarized in Table 5. When any single head was removed, its corresponding samples were not excluded from training; instead, all annotations were reassigned to the Others head to preserve the dataset composition and training balance. The results show that disabling any individual head consistently reduced detection accuracy, indicating that each contributes unique and complementary information. The most pronounced performance drop occurred when the Microorganisms head was removed, reflecting their critical role in discriminating morphologically complex or clinically significant particle groups.

Table 5 Head-wise ablation analysis of the proposed method. Removing each head degrades detection accuracy even though all data remain in use (annotations of the removed head were redirected to the others head).
Table 6 Evaluation results of YOLO12-Nano with different input image sizes on the OpenUrine test set. Metrics are reported as mean±std (%) across five cross-validation folds. The table demonstrates how increasing input resolution substantially improves the detection of urinary sediment particles, with the best performance achieved at \(960 \times 960\) pixels.
Table 7 Class-wise detection performance of the best model on the OpenUrine test set. Numbers are reported as percentages. Metrics include Precision, Recall, mAP@50, and mAP@50-95 for each urinary sediment class, as well as unweighted and instance-weighted averages across all classes.

Class-wise analysis

The results in Table 7 demonstrate that our method substantially improves detection accuracy across most urinary sediment particle classes compared to previous baselines. The model achieves high precision and recall on classes with distinctive morphological features, such as Calcium Oxalate, Bilirubin, and Calcium Carbonate, showing the benefit of leveraging their unique visual patterns. However, certain classes remain challenging: for example, Bacteria achieve high precision but low recall, likely due to their small size and tendency to be overlooked in crowded backgrounds. Morphologically similar cells, notably RBC and WBC, are sometimes confused due to their overlapping appearance, limiting further accuracy improvements in these categories. Additionally, rare or subtle classes such as Fat Droplets and Renal Epithelial Cells still suffer from lower detection rates.

Typical failure cases include missed detections in densely clustered regions, merged bounding boxes where adjacent particles overlap, and occasional confusion between morphologically similar RBC and WBC, especially when illumination or focus artifacts blur their boundaries. Quantitative analysis indicates that approximately 38% of the undetected WBC instances were misclassified as RBC, while 43% of the missed RBC instances were incorrectly detected as WBC. This bidirectional confusion highlights their strong morphological resemblance under bright-field microscopy. In crowded microscopic fields, small Bacteria are sometimes undetected or merged with noise, while low-contrast Renal Epithelial Cells may be mistaken for background structures. These qualitative observations (illustrated in Fig.5) reveal the key limitations of the current model and inform future improvements such as boundary-aware loss design and targeted synthetic data augmentation for rare or visually ambiguous categories.

Overall, while our model demonstrates meaningful advances in most categories, the reliable detection of small, ambiguous, or visually similar particles remains a significant challenge for automated urine sediment analysis.

Fig. 5
figure 5

Grad-CAM visualization of model attention across representative classes. Top row: detection and attention patterns for Amorphous particles, showing that the model accurately localizes dense crystalline regions and focuses its activations (red/yellow) on texture-rich clusters relevant to this class. Bottom row: predictions for Epithelial Cells where the model highlights cell nuclei and boundary contours while de-emphasizing background noise and staining artifacts. In each pair, the left image displays predicted bounding boxes and class labels, while the right image presents the corresponding Grad-CAM heatmap. Warmer colors (red/yellow) indicate regions contributing most to the network’s decision, confirming that it primarily attends to morphologically informative structures such as epithelial cells and amorphous deposits rather than irrelevant background patterns.

Clinical validation

Urine microscopy results from 84 patients, previously analyzed and verified by experienced laboratory technologists, were employed for clinical validation of the proposed model. For each sample, three to five representative microscopic fields were processed by the model, and the predictions were averaged at the patient level before comparison with the laboratory-reported results. The predicted outputs were mapped to the standard five-level microscopic quantitation scale (none, rare, few, moderate, many) used in routine clinical reporting. A prediction was considered correct when the model’s categorical output matched the laboratory category for the same urinary component.

All major urinary sediment components, including RBCs, WBCs, epithelial cells, calcium oxalate crystals, bacteria, and mucus, were evaluated accordingly. Table 8 presents the clinical accuracy of the proposed model relative to technologist reports. Discrepant samples were further reviewed by an independent clinical biochemist to confirm the final reference label.

Table 8 Patient-level accuracy of the proposed method for major urinary particle types compared with laboratory technologist reports.

Interpretability via visual explanation

In Fig. 5, a comparison is presented between the model’s bounding box predictions (left) and Grad-CAM visualizations (right) for selected test images. The detection results illustrate the network’s ability to localize and classify different urine sediment constituents, such as amorphous particles and epithelial cells. Notably, the Grad-CAM activation maps reveal that the highlighted regions (red areas) are primarily concentrated over clear and well-defined particles within the microscopic fields, confirming that the model bases its predictions on relevant visual cues rather than background artifacts. This qualitative interpretability analysis demonstrates the reliability and transparency of the network’s decision-making process in real-world clinical samples.

Conclusion

In this study, we introduced a novel deep learning method tailored for automated urine sediment analysis, integrating a multi-head YOLOv12 architecture, self-supervised pretraining, and SAHI-based inference. Our method effectively addresses critical challenges such as small-object detection, class imbalance, and data scarcity, leading to a competitive precision of 76.59% on a large, diverse dataset. The deployment of six specialized detection heads allows for detailed and simultaneous classification across all relevant urinary particles and artifacts, supporting detailed clinical interpretation. Furthermore, the establishment and public release of the OpenUrine dataset fill a crucial gap, providing a valuable resource for further research in this domain.

Future work will focus on refining the model’s performance, especially for rare or visually ambiguous particle types, by exploring adaptive focal loss weighting, targeted synthetic data augmentation, and self-supervised consistency regularization to mitigate class imbalance. We also intend to integrate physical and chemical urinalysis test data to further enhance diagnostic precision and generalizability.