Abstract
Deep learning has improved automated plant disease detection by increasing recognition accuracy and robustness compared with traditional vision-based methods. Self-supervised learning (SSL) further reduces dependence on manual labels, but its transferability across heterogeneous agricultural datasets remains insufficiently characterized. Here, we evaluate a contrastive SSL pretraining and fine-tuning pipeline, termed PlantCLR, for plant disease classification under cross-dataset transfer with target-domain fine-tuning. PlantCLR combines SimCLR-style contrastive pretraining with a lightweight convolutional classifier to balance representation quality and deployment efficiency. Experiments on PlantVillage and Cassava Leaf Disease show strong performance, achieving 99.10% accuracy and 99.04% F1-score on PlantVillage, and 96.83% accuracy and 96.70% F1-score on Cassava. Feature embedding visualization using t-SNE and explanation maps using Grad-CAM indicate improved class separability and attention to disease-relevant regions. These results suggest that contrastive SSL can improve representation transfer while maintaining computational efficiency, supporting scalable plant disease diagnostics in practical agricultural settings. Code is available at GitHub.
Introduction
Plant diseases remain a persistent threat to global food security by reducing crop yield and quality, with direct economic consequences for farmers and broader impacts on food supply1,2. Early detection is particularly difficult in regions with limited infrastructure and restricted access to trained agronomists. In practice, diagnosis often relies on visual inspection of symptoms such as discoloration, lesions, wilting, and malformations. Although commonly used, visual assessment is dependent on human expertise, can be slow, and may be subjective or inconsistent, especially under field conditions where symptoms vary across lighting, background clutter, and plant growth stage3.
Advances in artificial intelligence (AI), especially machine learning and deep learning, have enabled automated systems for plant disease recognition. Convolutional neural networks (CNNs) are a core technology in computer vision and have demonstrated strong performance across diverse domains, including medical diagnosis, autonomous driving, power system monitoring, and industrial inspection4,5. By learning hierarchical representations from raw images, CNNs can capture disease-related cues that are difficult to design manually, and they frequently outperform handcrafted feature pipelines in agricultural imagery6,7.
However, supervised CNNs typically require large, balanced, and reliably annotated datasets, which are often unavailable in real-world agricultural settings. Field-collected disease images exhibit substantial intra-class variability due to growth stage, illumination, occlusion, and device differences, alongside inter-class similarity where different diseases share visually similar symptoms such as yellowing or curling8. High-quality labeling is costly because it requires domain expertise and may be infeasible for rare diseases9. In addition, agricultural datasets are frequently imbalanced and may contain noisy labels, which can reduce generalization when models are trained purely in a supervised manner.
Self-supervised learning (SSL) has emerged as a promising alternative for reducing dependence on labeled data by learning transferable visual representations from unlabeled images10. SSL leverages surrogate objectives that do not require manual annotation, such as predicting transformations or enforcing consistency between different augmented views of the same image11. After pretraining, the learned encoder can be fine-tuned using a smaller labeled set, improving robustness and lowering annotation cost12. This paradigm is particularly relevant to agriculture, where unlabeled plant images are abundant but expert labeling is limited2. Recent work in medical and histopathological imaging has similarly emphasized representation-level alignment under heterogeneous acquisition conditions, demonstrating that careful feature fusion and cross-domain representation strategies improve robustness in visually variable environments13.
Despite its promise, applying SSL to plant disease recognition introduces practical challenges. Fine-grained discrimination can be difficult when symptom patterns are subtle and diseases are visually similar8. SSL performance is also sensitive to the choice of augmentations during pretraining: aggressive transformations can distort lesion patterns, whereas weak augmentations may not provide sufficient variability for robust representation learning12. Transfer performance may further degrade under domain bias when pretraining and target data differ substantially, for example controlled laboratory images versus noisy field imagery14. Finally, although SSL reduces the need for large labeled datasets, it does not inherently address class imbalance or label noise in the labeled data used for fine-tuning12. These considerations motivate reproducible SSL pipelines and evaluation protocols that explicitly account for dataset shift and deployment constraints.
In this study, we present PlantCLR, a contrastive self-supervised pretraining and fine-tuning pipeline for plant disease classification with explicit emphasis on evaluation under domain shift. We define a strict two-stage protocol with complete separation between training, validation, and test data to prevent information leakage. In addition to standard in-domain evaluation, we implement a cross-dataset adaptation setting that transfers representations from controlled laboratory imagery (PlantVillage) to real-field imagery (Cassava). This design enables a structured assessment of representation robustness under heterogeneous imaging conditions relevant to practical agricultural deployment.PlantCLR follows a SimCLR-style strategy15, where an encoder is trained to pull together representations of two augmented views of the same image while separating representations of different images in the latent space. We adopt ConvNeXt-Tiny as the encoder backbone due to its favourable trade-off between accuracy and computational cost. In the supervised stage, a lightweight classification head is attached and the network is fine-tuned on labeled data. We evaluate PlantCLR on PlantVillage, a controlled dataset with 38 classes and over 54,000 images, and on the Cassava Leaf Disease dataset, which contains 21,818 field images across five disease classes. The proposed pipeline achieves 99.10% accuracy and 99.04% F1-score on PlantVillage, and 96.83% accuracy and 96.70% F1-score on Cassava. We further assess the learned representations using t-SNE and Grad-CAM to examine class separability and disease-relevant attention patterns.
Related work
Plant disease recognition has progressed from handcrafted feature pipelines to end-to-end deep learning and, more recently, self-supervised representation learning. Prior studies can be broadly grouped into (1) traditional computer vision and classical machine learning approaches that rely on manually engineered descriptors and shallow classifiers, and (2) modern deep learning methods, including supervised CNN/Transformer architectures and self-supervised learning (SSL) strategies that aim to reduce annotation dependence and improve transferability across imaging conditions and datasets.
Traditional plant disease detection approaches
Before deep learning became dominant, automated plant disease detection commonly relied on handcrafted feature engineering followed by classical classifiers. Typical pipelines extracted texture descriptors such as the Gray-Level Co-occurrence Matrix (GLCM) and Local Binary Patterns (LBP), together with color- and shape-based features, to characterize symptomatic regions29,30. These features were then used with conventional learners including Support Vector Machines (SVM), k-Nearest Neighbors (k-NN), and Decision Trees31,32. While such approaches established early baselines and provided interpretable feature-based workflows, they were constrained by the limited expressiveness of manually designed descriptors and their sensitivity to illumination variation, scale changes, and background clutter. Their generalization across datasets was often weak in realistic settings with high intra-class variability and strong inter-class similarity, and performance tended to degrade as dataset size and visual diversity increased, motivating a transition to representation learning with deep neural networks33.
Deep learning-based approaches
Deep learning has enabled end-to-end learning of discriminative representations for plant disease detection, and CNN backbones such as ResNet34 and more efficient variants have reported strong performance across multiple crops and disease categories35,36. However, purely supervised approaches generally require large, reliably annotated datasets, which is a major bottleneck in many agricultural contexts. To better model global context and improve robustness to background clutter, recent studies have explored Vision Transformers (ViTs) and hybrid CNN–Transformer designs8,18,37, although these models may incur higher computational cost and still show limited adaptability under dataset shift. To reduce annotation requirements and improve transferability, SSL methods, including contrastive representation learning, have been increasingly investigated7,15,38. In a typical contrastive SSL pipeline, an encoder is pretrained using a contrastive objective on unlabeled images and then fine-tuned using a smaller labeled set for disease classification. While SSL can improve performance in low-label regimes, relatively few studies evaluate contrastive SSL under clearly defined cross-dataset protocols (e.g., pretraining and/or training in one domain and transferring to another) with consistent settings and without extensive re-tuning. This gap motivates the present study, which evaluates a single SimCLR-based pretraining and fine-tuning protocol on both PlantVillage and Cassava datasets with emphasis on reproducibility, transferability, and deployment practicality. Table 1 summarizes representative work from 2023 to 2025 and highlights recurring limitations related to dataset diversity, field robustness, computational cost, and cross-dataset generalization. Figure 1 provides an overview of the PlantCLR pipeline evaluated in this work. While contrastive self-supervised learning has been explored in general computer vision, its behavior under explicitly defined agricultural domain shift has not been systematically evaluated using strict split isolation and controlled benchmarking. In particular, few studies analyze how contrastive pretraining influences robustness when transferring from controlled datasets to field-acquired imagery. This gap motivates our evaluation-driven investigation.
Methods
Overview
We study a two-stage contrastive self-supervised learning pipeline for plant disease classification that separates representation learning from supervised classification. The framework consists of (1) a lightweight convolutional encoder \(f_e(\theta )\), (2) a projection head \(g_\phi (\cdot )\) used only during contrastive pretraining, and (3) a classification head \(h_\psi (\cdot )\) used during supervised fine-tuning. In the first stage, the encoder performs unsupervised representation learning by contrasting augmented views of unlabeled images. In the second stage, the pretrained encoder is adapted on labeled data for disease classification using a cross-entropy objective. A graphical overview of the full pipeline is shown in Fig. 1.
Overview of the contrastive self-supervised learning pipeline for plant disease classification evaluated on PlantVillage and Cassava datasets.
Problem formulation
Let \(\mathscr {X} = \{\mathscr {X}_l, \mathscr {X}_u\}\) denote the complete collection of plant disease images available for model development, where \(\mathscr {X}_l\) represents the labeled subset and \(\mathscr {X}_u\) represents the unlabeled subset. The labeled dataset is defined as
where \(x_i \in \mathbb {R}^{H \times W \times 3}\) denotes an RGB image of a plant leaf and \(y_i \in \{1, \ldots , C\}\) is the corresponding disease class label among C categories. The unlabeled dataset is given by
which consists of K images without annotations, typically satisfying \(N \ll K\).
The objective is to learn a robust image classification function
capable of accurately predicting the disease class of unseen plant leaf images by leveraging both labeled and unlabeled data. In practical agricultural settings, labeled samples are limited due to the high cost and expertise required for annotation, whereas unlabeled images are abundant, motivating learning strategies that can exploit unlabeled data effectively.
A central challenge in this formulation arises from domain mismatch between datasets. The unlabeled dataset \(\mathscr {X}_u\) may originate from a target domain with different visual characteristics, such as field-acquired images with varying illumination, background clutter, and occlusions, while the labeled dataset \(\mathscr {X}_l\) is often drawn from a source domain with more controlled acquisition conditions. This domain shift can significantly degrade the generalization performance of purely supervised models trained only on \(\mathscr {X}_l\).
To address this challenge, we adopt a two-stage learning strategy. In the first stage, contrastive self-supervised learning is applied to the unlabeled dataset \(\mathscr {X}_u\) to learn general and domain-invariant visual representations using a convolutional encoder \(f_e\). In the second stage, the pretrained encoder is fine-tuned on the labeled dataset \(\mathscr {X}_l\) using a supervised classification loss to obtain class-discriminative representations. This design enables effective knowledge transfer from abundant unlabeled data to the downstream classification task, reduces misclassification on unseen samples, and improves robustness under domain shift, thereby supporting practical deployment in real-world agricultural environments.
Contrastive pretraining
Data augmentation
To promote invariance to photometric and geometric nuisance factors and to create informative positive pairs, we apply a stochastic augmentation pipeline to the unlabeled images in \(\mathscr {X}_u\). Let \(\mathscr {T}\) denote a family of random transformations, including random resized cropping, horizontal flipping, color jittering, and Gaussian blur. For each image \(x_i \in \mathscr {X}_u\), we independently sample two transformations \(t_1, t_2 \sim \mathscr {T}\) and generate two augmented views:
The pair \((\tilde{x}_i^{(1)}, \tilde{x}_i^{(2)})\) forms a positive pair for contrastive learning, while augmented views originating from other images in the same mini-batch serve as negatives. This design encourages the encoder to learn representations that remain consistent across appearance changes while preserving disease-relevant cues.
Encoder and projection head
Given two augmented views \(\tilde{x}_i^{(1)}\) and \(\tilde{x}_i^{(2)}\), we use a shared encoder \(f_e(\cdot ;\theta )\) to extract feature representations. The encoder is trained to capture disease-relevant visual patterns while being invariant to nuisance factors introduced by the augmentation pipeline.
We adopt ConvNeXt-Tiny as the encoder backbone due to its favorable balance between representational capacity and computational efficiency. Rather than increasing architectural complexity, our objective is to examine whether contrastive pretraining itself can enhance cross-domain robustness using a compact model suitable for deployment in resource-constrained agricultural environments. This choice allows us to isolate the effect of representation learning strategy from model scale. During contrastive pretraining, the encoder output is passed to a projection head \(g_\phi (\cdot )\), implemented as a two-layer MLP with a ReLU activation. This head maps features to the space where the contrastive loss is applied, which is standard in SimCLR-style training and improves optimization without degrading downstream performance. The projected embeddings for the two views are computed as:
The pair \((z_i^{(1)}, z_i^{(2)})\) forms a positive pair, while embeddings from other images in the mini-batch serve as negatives.
Contrastive loss
The goal of contrastive learning is to pull together representations of two augmented views of the same image (positive pair) while pushing apart representations of different images (negative pairs). We adopt the Normalized Temperature-scaled Cross-Entropy (NT-Xent) loss as in SimCLR ?. Given projected embeddings \(z_i\) and \(z_j\), we compute cosine similarity as:
For a mini-batch of N images, each augmented twice to form 2N views, the NT-Xent loss for an anchor i and its positive j is:
where \(\tau >0\) is a temperature parameter and \(\mathbb {1}_{[k\ne i]}\) prevents comparing the anchor with itself. After contrastive pretraining, the projection head \(g_\phi\) is discarded and the encoder \(f_e\) is retained for supervised fine-tuning.
Pretraining configuration. During SimCLR-style contrastive pretraining, we use ConvNeXt-Tiny as the encoder \(f_e(\cdot ;\theta )\). A projection head \(g_\phi (\cdot )\) is attached only for contrastive training and removed for downstream fine-tuning. Following standard SimCLR practice, \(g_\phi\) is implemented as a two-layer MLP with a hidden dimension of 512 and a 128-dimensional output embedding, with ReLU activation, projecting encoder features to the space where the NT-Xent loss is applied. Contrastive training uses a batch size of 32 and temperature \(\tau = 0.5\) (Eq. 4). The encoder is pretrained for 100 epochs using SGD with momentum 0.9 and weight decay \(10^{-4}\), on unlabeled images from the PlantVillage training split, after which the projection head is discarded and the pretrained encoder is transferred to the supervised fine-tuning stage. All experiments are conducted on a single NVIDIA RTX GPU with 24 GB memory (Table 2).
Supervised fine-tuning
Once the encoder \(f_e\) is pretrained using the contrastive objective, we remove the projection head and attach a lightweight classifier head \(h_\psi\). The model is then trained on the labeled subset \(\mathscr {X}_l\) using cross-entropy loss. Unless otherwise stated in the experiments, we fine-tune the encoder together with the classifier head to adapt representations to class-discriminative features under limited supervision.
Classifier head
After contrastive pretraining, we discard the projection head and attach a lightweight linear classifier \(h_\psi\) on top of the encoder \(f_e\). Given an input image \(x_i\), the encoder produces a feature vector which is mapped to class logits by \(h_\psi\), and class probabilities are obtained via a softmax operation. This design keeps inference efficient while allowing the pretrained encoder to adapt to the downstream disease classes during fine-tuning.
Cross-entropy loss with label smoothing
To reduce overconfident predictions and improve robustness under limited labeled data, we use cross-entropy training with label smoothing. Concretely, a small smoothing factor \(\epsilon\) redistributes a fraction of the target probability mass from the ground-truth class to the remaining classes, which acts as a regularizer and typically improves calibration and generalization. The supervised objective is:
where \(\tilde{y}_{i,c}\) denotes the smoothed target for class c and \(\hat{y}_{i,c}\) is the predicted probability.
Workflow of the proposed model
Algorithm 1 summarizes the overall training workflow, which consists of two stages. In Stage 1, we perform contrastive self-supervised pretraining on unlabeled images to learn transferable visual representations. In Stage 2, we fine-tune the pretrained encoder using labeled data with a lightweight classifier head. Unless stated otherwise in the evaluation protocol, the unlabeled data used for Stage 1 and the labeled data used for Stage 2 are taken only from the corresponding training split to avoid leakage. After pretraining, the projection head is removed and only the encoder is transferred to the supervised stage.
Two-stage workflow of the PlantCLR training pipeline
Experimental setup
Datasets
We evaluate PlantCLR on two widely used plant disease benchmarks, PlantVillage and Cassava Leaf Disease. These datasets are deliberately chosen because they represent two very different imaging conditions. PlantVillage images are collected in a controlled setting with clean backgrounds and stable illumination, whereas Cassava images are captured in real field environments with complex backgrounds, shadows, occlusions, and large variations in leaf pose. This contrast allows us to assess both (i) performance under standard in-domain testing and (ii) robustness under dataset shift, which is critical for practical agricultural deployment.
PlantVillage PlantVillage is a commonly used benchmark for plant disease recognition, introduced in33. It contains more than 54,000 RGB images organized into 38 classes spanning 14 crop species, including tomato, potato, maize, grape, and apple. Most images depict a single leaf centered in the frame, often against a homogeneous or near-homogeneous background. This controlled acquisition makes PlantVillage suitable for benchmarking model capacity and for learning strong visual representations, since disease patterns such as lesions, discoloration, and texture changes are clearly visible.
At the same time, the controlled nature of PlantVillage can lead to overly optimistic results when models are deployed on field images. Real-world conditions include cluttered backgrounds, varying illumination, motion blur, and partial occlusions that are rarely present in PlantVillage. For this reason, we primarily use PlantVillage as a large and diverse source of unlabeled images for contrastive self-supervised pretraining. Representative examples are shown in Fig. 2.
Representative samples from the two benchmark datasets used in this study: (a) PlantVillage images captured under controlled conditions, and (b) Cassava Leaf Disease images collected in real-field environments.
Cassava Leaf Disease The Cassava Leaf Disease dataset was released as part of the Cassava Leaf Disease Classification challenge hosted on Kaggle and contains field-acquired cassava leaf images with substantial variability in illumination, background, and leaf pose. The dataset includes approximately 21,000 labeled images across five categories: Cassava Bacterial Blight (CBB), Cassava Brown Streak Disease (CBSD), Cassava Green Mottle (CGM), Cassava Mosaic Disease (CMD), and Healthy. Compared to PlantVillage, Cassava is significantly more challenging due to real-field noise factors and class imbalance, particularly for less frequent classes such as CGM.
We use Cassava as a field-realistic benchmark to evaluate whether representations learned by contrastive pretraining remain useful under practical conditions. Representative examples are shown in Fig. 2b.
Evaluation protocols and data splits
To eliminate ambiguity and prevent data leakage, we explicitly define the training and evaluation protocols used in all experiments.
Dataset splits. For both PlantVillage and Cassava Leaf Disease datasets, images are divided into disjoint training, validation, and test splits using stratified sampling to preserve class distributions. Unless stated otherwise, the training split is used for model optimization, the validation split for model selection and early stopping, and the test split is used exclusively for final performance reporting. Validation and test images are never used during either contrastive pretraining or supervised fine-tuning.
In-domain evaluation. For in-domain experiments on PlantVillage, contrastive self-supervised pretraining is performed using only unlabeled images from the PlantVillage training split. Supervised fine-tuning is then carried out using only labeled images from the same training split. Model selection is performed on the validation split, and final results are reported on the held-out test split.
Cross-domain adaptation under dataset shift. To evaluate robustness under dataset shift, we treat PlantVillage as the source domain and Cassava as the target domain. The encoder is first contrastively pretrained using only unlabeled images from the PlantVillage training split. The pretrained encoder is then fine-tuned using only labeled images from the Cassava training split. No Cassava validation or test images are used during pretraining or fine-tuning. Final evaluation is performed on the Cassava test split. This protocol evaluates cross-domain adaptation rather than zero-shot transfer.
These protocols ensure strict separation between training and evaluation data and enable reproducible assessment of both in-domain performance and robustness under domain shift.
Implementation details
Backbone and architecture. We use ConvNeXt-Tiny39 as the encoder due to its strong balance between accuracy and efficiency. During contrastive pretraining, we attach a two-layer MLP projection head, following standard contrastive learning practice. During supervised training, the projection head is removed and replaced with a lightweight linear classification head.
Training environment. All experiments are conducted on an NVIDIA GeForce RTX GPU with 24 GB memory.
Optimization and hyperparameters. We train using SGD with momentum 0.9 and weight decay \(10^{-4}\). The initial learning rate is set to \(10^{-2}\) and decayed using a polynomial schedule. The contrastive temperature is fixed to \(\tau =0.5\). Unless stated otherwise, the batch size is 32. Because contrastive learning is sensitive to training details, we explicitly report the pretraining and fine-tuning schedules in the “Results” section to avoid ambiguity.
Baseline training parity. All baseline models reported in Table 5 were trained under identical data splits, input resolution (\(224 \times 224\)), augmentation pipelines, optimizer (SGD with momentum 0.9), weight decay (\(10^{-4}\)), learning-rate schedule (polynomial decay initialized at \(10^{-2}\)), and number of supervised training epochs. Label smoothing was applied consistently across all supervised models to ensure comparable regularization effects.
For transformer-based ViT-B16, we retained the same input resolution and supervised schedule for fairness, without introducing architecture-specific hyperparameter tuning that could bias comparison. Our objective was to maintain controlled optimization parity across models rather than perform per-model hyperparameter optimization. Therefore, any performance differences primarily reflect representational capacity and learning strategy rather than procedural discrepancies.
Batch size and negative diversity. Contrastive learning methods, like SimCLR, are believed to benefit from having large sets of negative samples, normally realized through the use of a large batch size. For our specific implementation, using a batch size of 32 leads to 62 in-batch negatives per anchor (per SimCLR standard). Though smaller than the largest-scale configurations used in earlier pretraining studies, our setting is consistent with realistic computational limitations and deployment-focused training scenarios. It is imperative to apply diverse augmentation and pretraining (100 epochs) on rich unlabeled data to avoid negative diversity bottleneck. At an empirical level, training curves showed stable convergence without collapse; while cross-domain evaluation illustrated consistent gains on supervised-only baselines. Notably, these observations also indicate that the gains from representation learning for the tested datasets saturate within this batch-size regime. We did not explore alternative strategies such as memory-bank mechanisms and momentum encoders (e.g., MoCo-style frameworks) that decouple the negative set size from the batch size. Our goal, however, was to benchmark a SimCLR-based pipeline that is both reproducible and computationally light-weight without adding extra architectural complexity.
Evaluation metrics
We evaluate model performance in a multi-class classification setting using Accuracy, Precision, Recall, and F1-score. Since both PlantVillage and Cassava Leaf Disease datasets exhibit class imbalance, we report macro-averaged metrics, which assign equal importance to each class and provide a balanced assessment independent of class frequency.
Let C denote the total number of classes. For each class \(c \in \{1,\dots ,C\}\), let \(TP_c\), \(FP_c\), \(TN_c\), and \(FN_c\) represent the number of true positives, false positives, true negatives, and false negatives, respectively, as obtained from the multi-class confusion matrix.
The class-wise Precision and Recall are defined as:
The class-wise F1-score, which balances Precision and Recall, is computed as:
Overall Accuracy is defined as the proportion of correctly classified samples across all classes:
All reported metrics are computed on held-out test sets following the evaluation protocols described in “Evaluation protocols and data splits” section.
Visualization and interpretability
To complement quantitative performance, we include qualitative analyses using t-SNE and Grad-CAM. t-SNE is applied to the learned feature embeddings to visually assess whether samples from different disease categories become more separable after contrastive pretraining and fine-tuning. Grad-CAM is used to highlight image regions that contribute most strongly to the predicted class, helping verify that the model attends to disease-relevant regions (such as lesion areas or discoloration patterns) rather than background artifacts. Together, these visualization tools provide additional evidence of representation quality, interpretability, and robustness, and they are reported alongside confusion matrices, per-class metrics, and ROC curves.
Results
Evaluation setting. All experiments are conducted under a strictly controlled evaluation setting designed to separately assess in-domain performance and cross-dataset generalization while preventing any form of data leakage. Each dataset is partitioned into disjoint training and test splits prior to experimentation, and test samples are never used during contrastive pretraining, supervised fine-tuning, or model selection.
For the PlantVillage benchmark, results are obtained using an in-domain setting in which both contrastive self-supervised pretraining and supervised fine-tuning are performed exclusively on the PlantVillage training split. During the self-supervised stage, only unlabeled PlantVillage training images are used to learn visual representations. In the supervised stage, the pretrained encoder is fine-tuned using labeled PlantVillage training images. Final performance is reported on the held-out PlantVillage test split, which is never accessed during training.
For the Cassava Leaf Disease benchmark, results are obtained under a cross-dataset transfer setting that explicitly evaluates robustness to domain shift. In this case, the encoder is first contrastively pretrained using only unlabeled PlantVillage training images, without access to any Cassava data. The pretrained encoder is then transferred to the Cassava dataset and fine-tuned using labeled Cassava training images only. Evaluation is performed on the held-out Cassava test split, which remains completely unseen during both pretraining and fine-tuning.
Under both settings, all reported quantitative metrics and qualitative visualizations are computed exclusively on held-out test data. This evaluation strategy ensures fair comparison, eliminates information leakage, and enables a clear assessment of both in-domain accuracy and cross-dataset generalization performance.
Training and validation loss and accuracy curves on PlantVillage and Cassava datasets.
Results on plantvillage
The proposed PlantCLR framework achieves a test accuracy of 99.10% and a macro-F1 score of 99.04% on the PlantVillage dataset, demonstrating strong and well-balanced classification performance across all evaluated disease categories. The high macro-averaged score indicates that the model maintains consistent recognition accuracy even for less frequent classes, rather than being dominated by majority categories.
The training and validation curves shown in Fig. 3 exhibit smooth and stable convergence behavior, with no noticeable divergence between training and validation trends. Validation accuracy exceeds 98% after approximately epoch 25 and remains stable thereafter, suggesting effective optimization and minimal overfitting. This stability can be attributed to the contrastive self-supervised pretraining stage, which encourages the encoder to learn robust and transferable visual representations before supervised fine-tuning.
Overall, the results on PlantVillage indicate that contrastive pretraining enables the model to capture discriminative disease-related features under controlled imaging conditions while preserving strong generalization within the same domain. These findings establish a reliable in-domain performance baseline and provide a foundation for evaluating the robustness of the learned representations under more challenging real-field conditions, as explored in the following subsection. The PlantVillage confusion matrix (Fig. 4) exhibits strong diagonal dominance, indicating that most samples are classified correctly. Remaining errors occur primarily between visually similar disease pairs (e.g., early vs late blight), which is expected due to overlapping lesion characteristics. These observations suggest that PlantCLR learns discriminative representations but that highly similar symptom patterns remain challenging and may benefit from more fine-grained augmentation or localized attention mechanisms.
Confusion matrices for the evaluated datasets: (a) PlantVillage: Confusion matrix on the test set; (b) Cassava Leaf Disease: Confusion matrix on the test set.
Results on Cassava Leaf Disease
Cassava is substantially more challenging than PlantVillage because images are captured in the field with background clutter, illumination variation, occlusions, and leaf pose changes. After adaptation to Cassava training labels, PlantCLR achieves 96.83% test accuracy and 96.70% macro-F1 score. This relatively small degradation compared to PlantVillage indicates that contrastive pretraining provides a strong initialization that adapts effectively to real-world conditions after supervised fine-tuning.
The Cassava confusion matrix (Fig. 4) confirms strong class-wise performance, while most misclassifications occur between visually similar diseases such as CBSD and CMD, which are known to share overlapping symptoms (e.g., mottling and deformation). These errors highlight the practical difficulty of fine-grained cassava diagnosis and motivate future work on symptom localization and uncertainty-aware prediction. To further examine the effect of class imbalance, we report per-class precision, recall, F1-score, and support for the Cassava test set in Table 3.These results explicitly reflect the class imbalance present in the Cassava dataset, with CMD representing the majority class (13,158 samples) and CBB the smallest class (1,087 samples). Despite this disparity, minority classes maintain strong recognition performance. In particular, CBB achieves an F1-score of 95.57%, indicating no collapse toward majority-class dominance. The primary misclassifications occur between visually similar categories such as CBSD and CMD, suggesting that residual errors are driven by symptom similarity rather than class frequency alone. The small difference between macro-averaged F1 (97.55%) and weighted F1 (98.30%) further confirms that performance remains stable across classes despite imbalance.
Summary comparison across datasets
Table 4 summarizes the main quantitative results on PlantVillage and Cassava. Overall, PlantCLR maintains consistently high accuracy and macro-averaged metrics across both benchmarks, indicating that performance gains are not limited to majority classes and that recognition remains balanced under class imbalance. This is particularly important for agricultural decision support, where minority disease classes may be rare but highly consequential if missed.
A key observation is the relatively small performance drop when moving from PlantVillage (controlled acquisition) to Cassava (real-field acquisition). Such a gap is expected due to domain shift factors including background clutter, illumination variation, occlusion, and changes in leaf pose. The fact that PlantCLR sustains strong macro-F1 under these conditions suggests that contrastive pretraining learns transferable representations that are less dependent on dataset-specific cues and more aligned with disease-relevant visual characteristics.
From a practical perspective, these results imply that the proposed pipeline can serve as a robust initialization for real-world deployment settings, where models trained only on clean laboratory datasets often fail. In addition, the stability of precision and recall across the two datasets indicates that PlantCLR does not achieve higher performance by simply trading off one error type for another (e.g., reducing false positives while increasing false negatives). This balanced behavior is desirable for field use, where both missed detections and unnecessary interventions can have economic impact.
Finally, the cross-dataset comparison reinforces the motivation of this study: contrastive self-supervised learning is an effective strategy to improve generalization in plant disease recognition, particularly when training data are limited and collected under heterogeneous imaging conditions. These findings support further exploration of SSL-based pipelines for broader multi-crop settings and for robust deployment under diverse real-field environments.
Comparative analysis with baselines
To evaluate whether contrastive pretraining provides consistent gains beyond standard supervised training, we compare PlantCLR against a diverse set of widely used CNN and transformer-based baselines: DenseNet121, DenseNet16940, GoogLeNet41, InceptionV342, ResNet5034, VGG1643, and ViT-B1644. These models cover a broad spectrum of architectural paradigms, including densely connected networks, inception-style multi-scale feature extractors, deep residual learning, and global self-attention mechanisms.
Across both PlantVillage and Cassava datasets, PlantCLR achieves the strongest overall accuracy and macro-F1 score, indicating that contrastive pretraining improves both discriminability and cross-domain stability. On PlantVillage, several supervised CNN baselines achieve competitive in-domain accuracy; however, their macro-averaged metrics are consistently lower, suggesting sensitivity to class imbalance and reliance on dataset-specific visual cues. In contrast, PlantCLR maintains uniformly high precision and recall across classes, reflecting more balanced class-wise recognition.
The performance gap becomes more pronounced on the Cassava dataset, which represents a realistic field scenario with strong domain shift. Many supervised baselines exhibit noticeable degradation in accuracy and macro-F1 under these conditions, highlighting limited robustness to background clutter, illumination variation, and leaf pose changes. Transformer-based ViT-B16, while effective on PlantVillage, shows reduced stability on Cassava, likely due to its higher data requirements and sensitivity when trained with limited labeled samples. PlantCLR, despite using a compact ConvNeXt-Tiny backbone, demonstrates superior cross-domain generalization, underscoring that the learning strategy plays a more critical role than architectural complexity alone.
These results suggest that contrastive self-supervised pretraining enables the encoder to learn more transferable and disease-focused representations by exploiting large-scale unlabeled data prior to supervised adaptation. Consequently, PlantCLR not only surpasses conventional supervised baselines in accuracy but also offers improved robustness and reliability under domain shift, making it a practical and scalable alternative for real-world plant disease recognition systems.
Grad-CAM visualizations highlighting disease-relevant regions in (a) PlantVillage samples captured under controlled conditions and (b) Cassava field samples exhibiting strong domain shift.
t-SNE visualizations of learned feature embeddings on (a) PlantVillage and (b) Cassava.
ROC analysis
We further assess the discriminative capability of the proposed framework using Receiver Operating Characteristic (ROC) curves and the Area Under the Curve (AUC), which provide a threshold-independent evaluation of classification performance by capturing the trade-off between the true positive rate and false positive rate. Figure 7a,b present a comparative analysis of PlantCLR against the baseline models on the PlantVillage and Cassava datasets, respectively. Across both benchmarks, PlantCLR consistently exhibits higher ROC curves and larger AUC values, indicating stronger class separability and more reliable confidence calibration over a wide range of decision thresholds. This behavior suggests that contrastive pretraining leads to more discriminative and stable feature representations, particularly under challenging conditions such as class imbalance and domain shift. Importantly, robust ROC characteristics are critical for real-world agricultural decision support systems, where minimizing false negatives is essential to avoid delayed disease treatment, while controlling false positives helps prevent unnecessary interventions and resource expenditure. The superior AUC trends observed for PlantCLR therefore reinforce its practical suitability for deployment in field-level plant disease monitoring and management scenarios.
Although AUC values are reported for single trained models under fixed initialization, training was conducted using identical random seeds and controlled hyperparameters to ensure reproducibility. Optimization exhibited stable convergence across datasets without signs of representation collapse or instability during contrastive pretraining. The observed performance gains are further supported by consistent improvements in the controlled ablation and symmetric transfer experiments, suggesting that the reported AUC advantages reflect systematic representation benefits rather than stochastic fluctuation.
ROC curves for baseline models and the proposed PlantCLR on (a) PlantVillage and (b) Cassava.
Qualitative visual insights
Representation structure and interpretability. Beyond scalar performance metrics, we analyze the structure and interpretability of learned representations to verify that performance improvements correspond to meaningful feature organization. In practical agricultural systems, robustness requires not only high accuracy but also stable and semantically aligned representations under visual variability.
To interpret model behavior beyond aggregate metrics, we conduct qualitative analyses using both t-SNE visualization and Gradient-weighted Class Activation Mapping (Grad-CAM). The t-SNE projections of learned feature embeddings (Fig. 6a,b) demonstrate that samples from different disease categories form compact and well-separated clusters, indicating that contrastive self-supervised pretraining encourages discriminative and class-aware representations. This effect is particularly evident for visually similar disease classes, where improved inter-class margins are observed after fine-tuning.
Complementary to this global feature analysis, Grad-CAM visualizations (Fig. 5a,b) highlight spatial regions contributing most strongly to predictions. The highlighted regions consistently correspond to disease-relevant visual cues such as lesions, necrotic areas, discoloration patterns, and vein distortions, while background regions receive minimal attention. Together, these qualitative results corroborate quantitative findings and indicate that PlantCLR learns semantically meaningful representations robust to domain shift.
Assessment of background bias under domain shift. While PlantVillage images are captured under controlled laboratory conditions with relatively homogeneous backgrounds, we explicitly consider the possibility of background-induced bias. If classification were driven primarily by background artifacts, substantial degradation would be expected when transferring to Cassava, which contains field-acquired images with cluttered scenes, soil, shadows, overlapping leaves, and varying illumination. However, PlantCLR maintains strong cross-dataset performance (96.83% accuracy; 96.70% macro-F1), providing indirect quantitative evidence that learned representations are not dominated by background cues. The symmetric transfer experiment (“Reverse transfer (Cassava → PlantVillage)” section) further supports this conclusion.
t-SNE configuration and reproducibility. For reproducibility, feature embeddings were first reduced to 50 dimensions using PCA prior to t-SNE projection. The t-SNE embedding was computed with perplexity set to 30, learning rate 200, 1,000 optimization iterations, and a fixed random seed (random_state = 42) to ensure deterministic visualization under identical conditions. Although t-SNE is inherently sensitive to initialization and hyperparameter selection, qualitative cluster structure remained stable under modest parameter variation.
Ablation study
In order to accurately measure the influence of contrastive self-supervised pretraining versus architectural strength, we perform a controlled set of ablation experiments on both PlantVillage (in-domain) and Cassava (cross-domain adaptation) tasks. As a baseline, we use the vanilla ConvNeXt-Tiny backbone only trained in a supervised manner. We then combine this with SimCLR-style contrastive pretraining, preceding supervised fine-tuning and implementing the final PlantCLR pipeline. All variants are trained under strictly equivalent conditions for a fair comparison: input resolution of \(224 \times 224\), batch size of 32, SGD optimizer with momentum set to 0.9, weight decay equal to \(10^{-4}\), polynomial learning-rate schedule starting at \(10^{-2}\), and 30 epochs of supervised fine-tuning. In the pretraining variant, we perform contrastive pretraining of the encoder on unlabeled PlantVillage training images for 100 epochs followed by fine-tuning. No other architectural or optimization changes are made. Evaluation is limited to held-out test splits as described in “Evaluation protocols and data splits” section, which prevents data leakage and ensures that observed differences derive solely from representation initialization. As reported in Table 6, the use of contrastive pretraining provides a consistent improvement over the baseline supervised-only training. The improvement on PlantVillage is modest (in-domain setting), while the gain under cross-dataset adaptation to Cassava is remarkably higher. Macro-F1, for example, boosts from 93.52 to 96.70%, suggesting that unassembled contrastive representation learning provides robustness against domain shift beyond what is achievable through simply increasing backbone capacity. These findings constitute strong evidence that representation initialization is an essential component for attaining stable cross-domain generalization in the context of plant disease recognition. Such disciplined ablation design aligns with recent recommendations in biomedical and medical image analysis literature, where multi-stage pipelines require controlled experimental isolation to ensure that incremental gains are interpreted rigorously rather than procedurally45.
Reverse transfer (Cassava → PlantVillage)
To further evaluate robustness to source-domain bias and provide a symmetric cross-dataset generalization analysis, we conduct the reverse transfer experiment in which the encoder is contrastively pretrained on the Cassava training split and subsequently fine-tuned on PlantVillage under the identical supervised schedule and hyperparameters defined in “Ablation Study” section. All evaluation is performed exclusively on the held-out PlantVillage test split, preserving strict separation between training and testing data.
Because Cassava contains fewer samples (21k vs 54k) and substantially fewer classes (5 vs 38), the diversity of contrastive negatives during pretraining is reduced relative to the primary transfer direction. This setting therefore provides a complementary test of representation transfer from a smaller and less diverse source domain.
As shown in Table 7, reverse transfer improves performance over supervised-only training (98.21% → 98.67% accuracy; 98.05% → 98.48% macro-F1). Although the improvement margin is smaller than in the primary PlantVillage → Cassava direction, the results confirm that contrastive pretraining enhances cross-domain robustness beyond architectural capacity alone. The observed asymmetry suggests that source-domain scale and class diversity influence the magnitude of transferable representation benefits, while still supporting the overall claim that representation initialization plays a central role in stable cross-dataset generalization.
Conclusion
This paper presented PlantCLR, a reproducible contrastive self-supervised pretraining and fine-tuning pipeline for plant disease classification designed to reduce dependence on large labeled datasets and improve robustness under dataset shift. Using SimCLR-style contrastive pretraining on unlabeled PlantVillage images and supervised fine-tuning with a lightweight classifier, the proposed pipeline achieved 99.10% accuracy and 99.04% macro F1-score on the PlantVillage test set.
To evaluate performance in realistic field conditions, we further assessed the approach on the Cassava Leaf Disease dataset, which contains substantial variation in background, illumination, and leaf pose. After fine-tuning on labeled Cassava training data, the model achieved 96.83% accuracy and 96.70% macro F1-score on the Cassava test set, indicating strong transferability of the learned representations. Qualitative analyses using t-SNE and Grad-CAM supported these findings by showing improved class separability in the embedding space and consistent focus on disease-relevant regions. Overall, the results suggest that contrastive SSL can provide strong performance with compact architectures when evaluation protocols are clearly defined and data leakage is avoided.
Data availability
The datasets used and/or analyzed during the current study are publicly available. The Cassava Leaf Disease Classification dataset is available at https://www.kaggle.com/c/cassava-leaf-disease-classification, and the PlantVillage dataset is available at https://www.kaggle.com/datasets/emmarex/plantdisease.
References
Prommakhot, A., Onshaunjit, J., Ooppakaew, W., Samseemoung, G. & Srinonchat, J. Hybrid CNN and transformer-based sequential learning techniques for plant disease classification. IEEE Access 13, 122876–122887. https://doi.org/10.1109/ACCESS.2025.3586285 (2025).
Mamun, A. A. et al. Plant disease detection using self-supervised learning: A systematic review. IEEE Access 12, 171926–171943. https://doi.org/10.1109/ACCESS.2024.3475819 (2024).
Lu, F. et al. Leafconvnext: Enhancing plant disease classification for the future of unmanned farming. Comput. Electron. Agric. 233, 110165. https://doi.org/10.1016/j.compag.2025.110165 (2025).
Yu, F. et al. Healthnet: A health progression network via heterogeneous medical information fusion. IEEE Trans. Neural Netw. Learn. Syst. 34, 6940–6954. https://doi.org/10.1109/TNNLS.2022.3202305 (2023).
Saeed, F. et al. A robust approach for industrial small-object detection using an improved faster regional convolutional neural network. Sci. Rep. 11, 23390. https://doi.org/10.1038/s41598-021-02805-y (2021).
Kirti, K., Rajpal, N., Vishwakarma, V. P. & Soni, P. K. Fusion of non-iterative deep neural network feature extraction with kernel extreme learning machine for plant disease classification. Discov. Comput. 28, 154. https://doi.org/10.1007/s10791-025-09679-y (2025).
Yilma, G., Dagne, M., Ahmed, M. K. & Bellam, R. B. Attentive self-supervised contrastive learning (ASCL) for plant disease classification. Results Eng. 25, 103922. https://doi.org/10.1016/j.rineng.2025.103922 (2025).
De Silva, M. & Brown, D. Multispectral plant disease detection with vision transformer-convolutional neural network hybrid approaches. Sensors 23, 8531. https://doi.org/10.3390/s23208531 (2023).
Rezaei, M., Diepeveen, D., Laga, H., Jones, M. G. & Sohel, F. Plant disease recognition in a low data scenario using few-shot learning. Comput. Electron. Agric. 219, 108812. https://doi.org/10.1016/j.compag.2024.108812 (2024).
Miftahushudur, T., Sahin, H. M., Grieve, B. & Yin, H. A survey of methods for addressing imbalance data problems in agriculture applications. Remote Sens. 17, 454. https://doi.org/10.3390/rs17030454 (2025).
Noroozi, M. & Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In Computer Vision—ECCV 2016, Proceedings, Part VI, vol. 9910 of Lecture Notes in Computer Science, 69–84 (Springer, 2016). https://doi.org/10.1007/978-3-319-46466-4_5
Jing, L. & Tian, Y. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 43, 4037–4058. https://doi.org/10.1109/TPAMI.2020.2992393 (2021).
Hayat, M. et al. Cross-attention patch fusion for few-shot colorectal tissue generation. In 2025 5th International Conference on Digital Futures and Transformative Technologies (ICoDT2), 1–6 (2025). https://doi.org/10.1109/ICoDT269104.2025.11360745
Zeng, Z., Mahmood, T., Wang, Y., Rehman, A. & Mujahid, M. A. Ai-driven smart agriculture using hybrid transformer-CNN for real time disease detection in sustainable farming. Sci. Rep. 15, 25408. https://doi.org/10.1038/s41598-025-10537-6 (2025).
Jaiswal, A., Babu, A. R., Zadeh, M. Z., Banerjee, D. & Makedon, F. A survey on contrastive self-supervised learning. Technologies 9, 2. https://doi.org/10.3390/technologies9010002 (2021).
Zhang, J. et al. Rectifying the extremely weakened signals for cassava leaf disease detection. Comput. Electron. Agric. 232, 110107. https://doi.org/10.1016/j.compag.2025.110107 (2025).
Sambasivam, G., Prabu Kanna, G., Chauhan, M. S., Raja, P. & Kumar, Y. A hybrid deep learning model approach for automated detection and classification of cassava leaf diseases. Sci. Rep. 15, 7009. https://doi.org/10.1038/s41598-025-90646-4 (2025).
Dosset, A. et al. Cassava disease detection using a lightweight modified soft attention network. Pest Manag. Sci. 81, 607–617 (2025).
Srivathsan, M. S., Jenish, S. A., Arvindhan, K. & Karthik, R. An explainable hybrid feature aggregation network with residual inception positional encoding attention and efficientnet for cassava leaf disease classification. Sci. Rep. 15, 11750 (2025).
Che, C., Xue, N., Li, Z., Zhao, Y. & Huang, X. Automatic cassava disease recognition using object segmentation and progressive learning. PeerJ Comput. Sci. 11, e2721 (2025).
Zhu, S. & Gao, H. Mc-shufflenetv2: A lightweight model for maize disease recognition. Egypt. Inf. J. 27, 100503. https://doi.org/10.1016/j.eij.2024.100503 (2024).
Praveen, R. et al. AI powered plant identification and plant disease classification system. In 2024 4th International Conference on Sustainable Expert Systems (ICSES), 1610–1616 (IEEE, 2024).
Zhang, J. et al. Maianet: Signal modulation in cassava leaf disease classification. Comput. Electron. Agric. 225, 109351. https://doi.org/10.1016/j.compag.2024.109351 (2024).
Prashanth, J. S. et al. Mpcsar-ahh: A hybrid deep learning model for real-time detection of cassava leaf diseases and fertilizer recommendation. Comput. Electr. Eng. 119, 109628. https://doi.org/10.1016/j.compeleceng.2024.109628 (2024).
Thai, H.-T., Le, K.-H. & Nguyen, N.L.-T. Formerleaf: An efficient vision transformer for cassava leaf disease detection. Comput. Electron. Agric. 204, 107518. https://doi.org/10.1016/j.compag.2022.107518 (2023).
Zhao, R., Zhu, Y. & Li, Y. Cla: A self-supervised contrastive learning method for leaf disease identification with domain adaptation. Comput. Electron. Agric. 211, 107967. https://doi.org/10.1016/j.compag.2023.107967 (2023).
Verma, S., Kumar, P. & Singh, J. P. A meta-learning framework for recommending CNN models for plant disease identification tasks. Comput. Electron. Agric. 207, 107708. https://doi.org/10.1016/j.compag.2023.107708 (2023).
Gourisaria, M. et al. Pneunetv1: A deep neural network for classification of pneumothorax using cxr images. IEEE Access. https://doi.org/10.1109/ACCESS.2023.3289842 (2023).
Pujari, J. D., Yakkundimath, R. & Byadgi, A. S. Image processing based detection of fungal diseases in plants. Proc. Comput. Sci.46, 1802–1808 (2015). https://doi.org/10.1016/j.procs.2015.02.137. [Proceedings of the International Conference on Information and Communication Technologies, ICICT 2014, 3–5 December 2014 at Bolgatty Palace & Island Resort, Kochi, India].
Barbedo, J. G. A. Digital image processing techniques for detecting, quantifying and classifying plant diseases. Springerplus 2, 660. https://doi.org/10.1186/2193-1801-2-660 (2013).
Phadikar, S., Sil, J. & Das, A. K. Rice diseases classification using feature selection and rule generation techniques. Comput. Electron. Agric. 90, 76–85. https://doi.org/10.1016/j.compag.2012.11.001 (2013).
Al-Hiary, H., Bani-Ahmad, S., Ryalat, M., Braik, M. & Alrahamneh, Z. Fast and accurate detection and classification of plant diseases. Int. J. Comput. Appl. https://doi.org/10.5120/2183-2754 (2011).
Mohanty, S. P., Hughes, D. P. & Salathé, M. Using deep learning for image-based plant disease detection. Front. Plant Sci. https://doi.org/10.3389/fpls.2016.01419 (2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Bhagat, S. et al. Advancing real-time plant disease detection: A lightweight deep learning approach and novel dataset for pigeon pea crop. Smart Agric. Technol. 7, 100408. https://doi.org/10.1016/j.atech.2024.100408 (2024).
Fan, Y., & Zhang, X. Classification of plant leaf diseases based on efficientnet. In 4th International Signal Processing Communications and Engineering Management Conference (ISPCEM), 224–229 (2024). https://doi.org/10.1109/ISPCEM64498.2024.00046
Nasser, A. A. & Akhloufi, M. A. Ctplantnet: A hybrid CNN-transformer architecture for plant disease classification. In 2022 International Conference on Microelectronics (ICM), 156–159 (2022). https://doi.org/10.1109/ICM56065.2022.10005433
Moussaoui, A. & Berrimi, M. Vision transformer based models for plant disease detection and diagnosis, 1–6 (2022). https://doi.org/10.1109/ISIA55826.2022.9993508
Liu, Z. et al. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11976–11986 (2022).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4700–4708 (2017).
Szegedy, C. et al. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9 (2015).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2818–2826 (2016). https://doi.org/10.1109/CVPR.2016.308, arXiv:1512.00567
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint arXiv:1409.1556
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale (2020). arXiv preprint arXiv:2010.11929
Hayat, M. & Aramvith, S. Superpixel-guided graph-attention boundary gan for adaptive feature refinement in scribble-supervised medical image segmentation. IEEE Access 13, 196654–196668. https://doi.org/10.1109/ACCESS.2025.3634156 (2025).
Acknowledgements
This research was supported by Kyungpook National University Research Fund, 2025, and by the Institute of Information & Communications Technology Planning & Evaluation(IITP)-Innovative Human Resource Development for Local Intellectualization program grant funded by the Korea government(MSIT)(IITP-2026-RS-2022-00156389).
Author information
Authors and Affiliations
Contributions
Syed Shayan Ali Shah and Faisal Saeed wrote the main manuscript text, and Muhammad Umair Raza, Abdul Rehman, Muhammad Shaheryar, II-Min Kim, Sangseok Yun, and Jae-Mo Kang reviewed it and improved the presentation.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Shah, S.S.A., Saeed, F., Raza, M.U. et al. PlantCLR: contrastive self-supervised pretraining for generalizable plant disease detection. Sci Rep 16, 10550 (2026). https://doi.org/10.1038/s41598-026-45684-x
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-026-45684-x







