Introduction

Annual sowthistle (Sonchus oleraceus) and little mallow (Malva parviflora) are widespread weeds known to compete with crops for essential resources including water, nutrients, and light 1. These weeds have frequent interactions with lettuce, one of Monterey County’s highest-yielding crops, causing significant economic losses. In 2022, farmers in Monterey County faced roughly $150 million in losses due to Impatiens Necrotic Spot Virus (INSV) infected crops associated with INSV-transmitting weeds2. Annual sowthistle and little mallow, shown in Fig. 1, are major contributors to the spread of INSV and facilitate infections across seasons, adversely impacting crop yields in Monterey County3. Annual sowthistle is a highly adaptable weed species capable of thriving in a wide range of environmental conditions4. It grows year-round and has an INSV infectivity rate exceeding 20%5. Similarly, little mallow, which also grows year-round, is commonly found in production fields, waste areas, and along roadsides. It displays an infectivity rate of 20% and serves as a reservoir for INSV5.

Fig. 1
figure 1

Pictures of annual sowthistle (left) and little mallow (right). The weeds were grown under controlled conditions for this study.

Traditional weed management strategies have typically relied on methods such as pesticide application, manual herbicide spraying, and physical labor. These approaches are often expensive and can negatively impact the environment6. Recent advancements in agricultural robotics have enabled the development of autonomous systems that can perform tasks that include the planting, watering, and harvesting of crops. However, these tasks do not require as much precision and effort compared to challenges posed by selective weeding7. Traditional image processing technologies and machine learning methods have become increasingly useful to differentiate between crops and weeds8. As robotics continue to evolve within the agricultural sector, the availability of robust datasets containing weed specimens has become critical for accurate weed detection and identification9.

Machine learning (ML), and in particular deep learning (DL), has shown significant promise in automating weed classification through image-based methods. DL has demonstrated significant success in automating weed species classification, enabling more accurate and timely detection of invasive plants 1. ML methods have found widespread applications in multiple domains including biochar10, drug design11, solar cells12, glassware identification of chemistry equipment13, structural motifs identification in asphaltenes14, classification of fungi based on microscopic images15, and numerous others. While ML is widely applied across agricultural and scientific domains, prior studies on weed identification have largely focused on common weed species with limited regional specificity16,17,18. Traditional methods of identifying plant species through visual identification methods can be time-consuming and inaccurate, which is often unsuitable in practice. However, despite these advances, current datasets often lack representation of region-specific weed species that act as disease vectors. Sonchus oleraceus and Malva parviflora, two species linked to Impatiens Necrotic Spot Virus outbreaks in California’s Salinas Valley have not been the subject of dedicated image-based classification efforts. Existing large-scale datasets such as PlantCLEF17 and DeepWeeds18 focus on broader geographic regions and do not include high-resolution images of these specific weeds under simulated field conditions. Our study bridges this gap by developing a localized dataset and testing the performance of three Convolutional Neural Networks (CNNs)19, ResNet-5020,21,22, ResNet-10122,23, and DenseNet-12124,25, under conditions designed to simulate practical deployment26.

Additionally, DL techniques have been used to identify diseases by extracting and classifying specific features indicative of leaf blotch, powdery mildew, and rust, as well as disease symptoms from abiotic environmental stressors. However, these methods have limitations that prevent the detection of more subtle or early stages of disease and even in processing certain images27. DL techniques, including CNNs and deep belief networks (DBNs), are becoming more popular for their ability to detect plant diseases through their meticulous scanning of the images. However, they do come with drawbacks, such as extensively labeled data, expensive computations, and unknown data. To address these challenges, researchers have increasingly applied transfer learning and ensemble methods, which improve model performance on limited datasets and enhance generalizability.27 Additionally, data augmentation has also increased in popularity to create more robust data sets. Support vector machines are advanced techniques used to analyze images and detect specific plant features and diseases. However, their effectiveness is limited by the need for well-labeled and robust datasets and the challenge of incorporating new diseases. DL models like CNNs and DBNs can quickly learn from new data without support, but similarly require large, labeled data sets and expensive, computational resources27.

DL uses network feature structures and extracts these features using various methods, that have been proven to be more effective than manually extracted features. Methods associated with deep learning use spatial and semantic feature differences to accurately identify and detect crops and weeds8. These methods improve the overall accuracy of weed detection and are combined with the use of CNNs or Fully Convolutional Networks (FCNs). CNNs utilize learning-based segmentation and color space transformations, allowing the network to segment images and identify weeds. These networks require learning complex patterns that include lighting variations, overlapping leaves, and occlusions28. CNNs and DBNs have gained prominence in agricultural applications due to their capacity to learn complex patterns from labeled image data, enabling efficient detection of weeds and plant diseases27. However, the effectiveness of these models relies heavily on the availability of curated, annotated datasets representative of real-world variability. Although model training can be resource-intensive, well-trained neural networks have demonstrated high performance across a range of image recognition tasks, with CNNs achieving classification accuracies exceeding 99% under ideal conditions27. In this study, we address the challenge of weed identification using a regionally specific dataset featuring Sonchus oleraceus and Malva parviflora, cultivated under varied greenhouse conditions to emulate field-level heterogeneity typical of Monterey County. High-resolution RGB images were collected in controlled lighting environments that approximate natural field scenarios. We evaluated the classification performance of three well-established CNN architectures – ResNet-5021,22,29, DenseNet-12124,25, and ResNet-10123, on two datasets: a baseline (non-augmented) image set, and an augmented version designed to improve generalization. This design enabled assessment of both model accuracy and the role of data augmentation in enhancing classification robustness.

Methods

Image acquistion and dataset preparation

Images were collected using a Kodak PIXPRO AZ401RD point-and-shoot digital camera (16 MP CCD sensor, 24 mm wide-angle lens, 40 × optical zoom, and 3-inch LCD display) under uniform lighting conditions. Each image was cropped, labeled, and standardized using the Thumbnail function in the Wolfram Language, ensuring consistent input dimensions for deep-learning model training. The dataset focused on two weed species: Sonchus oleraceus (annual sowthistle) and Malva parviflora (little mallow), grown in a greenhouse environment designed to simulate field-relevant stress conditions representative of coastal California agriculture. For annual sowthistle, plants were cultivated under six treatment conditions: standard, overwatering, excess fertilizer, low fertilizer, mechanical injury, and drought. Little mallow was grown under four treatment conditions: standard, excess fertilizer, mechanical injury, and drought. The detailed specifications for each treatment condition are summarized in Table 1. To improve image consistency and minimize background interference, all images were manually cropped to center the weed specimen and exclude non-essential visual elements.

Table 1 Environmental growing protocols and abbreviations for each weed species including its stressors.

Data augmentation

The full dataset comprised images from ten distinct classes—corresponding to various environmental treatments for each of the two weed species. For each of the ten trials, we performed stratified sampling such that 80% of the images from each class were allocated to the training set and the remaining 20% to the test set. This ensured that the distribution of classes was preserved across both training and testing subsets, thereby maintaining representation of all treatment conditions. By generating multiple randomized yet class-balanced partitions, we obtained a statistically reliable estimate of model performance while mitigating the influence of any particular data split. Data augmentation was applied exclusively to the training set in each split to expand variability and simulate real-world image conditions. Augmentation techniques included controlled transformations such as blurring, lighting adjustment, sharpening, and geometric modifications (e.g., small-angle rotations and horizontal reflections), thereby increasing model robustness to common visual perturbations30,31,32. The test set remained unaugmented to ensure an unbiased assessment of generalization performance. This design is also referred to as repeated random sub-sampling validation is especially well-suited for datasets where stratified sampling and repeated trials offer improved reliability over fixed k-fold partitioning.

To assess the contribution of data augmentation to model generalization, all three neural networks were also trained and evaluated on the original, non-augmented dataset using the same stratified sampling procedure. This allowed for a direct comparison between models trained on augmented and non-augmented data, quantifying the impact of augmentation on classification accuracy and robustness. Following image acquisition, the original RGB images were computationally augmented using the ImageEffect33 and related image processing functions available in the Wolfram Language34 (Mathematica), thereby generating a larger and more diverse dataset for training and evaluating machine learning models.

DL techniques often require large amounts of labeled training data to achieve generalizable performance; a challenge mitigated through data augmentation. Augmentation techniques computationally expand and diversify datasets, improving model robustness and generalization. In this study, a range of image transformations was employed to simulate real-world variability and increase dataset complexity (Figs. 2 and 3). The augmentations included blur, noise, lightening, darkening, enhancement, and sharpening.33 The blur35 transformation was applied with varying intensities (levels 1–5), introducing controlled smoothing to reduce local pixel variation and simulate focus inconsistencies. The noise effect introduced zero-mean uniform noise with amplitude α, emulating real-world imperfections and sensor artifacts. Lightening36 and darkening37 transformations adjusted the image brightness to simulate varying illumination conditions—creating visually lightened and dimmed versions, respectively. These augmentations collectively enriched the dataset with diverse image conditions, thereby enhancing the neural networks’ capacity to generalize across a wider range of field scenarios.

Fig. 2
figure 2

Graphic depicting the image augmentations applied to create a robust, augmented dataset of Annual Sowthistle.

Fig. 3
figure 3

Schematic representing the augmentations utilized to create a robust, augmented dataset of Little Mallow.

To further diversify the dataset through image detail manipulation, sharpen and enhance transformations were employed.33 The sharpen function was used to increase the visual clarity of images by accentuating edges and improving focus in blurred regions. The enhance transformation applied a spatial spread parameter (σ) and a pixel value spread parameter (µ) to intensify fine details and local contrast.

In addition to pixel-level modifications, physical transformations were implemented using rotation and reflection. The rotation function altered image orientation by rotating samples counterclockwise about the center of their bounding box.38 Random rotations were applied within the intervals of –10° to –1° and 1° to 10°, thereby introducing slight angular variance while preserving semantic content. The reflect function introduced horizontal symmetry by flipping images along the vertical axis (i.e., left–right reflection).39 For the purposes of this study, only left–right reflections were included as part of the augmentation strategy.

The original dataset consisted of 433 non-augmented images of Sonchus oleraceus (annual sowthistle) and 397 images of Malva parviflora (little mallow), totaling 830 standard images. Table 2 shows the number of images present in each class corresponding to a growing condition. Through image augmentation, each original image was expanded by a factor of 28, generating 27 additional variants per image and resulting in a final dataset of 23,240 images. These images were systematically organized into training and testing sets for three neural network architectures: DenseNet-121, ResNet-101, and ResNet-50. Each model was trained and evaluated on two datasets: one comprising only the original cropped RGB images, and the other consisting of the augmented image set.

Table 2 Classes and the number of Non-Augmented images in each Class.

In these experiments, the dataset was randomly partitioned into stratified training and testing sets (80:20 ratio), ensuring class balance across all ten trials. Performance metrics were recorded for each trial, providing an additional perspective on model reliability.

Deep learning architectures

To evaluate the performance of deep learning models in classifying weed species under variable environmental conditions, we selected three widely adopted convolutional neural network (CNN) architectures: ResNet-50, ResNet-101, and DenseNet-121. These models represent complementary design philosophies in deep learning. ResNet-50 and ResNet-101 are part of the residual network family introduced by He et al.20, which utilize skip connections to mitigate the vanishing gradient problem and enable stable training of deeper networks. ResNet-50 offers a balance between accuracy and computational efficiency, while ResNet-101 increases model depth and representational capacity, often leading to improved classification performance in fine-grained image recognition tasks.22 Both networks have demonstrated strong results in agricultural imaging and plant species classification.18,40.

ResNet-50 is a convolutional neural network (CNN) renowned for its depth and computational efficiency. It employs a Bottleneck Residual Block, which consists of three convolutional layers: a 1 × 1 layer that reduces dimensionality, a 3 × 3 layer that captures spatial features, and a final 1 × 1 layer that restores the original channel dimension. A key innovation in ResNet-50 is the use of shortcut connections, which add the block’s input directly to its output. This residual learning mechanism facilitates the training of very deep networks by mitigating the vanishing gradient problem and preserving information flow. Through this architecture, ResNet-50 achieves high accuracy in image classification tasks while maintaining manageable computational demands41.

ResNet-101 is a deep convolutional neural network (CNN) architected for high-accuracy image classification. Central to its design are residual connections—commonly referred to as skip connections—which effectively address the vanishing gradient problem prevalent in deep neural architectures. These connections promote stable gradient propagation, thereby improving both the convergence rate and training robustness. The network comprises a series of residual blocks, each integrating convolutional layers with identity mappings, enabling enhanced feature extraction and improved recognition of complex visual patterns. Owing to this hierarchical and modular structure, ResNet-101 demonstrates strong generalization performance across a broad spectrum of image recognition tasks.22.

DenseNet-121 was selected to complement the ResNet architecture with a more compact, feature-reusing design. DenseNet24 introduces direct connections between all layers within a dense block, promoting feature propagation, reducing the number of parameters, and improving gradient flow. This architecture is known to perform well on small to moderate-sized datasets while maintaining high accuracy, making it particularly attractive for agricultural applications where data acquisition may be constrained.42 Together, these three models allow for a comprehensive comparison of residual and densely connected networks in the context of weed classification, enabling an informed assessment of architectural efficiency and generalization performance.

Model training configuration

To evaluate model performance and ensure robust generalization across weed phenotypes and stress conditions, we implemented a stratified Monte Carlo cross-validation strategy using ten independent data splits. To ensure reliable model evaluation across diverse environmental conditions, we implemented a stratified Monte Carlo cross-validation strategy with ten independent random splits. In each trial, 80% of images from each class were assigned to the training set, and 20% were held out for testing, preserving class distribution across splits.

This method was chosen over conventional k-fold cross-validation due to the limited dataset size and the presence of uneven class counts (see Table 2). Fixed k-folds can result in biased or unbalanced folds, particularly when class distributions are small or skewed. By contrast, stratified Monte Carlo sampling generates multiple random, class-balanced partitions, which reduces variance in performance estimates and mitigates the effects of atypical splits. This approach, also referred to as repeated random sub-sampling validation, has been shown to produce more stable results in small datasets with class imbalance or heterogeneity in image content.43 We adopted ten repetitions to strike a balance between computational efficiency and statistical robustness, allowing us to evaluate model performance across a range of class distributions and environmental conditions.

Each convolutional neural network (ResNet-50, ResNet-101 and DenseNet-121) was fine-tuned using the Wolfram Language’s NetTrain function. We replaced the final two layers of each pre-trained ImageNet model with a task-specific classification head consisting of a LinearLayer and SoftmaxLayer to distinguish among ten weed treatment classes. The training configuration included explicit control over several parameters: all convolutional layers were frozen (LearningRateMultipliers → {“linearNew” → 1, _ → 0}), a 10% validation split was reserved from the training set (ValidationSet → Scaled[0.1]), GPU acceleration was used (TargetDevice → ”GPU”), and early stopping was applied based on validation loss (TrainingStoppingCriterion → ”Loss”). Other hyperparameters — including batch size, base learning rate, and optimizer choice — were not manually tuned and were instead handled through the Wolfram Language’s internal AutoML-style pipeline, which adjusts such parameters based on dataset size, class balance, and convergence behavior44,45,46. This approach is consistent with prior machine learning workflows in the Wolfram Language47. Across the ten stratified Monte Carlo splits, training generally converged between 14 and 18 rounds, depending on class composition of each fold. Because all architectures were trained under the same configuration protocol, observed differences in performance reflect architectural and dataset effects rather than hyperparameter optimization33,34,35,36,37,38,39.

Results and discussion

Compared to existing weed detection efforts such as DeepWeeds18, which focused on broad vegetation classes in Australian rangelands, our study provides a region-specific, high-resolution image dataset targeting two morphologically similar weeds (Sonchus oleraceus and Malva parviflora) with known roles in viral disease transmission. Global repositories like PlantCLEF and LifeCLEF include diverse plant imagery but lack focused datasets under simulated agronomic conditions relevant to California’s Central Coast. The observed improvements from data augmentation (up to + 17.8% in accuracy and + 0.198 in Cohen’s Kappa for DenseNet-121) align with similar findings in weed detection literature, where geometric and photometric variation helps generalize models to real-world field scenarios. Among the tested CNNs, ResNet-101 offered the best overall accuracy–stability tradeoff, while DenseNet-121 provided the highest F1 and AUC scores in some stress classes. Limitations of this study include the use of controlled greenhouse imagery, which, although varied in treatment conditions, does not fully capture the complexity of open-field environments such as occlusions, shadows, or soil/crop backgrounds. Additionally, the dataset was limited to two weed species, albeit highly relevant to INSV. Although data augmentation substantially improved model performance, this improvement may result from both increased feature diversity and the enlarged dataset size. The augmentation process expanded the dataset by approximately twenty-eight-fold, which could reduce overfitting but may also mask the effect of true feature enrichment. Because augmentation increases both variability and total sample count, disentangling their relative contributions would require a larger base dataset or controlled augmentation ablation experiments. However, expanding the dataset further is constrained by the seasonal and logistical limitations inherent to agricultural research. Future work will explore incremental dataset growth and hybrid augmentation–synthesis methods to isolate these effects.

The use of pretrained models limits architectural customization, and the fixed augmentation strategy, while diverse, may not fully simulate all real-world perturbations.

Future directions include: (1) expanding the dataset to include additional weed species prevalent in California lettuce fields, (2) testing object detection frameworks (e.g., YOLOv5, EfficientDet) for in-situ plant localization, (3) exploring transformer-based or EfficientNet-based classifiers for improved performance–efficiency balance, and (4) deploying trained models on robotic or drone platforms for in-field real-time weed identification. The next few subsections detail the results of commonly used metrics in machine learning to quantify model performance.

Each neural network was trained to classify ten image classes, representing different environmental stress conditions applied to Sonchus oleraceus and Malva parviflora. The performance of each model was evaluated using standard classification metrics, as described in the following subsections. The original dataset consisted of 433 images of annual sowthistle and 397 images of little mallow. After applying augmentation strategies designed to simulate field variability, the dataset expanded to 23,240 images. To assess the impact of augmentation, all three neural networks—ResNet-50, ResNet-101, and DenseNet-121—were trained separately on both the original and augmented datasets. For each configuration, model evaluation was performed using ten independent stratified train-test splits to ensure a robust comparison across all metrics.

Accuracy

Accuracy is a standard performance metric that quantifies the proportion of correctly classified instances relative to the total number of predictions. While often used as a general indicator of classifier performance, it does not account for class imbalance or the distribution of errors. Classification accuracy across 10 independent data splits is shown in Fig. 4. All three convolutional neural networks, DenseNet-121, ResNet-50, and ResNet-101, exhibited substantial performance gains when trained on augmented datasets. The largest relative improvement was observed for DenseNet-121, where augmentation increased mean accuracy from 0.64 to 0.81 (a 26.6% gain). ResNet-101 and ResNet-50 showed smaller but consistent improvements of 18.5% and 17.2%, respectively. Among the augmented models, ResNet-101 achieved the highest median accuracy with minimal variability, while DenseNet-121 without augmentation had both the lowest accuracy and highest variance.

Fig. 4
figure 4

Classification accuracy across 10 independent data splits for three convolutional neural networks trained on augmented and unaugmented weed image datasets. Data augmentation significantly improves accuracy across all architectures. ResNet-101 with augmentation achieved the highest median accuracy, while DenseNet-121 showed the greatest sensitivity to augmentation.

Mean cross entropy

Mean cross entropy values across 10 independent splits are presented in Fig. 5. Models trained on augmented datasets consistently exhibited lower entropy, indicating more confident and better-calibrated predictions. DenseNet-121 trained without augmentation showed the highest average entropy (~ 1.17), suggesting frequent low-confidence predictions or misclassifications. In contrast, ResNet-101 and ResNet-50 with augmentation achieved the lowest mean entropy values (~ 0.48 and ~ 0.52, respectively), consistent with their superior classification accuracy. These results reinforce the role of data augmentation not only in improving accuracy but also in enhancing the reliability of class probability estimates.

Fig. 5
figure 5

Mean cross entropy across 10 independent data splits for each neural network trained on augmented and unaugmented weed image datasets. Models trained on augmented data consistently exhibited lower entropy, reflecting more confident and well-calibrated predictions. DenseNet-121 showed the highest entropy in the unaugmented regime, suggesting overfitting and unreliable output probabilities.

F1 score

The F1 Score is a standard metric used to assess the balance between precision and recall, particularly in binary classification contexts. It is defined as the harmonic mean of precision and recall, with values approaching 1 indicating high classification performance. In this study, F1 Scores were computed for both non-augmented and augmented image datasets across all neural network models evaluated. Specifically, Fig. 6 presents the F1 Scores for the augmented dataset across ten image classes for DenseNet-121(a), ResNet-101 (b), and ResNet-50 (c). F1 scores were computed across ten independent train-test splits using the augmented image dataset to evaluate the classification performance of each convolutional neural network. All three models: ResNet-50, ResNet-101, and DenseNet-121, demonstrated strong F1 scores, indicating effective identification of Sonchus oleraceus and Malva parviflora despite their morphological similarity. DenseNet-121 achieved the highest median F1 score with the least variability across splits, reflecting superior generalization and classification consistency. ResNet-101 also maintained high F1 scores but showed slightly greater variance. ResNet-50, while performing well overall, exhibited comparatively lower F1 scores in several splits, suggesting reduced robustness under certain training conditions. These results highlight the benefit of deeper or more densely connected architectures and emphasize the critical role of image augmentation in enhancing model performance and reliability.

Fig. 6
figure 6

F1 scores for the augmented 10 image classes across three neural networks: (a) DenseNet-121, (b) ResNet-101, and (c) ResNet-50. All models show strong class-level performance, particularly for high-fertility and drought conditions. ResNet-101 achieves the most consistent F1 scores across classes, while DenseNet-121 exhibits slightly more variability, especially for the overwatered (OS) and injured (IS) sowthistle classes. These results highlight the class-specific predictive strengths of each architecture when trained with data augmentation.

Area under ROC curve

The discrimination capabilities of various models were further quantified by computing the Area Under the Receiver Operating Characteristic Curve (AUC) for each class under all treatment conditions. AUC is a threshold-independent metric that quantifies a classifier’s ability to distinguish between classes; values closer to 1.0 indicate better separability. This metric is particularly useful when evaluating performance on imbalanced or noisy datasets, as it reflects the trade-off between true positive and false positive rates across varying thresholds. Figure 7 highlights the spread of 10 AUC values for each NN with the augmented data set of weeds images. ResNet-50 achieved high AUC scores across all classes, with medians generally exceeding 0.98. Despite this strong average performance, certain conditions—particularly injured sowthistle (IS) and standard sowthistle (SS)—exhibited wider interquartile ranges and lower whisker values approaching 0.92, indicating greater variability in classification confidence across splits. This suggests that the ResNet-50 model is somewhat sensitive to visual perturbations introduced by injury or environmental noise. ResNet-101 showed improved stability across splits, with AUC medians consistently near or above 0.99 for most classes. The model demonstrated reduced variance, particularly in standard and nutrient-stressed treatments. Although slight dispersion was observed in injured samples (IM and IS), AUC values remained high overall, confirming the model’s robust performance across a wider range of morphological conditions. DenseNet-121 exhibited the most consistent and highest AUC values among the three models. Nearly all treatment conditions yielded tightly clustered AUC distributions with medians at or above 0.99. The model maintained strong separability even in the most variable classes, including injured sowthistle and drought-stressed mallow. These results confirm DenseNet-121’s superior generalization capacity and stability when applied to augmented datasets featuring real-world variability.

Fig. 7
figure 7

Area under the ROC curve (AUC) for each class across 10 independent training/testing splits using (left) DenseNet-121, (middle) ResNet-101, and (right) ResNet-50, all trained on augmented weed image datasets. All three models achieved excellent AUC values across most classes, with median scores consistently above 0.96. ResNet-101 displayed the most stable and uniformly high discrimination performance, while DenseNet-121 exhibited slightly more variation for the injured (IS) and standard mallow (SM) classes. These results align with the F1 score trends and reinforce the effectiveness of data augmentation in enhancing model generalization and separability.

Cohen’s kappa

Cohen’s Kappa was computed to evaluate the agreement between predicted and true class labels while adjusting for chance. Unlike raw accuracy, which may be inflated in imbalanced datasets, Cohen’s Kappa provides a more robust assessment of model reliability by quantifying the extent of classifier agreement beyond random chance. Values range from -1 (complete disagreement) to 1 (perfect agreement), with values above 0.80 typically interpreted as strong agreement.

As shown in Fig. 8, all three CNNs achieved Kappa values consistent with substantial to near-perfect agreement. ResNet-101 attained the highest median Kappa value, approaching 0.87, with minimal interquartile variability, indicating both high predictive accuracy and consistency across splits. ResNet-50 followed closely with a slightly lower median but comparably tight variance, reflecting stable performance. DenseNet-121, while performing strongly in terms of AUC and F1 scores, exhibited greater variability in Kappa values, with a median near 0.79 and a broader interquartile range. This discrepancy may reflect increased sensitivity to class-specific imbalances or outlier effects in certain splits. Overall, Cohen’s Kappa results support the conclusion that ResNet-101 provides the most consistently reliable classification, while DenseNet-121 offers high discriminative power with some trade-off in agreement robustness.

Fig. 8
figure 8

Cohen’s Kappa values across 10 independent training/testing splits for each model trained on augmented data. All models demonstrate substantial inter-label agreement (Kappa > 0.75), with ResNet-101 achieving the most consistent performance. These values support the reliability of model predictions beyond chance agreement.

While this study employed widely used convolutional backbones (ResNet-50, ResNet-101, DenseNet-121), future work will explore more recent architectures that offer improvements in both computational efficiency and accuracy. EfficientNet48 has demonstrated state-of-the-art performance with fewer parameters through compound scaling of depth, width, and resolution. Similarly, attention-based architectures such as the Vision Transformer49 (ViT) enable global feature interactions that may improve performance on morphologically similar weed species. For tasks involving spatial localization or mapping, segmentation networks like U-Net50 or DeepLabV3 + 51 can provide pixel-level identification of weeds, which is critical for real-time precision spraying or autonomous navigation.

Conclusion

This study employed a total of 23,240 images of Sonchus oleraceus (annual sowthistle) and Malva parviflora (little mallow), generated through extensive augmentation of an initial dataset comprising 830 images. Augmentation techniques were designed to emulate real-world field conditions, thereby enriching the dataset’s variability and representativeness. This enhanced dataset served as a robust foundation for evaluating the efficacy of deep learning in weed identification. Three convolutional neural networks, DenseNet-121, ResNet-101, and ResNet-50 were trained and tested using the Wolfram Language, with performance assessed across both non-augmented and augmented datasets. Results indicated superior model accuracy and reliability on the augmented dataset, corroborating established principles that link dataset size and diversity to improved generalization in machine learning.

This study demonstrates the effectiveness of deep learning models in the classification of regionally significant weed species, Sonchus oleraceus and Malva parviflora, under simulated field conditions. Through the creation of a highly augmented and context-specific image dataset, we captured morphological variability reflective of real-world agricultural environments in coastal California. Comparative evaluation of three convolutional neural networks, DenseNet-121, ResNet-101, and ResNet-50 revealed that all models benefitted from dataset augmentation, with marked improvements in accuracy, F1 score, and AUC values. DenseNet-121 achieved the highest classification performance overall, particularly in terms of discriminative ability across complex treatment conditions. ResNet-101, while slightly lower in discriminative metrics, exhibited the highest Cohen’s Kappa, indicating strong agreement with ground truth across splits and high prediction consistency.

These results support the integration of CNN-based image classification into automated weed identification pipelines for precision agriculture Notably, this study presents the first curated image dataset focused on weed species endemic to the Monterey County region, where Sonchus oleraceus and Malva parviflora play a significant role in crop loss and disease transmission. The dataset and methods developed here offer a foundation for future efforts by researchers, agronomists, and agricultural technology developers seeking to build localized detection tools or integrate weed classification into broader disease risk management systems. In future work, we will expand the dataset to include additional species, test object detection frameworks for plant localization, and transition toward in-field validation using mobile or aerial robotic platforms. We also plan to explore more recent deep learning backbones and segmentation-based architectures to improve accuracy and efficiency in more complex deployment environments.