Introduction

Edge artificial intelligence (AI) has rapidly gained importance as computer vision systems are increasingly deployed in cost-sensitive and resource-constrained environments. Among various tasks, object detection is a central component in applications such as surveillance, robotics, automotive systems, and consumer electronics1,2,3,4. While cloud-based solutions can leverage high-end accelerators, practical deployment often requires running models directly on embedded devices, where low latency, limited memory bandwidth, and tight energy budgets are dominant constraints.

To address such requirements, system-on-chip (SoC) platforms that integrate CPUs, NPUs, and dedicated accelerators have become increasingly common5,6. However, most benchmarking efforts have targeted high-performance edge devices–such as server-grade NVIDIA GPUs, Google TPUs, or NVIDIA Jetson platforms–rather than the low-compute SoC accelerators that dominate embedded vision applications7,8,9,10. These lower-power SoCs, though ubiquitous in industrial cameras, IoT endpoints, and consumer devices, offer substantially constrained compute capacity and memory bandwidth. Their performance under realistic operating conditions, including resource contention and bandwidth saturation, remains insufficiently characterized.

In parallel, object detection models have advanced rapidly. The YOLO family, from early versions11,12,13,14 to more recent YOLOv5, YOLOv8, and YOLO1115,16,17,18,19, has become one of the most widely adopted frameworks for real-time detection, offering scalable variants that balance accuracy and efficiency. Recent contributions have also emphasized lightweight adaptations of YOLO for constrained environments, including LAI-YOLOv5s for UAV applications20, SOD-YOLO for small-object detection with reduced memory overhead21, MSGD-YOLO for low-cost edge intelligence22, FRYOLO for IoT embedded devices23, and PCPE-YOLO with dynamically reconfigurable backbones24. These studies illustrate the breadth of model-level innovations aimed at efficiency, yet how these models behave when deployed on entry-level SoCs–across different architectures, model sizes, and input resolutions–remains insufficiently studied. In particular, the interaction between algorithmic complexity and hardware constraints poses practical challenges for selecting the right combination of models and configurations.

To fill this gap, we present a comprehensive and reproducible benchmarking study of YOLO models on three representative Rockchip SoCs (RV1106, RK3568, RK3588). We evaluate Nano, Small, and Medium scales across multiple input resolutions, analyze compute scalability across SoCs, and examine the effects of multi-core scheduling and system-level contention. Our deployment pipeline (ONNX export, quantization, compilation, and execution) is described in detail to ensure reproducibility25. This article extends our earlier conference work26 by broadening the scope of evaluated SoCs and YOLO model families, incorporating detailed measurements of power consumption and energy per inference, and conducting deeper analysis of system-level constraints such as memory-bandwidth limitations, multi-core NPU scheduling, and resource contention.

Throughout this work, we use the term latency (or equivalently, inference time) to denote the end-to-end per-frame processing time measured at the NPU runtime. We define real-time following common practice in embedded vision: 25–30 FPS for surveillance and consumer video pipelines, and 50–60 FPS for high-speed perception tasks such as UAV obstacle avoidance. These thresholds are widely referenced in embedded robotics and vision literature.

Our quantitative analysis reveals several key observations: (i) inference latency correlates more strongly with model accuracy (mAP) than with FLOPs or parameter count; (ii) resolution scaling deviates from quadratic theoretical predictions due to memory-bandwidth limitations; (iii) performance scaling with theoretical TOPS is consistently sublinear (typically 1.3–2.0\(\times\) observed speedup per 2\(\times\) increase in TOPS); (iv) multi-core NPU scheduling yields only marginal gains (generally \(<10\)%); (v) memory-bandwidth contention induces significant slowdowns on RV1106 and RK3568 (50–270% degradation), while RK3588 remains largely unaffected due to its higher sustained bandwidth; and (vi) although power draw is similar across SoCs, energy per inference varies substantially with both model scale and hardware capability, with higher-performance SoCs consistently achieving significantly higher energy efficiency. These results demonstrate the importance of hardware-aware deployment strategies and motivate further work on bandwidth- and energy-efficient model design for embedded AI.

Related work

Since the introduction of YOLOv111, the framework has undergone continuous improvements, making it one of the most widely adopted object detection architectures.12 proposed YOLO9000, enhancing YOLOv2 with improvements from prior work, addressing the prediction of unlabeled object categories by jointly training on COCO and ImageNet, enabling real-time detection of over 9000 classes. YOLOv313 enhanced detection robustness through multi-scale feature fusion, while YOLOv414 introduced new training strategies to improve accuracy and efficiency. YOLOv515 further optimized the architecture for scalability, and subsequent versions such as YOLOv716 and YOLOv8 improved feature extraction and computational efficiency. YOLOv917 and YOLOv1018 further enhance the YOLO framework by introducing advanced backbone architectures, improved feature aggregation strategies, and more efficient training techniques, leading to better accuracy-speed trade-offs and improved adaptability across diverse deployment scenarios. The latest YOLO1119 continues this trend, pushing the boundaries of detection accuracy and speed. Beyond object detection, the YOLO framework has been extended to image classification, instance segmentation, pose estimation, and oriented object detection (OBB), further broadening its applicability.

Beyond architectural advances in YOLO, a substantial body of research investigates how object detection models behave on different hardware platforms. Existing studies, however, primarily target high-compute accelerators or focus on designing new lightweight models rather than systematically benchmarking standard YOLO models on low-compute SoCs.

EdgeYOLO9 proposes a customized, anchor-free detector tailored for edge devices, achieving real-time performance on the NVIDIA Jetson AGX Xavier. While this work demonstrates the feasibility of designing lightweight architectures for edge inference, its focus lies in model innovation rather than in characterizing the system-level behavior of existing YOLO models across heterogeneous low-power hardware. Similarly, Zhu et al.8 evaluate YOLOv3 and PP-YOLO on Jetson Nano and Xavier NX, providing latency measurements and deployment recommendations for specific NVIDIA platforms. These studies offer valuable insights but are limited to GPU-based embedded devices and do not analyze the broader design space involving multiple SoCs, varying input resolutions, NPU architectures, or system resource contention.

At the datacenter scale, Jouppi et al.7 distill lessons learned from three generations of Google TPUv4i deployment, emphasizing compiler design, quantization strategies, workload evolution, and cost-effective large-scale inference. Although highly influential, this line of work focuses on industrial-grade AI accelerators rather than resource-constrained SoCs, and therefore does not address the deployment bottlenecks unique to low-bandwidth embedded NPUs.

More comprehensive benchmarking efforts such as AIBench27 cover a wide range of AI tasks–including object detection–and provide benchmark suites for datacenter, HPC, IoT, and edge computing environments. However, AIBench primarily evaluates algorithmic workloads and system scalability rather than conducting fine-grained, model-level performance analyses on commercial low-compute SoCs. In particular, it does not examine how modern YOLO variants scale with input resolution, how performance is affected by memory bandwidth limitations, or how multi-core NPU scheduling and system-level contention impact real-time deployment.

In summary, prior benchmarking studies either evaluate a limited set of models on mid-to-high-end accelerators or focus on new lightweight detector designs. To the best of our knowledge, no prior work provides a systematic, reproducible, and cross-platform benchmark of contemporary YOLO models on low-compute SoCs. Furthermore, key deployment factors–such as bandwidth-driven latency scaling, sublinear TOPS-to-latency relationships, multi-core NPU scheduling overheads, energy-per-inference behavior, and real-world CPU/memory contention–remain largely unexplored. Our work fills this gap by offering a comprehensive evaluation and practical deployment guidelines tailored for widely used Rockchip SoCs, integrating latency, bandwidth sensitivity, multi-core behavior, and power–energy measurements into a unified benchmarking study.

Methods

Hardware platforms

In order to assess the practicality of running YOLO detectors on lightweight edge processors, we investigated three Rockchip SoCs–RV1106, RK3568, and RK3588–covering diverse ranges of compute power and memory resources. These platforms are widely adopted in embedded AI applications, including smart surveillance cameras, industrial monitoring, and IoT systems. Their detailed specifications, including CPU type, NPU capacity, memory size, and measured memory bandwidth, are listed in Table 1. Notably, the bandwidth results were obtained using the mbw benchmark28, reflecting practical throughput rather than theoretical peak specifications.

Table 1 Summary of processor architectures and memory configurations for the selected Rockchip SoCs, reflecting their compute and bandwidth diversity.

In addition to raw specifications, all three SoCs incorporate Rockchip’s third-generation Neural Processing Unit (NPU) architecture. The design integrates on-chip buffering, weight decompression, and zero-skipping to improve computational efficiency. As illustrated in Fig. 1, the NPU consists of the Convolution Neural Accelerator (CNA), Data Processing Unit (DPU), and Pooling Processing Unit (PPU), supported by a shared MAC array. These components form the core execution pipeline for convolutional and activation operations, while large intermediate feature maps are stored in off-chip DRAM. Similar to other embedded accelerators, practical performance on Rockchip NPUs is shaped not only by compute capacity but also by memory throughput and scheduling efficiency5,6.

Fig. 1
Fig. 1
Full size image

The architecture of Rockchip’s third-generation NPU, which is adopted in RV1106, RK3568, and RK3588 SoCs.

YOLO models

We selected three families of YOLO models–YOLOv5, YOLOv8, and YOLO11–that represent widely adopted and recently developed object detection frameworks. For each series, we examined three configurations (Nano, Small, and Medium). Their architectural characteristics, including depth, width, number of parameters, GFLOPs, and mean Average Precision (mAP) on the COCO dataset, are listed in Table 2.

To aid interpretation, we briefly clarify the architectural terms used throughout this paper. C3 refers to the CSP-like bottleneck adopted in YOLOv5, composed of multiple convolutions and residual branches. C2f, introduced in YOLOv8, fuses intermediate features to enhance parallelism and reduce redundant operations, but requires more sophisticated backend optimizations. C3k2, integrated in YOLO11, modifies C3 with stacked kernels (\(k=2\)), which improves representational capacity but may increase activation memory usage. In addition, the Nano, Small, and Medium variants are obtained by scaling depth and width multipliers, directly affecting parameter count, activation size, and GFLOPs. These scaling factors play an important role in determining how efficiently a model can be executed on resource-limited NPUs.

Table 2 Architectural specifications of representative YOLO variants tested with 640\(\times\)640 input images. Parameter counts are reported in millions (M), and mAP denotes the COCO validation score averaged over IoU thresholds 0.50–0.95.

This selection covers a broad spectrum from lightweight, latency-oriented models to more accurate, compute-intensive variants. It allows us to systematically explore the trade-offs between accuracy, latency, and hardware constraints across different SoCs.

Deployment workflow

The deployment of YOLO models on Rockchip SoCs follows a multi-stage process to ensure compatibility with low-compute NPUs. The workflow is illustrated in Fig. 2, which outlines model conversion, quantization, compilation, and execution.

  • Model Conversion: Trained YOLO models were exported into the ONNX format to ensure interoperability across platforms.

  • Quantization: Models were quantized to INT8 using either Post-Training Quantization (PTQ) or Quantization-Aware Training (QAT)25. PTQ was used for most experiments, while QAT was applied to preserve accuracy for selected Medium models.

  • Compilation: Quantized models were compiled into Rockchip-specific offline binaries using the RKNN Toolkit, translating high-level neural operations into low-level NPU instructions.

  • Execution: Compiled models were deployed within the Rockchip runtime environment, with inference conducted directly on the NPUs of RV1106, RK3568, and RK3588.

This workflow mirrors typical industrial deployment pipelines on embedded hardware, and ensures reproducibility by aligning with Rockchip’s SDK conventions. Similar workflows are reported for Jetson and Edge TPU deployments8.

All models were quantized using PTQ with the RKNN Toolkit (version 2.3.0). A fixed subset of 64 images randomly sampled from the COCO 2017 validation set was used for calibration, and the same subset was applied across all YOLO variants and all SoCs to ensure consistent quantization behavior. All deployments used the original FP32 PyTorch weights; no model was retrained, fine-tuned, pruned, or structurally modified beyond quantization. We adopted the default RKNN PTQ configuration, which automatically selects symmetric or asymmetric quantization and applies channel-wise activation quantization as appropriate. The compiler’s standard graph fusion and optimization passes were enabled, and no additional vendor-specific optimizations were introduced. These unified settings ensure reproducibility and provide a fair comparison across hardware platforms.

Fig. 2
Fig. 2
Full size image

Schematic diagram of inference workflow on resource-constrained AI processors.

Experimental setup and evaluation metrics

We evaluated nine YOLO variants (v5/v8/v11 in n/s/m scales) at input resolutions \(\{480,640,800,960,1120,1280\}^2\). Each model–SoC pair was tested over 100 inference runs, discarding the first 10 to eliminate initialization effects. Across all models, resolutions, and SoCs, the run-to-run latency variance remained extremely low. The standard deviation was consistently below 1% of the mean after warm-up. This stability is expected because the Rockchip NPUs operate with fixed-frequency, deterministic execution pipelines. Therefore, the mean latency is a reliable and representative metric for comparison.

Real-time feasibility was analyzed against thresholds of 25 FPS (40 ms), 30 FPS (33.3 ms), 50 FPS (20 ms), and 60 FPS (16.7 ms). System-level robustness was examined by introducing controlled CPU and memory loads using stress-ng29.

Our evaluation emphasizes not only model complexity (parameters, FLOPs, mAP) but also SoC scalability, multi-core scheduling behavior, and sensitivity to resource contention. This comprehensive approach expands upon prior embedded benchmarking studies10,27, providing detailed insights into low-compute SoC deployment.

Results

Inference latency and model complexity

To examine the relationship between model complexity and inference performance, we evaluated nine YOLO variants (YOLOv5n/s/m, YOLOv8n/s/m, YOLO11n/s/m) on the three Rockchip SoCs at a fixed input resolution of 640\(\times\)640. Latency was measured as the mean inference time over 100 runs, excluding warm-up iterations.

As shown in Fig. 3, latency varied significantly across SoCs. On the entry-level RV1106, Nano-scale models required 68–97 ms, while Medium-scale models exceeded 250 ms, making real-time inference unattainable. The mid-tier RK3568 achieved faster execution, with Nano models reduced to 34–79 ms and Medium models to 158–312 ms. The high-end RK3588, operating in single-core NPU mode, achieved the best performance: Nano variants consistently operated between 21–30 ms, sufficient for 30 FPS, while Medium variants remained below 100 ms (e.g., YOLO11m at 92 ms).

Interestingly, inference latency did not scale directly with parameter count or GFLOPs. For example, YOLO11n, with fewer parameters than YOLOv8n, exhibited higher latency on all three SoCs. Similarly, YOLOv8m consistently showed slower performance than YOLOv5m despite comparable parameter counts. Instead, latency correlated more closely with model accuracy (mAP), suggesting that architectural innovations (e.g., C2f in YOLOv8, C3k2 in YOLO11) introduce additional computational overheads that do not map efficiently onto NPUs.

These results demonstrate that, while more powerful SoCs can substantially reduce inference time, the relationship between model complexity and latency is influenced not only by FLOPs and parameters but also by architectural design choices.

Fig. 3
Fig. 3
Full size image

Average inference time of YOLO series models evaluated on three Rockchip SoCs using 640\(\times\)640 input images.

Resolution scaling

We next investigated how inference latency scales with input resolution, which was varied from 480\(\times\)480 to 1280\(\times\)1280 across the three SoCs. Since computational complexity theoretically increases quadratically with resolution, a fourfold increase in FLOPs would be expected when doubling both height and width (e.g., from 640\(\times\)640 to 1280\(\times\)1280). However, measured inference latency deviated from this quadratic trend, reflecting hardware-specific bottlenecks.

As illustrated in Fig. 4, latency growth was nonlinear and strongly dependent on the SoC. On the low-power RV1106, Nano models such as YOLOv5n increased from 68 ms at 640\(\times\)640 to nearly twice that at 1280\(\times\)1280, while Medium variants experienced even steeper slowdowns, far exceeding quadratic expectations. This behavior indicates that limited memory bandwidth (718 MiB/s) amplifies the effect of resolution scaling. On the mid-tier RK3568, latency grew more gradually but still displayed super-quadratic increases for larger models, with YOLO11m nearly doubling its inference time between 640\(\times\)640 and 1280\(\times\)1280. By contrast, the high-end RK3588 exhibited sub-quadratic growth: Nano models increased by less than 1.5\(\times\) and Medium variants by approximately 1.4\(\times\), demonstrating the stabilizing effect of higher bandwidth (8807 MiB/s) and more efficient parallel processing.

Model scale further influenced sensitivity to resolution changes. Nano variants remained relatively stable across resolutions, while Medium models consistently suffered from super-quadratic growth on bandwidth-limited platforms. This highlights the disproportionate impact of memory traffic and intermediate feature map size on larger architectures.

Overall, these results show that while theoretical complexity scales quadratically with input resolution, practical inference latency can be either sub-quadratic or super-quadratic, depending on the SoC’s memory subsystem and NPU scheduling. For low-compute devices, high-resolution inputs quickly become impractical, whereas higher-end SoCs retain usable performance even at 1280\(\times\)1280.

Fig. 4
Fig. 4
Full size image

Analysis of inference speed scaling with input size for YOLO-based detectors deployed on RV1106, RK3568, and RK3588 chips.

Sublinear compute scalability

We next analyzed how inference latency scales with the theoretical compute power of the three SoCs. The comparison was made between RV1106 (0.5 TOPS), RK3568 (1 TOPS), and RK3588 (2 TOPS, single-core mode), using the same set of nine YOLO models at 640\(\times\)640 resolution. Latency values were normalized to RV1106 to highlight relative improvements.

As summarized in Table 3, latency consistently decreased as compute capacity increased, but the improvements were sublinear relative to the expected ratios. For example, YOLOv5n achieved a 2.0\(\times\) speedup on RK3568 compared to RV1106, while YOLO11s improved by only 1.2\(\times\). On RK3588, the gains ranged from 2.9\(\times\) (YOLOv8n) to 4.9\(\times\) (YOLO11m), still below the theoretical 4\(\times\) reduction expected when moving from 0.5 TOPS to 2 TOPS.

Scaling behavior also varied with model size. Larger models such as YOLO11m and YOLOv8m benefited more from increased compute resources, achieving speedups close to 5\(\times\) on RK3588. In contrast, smaller models such as YOLOv8n and YOLO11n showed more limited improvements, indicating that factors other than raw compute power constrain performance.

Overall, these results show that inference latency does not scale proportionally with theoretical compute capacity. Differences in memory bandwidth, scheduling efficiency, and hardware utilization introduce bottlenecks that prevent SoCs from achieving ideal scaling.

Table 3 Relative inference time of YOLO family detectors on Rockchip SoCs, normalized to RV1106 baseline.

Multi-core scheduling efficiency

To assess the effect of workload distribution across multiple NPU cores, we benchmarked the nine YOLO models on RK3588, which integrates a three-core NPU. Experiments were conducted under different scheduling modes, including single-core execution, dual-core, three-core, and automatic scheduling.

The results, summarized in Table 4, show that multi-core scheduling did not lead to proportional reductions in latency. For small models such as YOLOv5n, the difference between single-core and multi-core execution was less than 5% (21.5 ms vs. 20.5 ms). For larger models, the gains were similarly modest or even negligible. For example, YOLO11m achieved 93.7 ms in single-core mode and 95.5 ms in three-core mode, indicating that overhead from synchronization and memory contention offset the benefits of parallel execution. Automatic scheduling also failed to outperform manual configurations and, in some cases, produced slightly higher latencies.

These findings suggest that while RK3588 provides hardware support for multi-core inference, the efficiency of current scheduling mechanisms remains limited. Performance improvements are constrained by inter-core communication, synchronization overhead, and shared memory bandwidth, preventing linear scaling with the number of cores.

Table 4 Inference latency (ms) of YOLO models under different core scheduling strategies on RK3588.

Real-time feasibility

To evaluate whether the tested models can meet the requirements of real-time applications, we analyzed inference latency against common frame-rate thresholds: 25 FPS (40 ms), 30 FPS (33.3 ms), 50 FPS (20 ms), and 60 FPS (16.7 ms).

As shown in Fig. 5, real-time feasibility strongly depends on both the SoC and the model scale. On the entry-level RV1106, none of the models met the 25 FPS requirement, even at lower resolutions, confirming that this platform is unsuitable for real-time object detection with current YOLO variants. On the mid-tier RK3568, Nano models and some Small variants satisfied the 25–30 FPS thresholds at 640\(\times\)640 or below, but performance degraded quickly at higher resolutions. In contrast, the high-end RK3588 achieved real-time operation for all Nano and Small models, maintaining 30 FPS across most tested resolutions. However, Medium-scale models remained above 40 ms in many cases, and none of the models consistently achieved the stricter 50–60 FPS requirements.

These results demonstrate that while real-time inference is achievable on RK3588 for lightweight YOLO models, ultra-low-latency targets (50–60 FPS) remain challenging, particularly for larger architectures and higher input resolutions. On lower-end SoCs, only highly constrained use cases are feasible without further optimization.

Fig. 5
Fig. 5
Full size image

Real-time feasibility analysis of YOLO series models across edge AI SoCs.

System-level resource contention

To assess the robustness of inference under realistic multitasking scenarios, we introduced controlled CPU and memory loads while running the nine YOLO models at 640\(\times\)640 resolution. CPU utilization levels were set to 30%, 50%, 70%, and 90%, while memory loads were configured at approximately 25%, 50%, and 75% of each SoC’s measured bandwidth. CPU load was generated using the cpu class in stress-ng, while memory pressure was applied using the vm class, saturating DRAM bandwidth with streaming memory operations.

The results, summarized in Fig. 6, reveal that inference latency was more sensitive to memory contention than to CPU utilization across all platforms. On the entry-level RV1106, latency for YOLOv5n increased from 68 ms (no load) to 107 ms under high memory load, a 57% slowdown, whereas CPU stress at 90% utilization increased latency by only 13%. Similar trends were observed for larger models, with memory load consistently producing greater degradation. The mid-tier RK3568 showed even stronger sensitivity: for YOLOv5n, latency increased by more than 270% under heavy memory load, while CPU load effects remained below 30% for most cases. By contrast, the high-end RK3588 demonstrated resilience, with latency changes remaining below 5% even under heavy memory or CPU stress.

These findings highlight memory bandwidth as the dominant factor affecting inference robustness in resource-constrained SoCs. While CPU utilization introduces some variability, the ability of NPUs to offload compute limits its impact. In contrast, memory-intensive operations–especially for larger models–compete directly with other system tasks, leading to significant slowdowns on devices with limited bandwidth.

Fig. 6
Fig. 6
Full size image

Inference latency of YOLO models under varying CPU and memory loads on RV1106, RK3568, and RK3588 SoCs at 640\(\times\)640 resolution.

Power consumption and energy efficiency

Power consumption and energy-per-inference are critical metrics for evaluating the practicality of object detection models on low-compute edge SoCs, especially in embedded and battery-powered deployments. To characterize the energy efficiency of different YOLO variants across platforms, we measured the average power draw and computed the per-frame energy consumption for RV1106, RK3568, and RK3588 under continuous inference. Power was recorded using an external USB power meter under steady-state conditions. Energy per inference (mJ/frame) was obtained as: \(E = P \times t\), where P is the average power consumption (W) and t is the measured per-frame latency (s). This formulation directly reflects the effective energy cost of running a single model inference on each SoC.

Fig. 7
Fig. 7
Full size image

Energy per inference across input resolutions on the three Rockchip SoCs. (a) RV1106, (b) RK3568, and (c) RK3588. The y-axis shows the energy per frame in millijoules (mJ/frame).

Figure 7 illustrates the energy-per-frame scaling trend across resolutions from 320 to 1280. All three SoCs exhibit monotonic growth in energy consumption with increasing input resolution; however, the rate of growth differs significantly across platforms. RV1106 shows the steepest scaling behavior, with energy increasing from tens of millijoules at low resolutions to several joules for medium-sized models at 1280\(\times\)1280. RK3568 demonstrates a more moderate trend, benefiting from higher compute density and improved memory bandwidth. RK3588 achieves the best energy scaling: even at high resolutions, its energy-per-frame remains substantially lower due to its significantly reduced latency.

The differences across model families are also evident. Nano models maintain low energy consumption across all SoCs, ranging from roughly 20–80 mJ/frame on RK3588 and RK3568, and 100–150 mJ/frame on RV1106. In contrast, medium-scale models impose a much higher energy cost: YOLOv11m reaches 978 mJ/frame on RV1106, 768 mJ/frame on RK3568, and 385 mJ/frame on RK3588 at the 640\(\times\)640 resolution. These differences underscore the importance of selecting appropriate model sizes for constrained hardware.

Table 5 summarizes the detailed measurements at the commonly used input resolution of 640\(\times\)640. RV1106 generally operates around 2.1–2.3 W across all models, RK3568 ranges from 2.2–2.5 W, while RK3588 consumes between 3.4–4.2 W. Despite drawing more power, RK3588 is consistently the most energy-efficient device due to its significantly lower latency, leading to up to 3\(\times\) lower energy-per-frame compared to RV1106 for medium-sized models. This illustrates a key conclusion: higher power consumption does not necessarily imply higher energy cost, and higher-performance SoCs may achieve superior efficiency per inference.

Overall, the power and energy measurements highlight a clear design guideline: (1). RV1106 is suitable only for nano/small models at moderate resolutions. (2). RK3568 supports nano and small models efficiently and can run medium models when energy constraints are relaxed. (3). RK3588 offers the best energy-per-inference performance, enabling deployment of medium-scale YOLO variants while preserving favorable energy efficiency. These results provide a practical reference for selecting model–hardware combinations in resource-constrained embedded or battery-powered applications.

Table 5 Average power and energy per inference at 640\(\times\)640 input resolution for all evaluated YOLO models on the three Rockchip SoCs.

Discussion

Model-level insights: architecture vs. complexity

The results reveal that inference performance on low-compute SoCs is shaped more profoundly by architectural design choices than by conventional model complexity metrics such as parameter count or FLOPs. Across all platforms, models with higher mAP–typically incorporating more expressive modules such as C2f (YOLOv8) or C3k2 (YOLO11)–exhibited noticeably higher latency despite having similar or even smaller FLOP counts compared to earlier YOLO variants. These modules improve representational capacity but introduce additional feature-map transformations that do not map efficiently onto Rockchip NPUs. The strong correlation between latency and accuracy (mAP), and the weak correlation with FLOPs, indicate that FLOPs alone are insufficient predictors of real-world inference costs. This highlights the importance of benchmarking model variants directly on the target SoC rather than relying on theoretical indicators or GPU-based measurements.

Resolution scaling and memory-bandwidth interactions

Scaling the input resolution exposed fundamental limits imposed by memory subsystems. Although theoretical inference cost should grow quadratically with resolution, empirical scaling deviated substantially across SoCs due to bandwidth constraints. On RV1106 and RK3568, latency increased super-quadratically for Medium-scale models, driven by the sharp rise in intermediate feature-map sizes and associated DRAM traffic. These devices possess measured DRAM bandwidths of only 718 MiB/s and 2323 MiB/s, respectively, making them highly susceptible to bandwidth saturation at higher resolutions. In contrast, RK3588–with 8807 MiB/s bandwidth and more efficient parallelism–exhibited sub-quadratic scaling across all YOLO variants. The stress-ng experiments further support these findings: memory pressure inflates latency by 50–270% on RV1106 and RK3568, mirroring the effects observed with large input resolutions. These results show that resolution scaling is fundamentally bounded by available memory throughput rather than compute alone, making high-resolution inference feasible only on higher-end SoCs. As shown in Fig. 1, the Rockchip NPU relies on off-chip DRAM for large feature maps, which explains the bandwidth-sensitive behavior observed in our experiments.

Compute scalability and NPU scheduling efficiency

Comparative analysis across SoCs demonstrated a consistent gap between theoretical compute capacity (TOPS) and achievable speedups. While transitioning from RV1106 (0.5 TOPS) to RK3568 (1 TOPS) or RK3588 (2 TOPS in single-core mode) improved latency, the gains were consistently sublinear. Smaller models, which are more memory-bound, benefited the least, while Medium-scale models achieved greater speedups due to more compute-bound behavior. Furthermore, multi-core NPU scheduling on RK3588 did not yield proportional latency reductions. For most models, three-core configurations offered marginal or even negative improvements compared to single-core mode. This is attributable to inter-core synchronization overhead and shared-memory contention, which dominate the execution pipeline when multiple cores attempt to access overlapping regions of activation memory. These findings imply that both compute and memory subsystems must be co-optimized for effective multi-core utilization; simply increasing available TOPS does not guarantee proportional latency reductions.

System robustness and energy efficiency

System-level stress experiments revealed a substantial disparity in robustness across SoCs. CPU utilization had relatively minor impact on inference latency, as the compute-intensive portion of YOLO execution is handled by the NPU. However, memory contention produced substantial slowdowns on RV1106 and RK3568, reaching 50–270% degradation depending on model scale. RK3588, with its significantly higher bandwidth, maintained latency stability within 5% even under heavy memory pressure, underscoring bandwidth as the dominant constraint in low-compute environments.

Energy-per-inference analysis further highlights the practical differences among SoCs. Although power consumption remained relatively stable within each platform, the resulting energy-per-frame varied drastically due to latency differences. For Nano models at 640\(\times\)640, energy ranged from 20–80 mJ/frame on RK3588 and RK3566, compared to 100–150 mJ/frame on RV1106. Medium-scale models consumed 300–1000 mJ/frame, with RK3588 achieving 2–3\(\times\) greater energy efficiency than RV1106. These findings emphasize that higher absolute power draw does not necessarily imply higher energy cost, and that high-performance SoCs may be more suitable for battery-powered systems when per-inference energy is the primary constraint.

Deployment implications and practical guidelines

The combined results have direct implications for real-world embedded applications. In surveillance scenarios operating at 25–30 FPS with 720p video, only Nano and Small models on RK3588 meet real-time constraints, while RK3568 supports limited configurations and RV1106 falls short across all tested models. UAVs and mobile robots require end-to-end latencies of 20–40 ms, which restrict feasible deployments to Nano variants on RK3588. IoT devices with relaxed FPS requirements may adopt Nano or Small variants on RK3568, while RV1106 is limited to low-resolution, low-FPS settings.

The results also suggest several software optimization strategies to improve deployment efficiency on constrained SoCs. Operator fusion and layer reordering can reduce intermediate memory traffic; activation compression or lower-precision intermediate tensors can alleviate bandwidth stress; and memory-aware scheduling may improve multi-core utilization on future NPU architectures. Together, these observations motivate hardware–software co-design for optimizing inference on bandwidth-limited, low-compute edge processors.

Overall, this study provides a practical framework for selecting and deploying YOLO models on resource-constrained SoCs. The findings highlight the interplay between model architecture, resolution, compute capacity, memory bandwidth, and system-level conditions, and underline the importance of hardware-aware deployment strategies in edge AI applications.

Conclusion

This work presents the first comprehensive and reproducible benchmarking of contemporary YOLO object detectors on low-compute Rockchip SoCs, evaluating nine model variants across multiple resolutions, bandwidth conditions, and power regimes. The results demonstrate that inference performance on embedded NPUs is shaped jointly by model architecture, memory-bandwidth constraints, and system-level behavior rather than by theoretical compute (TOPS) or FLOPs alone.

Three overarching conclusions emerge from our analysis. First, inference latency correlates more strongly with mAP and architectural complexity than with parameter count or FLOPs, indicating that modern modules such as C2f and C3k2 impose additional overhead on constrained NPUs. Second, resolution scaling and stress-induced slowdowns reveal memory bandwidth as the dominant limiting factor on RV1106 and RK3568, while RK3588’s higher throughput enables sub-quadratic scaling and robust performance under contention. Third, both compute scalability and multi-core NPU scheduling exhibit sublinear behavior, highlighting that synchronization and shared-memory interactions hinder parallel efficiency. Energy-per-inference measurements further reinforce these trends: although power draw varies modestly across SoCs, the resulting energy cost per frame differs by up to 3\(\times\) due to latency disparities.

These findings underscore the importance of hardware-aware deployment strategies. Lightweight models and moderate resolutions are best suited for low-compute SoCs, whereas high-end platforms can support larger architectures with improved latency and energy efficiency. Memory-centric optimizations–such as operator fusion, activation compression, and bandwidth-aware scheduling–represent promising directions for further improving performance on embedded NPUs.

By providing a unified evaluation methodology, detailed measurements, and cross-platform comparisons, this work offers a practical reference framework for deploying object detection models in real-world edge scenarios. The insights presented here directly inform applications in smart surveillance, robotics, and IoT, where achieving the right balance of accuracy, latency, and energy efficiency is essential for reliable and scalable system design.