Bridging the latency gap with a continuous stream evaluation framework in event-driven perception

Chu, Jie; Zhang, Runze; Yang, Chu; Yu, Zongyou; Bu, Zongtao; Liu, Haotian; Röhrbein, Florian; Knoll, Alois; Chen, Guang; Jiang, Changjun

doi:10.1038/s41467-026-70240-6

Download PDF

Article
Open access
Published: 16 March 2026

Bridging the latency gap with a continuous stream evaluation framework in event-driven perception

Nature Communications volume 17, Article number: 2441 (2026) Cite this article

2649 Accesses
Metrics details

Subjects

Abstract

Neuromorphic vision systems process continuous event streams and offer transformative potential for real-time applications. However, their evaluation remains tethered to methodologies from RGB imaging. These approaches convert asynchronous event streams into synchronized frames and ignore perception latency, creating a critical gap between benchmarks and real-world performance. To address this, we introduce the STream-based lAtency-awaRe Evaluation (STARE) framework. STARE integrates two core components: Continuous Sampling, maximizing model throughput to reduce the impact of latency, and Latency-Aware Evaluation, quantifying latency-induced online accuracy. To rigorously validate STARE, we developed ESOT500, a high-dynamic object tracking dataset with 500 Hz annotations. Experiments reveal that latency severely degrades online accuracy by over 50%. We further introduce two model enhancement strategies: Asynchronous Tracking, a fast-slow architecture that boosts model throughput, and Context-Aware Sampling, which dynamically adapts input to handle low event density cases. Overall, our work bridges the latency gap between models’ theoretical potential and real-world deployment.

Research on the perception method of tiny objects in low-light and wide-field video

Article Open access 27 July 2024

Nanosecond anomaly detection with decision trees and real-time application to exotic Higgs decays

Article Open access 25 April 2024

Photonic neuromorphic accelerators for event-based imaging flow cytometry

Article Open access 15 October 2024

Introduction

Biological vision systems excel at perceiving dynamic environments through continuous, adaptive sensing^1,2,3. In contrast, artificial vision systems, even those inspired by neural mechanisms, typically rely on discrete and frame-based processing inherited from conventional RGB cameras^4,5,6,7,8, as shown in Fig. 1a–b. This mismatch introduces a critical limitation: the temporal discontinuity of static frames inherently induces perceptual delay between sensory inputs^{9,10,11,12,13}. Such delay accumulates in real-world applications, degrading performance in tasks requiring rapid reactions, such as autonomous navigation or human-robot interaction^{9,14,15,16,17}. Event cameras, which are neuromorphic sensors that asynchronously encode pixel-level intensity changes at microsecond resolution, promise to bridge this gap by capturing continuous event streams^{3,18,19,20,21}.

**Fig. 1: STream-based lAtency-awaRe Evaluation (STARE) for event-driven perception.**

However, the traditional perception framework, which forces continuous event stream into fixed-rate event frames^{3,4,21,22,23,24}, as shown in Fig. 1b, reintroduces perceptual delays and discards the sensor’s inherent capacity for real-time, event-driven computation. Moreover, such framework typically evaluates model performance by comparing each model output against the ground truth of the corresponding input frame. This evaluation paradigm assumes instantaneous computation, failing to account for the influence of perception latency. We define perception latency as the total time elapsed from the moment an event is triggered to the moment a downstream application receives the perception model’s corresponding output, which is a critical factor in real-world deployment. In such scenarios, downstream applications such as robotic control require continuous access to the most recent outputs of perception models, and even minor latency can lead to outdated predictions and compounding errors. This mismatch between evaluation and real-time application is especially pronounced in neuromorphic vision, where event cameras generate data at over 1 MHz, yet existing frameworks process these streams as low-frequency frames (typically ~25 Hz), discarding temporal precision and penalizing lightweight, high-speed algorithms by confining them to suboptimal input frequencies.

To unlock the potential of event cameras in real-time applications, recent advances in perception frameworks have attempted to reduce latency by either optimizing frame-based processing pipelines^{9,25,26,27,28} or redesigning the perception framework to directly process the online streaming data^29,30,31,32. Despite these improvements, evaluation methodologies remain predominantly based on fixed frame-rate sequential data. To address this issue, we introduce the STream-based lAtency-awaRe Evaluation (STARE) framework, which is designed to transcend traditional limitations and realistically assess event-driven models through two core components, as shown in Fig. 1c. First, Continuous Sampling schedules the model to immediately process the latest events right after the previous cycle, maximizing model throughput for real-time use. Second, Latency-Aware Evaluation integrates latency into accuracy metrics: it aligns each timestamped ground truth with the latest available prediction, directly quantifying accuracy degradation caused by outdated model outputs. An example of the perception result from STARE is shown in Fig. 1d.

Given that existing event datasets are constrained by low frame-rate annotations and designed for the traditional latency-ignored framework, we developed a new event dataset with high-frequency annotations: ESOT500³³, a 500 Hz object tracking dataset in which each annotation simulates a real-world downstream query (Fig. 2). This annotation density ensures resistance to temporal aliasing³⁴ and supports rigorous Latency-Aware Evaluation (Fig. 3). Leveraging STARE with ESOT500, we demonstrate that perception latency reduces accuracy by over 50% (Fig. 4a–b). Robotic experiments further validate this finding: a 55% increase in perception latency leading to complete task failure (Fig. 5).

**Fig. 2: The ESOT500 dataset for high-dynamic event-driven perception.**

**Fig. 3: Temporal aliasing in event-driven perception and ESOT500’s solution.**

**Fig. 4: Impact of perception latency on event-driven trackers across datasets and hardware.**

**Fig. 5: Real-world robotic ping-pong experiment.**

To address the accuracy degradation stemming from latency, we propose two model enhancement strategies that underscore the need for advanced architectural designs and operational paradigms for event-driven perception systems (Fig. 6). First, Asynchronous Tracking, which draws inspiration from harnessing the continuity of event streams to improve model throughput and thus fulfills the dense query requirements from downstream applications^9,25, adopts a dual-component architecture: a heavyweight base model generates high-accuracy yet low-frequency predictions, while a lightweight residual model is integrated to rapidly refine the most recent output of the base model. Second, Context-Aware Sampling, which originates from our key observation that model performance correlates with the event density surrounding the target object. This method dynamically adjusts the model’s activation state based on this contextual event density. Experimental results demonstrate the efficacy of these strategies: Asynchronous Tracking improves latency-aware accuracy by up to 60% (from 31.83 to 51.06 AUC) via a 78% enhancement in model throughput (from 118 Hz to 210 Hz), whereas Context-Aware Sampling further enhances robustness, yielding over 51% performance improvement (from 18.73 to 28.29 AUC) in challenging scenarios (Fig. 7).

**Fig. 6: Conceptual illustration of the proposed Asynchronous Tracking and Context-Aware Sampling.**

**Fig. 7: Quantitative evaluation of model enhancement strategies under STARE.**

Our approach breaks free from the static, latency-ignored constraints of traditional vision research by centering on a critical, underaddressed principle: temporal congruence between event cameras’ continuous sensing, model computation, and real-world task demands. Unlike prior work that forces event streams into frame-based pipelines or treats latency as an isolated metric, we present a cohesive solution that integrates STARE (a stream-based latency-aware evaluation framework that mirrors real deployment), ESOT500 (a dense 500 Hz dataset for validating the framework), and our asynchronous/context-aware strategies, thereby narrowing the divide between the theoretical speed of event cameras and their practical utility in real-time scenarios. This represents the main contribution of our work.

Results

The stream-based latency-aware evaluation framework

We developed the STream-based lAtency-awaRe Evaluation (STARE) framework: a realistic, rigorous benchmarking tool explicitly tailored to event-driven vision systems, designed to transcend the limitations of conventional latency-ignored paradigms. STARE’s design is anchored in two core, mutually reinforcing principles that govern both the operation of models in online streaming scenarios and the quantification of their performance under real-time constraints, namely Continuous Sampling and Latency-Aware Evaluation.

Continuous sampling in STARE

The traditional framework operates offline by converting a continuous event stream into a sequence of fixed-rate frames and processing them sequentially, as shown in Fig. 1b. However, adapting this fixed-rate method for online operation would disrupt the event stream’s continuity, lowering model throughput and creating inefficiency. The core issue is the inherent misalignment between a model’s variable inference latency and the fixed frame interval. This misalignment inevitably causes idle time for both faster and slower models, as they must wait for the next frame to arrive, thereby artificially limiting maximum throughput^10,11,12,13. More details can be found in Section “Formal Definition”.

To address this, STARE employs the Continuous Sampling strategy during the model’s perception stage. As shown in Fig. 1c, this approach leverages the continuity of the event stream by scheduling the model to sample the most recent events for processing immediately upon completing the prior processing cycle. As illustrated by the comparison between Fig. 1b, d, Continuous Sampling eliminates idle time and enables the model to operate at maximum throughput, which is critical for real-time applications such as robotic control or high-speed interaction.

Latency-aware evaluation in STARE

The second core principle of STARE lies in its evaluation protocol, which explicitly integrates perception latency into the final accuracy metrics. As shown in Fig. 1b, the traditional framework evaluates a model’s prediction against the ground truth at the input’s timestamp, implicitly assuming that predictions are available instantaneously^{6,7,35,36,37,38,39}. Such an assumption neglects the inevitable delay between sensory input and computational output, thereby overlooking the critical influence of perception latency on real-time performance^{9,10,11,12,13}.

To address this, as shown in Fig. 1c, STARE employs a Latency-Aware Evaluation protocol, which simulates the interaction between the perception model and the downstream application. The core of this protocol involves comparing sparse model outputs against a dense sequence of high-frequency ground truth annotations, with each annotation representing a real-time query. For any query at time t_query, the most recent perception result prior to that time (i.e., t_output≤t_query) is retrieved to calculate the performance metric (e.g., AUC, IoU).

This mechanism inherently highlights the impact of latency. When a model exhibits high latency, its sparse outputs result in a single prediction being repeatedly utilized as the best available estimate for a sequence of queries. For example, as shown in Fig. 1c, the perception result generated at time t_i+2 is reused to evaluate queries after t_i+2, such as t_j where t_i+2 < t_j < t_i+3, thereby accumulating error over time. This approach provides a direct quantification of the performance degradation attributable to perception latency, yielding a faithful assessment of a model’s suitability for real-time deployment.

The stream-based latency-aware evaluation dataset

To enable rigorous validation of the STARE framework and address the limitations of existing event datasets, which lack dense temporal annotations for Latency-Aware Evaluation, we introduce the ESOT500 dataset. ESOT500 comprises two subsets with distinct sensor resolutions: ESOT500-L (346 × 260) and ESOT500-H (1280 × 720), covering diverse indoor/outdoor scenes, object classes, and challenging environmental conditions (Fig. 2a–b). A comparative analysis with related event datasets is presented in Fig. 2c, while detailed protocols for data capture, annotation, training/test splits, and implementation are provided in Supplementary Note 3. Latency-Aware Evaluation (a core goal of STARE) requires two critical dataset attributes, as identified in prior analysis: (1) dense ground truth to simulate the high-frequency queries of real-world downstream applications (e.g., robotic control)^{40,41,42,43,44}, and (2) high temporal resolution to capture high-dynamic object motion without the temporal aliasing inherent in low-frequency annotations^34,45. ESOT500 is purpose-built to satisfy both requirements via its 500 Hz annotation rate, as elaborated below.

Simulating dense downstream queries with 500 Hz annotations

The 500 Hz annotation frequency, which is ESOT500’s defining feature, generates a dense ground truth sequence that mimics the continuous, high-rate query demands of downstream applications^{40,41,42,43,44}. This enables STARE to directly assess latency-induced accuracy: by comparing a model’s sparse, timestamped outputs with ESOT500’s dense query stream, the framework captures how latency causes predictions to become outdated relative to real-time task demands, which would be impractical with low-frequency datasets.

High-fidelity capture of high-dynamic object motion

The 500 Hz annotation rate enables high-fidelity capture of high-dynamic object motion, addressing a critical shortcoming of low-frequency datasets, which struggle to resolve rapid, transient motion transitions. From an information-theoretic perspective, sampling continuous high-dynamic motion at discrete, low frequencies acts as a lossy channel⁴⁵: the original fast-changing trajectory serves as the input signal, while sparse annotations produce degraded outputs that omit key motion details. As the speed or variability of object motion increases, this degradation becomes more severe (Fig. 3a)³⁴, leading to substantial discrepancies between the true high-dynamic motion and trajectories interpolated from low-frequency sparse annotations (Fig. 3e).

To quantify this distortion, we introduce the Reconstruction Error (RE) metric, which measures motion information loss at low sampling frequencies by: (1) downsampling ESOT500’s 500 Hz ground truth to a target frequency; (2) reconstructing the trajectory via linear interpolation; and (3) computing the error relative to the original 500 Hz data. As shown in Fig. 3b–d, RE is pronounced at conventional low frequencies (e.g., 25 Hz) but gradually declines as the sampling rate approaches 500 Hz. Fig. 3e presents several visual examples. These results confirm that high temporal resolution is critical to preserve the integrity of high-dynamic motion. The RE curve is expected to be even steeper in more complex tasks (e.g., multi-object detection in dynamic scenes⁶, 6D pose estimation for high-speed targets²⁵, visual odometry in rapid locomotion⁷), as higher degrees of freedom introduce greater uncertainty. These results provide quantitative evidence that ESOT500’s high-frequency annotations are crucial for validating event-driven models in high-dynamic scenarios. A formal definition of RE is provided in Section “Reconstruction Error” (Methods).

Experimental results and analysis

Event-driven perception evaluation with STARE

To quantify the impact of perception latency on event-driven perception, we leverage the STARE framework to evaluate state-of-the-art models. This evaluation centers on the ESOT500 dataset, tailored for high-dynamic, low-latency assessment, and extends to external benchmarks to verify generalizability.

Experimental setting

We selected representative trackers spanning diverse model families, offline accuracy, and inference speeds (Fig. 4a–b): Siamese-based (PrDiMP18⁴⁶), Transformer-based (MixFormer⁴⁷), GNN-based (KeepTrack⁴⁸), RNN-based (KYS⁴⁹) and Segmentation-Centric (RTS⁵⁰). We also incorporated models specifically designed for event streams, such as those operating directly on raw events (EGT²⁹), grid-like representations²¹ (HDETrack⁵), and a recent cross-modal tracker using the Mamba architecture (Mamba-FETrackV2⁵¹). Trackers were assessed under two paradigms: (a) Traditional framework: Event streams were preprocessed into 20 Hz fixed-rate frames (with varying event durations per frame)²². (b) STARE framework: Trackers used Continuous Sampling, processing the most recent events within different sampling window sizes. More details about the experimental settings can be found in Supplementary Note 1.

Evaluation on ESOT500

The results, summarized in Fig. 4a–b, reveal two key findings. First, most models experienced a substantial accuracy drop by up to 50% when moving from the traditional framework to STARE. This demonstrates that accuracy degradation caused by perception latency has been largely overlooked by the traditional evaluation framework. While the accuracy drop was generally observed, the results under STARE highlighted the superiority of some lightweight models over heavy ones. As shown in Fig. 4f–h, some models that achieved higher offline accuracy in the traditional framework were outperformed by more lightweight models under STARE. For example, MixFormer⁴⁷ outperformed KeepTrack⁴⁸ under STARE, whereas KeepTrack performed better in the traditional framework. This result is consistent with our real-world robotic ping-pong experiment in Fig. 5c. Notably, such performance ranking reversals are observed even on our RTX 3090 GPU with high computational power. It is reasonable to infer that the impact of latency is much more pronounced on resource-constrained devices, which are common in robotics and autonomous systems^52,53. These findings provide direct experimental evidence that the traditional framework can yield misleading conclusions about a model’s practical capability, underscoring the necessity of a latency-aware evaluation paradigm such as STARE.

Second, the performance of most trackers under STARE followed a unimodal trend with respect to the sampling window size. The average peak performance appeared around a 20 ms window, reflecting a trade-off: longer windows gather more information but risk redundancy, while shorter windows avoid redundancy but miss critical data. Drawing on this finding, we noticed the effectiveness of Continuous Sampling was influenced by sampling window size. Hence, we made a deeper observation on Continuous Sampling by comparing the model performance under STARE and a frame-based latency-aware evaluation. Under the frame-based latency-aware evaluation, the event stream was converted into event frames in advance, and the model was constrained to sample events only at the discrete timestamps of these frames. More details about the frame-based latency-aware evaluation can be found in Section “Formal Definition” (Method). In contrast, under STARE with Continuous Sampling, the model sampled and processed the most recent events immediately after the previous processing was complete, not requiring any fixed-rate preprocessing. We evaluated model under these two settings with varying sampling window size. As shown in Fig. 4e, Continuous Sampling improved the model’s online accuracy by 51–129% with various sampling windows, demonstrating that Continuous Sampling is able to improve the model throughput thereby enhances accuracy by leveraging the continuity of event data. Furthermore, Continuous Sampling worked the most effectively with a sampling window of 50 ms, which best balanced window size and information richness on our dataset. It revealed that the window size in Continuous Sampling should be carefully considered to unlock a model’s full real-time potential.

The above experiments highlighted the impact of latency on model performance. To comprehensively evaluate a model under STARE in different latency conditions, we conducted experiments using a latency simulator, different physical GPU&CPU configurations, and resource contention from a parallel task. The results are shown in Fig. 4i–l, respectively. In all cases, we found that the accuracy dropped consistently under STARE as latency increased, demonstrating the consistent negative impact of perception latency on model performance.

Evaluation on other datasets

Apart from the proposed ESOT500 dataset, we also applied STARE to two other commonly-used event-based tracking datasets: FE108²¹ and VisEvent²⁴. As shown in Fig. 4c–d, the performance of models on these datasets also followed the unimodal distribution with respect to the sampling window size, similarly to our findings on ESOT500. It is worth noting that the optimal sampling window for Continuous Sampling on FE108 (e.g., 20 ms for KYS⁴⁹) is much shorter than that on VisEvent (e.g., 50 ms for KYS). This is likely because the objects in FE108 are hung on a stick and swung, generating high-density events. As a result, a short sampling window in this case would lead to more proper information aggregation.

Evaluation on ping-pong robot player with STARE

To examine the real-world impact of perception latency, we developed an event-based ping-pong robot player. This platform enabled us to test the immediate impact of latency on applications that demand rapid responses. The system consisted of a robotic arm holding a bat, an event camera, a ball launcher, an event-driven tracker, and a robot control policy. These components formed a tightly coupled perception-action loop. The event camera recorded the ball’s trajectory at 1 MHz. The tracker, operating at tens of Hz, requested the latest event data and provided estimated 3D positions of the ball. The robot control policy queried the tracker at 2 kHz, computing hitting trajectories, and sent control commands to the arm at 200 Hz, guiding it to strike the ball. The complete setup is illustrated in Fig. 5a. An example of a successful robot hitting back a ping-pong ball is shown in Fig. 5b. More details are provided in Supplementary Note 2.

To quantify the impact of perception latency on the ping-pong robot player, we conducted experiments under five conditions with results summarized in Fig. 5c. In the first three settings, we applied STARE to the MixFormer⁴⁷ model and simulated different levels of perception latency by varying the model processing speed. Increasing the processing speed from 23.0 Hz to 55.3 Hz improved the success rate from 0 to 7 out of 20 trials, demonstrating that reduced latency contributed to real-world task success. The fourth setting served as a critical ablation study to STARE by removing Continuous Sampling operation. When the event stream was converted into 40 Hz frames, the model success rate dropped to 2 out of 20 trials, lower than the second setting (5 out of 20) with a close processing speed but with Continuous Sampling operation. This demonstrated that the traditional frame-based online setting failed to unlock the model’s real-time potential. In the final setting, we selected KeepTrack⁴⁸ that has a higher traditional offline accuracy than MixFormer (62.87 AUC vs 61.56 AUC) but has a lower processing speed (33.2 Hz). KeepTrack achieved a lower task success rate (1 out of 20 trials), highlighting the negative impact of high processing latency even though the model had a higher offline accuracy.

Overall, these findings emphasize two essential requirements for evaluating event-driven systems. First, perception latency should be explicitly considered in evaluation to ensure that performance metrics reflect realistic deployment conditions rather than idealized assumptions. Second, to fully realize the capabilities of event-driven models, the evaluation methodology should preserve the continuous nature of event streams instead of constraining them into fixed-rate frames, which can suppress real-time potential even when the nominal throughput appears high.

Strategies for continuous event-driven perception

To mitigate latency-induced performance loss while aligning with the intrinsic continuity of event streams, we propose two targeted model enhancement strategies: Asynchronous Tracking (boosts throughput via dual-model collaboration) and Context-Aware Sampling (optimizes input efficiency via dynamic event selection).

Asynchronous tracking

Increasing model throughput is a direct way to produce more timely predictions, which is critical for satisfying the high-frequency query demands from downstream applications (e.g., robotic control). Event streams inherently provide continuous visual information at near-arbitrary timestamps, theoretically supporting ultra-high throughput. But this potential is not fully utilized by single-model architectures, which are limited by the computational latency of feature extraction. To leverage this continuity, we develop Asynchronous Tracking: a dual-model framework that uses a lightweight residual model to recursively update predictions from a heavyweight base model, thereby boosting throughput without sacrificing accuracy.

Figure 6a details the architecture: the heavyweight base model extracts high-quality features from the event stream to estimate target positions, but results in significant computational latency. While the base model runs inference, the lightweight residual model processes the latest incoming events to refine the base model’s features and predictions in real time. This design has two key advantages: it avoids redundant, computationally expensive feature extraction (by reusing the base model’s shared features) and integrates the most recent event information (preserving temporal continuity).

Experimental validation (Fig. 7a) confirms the strategy’s efficacy: Asynchronous Tracking increases model throughput by 78% (from 118 to 210 Hz) and lifts latency-aware accuracy by 60% (from 31.83 to 51.06 AUC). For comparison, we also explored a single-model alternative, the Predictive Motion Extrapolation, which extrapolates future target positions via velocity vectors output by the base model (details in Method Section “Predictive Motion Extrapolation”). Both strategies aim to boost throughput by updating base model predictions, but Asynchronous Tracking outperforms Predictive Motion Extrapolation across all settings (Fig. 7b). We attribute this performance gap to the limitations of motion extrapolation in high-dynamic scenarios, particularly critical for ESOT500’s 500 Hz annotated high-dynamic motion. To quantify this, we define the unpredictability score (negative accuracy gain of Predictive Motion Extrapolation relative to a non-predictive baseline), where higher scores indicate more unpredictable, namely high-dynamic motion. As shown in Fig. 7b, a model achieves sharply decreased accuracy while unpredictability scores increase, confirming why predictive motion extrapolation underperforms.

Context-aware sampling

Our earlier STARE evaluations (Section “Event-Driven Perception Evaluation with STARE”) revealed a key insight: model performance exhibits a unimodal distribution with respect to event sampling window size. Specifically, too small a window lacks sufficient motion information, while too large a window introduces redundant information. This finding motivated Context-Aware Sampling: a dynamic input refinement strategy that adapts the sampling window while activating/deactivating model inference based on event density surrounding the target’s last known location. The goal is to balance computational efficiency (avoiding unnecessary and inaccurate inference) and perception accuracy (preventing the target from slowly drifting).

Qualitative examples in Fig. 6b–c illustrates the mechanism: When event density is low (indicating slow or no motion, shown in Fig. 6b), inference is deactivated, and the last valid prediction is reused. This can reduce computational latency without accuracy loss. To avoid target drift from prolonged deactivation (Fig. 6c), the strategy enforces periodic reactivation to re-localize the target using newly accumulated events. The detailed formulation of the event density threshold and reactivation interval is provided in Section “Context-Aware Sampling” (Method). Experimental results (Fig. 7a, c) confirm its effectiveness: Context-Aware Sampling improves performance across most settings, and when combined with Asynchronous Tracking, achieves a 61% accuracy gain (from 31.83 to 51.12 AUC), as shown in Fig. 7a. Its effectiveness is most pronounced in sparse-event scenarios. Fig. 7c shows the strategy lifts accuracy by over 51% from an average of 18.73 to 28.29 AUC, demonstrating its robustness in challenging scenarios with sparse events.

Discussion

Neuromorphic vision systems hold unique potential for real-time applications, yet their practical deployment has been hindered by misalignment between evaluation paradigms and the intrinsic properties of event streams. To address this critical gap, we developed the STream-based lAtency-awaRe Evaluation (STARE) framework, which is designed to prioritize the temporal continuity of event data and explicitly account for perception latency, two factors overlooked by traditional RGB-derived evaluation methods.

STARE’s value lies in its ability to bridge theoretical benchmarks and real-world performance. By integrating Continuous Sampling and Latency-Aware Evaluation, the framework aligns model operation pipeline with the asynchronous nature of event streams: Continuous Sampling maximizes throughput by processing the latest events immediately after the prior processing cycle, while Latency-Aware Evaluation quantifies latency-induced accuracy loss by matching high-frequency ground truth to the most recent available model output. This design ensures STARE captures performance degradation that traditional frame-based, latency-ignored framework misses. Our experiments show perception latency reduces online accuracy by over 50%, and even reverses model rankings (favoring lightweight, high-throughput architectures over computationally heavy alternatives). Critically, event-driven robotic tests validated this finding: a 55% increase in latency can lead to a complete (100%) drop in task success rate (ping-pong return accuracy), underscoring the consequences of ignoring latency in real-world systems.

To enable rigorous validation of STARE, we developed the ESOT500 dataset, which provides 500 Hz dense annotations to accurately capture high-dynamic object motion and avoid temporal aliasing. This dataset addresses a key limitation of existing low-frequency event datasets, which fail to faithfully capture high-dynamic state changes and thus struggle with providing reliable Latency-Aware Evaluation results. Complementing STARE and ESOT500, we proposed two model enhancement strategies: Asynchronous Tracking and Context-Aware Sampling that mitigate latency-induced degradation without sacrificing model computational efficiency. Asynchronous Tracking boosts throughput via a dual lightweight-heavyweight architecture, while Context-Aware Sampling dynamically adapts input based on event density surrounding the target object. The two strategies together improve latency-induced accuracy by up to 61% and increase model speed by 78%. Furthermore, Asynchronous Tracking outperforms alternative approaches like Predictive Motion Extrapolation, which struggles in highly dynamic scenarios due to its reliance on trajectory forecasting rather than improving throughput by leveraging the continuous event stream.

Despite these advances, our work has limitations that point to future research directions. First, while our vision is for the STARE framework to be applied across a wide range of perception tasks, our current work demonstrates its methodology on the single-object tracking task, chosen for its clarity as a foundational perception problem. In turn, this work paves the way for STARE’s application to other demanding real-time tasks, as more high temporal resolution datasets tailored to these scenarios are established in the future. Moreover, we currently treat upstream perception modules and downstream applications (e.g., robotic control) as decoupled systems, focusing on how perception latency impacts control outcomes rather than simultaneously optimizing them as a whole. A more integrated approach would involve co-designing perception and control policies within a unified latency-aware framework. Recent work has explored learning a separate high-level controller to make runtime decisions⁵⁴ or directly forecasting future states to provide better state estimation⁵⁵, effectively creating reactive policies that adapt to dynamic environments. But end-to-end policies probably offer even greater potential. For example, such policies could learn to prioritize safer, lower-risk actions when high perception latency is anticipated, which can make systems more efficient, and more robust to latency-induced uncertainty.

Looking forward, STARE also opens avenues to explore broader questions in neuromorphic vision, such as co-design of algorithms and hardware, real-time robotic control, and learning-based event sampling strategies. By centering temporal congruence in model operation and evaluation, our work provides a path towards unlocking the full potential of event-driven systems, enabling technologies that are accurate, low-latency, and reliable in real-world scenarios.

Methods

In this section, we employ VOT as a concrete example to formally describe the STARE framework and compare it with latency-ignored and frame-based approaches.

Framework

Formal definition

Preliminary and Notation

Given an event stream

$${{\mathcal{E}}}={\{{e}_{i}\}}_{i=0}^{{N}_{e}}\,,$$

(1)

we regard the sampled event stream segment as

$${{{\mathcal{E}}}}_{l,r}={\{{e}_{i}\}}_{i=l}^{r}\,,$$

(2)

where e_i = (x_i, t_i, p_i), including event coordinate, timestamp and polarity; N_e is the max event index and N_e + 1 is the total events number of an event stream; l, r are event indexes, corresponding to the left and right bounds of the sampled event stream segment. In this work, the basic sampling strategy can be divided into two types, one fixes the time duration of sampled event stream, while another fixes the sampled event number. Following that, the indexes l, r can be achieved by two indexing functions:

1) Given the window size of sampling time duration L, the time-fixed indexing function φ_τ can be defined as

$$[{l}_{\tau },{r}_{\tau }]={\varphi }_{\tau }({{\mathcal{E}}},t,L)\,,$$

(3)

where

$${r}_{\tau }=\max \{i| ({{{\bf{x}}}}_{i},{t}_{i},{p}_{i})\in {{\mathcal{E}}},{t}_{i}\le t\}\,,$$

(4)

$${l}_{\tau }=\min \{i| ({{{\bf{x}}}}_{i},{t}_{i},{p}_{i})\in {{\mathcal{E}}},{t}_{i}\ge \max (t-L,{t}_{0})\}\,.$$

(5)

2) Given the window size of sampled events number N, the number-fixed indexing function φ_μ can be defined as

$$[{l}_{\mu },{r}_{\mu }]={\varphi }_{\mu }({{\mathcal{E}}},t,N)\,,$$

(6)

where

$${r}_{\mu }=\max \{i| ({{{\bf{x}}}}_{i},{t}_{i},{p}_{i})\in {{\mathcal{E}}},{t}_{i}\le t\}\,,$$

(7)

$${l}_{\mu }=\max ({r}_{\mu }-N,0)\,.$$

(8)

Before feeding the sampled event stream ${{{\mathcal{E}}}}_{l,r}$ into the perception model, we need to convert it to an event frame with a specific event representation that the model can process. The general conversion process is denoted by the function χ

$${\widehat{{{\mathcal{E}}}}}_{l,r}=\chi ({{{\mathcal{E}}}}_{l,r})\,.$$

(9)

As to the relevant event representations, we make a brief description in Supplementary Note 1.

As illustrated in Section “Results”, to meet the Latency-Aware Evaluation paradigm, each ground truth bounding box B_j within an event stream sequence is attached with a timestamp t_j. Specifically, we assume that the time starts from zero, and the total time duration of an event stream is T. We mark the ground truth set as

$${{\mathcal{Q}}}={\{({{{\bf{B}}}}_{j},{t}_{j})\}}_{j=0}^{{N}_{{{\rm{gt}}}}}\,,$$

(10)

where N_gt is the max ground truth index and N_gt + 1 is the total number of ground truth and B₀ serves as the template bounding box for tracking. If the sampling frequency is H segments per second, we can calculate that

$${N}_{{{\rm{gt}}}}=\left\lfloor T\cdot H\right\rfloor -1\,.$$

(11)

Frame-based Latency-ignored Evaluation

As discussed in Section “Introduction” and illustrated in Fig. 8, frame-based latency-ignored evaluation involves processing preprocessed event frames one by one to calculate the offline accuracy. We mark the sequence of event frames as

$${\widehat{{{\mathcal{E}}}}}_{f}={\{{\widehat{{{\mathcal{E}}}}}_{{l}_{i},{r}_{i}}\}}_{i=0}^{{N}_{f}}\,,$$

(12)

where N_f is the max event frame index and N_f + 1 is the total number of event frames. With the introduced notations above, we can get

$${N}_{f}={N}_{{{\rm{gt}}}}\,,$$

(13)

and

$$[{l}_{i},{r}_{i}]={\varphi }_{\tau }({{\mathcal{E}}},\frac{i+1}{H},L)\,.$$

(14)

**Fig. 8: The traditional frame-based latency-ignored evaluation framework.**

Since each event frame in ${\widehat{{{\mathcal{E}}}}}_{f}$ should be fed into the model, we can get the output bounding box set described as

$${{{\mathcal{Q}}}}_{{{\rm{offline}}}}={\{{\widehat{{{\bf{B}}}}}_{k}\}}_{k=1}^{{N}_{f}}\,,$$

(15)

and the corresponding output-ground-truth pairs set described as

$${{{\mathcal{D}}}}_{{{\rm{offline}}}}={\{({{{\bf{B}}}}_{j},{\widehat{{{\bf{B}}}}}_{j})| {{{\bf{B}}}}_{j}\in {{\mathcal{Q}}},{\widehat{{{\bf{B}}}}}_{j}\in {{{\mathcal{Q}}}}_{{{\rm{offline}}}}\}}_{j=1}^{{N}_{{{\rm{gt}}}}}\,.$$

(16)

With ${{{\mathcal{D}}}}_{{{\rm{offline}}}}$, we can calculate standard accuracy metrics, such as AUC and AP.

Frame-based Latency-Aware Evaluation

As illustrated in Fig. 9, frame-based latency-aware evaluation evaluates a system’s reactivity to the external world by processing successively arriving frames and comparing the ground truth with corresponding timestamped outputs. The original and detailed definition can be found in ref. ¹⁰. We migrate it to the event-based vision with fixed-rate event sampling mentioned in Fig. 9. Here we provide a brief description using the previously mentioned notations.

Unlike frame-based latency-ignored evaluation, frame-based latency-aware evaluation framework does not schedule the model to process each frame in ${\widehat{{{\mathcal{E}}}}}_{f}$ one by one. If a model is currently processing the input, the coming frames will be skipped. And then, after finishing the inference, it briefly idles while waiting for the next input, as shown in Fig. 9.

The outputs contain not only the bounding boxes but also corresponding output timestamps for latency-aware evaluation. The output set can be marked as

$${\widehat{{{\mathcal{Q}}}}}_{{{\rm{streaming}}}}={\{({\widehat{{{\bf{B}}}}}_{k},{\widehat{t}}_{k})\}}_{k=1}^{{N}_{{{\rm{streaming}}}}}\,,$$

(17)

where N_streaming is the max output index and the total number of timestamped outputs. To implement the latency-aware evaluation, we still need to match each item in ${{\mathcal{Q}}}$ with an item in ${\widehat{{{\mathcal{Q}}}}}_{{{\rm{streaming}}}}$. As shown in Fig. 9, the matching principle and the matched output-ground-truth pair set ${{{\mathcal{D}}}}_{{{\rm{streaming}}}}$ can be described as

$${{{\mathcal{D}}}}_{{{\rm{streaming}}}} =\left\{({{{\bf{B}}}}_{j},{\widehat{{{\bf{B}}}}}_{{k}_{j}})| \right. \\ \quad{\left.({{{\bf{B}}}}_{j},{t}_{j})\in {{\mathcal{Q}}},({\widehat{{{\bf{B}}}}}_{{k}_{j}},{\widehat{t}}_{{k}_{j}})\in {\widehat{{{\mathcal{Q}}}}}_{{{\rm{streaming}}}}\right\}}_{j=1}^{{N}_{{{\rm{gt}}}}}\,,$$

(18)

where each item in ${{{\mathcal{D}}}}_{{{\rm{streaming}}}}$ satisfies

$${k}_{j}=\max \{k| {\widehat{t}}_{k}\le {t}_{j},\forall ({\widehat{{{\bf{B}}}}}_{k},{\widehat{t}}_{k})\in {\widehat{{{\mathcal{Q}}}}}_{{{\rm{streaming}}}}\}\,.$$

(19)

With ${{{\mathcal{D}}}}_{{{\rm{streaming}}}}$, we can calculate metrics in the same way as frame-based latency-ignored evaluation, however, to show the latency-aware results.

Stream-based Latency-Aware Evaluation

As shown in Fig. 1c, in STARE, perception models get the input from continuous event streams instead of discrete frames. Algo. 1 of Supplementary Note 5 illustrates the stream-based tracking process implemented in our work.

In general, we start by setting the first ground truth bounding box B₀’s timestamp as the current world time t_curr. The stream-based tracking process then begins, where in each cycle, we sample the event stream using the indexing function φ based on t_curr and the sampling window ω. The sampled segment ${{{\mathcal{E}}}}_{l,r}$ is converted into an event frame ${\widehat{{{\mathcal{E}}}}}_{l,r}$ in a specific representation using the conversion function χ. We then input ${\widehat{{{\mathcal{E}}}}}_{l,r}$ into the model along with the latest output bounding box $\widehat{{{\bf{B}}}}$ to predict a new bounding box and return the computational latency Δt. Then, we update t_curr, store the new timestamped output in ${\widehat{{{\mathcal{Q}}}}}_{{{\rm{STARE}}}}$, and repeat the process until t_curr surpasses end time T of the event stream. Notably, the first cycle specifically serves to initialize the perception model with B₀.

After the end of the tracking, we can get the timestamped bounding box list, described as

$${\widehat{{{\mathcal{Q}}}}}_{{{\rm{STARE}}}}={\{({\widehat{{{\bf{B}}}}}_{k},{\widehat{t}}_{k})\}}_{k=1}^{{N}_{{{\rm{STARE}}}}}\,,$$

(20)

where the ${N}_{{{\rm{STARE}}}}$ is the max output index and also the total number of outputs. The method of aligning ${{\mathcal{Q}}}$ and ${\widehat{{{\mathcal{Q}}}}}_{{{\rm{STARE}}}}$ is consistent with the frame-based latency-aware evaluation. The matched output-ground-truth pair set ${{{\mathcal{D}}}}_{{{\rm{STARE}}}}$ can be described as

$${{{\mathcal{D}}}}_{{{\rm{STARE}}}}={\{({{{\bf{B}}}}_{j},{\widehat{{{\bf{B}}}}}_{{k}_{j}})| ({{{\bf{B}}}}_{j},{t}_{j})\in {{\mathcal{Q}}},({\widehat{{{\bf{B}}}}}_{{k}_{j}},{\widehat{t}}_{{k}_{j}})\in {\widehat{{{\mathcal{Q}}}}}_{{{\rm{STARE}}}}\}}_{j=1}^{{N}_{{{\rm{gt}}}}}\,,$$

(21)

where for each item in ${{{\mathcal{D}}}}_{{{\rm{STARE}}}}$ satisfies

$${k}_{j}=\max \{k| {\widehat{t}}_{k}\le {t}_{j},\forall ({\widehat{{{\bf{B}}}}}_{k},{\widehat{t}}_{k})\in {\widehat{{{\mathcal{Q}}}}}_{{{\rm{STARE}}}}\}\,.$$

(22)

We can use the same way to calculate accuracy metrics as the frame-based latency-free evaluation with ${{{\mathcal{D}}}}_{{{\rm{STARE}}}}$, to get the perception model’s performance on stream-based tracking.

Analysis

From the above description, we can see that for a given event stream ${{\mathcal{E}}}$ and sampling setting (fps and window size), both frame-based latency-ignored evaluation and frame-based latency-aware evaluation share the same event frame set ${\widehat{{{\mathcal{E}}}}}_{f}$ as input. But the frame-based latency-aware evaluation will skip some coming frames. Therefore, we can know that

$${N}_{{{\rm{streaming}}}}\le {N}_{{{\rm{gt}}}}-1\,,$$

(23)

with equality if and only if

$$\Delta t\le \frac{1}{H}\,,$$

(24)

which means the model’s reasoning speed exceeds the time interval between frames. Consequently, the model will always wait for the arrival of the next frame and never skip it. When this equality holds, Eq. (19) can be simplified as

$${k}_{j}=j-1\,.$$

(25)

On the contrary, the STARE framework schedules the perception model to operate on the event stream continuously, called Continuous Sampling, avoiding idle time caused by waiting for subsequent frames. The perception model can instantly receive input right after generating the current bounding box, which leads to a higher frequency of outputs and maximally exploits the real-time capability of the perception system.

With the same algorithm for tracking and evaluation as in frame-based latency-aware evaluation, we can get

$${N}_{{{\rm{streaming}}}}\le {N}_{{{\rm{STARE}}}}\,,$$

(26)

with equality if and only if

$$\Delta t=\lambda \cdot \frac{1}{H},\,\lambda \in {{\mathbb{Z}}}^{+}\,.$$

(27)

Moreover, in the STARE framework, it follows that

$${N}_{{{\rm{gt}}}}-1 < {N}_{{{\rm{STARE}}}},\,\,{{\rm{if}}}\,\,\,\Delta t < \frac{1}{H}\,.$$

(28)

In the real-world online scenarios, both H and N_gt can be considered infinite, while Δt cannot be equal to zero. However, in computer-simulated evaluations, they are all finite. Thus, a higher value of H and N_gt aids in a more precise evaluation of a perception model’s real-time performance, to some extent motivating the creation of our ESOT500 dataset.

Reconstruction error

A significant secondary benefit of ESOT500’s 500 Hz annotation rate is its ability to faithfully capture high-dynamic object motion without the temporal aliasing³⁴ inherent in lower-frequency datasets. Furthermore, the process of recording a continuous real-world motion trajectory at a discrete frequency can be viewed as a lossy information channel⁴⁵. The original, continuous trajectory serves as the input signal, while the set of low-frequency annotations is the channel’s output. The fundamental question is: how much information about the original signal is preserved after passing through this channel?

To quantify this, we define the Reconstruction Error (RE), which serves as a practical measure of the channel’s fidelity. The RE evaluates our ability to reconstruct the original high-frequency trajectory ${{{\mathcal{B}}}}_{{{\rm{high}}}}$, given only a downsampled set of low-frequency annotations ${{{\mathcal{B}}}}_{{{\rm{low}}}}$. Specifically, for a given low frequency f_low, we generate ${{{\mathcal{B}}}}_{{{\rm{low}}}}$ by subsampling ${{{\mathcal{B}}}}_{{{\rm{high}}}}$. We then use a linear interpolator ${{\mathcal{L}}}$ to reconstruct the trajectory between every two consecutive low-frequency samples ${{{\bf{B}}}}_{{t}_{i}^{{\prime} }}$ and ${{{\bf{B}}}}_{{t}_{i+1}^{{\prime} }}$ in ${{{\mathcal{B}}}}_{{{\rm{low}}}}$.

The RE at frequency f_low is then the accumulated loss between the reconstructed trajectory ${{{\mathcal{B}}}}_{{{\rm{recon}}}}$ and the original high-frequency ground truth:

$$\,{{\rm{RE}}}(\,{f}_{{{\rm{low}}}})={{\mathbb{E}}}_{t\in [{t}_{i}^{{\prime} },{t}_{i+1}^{{\prime} }]}\left[{{\rm{Loss}}}({{{\bf{B}}}}_{t},{{\mathcal{L}}}(({{{\bf{B}}}}_{{t}_{i}^{{\prime} }},{{{\bf{B}}}}_{{t}_{i+1}^{{\prime} }}),t))\right]\,,$$

(29)

where Loss( ⋅ , ⋅ ) is a metric such as 1 − IoU. Here we adopt L2 distance. A high reconstruction error signifies that the sampling frequency is too low to capture the signal’s complexity, leading to severe information loss, which is also known as temporal aliasing³⁴. This error is thus a direct, quantitative measure of aliasing, reflecting the gap between the information content of the original signal.

Details and insights of model enhancement

Predictive motion extrapolation

We define the function κ to represent the linear compensation function based on model-predicted velocity, which can be formally described as

$$\kappa (\widehat{{{\bf{B}}}},\widehat{{{\bf{v}}}},\Delta t)=\widehat{{{\bf{B}}}}+\widehat{{{\bf{v}}}}\cdot \Delta t\,,$$

(30)

where the latency Δt is a scalar, and $\widehat{{{\bf{B}}}}$ and $\widehat{{{\bf{v}}}}$ are both four-dimensional vectors, including the information of coordinates of the top-left corner and the size of the bounding box.

The predictive head is trained by adding the loss of predictive inference result ${\widehat{{{\bf{B}}}}}_{{{\rm{pred}}}}$, which is obtained by Eq. (30), in addition to the loss of inference result of current input, ${\widehat{{{\bf{B}}}}}_{{{\rm{curr}}}}$. The improved loss formula can be expressed as

$${{\rm{Loss}}}({{{\bf{B}}}}_{{{\rm{curr}}}},{\widehat{{{\bf{B}}}}}_{{{\rm{curr}}}})+{{\rm{Loss}}}({{{\bf{B}}}}_{{{\rm{pred}}}},{\widehat{{{\bf{B}}}}}_{{{\rm{pred}}}})\,,$$

(31)

where

$${\widehat{{{\bf{B}}}}}_{{{\rm{pred}}}}=\kappa ({\widehat{{{\bf{B}}}}}_{{{\rm{curr}}}},{\widehat{{{\bf{v}}}}}_{{{\rm{curr}}}},{{\rm{latency}}})\,.$$

(32)

The loss of ${\widehat{{{\bf{B}}}}}_{{{\rm{pred}}}}$ and ${\widehat{{{\bf{B}}}}}_{{{\rm{curr}}}}$ follows the same calculation formula defined by the specific perception model. Furthermore, considering the tracked object does not always move in a straight line, the parameter latency is set to a small value (e.g., 8 ms) to facilitate model convergence. In fact, the predictive module is more effective at predicting the short future, and when the required compensation time is too long, it can lead to incorrect predictions.

An intriguing finding during debugging was that simply equipping the model with predictive capability, even when its velocity prediction is not used for the final output, already boosts STARE performance. If we further use κ to adjust ${\widehat{{{\bf{B}}}}}_{{{\rm{curr}}}}$ according to ${\widehat{{{\bf{v}}}}}_{{{\rm{curr}}}}$ and Δt, the performance can be improved further.