Object state optimization algorithm based on Bayesian random sampling for visual object tracking

Zhao, Zhiqiang; Zhao, Huijie; Wen, Daitu; Ma, Tao; Luo, Xiaoli; Wu, Bin

doi:10.1038/s41598-025-21033-2

Download PDF

Article
Open access
Published: 24 October 2025

Object state optimization algorithm based on Bayesian random sampling for visual object tracking

Zhiqiang Zhao^1,2,
Huijie Zhao¹,
Daitu Wen¹,
Tao Ma^1,2,
Xiaoli Luo^1,2 &
…
Bin Wu³

Scientific Reports volume 15, Article number: 37237 (2025) Cite this article

2049 Accesses
Metrics details

Subjects

Abstract

From the perspective of object state modeling, visual object tracking can be regarded as a unified process that combines object state estimation and object localization. In this framework, state estimation refers to predicting the complete state vector of the object–such as its position, scale, and motion dynamics–while localization specifically denotes identifying the object’s spatial position within the image, typically in the form of bounding box coordinates. Traditional optimization-based methods for state estimation often suffer from getting trapped in local optima, primarily due to the non-convexity of the objective function and the algorithm’s sensitivity to initialization. To address these issues, this research proposes an object state optimization algorithm based on Bayesian random sampling for visual object tracking. Firstly, a dense sampling method is introduced to mitigate the problem of local optima. Secondly, a hybrid model that merges Bayesian random sampling and gradient ascent is proposed to refine the bounding box, successfully alleviating convergence instability. Finally, our experimental results show that the proposed algorithm significantly improves tracking performance on multiple datasets, validating its efficiency and applicability in object state estimation tasks.

Introduction

Object tracking plays a crucial role in a wide range of application scenarios, with the aim of accurately and stably estimating the position and state of objects within a sequence^1,2. In recent years, significant advancements have been made in object tracking technology within the field of computer vision, such as remote sensing satellites³ and unmanned aerial vehicles (UAV)⁴. However, due to factors such as illumination variations and occlusions, object tracking remains an inherently challenging task.

Currently, deep learning-based object tracking methods^1,5 have primarily been categorized into four types: Siamese network-based architectures^6,7, discriminative model-based architectures^8,9, Transformer-based architectures^10,11, and multi-technology fusion architectures^12,13. Siamese network-based trackers correlate object template features with search region convolutional features, trained end-to-end on datasets^14,15. Discriminative model-based methods employ a discriminator to distinguish between object and background information, achieving accurate tracking by predicting the object’s bounding box¹⁶. Transformer-based approaches emphasize the use of self-attention and cross-attention mechanisms for object tracking and localization^17,18. Multi-technology fusion-based algorithms emphasize the integration of diverse techniques or models to complement each other, achieving high-precision and robust object tracking^19,20. These object tracking algorithms focus on object classification or state estimation as the core for analyzing and tracking objects. Object state estimation, as a key step in object tracking, aims to further refine the object’s location and optimize the bounding box based on the initial detection results of the object. In order to achieve more accurate object state estimation, researchers have proposed various strategies to optimize this process. These algorithms can be broadly divided into two categories based on their design philosophies: one category integrates object localization and bounding box parameter regression into a unified model for joint optimization²¹, while the other models them separately in a staged manner²². On the one hand, some researchers estimate the object bounding box by first performing object center point localization or keypoint localization and then deducing the bounding box parameters. For example, Hui et al.²³ utilize Vision Transformer (ViT) and the Template-Bridged Search Region Interaction (TBSI) module to extract spatio-temporal features, predicting the object position and subsequently inferring the bounding box. Talaoubrid et al.²² employ particle filtering to approximate the object state probability distribution and estimate the bounding box accordingly. Zhao et al.⁷ convert 3D point cloud data into 2D feature maps via BEV and use convolutional neural networks (CNN) combined with supervised by a loss function to predict object position for bounding box estimation. These methods improve robustness and accuracy to some extent. However, due to real-time constraints, they show limited precision in accurately predicting object bounding boxes. On the other hand, some researchers focus on directly optimizing bounding box parameters to address real-time limitations in object tracking. Models such as Siamese-RPN²⁴, ATOM²⁵, DiMP²⁶ and SeqTrack²⁷ have effectively enhanced tracking robustness and accuracy. Specifically, ATOM²⁵ combines a classifier with an IoU predictor and uses gradient ascent to optimize candidate box parameters, alleviating the classification-localization inconsistency in traditional tracking methods. DiMP²⁶ further introduces a discriminative filter and employs steepest descent in continuous space to iteratively search for the optimal bounding box, enabling end-to-end training and fast online updating. These tracking methods have achieved good results in object state estimation, improving tracking performance and accuracy while alleviating real-time constraints to a certain extent. However, there are still some challenges. First, due to non-convex loss functions and dependence on initialization, they are prone to local optima and fail to reach the global optimum within limited time. Second, during the optimization process, pronounced oscillations often occur near the optimal solution region. This is mainly caused by unstable gradients and ambiguous feature responses, leading the solution to fluctuate repeatedly among multiple approximate optima, making convergence unstable.

To address the above challenges, this paper proposes an object state optimization algorithm based on Bayesian random sampling for visual object tracking(BRSO), aiming to capture the optimal state information of the tracked object. Firstly, a dense sampling method is introduced to help the model overcome the limitations of conventional optimization algorithms, enabling it to escape the constraints of local optima. Secondly, to address oscillations in the predicted object state values, a hybrid model combining gradient ascent and Bayesian random sampling principles is developed. The core of this approach lies in leveraging the random sampling principle to enhance the diversity and randomness of candidate bounding boxes, while maximum a posteriori (MAP) estimation improves the accuracy of the predicted object state information. Additionally, IoU features are introduced as key metrics for optimising and predicting object bounding boxes, with extensive experiments across various datasets and tracking methods. The results show that the method significantly improves the object tracking performance and verify its efficiency and applicability in the task of object state estimation.

The main contribution of this paper is:

1)
Proposed a dense sampling method aimed at overcoming the iterative limitations of traditional optimization algorithms, enabling global optimization of the model.
2)
Introduced a hybrid optimization strategy based on Bayesian random sampling principles and the gradient ascent algorithm to mitigate oscillation issues in object state prediction, resulting in more precise object bounding boxes.
3)
Validated the effectiveness and compatibility of the proposed method on multiple datasets. Experimental results demonstrate that the method achieves outstanding performance in object tracking state estimation tasks.

Relevant work

IoU

In recent years, the IoU feature has been widely used as a key metric for measuring bounding box overlap in visual perception and prediction systems. IoU is defined as the ratio between the overlapping area and the union area of two bounding boxes, with values ranging from 0 to 1. Its applications include bounding box regression loss optimization in object detection^28,29; trajectory matching and candidate box quality evaluation in visual tracking³⁰; multi-model detection result fusion and duplicate removal³¹; as well as object state estimation³².

To improve the accuracy of object state estimation, a variety of studies have explored integrating and extending IoU features. Overall, these methods can be categorized into the following three types:

Joint Modeling of IoU Features with Spatio-Temporal Information: These methods combine IoU features with appearance, motion, and other Spatio-Temporal features as joint inputs to the model^33,34. For example, NeighborTrack³⁴ incorporates IoU as a spatial location feature into a bipartite graph-based matching framework, enhancing object matching discrimination and stability. However, such approaches suffer from convergence instability due to feature conflicts and neighbor dependency.

Introducing Spatial Uncertainty Modeling Mechanisms: These methods introduce spatial uncertainty evaluation into IoU-based state estimation to refine bounding box estimation^35,36,37. For instance, SSUTracker³⁷ improves the robustness and accuracy of bounding box estimation by combining weighted GIoU with spatial uncertainty scores to predict more accurate object positions. Nevertheless, IoU as a distance metric provides limited information, making it difficult to handle complex scenarios.

State Estimation Mechanisms Based on IoU Prediction and Fusion: These methods explicitly or implicitly use IoU prediction results to optimize object bounding boxes^{25,26,38,39,40,41}. For example, PrDiMP⁴⁰ extends DiMP by incorporating KL divergence to model the distribution of IoU predictions, mitigating fluctuations caused by single-point predictions. However, its static single-step optimization structure struggles to dynamically adapt to spatio-temporal variations in object states. KYS⁴¹ implicitly integrates IoU-equivalent spatial overlap features through state vector propagation to achieve object region discrimination and localization optimization. Yet, its state updating depends heavily on previous frame estimations, leading to error accumulation and amplification.

In summary, existing methods have improved the accuracy of bounding box prediction to a certain extent. However, due to the non-convexity of the loss function and dependency on initial bounding boxes, there remains a risk of falling into local optima. To address this issue, our method takes IoU features as the primary metric for object state optimization and prediction, introducing a dense sampling strategy combined with sample diversity modeling mechanisms to generate high-quality candidate boxes.

Object bounding box estimation

Inspired by object detection methods⁴², SiamRPN²⁴ and its extension SiamRPN++⁴³ adopt an RPN-based regression framework, using classification and regression branches with anchor boxes to predict object position and size. While these methods demonstrate strong performance, the decoupling between classification and regression leads to suboptimal matching when handling large scale variations or objects with irregular shapes. To overcome the limitations of anchor-based designs, anchor-free approaches such as SiamBAN⁴⁴, SiamCAR⁴⁵, and DEST⁴⁶ utilize fully convolutional networks (FCN) to estimate object confidence scores and predict bounding boxes. These methods improve prediction accuracy to some extent, but still suffer from regression instability. ATOM²⁵ and DiMP²⁶ further extend anchor-free strategies by introducing IoU prediction networks for posterior gradient-based optimization. By averaging multiple high-IoU candidates, these methods refine bounding box estimation. However, gradient instability remains a challenge during the parameter search process, affecting the consistency and convergence of results. In recent years, Transformer-based models have been widely applied for direct bounding box regression^{27,47,48,49,50,51,52,53,54}. For example, Trtr⁴⁷ employs a Transformer encoder–decoder structure, TransT⁴⁹ integrates ResNet-50 with attention mechanisms, and SeqTrack²⁷ predicts bounding box sequences through an end-to-end autoregressive method. These approaches effectively leverage temporal information to enhance prediction performance. Nevertheless, due to model complexity and real-time constraints, achieving stable convergence remains challenging. To further improve bounding box quality, E.T.Track⁵⁵ and IoUformer³² incorporate IoU prediction networks into Transformer-based tracking frameworks. By filtering confidence scores and refining bounding box regression, they enhance prediction accuracy. However, constrained by optimization complexity and real-time requirements, the models still suffer from convergence instability. In summary, the aforementioned tracking algorithms have improved tracking performance and robustness to a certain extent. However, due to the limitations of existing optimization algorithms, they are limited in efficiently reaching global optima within limited iterations, often exhibiting oscillation phenomena near optimal solutions. As a result, the predicted bounding boxes may not fully reflect the actual object boundaries. To address these issues, this paper proposes a hybrid model that integrates gradient ascent⁵⁶ with Bayesian random sampling⁵⁷ to optimize object state estimation through dense sampling.

Proposed method

The process of visual object tracking is divided into two stages: object localization and bounding box optimization. To clearly describe the proposed bounding box optimization algorithm, we assume that the prediction network obtains the object prediction spectrogram M during the object localization phase, and preliminary predictions for the object state are $B_{t}$, $B_t=(C,W,H)$, where C denotes the center coordinates, W denotes the width, and H denotes the height. $M_{B_{t}}$can be expressed as:

$$\begin{aligned} M_{B_{t}} = g\big (a(X_{0},B_{0}),z(X_{t},B_{t-1})\big ), \end{aligned}$$

(1)

Here, a represents the reference vector of the object, derived from $X_{0}$ and $B_{0}$, where $X_{0}$ denotes the backbone features extracted from frame 0, and $B_{0}$ represents the object state in frame 0. z denotes the modulation vector computed from $X_{t}$ and $B_{t-1}$, while g is the prediction network that estimates the object state based on the reference vector and the modulation vector.

To achieve more accurate prediction of the object state, the paper proposes an optimization algorithm based on Bayesian random sampling, to further refine $B_{t}$ building upon the predicted spectrogram M. The specific optimization process is as follows: First, starting from the preliminarily predicted object state $B_{pre}$, dense sampling is performed to generate a set of candidate object states, and obtain the set of confidence scores for the states through the IoU network. Then, a few-step gradient ascent algorithm is used to precisely optimize the confidence scores of the object states, enhancing the accuracy of every state evaluations and selecting several optimal object states. Finally, Bayesian random sampling is applied to the optimized object state set to generate a randomly sampled state set. Based on this, the maximum posterior probability rule is used to obtain the exact object state. The detailed process is illustrated in Fig. 1.

Dense sampling

Based on the given object bounding box, traditional object state optimization algorithms tend to cause the bounding box state $B_{t}$ at frame t to fall into a local optimum during the initial sampling optimization phase. To address this issue, we propose a dense sampling method that increases sample diversity and effectively achieves global optimization. To better describe the process of dense sampling, we define the object state space as $\mathbb {R}^4$, the preliminary bounding box $B_{pre}= (x/w, y/h, \log w, \log h)$ in frame t, where (x, y) represents the center coordinates of the bounding box, and (w, h) represent its width and height, respectively. In this process, based on the preliminary predicted bounding box $B_{pre}$, the random perturbations are applied to sample the centroid coordinates and the dimensions of the bounding box separately. In the sampling process, the dimensions of the bounding box height and width remain unchanged.

We utilize a center-point perturbation method that samples q center points around the center on predefined mesh points, and define the bounding box scale factor s that determines the interval density of the predefined mesh. The sampling formula for center coordinates $C(x_i, y_i)$ can be expressed as:

$$\begin{aligned} \begin{aligned} x_{i}&=x/w+o_{x_{i}} \cdot s_{x} , \, \\ y_{i}&=y/h+o_{y_{i}} \cdot s_{y} , \end{aligned} \end{aligned}$$

(2)

where, $o_{x_{i}},o_{y_{i}}\in \{-1,0,1\}$. We perform random sampling of the width and height of the bounding box based on the approximate square side length $l=\sqrt{w \cdot h}$ of the initial bounding box, which is done to improve the robustness of the algorithm to variations in bounding box dimensions and enhance sample diversity. The width and height of the bounding box after random sampling are expressed as:

$$\begin{aligned} \begin{aligned} w_{i}&=\log w+r_{w_{i}} , \, \\ h_{i}&=\log h+r_{h_{i}} , \end{aligned} \end{aligned}$$

(3)

where, $r_{w_{i}},r_{h_{i}}$ are random variables sampled from a uniform distribution $[-0.5\delta l,0.5\delta l ]$, and $\delta$ is a predefined perturbation ratio. Through the aforementioned sampling of center points and bounding box size, can be generated a dense sampling set $B=\{B_{1},B_{2},\ldots ,B_{q}\}$.

Optimization of bounding box

For simplicity, this paper utilizes the IoU-prediction network²⁵ to generate confidence scores based on $B_{i}$ for optimizing object candidate bounding boxes. The objective is to maximize the confidence scores predicted by the IoU-network, and the objective function is as follows:

$$\begin{aligned} S_{i}=IoU(B_{i}). \end{aligned}$$

(4)

To enhance the accuracy and reliability of the object candidate boxes, we employ a stochastic gradient ascent-based method⁵⁸ to optimize the confidence scores of the object states. Through a limited number of iterative optimizations, the positions and sizes of each object bounding box in the sample set B are progressively refined to maximize the degree of alignment between the object states and the actual object.

We utilize gradient updates to iteratively adjust the parameters of the bounding box. In each iteration, the gradient of the IoU-prediction network is calculated for each candidate bounding box to update the object parameters. The formula is as follows:

$$\begin{aligned} B_{i}^{(j+1)}=B_{i}^{(j)}+\alpha _{j} \cdot \bigtriangledown IoU(B_{i}), \end{aligned}$$

(5)

Here, $B_{i}^{(j)}$ represents the result of the $i-th$ sample state at the $j-th$ iteration, $\alpha _{j}$ is the learning rate, $\nabla IoU( B_{i})$ represents the gradient of the objective function. The gradient components of IoU with respect to $B_{i}$ are expressed as:

$$\begin{aligned} \bigtriangledown IoU(B_{i})=(\frac{\partial IoU}{\partial (x_{i}/w_{i})},\frac{\partial IoU}{\partial (y_{i}/h_{i})},\frac{\partial IoU}{\partial \log w_{i}},\frac{\partial IoU}{\partial \log h_{i}}). \end{aligned}$$

(6)

The entire optimization process undergoes $\lambda$ iterations, with the bounding box from each iteration serving as the starting point for the subsequent optimization. The set of object state samples after a limited number of optimization steps is denoted as $B^{'}=\{ B_{1}^{'},B_{2}^{'},\ldots ,B_{q}^{'}\}$.

Random sampling

The gradient ascent optimization method causes the prediction of the object to gradually approach the optimal solution. However, two issues arise: first, it often requires a lengthy optimization step; second, in the later stages of optimization, it tends to oscillate near the optimal solution. To address these challenges, the BRSO method employs fewer optimization steps to quickly approximate the optimal solution for object prediction and obtain an initial prediction of the object. Subsequently, Bayesian random sampling is employed to refine the object state estimation for improved accuracy.

In the case of obtaining the set of object state samples $B^{'}=\{B_{1}^{'},B_{2}^{'},\ldots ,B_{q}^{'}\}$, we select n optimal states from the sample set as priors for prediction, $F=\{f_{1},f_{2},\ldots ,f_{n}\},1<n<q$, where n represents the number of object states. For each element $f_{i}$ in the state set, we first generate m new object states, denoted as $\Theta =\{\theta _{1},\theta _{2},\ldots ,\theta _{m}\}$. To enhance the diversity of the object states, Gaussian noise is added to the object states, specifically formulated as:

$$\begin{aligned} \theta _{j}=f_{i}+N_{j}(c_{x},c_{y},d_{w},d_{h}),1<j< m , \end{aligned}$$

(7)

Here, $N_{j}$ represents the four-dimensional Gaussian noise added to the $j-th$ sample, $(c_{x},c_{y})$, $(d_{w},d_{h})$ respectively represent position noise and size noise. The noise in each dimension follows a Gaussian distribution $G(\mu ,\sigma ^{2})$, $\mu$, $\sigma ^{2}$ denotes the mean and variance of the Gaussian noise. Finally, the posterior probability of the new object state is obtained using Bayes’ theorem⁵⁷ with the prior samples, expressed as:

$$\begin{aligned} p(f_{i}|\theta _{j}^{(i)})=\frac{p(\theta _{j}^{(i)}|f_{i})p(f_{i})}{p(\Theta )}, \end{aligned}$$

(8)

Since all elements in the set $\Theta$ have the same prior probability, the posterior probability is given by:

$$\begin{aligned} p(f_{i}|\theta _{j}^{(i)})=\frac{p(\theta _{j}^{(i)}|f_{i})}{\sum _{i=1}^{n}{p(\theta _{j}^{(i)}|f_{i})}}, \end{aligned}$$

(9)

Therefore, for all object states in the set F, the posterior probability set $E=\{p(f_{1}|\theta _{1}^{(1)}),\ldots ,p(f_{1}|\theta _{m}^{(1)}),\ldots , p(f_{n}|\theta _{1}^{(n)}),\ldots ,p(f_{n}|\theta _{m}^{(n)})\}$ can be obtained by applying the Bayes’ theorem to randomly sampled object states.

The values of the posterior probability score set P are obtained using the Maximum A Posteriori (MAP) criterion:

$$\begin{aligned} \begin{aligned} \hat{\theta }&=\underset{\theta }{arg\max }\{p(\theta _{1}^{(1)}|f_{1}),\ldots ,p(\theta _{m}^{(1)}|f_{1}),\ldots , p(\theta _{1}^{(n)}|f_{n}),\ldots ,p(\theta _{m}^{(n)}|f_{n})\}. \end{aligned} \end{aligned}$$

(10)

Experimentation

The experiments were conducted using Python and the PyTorch framework on an NVIDIA 2080 GPU. The ResNet-18 architecture was selected as the backbone network, and the proposed optimization method was integrated and validated. The optimization method, BRSO, was integrated into representative object tracking algorithms, including KYS⁴¹, Dimp²⁶, PrDiMP⁴⁰, and SuperDiMP for comparative analysis. Furthermore, the experimental results are further analyzed in depth on the four publicly available datasets: OTB-100⁵⁹, UAV123⁶⁰, VOT2018⁶¹, and Temple-color-128⁶².

Based on the randomness of the tracking, the average of the results from three runs is reported. In this experiment, 9 dense bounding boxes were selected, and 10 iterations of optimization were performed for each dense sample, with an initial learning rate $\alpha$ set to 0.01. Three optimal states were selected based on the optimization results, and 50 random sample boxes were generated for each optimal state. Each random sample added Gaussian noise, which follows a two-dimensional Gaussian distribution $G\left( 0.8,0.18^2 \right)$

Quantitative analysis

To validate the effectiveness of the proposed BRSO algorithm, five representative sequences from the OTB-100 benchmark⁵⁹ –Soccer, Bolt, Coupon, Ironman, and Subway–were selected to evaluate and compare the tracking performance of the baseline SuperDiMP algorithm and the BRSO-enhanced SuperDiMP (SuperDiMP+BRSO). In the SuperDiMP algorithm, IOU features and the steepest descent method are used to select the top three candidates with the highest IoU, and their average is taken to estimate the object state. In this study, SuperDiMP+BRSO replaces this with a MAP-based object state estimation using Bayesian random sampling. Figure 2 illustrates examples of tracking results on the OTB-100 dataset. In the Bolt sequence, fast motion and continuous scale changes, combined with interference from similar objects, cause SuperDiMP to update delayed and inaccurate state updates. Similarly, in the Coupon sequence, distractor objects lead to state estimation failures. After integrating BRSO, both Bolt and Coupon sequences exhibit notably improved object state estimation and tracking accuracy, indicating that the proposed method possesses strong adaptability in handling complex scenarios. Furthermore, in the Soccer sequence, a combination of lighting changes, motion blur, and occlusion causes SuperDiMP to generate unstable object state predictions. In the Ironman sequence, drastic fluctuations in lighting intensity result in SuperDiMP misestimating the object state, mistakenly identifying the shoulder region as the object. In the Subway sequence, when the walking object is occluded, SuperDiMP tends to misidentify similar objects as the object, leading to tracking failure. By comparison, SuperDiMP+BRSO demonstrates consistently superior tracking performance across the Soccer, Ironman, and Subway sequences. This improvement is primarily attributed to the Bayesian random sampling mechanism introduced by BRSO, which enhances both the diversity and discriminative ability of candidate states. In summary, the BRSO optimization strategy enhances both the accuracy and robustness of the tracker under complex scenarios, confirming its effectiveness and applicability in visual object tracking.

Qualitative analysis

This section presents a qualitative analysis of the proposed algorithm on the OTB-100⁵⁹, UAV123⁶⁰, VOT2018⁶¹, Temple-color-128⁶² datasets. Success rate is adopted as the primary evaluation metric, while precision and normalized precision are also reported for comprehensive performance.

VOT2018⁶¹: The VOT2018 dataset, released in 2018, consists of 60 video sequences encompassing diverse scenarios and challenges, including indoor, outdoor, daytime, and nighttime conditions. On the VOT2018 dataset, we integrated the proposed BRSO optimization algorithm into SuperDiMP, KYS⁴¹, DiMP²⁶, and PrDiMP⁴⁰ tracking algorithms to evaluate its performance. The evaluation results are shown in Fig. 3. After incorporating the BRSO optimization algorithm, the SuperDiMP algorithm achieved improvements in success rate, normalized precision, and precision by 1.6%, 2.5%, and 2.5%, respectively. Additionally, the BRSO optimization algorithm demonstrated performance gains across various other tracking architectures. Specifically, the success rate of the DiMP algorithm was improved by 1.3% and its precision was improved by 2%. The PrDiMP algorithm improved its success rate by 0.9% and normalized precision by 0.8%. For the KYS tracker, the success rate improved by 0.4% on the VOT2018 dataset after applying the BRSO optimization algorithm. In summary, the BRSO optimization algorithm exhibited outstanding performance across the aforementioned trackers, effectively enhancing their tracking capabilities.

OTB-100⁵⁹: The OTB-100 dataset contains 100 video sequences representing common tracking scenarios. On this dataset, the BRSO optimization algorithm was integrated into four representative trackers: the KYS⁴¹, DiMP²⁶, PrDiMP⁴⁰ and SuperDiMP, for comparative evaluation. Figure 4 illustrates the evaluation results. Notably, the most significant improvement was observed with the DiMP tracker, where the success rate increased by 1.4% and the precision improved by 2.4% after incorporating the BRSO. Similarly, the SuperDiMP tracker’s success rate rose from 69.7% to 70.7%, and accompanied by a 1.6% gain in precision. For PrDiMP, a 0.6% improvement in success rate and a 0.4% increase in precision were recorded. In conclusion, on the OTB-100 dataset, the BRSO optimization algorithm yielded moderate improvements in the performance metrics of the aforementioned trackers.

Table 1 Success rate, precision, and normalized precision evaluation results of various algorithms on the Temple-color128 Dataset.

Full size table

Temple-color128⁶²: The Temple-Color128 dataset, released by Temple University, consists of 128 video sequences. In this experiment, the proposed BRSO optimization algorithm was integrated into the DiMP²⁶, KYS⁴¹, PrDiMP⁴⁰, and SuperDiMP frameworks for evaluation. The evaluation results are summarized in Table 1. On this dataset, the BRSO optimization algorithm demonstrated significant improvements over the baseline optimization methods in the aforementioned tracking frameworks. For instance, the SuperDiMP tracker showed a 2.5% increase in success rate, a 3% improvement in precision, and a 3.8% boost in normalized precision, leading to a substantial enhancement in tracking performance. The algorithm also delivered notable performance gains in the KYS framework, with a 1.8% rise in success rate, a 1.9% increase in precision, and a 1.6% improvement in normalized precision. For the other two frameworks, DiMP and PrDiMP, the success rates increased by 1.2% and 1.6%, respectively. In summary, the BRSO algorithm achieved consistent and significant performance improvements on the Temple-Color128 dataset, significantly improving the performance metrics of the evaluated trackers.

UAV123⁶⁰: The UAV123 dataset, consisting of 123 high-definition video sequences captured from a low-altitude aerial perspective, was used to evaluate the proposed BRSO optimization algorithm. In this experiment, the algorithm was integrated into the KYS⁴¹, SuperDiMP, PrDiMP⁴⁰, and DiMP²⁶ trackers for performance evaluation. The evaluation metrics and results are presented in Fig. 5. On this dataset, while the BRSO optimization algorithm did not demonstrate substantial improvements over the aforementioned trackers, it still achieved measurable gains. Specifically, both success rate and accuracy were enhanced by over 0.1%.

Ablation study

To analyze the roles of the dense sampling module and the random sampling module within the BRSO framework, ablation experiments were conducted on the OTB-100⁵⁹ dataset. The complete BRSO structure was taken as the baseline and integrated into the SuperDiMP algorithm. Comparative analyses were then performed by selectively removing either the dense sampling module or the random sampling module. The experimental results are shown in Table 2.

Table 2 Ablation Results of Dense Sampling and Random Sampling.

Full size table

Analysis of the Dense Sampling Module: SuperDiMP+BRSO_Dense retains the original bounding box refinement module, while SuperDiMP+BRSO introduces the dense sampling strategy from the BRSO framework. The experimental results demonstrate that SuperDiMP+BRSO outperforms SuperDiMP+BRSO_Dense in both success rate and normalized precision. The dense sampling module effectively enhances the global search capability in bounding box estimation. It achieves this by increasing the spatial coverage and diversity of candidate samples, which helps reduce the risk of falling into local optima.

Analysis of the Random Sampling Module: In SuperDiMP+BRSO_Rand, the random sampling module was removed while retaining the dense sampling module. Compared with SuperDiMP+BRSO, the success rate decreased by 0.4%, precision decreased by 0.5%, and normalized precision declined 0.2%. These results demonstrate that the random sampling module critically refines candidate bounding boxes during optimization by mitigating convergence instability, thereby enabling more accurate and stable predictions.

Convergence analysis of MAP

To verify the optimization behavior of the proposed object state optimization algorithm in multi-frame object tracking tasks, we further analyze its performance under different sampling numbers and iteration settings. In this experiment, the BRSO optimization algorithm is integrated into the SuperDiMP tracking framework and evaluated on the OTB-100 dataset. By default, the number of samples is set to 50 and the maximum number of iterations to 10. The experimental results are summarized in Table 3 and Table 4.

Table 3 Tracking performance under different random sampling sizes (m).

Full size table

Table 4 Tracking performance under different optimization iterations (iter).

Full size table

The experimental results indicate that increasing the number of samples leads to an overall improvement in both success rate and precision, with performance peaking at $m = 120$ before slightly declining. This suggests that excessive sampling may introduce noise or redundant particles, thereby affecting the stability of bounding box estimation. Moreover, a higher sampling number significantly reduces the frame rate, which drops to only 9 FPS at $m = 300$, highlighting substantial computational overhead. When the number of samples is fixed, the number of optimization iterations also influences performance. The best results are observed around $iter = 9$, beyond which performance begins to slightly degrade. The normalized precision remains relatively stable across different settings, while the frame rate consistently decreases as the number of iterations increases. In summary, the BRSO optimization process exhibits a clear convergence trend within an appropriate parameter range, yet reveals a notable trade-off between accuracy and computational efficiency. Therefore, careful selection of sampling number and iteration count is essential for improving the stability and practical applicability of the algorithm.

Conclusions

This paper proposes an object state optimization algorithm based on Bayesian random sampling for visual object tracking, aimed at improving object state estimation and bounding box optimization. By using dense sampling, the framework enhances the diversity of samples, effectively mitigating the limitations of the optimization algorithm and achieving global optimization. The hybrid architecture, combining Bayesian random sampling with gradient ascent, increases the diversity and randomness of candidate boxes, enabling more accurate object state predictions with fewer optimization steps. Experiments demonstrate that the optimized framework improves the success rate of object bounding box prediction on four challenging benchmark datasets, validating the effectiveness and compatibility of the BRSO optimization algorithm.

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Marvasti-Zadeh, S. M., Cheng, L., Ghanei-Yakhdan, H. & Kasaei, S. Deep learning for visual tracking: A comprehensive survey. IEEE Transactions on Intell. Transp. Syst. 23, 3943–3968 (2021).
Article Google Scholar
Kugarajeevan, J., Kokul, T., Ramanan, A. & Fernando, S. Transformers in single object tracking: an experimental survey. IEEE Access (2023).
Lin, B. et al. Motion-aware correlation filter-based object tracking in satellite videos. IEEE Transactions on Geosci. Remote. Sens. 62, 1–13 (2024).
ADS Google Scholar
Xue, C. et al. Similarity-guided layer-adaptive vision transformer for uav tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 6730–6740 (2025).
Agrawal, H., Halder, A. & Chattopadhyay, P. A systematic survey on recent deep learning-based approaches to multi-object tracking. Multimed. Tools Appl. 83, 36203–36259 (2024).
Article Google Scholar
Liu, Q., Li, Y., Jiang, Y. & Fu, Y. Siamese-detr for generic multi-object tracking. IEEE Transactions on Image Process. (2024).
Zhao, K., Zhao, H., Wang, Z., Peng, J. & Hu, Z. Object-preserving siamese network for single-object tracking on point clouds. IEEE Transactions on Multimed. (2023).
He, X. & Chen, C.Y.-C. Enhancing discriminative appearance model for visual tracking. Expert. Syst. with Appl. 219, (2023).
Touil, D. E., Terki, N. & Medouakh, S. Learning spatially correlation filters based on convolutional features via pso algorithm and two combined color spaces for visual tracking. Appl. Intell. 48, 2837–2846 (2018).
Article Google Scholar
Gong, X., Zhang, Y. & Hu, S. Asaformer: Visual tracking with convolutional vision transformer and asymmetric selective attention. Knowl. Based Syst. 291, 111562. https://doi.org/10.1016/J.KNOSYS.2024.111562 (2024).
Article Google Scholar
Hu, K. et al. Sequential fusion based multi-granularity consistency for space-time transformer tracking. In Proceedings of the AAAI Conference on Artificial Intelligence 38, 12519–12527 (2024).
Article Google Scholar
Yu, B. et al. High-performance discriminative tracking with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 9856–9865 (2021).
Deng, Y. et al. Gradually spatio-temporal feature activation for target tracking. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3925–3929 IEEE, (2024).
Xu, T., Feng, Z., Wu, X.-J. & Kittler, J. Toward robust visual object tracking with independent target-agnostic detection and effective siamese cross-task interaction. IEEE Transactions on Image Processing 32, 1541–1554 (2023).
Article ADS PubMed Google Scholar
Rahman, M. M. & Hammond, T. Learning random noise salient feature fusion siamese network for low-resolution object tracking (student abstract). In Proceedings of the AAAI Conference on Artificial Intelligence 38, 23626–23627 (2024).
Article Google Scholar
Wang, L. & Pan, C. Visual object tracking via a manifold regularized discriminative dual dictionary model. Pattern Recognit. 91, 272–280 (2019).
Article ADS Google Scholar
Rahman, M. M. Target focused shallow transformer framework for efficient visual tracking. In Proceedings of the AAAI Conference on Artificial Intelligence 38, 23409–23410 (2024).
Article Google Scholar
Gopal, G. Y. & Amer, M. A. Separable self and mixed attention transformers for efficient object tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 6708–6717 (2024).
Moorthy, S. & Joo, Y. H. Adaptive spatial-temporal surrounding-aware correlation filter tracking via ensemble learning. Pattern Recognit. 139, (2023).
Xue, C., Zhong, B., Liang, Q., Xia, H. & Song, S. Unifying motion and appearance cues for visual tracking via shared queries. IEEE Transactions on Circuits Syst. for Video Technol. (2024).
Huang, M., Li, X., Hu, J., Peng, H. & Lyu, S. Tracking multiple deformable objects in egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1461–1471 (2023).
Talaoubrid, H., Hayat, K. & Magnier, B. Straightforward adaptation of particle filter to fish eye images for top view pedestrian tracking. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4300–4304 IEEE, (2024).
Hui, T. et al. Bridging search region interaction with template for rgb-t tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13630–13639 (2023).
Li, B., Yan, J., Wu, W., Zhu, Z. & Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition, 8971–8980 (2018).
Danelljan, M., Bhat, G., Khan, F. S. & Felsberg, M. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
Bhat, G., Danelljan, M., Van Gool, L. & Timofte, R. Learning discriminative model prediction for tracking. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 6181–6190, https://doi.org/10.1109/ICCV.2019.00628 (2019).
Chen, X., Peng, H., Wang, D., Lu, H. & Hu, H. Seqtrack: Sequence to sequence learning for visual object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14572–14581 (2023).
Zheng, Z. et al. Distance-iou loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI conference on artificial intelligence 34, 12993–13000 (2020).
Article Google Scholar
Liu, C. et al. Powerful-iou: More straightforward and faster bounding box regression loss with a nonmonotonic focusing mechanism. Neural Networks 170, 276–284 (2024).
Article PubMed Google Scholar
Toida, K., Kato, N., Segawa, O., Nakamura, T. & Hotta, K. Gr-iou: Ground-intersection over union for robust multi-object tracking with 3d geometric constraints. In Bue, A. D., Canton, C., Pont-Tuset, J. & Tommasi, T. (eds.) Computer Vision - ECCV 2024 Workshops - Milan, Italy, September 29-October 4, 2024, Proceedings, Part XV, vol. 15637 of Lecture Notes in Computer Science, 79–89, https://doi.org/10.1007/978-3-031-91581-9_6 (Springer, 2024).
Raghavan, D. & Selvi, S. S. Setnet: A sparse ensemble network for drone localization and zero shot drone tracking in real time surveillance videos. In 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 1–5 (IEEE, 2023).
Cai, H. et al. Iouformer: Pseudo-iou prediction with transformer for visual tracking. Neural Networks 170, 548–563. https://doi.org/10.1016/j.neunet.2023.10.055 (2024).
Article PubMed Google Scholar
You, S., Yao, H., Bao, B.-K. & Xu, C. Utm: A unified multiple object tracking model with identity-aware feature enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21876–21886 (2023).
Chen, Y.-H. et al. Neighbortrack: Single object tracking by bipartite matching with neighbor tracklets and its applications to sports. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5139–5148 (2023).
Solano-Carrillo, E. et al. Utrack: Multi-object tracking with uncertain detections. In Bue, A. D., Canton, C., Pont-Tuset, J. & Tommasi, T. (eds.) Computer Vision - ECCV 2024 Workshops - Milan, Italy, September 29-October 4, 2024, Proceedings, Part XVII, vol. 15639 of Lecture Notes in Computer Science, 219–236, Springer, (2024). https://doi.org/10.1007/978-3-031-91585-7_14
Lee, C. W. & Waslander, S. L. Uncertaintytrack: Exploiting detection and localization uncertainty in multi-object tracking. In 2024 IEEE International Conference on Robotics and Automation (ICRA), 4946–4953, https://doi.org/10.1109/ICRA57147.2024.10610458 (2024).
Liao, P.-J., Huang, Y.-C., Chiang, C.-K. & Lai, S.-H. Robust multi-object tracking with spatial uncertainty. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 IEEE, (2023).
Zhou, Z., Li, X., Fan, N., Wang, H. & He, Z. Target-aware state estimation for visual tracking. IEEE Transactions on Circuits and Systems for Video Technol. 32, 2908–2920 (2021).
Article Google Scholar
Xu, Y., Wang, Z., Li, Z., Yuan, Y. & Yu, G. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, 12549–12556, https://doi.org/10.1609/AAAI.V34I07.6944 AAAI Press, (2020).
Danelljan, M., Gool, L. V. & Timofte, R. Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7183–7192 (2020).
Bhat, G., Danelljan, M., Van Gool, L. & Timofte, R. Know your surroundings: Exploiting scene information for object tracking. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, 205–221 (Springer, 2020).
Jiang, B., Luo, R., Mao, J., Xiao, T. & Jiang, Y. Acquisition of localization confidence for accurate object detection. In Proceedings of the European conference on computer vision (ECCV), 784–799 (2018).
Li, B. et al. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
Chen, Z., Zhong, B., Li, G., Zhang, S. & Ji, R. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6668–6677 (2020).
Guo, D., Wang, J., Cui, Y., Wang, Z. & Chen, S. Siamcar: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6269–6277 (2020).
Zhao, S., Xu, T., Wu, X.-J. & Kittler, J. Distillation, ensemble and selection for building a better and faster siamese based tracker. IEEE transactions on circuits and systems for video technology 34, 182–194 (2022).
Article Google Scholar
Zhao, M., Okada, K. & Inaba, M. Trtr: Visual tracking with transformer. arXiv preprint arXiv:2105.03817 (2021).
Yu, Q., Ma, Y., He, J., Yang, D. & Zhang, T. A unified transformer based tracker for anti-uav tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3036–3046 (2023).
Chen, X. et al. High-performance transformer tracking. IEEE Transactions on Pattern Analysis Mach. Intell. 45, 8507–8523 (2022).
ADS Google Scholar
Xie, J. et al. Autoregressive queries for adaptive tracking with spatio-temporal transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, 19300–19309, (IEEE, 2024). https://doi.org/10.1109/CVPR52733.2024.01826
Li, S. et al. Learning target-aware vision transformers for real-time uav tracking. IEEE Transactions on Geosci. Remote. Sens. 62, 1–18 (2024).
ADS Google Scholar
Ma, F. et al. Unified transformer tracker for object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8781–8790 (2022).
Zhao, J., Edstedt, J., Felsberg, M., Wang, D. & Lu, H. Leveraging the power of data augmentation for transformer-based tracking. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 6469–6478 (2024).
Zhang, S., Liu, H., Lin, S. & He, K. You only need less attention at each stage in vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6057–6066 (2024).
Blatter, P., Kanakis, M., Danelljan, M. & Van Gool, L. Efficient visual tracking with exemplar transformers. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 1571–1581 (2023).
Ruder, S. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016).
Gelman, A., Carlin, J. B., Stern, H. S. & Rubin, D. B. Bayesian data analysis (Chapman and Hall/CRC, 1995).
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Annals of statistics 1189–1232 (2001).
Wu, Y., Lim, J. & Yang, M.-H. Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2411–2418 (2013).
Mueller, M., Smith, N. & Ghanem, B. A benchmark and simulator for uav tracking. In European Conference on Computer Vision (2016).
Kristan, M. et al. The sixth visual object tracking VOT2018 challenge results. In Leal-Taixé, L. & Roth, S. (eds.) Computer Vision - ECCV 2018 Workshops - Munich, Germany, September 8-14, 2018, Proceedings, Part I, vol. 11129 of Lecture Notes in Computer Science, 3–53, Springer, (2018). https://doi.org/10.1007/978-3-030-11009-3_1
Liang, P., Blasch, E. & Ling, H. Encoding color information for visual tracking: Algorithms and benchmark. IEEE transactions on image processing 24, 5630–5644 (2015).
Article ADS MathSciNet PubMed Google Scholar

Download references

Acknowledgements

Thank the editor and the anonymous referees for their valuable comments. This work was supported by the Nature Science Foundation of Ningxia, Peoples R China (No. 2024AAC03310, 2023AAC03338), the Key Research and Development Program Project of Ningxia, Peoples R China (No. 2023BEG02072), the Hui Autonomous Region Education Department Higher School Scientific Research Project of Ningxia (No. NYG202420), the Higher Education Institution Scientific Research Project of Ningxia, Peoples R China (No. NYG2024164), and the Nature Science Foundation of China (No. 62262054, 62262033, 12261070).

Author information

Authors and Affiliations

The School of Mathematics and Computer Science, Ningxia Normal University, Guyuan, Ningxia, 756099, People’s Republic of China
Zhiqiang Zhao, Huijie Zhao, Daitu Wen, Tao Ma & Xiaoli Luo
Artificial Intelligence and Intelligent Medical Engineering Technology Research Center, Guyuan, Ningxia, 756099, People’s Republic of China
Zhiqiang Zhao, Tao Ma & Xiaoli Luo
The School of Information Science and Technology, University of Jiujiang, Jiujiang, Jiangxi, 332005, People’s Republic of China
Bin Wu

Authors

Zhiqiang Zhao
View author publications
Search author on:PubMed Google Scholar
Huijie Zhao
View author publications
Search author on:PubMed Google Scholar
Daitu Wen
View author publications
Search author on:PubMed Google Scholar
Tao Ma
View author publications
Search author on:PubMed Google Scholar
Xiaoli Luo
View author publications
Search author on:PubMed Google Scholar
Bin Wu
View author publications
Search author on:PubMed Google Scholar

Contributions

[Zhiqiang Zhao]: Designed the research study, performed the statistical analysis, wrote the manuscript, and coordinated the project. [Huijie Zhao]: Collected and analyzed the data, contributed to the writing of the manuscript, and reviewed the manuscript [Daitu Wen]: Designed the experiments, supervised the laboratory work, and reviewed the manuscript. [Tao Ma]: Provided reagents and materials, participated in data analysis, helped with the experimental design, and reviewed the manuscript [Xiaoli Luo]: Performed the literature search, helped with data interpretation, and revised the manuscript. [Bin Wu]: Contributed to the conception of the study, provided administrative and technical support, and revised the manuscript.

Corresponding author

Correspondence to Zhiqiang Zhao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, Z., Zhao, H., Wen, D. et al. Object state optimization algorithm based on Bayesian random sampling for visual object tracking. Sci Rep 15, 37237 (2025). https://doi.org/10.1038/s41598-025-21033-2

Download citation

Received: 03 March 2025
Accepted: 18 September 2025
Published: 24 October 2025
Version of record: 24 October 2025
DOI: https://doi.org/10.1038/s41598-025-21033-2