Introduction

Video surveillance, autonomous driving, and face recognition are widely applied in the contemporary society, and the demand for video analysis technology is also increasing1. Video analysis tasks can be broadly classified into basic tasks and advanced tasks. Object detection, video de-noising, video compression, etc., can be regarded as basic tasks, and Moving Objects Detection (MOD) is one of the most vital research directions in basic tasks. Some advanced tasks, such as people re-identification and video semantic understanding, are based on MOD. The accuracy of the MOD algorithm significantly determines the performance of these advanced algorithms.

With constant advances in sensor technology, high-quality cameras capture more detailed information from the scene, which increases the difficulty of MOD research. Furthermore, challenges such as moving backgrounds, rainy weather2, camouflage, tiny object detection, and varying lighting conditions further complicate MOD research3. MOD approaches generally fall into two categories: pixel-based approaches4 and frame-based approaches5. Pixel-based approaches include classic approaches such as Mixture of Gaussian (MOG)6, Support Vector Machine (SVM)7, K-means clustering, Fuzzy C-means clustering (FCM)8, and others9. However, pixel-based approaches often exhibit inconsistent performance and tend to misclassify foreground objects under noisy conditions, making frame-based methods more favorable than pixel-based methods. Candès et al. proposed Robust Principal Component Analysis (RPCA)10, which is an influential contribution among different MOD methods. This method decomposes video into two parts: a low rank background component and a sparse foreground component. However, this method fails to detect dynamic background effectively. The GRASTA method11, based on a subspace model, was proposed to overcome this issue by using an \(l_1\)-norm model to detect moving objects. The robust subspace tracking method has been proven successful in dynamic object detection, but there are still some problems that need to be addressed12.

Another MOD method, which combines Total Variation(TV) and RPCA, was proposed13. It utilizes the TV model to capture spatial and temporal relationships. However, it fails to detect tiny moving objects and struggles to capture spatial information effectively, as it tends to lose some information during the conversion from video frames into low-dimensional matrices. Subsequently, LR-\(l_1\)TV method14, which is based on the total variation model, was proposed. This method can effectively detect static backgrounds and point objects but is less adept at handling noisy videos. A noise-robust model that combines tensor low rank approximation and tensor total variation (TTV) regularization was proposed. It employs \(l_{1/2}\) regularization and TTV regularization to suppress dynamic backgrounds and extract foreground information. However, this method exhibits limited precision in background separation15.

A tensor-based model has been proven to be effective in extracting low rank information from high-order data16. In the same year, a Tucker-based model named RTCUR was proposed. To solve this large-scale nonconvex problem, CUR decomposition was employed to reduce the computational complexity17. However, its number of parameters scales exponentially with the tensor order. A multi-mode outlier-robust tensor ring decomposition(ORTRD) method was proposed. It demonstrates a strong low rank representation capability based on high-order data18. Tensor nuclear norm(TNN) is a conventional method to solve the low rank model, the FC-TNN method provides new inspiration to solve the tensor-based RCPA model. It incorporates Chebyshev polynomial approximation(CPA) method into the alternating direction method of multipliers (ADMM) algorithm, and results show the efficiency19.

Supervised learning is also one of the solutions for MOD research. Convolutional Neural Networks (CNNs), You Only Look Once (YOLO), and Long Short-Term Memory (LSTM)20,21,22,23 are well-known deep learning models and semi-supervised learning has demonstrated strong potential in MOD research24, these algorithms necessitate the establishment of large-scale databases and their annotation. Furthermore, the training process demands considerable time, computational and electrical resources.

Drawing inspiration from the aforementioned works, we develop a new MOD model, with the specific contributions as follows:

  • This paper proposes a novel tensor ring low rank decomposition-based method for moving objects detection, which enhances the ability to estimate the low rank information (static backgrounds).

  • The proposed method employs \(l_{1/2}\) regularization and tensor total variation model to capture dynamic foreground information. The TTV model is used to ensure the smoothness of moving objects, and the \(l_{1/2}\) regularization model is utilized to separate moving objects and dynamic background. To solve the proposed minimization model, we employ the augmented Lagrange multiplier (ALM) method in conjunction with the alternating direction method of multipliers (ADMM). The model is decomposed into several subproblems, each of which can be efficiently solved to iteratively approach the optimal solution.

  • We conducted a series of experiments under various real-world scenarios to evaluate the performance of the proposed method. The results verify its superiority over existing state-of-the-art (SOTA) approaches. Furthermore, the proposed method demonstrates better suitability for high-dimensional video data and exhibits increased robustness against different types of noise.

Tensor ring

Notation description

In this paper, handwritten letters are used to represent an N-order tensor, such as \(\mathscr {X} \in \mathbb {R}^{I_1 \times I_2 \times \cdots \times I_N}\), where \(I_n\) denotes the dimension of the tensor along mode n, for \(n = 1, 2, \dots , N\). The nuclear norm \(\Vert \textbf{X}\Vert _*\) represents the sum of the singular values of the matrix \(\textbf{X}\). Uppercase letters (e.g., \(\textbf{A}\), \(\textbf{B}\)) are used to indicate matrices, while lowercase letters (e.g., \(\textbf{x}\), \(\textbf{y}\)) are used to represent vectors.

Tensor ring low rank decomposition

Tensor ring decomposition decomposes a high-dimensional N-order tensor into a series of low-dimensional tensors. Figure 1 illustrates that an N-order tensor can be represented by a sequence of third-order tensors, where the edges representing the tensor dimensions. Each mode is indicated by the edge numbers, which determine the multilinear product between two tensors, also known as tensor contraction.

Fig. 1
figure 1

Tensor ring decomposition diagram.

With \((I_1, I_2, \dots , I_N)\) representing the ranks of the tensor ring, the decomposition of the tensor ring results in a series of third-order tensors, denoted as \(\mathscr {U}_{(n)} \in \mathbb {R}^{R_{n-1} \times I_n \times R_{n+1}}\), where \(R_1 = R_{N+1}\). The elements of tensor \(\mathscr {Y}\) can be expressed using the following equation:\(\mathscr {Y}(i_1, i_2, \dots , i_N) = \text {Trace} ((U^{(1)}_{i_1}U^{(2)}_{i_2} \cdots U^{(N)}_{i_N})\), \(\ U^{(n)}_{(i_n)} \in \mathbb {R}^{R_n \times R_{n+1}}\) represents the \(i_n\)th the slice of \(\mathscr {U}^{(n)}\). Trace \(\mathbf {(Y)}\) represents the matrix trace operation and \(Y _{(n)}\) represents the mode-n unfolding of the tensor \(\mathscr {Y}\). For the n-th tensor core \(\mathscr {U}^{(n)}\), Yuan et al.25 defined another standard matrix unfolding form as \(Y _{\langle n \rangle } \in \mathbb {R}^{I_n \times I_{n+1} \cdots I_N I_1 I_2 \cdots I_{n-1}}\), where \(Y _{\langle n \rangle } = U ^{(n)}_{(2)} \left( U ^{\ne n}_{\langle 2 \rangle } \right) ^{T}\). In this context, \(U ^{(n)}_{(i)} \in \mathbb {R}^{I_n \times R_{n-1} R_n}\) represents the mode-i unfolding of the n-th core tensor. \(( U ^{\ne n}_{\langle 2 \rangle })^{T}\) represents the concatenation of all core tensors along mode-2 except the n-th one . For all \(n = 1, 2, \dots , N\), the rank relationship between tensor ring and corresponding core is as follows:

$$\begin{aligned} \text {Rank} \left( \textbf{Y}_{(n)} \right) \le \text {Rank} \left( \textbf{U}^{(n)}_{(2)} \right) \end{aligned}$$
(1)

From the equation, it follows that the rank of the mode-n unfolding of the tensor \(\mathscr {Y}\) is bounded by the corresponding core tensor’s dimensional rank unfolding. In this paper, we explore the low rank structure of tensors by imposing a low rank constraint on the tensor cores \(U\).

Data tensorization processing

Tensorization is an important pre-processing step that leverages local structures and low rank features of the data. High-order tensors provide more significant image structures through tensor ring decomposition. In this paper, the non-local coupled tensorization (NCT) method26 is employed to transform 3-order video into a high-order tensor, allowing for better utilization of low rank representations while exploring non-local self-similarity and spatial correlations in local regions.

Fig. 2
figure 2

High-dimensional tensorization method of tensor \(\mathscr {T}\).

A third-order video data \(\mathscr {T} \in \mathbb {R}^{M \times N \times B}\), \(M\) and \(N\) represent the spatial width and height, and \(B\) denotes the frame sequence. To represent the redundancy of the video data, it is divided into several small cubic patches \(\mathscr {C}_i\). The size of each is \(s \times s \times B\). For each cubic patch, within the local temporal domain, the Euclidean distance equation is used to search for the \(k^2-1\) nearest neighboring patches in the local window, where each neighboring patch is of size \(s \times s \times p\). The \(k^2-1\) neighboring patches are combined into a new cubic tensor of size \(sk \times sk \times p\). In the same spatial location, there are \((2b/p - 1)\) cubes containing temporal information. As shown in Fig. 2, these cubes are combined into \(\textbf{T}\) fourth-order tensors of size \(sk \times sk \times p \times h\), where \(h = \frac{2b}{p} - 1\), and b is parameter to control the distance of searching area for local spatial information.

Overview of MOD algorithms

Tensor total variation model

MOD algorithms are used to detect objects located at salient positions of each frame, and these objects exhibit continuity in the temporal dimension. The focus of this study is on salient target detection, where the continuity of the consecutive frames of the moving objects is preserved in the temporal direction. In this paper, temporal-spatial continuity is utilized to regularize the detection of moving objects. We construct the model using the tensor total variation (TTV) framework. Specifically, we adopt anisotropic tensor total variation (TTV-A) as regularization model:

$$\begin{aligned} \Vert \mathscr {T}\Vert _{TTV-A} = \Vert \nabla _h \mathscr {T}\Vert _1 + \Vert \nabla _v \mathscr {T}\Vert _1 + \Vert \nabla _f \mathscr {T}\Vert _1 \end{aligned}$$
(2)

where \(\nabla _h\), \(\nabla _v\) and \(\nabla _f\) represent the horizontal, vertical, and temporal difference operators, respectively, which can be expressed as:

$$\begin{aligned} \nabla _h \mathscr {T}= & \text {vec}(\mathscr {T}_h), \quad \nabla _v \mathscr {T} = \text {vec}(\mathscr {T}_v), \quad \nabla _f \mathscr {T} = \text {vec}(\mathscr {T}_f); \nonumber \\ \text {where,} \quad \nabla _h \mathscr {T}= & \mathscr {T}(x, y, z) - \mathscr {T}(x + 1, y, z), \nonumber \\ \nabla _v \mathscr {T}= & \mathscr {T}(x, y, z) - \mathscr {T}(x, y + 1, z), \nonumber \\ \nabla _f \mathscr {T}= & \mathscr {T}(x, y, z) - \mathscr {T}(x, y, z + 1).\nonumber \\ \Vert \mathscr {T}\Vert _{TTV-A}= & \sum _{x=1}^{p-1} \left( \mathscr {T}(x, y, z) - \mathscr {T}(x+1, y, z) \right) ^2 \nonumber \\ & + \sum _{y=1}^{q-1} \left( \mathscr {T}(x, y, z) - \mathscr {T}(x, y+1, z) \right) ^2 + \sum _{y=1}^{v-1} \left( \mathscr {T}(x, y, z) - \mathscr {T}(x, y, z+1) \right) ^2 \end{aligned}$$
(3)

Tensor robust principal component analysis (TRPCA)

RPCA is an algorithm that decomposes a matrix into two distinct components: a low rank part and a sparse part10. Typically, an original black-and-white video is represented as a 3-D tensor. In matrix-based RPCA, when applying the RPCA algorithm to process video data, the video is vectorized and then transformed into a matrix, resulting in the loss of some spatial and temporal information from the original data. To address this limitation, TRPCA27 is proposed to decompose the video along the temporal direction into low rank and sparse tensors. For MOD algorithms, the background information is in the low rank tensor, while the moving foreground information is in the sparse tensor. This decomposition is mathematically expressed as follows:

$$\begin{aligned} \min _{\mathscr {L}} \text {rank}(\mathscr {L}) + \lambda \Vert \mathscr {T}\Vert _0 \quad \text {s.t.} \quad \mathscr {Z} = \mathscr {L} + \mathscr {T} \end{aligned}$$
(4)

where \(\mathscr {Z} \in \mathbb {R}^{x \times y \times z}\) represents the input video data, and \(\mathscr {L} \in \mathbb {R}^{x \times y \times z}\) is the static background part, which exhibits low rank characteristics because the background has strong temporal correlations and resides in a low-dimensional subspace that slowly changes along time direction. \(\lambda\) is a parameter to control the weight of low rank and sparse terms.

\(l_{1/2}\) regularization model

In addition to the moving foreground, real-world video scenarios may also include additional salient moving objects that are not the primary targets of MOD detection, such as falling raindrops. Another type of moving object exhibits irregular and discontinuous motion, such as water ripples or swaying trees. These small objects also belong to the sparse tensor and can be effectively analyzed using the \(l_{1/2}\) regularization model.

The \(l_{1/2}\) regularization is essentially a non-convex regularization model. Zhang et al.28 proposed an iterative half-thresholding algorithm based on a matrix framework to solve the \(l_{1/2}\) regularization problem. Inspired by this work, a tensor framework-based algorithm was proposed29. The formula is shown as follows:

$$\begin{aligned} \min _{\mathscr {A}} \Vert \mathscr {A} - \mathscr {Z}\Vert _F^2 + \lambda \Vert \mathscr {A}\Vert _{l_{1/2}}^{1/2} \quad \text {s.t.} \quad \mathscr {Z} = \mathscr {A} + \varepsilon _o \end{aligned}$$
(5)

where \(\mathscr {Z} \in \mathbb {R}^{x \times y \times z}\) represents the input data, \(\mathscr {A} \in \mathbb {R}^{x \times y \times z}\) represents the sparse component of the input data, and \(\varepsilon _o\) represents the noise.

MOD algorithm based on tensor ring low rank decomposition

This part focuses on describing the development process of the proposed model. The original video data contains low rank tensor(background) and sparse tensor (foreground). Consider any black and white video data \(\mathscr {Z} \in \mathbb {R}^{x \times y \times z}\), where x,y,z represent the width of the frame, height of the frame and the number of frames. The original video \(\mathscr {Z} \in \mathbb {R}^{x \times y \times z}\) can be divided into background part \(\mathscr {L} \in \mathbb {R}^{x \times y \times z}\) and dynamic foreground part \(\mathscr {T} \in \mathbb {R}^{x \times y \times z}\). Foreground part is usually comprised of dynamic background \(\mathscr {S} \in \mathbb {R}^{x \times y \times z}\) and moving object\(\mathscr {W} \in \mathbb {R}^{x \times y \times z}\), the real world video always contains dynamic background because of illumination changes, camera jitter, moving shadows, etc. By combining the low rank model and the \(l_{1/2}\) norm, dynamic backgrounds can be effectively detected, and it requires adding a regularization term to analyze the dynamic nature of the background to prevent the misclassification of moving background components as target objects. This paper assumes that the dynamic background is temporally sparser than the moving objects in the temporal dimension. The formula can be expressed as follows:

$$\begin{aligned} & \min _{\mathscr {L}, \mathscr {T}, \mathscr {S}, \mathscr {W}} \text {Rank}(\mathscr {L}) + \lambda _1 \Vert \mathscr {T}\Vert _1 + \lambda _2 \Vert \mathscr {S}\Vert _1 + \lambda _3 \Vert \rho (\mathscr {W})\Vert \nonumber \\ & \quad \text {s.t.} \quad \mathscr {Z} = \mathscr {L} + \mathscr {T} \quad \mathscr {T} = \mathscr {S} + \mathscr {W} \end{aligned}$$
(6)

where \(\mathscr {Z} \in \mathbb {R}^{x \times y \times z}\) represent the original video data, \(\mathscr {L} \in \mathbb {R}^{x \times y \times z}\) and \(\mathscr {T} \in \mathbb {R}^{x \times y \times z}\) represent the low rank tensor, and the sparse tensor of the input video. \(\mathscr {S}\) and \(\mathscr {W}\) represent the dynamic background tensor and the moving target foreground tensor. \(\mathscr {S}\) is sparser than \(\mathscr {W}\), and exhibit different continuity in the temporal dimension. This paper employs the tensor ring low rank model to extract global low rank information, and the \(l_{1/2}\) regularization is applied to enhance sparsity of dynamic background. For the moving foreground, this paper leverages temporal-spatial smoothness. To ensure the foreground data smoother, the \(TTV-A\) regularization is combined with the proposed model:

$$\begin{aligned} & \min _{\mathscr {L}, \mathscr {T}, \mathscr {S}, \mathscr {W}} \sum _{n=1}^{N} \sum _{i=1}^{3} \Vert \mathscr {U}^{(n)}_{(i)}\Vert _{*} + \lambda _1 \Vert \mathscr {T}\Vert _{l_{1/2}}^{1/2} + \lambda _2 \Vert \mathscr {S}\Vert _{l_{1/2}}^{1/2} + \lambda _3 \Vert \mathscr {W}\Vert _{TTV-A} \nonumber \\ & \quad \text {s.t.} \quad \mathscr {Z} = \mathscr {L} + \mathscr {T} \quad \mathscr {T} = \mathscr {S} + \mathscr {W} \end{aligned}$$
(7)

where \(\mathscr {U}^{(n)}_{(i)}\) represents the \(i\)-mode unfolded matrix of the \(n\)-th core tensor after tensorization of \(\mathscr {L}\). \(\lambda _1\), \(\lambda _2\), and \(\lambda _3\) are balancing weight parameters. The \(TTV-A\) term denotes the \(TTV\) norm of the foreground \(\mathscr {W}\).

In this paper, the ALM method is used to solve the optimization problem in model (6). The augmented Lagrangian function is as follows:

$$\begin{aligned} & \min _{\mathscr {L}, \mathscr {T}, \mathscr {S}, \mathscr {W}, \Lambda _1, \Lambda _2} \sum _{n=1}^{N} \sum _{i=1}^{3} \Vert \mathscr {U}^{(n)}_{(i)}\Vert _{*} + \lambda _1 \Vert \mathscr {T}\Vert _{l_{1/2}}^{1/2} + \lambda _2 \Vert \mathscr {S}\Vert _{l_{1/2}}^{1/2} + \lambda _3 \Vert \mathscr {W}\Vert _{TTV-A} \nonumber \\ & \quad + \frac{\beta }{2} \Vert \mathscr {Z} - \mathscr {L} - \mathscr {T}\Vert _F^2 + \langle \Lambda _1, \mathscr {Z} - \mathscr {L} - \mathscr {T} \rangle \nonumber \\ & \quad + \frac{\beta }{2} \Vert \mathscr {T} - \mathscr {S} - \mathscr {W}\Vert _F^2 + \langle \Lambda _2, \mathscr {T} - \mathscr {S} - \mathscr {W} \rangle \end{aligned}$$
(8)

where \(\Lambda _1 \in \mathbb {R}^{x \times y \times z}\) and \(\Lambda _2 \in \mathbb {R}^{x \times y \times z}\) are Lagrange multipliers, \(\beta\) is the penalty parameter, \(\langle X, Y \rangle\) represents the inner product between matrices \(X\) and \(Y\), and \(\Vert \cdot \Vert\) denotes the Frobenius norm of the tensor. In this paper, we decompose Eq. (7) into subproblems for solution. Subproblem 1:

$$\begin{aligned} & \min _{\mathscr {L}} \sum _{n=1}^{N} \sum _{i=1}^{3} \Vert \mathscr {U}^{(n)}_{(i)}\Vert _{*} + \frac{\beta }{2} \Vert \mathscr {Z} - \mathscr {L} - \mathscr {T}\Vert _F^2 + \langle \Lambda _1, \mathscr {Z} - \mathscr {L} - \mathscr {T} \rangle \nonumber \\ & \quad = \min _{\mathscr {L}} \sum _{n=1}^{N} \sum _{i=1}^{3} \Vert \mathscr {U}^{(n)}_{(i)}\Vert _{*} + \frac{\beta }{2} \left\| \mathscr {L} - \left( \mathscr {Z} - \mathscr {T} + \frac{\Lambda _1}{\beta } \right) \right\| _F^2 \end{aligned}$$
(9)

Let \(\mathscr {Z} - \mathscr {T} + \frac{\Lambda _1}{\beta } = \mathscr {M}\); then Equation (9) can be transformed into:

$$\begin{aligned} \min _{\mathscr {L}_{[n]}} \frac{1}{\beta } \sum _{n=1}^{N} \sum _{i=1}^{3} \Vert \mathscr {U}^{(n)}_{(i)}\Vert _{*} + \frac{1}{2} \Vert \mathscr {L}_{[n]} - \mathscr {M}_{[n]}\Vert _F^2 \end{aligned}$$
(10)

where \(\mathscr {M}_{[n]}\) denotes the matrixization of tensor \(\mathscr {M}\) along the n mode, and \(\mathscr {U}^{(n)}\) can be obtained by the following equation21:

$$\begin{aligned} \mathscr {U}^{(n)} = \text {fold} \left( \frac{\sum _{i=1}^{3} \left( \beta U^{(n,i)}_{(2)} + {\Lambda _1}^{(n,i)}_{(2)} \right) + T_{\langle n \rangle } U ^{(\ne n)}_{\langle 2 \rangle }}{ U ^{(\ne n)T}_{\langle 2 \rangle } U ^{(\ne n)}_{\langle 2 \rangle } + 3E} \right) \end{aligned}$$
(11)

where \(E\) represents the identity matrix. Equation (10) can be solved using the SiLRTC algorithm30. Subproblem 2 can be solved using half-quadratic minimization combined with a shrinkage operator:

$$\begin{aligned} & \min _{\mathscr {T}} \lambda _1 \Vert \mathscr {T}\Vert _{l_{1/2}}^{1/2} + \frac{\beta }{2} \Vert \mathscr {Z} - \mathscr {L} - \mathscr {T}\Vert _F^2 + \langle \Lambda _1, \mathscr {Z} - \mathscr {L} - \mathscr {T} \rangle \nonumber \\ & \quad + \frac{\beta }{2} \Vert \mathscr {T} - \mathscr {S} - \mathscr {W}\Vert _F^2 + \langle \Lambda _2, \mathscr {T} - \mathscr {S} - \mathscr {W} \rangle \nonumber \\ & \quad \mathscr {T} = H_{\frac{\lambda _1}{\beta ^k}} \left[ \frac{\mathscr {Z} - \mathscr {L} + \mathscr {S} + \mathscr {W}}{2} + \frac{\Lambda _1 - \Lambda _2}{2 \beta ^k} \right] \end{aligned}$$
(12)

where \(H[\cdot ]\) represents the half-thresholding shrinkage operator31, and \(\lambda _1\) denotes the regularization parameter. Equation (12) can be solved using the tensor-based half-quadratic alternate minimization algorithm. Similarly, subproblem 3 can also be solved using this method:

$$\begin{aligned} & \min _{\mathscr {S}} \lambda _2 \Vert \mathscr {S}\Vert _{l_{1/2}}^{1/2} + \frac{\beta }{2} \Vert \mathscr {T} - \mathscr {S} - \mathscr {W}\Vert _F^2 + \langle \Lambda _2, \mathscr {T} - \mathscr {S} - \mathscr {W} \rangle \nonumber \\ & \quad \mathscr {S} = H_{\frac{\lambda _2}{\beta ^k}} \left[ \mathscr {T} - \mathscr {W} + \frac{\Lambda _2}{\beta ^k} \right] \end{aligned}$$
(13)

The equation for subproblem 4 is:

$$\begin{aligned} & \min _{\mathscr {W}} \lambda _3 \Vert \mathscr {W}\Vert _{TTV-A} + \frac{\beta }{2} \Vert \mathscr {T} - \mathscr {S} - \mathscr {W}\Vert _F^2 + \langle \Lambda _2, \mathscr {T} - \mathscr {S} - \mathscr {W} \rangle \nonumber \\ & \quad = \min _{\mathscr {W}} \lambda _3 \Vert \mathscr {W}\Vert _{TTV-A} + \frac{\beta }{2} \left\| \mathscr {W} - \left( \mathscr {T} - \mathscr {S} + \frac{\Lambda _2}{\beta }\right) \right\| _F^2 \end{aligned}$$
(14)

Equation (14) can be solved using the following equation:

$$\begin{aligned} \mathscr {W} = \text {sth}\left( \mathscr {T} - \mathscr {S} + \frac{\Lambda _2}{\beta }, \frac{1}{\beta } \right) \end{aligned}$$
(15)

where \(\text {sth}()\) represents the soft-thresholding operator, and its formula is as follows:

$$\begin{aligned} \text {sth}(x,t) = \text {sgn}(x) \max (|x| - t, 0) \end{aligned}$$
(16)

The multipliers \(\Lambda _1\) and \(\Lambda _2\) can be updated using the following equations:

$$\begin{aligned} \Lambda _1= & \Lambda _1 + \beta (Z - L - T) \nonumber \\ \Lambda _2= & \Lambda _2 + \beta (T - S - W) \end{aligned}$$
(17)

The proposed algorithm:

figure a

Experiment and result analysis

The MOD detection experiment is conducted using MATLAB 2024 software. The test equipment is a laptop equipped with an Intel Core i7 processor, a 2.2 GHz CPU, and 16 GB of 1600 MHz DDR3 memory. The parameter \(\lambda ={1}/{\sqrt{max(M,N)*B}}\) is used to separate the low rank and sparse parts of the input tensor. In these experiment sequences, \(\lambda _1={0.2}/{\sqrt{max(M,N)*B}}\), \(\lambda _2={2}/{\sqrt{max(M,N)*B}}\), \(\lambda _3={20}/{\sqrt{max(M,N)*B}}\) guarantee the desired detection result.

The background and foreground detection results are evaluated using the \(f_1\) score and the \(f_2\) score, and the formula is expressed as follows:

$$\begin{aligned} f_1 = \frac{2R_1P_1}{R_1 + P_1} \quad f_2 = \frac{2R_2P_2}{R_2 + P_2} \end{aligned}$$
(18)

where \(R_1 = {TN}/{(TN + FP)}\), \(P_1 = {TN}/{(TN + FN)}\), \(R_2 = {TP}/{(TP + FN)}\), and \(P_2 = {TP}/{(TP + FP)}\). FP, FN, TP, and TN represent false positives, false negatives, true positives, and true negatives, respectively. Additionally, the parameter \(f\) is used to evaluate the algorithm32, and the formula for parameter \(f\) is expressed as follows:

$$\begin{aligned} f = \frac{2 f_1 f_2}{f_1 + f_2} \end{aligned}$$
(19)

By comparing \(TVPCRA\)13, SCLR-\(L_{1/2}\)14, \(DNLRITV\)33, \(DHLRTTV\)34, and TQR-SVD15, the results demonstrate the advantages of the proposed algorithm. To evaluate the proposed algorithm, two datasets, including the UCSD background subtraction dataset and Change Detection.net (CD)35background subtraction dataset are tested in the experiment.

Fig. 3
figure 3

The visual results of different algorithms: Columns I-VII represents the original data, ground-truth, TVPCRA, SCLR-\(L_{1/2}\), DHLRTTV, TQR-SVD, and the proposed method. (A) Highway(static background), (B) Indoor corridor(static background), (C) Boat(dynamic background), (D) Snowfall(dynamic background), (E) Fast moving object(illusion).

Fig. 3 illustrates the performance of different algorithms under multiple scenarios. Scenarios A and B involve static backgrounds, scenarios C and D contain dynamic background, scenario E features a fast-moving object with motion blur. From the results of scenarios A and B, we can conclude that only TQR-SVD and the proposed method can detect moving objects correctly, the TVPCRA method even misclassifies leaf shadows as moving objects. In scenarios C and D, the waves and the snow in the sky are recognized as dynamic background, and result C and D demonstrate the superiority of the proposed method in accurately detecting the exact profile of the moving object. Motion blur in images can be caused by camera jitter, slow shutter speed, or fast-moving objects. From the result of scenario E, the proposed method detects the moving object more accurately compared to other methods.

Table 1 Comparison of \(f_1\), \(f_2\), f results for different algorithms.
Figure 4
figure 4

The relationship graph of the \(f\) parameter for different algorithms with different iteration counts.

A comparison of the proposed algorithm with \(TVPCRA\) and SCLR-\(L_{1/2}\) is shown in Table 1, and the experiment results indicate that the proposed method achieves higher accuracy in background detection than other algorithms, particularly in detecting fast-moving objects in high-jump videos. The proposed algorithm demonstrates superior performance in terms of \(f_1\), \(f_2\), and f. The redundant information in the background occupies a large portion of the data in the ocean video. The proposed method achieves better results in background detection and overall performance compared to other algorithms. When compared to the \(DHLRTTV\) algorithm, the proposed algorithm achieves a MOD detection performance parameter f of 0.91, which is comparable to DHLRTTTV. Experimental results demonstrate that the proposed algorithm outperforms other algorithms in terms of background data detection. The tensor ring low rank constraint enhances the ability to capture global low rank background information in video data. By combining the tensor total variation model and the \(l_{1/2}\) norm regularization model, the MOD detection performance is significantly improved in dynamic background video data.

Figure 4 illustrates the results based on highway video data. This experiment compares the comprehensive parameter \(f\) for moving object detection, calculated over the first 20 iterations of different algorithms. As shown, our algorithm achieves the highest accuracy during initial iterations and maintains a performance advantage across subsequent iterations.

To test the robustness of the proposed algorithm, Gaussian noise and salt-and-pepper noise with variances of 0.001 and 0.01 were introduced into the dataset. As shown in Table 2, the proposed algorithm can effectively detect moving objects from noisy data, demonstrating that the tensor low rank model has strong noise separation ability. The proposed algorithm also effectively detect the foreground in noisy data. The results confirm that the method is robust against different types of noise.

In the next experiment, we evaluated the performance of the proposed method against several deep learning methods using the parameter \(f\). From Table 3, we can conclude that the proposed method performs well in ocean data which is simple background and contains slow-moving objects, and even surpasses Triple-CNN36 and MsEDNET37. However, when the foreground is relatively complex and the moving objects move with high speed, the detection accuracy decreases. The proposed method struggles to accurately detect fast-moving objects because it assumes that moving objects are sparse and smooth in the spatial dimension, where as in reality, they are not continuous in the spatial dimension.

Table 2 Comparison of \(f_1\), \(f_2\), f results for different algorithms under different noise levels.
Table 3 Performance comparison of different deep learning algorithms.
Table 4 Execution time comparison of different methods.

In the final experiment, we evaluated the computational time of various methods on 4-dimensional video data from the MOTChallenge dataset. The proposed method and TQR-SVD method are sensitive to the rank parameter. During the experiment, we choose a small rank while ensuring relatively similar accuracy. As shown in the results presented in Table 4, the proposed method achieved the shortest computational time among all compared approaches. The majority of the computational cost in our method is attributed to the updates of the core tensor and the tensor total variation (TTV) term. In contrast, the main cost in DHLRTTV arises from the t-SVD and TTV computations, whereas in TQR-SVD it comes from QR decomposition and TTV updates. These findings indicate that the proposed method is more efficient and thus better suited for high-dimensional video data.

Discussion

In this part, we tested the impact of the rank values of the proposed method. We assume that the tensor ring ranks are equal for every core tensor, i.e., \(R_1 = R_1 =\dots = R_{N}\) and \(R = {(5, 10, 15, 20, 25, 30, 35)}\) and the data is the first image of Fig. 3. As we set ranks from 5 to 30 we find that when \(R = 5\) the parameter f is 0.52, and the best performance of rank value is 25, and then with the increase of rank, the performance does not increase, but it affects the efficiency of computing.

Fig. 5
figure 5

The result of moving object detection under crowded road condition.

In another experiment, we tested the limitation of proposed method. MOTChallenge dataset is used to evaluate the performance of proposed method with multiple moving objections. The video is a hallway is in the mall, filled with pedestrians, and contains both sunshine and artificial lights, which further increases the difficulty of detection. From Fig. 5, we can find that most pedestrians can be detected, but the result is inaccurate, and the contours of pedestrians are not coherent. When it occurs to the multiple moving objections scenario and they occupy a portion of the video, the detection accuracy of our method will decrease cause the moving subjects become less sparse. Our method has limitations on the number of moving targets and the proportion they occupy in the video.

Conclusions

The proposed algorithm combines the tensor ring low rank model and the total variation model to extract background information and dynamic foreground information from video data. The \(l_{1/2}\)-norm regularization model is applied to separate the dynamic background, which is sparse and temporally non-continuous. Experimental results demonstrate that the proposed algorithm outperforms other methods, such as \(TVPCRA\) and \(DHLRTTV\) in both background detection and moving target detection. Moreover, the proposed method demonstrates greater robustness against Gaussian and salt-and-pepper noise interference compared to other algorithms.