Abstract
Sports video analysis has produced many valuable applications driven by different needs, and in these applications, moving target detection technology plays an indispensable role. However, the uniqueness of sports videos brings a big challenge to target detection and tracking technology. Therefore, the purpose of this article is to propose an efficient multi-target detection algorithm to quickly and effectively detect all target objects in the video. We propose a multi-target detection and tracking framework based on a deep conditional random field network, adding a conditional random field layer to the output of the target detection network to model the mutual relationships and contextual information between targets. In addition, we also introduce local adaptive filters and spatial-temporal attention mechanisms into this framework to further improve target detection performance, especially when dealing with complex scenes and target interactions. Experimental results show that the proposed method is superior to the state-of-the-art methods in terms of accuracy and efficiency.
Similar content being viewed by others
Introduction
In recent years, with the continuous development of science and technology, computer vision technology has been increasingly applied in the field of sports video analysis, including video analysis for sports training and sports competitions. In sports training, because machine vision has better accuracy and memory than human eyes, it can quickly capture moving targets and record various motion data of targets. By collecting a large amount of video image information of high-level athletes in the same training and large-scale events, and then effectively analyzing the information, it can change the insufficiency of coaches only relying on manual observation and experience to guide athletes’ technical movements, and can greatly improve the Athlete training effect. In sports competition, by detecting the moving targets in the competition video, on the one hand, we can analyze the high-level semantic events and get the video summary, on the other hand, we can analyze the competition sports strategy, behavior and penalty, such as analyzing the competition formation, attack route, cooperative actions, and some controversial penalties. Driven by different needs, sports video analysis has produced many valuable applications, including wonderful fragment extraction, video summary, action analysis and index, tactical statistics, strategy analysis, virtual content insertion, virtual scene construction and so on. In these applications, target detection technology plays an indispensable role.
For decades, target detection has been a research hotspot, because it is a long-standing, basic and challenging problem in computer vision. The goal of image-based object detection is to detect instances of objects of predefined classes in the image and draw a compact bounding box around each object1. Specifically, target detection includes two tasks: target location and classification, that is, finding the position of the target in the image and determining which predefined class each target belongs to. As the cornerstone of image understanding and computer vision, target detection is the basis for solving complex or advanced visual tasks, such as segmentation, scene understanding, target tracking, image title, event detection and activity recognition. Target detection has been widely used in many fields, including robot vision, consumer electronics, security, automatic driving, human-computer interaction, content-based image retrieval, intelligent video surveillance and augmented reality.
Moreover, deep learning technology has become a powerful method to automatically learn feature representation from data recently. These technologies have greatly improved the target detection task2. Since 2012, deep learning has become a very powerful tool because it can process a large amount of data. The use of more and more hidden layers and intermediate full connection layers has gone beyond the traditional image processing and computer vision technology and promoted the significant progress of a wide range of problems such as target detection, speech recognition, natural language processing, medical image analysis, drug discovery and genomics. In particular, in different types of deep neural networks, the research of deep convolutional neural network (DCNN) has brought a breakthrough in the processing of image, video, voice and audio.
However, different from the data for general target detection, sports video has the following characteristics: firstly, the camera position of the video often overlooks the whole sports field through an oblique downward angle, and the target resolution is generally low; Secondly, the athletes on the field often wear two kinds of uniforms, and there will be a large number of examples with similar appearance around the target, which will test the discrimination ability of the algorithm; Thirdly, the colors of the athletes and the background field may be similar, and due to the athletes’ movement on the field, there will be severe deformation, rotation and occlusion, and the camera lens will be blurred and shaken3. The uniqueness of sports video brings a great challenge to target detection and tracking technology. In addition, the important task of sports video analysis is to detect and track the selected target in real time, and feed back its tracking data and trajectory data in real time, so as to analyze the target according to these data. In order to ensure the accuracy and perfection of the analysis results, the target detection and tracking algorithm must have strong robustness and real-time performance. Therefore, the purpose of this paper is to propose an efficient multi-target detection and tracking algorithm to detect the objects in the video quickly and effectively, which is the basis of subsequent video image analysis.
In this work, we propose a multi-target detection and tracking framework based on deep conditional random field (CRF) networks, which combines the concepts of deep learning neural networks and conditional random fields, adding a conditional random field layer to the output of the target detection network. Conditional random fields are a probabilistic graphical model used to model interrelationships and contextual information between objects. In addition, we also introduce local adaptive correlation (LAC) filters and spatiotemporal attention mechanism (STAM) into this framework to further improve the target detection performance, especially when dealing with complex scenes and target interactions.
In summary, the main contributions of our work are as follows:
-
We propose a multi-object detection and tracking framework based on a deep conditional random field network. Its primary advantage lies in its ability to effectively capture the relationships and contextual information among objects, thereby enhancing the accuracy of object detection, particularly in dense object scenes or overlapping object scenarios.
-
Within this framework, we employ local adaptive filters to optimize the discriminability of each object. This enhances the robustness of object detection, especially in cases of varying illumination or high levels of image noise.
-
We introduce a spatiotemporal attention mechanism within this framework, allowing the network to focus on time steps and spatial locations relevant to the current target. This reduces interference between objects, effectively handling occlusions and drift caused by interactions among objects.
The rest of this article includes the following. “Related work” presents target detection and machine learning methods, and “Proposed method” presents the proposed method. “Experiments” demonstrates the performance of the proposed method with experimental results. Finally, conclusions and future research directions are given in “Conclusions”.
Related work
Target detection based on traditional methods
Target detection is a basic direction of computer vision research. The ideal goal of target detection is to develop an effective algorithm to achieve the two competitive goals of accuracy and efficiency. High-quality detection must accurately locate and identify the target in the image or video frame, that is, it can distinguish a variety of object categories in the real world, and can identify the object instances from the same category affected by the internal and external changes of the class4. High efficiency requires the whole detection task to run in real time under acceptable memory and storage requirements.
Traditional target detection technologies include passive infrared, ultrasonic and radio wave detection methods, but these methods can only operate under limited conditions in the actual use process, because they are vulnerable to environmental interference and lack of sensitivity. Computer vision methods also provide many ways for target detection. The early research on Image-based target recognition is based on template matching technology and simple part-based model, which mainly focuses on specific objects with roughly rigid spatial layouts. The main paradigm of target recognition is based on geometric representation. Appearance features range from global representation to local representation, which are designed to remain unchanged in translation, scale, rotation, lighting, viewing angle and occlusion. Handmade local invariant features have gained great popularity. Starting from scale-invariant feature transformation (SIFT), the progress of various visual recognition tasks is basically based on the use of local descriptors, such as haarlike feature, sift, shape context, gradient histogram (HOG), local binary pattern (LBP) and regional covariance5. Later, the focus shifted from geometric models and previous models to the use of statistical classifiers. This series of successful object detectors has laid the foundation for the follow-up research in this field.
DCNNs for target detection
The use of CNNs for detection and location can be traced back to the 1990s. A small number of hidden layers are used for target detection, which has been successful in limited fields such as face detection. However, recently, more in-depth CNN has made record-breaking progress in target detection. This change occurred when the successful application of DCNNs in image classification shifted to target detection6. Detection frameworks based on DCNNs can be mainly divided into two categories: One-Stage Detectors and Two-Stage Detectors7.
One-Stage Detectors are end-to-end object detection methods that are generally simpler and faster. They predict the location and category of objects directly from the input image without explicitly generating candidate regions. The YOLO series of models (e.g., YOLOv1, YOLOv2, YOLOv3, YOLOv4, etc.) are representative of one-stage detectors8,9,10,11. They divide the input image into multiple grid cells and predict the positions and categories of objects for each cell. The YOLO series is known for its speed and real-time performance. Single Shot MultiBox Detector (SSD) is another representative one-stage detector that employs multi-scale feature maps and predicts objects of different sizes at each scale. This enables SSD to perform well on objects of various sizes. In addition, EfficientDet12 was an efficient single-stage detector that used EfficientNet as the backbone network to perform target detection and segmentation simultaneously by sharing convolutional networks and feature maps, thereby reducing the amount of calculation and parameters. CenterNet13 simultaneously detected the center point of the target and the size of the object which simplified the problem of object detection and achieved a good balance between speed and accuracy. DETR14 was based on the Transformer architecture and has a unique method by converting the target detection problem into a set prediction problem, and simultaneously predicting the target position and category. RepPoints15 not only detected the bounding box of the object but also generated object-specific points that described the shape and pose of the object more precisely. FCOS16 proposed a fully convolutional one-stage object detector to solve object detection in a pixel-wise prediction manner. FCOS did not contain anchor boxes or suggestion boxes. By eliminating predefined sets of anchor boxes, FCOS completely avoided complex calculations related to anchor boxes, such as overlapping calculations during training.
Two-stage detectors generally achieve higher detection accuracy as they operate in two steps: first, generating candidate regions, and then classifying and regressing the positions of objects within these regions. The Region-based Convolutional Neural Network (RCNN) series was a representative work of two-stage detectors. RCNN17 can be considered as the first successful algorithm that applied deep learning to object detection. It followed the traditional approach of object detection, employing the steps of region proposal, feature extraction, image classification, and non-maximum suppression for object detection. The key difference was that, in the feature extraction step, it replaced traditional features (such as SIFT, HOG features, etc.) with features extracted by deep convolutional networks. Fast R-CNN18 was an improvement over the original RCNN, speeding up the entire process by sharing feature extraction. Faster R-CNN19 introduced the Region Proposal Network (RPN) to make candidate region generation more efficient and accurate. Mask R-CNN20 extended Faster R-CNN by adding support for instance segmentation, enabling simultaneous object detection and segmentation. Cascade R-CNN21 was also an improved R-CNN model which introduced a cascade structure and gradually improved the accuracy of detection by cascading multiple detection stages. Sparse R-CNN22 was a two-stage detector for sparse target scenes which was specially designed to detect sparsely distributed targets.
Target detection for sports videos
Target detection in sports videos is a challenging task in the field of computer vision because there are problems such as complex dynamic scenes, fast-moving targets, and multi-target interactions in sports videos. Researchers have made many efforts in the field of sports video object detection, aiming to improve the efficiency and quality of sports video analysis and sports game management. Castellano et al. described an automated soccer player detection and tracking system developed during the 2014 FIFA World Cup23. It was used for real-time match monitoring and generating statistics. Huang et al. aimed to detect and track soccer players in football matches in real-time, especially in broadcast videos24. The authors combined object detection and multi-object tracking techniques to achieve high-quality player tracking. Scott et al. developed algorithms to detect and track athletes during competition based on video data provided by drones25. Li et al. explored event perception and target detection in football matches which used deep learning techniques to detect key events, such as goals and offsides, and performed related target detection26. Huang et al. focused on player localization and tracking in football videos, developing a method based on vision and motion models to detect and track football players and generate their trajectories27. Manzoor et al. aimed to achieve real-time multi-person target tracking in sports competitions28. They optimized the non-maximum suppression algorithm to increase the tracking speed which was suitable for fast-moving scenes. This field is still evolving, and as deep learning technology continues to develop and the demand for sports video analysis increases, more innovative methods and applications are expected to emerge.
Although existing methods have made some progress in the field of sports video target detection and tracking, they are still not able to handle complex dynamic scenes, multi-target interactions, target occlusion, etc. in sports videos. Therefore, we propose a multi-target detection and tracking framework based on a dual-stage detector, which can more effectively capture the relationship and contextual information between targets, reduce the interference between targets and the impact of illumination changes, and thus improve the accuracy and robustness of target detection and tracking.
Proposed method
Overview
To quickly and effectively detect multiple objects in sports videos, we propose a multi-target detection and tracking framework that integrates conditional random fields, local adaptive filters, and spatiotemporal attention mechanisms. The schematic diagram of the framework is shown in Figure 1.
First, for each frame in the sports video, the CRF module is employed to obtain a search region for each target. Within these search regions, candidates are sampled. Subsequently, ROI-Pooling is used to extract candidate features for each target, which are then weighted using spatial attention within the Spatial-Temporal attention Block. Next, the LAC filter is utilized to identify the highest-scoring best-match candidate, serving as the estimated target state. Furthermore, the visibility map for each tracking target is inferred based on the features of the corresponding estimated target state. Then, the visibility map of tracking targets, along with the spatial arrangement of the targets and their neighboring targets, is used to infer temporal attention. Finally, the update of each target’s state is performed based on the time attention-weighted loss of training samples in both the current and historical frames within the Spatial-Temporal attention Block. The motion model for each target is updated according to the respective estimated target state.
Conditional random field module
CRF is a probabilistic graphical model, mainly used in sequence labeling problems, such as part-of-speech tagging, named entity recognition, etc. It is a conditional probability distribution model of a given set of input sequences and another set of output sequences. It uses a graph structure to represent the probability distribution. Specifically, CRF uses an undirected graph (Markov random field) to represent the relationship between random variables instead of a directed graph (Bayesian network). One of the core concepts of CRF is feature functions, which describe the relationship between input features and labels. Each characteristic function is associated with a weight that represents the importance of the relationship. These feature functions are usually based on the observations of the input data and the selection of labels. The goal of CRF is to model the conditional probability distribution of the output label sequence under given input conditions. This conditional probability distribution can be expressed as Eq. (1).
Where \(Y=[y_1, y_2, \ldots y_n]\) represents the label sequence, X represents the input feature, P(Y|X) is the conditional probability of the label sequence Y under a given input X, \(f_i(Y, X)\) is the characteristic function, \(\lambda _i\) is the corresponding weight, and Z(X) is the normalization factor to ensure that the sum of the probability distribution is 1.
Among the three fundamental issues of conditional random fields (inference, learning, and calculation of normalization constants), the main focus is on the calculation of conditional probability, model training, and prediction. When calculating conditional probability, the most likely label sequence Y is inferred given the input features X, typically using methods such as the forward-backward algorithm and the Viterbi algorithm. Then, the conditional probability P(Y|X) is calculated based on the known input sequence x and output sequence y. During the training process of the model, maximum likelihood estimation is typically used to estimate the model parameters. Finally, in the prediction stage, the most likely output sequence y is predicted based on the known input sequence x and the already trained model.
CRF performs excellently in sequence modeling because it can naturally handle sequence data such as text, time series, and image pixel sequences. In sequence labeling problems, CRF can capture the dependencies between labels in the sequence, improving performance. Additionally, the normalization factor Z(X) ensures that the conditional probability distribution is normalized, such that the sum of probabilities of all possible label sequences equals to 1. It typically requires calculating the sum over all label sequences, which can be a challenge in inference, but efficient computation can be achieved using dynamic programming methods.
In multi-target detection, targets typically interact with each other, with the presence of one target potentially affecting the location and category of nearby targets. As CRF can capture contextual information in multi-object detection tasks, allowing for associations to be made between targets and accounting for their interactions, they can better infer the location and labels of targets, thereby improving detection accuracy. Furthermore, CRF naturally extend to the spatio-temporal domain, enabling them to track and detect targets in video sequences. By combining deep neural networks with deep learning, end-to-end learning can be achieved, leading to a better integration of feature extraction and inference processes. Based on these advantages, we introduce CRF into multi-target detection and tracking tasks for sports videos to improve detection accuracy and robustness.
Details of the CRF module used in our framework are shown in Fig. 2. The first part of the module is to perform target detection on the video sequence after inputting the video, and connect the detection results into short but reliable trajectories. Then, a graph node is given to generate the potential values of the node. The second part is an RNN model that performs CRF inference using gradient descent algorithm. One challenge in CRF inference is that the implementation of assignment hypotheses involves obtaining values from a discrete set, such as \(\{0, 1\}\). However, existing deep learning methods are not designed for discrete problems. To address this issue and use DNNs to generate the optimal assignment hypotheses, we first perform a continuous relaxation on the original binary labels and then formulate it as an optimization problem. Specifically, expand the label variable \(y_i\) to another new Boolean variable \(y_{i:\alpha }\), \(\alpha \in \{0, 1\}\). \(y_{i:\alpha }\) represents an assignment of label \(\alpha\) to \(y_i\), which is equivalent to an assignment of Boolean labels 0 or 1 to each node \(y_{i:\alpha }\), and an assignment of label 1 to \(y_{i:\alpha }\) means that \(y_i\) receives label \(\alpha\). A constraint is introduced to ensure that only one label value is assigned to each node. Therefore, since our tracking problem is transformed into an energy minimization problem, we write the energy minimization as the following binary integer program
Next, we relax the integer programming to allow real numbers on the unit interval [0, 1] instead of just Boolean values. Let \(q_{i:\alpha } \in [0, 1]\) denote the relaxed variables, and energy functions can be expressed as the following quadratic program.
By now, we can view the CRF inference in Eq. (3) as a gradient descent-based minimization problem, which can be easily formulated in terms of a recurrent neural network since all operations are differentiable with respect to q.
As all parts of the model are formulated as standard network operations, our final model can be trained in an end-to-end manner using common deep learning strategies. Specifically, during the forward pass, the output of the CNN is passed to the RNN as initialization, i.e., the CRF state \(q_0\). After T iterations, the RNN ultimately outputs \(q_T\), which is the solution of the equation. It is worth mentioning that all CRF learning and inference are embedded in a unified neural network.
Spatial-temporal attention block
Details of the Spatial-Temporal Attention Block, which consists of traditional convolutional layers, a time attention model, and a spatial attention model. To learn the time weights of sports video clips, multiple clip feature detectors are transformed into multiple feature detectors for a single clip, thereby changing the learning of time weights to channel weights. Then, spatial pooling is performed to compress spatial information, and a fully connected layer is used to learn time attention features and refine the weights of the transformed feature detectors. Additionally, spatial pooling and convolutional layers are used to learn spatial attention information. Finally, the output detection results are obtained.
Multi-object detection necessitates consideration of the dynamic changes of objects in both temporal and spatial dimensions. Spatiotemporal attention mechanisms aid in establishing temporal and spatial correlations among targets, enabling models to focus on the relationships between targets across different frames or time steps. This facilitates accurate tracking and detection of moving objects, such as vehicles in traffic scenarios or athletes in sports competitions. Furthermore, spatiotemporal attention contributes to achieving continuous object tracking, maintaining the identity of targets across multiple frames. Models can employ spatiotemporal relationships to predict the positions and attributes of targets in the next frame, ensuring smooth tracking without instability arising from frame-by-frame processing. Moreover, in multi-object detection, false positives are a prevalent issue, wherein the model erroneously detects non-existent targets. Spatiotemporal attention assists models in comprehending the motion patterns and contextual information of targets more effectively, thereby reducing false positive rates. Models can make detection and tracking decisions based on target motion direction, velocity, acceleration, and other information, thus mitigating false alarms. Multi-object detection often confronts complex scenarios with numerous objects that may overlap, occlude, or interfere with each other. Spatiotemporal attention permits models to dynamically adjust their focus based on the spatiotemporal characteristics of targets, facilitating improved discrimination and tracking of these objects. This enhances model performance in congested or intricate environments. Therefore, spatiotemporal attention has become a crucial technical tool for handling video data and dynamic scenes, contributing to the enhancement of accuracy and robustness in multi-object detection.
The principle of applying the spatiotemporal attention mechanism to multi-target detection in sports videos is to introduce the attention mechanism in the temporal domain (time dimension) and spatial domain (space dimension) to dynamically adjust the feature map according to the different spatiotemporal dynamics in the game, so that the model can better perceive and understand the spatio-temporal relationship and dynamic changes of players and balls in sports scenes. This helps improve the performance of object detection and tracking, making the model more suitable for handling dynamic scenes such as sports matches. Therefore, a spatiotemporal attention mechanism is also introduced in our framework.
As shown in Fig. 3, the Spatio-Temporal Attention Block consists of traditional convolutional layers, temporal attention models and spatial attention models. The input data is a video of a sports match, which contains consecutive image frames, each frame capturing a different time point in the match scene. First, each video frame is subjected to feature extraction through a convolutional neural network to obtain a spatial feature representation at each time step (each frame). These feature maps reflect different objects and elements in sports scenes, such as players, balls, and fields. Temporal attention focuses on feature maps between different time steps to model the temporal relationships and dynamic changes of players and balls in the game. It helps the model focus on player movement, interactions, and ball trajectory. For example, the model could learn how players typically chase the ball during a game, or how the ball’s speed and direction change over time. Spatial attention focuses on feature maps at different locations within the same time step to capture the position and pose information of the player and ball in the image. It helps the model distinguish between players and other objects such as field boundaries or spectator stands. During feature fusion, the spatiotemporal attention weight generated by spatiotemporal attention is multiplied with the original feature map to generate a feature map adjusted according to spatiotemporal dynamics. These feature maps capture information about the movement and interaction of players and balls during a match. The enhanced feature map is used in the multi-target detection and tracking module. The detection module can use these feature maps to detect and locate players and balls in the game, while the tracking module can use them to achieve continuous tracking. Specifically, in order to learn the temporal weight of a sports video clip, the feature detectors of multiple clips are transformed into multiple feature detectors of a single clip, thereby changing the learning of temporal weights to the learning of channel weights. Spatial average pooling is then performed to compress spatial information, and fully connected layers are employed to learn temporal attention features and refine the weights of the transformed feature detector. Spatial attention information is then learned through average pooling and convolutional layers. Finally, output the detection results. Spatiotemporal attention helps improve target detection accuracy and tracking stability, especially in dynamic and complex sports scenes. Spatiotemporal attention also helps the model better understand the interactions between players, tactical strategies, and game progression during the game. It captures key moments and important dynamics of the game.
Local adaptive correlation filters
LAC filters can be used for multi-target detection, which can improve the accuracy and robustness of target detection. The LAC filter can make the characteristics of the target signal more obvious by filtering out interference signals. In multi-target detection, targets are often affected by a variety of noise and interference, such as illumination changes, occlusion, deformation, etc. The local adaptive correlation filter can adaptively adjust filter parameters and filter structure according to the characteristics of the local target area to better adapt to different interference situations and further improve the accuracy and robustness of target detection. In addition, the local adaptive correlation filter can also effectively reduce computational complexity and improve target detection efficiency. Because it can adaptively adjust according to the characteristics of the local target area, it can filter the target area in a targeted manner and reduce calculations and interference in non-target areas, thereby reducing computational complexity and improving target detection efficiency. However, it should be noted that the performance and effect of the local adaptive correlation filter are greatly affected by its algorithm and parameter design. For example, if the learning rate is not set properly, it will not be able to adapt to changes in illumination in time, resulting in a weakened filter response, reduced tracking accuracy, or even drift, and if the search window setting cannot adapt to the scale change of the target, it will cause partial loss of the target or insufficient search area, affecting tracking continuity. Therefore, in practical applications, it is necessary to carry out reasonable design and adjustment according to specific tasks and data characteristics to obtain better target detection results.
Therefore, in the proposed framework, we design a composite correlation filter optimized for distortion-tolerant pattern recognition based on the characteristics of sports videos to identify targets in video frames. The detection algorithm based on this composite correlation filter is robust to pose changes and appearance modifications of objects, as well as the presence of scene noise, lighting changes, and target occlusions. The algorithm initiates with the object selection phase. Subsequently, it formulates optimal correlation filters for reliable object detection and target position estimation. Following that, a composite locally adaptive correlation filter is synthesized. Our proposed algorithm also incorporates an automatic re-initialization mechanism that can re-establish tracking in the event of a failure. Figure 4 illustrates the flow chart of the algorithm based on LAC filter, and we provide a detailed breakdown of the steps below.
Firstly: For each object, select a small target, denoted as \(T_i(x, y)\), from the captured scene frame, \(F_i(x, y)\), containing the object to be tracked. Here, x and y represent the location coordinates of the target.
Secondly: Create the optimal correlation filter, \(H_i(x, y)\), to ensure reliable detection and estimation of the position of the target, \(T_i(x, y)\), in the observed local frame, \(L_i(x, y)\).
Thirdly: Synthesize the composite local adaptive correlation filter, \(P_i(x, y)\), using the following procedure. First, objects are detected and localized within the observed local frame, \(L_i(x, y)\), through the \(H_i(x, y)\) filter. If the resulting DC (correlation peak) exceeds a predefined threshold (DC > DCrec), the target is deemed successfully detected, \(T_i(x, y)\) is added to the set T, and the recursion process is halted. Otherwise, the target, \(S_i(x, y)\), corresponding to the false peak is added to the set S. Subsequently, the composite filter, \(P_i(x, y)\), is synthesized. Lastly, target detection and localization are performed recursively within the observed local frame, \(L_i(x, y)\), using the \(P_i(x, y)\) filter until the condition DC> DCrec is met.
Experiments
Implementation details
To evaluate detection performance, we use four benchmark datasets: MOT2015, MOT2016, PET2009, and MS COCO. MOT2015 and MOT2016 contain 22 (11 training, 11 test) and 14 (7 training, 7 test) video sequences in unconstrained environments respectively. The ground truth annotations of the training sequences are released. PETS-2009 contains 20 video sequences, covering a variety of different scenes and environments, including lighting changes, weather changes (for outdoor scenes), complex backgrounds, etc., providing rich challenges. MS COCO consists of 82 K training and 40 K validation images belonging to 80 classes. MS COCO is a widely used dataset in object detection and a relatively difficult dataset, since the objects sizes are relatively small compared with other datasets. All experiments were conducted on a PC equipped with an NVIDIA GeForce RTX2080 Ti GPU (11 GB VRAM), an Intel Core i7-8086K CPU (6 cores, 4.0 GHz), and 32 GB RAM.
Comparisons with state-of-the-art methods
We conducted a comprehensive comparison of our proposed method with several state-of-the-art approaches in the field, evaluating them based on two key aspects: accuracy in target detection and efficiency. By conducting a rigorous assessment of accuracy and efficiency, we were able to provide a comprehensive and objective comparison of our method against the current state-of-the-art techniques in the field of object detection. This evaluation enabled us to highlight the strengths and potential areas for improvement of our approach relative to the others, offering valuable insights for researchers and practitioners in the domain.
In terms of accuracy, we meticulously assessed the performance of our method alongside the other approaches in accurately identifying and localizing objects within the given datasets. This evaluation involved measuring metrics such as accuracy, recall, F1-score, and Area Under the Curve (AUC) to quantify the accuracy of object detection. Accuracy is one of the most commonly used classification performance indicators, indicating the proportion of samples predicted correctly by the model to the total number of samples. Recall focuses on the ability to identify positive classes and is suitable for tasks that focus on missed detections. F1-score comprehensively considers precision and recall and is suitable for class imbalance problems. AUC measures the overall discrimination ability of the model and is more robust to threshold selection and changes in class distribution. Additionally, we considered the ability of these methods to handle challenging scenarios, occlusions, and variations in object appearance. By conducting extensive experiments and analyzing the results, we gained insights into how our method compares in terms of its detection accuracy. In Table 1, we compare the accuracy of different methods on the MOT2015 and MOT2016 datasets, and Table 2 shows the comparison results between our method and other methods on PET2019 and MS COCO datasets. From the tables, it’s evident that our method consistently outperforms other state-of-the-art methods in terms of accuracy on all datasets (MOT2015, MOT2016, PET2009, and MS COCO). This indicates that our approach is more effective in object tracking and detection, as it achieves higher accuracy and outperforms the competition in various scenarios. Our model’s superior performance suggests its potential for real-world applications in object detection and tracking tasks.
Efficiency was another pivotal dimension of comparison. We evaluated the computational efficiency of our proposed method and the competing approaches, taking into account factors such as parameter size, Flops, inference time, and training time. We aimed to determine how efficiently each method performs, particularly in real-time or resource-constrained scenarios. Comparison results of efficiency between our method and state-of-the-art methods on the datasets of MOT2015, MOT2016, PET2009 and MS COCO are shown in Tables 3 and 4. It can be found that in both MOT2015/MOT2016 and PET2009/MS COCO datasets, our method achieved a significant reduction in model size and training time with lower computational complexity and faster inference times, indicating superior efficiency compared to the state-of-the-art methods, making it a promising choice for practical applications with resource constraints. It should be noted that the parameter size of MOT2015 was 336.61, which was larger than 310.79 of MOT2016. However, MOT2016 requireed more training time than MOT2015. After analysis, the possible reason for this was that the MOT2016 dataset was more complex in terms of scene complexity, target density, environmental changes, background interference, and target motion patterns. This increased difficulty would lead to slower model training convergence and require more iterations to learn effective feature expressions and decision boundaries. As a result, although the parameters were slightly less, the training time was longer than MOT2015.
Ablation study
In this section, we take an in-depth look at how each component contributes to the overall object detection performance of the proposed model. It should be noted that in order to illustrate the effectiveness of our method for sports videos, in the ablation experiment, we deliberately added a large-scale multi-object tracking dataset named SportsMOT, consisting of 240 video clips from 3 categories (i.e., basketball, football and volleyball).
In our experiments, we tested various combinations of these components and compared their detection errors. Table 5 presents the results of ablation experiments for the proposed method across various datasets. Ablation experiments involve systematically disabling or altering specific components or techniques within the method to assess their impact on performance metrics. In this case, the following components or techniques have been analyzed: CRF (Conditional Random Field), LAC filter (a filtering technique), and S-T Attention (Spatial-Temporal Attention). The experiment is conducted on five different datasets: MOT2015, MOT2016, PET2009, MS COCO, and SportsMOT. Each dataset has four performance metrics measured: Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), and Mean Squared Error (MSE). MAE is used to measure the average error between the predicted value and the true value. MAPE measures the relative error percentage of the predicted value relative to the true value, which can reflect the proportion of the predicted error to the true value. RMSE is used to measure the degree to which the predicted value deviates from the true value. MSE is the average of the squares of all prediction errors (the difference between the predicted value and the true value).
The results of the ablation experiment are shown in Table 5, and Fig. 5 shows some example images of the detection results on the SportsMOT dataset. By comparison, it could be found that the CRF module could effectively capture the relationship and contextual information between targets, thereby improving the accuracy of target detection, and the LAC filter enhanced the robustness of target detection, especially in the case of illumination changes or high image noise levels. Spatial-Temporal attention Block could reduce the interference between targets and could effectively handle the occlusion and drift caused by the interaction between targets. The proposed model, which incorporates all components, showed significant improvements in performance across all datasets. It achieved the lowest MAE, MAPE, RMSE, and MSE values, indicating superior accuracy and predictive power.
In summary, the ablation experiments demonstrate the impact of different components or techniques on the proposed method’s performance. Our method, which combines all components, consistently outperforms other configurations, achieving the best results across all datasets and metrics. This suggests that each component contributes positively to the model’s overall effectiveness, with the combination yielding the most significant improvements in accuracy and predictive power.
Conclusions
In view of the characteristics of sports videos, we propose a comprehensive multi-target detection framework that combines CRF, LAC Filters and Spatiotemporal Attention Mechanism to significantly improve multi-target detection in challenging scenarios such as sports video analysis. We add a CRF layer to the output of the target detection network, which is known for modeling interrelationships and contextual information between objects. To further enhance target detection performance, especially in complex scenes and when dealing with target interactions, we also incorporate LAC filters and a Spatial-Temporal Attention Block into this framework. Experimental results show that compared with existing advanced methods, our proposed method has better performance in terms of accuracy and efficiency.
Data availability
All data generated or analysed during this study are included in this published article.
References
Mauri, A., Khemmar, R., Decoux, B., Haddad, M. & Boutteau, R. Lightweight convolutional neural network for real-time 3D object detection in road and railway environments. J. Real-Time Image Process. 19, 499–516 (2022).
Liu, Y., Sun, P., Wergeles, N. & Shang, Y. A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 172, 114602 (2021).
Mukilan, P. & Semunigus, W. Human and object detection using hybrid deep convolutional neural network. Signal Image Video Process. 16, 1913–1923 (2022).
Lin, X., Li, C.-T., Sanchez, V. & Maple, C. On the detection-to-track association for online multi-object tracking. Pattern Recognit. Lett. 146, 200–207 (2021).
Matveev, I., Karpov, K., Chmielewski, I., Siemens, E. & Yurchenko, A. Fast object detection using dimensional based features for public street environments. Smart Cities 3, 93–111 (2020).
Liu, L. et al. Deep learning for generic object detection: A survey. Int. J. Comput. Vis. 128, 261–318 (2020).
Kalake, L., Wan, W. & Hou, L. Analysis based on recent deep learning approaches applied in real-time multi-object tracking: A review. IEEE Access 9, 32650–32671 (2021).
Lu, J. et al. A vehicle detection method for aerial image based on yolo. J. Comput. Commun. 6, 98–107 (2018).
Sang, J. et al. An improved yolov2 for vehicle detection. Sensors 18, 4272 (2018).
Redmon, J. & Farhadi, A. Yolov3: An incremental improvement. arXiv preprint [SPACE]arXiv:1804.02767 ( 2018).
Wang, C.-Y., Bochkovskiy, A. & Liao, H.-Y. M. Scaled-yolov4: Scaling cross stage partial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13029–13038 ( 2021).
Tan, M., Pang, R. & Le, Q. V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10781–10790 ( 2020).
Duan, K. et al. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6569–6578 ( 2019).
Dai, Z., Cai, B., Lin, Y. & Chen, J. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1601–1610 ( 2021).
Yang, Z., Liu, S., Hu, H., Wang, L. & Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9657–9666 ( 2019).
Tian, Z., Shen, C., Chen, H. & He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9627–9636 ( 2019).
Girshick, R., Donahue, J., Darrell, T. & Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38, 142–158 (2015).
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440–1448 ( 2015).
Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 ( 2015).
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 2961–2969 ( 2017).
Cai, Z. & Vasconcelos, N. Cascade r-cnn: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1483–1498 (2019).
Sun, P. et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14454–14463 ( 2021).
Castellano, J., Alvarez-Pastor, D. & Bradley, P. S. Evaluation of research using computerised tracking systems (Amisco® and Prozone®) to analyse physical performance in elite soccer: A systematic review. Sports Med. 44, 701–712 (2014).
Huang, W. et al. Open dataset recorded by single cameras for multi-player tracking in soccer scenarios. Appl. Sci. 12, 7473 (2022).
Scott, A. et al. Soccertrack: A dataset and tracking algorithm for soccer with fish-eye and drone videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3569–3579 ( 2022).
Li, H., Manickam, A. & Samuel, R. Automatic detection technology for sports players based on image recognition technology: The significance of big data technology in china’s sports field. Ann. Oper. Res. ( 2022).
Huang, C., He, J., Tan, R. & Yu, Z. Research on recognition of football trajectory based on robot vision system. In 2022 2nd International Conference on Networking, Communications and Information Technology (NetCIT). 401–404 ( IEEE, 2022).
Manzoor, S. et al. Spt: Single pedestrian tracking framework with re-identification-based learning using the siamese model. Sensors 23, 4906 (2023).
Shin, J., Kim, H., Kim, D. & Paik, J. Fast and robust object tracking using tracking failure detection in kernelized correlation filter. Appl. Sci. 10, 713 (2020).
Kreymer, S. & Bendory, T. Two-dimensional multi-target detection: An autocorrelation analysis approach. IEEE Trans. Signal Process. 70, 835–849 (2022).
Hong, Y. et al. An improved end-to-end multi-target tracking method based on transformer self-attention. Remote Sens. 14, 6354 (2022).
Fang, S., Zhang, B. & Hu, J. Improved mask r-cnn multi-target detection and segmentation for autonomous driving in complex scenes. Sensors 23, 3853 (2023).
Sun, L., Liu, H., Wang, C. & Li, B. Hdt network: A high-resolution range profile multi-target detection and tracking method based on neural network. IET Radar Sonar Navig. 17, 1430–1440 (2023).
Ding, P., Qian, H., Bao, J., Zhou, Y. & Yan, S. L-yolov4: Lightweight yolov4 based on modified rfb-s and depthwise separable convolution for multi-target detection in complex scenes. J. Real-Time Image Process. 20, 71 (2023).
Author information
Authors and Affiliations
Contributions
X.C. and H.Z. conceived the experiments and writed the original draft, H.Z. and A.S. conducted the experiment(s), B.B. and K.J. analysed the results. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chen, X., Zhang, H., Shankar, A. et al. Multi-target detection and tracking based on CRF network and spatio-temporal attention for sports videos. Sci Rep 15, 6808 (2025). https://doi.org/10.1038/s41598-025-89929-7
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-89929-7
Keywords
This article is cited by
-
A text mining-based approach for comprehensive understanding of Chinese railway operational equipment failure reports
Scientific Reports (2025)
-
Efficient real-time multi-object tracking: a strategy in complex scenarios
Signal, Image and Video Processing (2025)
-
Lane Detection and Target Tracking Algorithm for Vehicles in Complex Road Conditions and Dynamic Environments
International Journal of Intelligent Transportation Systems Research (2025)







