Introduction

Multi-Object Tracking (MOT) is a core task in computer vision that estimates the trajectories of multiple objects in video sequences and is an essential technology for various applications, including autonomous driving, security surveillance, and robotics1,2,3. Although deep learning has significantly improved the object detection performance4,5, the data association stage that connects detected objects across time frames remains a critical bottleneck in determining the overall performance of MOT systems. The Tracking-by-Detection (TBD) paradigm6,7, which dominates modern MOT, decomposes the tracking problem into two stages: detecting objects in each frame and associating the detection results across frames to form trajectories. Motion cues are primarily defined as the spatial distance between positions predicted by Kalman filters8 and detection results, whereas appearance cues are computed as the similarity between visual features extracted by Re-Identification (ReID) networks9.

Typical TBD trackers combine these two costs and match tracks to detections through optimal assignment techniques such as the Hungarian algorithm10. However, this approach has two fundamental limitations when combining and utilizing these two key cues. The first is the limitation of fixed-weight strategies. Early approaches adopted strategies combining motion and appearance cues with fixed weights. DeepSORT11 combines the Mahalanobis and cosine distances at fixed ratios, and this approach exhibits stable performance under general tracking conditions. However, the reliability of each cue varies significantly, depending on the characteristics of the tracking environment. In crowded environments such as MOT2012, spatial overlap reduces the discriminative power of motion cues, and in environments dominated by non-linear motion such as DanceTrack13, motion blur destabilizes appearance features. In other words, fixed weights fail to reflect environment-specific reliability changes, thereby causing association errors. Therefore, ByteTrack introduced a hierarchical structure based on detection confidence to enhance the occlusion recovery performance. However, it has the limitation of not utilizing appearance information by using only Intersection over Union (IoU) at all stages.

The second problem is the absence of a motion-appearance consistency check. An ideal association presupposes consistency, where motion predictions and appearance similarity align simultaneously. However, existing methods simply sum the two costs without explicitly verifying the coherence between them and fail to fully utilize the relational information between the two cues14,15. A high IoU with a low ReID similarity is an early indicator of mismatches occurring in crowded environments, whereas the opposite case suggests reappearance situations after occlusion. Such inconsistency patterns provide useful contextual information for determining the matching reliability of each track-detection pair; however, their utilization has been limited in prior research. To address this, some studies have attempted to conduct consistency checks. OC-SORT16 added a velocity consistency check that compared past and current observed velocities; however, this approach was biased toward motion cues only. BoT-SORT17 uses Mahalanobis-distance-based prediction range gating; however, by blocking candidates outside the range, it risked prematurely excluding actual objects in situations with abrupt direction changes or large prediction errors. Consequently, existing trackers fail to balance motion-appearance consistency information, showing limitations in balancing False Positive (FP) suppression and False Negative (FN) recovery. To overcome these limitations, we propose three complementary modules: Balanced Cascade Association (BCA), Condition-Aware Matching with Weights (CAMW), and Motion-Appearance Consistency Check (MACC). BCA establishes a stable association foundation by combining ReID feature integration, balanced 50:50 fusion of motion and appearance cues, and a detection confidence-based two-stage hierarchical structure on top of a standard geometric tracking baseline. CAMW is a 2-level cost selection mechanism that defines pre-optimized parameter sets for each environment and conditionally selects environment-specific cost matrices by evaluating the tracking quality of individual track-detection pairs in real time. The MACC classifies the consistency patterns of motion and appearance cues into four cases and differentially applies cost refinement to resolve the rigidity of hard gating while achieving a balance between FP suppression and FN recovery. The three modules were applied sequentially to establish clear role separation.

Fig. 1
Fig. 1
Full size image

Tracking performance comparison on the MOT17 test set. Motion-based methods are indicated by blue circles, and appearance-enhanced methods are indicated by orange circles. ClarityTrack, marked in pink, achieves the highest AssA of 67.7 and HOTA of 66.5 among all compared methods.

Figure 1 shows that the proposed ClarityTrack achieves superior performance in Association Accuracy (AssA) and Higher Order Tracking Accuracy (HOTA) compared to existing motion-based methods and appearance-enhanced methods on the MOT17 test set. The main contributions of this study are as follows:

  1. (1)

    We propose ClarityTrack, an environment-aware rule-based system that performs decision-making by leveraging the balanced, crowded, and unstable characteristics of the tracking environment as prior information.

  2. (2)

    We design a 2-level cost selection mechanism that defines candidate parameter sets optimized at the environment level and dynamically applies them at the individual track-detection pair level in real time. The conditional switching structure via the CAMW module flexibly adapts to each matching situation, thereby enhancing the overall matching reliability of the system.

  3. (3)

    We systematically present a dataset-specific hyperparameter optimization methodology. We validated the generalization performance of the proposed model using MOT1718 and performed rule combination analysis and hyperparameter fine-tuning for the crowded environment of MOT20 and the unstable environment of DanceTrack to derive the optimized strategies for each environment.

  4. (4)

    We demonstrated the superiority of the proposed method on the MOT17, MOT20, and DanceTrack standard benchmarks. We achieved consistent performance improvements in HOTA, IDF1, and AssA19,20, and confirmed the meaningful contribution of ClarityTrack, particularly in DanceTrack, with complex and dynamic movements. These results indicate that the proposed method effectively overcomes the limitations of the existing TBD approaches.

Related work

Motion-appearance cost combination strategy

The approach of combining motion and appearance cues in the TBD framework is a critical factor in determining the tracking performance. Early strategies such as DeepSORT and StrongSORT21 predominantly employed a linear combination of the two costs with fixed weights. Although this approach is straightforward to implement, it has a structural limitation in that it cannot adapt to situations where the reliability of each cue changes rapidly, such as in crowded environments or non-linear motion environments. To overcome this, studies on adaptive weight adjustment based on detection quality, appearance similarity consistency, and spatial proximity22,23,24, as well as dynamic fusion that optimizes the prediction ratio of the Kalman filter25,26, have been proposed. However, these approaches share a common structural limitation in that they are ultimately fused into a single cost matrix. More recently, the combination strategy has been extended, with a method proposed in the OVMOT domain that groups appearance costs and motion into a joint similarity and updates them according to dormant and vibrant states27. In contrast, a pure motion-based approach has also been proposed that excludes appearance and extends motion with pseudo-depth cues to handle occlusion28. In the Single Object Tracking field, adaptive fusion is also actively being explored, including integrating cues via shared queries29, merging multiple cues30, and performing modality confidence-based multi-expert fusion31. Despite the variety of cue integration studies, existing methods remain confined to continuous weight adjustment within a single cost matrix. Therefore, a rule-based mechanism is required that conditionally selects explicitly different cost matrices according to the dataset-level prior of the target environment.

Consistency verification and quality-aware association

In MOT, efforts to verify the consistency between predictions and observations have been continuously pursued to enhance the reliability of association decisions. Early approaches such as DeepSORT and BoT-SORT employed hard gating using Kalman filter prediction ranges; however, this introduced the side effect of prematurely excluding actual objects during abrupt motion changes. Therefore, OC-SORT and Deep OC-SORT32 enhanced the robustness by verifying the velocity consistency or integrating appearance information. More recently, association strategies that explicitly evaluate quality have been investigated, including track quality grading33, selective utilization of weak cues34,35, and detection confidence incorporation36. However, existing methods have primarily focused on motion-domain verification or post-association state updates, and explicit mechanisms that simultaneously consider motion and appearance within the association decision process itself to classify consistency patterns remain limited. The concept of consistency verification is being extended across diverse tracking domains. In satellite video tracking, discrepancies between single exponential smoothing (SES)-based motion estimation and observations are detected to automatically correct trajectories upon occlusion37, while in SOT research, inaccurate cues are filtered through low-threshold and local multi-peak filtering38. Quality management based on dormant and vibrant states has also been introduced in OVMOT. Furthermore, a study achieving near-linear complexity that maintains long-term temporal consistency via summary tokens39 is complementary to approaches that verify immediate frame-level motion-appearance consistency. Although quality-aware strategies are being actively explored across various domains, mechanisms that simultaneously consider motion and appearance information during the association decision process to explicitly classify consistency patterns and differentially adjust matching costs remain insufficient.

Hierarchical association and structural tracking design

Single-stage association in MOT fails to clearly distinguish uncertain matchings, leading to a cascading degradation of overall trajectory quality. To overcome this, DeepSORT introduced cascade matching based on track age, and ByteTrack adopted a two-stage association based on detection confidence to suppress false positives. More recently, hierarchical structures have been further refined, including four-stage matching that combines track state and detection quality40, scene space characteristic-based split tracking41,42,43, and tracklet-level association44,45. In addition, an approach has emerged that inherits the existing two-stage structure of OC-SORT comprising general association and observation-centric recovery (OCR), while applying depth volume IoU and quantized pseudo-depth measurement (QPDM) to the matching cost at each stage to enhance occlusion recovery performance. This method represents a recovery-based hierarchy that uses track matching status as its criterion, which differs from approaches that separate hierarchies based on detection confidence. Furthermore, an approach has also been proposed that extends to multi-class generalized tracking by combining occlusion-aware re-identification and road structure-based trajectory correction on top of ByteTrack’s two-stage association structure46. Track state-based hierarchical management that classifies trajectories into dormant and vibrant states has also been applied in OVMOT. Structural hierarchization for achieving computational efficiency and accuracy is also actively pursued in SOT research. Methods have been proposed that dynamically activate only the optimal single layer after the saturation layer, or simultaneously generate trackers of varying complexity through a dual-branch framework47. A study achieving efficient tracking using only a small number of samples and motion mimicry data augmentation (AMMC) without frame-by-frame annotation has also been reported48. This network-level structural hierarchization shares its design philosophy with hierarchization in the data association matching process. However, existing hierarchical methods have the limitation that the cost composition strategy at each hierarchy level is fixed. Therefore, a design is required that, on top of a clear detection quality-based hierarchical structure, evaluates the quality of each track-detection pair and systematically integrates conditional cost matrix selection and consistency verification, thereby providing a stable association backbone.

Proposed method

We propose ClarityTrack, an environment-aware rule-based system in which three modules operate sequentially. In contrast to existing trackers that rely on fixed-weight fusion approaches, ClarityTrack builds an interpretable framework that pre-defines parameter sets reflecting environmental characteristics, conditionally switches them to suit individual matching situations, and systematically verifies the consistency between motion and appearance cues. Each module maintains a clear role separation while sequentially cooperating to simultaneously secure dataset-specific optimization and matching reliability.

System overview

Figure 2 shows an overview of the ClarityTrack system. ClarityTrack is a framework in which three modules (BCA, CAMW, and MACC) operate sequentially. The processing flow is as follows. Tracks from the previous frame were updated to predicted positions in the current frame through an 8-dimensional Kalman filter and Oriented FAST and Rotated BRIEF (ORB)49 based Camera Motion Compensation (CMC). The YOLOX detection results of the current frame were separated into high-confidence and low-confidence sets based on the confidence scores. In the first stage, an association between the predicted tracks and high-confidence detection was performed. The BCA combines the IoU-based motion cost and cosine distance-based appearance cost at a 50:50 ratio to compute the baseline cost matrix. The CAMW computes environment-specific cost matrices in parallel and evaluates the tracking quality of each track-detection pair to generate an environment-specific mask, \(\:{M}_{env}\). \(\:{M}_{env}\) is a binary matrix indicating whether each pair satisfies the conditions. In the conditional selection stage, environment-specific costs are applied to activated pairs, and baseline costs are maintained for deactivated pairs. The MACC verifies the motion-appearance consistency of each pair and adds adjustment values to the costs. The Hungarian algorithm performs optimal matching based on the final costs of separating the matched and unmatched tracks. In the second stage, the association between tracks unmatched in the first stage and low-confidence detections is processed based only on the IoU. After matching with the Hungarian algorithm, successfully matched tracks are updated and failed tracks are transitioned to the Lost state. When all the association stages were completed, the matching results from each stage and the newly created tracks were integrated to output the updated track set.

Fig. 2
Fig. 2
Full size image

Overview of the entire ClarityTrack system.

BCA

Existing MOT approaches exhibit biased strategies for utilizing motion and appearance cues. Some abandon the discriminative power of appearance cues by using only motion information or reveal vulnerability in fast-motion situations by over-relying on appearance information. Although hierarchical association techniques based on detection confidence have been widely adopted, the impact of the combination ratio of the two cues at each association stage on the tracking performance tends to be overlooked.

To address these limitations, we constructed a BCA, a hierarchical association structure that strategically combines validated techniques. BCA is built upon a standard geometric tracking system comprising an 8-dimensional Kalman filter, ORB-based CMC, and single-stage Hungarian matching based on Height Modulated IoU (HMIoU). This Baseline performs association using only the geometric cost computed as \(\:1-HMIoU\), without utilizing appearance information. Each component is as follows. The 8-dimensional Kalman filter uses a state vector comprising the object’s center coordinates \(\:(u,\:v)\), width \(\:w\), height \(\:h\), and the velocities of each component \(\:(\dot{u},\:\dot{v},\:\dot{w},\:\dot{h})\). The 8-dimensional configuration independently tracks the rate of change of all geometric attributes to improve the prediction accuracy in non-uniform motion situations. In particular, the velocity components of width and height contribute to prediction stability in situations where an object abruptly changes direction or undergoes size changes. ORB-based CMC computes an affine transformation matrix through feature point matching between consecutive frames and applies it to the Kalman filter prediction results to preemptively eliminate the prediction errors caused by camera motion. The ORB approach is feature point-based, offering high computational efficiency and robust matching under illumination changes or partial occlusion. HMIoU emphasizes vertical alignment by multiplying the general IoU and the height-direction IoU, reflecting the importance of height information as a critical cue in pedestrian tracking. It provides higher discriminability compared to standard IoU in crowded environments and occlusion situations.

BCA introduces three core association strategies on top of this geometric tracking foundation. First, ReID features are integrated in the first stage association to secure appearance-based discriminability. Unlike performing association using only geometric costs, the first stage association of BCA computes the cosine distance \(\:{D}_{reid}[i,\:j]\) between ReID feature vectors of each track-detection pair \(\:(i,\:j)\) as the appearance cost and combines it with the motion cost. Second, motion and appearance costs are balanced at a 50:50 ratio. The baseline cost for each track-detection pair \(\:(i,\:j)\) is given by Eq. (1).

$$\:{C}_{base}[i,\:j]\:=\:0.5\times\:{D}_{reid}[i,\:j]+0.5\times\:{D}_{iou}[i,\:j]$$
(1)

\(\:{D}_{reid}[i,\:j]\) is the cosine distance between the ReID feature vectors of track \(\:i\) and detection \(\:j\) and \(\:{D}_{iou}[i,\:j]\) is \(\:1-HMIoU\). At this time, pairs with low spatial overlap were set at high costs to be excluded from matching. This balanced fusion prevents dependency bias toward a specific cue, thereby reducing the risk of mismatches in crowded environments or fast-motion situations. In other words, when objects with similar appearances are in proximity, motion cues act complementarily, and when position prediction is unstable owing to fast-motion, appearance cues act complementarily, simultaneously reducing FP and FN. The validity of the 50:50 weight setting was experimentally verified through an ablation study, as described in Sect.  4.5. Third, differentiated association strategies based on detection quality are applied through a detection confidence-based two-stage hierarchical structure. Tracks unmatched in the first stage association are passed to the second stage association, where matching with low-confidence detections is attempted. The second stage association uses only HMIoU-based geometric costs and excludes ReID information. Since low-confidence detection results are highly likely to distort appearance features due to occlusion or motion blur, association is performed using only geometric cues to prevent track contamination caused by inaccurate appearance information. This prevents trajectory fragmentation of temporarily occluded objects while suppressing incorrect matching owing to low-quality appearance information. Both association stages perform optimal matching with the Hungarian algorithm by applying different thresholds to the first stage association and second stage association. Pairs whose matching costs exceeded the threshold were excluded from the matching candidates to prevent incorrect connections of spatially distant objects. This two-stage hierarchical structure applies appropriate association strategies based on the detection quality to enhance the robustness of the tracking system. Specifically, the first stage association achieves accurate matching by utilizing motion and appearance in a balanced manner for high-quality detections, whereas the second stage association conservatively processes low-quality detections to minimize trajectory fragmentation. By combining ReID integration, balanced 50:50 fusion, and a detection confidence-based two-stage hierarchical structure on top of a standard geometric tracking foundation, BCA secures stable tracking performance across diverse detection quality conditions. This provides a robust foundation upon which the subsequent modules, CAMW and MACC, perform environment-specific optimization and consistency verification.

CAMW

Tracking benchmarks can be broadly classified based on three environmental characteristics. Both the motion and appearance cues can be trusted in balanced environments with moderate crowd densities. In contrast, crowded environments are dominated by frequent occlusions and similarities in appearance, and unstable environments suffer from degraded reliability of both cues owing to unpredictable non-linear motion and similar appearances. Under these heterogeneous conditions, balanced fusion of BCAs provides stability but has limitations in achieving optimal performance, as it does not fully reflect environment-specific characteristics. Therefore, we propose CAMW, an environment-specific cost selection mechanism. ClarityTrack adopts a 2-level design for environment adaptation. At the environment level, the tracking environment type is pre-specified based on the known characteristics of the target domain, such as crowd density and motion linearity. In this study, MOT17 corresponds to a balanced environment, MOT20 to a crowded environment, and DanceTrack to an unstable environment. This classification determines the conditional logic of CAMW and the pre-optimized weight pair \(\:({w}_{reid}^{env},\:{w}_{iou}^{env})\) for each environment. Unlike existing adaptive methods, which continuously compute weights and fuse them into a single cost, CAMW adopts a hard-switching approach that computes two independent cost matrices in parallel and then selects them based on the conditions. Specifically, the first matrix \(\:{C}_{base}\) follows the balanced fusion cost of Eq. (1), as defined earlier. The second matrix \(\:{C}_{CAMW}\) applies environment-specific optimized weights, and the cost for each pair \(\:(i,\:j)\) is given by Eq. (2).

$$\:{C}_{CAMW}[i,\:j]\:=\:{w}_{reid}^{env}\times\:{D}_{reid}[i,\:j]+{w}_{iou}^{env}\times\:{D}_{iou}[i,\:j]$$
(2)

Equation (2) maintains the same linear combination structure as Eq. (1), but the weights are optimized for each environment. The weight pair \(\:({w}_{reid}^{env},\:{w}_{iou}^{env})\) is pre-optimized according to the environmental characteristics, and the relative importance of the appearance and motion information is adjusted. \(\:{C}_{base}\) maintains a 50:50 balance for all pairs, whereas \(\:{C}_{CAMW}\) uses weight ratios suited to the environmental characteristics. At the individual track-detection pair level, the key aspect of CAMW is that \(\:{C}_{CAMW}\) is not applied uniformly to all matching candidates but is conditionally selected. Applying it identically to all matching candidates would merely amount to changing the fixed weights and would fail to handle the diverse tracking situations within a frame individually. The conditional selection mechanism evaluates the tracking quality indicators of each track-detection pair in real time and adopts an environment-specific strategy that applies \(\:{C}_{CAMW}\) only to pairs where substantial improvement is expected at each frame, while independently determining whether to maintain \(\:{C}_{base}\) for pairs with uncertain matching. Therefore, although the environment strategy is pre-configured, the cost selection for individual track-detection pairs operates dynamically, enabling differentiated and conservative responses to the diverse tracking conditions coexisting within the same scene.

Fig. 3
Fig. 3
Full size image

Conditional selection mechanism of CAMW. (a) Pairs with clear tracking conditions within the same frame select environment-specific strategies, (b) Pairs with ambiguous conditions maintain balanced fusion criteria to ensure stability. Dashed boxes represent tracks and solid boxes represent detections.

Figure 3 shows the conditional selection mechanism of the CAMW. The main frame presents two pairs under different tracking conditions within the same scene. (a) In Example 1, track T1 is matched with detection D1, which shows clear appearance features and high spatial coherence. Because the ReID distance is low at 0.21 and the IoU score is high at 0.88, satisfying all conditions, the system selects \(\:{C}_{CAMW}\) to enhance the matching accuracy. (b) In Example 2, track T2 has ambiguous tracking conditions owing to the presence of adjacent objects with similar appearances, with a high ReID distance of 0.65 and a low IoU score of 0.42, failing to satisfy the conditions; therefore, the system maintains \(\:{C}_{base}\) to suppress the risk of mismatches. This implies that the CAMW operates as a conditional mechanism that independently evaluates the tracking quality of each pair and selects \(\:{C}_{CAMW}\) or \(\:{C}_{base}\) even within the same frame.

A simple cost comparison strategy was adopted in balanced environments. In this case, because the motion and appearance information can be trusted, the matrix selection is possible through a simple cost comparison without complex additional verification. The application condition is expressed in Eq. (3).

$$\:{M}_{balanced}[i,\:j]\:=\:\left({C}_{CAMW}\right[i,\:j]+\alpha\:<{C}_{base}[i,\:j\left]\right)$$
(3)

In Eq. (3), the safety margin α suppresses unnecessary switches caused by cost differences. For each pair, \(\:{C}_{CAMW}\) and \(\:{C}_{base}\) are computed simultaneously, and \(\:{C}_{CAMW}\) is adopted only when it has a sufficiently low cost to effectively utilize environment-specific strategies for highly reliable pairs.

In crowded environments, a strict mask strategy is used. In situations in which multiple appearance-similar detection candidates coexist for a single track, an indiscriminately increasing appearance weight increases the risk of a mismatch. Therefore, the CAMW imposes additional verification conditions beyond the cost advantage, which is given by Eq. (4).

$$\:{M}_{crowd}[i,\:j]\:=\:\left({M}_{balanced}\right[i,\:j\left]\right)\wedge\:\left({D}_{reid}\right[i,\:j]<{\tau\:}_{reid})\wedge\:\left(IoU\right[i,\:j]>{\tau\:}_{iou})$$
(4)

The three conditions in Eq. (4) act complementarily to verify safe CAMW application. The first condition \(\:{M}_{balanced}[i,\:j]\) verifies the substantial performance improvement of \(\:{C}_{CAMW}\). The second condition allows only pairs with high appearance similarity when \(\:{D}_{reid}[i,\:j]\) is smaller than the threshold \(\:{\tau\:}_{reid}\). The third condition allows only spatially close pairs when \(\:IoU[i,\:j]\) is greater than the threshold \(\:{\tau\:}_{iou}\). \(\:{C}_{CAMW}\) is applied only when all conditions are satisfied and \(\:{C}_{base}\) is maintained when the conditions are not met, thereby ensuring trajectory continuity even under temporary occlusion.

In unstable environments, a conservative safety strategy was constructed. Unpredictable non-linear motion degrades the reliability of Kalman filter predictions and similar appearances weaken the discriminative power of ReID features, thereby limiting the reliability of both motion and appearance cues. In such environments, indiscriminate cost matrix switching can degrade performance. Therefore, the CAMW designs four strict independent conditions. Each condition must simultaneously satisfy multiple quality indicators, including the ReID confidence, motion quality, spatial overlap, track length, and cost advantage. If any of the four conditions is satisfied, an environment-specific cost is selected, which is given by Eq. (5).

$$\:{M}_{unstable}[i,\:j]\:=\:\left({S}_{1}\right[i,\:j]\vee\:{S}_{2}[i,\:j]\vee\:{S}_{3}[i,\:j]\vee\:{S}_{4}[i,\:j\left]\right)$$
(5)

Each condition in Eq. (5) represents different tracking scenarios. The first condition \(\:{S}_{1}\) allows the CAMW to have a minimal cost advantage when all quality indicators are simultaneously excellent, securing clear matching opportunities. The second condition \(\:{S}_{2}\) requires strict stability criteria and a moderate cost advantage for long-term tracks to maintain trajectory continuity even under temporary quality degradation. The third condition \(\:{S}_{3}\) operates only when all quality indicators are excellent but requires a strong cost advantage to suppress FP. The fourth condition \(\:{S}_{4}\) handles reappearance situations after occlusion with a strong cost advantage under clear ReID and minimal spatial overlap to reduce the FN. \(\:{C}_{CAMW}\) is applied when any one of these four conditions is satisfied, and the strict requirements of each condition maintain the overall safety. The complete threshold specifications for each condition are presented in Table 2. Finally, the cost matrix \(\:{C}_{high}\) for the first stage association is determined, as expressed in Eq. (6).

$$\:{C}_{high}[i,\:j]\:=\left\{\begin{array}{c}{C}_{CAMW}[i,\:j],\:\:\:if\:{M}_{env}[i,\:j]=1\\\:{C}_{base}[i,\:j],\:\:\:\:\:\:\:\:\:otherwise\end{array}\right.\:$$
(6)

The conditional selection expressed in Eq. (6) indicates that each pair is evaluated independently. \(\:{M}_{env}\) is assigned to one of \(\:{M}_{balanced}\), \(\:{M}_{crowd}\), or \(\:{M}_{unstable}\) according to the environmental characteristics. When \(\:{M}_{env}\) is 1, \(\:{C}_{CAMW}\) is applied to the corresponding pair, and when \(\:{M}_{env}\) is 0, the safety mechanism \(\:{C}_{base}\) is maintained. This suggests that different strategies can be applied to each track-detection pair, even within the same frame, and is a dual mechanism that combines environment-level strategy selection and pair-level conditional applications.

MACC

BCA provides a stable tracking foundation through hierarchical association and balanced fusion, whereas CAMW selects cost matrices with environment-specific optimized weights. MACC is a post-verification mechanism that independently evaluates the motion-appearance consistency of each track-detection pair within the cost matrix selected by CAMW and adjusts the costs. Unlike existing studies that simply sum the two costs, MACC is performed as an independent stage after the cost matrix selection of CAMW, preserving fine cost differences while integrating consistency information.

Fig. 4
Fig. 4
Full size image

Motion-appearance consistency classification mechanism of MACC.

Figure 4 shows the motion-appearance consistency classification mechanism of the MACC. Step 1 presents the cost matrix selected by the CAMW, and the four sample pairs within the matrix represent each consistency case highlighted by individual colors. Step 2 visualizes the characteristics of the four consistency cases. The color pattern contrast between track feature \(\:{f}_{t}\) and detection feature \(\:{f}_{d}\) represents the appearance similarity. The connecting lines distinguished by solid and dashed lines represent spatial proximity, and the cost adjustment indicators specify the response strategy for each case. This shows the mechanism by which the MACC independently evaluates the consistency state of each pair and responds with differentiated adjustment strategies, even within the same frame. Step 3 presents the final cost calculation process. The adjustment value \(\:{\beta\:}_{ij}\) determined according to consistency rules is added to the cost \(\:{C}_{high}[i,\:j]\) selected by CAMW to compute the final cost \(\:{C}_{final}[i,\:j]\). Subsequently, the final cost matrix was input into the Hungarian algorithm to generate global optimal matching. The detailed process of consistency classification in Step 2 is as follows. Each pair \(\:(i,\:j)\) within the cost matrix selected by the CAMW was evaluated based on the motion quality and appearance quality. Motion quality is determined by whether \(\:{IoU}_{ij}\) exceeds the threshold \(\:{\tau\:}_{m}\), and appearance quality is determined by whether \(\:{D}_{reid}[i,\:j]\) is below the threshold \(\:{\tau\:}_{a}\). \(\:{\beta\:}_{ij}\) is assigned according to the combination of the two quality indicators, as expressed in Eq. (7).

$$\:{\beta\:}_{ij}=\left\{\begin{array}{c}{\beta\:}_{1},\:\:\:\:\:{IoU}_{ij}>{\tau\:}_{m}\:\wedge\:\:{D}_{reid}[i,\:j]<{\tau\:}_{a}\\\:{\beta\:}_{2},\:{\:\:\:\:IoU}_{ij}>{\tau\:}_{m}\:\wedge\:\:{D}_{reid}[i,\:j]\ge\:{\tau\:}_{a}\\\:{\beta\:}_{3},\:\:\:\:{IoU}_{ij}\le\:{\tau\:}_{m}\:\wedge\:\:{D}_{reid}[i,\:j]<{\tau\:}_{a}\\\:{\beta\:}_{4},\:{\:\:\:IoU}_{ij}\le\:{\tau\:}_{m}\:\wedge\:\:{D}_{reid}[i,\:j]\ge\:{\tau\:}_{a}\end{array}\right.$$
(7)

The adjustment values \(\:{\beta\:}_{1}\), \(\:{\beta\:}_{2}\), \(\:{\beta\:}_{3}\), \(\:{\beta\:}_{4}\) in Eq. (7) correspond to Case 1, 2, 3, and 4 shown in Fig. 4, respectively, and \(\:{\beta\:}_{4}\) is set to 0, performing no cost adjustment. The adjustment direction was determined according to cost-based matching principles. Because the Hungarian algorithm preferentially matches pairs with low costs, when motion and appearance align, costs are reduced through negative adjustment to increase the matching likelihood. When the two cues contradict, costs are increased through positive adjustment to suppress matching. In Case 1, both the motion and appearance cues were strong, with tracks and detections positioned spatially close and having similar appearance features. Negative adjustment \(\:{\beta\:}_{1}\) is applied to reduce costs, thereby enhancing matching. Case 2 is a contradictory situation in which motion is strong but appearance is weak, being spatially close but having different appearance features, thus risking confusion with nearby objects. Positive adjustment \(\:{\beta\:}_{2}\) is applied to increase costs, thereby suppressing mismatches. Case 3 is a re-identification situation where motion is weak but the appearance is strong, such as when reappearing at a different location after occlusion or when Kalman filter predictions deviate owing to fast-motion. Conservative negative adjustment \(\:{\beta\:}_{3}\) is applied to support re-identification while preventing excessive intervention. Case 4 was an uncertain situation in which both the motion and appearance cues were weak, maintaining neutrality without performing adjustments owing to insufficient judgment basis.

Consistency check strategies and adjustment values were differentiated according to environmental characteristics. In balanced environments, motion and appearance cues are generally reliable, and BCA and CAMW provide a stable tracking foundation; therefore, MACC selectively intervenes only on a minority of pairs where clear consistency or inconsistency is detected. Clear matching is enhanced and contradictory associations are suppressed through Cases 1 and 2, while most pairs maintain the decisions of BCA and CAMW. In crowded environments, the proportion of pairs with a clear judgment basis is low, owing to frequent occlusions and appearance similarities; therefore, the intervention frequency of the MACC is also limited. The adjustment values of Cases 1 and 2 were set to 0, and only Case 3 was activated to support recovery in reappearance situations, where clear appearance matching was confirmed. Simultaneously, \(\:{\tau\:}_{a}\) is set strictly to prevent mismatches in crowded environments. In unstable environments, the reliability of motion prediction is significantly degraded owing to unpredictable non-linear motion, causing frequent motion-appearance inconsistency situations; therefore, the MACC’s intervention ratio increases. Re-identification based on appearance is actively supported in Case 3 to maintain trajectory continuity even in situations where the Kalman filter prediction fails, and contradictory associations are suppressed in Case 2.

The post-application and additive approach of the MACC ensured the structural coherence of the system. The CAMW performs macroscopic decisions that select the overall appropriate cost matrix structures at the environmental level, and the MACC is a microscopic verification stage responsible for the consistency evaluation of individual pairs and cost refinement within the selected cost matrix. In this process, the approach of adding \(\:{\beta\:}_{ij}\) to \(\:{C}_{high}[i,\:j]\) injects consistency constraints while preserving the relative superiority of cost values computed by CAMW. Pairs with low ReID distance have already been assigned low costs in the \(\:{C}_{high}[i,\:j]\) computation stage; therefore, matching reliability is doubly enhanced through negative adjustment of \(\:{\beta\:}_{ij}\). If the order is reversed and MACC precedes it, microscopic adjustments may be lost due to CAMW matrix switching; therefore, a sequential application that performs microscopic verification after macroscopic selection confirmation ensures complete reflection of refinement information.

Algorithm 1
Algorithm 1
Full size image

Multi-object tracking with BCA, CAMW, and MACC.

Experimental and results

Datasets

For the experiments, we used three standard benchmark datasets: MOT17, MOT20, and DanceTrack. MOT17 is a pedestrian tracking benchmark consisting of seven training and seven test sequences covering urban pedestrian environments with moderate crowd density. It is widely used to evaluate the overall robustness of algorithms in a balanced environment in which both motion and appearance cues are reliable. MOT20 consisted of four training sequences and four test sequences, providing an extremely crowded environment with an average of approximately 170 pedestrians per frame. Owing to frequent occlusions and high spatial overlap, the reliability of the IoU-based motion information is significantly degraded, thereby providing suitable conditions for the intensive evaluation of appearance-based re-identification and association strategies in crowded environments. DanceTrack consists of 100 video sequences that primarily include group dance scenes. Objects have very similar appearances while exhibiting unpredictable non-linear motion and frequent mutual crossings. This provides a challenging scenario suitable for intensively evaluating the association stage, because the reliability of both motion predictions and appearance features is limited.

Metrics

For performance evaluation of ClarityTrack, we used HOTA, IDF1, and Multiple Object Tracking Accuracy (MOTA)50 as standard metrics. HOTA is calculated as the geometric mean of Detection Accuracy (DetA) and AssA and serves as a comprehensive metric that evaluates detection and association performance in a balanced manner. Because we fixed the detector and focused on improving the data association stage, improvements in HOTA were primarily attributed to improvements in AssA. AssA measures the association accuracy between detections and Ground Truth, reflecting the impact of the environment-specific cost selection strategy of CAMW on association quality. IDF1 is the harmonic mean of precision and recall that measures the correspondence accuracy between ground truth IDs and predicted IDs across the entire sequence and is utilized to verify the association quality improvement effects of CAMW and MACC in crowded and occluded environments. MOTA integrates FP, FN, and ID Switches (IDSW) to measure overall tracking accuracy. We used HOTA as the primary metric and verified the improvement effects of pure tracking algorithms using the same detector.

Implementation details

ClarityTrack was implemented using the PyTorch framework, and the hardware and software specifications used in the experiments are listed in Table 1.

Table 1 Hardware specifications and software specifications.
Table 2 Complete hyperparameter specification for ClarityTrack.

Publicly available YOLOX-X51 detection results and the SBS-5052 model from the FastReID framework were used for appearance feature extraction. CMC is an ORB feature-point-based homography estimation method, and the object motion is modeled using an 8-dimensional state vector Kalman filter. The baseline weights of the BCA were fixed at 50:50 across all datasets, and GBI(Gradient Boosting Interpolation)53 with a maximum interval of 30 frames was applied to compensate for the trajectory fragmentation. Dataset-specific hyperparameters were optimized for each validation dataset. The detection confidence threshold \(\:{\tau\:}_{det}\) for the first stage association is set to 0.6 for MOT17 and DanceTrack, and 0.4 for MOT20. The conditional selection margin \(\:\alpha\:\) of CAMW and the consistency refinement parameter \(\:\beta\:\) of MACC are adjusted to reflect the crowd density and non-linearity of motion patterns of each dataset. When evaluating on the test datasets, two sets of results are reported depending on whether post-processing techniques are applied, in order to verify the robustness of the proposed model. In contrast, the ablation study was performed without post-processing the validation datasets to measure the pure contribution of each module.

The complete hyperparameter specification of ClarityTrack is presented in Table 2. The BCA section lists the fundamental tracking parameters including detection separation and matching thresholds, where the matching thresholds are commonly used across all datasets. The CAMW section presents the environment-specific cost matrix selection parameters, with additional verification conditions corresponding to Eq. (4) applied in crowded environments. The MACC section specifies the motion-appearance consistency adjustment parameters of Eq. (7). All parameters were optimized on the respective validation datasets.

In unstable environments, CAMW determines whether to apply \(\:{C}_{CAMW}\) through four parallel safety conditions of Eq. (5). Each condition simultaneously evaluates five quality indicators to ensure safe cost selection in high-uncertainty tracking situations. Table 3 presents the specific thresholds and indicator assignment information for each safety condition. The dataset-specific hyperparameter optimization process is presented in Sect.  4.5.1.

Table 3 CAMW safety strategy thresholds for the unstable environment,

Experimental results

To verify the performance of ClarityTrack, we conducted quantitative comparisons with State-of-the-Art (SOTA) models on three standard benchmark datasets. The comparison methods were selected primarily from online trackers based on the TBD paradigm. Specifically, the selection includes methods adopting hierarchical association structures similar to ClarityTrack, such as ByteTrack and OC-SORT, as well as representative methods employing fixed-weight motion-appearance fusion strategies, such as StrongSORT, BoT-SORT, and Deep OC-SORT, with a broad range of related studies selected to verify competitiveness at the current SOTA level. The effects in each dataset environment are presented using HOTA, IDF1, AssA, and MOTA.

Table 4 lists the results for the MOT17 test dataset. ClarityTrack with post-processing achieved HOTA 66.5, IDF1 81.8, and AssA 67.7, demonstrating superior association performance compared to the comparison models. Furthermore, even under the standalone evaluation condition without post-processing, HOTA 65.7, IDF1 81.6, and AssA 67.0 were recorded, surpassing recent models including CMTrack, Hybrid-SORT-ReID, and PD-SORT, which were evaluated with post-processing applied. This demonstrates that despite the characteristic of balanced environments dominated by linear motion, where trajectory interpolation tends to be advantageous for performance improvement, the proposed data association strategy forms a structurally robust tracking foundation. ByteTrack proposed an efficient two-stage association structure but showed limitations in flexibly responding to environmental variables, such as similar object identification in crowded crowds or illumination changes, due to maintaining a fixed parameter strategy. Additionally, PIA complements long-term tracking stability by utilizing temporal appearance information, but has constraints in actively responding to frame-by-frame changing tracking environments because the combination ratio of motion and appearance costs is fixed. ClarityTrack resolves these structural limitations through two complementary mechanisms. The CAMW evaluates the tracking quality of each track-detection pair and conditionally applies environment-specific cost matrices suited to the dataset characteristics. The MACC selectively intervenes in a minority of pairs, showing clear consistency patterns to suppress contradictory matching. The balanced fusion of BCA provides a stable association foundation, and CAMW’s conditional cost selection and MACC’s selective consistency check complement this, directly leading to meaningful improvements in HOTA and AssA in complex situations where appearance similarity and spatial proximity are mixed.

Table 4 Comparison with state-of-the-art methods on MOT17 test set.

Table 5 lists the results of the MOT20 test dataset. ClarityTrack consistently maintains superior tracking performance compared to the comparison models regardless of whether post-processing is applied. In particular, the IDF1 and MOTA recorded under the standalone evaluation condition without post-processing surpassed the performance of models with post-processing applied. This suggests that applying trajectory interpolation and linking techniques in extremely crowded environments entails the risk of inducing false positives and ID switches between similar-appearance objects. In contrast, the proposed model adopts a conservative association strategy that intervenes only in a limited manner when clear tracking quality indicators are satisfied. This strategy prevents confusion between similar-appearance objects caused by forcibly connecting trajectories under uncertain conditions, consequently demonstrating the effect of more robustly preserving the overall system’s IDF1 and MOTA. ClarityTrack recorded HOTA 63.7, IDF1 78.2, and AssA 64.7, demonstrating superior tracking performance compared to the comparison models, even in crowded environments, while maintaining a stable MOTA of 75.8. In particular, the superiority of AssA and IDF1 demonstrates that the proposed environment-specific strategy is effective in distinguishing objects in crowds and improving the association quality. SparseTrack achieved the highest MOTA performance by maximizing the object separation performance in crowded environments using virtual depth cues while focusing primarily on MOTA rather than IDF1 and AssA. This suggests that scene decomposition-based approaches are effective for momentary object separation, but are limited in maintaining ID consistency over time. Additionally, StrongSORT + + increased the proportion of appearance information utilization, but had difficulties in actively responding to situations where appearance features were corrupted by frequent occlusions owing to the application of fixed weights. Strategies customized to the environmental constraints of MOT20 were applied. The balanced fusion and two-stage hierarchical structure of the BCA provide a stable association foundation in environments with frequent occlusions and appearance similarities, contributing to major performance improvements. The CAMW optimizes the cost matrix by increasing the appearance weights and relaxing the IoU gates. The MACC activated Case 3 to support re-identification after occlusion, but selectively applied it only under strict conditions, given the extremely crowded environmental characteristics to secure stability.

Table 5 Comparison with state-of-the-art methods on MOT20 test set.

Table 6 lists the results of the DanceTrack test dataset. ClarityTrack achieved HOTA 64.6, IDF1 66.3, and MOTA 93.7, demonstrating the robustness of the proposed system in non-linear motion environments. In particular, the high MOTA performance is the result of the high detection quality of DanceTrack, along with CAMW’s conservative strategy that maintains safe baseline costs in high-uncertainty situations to prevent FP caused by unnecessary intervention and MACC’s active appearance-based re-identification that effectively reduces FN. OC-SORT introduced Observation-Centric Momentum to compensate for Kalman filter errors but showed limitations in distinguishing dancers with similar appearances due to limited appearance information utilization. Deep OC-SORT, which has improved upon this, also has constraints in complex crossing situations despite the addition of appearance cues, as it fails to flexibly reflect the rapidly changing cue reliability on a per-frame basis. These issues were resolved through the cooperation of the three modules. BCA’s balanced fusion provides a stable association foundation, and CAMW reflects high-uncertainty environmental characteristics to maintain safe baseline costs for most pairs, thereby preventing errors from unnecessary interventions. Particularly in non-linear motion environments, the MACC operates extensively to contribute to major performance improvements. It actively supported appearance-based re-identification centered on Case 3, and suppressed contradictory associations through Case 2, maintaining trajectory continuity even during Kalman filter prediction failures. Consequently, high HOTA and IDF1 values were obtained compared to the comparison models, even during dynamic movements.

Table 6 Comparison with state-of-the-art methods on DanceTrack test set.

Ablation study

To independently verify the contribution of each module, we conducted component-wise experiments on the validation datasets of MOT17, MOT20, and DanceTrack. All experiments were performed without post-processing techniques to evaluate the contribution of each module.

The Baseline is a geometric tracker comprising an 8D Kalman filter, ORB-based CMC, and single-stage HMIoU-based matching. BCA adds ReID integration, balanced 50:50 fusion, and confidence-based two-stage hierarchical association.

Table 7 lists the performance changes when each module was sequentially applied across the three datasets. The experimental results showed that the addition of each module resulted in consistent performance improvements for HOTA, IDF1, and AssA. BCA provides substantial performance improvements over the Baseline across all datasets, demonstrating that ReID integration, balanced fusion, and the hierarchical association structure constitute a robust tracking foundation. The addition of CAMW and MACC achieved meaningful additional performance improvements based on the stable foundation provided by BCA. In particular, the performance of the complete system exceeds those of the cases in which CAMW or MACC are added individually, demonstrating that the two modules operate independently while contributing complementarily. Dataset-specific analysis revealed that the contribution of each module varied according to the tracking environment. In the balanced MOT17 environment, CAMW and MACC contributed at similar levels, and the combination of the two modules provided an optimal performance improvement. In the extremely crowded environment of MOT20, BCA establishes the core performance foundation, and CAMW and MACC perform optimization tailored to the dataset characteristics to compensate for occlusion problems. In the non-linear motion environment of DanceTrack, the contribution of MACC is most prominent, confirming that motion-appearance consistency checks effectively control errors occurring in unpredictable motion patterns. The consistent improvements across all three datasets demonstrate that the proposed modules systematically improve the data association stage.

Table 7 Component-wise ablation study on validation sets (without post-processing).

To verify the validity of the BCA 50:50 balanced fusion setting, we evaluated various ReID and IoU weight ratios. Table 8 lists the performance changes according to the ratios across the three datasets. MOT17 recorded HOTA 67.70 at 40:60, preferring motion information, whereas MOT20 and DanceTrack achieved HOTA 68.41 and 62.74 at 60:40, respectively, preferring appearance information. These contrasting patterns originate from the dataset characteristics. In the balanced environment of MOT17, Kalman filter predictions are stable with predictable linear motion, and the reliability of motion cues is high. In contrast, in the crowded environment of MOT20, IoU information becomes ambiguous owing to frequent spatial overlap, and in DanceTrack, which is dominated by non-linear motion, Kalman filter predictions are unstable, increasing the dependence on appearance information. In these situations, a 50:50 ratio satisfied two key requirements. First, it minimizes environmental bias at a neutral position between motion and appearance preferences, consistently maintaining top-tier performance across all three datasets. Secondly, it avoids the risk of extreme weight allocation. The ratio of 30:70 recorded the lowest HOTA across all three datasets, and 70:30 consistently showed low performance. Strategies that over-rely on one cue impair the overall system robustness in environments where that cue is unstable. A comprehensive analysis of the experimental results concluded that while dataset-specific optimal values exist, a 50:50 ratio consistently maintains superior performance across all environments and provides a stable and balanced baseline without bias toward specific environments. BCA adopted a two-stage strategy that provided a consistent foundation through balanced fusion, whereas CAMW handled environment-specific optimization. This enables selective adaptation according to environmental characteristics, while maintaining a baseline cost matrix that operates stably across all environments.

Table 8 BCA weight ratio analysis on MOT17-val, MOT20-val, DanceTrack-val.

Table 9 presents the association-stage inference speed of ClarityTrack measured on a single NVIDIA RTX 3090 GPU, excluding the time required for object detection and post-processing. ClarityTrack achieved speeds of 87.99 FPS and 305.26 FPS on the MOT17 and DanceTrack datasets, respectively, both of which exceed the real-time processing threshold of 30 FPS. The speed difference across datasets is attributable to the average number of objects per frame. DanceTrack contains 10 to 30 objects per frame, forming a relatively small association matrix, whereas MOT17 comprises 5 to 20 objects but exhibits more complex scene characteristics. In particular, for MOT20, where an average of approximately 170 pedestrians appear per frame, the extreme object density expands the dimensionality of the association matrix. This increases the computational burden of CAMW’s pair-level condition evaluation and the Hungarian algorithm, resulting in a tendency for the overall inference speed to decrease to 10.05 FPS. The CAMW and MACC modules perform lightweight rule-based computations based on pre-computed cost matrices. Consequently, the primary computational bottleneck of the system occurs at the cost matrix construction and optimal assignment stages, where the computational complexity scales proportionally to the number of tracks \(\:N\) and detected objects \(\:M\) at the level of \(\:O(N\times\:M)\).

Table 9 Association-stage inference speed measured independently of detector inference and post-processing.

Table 10 shows the effects of adding CAMW and MACC on IDSW and Fragmentation (Frag). The analysis confirms that an inherent trade-off exists between HOTA, IDF1, and AssA, which reflect overall association quality, and IDSW, which reflects ID consistency. Although CAMW and MACC consistently improve HOTA, IDF1, and AssA compared to BCA alone, IDSW tends to increase marginally. The effect of CAMW varies across environments. In MOT17 and MOT20, CAMW reduces IDSW by 1 and 10, respectively, compared to BCA alone, demonstrating that conditional cost selection effectively suppresses mismatches in linear motion environments. In contrast, in DanceTrack, despite the \(\:{C}_{CAMW}\) application ratio being the lowest among all datasets at 7.08 as shown in Fig. 6, IDSW increases by 4 compared to BCA alone. This is because the high prediction uncertainty inherent in non-linear motion environments limits the effectiveness of environment-specific cost selection and acts as a factor that amplifies prediction errors. For MACC, an increase in IDSW compared to BCA+CAMW is observed across all three datasets. In particular, in DanceTrack, the extensive intervention of Case 3, which accounts for 88.3% of all adjustments as shown in Fig. 7, increases IDSW by 44 when comparing BCA+MACC to BCA alone. However, this aggressive appearance-based re-identification enables correct re-association after Kalman filter prediction failures, leading to a meaningful contribution of improving HOTA by 0.96 compared to BCA+CAMW. This phenomenon is attributable to the aggressive re-association strategy of the proposed modules. Case 3 of MACC attempts appearance-based re-association in situations where motion prediction fails, and in this process, the termination of existing tracks and new associations occur simultaneously, which may increase IDSW. Additionally, the contradiction penalty of Case 2 has the side effect of prematurely terminating tracks by suppressing matchings where motion and appearance cues conflict. Consequently, the improvement in AssA and IDF1 despite the increase in IDSW indicates that the association quality improvement driven by mismatch reduction contributes more significantly to overall tracking quality enhancement than the negative impact of the IDSW increase. This demonstrates the mechanism by which the proposed modules prevent error accumulation caused by mismatches to achieve an overall improvement in association quality.

Table 10 ID Switches and Fragmentation analysis on validation sets. All experiments were conducted without post-processing.

To evaluate the cross-dataset generalization capability of ClarityTrack, all CAMW and MACC parameters optimized on the MOT17 validation dataset were fixed and evaluated on the MOT20 and DanceTrack validation datasets. As shown in Table 11, the frozen parameters transferred effectively to MOT20, with a HOTA reduction of 0.31. This is because MOT17 and MOT20 share the same domain of pedestrian tracking, and despite differences in crowd density, their fundamental motion and appearance characteristics are similar. In contrast, HOTA decreases by 3.63 on DanceTrack, as the environmental characteristics of DanceTrack, dominated by non-linear motion patterns and frequent mutual crossings, are fundamentally different from the linear pedestrian motion assumed by the parameters optimized on MOT17. The above results demonstrate that ClarityTrack’s parameters transfer effectively within similar tracking domains, but that validation-based tuning is necessary for environments with fundamentally different motion characteristics.

Table 11 Cross-dataset generalization experiment.

Parameter sensitivity analysis

We optimized the hyperparameters of the MACC module to conform to the characteristics of each domain using validation datasets. The optimization follows a three-step protocol. First, in the rule combination analysis step, each consistency case of MACC is applied individually to identify valid rule combinations for the target environment. Second, in the individual parameter optimization step, the adjustment values and thresholds of the selected rules are progressively tuned to search for optimal performance. Third, in the final combination tuning step, the derived individual parameters are combined and interaction effects are verified to finalize the configuration. MOT17 achieved optimal performance with the initial settings alone, as motion and appearance reliability were balanced. Accordingly, we maintained the basic strategy with Case 1 and Case 2 activated without additional tuning. In contrast, MOT20 and DanceTrack, which have distinct environmental constraints, undergo dataset-specific hyperparameter optimization processes to derive strategies tailored to each environment. This section presents the optimization process and quantitative analysis results to elucidate the mechanism by which MACC compensates for the constraints of different tracking environments.

MOT20 experiences frequent occlusions in extremely crowded environments, destabilizing the reliability of the appearance features, thus requiring strategies to prevent mismatches and enhance re-identification after occlusion. To derive the optimal strategy, we individually evaluated the three rule components of the MACC. Table 12 lists the performance when each consistency case was applied independently. Comparing Cases 1, 2, and 3, Case 3 shows the most stable performance at HOTA 68.43, which is the closest to the baseline. Case 1 is a strategy that applies negative adjustment to pairs with strong motion and appearance; however, in crowded environments, multiple similar candidates simultaneously acquire low costs, making differentiation difficult. Case 2 is a strategy that applies positive adjustment to pairs with strong motion but weak appearance; however, given MOT20’s frequent occlusion characteristics, it risks excessively suppressing even temporary appearance corruption. Therefore, establishing the strategy of Case 3 is essential to exclude unnecessary constraints in frequent-occlusion situations and to intensively support appearance-based re-identification. Based on these analysis results, the hyperparameters were optimized as follows. First, \(\:{\beta\:}_{3}\) is set strongly to maximize recovery after occlusion. Starting from an initial value of \(\:-\)0.050 and progressively strengthening, when adjusted to \(\:-\)0.125, it achieved high performance with HOTA 68.46, IDF1 89.86, and AssA 69.77. This suggests that cost reduction is essential when appearance information is certain to avoid missing target objects in complex crowds. Second, \(\:{\tau\:}_{a}\) is a threshold determining the allowable range of appearance-based re-identification, strictly adjusted to prevent mismatches. Given MOT20’s extremely crowded environmental characteristics, many objects were densely packed in narrow spaces, with multiple objects having similar appearances coexisting. Applying a general threshold of 0.40 causes situations where appearance similarity conditions are met despite being different objects, increasing the mismatch risk. To prevent this, \(\:{\tau\:}_{a}\) is lowered to 0.35 to stricten the criterion, ensuring re-identification support applies only in situations with clear appearance matching, achieving final performance of HOTA 68.53, IDF1 89.99, and AssA 69.91.

Table 12 MACC rule component analysis on MOT20 validation set.

DanceTrack requires strategies that support appearance-based re-identification rather than excessive constraints, because the reliability of Kalman filter predictions is lowered by abrupt direction changes and non-linear motion. Although prediction instability due to non-linear motion in DanceTrack and motion disconnection due to occlusion in MOT20 have different physical causes, the fundamental problem of compensating for uncertain motion predictions with appearance information is the same. Accordingly, an analysis centered on Case 3 was performed to derive the optimal rule combination. Table 13 lists the performances of various rule combinations. Comparing Cases 1, 2, and 3, Case 3 shows a slight performance improvement at HOTA 63.26 over the baseline. The combination of Case 1 and Case 3, applying the strong adjustment values used in MOT17, recorded a lower level than the baseline at HOTA 62.62, and the combination of Case 2 and Case 3 also achieved only a limited improvement at HOTA 63.00. This is because strongly applying motion-based cost adjustment in non-linear motion environments, where Kalman filter predictions frequently fail, over-relies on unstable predictions, impairing matching flexibility.

Table 13 MACC rule combination analysis on DanceTrack validation set.

Based on these analysis results, hyperparameter optimization of Case 3 was performed. Strategies contrary to the MOT20 were adopted according to the dataset characteristics. First, \(\:{\beta\:}_{3}\) requires a conservative approach to motion uncertainty. In the crowded environment of MOT20, a strong cost reduction is required to support object reappearance after occlusion. However, in DanceTrack, excessive cost reduction risks cause mismatches owing to the discrepancy between predicted and actual positions. Accordingly, by progressively relaxing the adjustment strength, high performance was achieved at \(\:-\)0.035, which was reduced by approximately 3.5 times from the initial value, recording HOTA 63.88, IDF1 67.18, and AssA 50.47. This demonstrates that minimal intervention is effective in environments with unstable motion information. Second, \(\:{\tau\:}_{a}\) is maintained at 0.40, the same as MOT17. MOT20 strictened the threshold to 0.35 to prevent confusion between objects with similar appearances; however, the core problem of DanceTrack is motion uncertainty, not appearance similarity. Therefore, excessively raising the threshold causes a loss of re-identification opportunities, whereas lowering it increases mismatches; therefore, a balanced threshold ensures optimal performance. Third, the reintroduction of \(\:{\beta\:}_{2}\) is to control DanceTrack’s contradiction patterns. In MOT20, the corresponding positive adjustment was disabled, considering temporary appearance corruption owing to occlusion; however, in DanceTrack, motion prediction errors are frequent. Therefore, when high motion similarity and low appearance similarity coexist, it is regarded as an incorrect association, and a weak positive adjustment is applied to suppress this association. The final derived hyperparameter combination achieved HOTA 64.20, IDF1 67.49, and AssA 50.88. This is an improvement of HOTA 0.96, IDF1 1.38, and AssA 1.61, compared with MACC Off, demonstrating that setting an appropriate adjustment strength suited to environmental characteristics contributes to tracking stability in non-linear motion environments.

Qualitative results

Fig. 5
Fig. 5
Full size image

Comparison of qualitative tracking results in MOT17 and DanceTrack. (a) Baseline, (b) ClarityTrack. The emphasis box displays the main objects that show the difference in tracking performance compared to baseline. It shows the excellent tracking stability of the proposed system compared to baseline in the light reflection and occlusion environment of MOT17, and the rapid change of direction of DanceTrack and the intersection environment.

Figure 5 shows the qualitative comparison results between the Baseline and the proposed ClarityTrack on MOT17 and DanceTrack. In MOT17, system robustness was verified in situations of appearance feature corruption due to light reflection and occlusion environments. The Baseline fails to establish stable matching for objects whose appearance features are corrupted by strong light reflection from a streetlamp in the left frame, and also fails to form trajectories in the right frame due to the combined effect of continuous light reflection and partial occlusion. In contrast, ClarityTrack consistently tracks the ID 32 object, indicated by the purple highlighted box, across both frames. This is the result of CAMW’s environment-specific cost selection reflecting the relative advantage of motion cues under degraded appearance reliability, while BCA’s two-stage hierarchical association performs conservative processing for low-quality matching candidates to secure trajectory continuity. In DanceTrack, the responsiveness to prediction-unstable situations in environments dominated by non-linear motion patterns was evaluated. In the left frame of the Baseline, ID 3 indicated by the pink highlighted box and ID 6 indicated by the sky blue highlighted box are each tracked with their respective unique IDs. However, in the right frame, abrupt direction changes and mutual crossings occur, degrading the prediction accuracy of the Kalman filter, resulting in an ID swap error where the object of ID 3 is reassigned to ID 9 indicated by the yellow highlighted box and the object of ID 6 transitions to ID 3. In contrast, ClarityTrack consistently maintains both ID 3 and ID 6 on the correct objects, as indicated by the pink and sky blue solid arrows. This is the result of MACC’s Case 3 performing correct re-association through appearance-based re-identification in situations where motion prediction fails, while BCA’s balanced fusion and CAMW’s conservative cost selection ensure stable processing for uncertain matching candidates. These qualitative results visually demonstrate that the cooperation among the proposed modules achieves superior trajectory consistency compared to the Baseline under heterogeneous environmental conditions including appearance corruption, occlusion, and non-linear motion.

Fig. 6
Fig. 6
Full size image

Selective intervention patterns of CAMW across datasets. The selection ratios of \(\:{\varvec{C}}_{\varvec{b}\varvec{a}\varvec{s}\varvec{e}}\) and \(\:{\varvec{C}}_{\varvec{C}\varvec{A}\varvec{M}\varvec{W}}\) for each dataset are shown, indicating that CAMW operates conservatively in proportion to environmental uncertainty.

Figure 6 shows the distribution of the CAMW environment-specific selective intervention patterns. In MOT17, \(\:{C}_{CAMW}\) was selected for 28.59% of all matching candidates, suggesting that the selection mechanism operates relatively actively in balanced environments. MOT17’s simple cost comparison condition reflects the environmental characteristics in which motion prediction and appearance feature reliability are generally high, allowing \(\:{C}_{CAMW}\) selection based solely on cost advantage without additional strict verification. In MOT20, the application ratio decreased to 15.15% owing to the strict verification mechanism of the three AND conditions considering the complexity of crowded environments. The requirement for simultaneously satisfying the ReID reliability, IoU threshold, and cost advantage is to suppress mismatches and ensure stability in environments dominated by frequent occlusions and high appearance similarity. The most notable result is that the lowest application ratio of 7.08% is observed in DanceTrack. In high-uncertainty environments, owing to non-linear motion patterns and similar appearances, CAMW shows the most conservative operational behavior, selectively intervening only in situations where all quality indicators are simultaneously excellent. This quantitatively demonstrates the validity of the strict settings for the four parallel conditions described in Sect.  3.3. Each condition is configured to simultaneously satisfy multiple quality indicators including ReID reliability, motion quality, spatial overlap, track length, and cost advantage, and this multi-condition evaluation structure maintains \(\:{C}_{base}\) in most cases and selects \(\:{C}_{CAMW}\) only in situations where clear performance improvement is expected.

The difference in application ratios across the datasets demonstrates the mechanism by which the CAMW differentiates the intervention intensity according to environmental characteristics. As environmental uncertainty increases, the risk of performance degradation due to inappropriate intervention increases; therefore, the CAMW operates conservatively to prioritize safety. This selective intervention mechanism showed patterns consistent with the ablation study results listed in Table 7, explaining the principle by which the CAMW achieves optimal performance improvement in each environment. Specifically, HOTA improvements of 0.53, 0.17, and 0.58 compared to BCA, were recorded in MOT17, MOT20, and DanceTrack, respectively, quantitatively supporting the effectiveness of the environment-tuned selective intervention strategy. Furthermore, when contrasted with the IDSW analysis in Table 10, the environment-specific relationship between CAMW’s intervention patterns and IDSW changes is confirmed. In MOT17 and MOT20, CAMW reduces IDSW by 1 and 10, respectively, compared to BCA alone, demonstrating its mismatch suppression effect, whereas in DanceTrack, IDSW increases by 4 compared to BCA alone despite the minimal intervention ratio of 7.08, indicating the limitation of conditional cost selection in non-linear motion environments. In terms of Frag, the magnitude of change is limited across all three datasets compared to BCA alone, suggesting that CAMW’s conditional selection does not exert a meaningful influence on trajectory fragmentation and primarily acts on the accuracy of ID assignment.

Fig. 7
Fig. 7
Full size image

Case distribution patterns of MACC across datasets. The occurrence ratios of the four consistency cases for each dataset are shown. In MOT17 and MOT20, selective intervention strategies operate, whereas in DanceTrack, extensive appearance-based re-identification centered on Case 3 is performed, indicating differentiated consistency check patterns according to environmental characteristics.

Figure 7 shows the environment-specific case distribution patterns of MACC. In MOT17, a selective intervention strategy was observed. Case 1 accounts for 2.6% and Case 2 for 0.2%, meaning that cost adjustment is applied to only approximately 2.8% of all pairs. Although 1.7% of pairs satisfy the Case 3 condition, \(\:{{\upbeta\:}}_{3}\) is set to 0 in balanced environments, so no actual cost adjustment is performed. MACC selectively applies adjustments only to Case 1, where both motion and appearance are strongly consistent, and Case 2, which suppresses contradictory associations. This selective intervention achieves effective performance improvement by focusing on clear patterns, while preventing excessive intervention in uncertain pairs. In the MOT20, a conservative verification pattern was observed. Owing to the crowded environment characteristics, pairs satisfying clear consistency patterns were limited; therefore, only Case 3 was activated, applying re-identification support to only 0.2% of all pairs. This strategy secures the stability and precision of the tracking system by strictly selecting only the reappearing objects with verified reliability in crowded environments with a high mismatch risk, intervening only when clear cues are secured. By contrast, DanceTrack shows a stark environmental difference. Case 3 operated extensively at 88.3%, with adjustments applied to 98.2% of the pairs across all three cases. This is because appearance-based re-identification operates as the core mechanism in environments dominated by motion prediction instability owing to non-linear motion. The high utilization of Case 3 demonstrates that the MACC actively supports appearance-based re-identification to maintain trajectory continuity even in prediction failure situations.

The difference in case distribution across datasets indicates that the MACC exhibits optimized consistency check patterns according to environmental characteristics. In MOT17 and MOT20, stability was secured through selective precision, whereas in DanceTrack, a major performance improvement was achieved through appearance-based re-identification. These environment-specific operational patterns quantitatively confirm that the dataset-specific hyperparameter optimization strategy described in Sect.  4.5.1, effectively conforms to the tracking characteristics of each environment. The IDSW analysis in Table 10 shows results consistent with these case distribution patterns. In DanceTrack, the extensive intervention of Case 3 increases IDSW by 44 compared to BCA alone, while simultaneously improving HOTA by 0.96. This demonstrates that the IDSW increase occurring during appearance-based re-identification is offset by the overall improvement in association quality. Frag also increases by 13 compared to BCA alone; however, as with IDSW, the improvements in AssA and IDF1 outweigh this increase, indicating that the incidental rise in trajectory fragmentation has a limited impact on overall tracking quality.

Conclusions

We propose ClarityTrack, an environment-aware rule-based system designed to overcome the limitations of fixed-weight fusion approaches in MOT. ClarityTrack comprises three core modules. BCA performs balanced 50:50 fusion of motion and appearance costs for high-quality detections in the first stage through a detection confidence-based two-stage hierarchical structure, and conducts robust association using only motion cues for low-quality detections in the second stage. CAMW pre-defines parameter sets optimized for each environment and dynamically switches the appropriate cost matrix for individual track-detection pairs by evaluating explicit criteria such as ReID confidence, IoU, and cost advantage. The MACC cross-validates the consistency between the motion prediction results and appearance similarity to refine costs, thereby preventing mismatches and enhancing trajectory continuity. This rule-based approach, which involves validation-based tuning, ensures transparency in the decision-making process and enables systematic hyperparameter tuning according to the characteristics of each dataset. We demonstrate the effectiveness of ClarityTrack through experiments on three benchmarks: MOT17, MOT20, and DanceTrack. In MOT17, competitive performance was achieved by HOTA, IDF1, and AssA through an integrated verification strategy that reflects balanced environmental characteristics. In MOT20, stable tracking was maintained through a hierarchical association structure and a conservative verification strategy, considering frequent occlusions and high appearance similarity in crowded environments. In DanceTrack, robustness is secured by actively supporting appearance-based re-identification and extensively applying consistency checks to respond to the motion prediction instability caused by non-linear motion. The process of systematically deriving environment-specific strategies tailored to the characteristics of each dataset provides methodological contributions to environment-specific optimization beyond simple performance improvement. However, the current implementation determines the environment type based on dataset-level prior information, which is well-suited for benchmark evaluations where scene characteristics are known. The cross-dataset generalization experiments confirmed that parameters transfer effectively within similar tracking domains, but also demonstrated that domain-specific optimization is necessary for environments with fundamentally different motion characteristics. This characterizes ClarityTrack as an environment-aware rule-based system that delivers strong performance through validation-based tuning, where the rule-based design facilitates systematic adaptation to new domains. Future work aims to develop an automatic environment classifier that analyzes scene-level statistics such as crowd density, motion linearity, and occlusion frequency to select parameters without manual specification, and plans to extend the proposed rule-based mechanism to real-time tracking scenarios and verify its generalizability through integration with diverse detectors.