Background & Summary

Trajectory datasets of traffic participants (TPs) are fundamental for advancing Intelligent Transportation Systems. Modern data acquisition techniques provide these datasets with unprecedented granularity, enabling the empirical observation of microscopic interactions1. Consequently, they support a range of research areas, including microscopic simulation2, behavioral modeling3, and trajectory generation4. This level of detail is especially critical at urban intersections, which are dense with complex traffic conflicts. Specifically, the insights gained from this data—covering yielding strategies, violation patterns, and near-collision events5—are essential for designing effective control strategies6, developing safety interventions, and testing autonomous driving systems. Therefore, high-resolution intersection trajectory data is invaluable for improving urban traffic operations and proactive safety.

Methods for observing traffic participant (TP) mobility and interactions are primarily categorized into ego-centric, roadside, and aerial perspectives7. Ego-centric approaches, foundational to autonomous driving datasets like KITTI8, Apolloscape9, Waymo10, and nuPlan11, equip vehicles with multi-modal sensors to capture surrounding interactions12, but are inherently limited by occlusions and a restricted field of view that prevent a complete scene understanding. Roadside methods, employing either single-sensor (e.g., NGSIM13, Zen Traffic Data14, I-24 MOTION15, TJRD16) or multi-sensor systems (e.g., DLR-UT17), offer long-term observation but suffer from limited spatial coverage and potential error accumulation18. A critical drawback for both ground-based methods is that visible hardware can alter driver behavior, compromising data naturalness19. In contrast, aerial observation—typically using drones for their stability and cost-effectiveness—naturally overcomes these limitations by providing a complete, bird’s-eye view of the entire traffic scene and all simultaneous interactions within it. Crucially, high-altitude operation ensures the drone remains virtually unnoticed, thus preserving the naturalness of TP behavior. These advantages have spurred the development of numerous drone-based datasets for highways (e.g., highD20, exiD21, CQSkyEyeX22, MiTra23), urban junctions (e.g., inD24, SIND25, Hohhot-HDI26, Songdo Traffic27, RounD28), and mixed urban environments (e.g., INTERACTION29, pNEUMA30, CitySim31).

Among the various traffic scenarios, urban intersections are particularly critical due to the significant efficiency losses and safety risks associated with the interrupted traffic flow. In the domain of intersection datasets, our review focuses on three key aspects: the captured scenes, the provided information, and the quality of trajectory data. Prominent existing datasets typically cover a limited number of scenarios (e.g., 3-4 intersections)24,25,29,31, but generally provide comprehensive information, including precise object dimensions and detailed scene elements. In contrast, the HDI dataset26, while featuring diverse trajectory patterns across more intersections, lacks the aforementioned details. This discrepancy precludes a direct and fair comparison. Therefore, our subsequent analysis evaluates the datasets presented in Table 1 against these three aspects, with separate considerations for motor vehicles (MVs) and vulnerable road users (VRUs, including pedestrians and two-wheelers).

Table 1 Summary of intersection drone datasets.

The first aspect is the captured scenes, analyzed from both static (intersection type) and dynamic (traffic flow) perspectives.From a static perspective, intersections are classified by their geometries and channelization—such as three-way/four-way intersections (TI/FI), and four-way with dedicated right-turn lanes (FIDRT)—and by their control methods. The latter include unsignalized (e.g., uncontrolled/right-before-left, all-way-stop, priority-controlled) and signalized types, which feature permissive or protected phases for conflicting directions32. Different characteristics of intersections directly affect the speed and angle of TPs when crossing them, which is closely related to the situation of traffic conflicts. However, Fig. 1 shows that existing drone datasets seldom focus on intersections with dedicated channelization and usually provide incomplete coverage of diverse control strategies. From a dynamic perspective, we evaluate traffic flow using three key metrics: TP arrival rates, MV conflict ratio, and the number of associated MV per conflict (NCMVCP), where conflicts are identified using time-based surrogate safety measures (SSMs). A comparative analysis reveals several limitations in existing datasets, such as low or imbalanced participant counts(e.g., a low VRU proportion in INTERACTION and a low overall TP count in inD) and consistently low MV conflict ratio across several major datasets. Furthermore, the pNEUMA dataset’s high shooting altitude leads to decimeter-level target positioning accuracy, making it unsuitable for conflict analysis. In contrast, our proposed FLUID dataset shows clear advantages, featuring a higher overall TP arrival rate and a higher MV conflict ratio of over 15%. Its NCMVCP metric, comparable to that of SIND, further underscores its strength in capturing dense and interconnected traffic conflicts.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Comparative view of intersection networks from drone-based datasets.

Beyond the scenes themselves, the richness and standardization of the provided information present further challenges. This encompasses both object attributes and behaviors. Large-scale datasets such as pNEUMA and Songdo Traffic lack structured maps, which constrains spatial analysis and simulation of traffic behavior. For TPs from aerial perspectives, the inherent lack of detail hinders both VRU detection and MV classification. Consequently, CitySim omits VRUs entirely, INTERACTION lacks a classification for them, and even in class-annotated datasets like SIND, misclassifications are common. Furthermore, comprehensive behavioral annotations are rare. While analyzing the spatial distribution of traffic conflicts, they seldom provide individual conflict attributes (e.g., object types, conflict types), and the conflicts themselves are often temporally sparse (see Fig. 2). Intent information, crucial for understanding driving behavior, is another neglected area, with only the SIND dataset offering preliminary annotations for turning and violation intentions.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Comparison of conflict types and frequencies across drone-based intersection datasets Notes: Conflicts are classified by vehicle yaw angle difference; rates shown are conflicts per minute.

The ultimate value hinges on data quality. It presents challenges related to both the final data outputs and the methodological transparency of the generation process. Regarding the final outputs, quantitative assessments of spatio-temporal accuracy are scarce. Most works vaguely mention manual annotation ratios and typically only release polished, final-version trajectories without the corresponding raw data (e.g., videos). This practice prevents users from independently verifying data fidelity or assessing potential error accumulation from over-processing. Methodological transparency is also a widespread issue. Implementation details for detection algorithms are often sparse (e.g., the application of U-Net24, Mask R-CNN31, YOLOv525) or undisclosed, and descriptions of tracking algorithms are almost universally absent. Furthermore, commercial platforms like DataFromSky33 or GoodVision34 are not viable alternatives for many researchers due to barriers such as high costs, accuracy issues, and stringent eligibility requirements. Collectively, the lack of transparency and access to raw data severely undermines the reproducibility of existing datasets.

To address the aforementioned challenges, we introduce FLUID, a new fine-grained trajectory dataset for urban intersections. We conducted 14 flight campaigns at three carefully selected signalized intersections in Xuancheng, Anhui, China, while simultaneously recording their traffic signal states. This process yielded a dataset rich in diverse traffic behaviors, generated via a clear and lightweight pipeline. FLUID is characterized by the following key features:

  • Scene Representativeness: FLUID features three distinct types of signalized intersections, chosen to cover a range of common traffic conflict types. The high arrival rates of TPs and a significant proportion of conflict-involved vehicles result in a dataset characterized by dense and frequent conflict scenarios.

  • Information Richness: The dataset provides detailed attributes for multiple classes of TPs. It is further supplemented with synchronized traffic signal states, road maps, and fine-grained annotations of traffic conflicts and behavioral intentions (e.g., turning maneuvers and traffic violations).

  • Data Fidelity: The spatio-temporal accuracy of the trajectory data was validated against the DataFromSky platform and ground-truth measurements from the RTK-GNSS device. Furthermore, we provide a comprehensive description of our entire data acquisition, processing, and fusion pipeline. This provides a basis for assessing the dataset’s reliability and extending the methodology to new scenarios.

Methods

To obtain high-quality and fine-grained annotated trajectory results from raw collected data, we propose the construction and quality enhancement framework shown in Fig. 3, which is divided into three parts: raw recording, trajectory acquisition, and data fusion.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Road map of the construction and quality enhancement of FLUID.

Original Records

Raw Data

The raw data was collected by a three-person team: one operator for the drone, and two observers who recorded traffic signal phases using ground-based cameras. We used a DJI Mini 3 drone to capture high-definition video at 4K resolution (3840 × 2160 pixels) and approximately 30 FPS (29.97 FPS). Due to regulations, the maximum altitude for drones is 120 meters, which is sufficient for capturing microscopic traffic behavior. The drone maintained a consistent flight altitude of 100 ~ 120 meters (100 ~ 105 m for the FI scenario to capture finer VRU details; 120 m for FIDRT and TI). The positional drift of this type of drone during stationary shooting does not exceed 1.5 meters, with a maximum recording duration per flight of up to 30 minutes. The precise flight altitude for each session can be retrieved from the drone’s flight logs. To synchronize the timestamps, the drone and ground cameras were aligned to a unified mobile phone clock with second-level precision, using the coordinated movement of a ground marker as a temporal reference.

Video Pre-processing

Due to the drone’s lightweight design, the raw footage was susceptible to wind-induced instability and required stabilization. We implemented a two-stage stabilization pipeline based on the open-source tools developed by Fonod et al.35. The lower level employed the feature detector based on video quality: AKAZE36 is employed when sharpness remains consistent, leveraging its robust scale-space construction via non-linear diffusion filtering; BRISK37 is preferred for videos with significant clarity variations. Both offer a favorable balance of speed and accuracy compared to alternatives like SIFT or ORB38. The upper level then utilized a RANSAC algorithm with block matching and motion compensation to robustly estimate inter-frame motion parameters while rejecting false matches from dynamic objects. This process effectively eliminated significant jitter from the footage. Moreover, we developed a masking program to define a Region of Interest (ROI) that strictly encompassed the intersection area. This step served to both anonymize areas outside the road network and reduce the computational load for subsequent processing. The FIDRT scenario was exempted from masking to preserve the full range of VRU activities. Finally, the videos were downsampled to 10 FPS. This frame rate is sufficient to capture the decision-making time range of human TPs and is also conducive to detecting the normal motion of VRUs25. The primary output of this stage is the set of pre-processed videos.

Trajectory Acquisition

The foundation of our analysis lies in the extraction of TP trajectories from videos, which involves identifying track points and associating them with specific individuals in the video.

Object Detection

Detection was accomplished using the YOLOv8 architecture39, selected for its native support for both horizontal/oriented bounding boxes (HBB/OBB) and its efficiency as a single-stage detector. We prioritized an OBB representation as it provides a tighter and more accurate encapsulation of object boundaries and orientation from our aerial perspective, which is critical for the subsequent analysis of kinematic parameters and TP behaviors. To overcome single-dataset limitations and achieve higher detection accuracy across diverse object classes, we employ a multi-detector ensemble strategy. This strategy involves training the same YOLOv8 architecture independently on three distinct datasets, resulting in three sets of specialised model weights, each optimised to leverage the unique strengths of its training data.

The three specialised detectors were trained on the following datasets:

  • DroneVehicle_Revised: Our custom-curated dataset, which augments the DroneVehicle benchmark40 with manually-annotated samples of underrepresented classes (e.g., tricycles from FLUID) and additional vehicle types (e.g., motorcycles from the VETRA dataset41) to enhance its robustness.

  • CODrone42: A recent high-quality, 4K-resolution OBB detection benchmark, used to ensure state-of-the-art performance on common object categories, especially small objects such as pedestrians and two-wheelers.

  • Songdo Vision35: A third detector was trained on this dataset to enhance the detection of VRUs (e.g., motorcycles). Although notable for its extensive and precise HBB annotations, its HBB detections were subsequently converted to the OBB format to serve as a supplement to the primary OBB detectors.

For each of the three training processes, the respective dataset was re-partitioned into training, validation, and test subsets using a 70%/20%/10% ratio, and each model instance was trained for 200 epochs. Subsequently, the final detections were generated through a category-based fusion of the outputs from these specialised detectors, as illustrated in Fig. 4.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Training result metrics of three detection models (the advantage categories of the model are marked in red).

To evaluate the performance of each detector, we adopt the standard metrics from the YOLOv8 framework, including bounding box loss, classification loss, and Distribution Focal Loss (DFL) for both training and validation phases. The detection performance is quantified by Precision, Recall, and mean Average Precision (mAP) at Intersection over Union (IoU) thresholds of 0.5 and 0.5:0.95. Detailed definitions of each metric are provided in the supplementary materials43,44. F1-Confidence curve is a curve that shows the variation of F1-score as Confidence gradually increases. The F1-score exceeded 0.8 for all categories, indicating a well-balanced performance between precision and recall and confirming the robustness of the detection model.

In this process, we identified the object categories where each detector outperforms the others (termed its advantage categories). The final detection output for any given frame was constructed by integrating only the detections for these designated advantage categories from their respective specialist detector. This ensemble approach ensures a comprehensive and precise set of object detections.

Lightweight Tracking

Once object detection was complete, we employed SparseTrack45 to link the detections over time and form continuous trajectories. This algorithm was chosen for its effectiveness in dense and complex scenes, as it uses only intersection over union (IoU) for matching. As an enhancement to the widely-used ByteTrack46, SparseTrack introduces pseudo-depth estimation and deep cascade matching (DCM). ensuring robustness against occlusions and in mixed-traffic scenarios with VRUs. The pseudo-depth (dp) is defined as:

$${L}_{p}=H-{y}_{p}$$
(1)

The pseudo-depth of a target is determined by its distance from the camera. This value, denoted as dp (where a larger value implies a greater distance), is calculated using the image height H and the y-coordinate of the bounding box’s bottom-center point, yp, within the image’s pixel coordinate system. Next, the Deep Cascade Matching (DCM) algorithm refines the association process for confirmed tracks by assigning matching priorities. This multi-level strategy employs a cost matrix that combines cosine and Mahalanobis distances alongside the standard IoU score. The final data association is then resolved using the Hungarian algorithm.

For efficient inference on large video files, we employed a stream-based reading and parallel preprocessing pipeline to enable lightweight data loading and mitigate the risk of out-of-memory (OOM) errors.

Georeferencing

To georeference the pixel-based trajectories, we performed camera calibration and lens distortion correction. Initial intrinsic and extrinsic parameters were sourced from the image metadata and the drone’s flight logs. The distortion coefficients were then refined via a least-squares optimization. This process minimized the discrepancy between theoretical distances, calculated using the Ground Sampling Distance (GSD) from a constant flight altitude, and measured pixel distances in the pre-stabilized video feed. These coefficients account for both radial distortion (k1k2k3) and tangential distortion (p1p2):

$${\rm{Dist}}=[{k}_{1},{k}_{2},{p}_{1},{p}_{2},{k}_{3}]$$
(2)

Given that the distortion function is modeled as a polynomial, the corrected coordinates (xdistydist) can be calculated from (xy) and the coefficient Dist, where (xy) is the normalized coordinate computed from the pixel position using the camera intrinsic parameters.

Following distortion correction, we obtained a set of refined pixel coordinates for each trajectory. These trajectories, however, contained jitter and missing frames resulting from residual stabilization errors, occlusions, or indistinct object features. A multi-stage process was implemented in the pixel domain to refine these trajectories, focusing on interpolation and smoothing:

  • Savitzky-Golay (S-G) Filter: The S-G filter with a dynamic window size was applied to the pixel coordinates of the bounding box vertices and their orientation angles. This procedure mitigates high-frequency jitter in the raw detections.

  • Kinematic Interpolation: Missing data points in the trajectories were interpolated. The positions were estimated by calculating the linear and angular velocities from adjacent valid data points in the pixel coordinate system. This kinematic method was supplemented by linear and nearest-neighbor interpolation as fallback routines. All interpolated points were explicitly flagged.

  • Rauch-Tung-Striebel (RTS) Smooth: An RTS-smoother was applied to the complete trajectories (containing both original and interpolated points). It is based on a Constant Velocity (CV) motion model. From the resulting smoothed velocity profile, acceleration was computed using the central difference method.

Upon completion of the pixel-domain processing, the trajectories were georeferenced to a local Cartesian coordinate system (in meters), as illustrated in Fig. 5. This was achieved by first computing a homography matrix that mapped pixel coordinates to WGS84 geodetic coordinates. The matrix was calibrated using several ground control points, whose positions were precisely measured with an RTK-GNSS device. These geodetic coordinates were then projected into the final local system via a Universal Transverse Mercator (UTM) projection. The origin of this local system is the centroid of the intersection’s inner polygon, which is derived from a larger stopLine polygon by excluding the crosswalk areas. The stopLine polygon itself is the area bounded by the stop lines and is used for subsequent violation analysis.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Illustration of local geographic coordinate system construction.

Data Fusion

This process is applied to complete trajectories, aiming to fuse tracks of diverse TP types from multiple sources and align them through spatio-temporal matching.

Motion Refinement

A stable representation for each target’s physical dimensions was established by searching the median width and height (converted to meters) from its complete set of detections. Subsequently, we performed kinematic correction by computing and refining two orientation angles: heading (direction of motion) and yaw (TPs’ longitudinal orientation). The raw yaw sequence was first stabilized via a bidirectional method, which traverses the sequence to identify stable values and replace intermittent outliers, mitigating abrupt changes. Concurrently, the heading angle, computed from the target’s displacement vector, was used as a reference to correct anomalous yaw values that deviated significantly from the direction of motion.

Bounding-box Filter

Erroneous and redundant bounding boxes were then filtered in a two-stage process. First, a heuristic pass removed trajectories with a short duration, minimal displacement, or an average confidence below 0.5, targeting false positives like static objects and shadows. The second stage resolved persistent overlaps on single objects (i.e., dual detections caused by classification ambiguity). Although Surrogate Safety Measures (SSMs) are typically used to analyze traffic conflicts47, their inherent ability to quantify spatio-temporal proximity provides a new perspective for lightweight trajectory post-processing. We perform scene-wide removal of redundant bounding boxes using Two-dimensional SSMs (2D-SSMs)—specifically Time-to-Collision (TTC) and Dynamic Gap Time (DGT)—which distinguish persistent redundancy from transient overlaps. As shown in Fig. 6, vehicle A and B represent valid, distinct targets, whereas vehicle C is abnormal.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Filtering abnormal vehicles based on 2D-SSMs calculations (left: TTC, right: DGT).

As depicted in the left panel of Fig. 6, we consider a scenario where at least one of two objects, vehicle A and vehicle B, is in motion. Building upon the open-source work of Jiao et al.48, we compute TTC to assist in determining potential overlaps between the bounding boxes of these two relatively moving objects.

Let vAB be the relative velocity vector between vehicle A and vehicle B. For each corner point ci of vehicle A’s bounding box, we define kj as the intersection point of the line originating from ci with direction vAB and the line segments forming the bounding box of vehicle B. If such an intersection point kj exists, the vector (kj − ci) represents the relative displacement from the corner point ci to the edge of vehicle B’s bounding box. To determine whether an edge is approaching or receding from a corner point, we compute the scalar product of this displacement vector with the relative velocity vector, vAB. As illustrated in the figure, for the intersection point k1, the scalar product vAB (k1 − ci) yields a positive value, indicating that the corresponding edge of B is approaching corner ci. Conversely, for k2, the product vAB (k2 − ci) is negative, signifying that the edge is receding from ci. We use an indicator function I(ci) to denote this relationship, where I(ci) = + 1 for an approaching edge and I(ci) = − 1 for a receding one.

Let dij denote the minimum distance from corner ci to the bounding box of vehicle B along the direction of the relative velocity vector vAB. It is calculated as:

$${d}_{i\to j}=\left\{\begin{array}{rl}\left\Vert {{\boldsymbol{k}}}_{{\boldsymbol{j}}}-{{\boldsymbol{c}}}_{i}\right\Vert , & \left({{\boldsymbol{k}}}_{j}-{{\boldsymbol{c}}}_{i}\right){{\boldsymbol{v}}}_{AB}\ge 0\\ \inf , & \left({{\boldsymbol{k}}}_{j}-{{\boldsymbol{c}}}_{i}\right){{\boldsymbol{v}}}_{AB} < 0\,{\rm{or}}\,{{\boldsymbol{k}}}_{j}\,{\rm{does\; not\; exist}}\end{array}\right.$$
(3)

By iterating over all corner points ci of vehicle A, we define the Distance-to-Collision (DTC) as the minimum magnitude among all possible distances dij. The DTC between vehicles A and B at the current time step is thus given by:

$$\begin{array}{rcl}DT{C}_{A\to B} & = & \min \,{{\bf{D}}}_{i\to j}\\ DTC & = & \min \{DT{C}_{A\to B},{DTC}_{B\to A}\}\end{array}$$
(4)

Then, TTC can be calculated from the DTC as follows:

$${\rm{TTC}}=\left\{\begin{array}{ll}-1, & {\rm{if}}\,\left({\sum }_{c\in {C}_{i}}{I}_{+}(c) > 0\right)\wedge \left({\sum }_{c\in {C}_{i}}{I}_{-}(c) > 0\right)\\ \frac{\min \left\{\parallel {k}_{c}-c\parallel \right.\cdot {I}_{+}(c)}{\parallel {v}_{ij}\parallel }, & {\rm{if}}\,{\sum }_{c\in {C}_{i}}{I}_{+}(c) > 0\,{\rm{and}}\,{\sum }_{c\in {C}_{i}}{I}_{-}(c)=0\\ \infty , & {\rm{if}}\,{\sum }_{c\in {C}_{i}}{I}_{+}(c)=0\end{array}\right.$$
(5)

The decision rule for TTC is as follows: a value of −1, which indicates that parts of the two objects are simultaneously approaching and receding, signifies a bounding box overlap. In such cases, the trajectory with the shorter duration is identified as redundant and is removed.

However, the TTC-based method is insufficient for cases where an erroneous bounding box moves in close parallel with the true target, maintaining a near-constant relative velocity (i.e., they are relatively static). To address this limitation, we introduce DGT, as illustrated in the right panel of Fig. 6.

The most common time-based SSM, Post-Encroachment Time (PET), is ill-suited for this task because it requires a clear exit time from a conflict zone, which is ambiguous when considering full bounding boxes. We therefore turned to the concept of Gap Time (GT), which only considers the time difference between the two vehicles entering the conflict zone49. Based on this, we define DGT. We generalize the conflict point to a conflict area—the intersection of the two bounding boxes’ swept areas, detected via the Separating Axis Theorem (SAT)—and calculate DGT as the time difference between the moments each vehicle first enters this area. A DGT that remains zero for a sustained duration signifies a persistent overlap, prompting the removal of the shorter trajectory.

By applying this sequential filtering process (first TTC, then DGT), we effectively screened out the majority of overlapping and erroneous bounding boxes, retaining only the validated TP trajectories.

S-T Matching

The final step in our data fusion pipeline is the precise spatio-temporal matching of the validated TP trajectories.

Spatial matching was achieved by integrating the trajectory data with geographic information. For each aerial video, a georeferenced TIFF image with a local Cartesian coordinate system was generated. Then, the intersection’s road layout was semantically segmented. We partitioned each road edge into its constituent entry and exit lane groups. By jointly matching a trajectory’s position and its refined yaw angle against these directional layers, we were able to assign a specific turning movement (e.g., left-turn, straight, right-turn) to each TP.

In the temporal matching stage, we focused on time-series data. Each trajectory was synchronized with intersection entry/exit time and the corresponding traffic signal status. This temporal integration allowed for the fusion of supplementary information, such as traffic violations (e.g., red-light running).

Data Records

The full dataset is available through the figshare repository50. Figure 7 shows all the intersections, which are located in the central urban area. For privacy protection, the specific names and locations of these three intersections have been anonymized. Data was collected at these sites during clear daytime hours on selected days in January and May 2025. After an anomaly screening process, a total of approximately 5 hours of raw video footage was obtained. An overview of these three intersections is as follows:

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Scene information recorded in FLUID.

  • FI (Four-way Intersection): This intersection is governed by a three-phase signal control. It is characterized by high volumes of MVs and VRUs, leading to direct conflicts. Furthermore, conflicts arise from the concurrent release of through and left-turning traffic in the east-west direction.

  • FIDRT (Four-way Intersection with Dedicated Right-turn Lanes): Operating on a four-phase signal, this intersection has no direct internal conflict points but exhibits high traffic density. Conflicts are present externally at U-turn spots where paths cross with other traffic streams.

  • TI (T-Intersection): This three-way intersection is managed by a two-phase signal. It serves as a valuable site for observing frequent conflicts between opposing through and left-turning vehicles, particularly in lower-volume traffic scenarios.

The formats of files in the FLUID dataset are: MPEG-4 Part 14 (MP4), Comma-Separated Values (CSV), Tagged Image File Format (TIFF), and OpenStreetMap (OSM). Since there are too many fields, their meanings are explained in the Markdown document README.md. As the structure shown in Fig. 8, the following components are available:

  • Privacy-preserved videos (video): Provided in MP4 format at 10 FPS.

  • Signal timings (signal): Manually annotated signal control data in CSV format, sampled per second.

  • Flight log during videos (flightLog): Drone flight posture recording in CSV format.

  • Maps (map): Georeferenced TIFF images and vector maps largely compatible with the Lanelet2 (OSM) format51.

  • Processed trajectories (traj/route/conflict): Offered in CSV format (the annotations for conflicts and violations belong to different files than the original tracks).

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Structure of files of FLUID.

It is worth noting that FLUID offers not only a unified processing framework, but also high-quality trajectory extraction results, which act as a solid baseline and can be further enhanced by the research community.

Technical Validation

The validation of the FLUID dataset is three-fold. First, we assess the effectiveness of our data processing pipeline by benchmarking its results against the DataFromSky platform and ground truth data. Second, we establish the significance of our chosen scenes by contrasting their conflict profiles with those of other public datasets. Finally, we confirm the dataset’s richness by demonstrating that each of the three scenes features a unique distribution of conflicts and violations, capturing a wide spectrum of behaviors.

Trajectory Accuracy

Previous datasets have rarely benchmarked their results against alternative processing methods or conducted systematic accuracy validation using supplementary data sources. Acknowledging that many prominent trajectory datasets—such as pNEUMA30, MAGIC52, and Mitra23—rely on the DataFromSky (DFS)33 platform, we selected DFS as a benchmark to validate our FLUID framework’s effectiveness. Furthermore, inspired by vehicle kinematics studies that often leverage ground-based positioning for validation53, we equipped a test vehicle with an RTS-GNSS device to establish a high-precision ground truth trajectory.

For this validation, we collected ground control points for georeferencing, as shown in Fig. 5. We utilized the 5-minute video analysis offered by the DFS free tier, processing the first five minutes of footage from the FIDRT scene recorded on May 26, 2025. This DFS-generated trajectory set serves as a baseline. We then compared it against the trajectories extracted by our FLUID framework from two video sources: the original ~30 FPS footage and a downsampled 10 FPS version. Manual object counts were used as the ground-truth for quantity assessment. This comparative analysis focuses on three aspects: the accuracy of MV/VRU position and count, overall speed distribution, and individual speed profiles.

Position and Count

Figure 9 presents the aligned trajectories positions. For consistency, all object classes were grouped into MV and VRU categories. The spatial comparison reveals that FLUID-extracted trajectories closely correspond with the DFS output. This positional accuracy is further corroborated by the RTS-GNSS ground truth: the Hausdorff distance between the reference and extracted trajectories for the test vehicle is 0 ~ 0.97m, with errors under 0.3m on straight segments. Table 2 reveals object counts. DFS exhibited a 5 ~ 6% miss rate against the ground-truth, a consequence of its policy to discard stationary and short trajectories. While this may improve tracking precision, it sacrifices recall. In contrast, FLUID achieved near-zero missed detections and a low ID switch rate of 2 ~ 5%, demonstrating performance comparable or superior to the DFS benchmark in comprehensive object detection.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Position comparison (the coordinate of trajectories).

Table 2 Comparison of MV and VRU counts across different sources.

Speed Distribution

Figure 10 presents the overall MV speed distributions from DFS and FLUID. A notable finding is that the speed profile from the 10 FPS video appears more stable. We hypothesize that the lower frame rate mitigates detection jitter from bounding boxes, suggesting that processing at very high frame rates may, counterintuitively, complicate the velocity post-processing stage.

Fig. 10
Fig. 10The alternative text for this image may have been generated using AI.
Full size image

Speed distribution of MVs of different methods.

Individual Speed Profiles

This observation is reinforced by examining individual speed profiles (the two MVs with the longest trajectories in the FLUID scenario). In contrast, the left panel of Fig. 11 illustrates that existing datasets employ inconsistent smoothing strategies. For instance, CitySim thresholds all near-zero speeds to zero, whereas SIND and inD apply this only to persistently stationary objects, leaving transient stops untreated. The variation in smoothing granularities across scenes—as seen in datasets like SIND and inD—is substantial and could lead to analytical biases. Conversely, FLUID enhances transparency by providing both raw and smoothed data (Fig. 11, right). The analysis again demonstrates that the 10 FPS processing yields accurate, stable speed profiles that align well with the DFS results, thus validating our methodological choices.

Fig. 11
Fig. 11The alternative text for this image may have been generated using AI.
Full size image

Comparison of processed results of speed.

Scene Significance

A central application of the FLUID dataset is traffic conflict analysis. Defined as hazardous traffic interactions54, conflicts serve as a proxy for collisions, and their link to accident risk can be quantified using Surrogate Safety Measures (SSMs)49. Due to the lack of standardized SSM thresholds across different scenarios, we developed a new conflict quantification process based on a comprehensive literature review. This process includes conflict extraction, classification, and the identification of associated TPs.

To maintain a consistent comparative basis, our analysis is confined to MVs. We selected Time-to-Collision (TTC) as the primary predictive SSM, given its robustness after standardizing the temporal resolution of velocity data across datasets. Potential conflicts are initially identified using the minimum TTC (minTTC) observed between any two trajectories49. Subsequently, to improve precision, we introduce our Dynamic Gap Time (DGT) method for post-validation, which effectively filters out kinematically plausible but physically impossible conflicts (e.g., those separated by infrastructure). Drawing from a review of established practices55, we define a conflict event using the thresholds: 0sDGT ≤ 4.0s, 0sTTC ≤ 2.0s.

The conflict angle, Δψ, is a critical determinant of potential accident severity. While recent research widely adopts angle-based classification for conflicts and collisions, existing methodologies often suffer from ambiguous criteria56,57, overly complex rules requiring bounding box overlap analysis58, or incomplete taxonomies59. To address these limitations, we adopt the comprehensive definition60, categorizing conflicts into four distinct types: rear-end, sideswipe, angle, and head-on, as depicted in Fig. 12.

Fig. 12
Fig. 12The alternative text for this image may have been generated using AI.
Full size image

Traffic conflicts identification and classification.

Figure 12 illustrates the process of identifying associated conflicting objects. At instant t1, the two blue vehicles constitute a primary conflict pair. All other vehicles within a 10-meter radius of this pair—the two gray vehicles—are designated as associated objects, which will subsequently engage in new conflicts at a later time (t2). The green vehicle, being outside this radius, is excluded. Associated objects can be repeatedly identified if they are proximate to multiple conflict TPs.

Table 1 shows that FLUID’s conflict quantification demonstrates clear advantages over other datasets. A key differentiator is the traffic composition surrounding conflict events. Our analysis reveals that VRUs constitute 35.4% of all agents within a 10-meter radius of a conflict pair in FLUID. This proportion is substantially higher than that in SIND (7.2%), inD (23.7%), and INTERACTION (4.4%), indicating that FLUID provides a unique environment for studying MV interactions in the presence of VRUs, which impacts driver decision-making.

Behavioral Richness

In FLUID, conflict and violation annotations exhibit rich spatiotemporal behavioral characteristics. The 1m × 1m grid is used for discrete division, enabling finer conflict identification. Subsequently, the previously classified traffic types are clustered according to the grid. Figure 13 showed that different scenarios exhibit diverse traffic conflicts type-density distributions. Figure 14 shows that the violation rates for each signal cycle (non-consecutive videos, 110 for FI, 26 for FIDRT, and 68 for TI) vary. The calculated violation rates may be higher during cycles with lower traffic volume.

Fig. 13
Fig. 13The alternative text for this image may have been generated using AI.
Full size image

Heatmaps of spatial density for different conflict types.

Fig. 14
Fig. 14The alternative text for this image may have been generated using AI.
Full size image

Per-cycle violation rates for straight and left-turn movements, calculated as a proportion of total MVs.

Usage Notes

Benefiting from the fine-grained details of our dataset, we can categorize distinct traffic behavior patterns through turn labels, as illustrated in Fig. 15. Beyond basic traffic flow analysis, the high precision of FLUID facilitates multi-domain research, including human preference mining, traffic behavior modeling, and autonomous driving. Figure 16 showcases three representative cases that highlight the dataset’s unique potential in these specialized research contexts.

Fig. 15
Fig. 15The alternative text for this image may have been generated using AI.
Full size image

Trajectory visualization and turn annotation result.

Fig. 16
Fig. 16The alternative text for this image may have been generated using AI.
Full size image

Unique application prospects of FLUID.

Analysis of Passing and Yielding Behaviors

Leveraging the standardized geometric layout, high density of passenger vehicles, and integrated traffic signal data in the FI scenario, FLUID provides a robust foundation for analyzing complex interaction behaviors. By calculating the convex hull of conflicting trajectories, we can accurately delineate conflict zones and analyze the temporal sequences of vehicles entering and exiting both these zones and the intersection boundaries. This methodology enables a refined quantification of passing and yielding dynamics for specific maneuvers. For instance, among the through-left interaction events identified during green phases across the FI dataset, we recorded 502 instances of left-turn yielding, 415 instances of through-moving yielding, and 5 instances where no clear yielding behavior was discernible (Fig. 16, Left).

Spatiotemporal Violation Analysis of VRUs

While previous research61 (conducted on a scenario similar to FIDRT) was often constrained by limited tracking precision and necessitated grid-based spatial partitioning, the FLUID dataset provides individual-level trajectories of VRUs integrated with precise Lanelet2 semantic maps. Moreover, the significantly higher VRU arrival rate in FLUID, compared to existing benchmarks, allows for a more granular analysis of spatial occupancy. These features facilitate a deeper understanding of VRUs’ spatiotemporal violation intentions and movement patterns (Fig. 16, Middle).

Optimization of Large Vehicle Detection

In the TI scenario, large vehicles constitute over 15% of the traffic, where their diverse dimensions present substantial challenges for accurate detection. By providing both raw data and bounding box labels, we offer significant optimization potential for large-scale target detection in complex environments. Ultimately, this provides a contribution of similar value to pNEUMA Vision62 for multi-object tracking and detection research (Fig. 16, Right).

In addition to these unique capabilities, our dataset also serves as a high-quality resource for broader applications in traffic engineering, such as:

  • Driving Decision Modeling: Analyze the impact of interconnected information on driving decisions and traffic operations in interactive dilemmas involving two vehicles encountering conflicting directions at an intersection63.

  • Intersection Operation Quantification: Evaluate the safety and efficiency characteristics of different intersection control strategies26.

  • Conflict Correlation Analysis: Quantify conflict severity using SSMs, and analyze the relationship between this severity and other kinematic parameters64.

  • Trajectory Generation: Learn from human mobility patterns to generate human-like and socially-inspired behaviors and movements65.