Introduction

With the increasing complexity of port logistics, automation and safety in handling equipment have become critical challenges in both research and engineering applications1,2,3. As the core machinery for bulk cargo handling, the unloader performs essential tasks such as grabbing, transferring, and placing heavy materials. However, it often operates in open and unstructured environments with strong lighting variation, dense dust, and irregular movement patterns4,5. These conditions create significant risks for both equipment and personnel, raising operational costs and compromising efficiency.

Traditional safety monitoring methods mainly rely on three approaches6,7,8: physical devices such as limiters and laser sensors to prevent overtravel and collisions, PLC-based control logic to constrain movement, and manual video surveillance by dispatchers. While useful, these methods are inherently limited. For instance, Zhang et al.9 developed a real-time monitoring system combining multi-source sensing and finite element analysis, but such systems struggle with adaptability in fast-changing environments. Kara et al.10 applied fuzzy analytical hierarchy methods for risk assessment, yet these remain qualitative and lack real-time responsiveness. Chen et al.11 explored a vision-based system for gantry cranes, demonstrating accurate target tracking, but vision-only methods are vulnerable to occlusion and poor visibility. Overall, existing approaches cannot robustly handle dynamic obstacles or maintain perception stability under challenging port conditions12,13.

Recent advances in artificial intelligence and machine vision have opened new opportunities for smart ports14,15,16. For example, Shen et al.17 proposed a monocular vision system to locate unloader buckets, reducing dependence on costly LiDAR, while Wang et al.18 used CNNs and GANs to generate a 3D coordinate system, showing promising accuracy in simulations. Sunwoo Hwang et al.19 improved object detection by combining YOLO and Mask R-CNN with fuzzy clustering, achieving high recognition accuracy at real-time speeds. In parallel, Visual SLAM and point cloud reconstruction techniques have enabled non-contact 3D perception and rich spatial awareness20,21. However, most of these methods remain isolated, either focusing on vision-only perception or simulation-based demonstrations, with limited integration into dynamic unloader operations. Key gaps persist: maintaining spatial consistency under structural changes, detecting obstacles in real time, and transforming perception into actionable safety warnings.

To address these challenges, this paper proposes a machine vision–based framework for dynamic 3D coordinate system construction and unloader safety optimization. By fusing multi-view vision and IMU data with LiDAR point clouds through a sliding window nonlinear optimization method, the system reduces drift, maintains spatiotemporal consistency, and improves adaptability in disturbed environments. The main contributions are as follows:

  • Construction of a stable, dynamic 3D coordinate system adapted to structural changes;

  • Real-time 3D reconstruction of the unloader’s workspace;

  • Collision risk detection and a safety warning mechanism;

  • Experimental validation in real port environments with high accuracy and low false alarms.

In this study, “optimization” refers to systematically improving unloader safety and operational performance under defined objectives and constraints. The objective is to minimize collision risk and mapping error while maximizing real-time performance; the constraints include mechanical movement limits, environmental variability such as dust and lighting, and computational capacity of embedded hardware. The solution is realized by integrating LiDAR-based point cloud reconstruction, visual SLAM, and Voronoi skeleton extraction into a unified framework. This definition provides a clear foundation for the technical developments and evaluations presented in the following sections. Compared with existing approaches, our method addresses key weaknesses. IMU-only or visual odometry systems suffer from drift and lose consistency under structural changes or occlusions. Vision-based methods are sensitive to lighting, dust, and obstruction, while LiDAR-only frameworks, though geometrically accurate, lack semantic depth for risk prediction. Our approach fuses LiDAR, multi-view vision, and IMU data through sliding window nonlinear optimization, reducing drift, improving robustness, and enabling more accurate real-time obstacle detection with fewer false alarms.

System approach

To ensure reliable object localization and environment perception in complex port scenarios, this study adopts LiDAR-based point cloud data rather than relying solely on RGB or thermal imaging. RGB cameras, though capable of capturing rich texture, are highly susceptible to environmental noise such as variable lighting, heavy dust, and occlusion—common challenges in crane operations. Thermal imaging, while useful for detecting temperature-differentiated targets, lacks the spatial resolution and structural detail required for accurate 3D modeling and collision risk prediction. In contrast, LiDAR provides dense and lighting-independent point cloud data with sub-centimeter accuracy, enabling precise spatial structure acquisition, robust object contour mapping, and real-time depth perception. These capabilities are especially critical for capturing the dynamic geometry of grab buckets, material piles, and hull edges. Furthermore, LiDAR exhibits strong resilience to nighttime conditions, rapid mechanical movement, and interference from multiple obstacles, making it a robust and indispensable sensing modality for unstructured and highly dynamic port environments.

LiDAR-based intelligent environmental sensing system

A mechanical surround-type LiDAR is selected as the environmental sensing device to scan the grab crane’s operating environment and ship cargo. It offers the operator intuitive spatial awareness and reliable data support for efficient path planning. This type of LiDAR provides 360° panoramic coverage by rotating at high speed through mechanical components such as rotating platforms or lenses. It features high spatial resolution, enabling accurate detection of both large structures and small objects across wide operating areas—an essential capability for port unloading scenarios22,23,24. Structurally, the system comprises three main modules: a transmission module that emits laser beams, a scanning module that directs the beams, and a receiving module that captures reflected pulses. The distance to objects is calculated by measuring the time delay between laser emission and return. The working principle is illustrated in Fig. 1.

Fig. 1
figure 1

Principle of LiDAR distance measurement.

Figure 1 illustrates the working principle of LiDAR distance measurement. A laser source emits pulses that are directed by the optical system and scanning mechanism toward the object. The reflected signals are collected by the optical and electrical systems, then amplified and processed. By measuring the time delay between emission and detection, the system calculates the distance to the object, enabling accurate 3D spatial mapping.

To perceive the grab crane’s operational environment, the LiDAR device is mounted in a fixed position to ensure proper alignment with the material pile. Prior to scanning, calibration is performed by adjusting the device’s horizontal and vertical orientation, ensuring accurate positioning relative to the cargo in the ship’s hold.

Global calibration involves four coordinate systems reflecting the degrees of freedom of the unloader’s movement: the stationary track coordinate system \(O_{{{\text{rail}}}} - X_{{{\text{rail}}}} Y_{{{\text{rail}}}} Z_{{{\text{rail}}}}\), which also serves as the yard reference frame; the yaw coordinate system \(O_{{{\text{yaw}}}} - X_{{{\text{yaw}}}} Y_{{{\text{yaw}}}} Z_{{{\text{yaw}}}}\); the pitch coordinate system \(O_{{{\text{pitch}}}} - X_{{{\text{pitch}}}} Y_{{{\text{pitch}}}} Z_{{{\text{pitch}}}}\)​; and the LiDAR’s intrinsic coordinate system \(O_{{{\text{laser}}}} - X_{{{\text{laser}}}} Y_{{{\text{laser}}}} Z_{{{\text{laser}}}}\). Except for the track system, all coordinate frames shift dynamically during operations.

Once activated, the LiDAR emits laser pulses, capturing the time delay of reflected signals to compute spatial distances and generate 3D positional data for the material pile. The resulting point cloud undergoes outlier removal, noise filtering, and error correction. Multiple scans are then fused to reconstruct a comprehensive point cloud of the pile and surrounding environment. This visual representation supports real-time situational awareness, helping operators detect obstacles and operational hazards more intuitively (see Fig. 2)25,26.

Fig. 2
figure 2

3D point cloud of the grab crane.

Voronoi-based grab skeleton recognition algorithm

Voronoi diagrams, also known as Thiessen polygons or Dirichlet diagrams, are composed of continuous polygons formed by vertical or bisecting curves connecting adjacent points. Characteristics of Voronoi diagrams: A fundamental generator in the Voronoi diagram is a point inside a V-shaped polygon, where the distance from this point to the generator is shorter than to any other generator27; the inner endpoints on the polygon’s edges are equidistant from the generator that created that edge; the Voronoi boundary line between adjacent diagrams takes the subset of the original adjacent boundary28.

The most critical part of the Voronoi algorithm is constructing the Delaunay triangulation. A triangle is automatically constructed from a discrete set of points, forming the Delaunay triangulation. The discrete points and the triangles they form are numbered, and the triangles made by three discrete points are recorded. Once the traversal process is complete, all Voronoi edges are identified, and a Voronoi diagram is drawn based on these edges. The specific process is shown in Fig. 3.

Fig. 3
figure 3

Visual recognition process of the Voronoi algorithm.

Figure 3 illustrates the visual recognition process of the Voronoi algorithm. The procedure begins with generating a Delaunay triangulation and calculating the circumcenters of each triangle. Neighboring triangles are then searched and, if found, the shared edge is recorded as a Voronoi edge. If no neighboring triangle exists, the outer bisector line is recorded. This iterative process produces the Voronoi diagram, which forms the basis for skeleton extraction and path recognition. The key to constructing a Voronoi diagram is the generation of the Delaunay triangulation. The characteristics of the Delaunay triangulation29,30 are:

Empty circle property; Closest proximity; Uniqueness; Optimality; Regularity; Locality; The convex polygon shell.

Delaunay triangulation is a standard form of triangular division, and there are several algorithms to achieve it. In this case, the Bowyer-Watson algorithm is used. The basic steps of the algorithm are:

  1. (1)

    Construct a super triangle that contains all the scattered points and place it into a triangle list;

  2. (2)

    Insert the scattered points of the point set in sequence, find the triangle in the triangle list whose circumcircle contains the inserted point (called the influencing triangle for that point), delete the common edges of the influencing triangle, and connect the inserted point with all the vertices of the influencing triangle to complete the insertion of a point into the Delaunay triangle list;

  3. (3)

    Perform comprehensive optimization on the locally newly formed triangles based on optimization criteria, and place the newly formed triangles into the Delaunay triangle list31.

Repeat step 2 above until all the scattered points are inserted, as shown in Fig. 4. The Bowyer–Watson algorithm was adopted for Delaunay triangulation due to its robustness, topological consistency, and computational efficiency when processing unstructured 2D spatial data derived from LiDAR point clouds. The algorithm operates based on an incremental construction strategy within an enclosing super-triangle, repeatedly removing triangles whose circumcircles contain the new point, and retriangulating the resulting cavity. This process guarantees that the resulting mesh satisfies the empty circumcircle property fundamental to Delaunay triangulation, ensuring non-overlapping triangles with maximized minimum angles—an essential feature for preserving geometric fidelity in LiDAR-derived environments. Compared to the incremental insertion method, Bowyer–Watson offers improved clarity in conflict zone identification and better cache coherence, especially beneficial when handling millions of spatial points. Furthermore, while the divide-and-conquer approach achieves optimal asymptotic complexity \((O(n\log n))\), its recursive structure introduces significant overhead in real-time or streaming scenarios, and its implementation becomes less tractable in non-uniform point distributions often observed in port unloading scenes.

Fig. 4
figure 4

Generation of the Delaunay triangulation.

The Bowyer–Watson method, in contrast, supports dynamic point set updates with localized triangulation repair, making it more adaptive and scalable. This local adaptability is particularly advantageous in safety–critical environments like port unloading, where moving machinery and cargo generate frequent updates in the point cloud. Thus, the Bowyer–Watson algorithm offers a theoretically sound and practically efficient foundation for Delaunay-based Voronoi diagram generation under real-world operational constraints.

The technical solution for grab skeleton machine vision in the fully automated unloader system is built upon a grid map generated via SLAM software, which is passed to a global planner. This planner applies search algorithms—such as A*, dynamic A*, the framework method, and the jump point method—under specific constraints to derive a smooth and visually efficient path structure. The resulting Voronoi skeleton map offers strong edge-pruning capabilities and supports robust global path planning. For deeper recognition, the algorithm identifies the main framework of the grab skeleton by selecting N nodes counterclockwise or by defining N side edges. Based on Voronoi rules, the system generates the skeleton and guiding points through a multi-step process: first, LiDAR and software collaboratively map the environment and define the initial skeleton shape. Then, Voronoi points are extracted, filtered by proximity to obstacles, and traversed to identify and connect valid intersection points. Guiding points are established from the spatial relationship between intersection lines and obstacles, and this information is exported and compared with the pre-processed grid map. If a potential collision is detected, immediate braking is triggered; otherwise, the system loops back for continuous scanning. Each full cycle takes approximately 15 ms (1/60 of a second), enabling real-time safety assurance and high-efficiency operation. To further enhance algorithmic transparency, pseudocode is provided to illustrate the core logic: a 2D projection of the point cloud is computed, followed by Delaunay triangulation and Voronoi diagram generation, forming the foundation for dynamic skeleton extraction and real-time monitoring. The skeleton is obtained by filtering Voronoi edges that pass through obstacle-free areas. Finally, redundant branches are pruned based on connectivity and angle heuristics to retain the main path, as shown in Table1. To improve clarity, the Voronoi-based skeleton extraction is described step by step, with each stage linked to its function in the system. First, the 2D projection of the point cloud is computed and transformed into a Delaunay triangulation, which provides the geometric foundation for Voronoi generation. Second, Voronoi edges are derived from the triangulation to represent potential medial axes. Third, edges intersecting obstacles are removed so that only safe candidate paths remain. Fourth, redundant or noisy branches are pruned, leaving a simplified skeleton that highlights the main structural paths of the grab and its environment. Finally, the extracted skeleton is directly fed into the global planner for collision detection and safety warning. This sequence ensures that each algorithmic step serves a specific purpose, from geometric construction to real-time safety decision-making.

Table 1 Detailed steps of the Voronoi-based skeleton extraction algorithm.

3D mapping module design

To achieve real-time spatial perception of the unloader’s operational area, this paper constructs a 3D mapping module based on Visual SLAM. The system utilizes multiple RGB-D cameras placed on the grab, trolley, and track structure to capture continuous image frames and depth information. The ORB-SLAM2 framework is used to perform keyframe extraction, feature point matching, and map optimization. After feature points are extracted from the image frames by the visual frontend, they are matched with historical keyframes to calculate pose transformations, forming a continuous trajectory that is then reprojected into 3D space to generate a dense point cloud. The system then combines IMU information for tightly coupled optimization, effectively eliminating trajectory drift caused by mechanical vibrations and lighting changes. The generated point cloud map is mapped in real-time to the unloader’s operation interface, enabling dynamic modeling of structures such as the grab, material hopper, and ship’s hold32. This module supports heterogeneous sensor fusion and can be extended to incorporate LiDAR data for calibration and correction, ultimately outputting high-precision 3D mapping results in a unified coordinate system. These results provide spatial support for subsequent collision prediction and task scheduling, as shown in Fig. 5. In this module, optimization refers to the joint refinement of camera poses, IMU states, and point cloud alignment within a unified factor graph framework. The goal is to minimize reprojection errors of visual landmarks and integration errors of inertial measurements, thereby reducing trajectory drift and ensuring spatial consistency of the reconstructed environment. This process directly improves the accuracy of 3D mapping and provides a stable coordinate reference for subsequent collision detection and task scheduling.

Fig. 5
figure 5

System flowchart of the 3D mapping module.

To effectively integrate multi-source information such as RGB-D data, inertial measurements, and visual keyframes, the back-end of the system employs a nonlinear optimization model based on a factor graph representation. The objective is to minimize the following cost function:

$${\mathcal{C}}(\mathbf{X}) = \mathop \sum \limits_{i \in {\mathcal{K}}} {\mathbf{r}}_{reproj}^{(i)} (\mathbf{X})_{{\Sigma_{r} }}^{2} + \mathop \sum \limits_{j \in {\mathcal{I}}} {\mathbf{r}}_{imu}^{(j)} (\mathbf{X})_{{\Sigma_{i} }}^{2}$$
(1)

where, X is the set of all state variables including camera poses, velocities, and IMU biases; \({\mathbf{r}}_{{{\text{reproj}}}}^{(i)}\) denotes the reprojection residual of visual landmark iii, computed as the difference between observed pixel coordinates and those projected from estimated 3D landmarks; \({\mathbf{r}}_{{{\text{imu}}}}^{(j)}\) represents the IMU residual computed via pre-integration between two frames j and j + 1; \({\Sigma }_{r}\), \({\Sigma }_{i}\) are the associated covariance matrices (information matrices) that weight the confidence of each measurement type. The optimization is solved iteratively using the Levenberg–Marquardt algorithm, within a sliding window of recent keyframes to maintain computational efficiency.

Experiments and results analysis

The LiDAR calibration diagram illustrates the intrinsic and extrinsic calibration procedure involving geometric alignment with known reference markers. The Voronoi skeleton extraction diagram presents a step-by-step representation of spatial segmentation, medial axis extraction, and topology simplification, as shown in Fig. 6.

Fig. 6
figure 6

LiDAR calibration and Voronoi skeleton extraction.

Evaluation of 3D mapping performance

The use of point cloud data in this study is essential for capturing accurate spatial and geometric information of the unloading environment, which cannot be reliably obtained through conventional RGB or night vision cameras. Unlike 2D imaging, point cloud data provides dense three-dimensional measurements that are critical for modeling the structure of the grab crane, its surroundings, and dynamic obstacles with high precision. This spatial fidelity enables real-time depth estimation, volumetric analysis, and trajectory planning for the unloading process. Moreover, in dusty or low-light port environments where RGB imaging may suffer from occlusion or poor visibility, LiDAR-generated point clouds remain robust and unaffected by lighting conditions. These advantages justify the adoption of 3D mapping techniques for ensuring operational safety and automation reliability. To validate the feasibility and effectiveness of the machine vision-based dynamic 3D coordinate system construction method in the unloader operation process, typical experimental platforms were set up in both laboratory and actual port environments. Data collection and system operation tests were conducted in multiple scenarios. The experimental environments include three operational conditions: model laboratory, daytime operations at the port, and nighttime operations at the port. These cover typical conditions ranging from ideal controlled environments to complex dynamic field conditions, ensuring that the experimental results are representative and generalizable.

In terms of hardware configuration, the core of the 3D mapping system adopts a dual-vision sensor structure that combines the Intel RealSense D455 RGB-D camera and the Livox Mid-70 LiDAR. The former is responsible for capturing image and depth information, while the latter provides high-density point cloud support. Additionally, to further improve the robustness of pose estimation, the system integrates an Xsens MTi-30 IMU unit, which provides acceleration and angular velocity data for auxiliary optimization. These devices are installed at key structural locations of the unloader, including the grab beam, pitch platform, and the main frame of the trolley, and are fixed using aluminum alloy mounting brackets to ensure stability of the viewpoint and consistency of the coordinate system. In the port environment (Fig. 7), to avoid the impact of day-night lighting differences on recognition performance, the RGB image channel is equipped with an automatic exposure and median filtering enhancement module.

Fig. 7
figure 7

Schematic diagram of the unloader operational environment.

In terms of data collection, the system continuously acquires raw image and IMU data at a frequency of 20 frames per second and uses the ROS platform to perform timestamp alignment and buffering, ensuring consistency in the input for subsequent visual SLAM processing. During the experiment, the system automatically records the grab’s motion trajectory, changes in the grabbing position, and environmental point cloud data, while using GNSS and LiDAR as external reference sources to evaluate the accuracy of the mapping trajectory. A total of over 4 h of sample data was collected, covering real-world factors such as day-night lighting changes, wind speed interference, and non-uniform trolley movement, providing abundant data support for subsequent 3D mapping and safety performance analysis. To visually demonstrate the system’s 3D mapping performance in a typical port operational environment, Fig. 8 presents a spatial point cloud image obtained through the fusion of LiDAR and RGB-D camera sensing. The various structural areas in the image are presented with different color distributions based on distance information. The boundaries of the grab, trolley track, and material hopper are accurately reconstructed, with the boundary of the ship’s hold clearly reflected in the red region, verifying the system’s spatial recognition ability in complex industrial scenarios. The white rectangular box further highlights the grab operation target area, facilitating precise constraints for subsequent collision detection and path planning modules, as shown in Fig. 8.

Fig. 8
figure 8

Schematic diagram of 3D point cloud mapping results in the unloader operational environment.

To improve the scientific rigor of the system evaluation, this paper uses the RMSE (Root Mean Square Error) metric to quantify the mapping accuracy, while the frame rate (FPS) is collected to measure the system’s real-time performance33. The evaluation results for the three experimental scenarios are shown in Table 2.

Table 2 3D mapping performance evaluation table.

From Table 2, it can be seen that the system performs most stably in the model laboratory environment, with the mapping error controlled within 2.8 cm and a frame rate of up to 20 FPS. Under the daytime conditions at the port, the system maintains good stability, with the mapping error slightly increasing to 4.3 cm. However, in the nighttime environment with insufficient lighting and increased background interference, the RMSE rises to 6.1 cm, but it remains within the allowable range of the system. To further evaluate the feasibility of real-time deployment on embedded platforms, a detailed system performance analysis was conducted. In addition to RMSE and FPS, we measured the processing latency breakdown, memory usage, and CPU/GPU computational load on an NVIDIA Jetson Xavier NX. As shown in Table 3, the total processing time per frame is approximately 86.2 ms, with LiDAR point cloud pre-processing consuming 24.7 ms, coordinate system transformation 15.4 ms, Voronoi diagram construction 19.1 ms, and visualization output 27.0 ms. The system occupies approximately 1.3 GB of memory during operation and maintains an average CPU utilization of 48.7% and GPU utilization of 38.4%, indicating sufficient headroom for real-time performance.

Table 3 Real-time performance evaluation on embedded hardware.

Furthermore, to visually demonstrate the system’s mapping performance in different environments, Fig. 9 presents a bar comparison chart of the mapping error RMSE and frame rate FPS. It can be observed that although complex environments cause some performance fluctuations, the overall mapping accuracy and real-time performance still exhibit strong adaptability.

Fig. 9
figure 9

Comparison of mapping error RMSE and frame rate FPS in different test scenarios.

Figure 9 presents the comparison of mapping error RMSE and frame rate FPS across three scenarios: laboratory, port daytime, and port nighttime. The results indicate that the laboratory condition achieves the lowest RMSE of about 2.8 cm with a frame rate close to 20 FPS, showing the most stable performance. Under port daytime conditions, the RMSE increases to approximately 4.3 cm while the frame rate remains around 17 FPS, reflecting good adaptability in real environments. During nighttime operations, the RMSE rises to about 6.1 cm and the frame rate drops to nearly 14 FPS, which demonstrates weaker performance due to lighting and environmental interference. The combined results confirm that the system maintains acceptable accuracy and real-time capability across varying scenarios, ensuring reliable slope stability and vertical displacement monitoring.

Safety detection performance metrics

During the operation of the unloader’s grab, factors such as the large operating radius, irregular material pile shapes, and frequent spatial obstructions result in a high risk of collision and a probability of operational errors. Therefore, to verify the perception ability of the proposed 3D vision system in detecting collision risks for key structures and its system response efficiency during actual operation, this paper conducted multiple rounds of safety detection experiments under typical unloading conditions to assess the system’s performance in collision prediction, false alarm control, and response latency. The safety detection module primarily relies on point cloud space analysis and grab motion trajectory prediction algorithms. By capturing the relative positional relationship between the grab and the surrounding environmental model in real-time, potential contact risks are identified in advance, and early warning signals are sent to the operational system. In the experimental tests, 82 actual grab motion trajectories were recorded, including those involving close interactions with the edges of material hoppers, ship hull panels, and other structures. To further validate the practical performance of the proposed safety detection module in unloading operations, this study visually demonstrated the dynamic spatial relationship between the grab’s path and the ship’s structure through 3D point cloud mapping and trajectory tracking experiments of typical grab operation scenarios. During the tests, the system recorded the grab’s motion trajectory in real-time and, combined with the actual unloading environment at the port, achieved intelligent recognition of potential collision risks with the ship’s boundaries and cargo hold contours. Figure 10 shows the visualization results of two typical test samples, comparing error paths with safe paths from dimensions such as spatial structure, trajectory shape, and relative positioning.

Fig. 10
figure 10

Visualization mapping results of grab path and ship hull edge proximity risk based on point cloud.

As shown in Fig. 10, the point cloud map accurately reconstructs the boundaries of the cargo hold, the hull panels, and the grab’s suspension structure at the unloading site. The red curve indicates the grab’s actual motion trajectory during a specific operation, which exhibits a clear tendency to deviate toward the left boundary of the cargo hold. As this segment overlaps increasingly with the boundary point cloud, the system identifies it as a high-risk zone and promptly issues an early warning signal, thereby preventing a potential collision. In contrast, the green curve represents the system-recommended safe trajectory, and the red curve reflects the originally planned path by the operator. Comparative analysis reveals that the original path poses significant proximity risks, especially during the final descent and near the right rear boundary, whereas the system-generated path effectively maintains a minimum safe distance from structural boundaries while ensuring efficient unloading operations. The point cloud visualization of the grab’s posture and movement not only enhances situational awareness but also provides a reliable foundation for refining safety strategies. Additionally, the system’s collision detection results were rigorously validated against manual judgments, achieving a high prediction accuracy of 91.2%. The false alarm rate was limited to 4.7%, ensuring minimal disruption to regular operations, and the average early warning response time was 1.4 s—sufficient for executing obstacle avoidance or emergency stopping within operational constraints. Figure 11 illustrates a comparative bar chart summarizing the three core safety metrics, clearly demonstrating that the system sustains over 90% prediction accuracy while keeping both the false alarm rate and response latency within acceptable engineering thresholds, thus confirming its robust risk perception performance under multi-source data fusion conditions.

Fig. 11
figure 11

Safety performance metrics.

To visually demonstrate the system’s spatial risk recognition performance under specific operating conditions, Fig. 12 shows a 3D point cloud image of the scenario where the grab approaches the edge of the material hold during unloading. In the image, the grab structure and the boundary wall outline are clearly visible. Through dynamic envelope analysis and trajectory fitting, the system successfully identifies the area where the grab is about to make contact and provides an early warning signal in advance. This highlights the system’s strong structural recognition and temporal judgment capabilities in a multi-target spatial environment.

Fig. 12
figure 12

3D point cloud visualization of grab bucket approaching cargo hold edge for risk identification.

System operational stability analysis

To further validate the stability and anti-interference capability of the proposed system during long-term operation, a continuous 7-day operational test was conducted in the actual port working environment. The core performance indicators of the system were recorded and analyzed for fluctuations under different time periods and operating conditions. During the test, the unloader operated for an average of 8 h per day, with the system running continuously online. The built-in log module automatically recorded key parameters such as mapping frame rate, error drift, warning frequency, and visual perception integrity.

Figure 13 shows the fluctuation curve of the mapping frame rate (FPS) during continuous operation. The results indicate that, despite strong sunlight during the day and insufficient lighting at night causing some interference with the visual image quality, the overall frame rate fluctuations remain within a ± 3 FPS range, demonstrating that the SLAM module has strong image adaptability and optimization capabilities. Table 4 summarizes the comparison of the total number of early warning triggers and false alarm frequencies during the entire testing period. It can be seen that, out of a total of 68 early warning events, 3 false alarms were recorded, accounting for only 4.4%. All false alarms were related to short-term occlusion misrecognition, indicating that the system still maintains a high level of stability and reliability in the context of frequent grab movements and multi-target interactions.

Fig. 13
figure 13

Fluctuation curve of mapping frame rate (FPS) during continuous operation.

Table 4 Statistics of early warning events during system continuous operation.

Discussion

The proposed machine vision-based system for dynamic 3D coordinate frame construction has demonstrated its effectiveness in spatial awareness and risk detection during ship unloader operations. Experimental data validate the system’s robust performance across various working conditions, maintaining low mapping errors and stable frame rates. The integration of visual SLAM and LiDAR point cloud data proved effective, especially in low-light nighttime environments at the port, where the system sustained a frame rate of 14 FPS and a mapping error under 6.1 cm. Although various studies have explored SLAM or LiDAR independently for localization or obstacle detection, and Voronoi-based strategies have been used for path planning, there is limited evidence in the literature of integrated frameworks combining all three specifically for unloader safety optimization. The proposed hybrid approach—fusing SLAM-based environmental mapping, LiDAR-driven spatial data acquisition, and Voronoi skeleton extraction—represents a novel direction for enhancing real-time perception and collision avoidance during grab crane operations. By explicitly integrating these modules, the system achieves both geometric precision and path-aware decision-making, which existing siloed solutions often fail to address holistically. This highlights the adaptability of the auto-exposure and filtering modules to challenging illumination changes.

From the safety monitoring perspective, the adoption of collision analysis based on dynamic envelope prediction and motion trajectory fitting allows early detection of potential contact between the grab bucket and critical structures like the cargo hold edge and hull walls. Unlike traditional threshold-based methods, this approach models spatial relationships among multiple targets across frames, effectively identifying nonlinear dynamic hazards. The experimental results showed a high collision prediction accuracy of 91.2%. Moreover, the system maintained stable performance during a seven-day deployment under real port conditions, demonstrating resilience to lighting variation, vibration, and electromagnetic interference.

Nonetheless, certain technical limitations remain. Despite the promising performance, these limitations may be further amplified under large-scale or multi-unloader operations. Specifically, point cloud sparsity caused by rapid mechanical movement can lead to incomplete spatial representation, which may hinder the accurate detection of adjacent unloaders, structural boundaries, or overlapping operational zones. Similarly, SLAM’s vulnerability to occlusion becomes more pronounced when multiple unloaders operate simultaneously within confined visual fields, increasing the risk of pose drift and coordinate frame misalignment. Moreover, the current system does not implement inter-device communication, which limits collaborative spatial awareness among multiple units. Without a shared coordinate framework or synchronized trajectory prediction, collision avoidance may become suboptimal in densely populated unloading environments. Addressing these limitations will be essential for scaling up the system towards multi-agent coordination and real-time risk management at intelligent ports. For instance, during high-speed acceleration phases of the grab bucket, localized sparsity in the point cloud can compromise edge detection accuracy. In addition, the SLAM algorithm is sensitive to occlusions, and substantial visual blockage in the operation zone can lead to pose drift and coordinate inconsistency. The system currently relies on embedded computing platforms, which restrict scalability for more complex deep neural networks such as GCNs or Transformers. While the experimental validation in both laboratory and port environments demonstrates promising results, it should be noted that the scale of trials remains limited. The reported performance indicators—such as mapping error, frame rate, and collision prediction accuracy—confirm the feasibility of the proposed approach under the tested conditions, but further large-scale deployment and extended testing across diverse operational scenarios will be necessary to fully establish its robustness and reliability. Several limitations should be acknowledged. The system shows weaker performance at night, with increased mapping errors and lower frame rates. Environmental factors such as dust and lighting changes may also affect perception stability. In addition, the cost and complexity of integrating multiple sensors may hinder large-scale deployment. Future work will focus on improving robustness, reducing cost, and validating the system in broader applications.

Future developments should focus on three key aspects: first, integrating advanced object recognition models based on Transformer or Graph Convolutional Networks to enhance multi-target recognition under complex conditions. Transformer models offer strong global context modeling capabilities, which can enhance the system’s ability to recognize overlapping or partially occluded objects in cluttered port environments. Their attention mechanisms allow the system to weigh spatial dependencies across frames, improving accuracy in identifying cargo holds, structural edges, and dynamic targets such as grab buckets. Meanwhile, Graph Convolutional Networks (GCNs) are well-suited for relational modeling among multiple moving entities, enabling improved spatial reasoning in multi-unloader coordination scenarios. However, these advanced architectures require significantly more memory and computation, which may exceed the capabilities of current embedded platforms. This trade-off necessitates the integration of lightweight model compression strategies, such as quantization or pruning, and may call for the use of dedicated edge AI chips or offloading inference tasks to centralized servers under 5G networks; Secondly, the adoption of edge computing and 5G communication modules can address the latency and bandwidth bottlenecks often encountered in multi-unloader collaborative tasks. By deploying lightweight computing nodes directly on or near the unloader, the system can perform real-time perception, collision prediction, and decision-making locally, reducing reliance on centralized servers and improving system responsiveness. The integration of 5G enables ultra-low latency data transmission and high-speed uplink/downlink throughput, allowing spatial coordinates, risk alerts, and operational status to be rapidly shared among different units. This capability is particularly critical when multiple unloaders operate in close proximity within a confined port environment, as it facilitates synchronized motion planning and conflict avoidance. Moreover, edge computing reduces data transmission costs and mitigates privacy concerns, making it more suitable for industrial-scale deployment.

Conclusion

This study presents a comprehensive framework for dynamic 3D coordinate system construction and safety optimization in ship unloader operations under complex port environments. By integrating LiDAR-based point cloud reconstruction, visual SLAM, and Voronoi skeleton recognition, the proposed system establishes a robust spatial reference adaptable to the unloader’s dynamic structure. Furthermore, a multi-modal sensing and risk prediction module is incorporated to achieve real-time collision avoidance and intelligent path planning. Experimental validations across laboratory and real port scenarios confirm the system’s strong performance, with mapping errors consistently under 6.1 cm, frame rates up to 20 FPS, and collision prediction accuracy reaching 91.2%. The system also maintained a low false alarm rate (< 5%) and short average response time (≤ 1.4 s). A seven-day field deployment further demonstrated its operational stability, with a false alarm rate limited to 4.4%. These results affirm the system’s potential for real-world applications in large-scale port machinery. Future work will focus on integrating deep learning-based multi-object detection models and digital twin simulation platforms to enhance remote coordination and safety resilience in automated unloading operations. To further improve real-world applicability, the system was designed with consideration for adverse environmental conditions such as fog or rain. Although extreme weather scenarios were not the focus of this study, future experiments will include tests under such conditions to fully evaluate sensing robustness.