Introduction

Environment perception is a key technology for Intelligent Connected Vehicle (ICV), and object detection, as the basis for solving more complex environment sensing tasks, has a direct impact on traffic safety in terms of its accuracy1. In recent years, with the continuous development of object detection algorithms based on deep learning, the prediction accuracy has been continuously improved2. However, in the face of occluded traffic targets, the truncation phenomenon makes the feature acquisition crippled, often resulting in omission and false alarm phenomenon, which seriously affects the detection effect. There-fore, the study of occluded object detection is of great significance to traffic safety.

Object detection algorithms can be classified into two types: traditional and deep learning-based methods. Compared with the disadvantage of traditional methods3,4 which rely heavily on manually designed feature sub, deep learning-based methods can overcome the limitation of not being able to adapt to complex scenes. Therefore, deep learning-based object detection algorithms have become the mainstream of application. They are classified into two-stage and single-stage according to their detection logic. The RCNN series of algorithms proposed in the literature5,6,7,8, is a representative of two-stage algorithms, but all the algorithms in this series need to generate a large number of candidate regions first to mark all the possible targets, which has a high detection accuracy, but the redundant anchor frames of repeated calculations will reduce the computing speed. Therefore, the single-stage algorithms that do not need to select candidate regions and directly generate target position coordinates were then born.

Meanwhile, the optimization of object detection algorithms under occlusion conditions mainly follows two technical approaches: one is based on the improvement of overall feature detection algorithms, and the other is based on the enhancement of partial semantic detection algorithms. Although existing methods have made progress in occlusion target detection research through different technical directions, there are still some issues that need to be addressed. For instance, common occlusion datasets such as the KITTI dataset, Caltech dataset, etc., have single-scene environments, leading to a lack of robustness in the detection results of the algorithms. The complexity of occlusion situations makes it difficult for algorithms to learn an infinite number of occlusion scenarios. Moreover, during occlusions, severe overlaps between prediction boxes may lead to the erroneous suppression of predicted targets, resulting in missed detections. There is an urgent need to propose more specific and robust algorithms based on the characteristics of targets under occlusion conditions, which is of great significance for complex perception systems.

YOLOX has designed four types of networks with different scale sizes for different application scenarios. Among them, YOLOX-s has the best degree of lightness and is of great significance for engineering applications. Therefore, this paper carries out research based on the YOLOX-s algorithm, and the network structure model is shown in Fig. 1.

Fig. 1
figure 1

YOLOX network structure.

The YOLOX-s backbone network is CSPDarknet9, which can deepen the network structure and improve the image feature expression while reducing network redundancy. The SPP structure is used at the end of the backbone network, and three maximum pooling kernels of different sizes are designed for processing to increase the sensory field of the network. The neck network uses three effective feature layers to construct the FPN (Feature Pyramid Network)10 + PAN (Path Aggregation network)11 two-way fusion feature pyramid structure, which performs multidimensional feature fusion on the effective features. The prediction end of the YOLOX-s adopts the form of decoupling, and obtains three prediction results for each.

In this paper, YOLOX is improved for the problem of poor detection of occluded targets from vehicle viewpoints, and an adaptive deformable YOLOX occlusion object detection algorithm is proposed. In summary, we make several important contributions in this work:

  1. 1.

    We have applied adaptive deformable convolution to the backbone network of YOLOX, effectively enhancing the algorithm’s transformation capability for irregular geometries to adapt to complex and dynamic occlusion scenes, and to strengthen the feature representation ability of occluded targets.

  2. 2.

    Coordinate attention mechanism12 is applied in the Neck part to overcome the restriction of weight application in different channels and spatial domains, so as to improve the fusion ability of residual feature extraction for occluded targets.

  3. 3.

    We propose the Overlapping IoU (OL-IoU) regression loss function based on the intersection and merger ratio (IoU)13, and a proportional penalty mechanism such as overlapping width and height is added to accelerate the convergence of the bounding box and increase the detection accuracy at the same time.

  4. 4.

    The Spatial Pyramid Pooling (SPP)14 is replaced by the more efficient Fast SPP15 to balance the real-time and accuracy of occlusion object detection. Experiments in the occlusion scenario show that the algorithm improves the accuracy and real-time performance of occluded traffic object detection, which verifies the effectiveness of the algorithm.

Related work

Single-stage object detection

Literature16 proposed the YOLO (You Only Look Once) algorithm, which treats detection as a regression problem and greatly improves the detection speed, but the problem of missed detection is serious in the face of irregular targets and small targets. Literature17 proposed YOLO9000, which uses a joint dataset to train and design a Darknet19 backbone network, which improves the detection accuracy and speed, but is hindered by the fact that there is only one branch of detection, and it cannot be used for multi-size target detection. Literature18 proposed SSD (Single Shot MultiBox Detector), which uses CNN to replace the fully connected layer in YOLO detection, and applies feature maps of different scales to detect targets of different sizes, but there is no restriction on the standard size, which makes the miss elected feature maps produce great detection errors. Literature9 proposed YOLOv3 and designed Darknet53 backbone network to fill the gap of multi-size detection and used binary cross-entropy loss for multi-label classification, but did not consider regression loss, which made the localization results imprecise. Literature15 proposed YOLOv4, designed CSPDarknet53 backbone network, compared with Darknet53 applying deeper network and more parameters, and added some plug-ins, effectively balancing accuracy and real-time, but the adopted DropBlock random discard does not guarantee the diversity of the feature information, so there is still an indeterminable duplication of operations. Literature19 proposes YOLOv5 high efficiency detection model, which provides four different depths of network models to adapt to different detection requirements, and also applies Adaptive Anchor Box to train the customized dataset, which improves the operation speed. However, the manual matching rule strategy still requires empirical settings and poorly improves the detection of irregular targets.

YOLOX20 incorporates the anchor-free framework and decoupled head method, which reduces the tuning pressure of manually setting the Anchor, makes the feature learning and classification regression problems easy to learn, and improves the accuracy and computing speed compared with YOLOv5 models of all sizes. However, there is still the problem of not being able to detect effectively in scenes with a lot of occlusions. In summary, the object detection algorithm still has problems especially in irregular object detection. The irregular targets perceived in the field of intelligent traffic under the vehicle perspective are reflected in the occluded vehicles, pedestrians, cyclists, etc., and the variable traffic scene makes the occlusion relationship of the detection targets extremely complex, thus increasing the difficulty of object detection.

Occlusion object detection

The occlusion object detection algorithm is based on deep learning algorithms and optimized according to its own characteristics, with the aim of training a network model to cope with occlusion. Literature21 proposes occlusion processing techniques based on various parts of the pedestrian’s body, applying the Histogram of Oriented Gradients (HOG) detector to compute the classification scores of the sliding window and applying the sum of its responses to the global detector to reflect the possible partial occlusion of pedestrians. Literature22 proposes a method to detect partially occluded pedestrians by determining the visible part of the object, which uses a discriminatively trained Deformable Parts Model (DPM) to solve the concave optimization problem to indicate whether the image part belongs to the target object or to the occluder. Literature23 proposes Soft-NMS non-maximal suppression method to attenuate the scores of high overlapping prediction frames, but its essence is still to consider overlapping suggestion frames as false positives that cannot be accurately classified. Literature24 integrates deep convolutional neural networks and combinatorial convolutional neural networks, and utilizes a microscopically combinable layer instead of a fully-connected classification layer in order to achieve classification and localization of occluded targets. Literature25 uses background subtraction to simulate two-wheelers in a crowded scenario. Decision tree is used to evaluate the geometric features of overlapping two-wheelers for classification. All the above detection algorithms for occluded targets are only for a single class of targets and cannot be adapted to the task of occluded object detection in multi-category complex scenes. Literature26,27 requires manual annotation at the part level, which is associated with higher data preparation costs. Literature28,29,30, are primarily focused on processing video targets, whereas our method is more attentive to static images. Literature31 employs a combination of classification and proposal methods, which are less efficient and not suitable for complex and dynamic traffic scenarios. Literature32 concentrates on pedestrian targets, failing to cover various types of traffic participants, and does not address the model lightweight processing required for the computational demands of intelligent vehicles.

Attention mechanism for computer vision

Attention mechanism aims to enhance the ability of neural networks in judging the importance of feature information, and currently Squeeze-and-Excitation (SE)33, CBAM34, are mainly used in the field of computer vision. However, SE only considers weighting the importance of each channel by modelling the dependency between channel information, and ignores the importance of positional information. CBAM does not capture the importance of spatial information through the use of a large-size kernel of convolution to introduce spatial information encoding on top of the channel information, but this approach can only capture the local information and not the information of long range dependencies. Based on the limitations of the above study, Coordinate Attention (CA)12 is proposed to decompose the 2D global pooling into one-dimensional coding in two spatial directions for obtaining position information and channel relationships, respectively.

OCC-YOLOX Algorithm

Deformable convolution

When conventional convolution is modelling image features, the receptive fields are all regular adjacent rectangles with limited geometric variation capability. However, the location and shape of the occluded targets cannot be predicted, requiring algorithms with the ability to adaptively perceive the target region and adjust the receptive fields. Deformable Convolutional Networks (DCN)35,36 provides an adaptive learnable offset for increasing spatial sampling points, enabling the network to have adaptive modelling capability. The deformation process can be described by Eq. (1).

$$y\left( {{p_0}} \right)=\sum\limits_{{n=1}}^{N} {{\omega _n}} \times x\left( {p+{p_n}+\Delta {p_n}} \right) \times \Delta {m_n}$$
(1)

Where: \(\:\varDelta\:{m}_{n}\) denotes the modulation scalar, and \(\:\varDelta\:{m}_{n}\) is the sigmoid normalized modulation term allowing for more efficient feature level range control. \(\:{p}_{0}\) is the pixel of the output feature, \(\:\varDelta\:{p}_{n}\) denotes the offset position in \(\:x\), \(\:{\omega\:}_{n}\) is the convolution weight, and \(\:{p}_{n}\) enumerates the position in \(\:x\), which can be expressed as Eq. (2). The sampling position of each sample point is affected by summing the offsets of each convolution position.

$${p_n} \in R=\{ ( - 1, - 1),( - 1,0), \ldots ,(0,1),(1,1)\}$$
(2)

Deformable convolution uses learnable offsets to describe target feature orientations, allowing the network’s sense field to be not limited to a fixed range and more flexible to adapt to changes in target geometry. As a result, DCNs are more conducive to adequate detection of complex scenes. Although DCNs do not bring significant extra computation, the application of a large number of DCNs increases the algorithm inference time. Therefore, in order to balance the inference efficiency and detection accuracy, this paper replaces the standard 3 × 3 convolution in the FEAT3 layer of the backbone network, and adaptively changes the convolution kernel sampling position through the offset, and the offset process is shown in Fig. 2, which effectively adapts to the shape of the occluded target.

Fast SPP spatial pyramid pooling

YOLOX-s applies the Spatial Pyramid Pooling Structure (SPP) in the backbone network. The SPP takes the feature map through three different sizes of maximum pooling kernels of 5 × 5, 9 × 9, and 13 × 13 for feature extraction and then aggregation, which efficiently enlarges the sensory field of the network.

However, the use of multilayer concatenated convolutional kernels of different sizes would greatly enhance the computational difficulty and affect the detection speed of the algorithm. Therefore, the FAST SPP structure is utilized for replacement in the backbone network of OCC-YOLOX, and the FAST SPP structure model is shown in Fig. 3. Compared with the SPP structure, FAST SPP calculates three convolutional kernels as 5 × 5 pooling layers in series, so that the output of each pooling will become the input of the next pooling, and aggregates the features under different scales of the same feature map, which improves the utilization rate of the network parameters and reduces the difficulty of the operation.

Fig. 2
figure 2

Deformable convolution offset effect.

Fig. 3
figure 3

FAST SPP structural model.

Coordinate attention network

The CA attention module is shown in Fig. 4, where given the inputs two spatial ranges of the pooling and are used to encode each channel along the horizontal and vertical coordinates, respectively. encoding. Therefore, the output of the cth channel with height of h can be formulated as Eq. (4):

$$Z_{c}^{h}(h)=\frac{1}{W}\sum\limits_{{0 \leqslant i \leqslant W}} {{x_c}} (h,i)$$
(3)

The output of the cth channel with a width of w can be written as Eq. (5):

$$Z_{c}^{w}(w)=\frac{1}{H}\sum\limits_{{0 \leqslant i \leqslant H}} {{x_c}} (j,w)$$
(4)

And then the features are aggregated along the two spatial directions respectively, as in Eq. (6), to produce a pair of intermediate feature maps \(\:f\) that encode spatial information in the horizontal and vertical directions.

$$f=\delta \left( {{F_1}\left( {\left[ {{Z^h},{Z^w}} \right]} \right)} \right)$$
(5)

where, \(\:{F}_{1}\) denotes the 1 × 1 convolutional transform function and \(\:\delta\:\) is a nonlinear activation function, which performs cascade operations on the two spatial dimensions. Then use two 1 × 1 convolution \(\:{F}_{h}\) and \(\:{F}_{w}\) encoded as tensor with the same channel respectively and calculate the attention weights of -two spatial directions, the operation process is as in Eqs. (7) and (8), and finally the coordinate attention output can be obtained as in Eq. (9).

$${g^h}=\sigma \left( {{F_h}\left( {{f^h}} \right)} \right)$$
(6)
$${g^w}=\sigma \left( {{F_w}\left( {{f^W}} \right)} \right)$$
(7)
$${y_c}(i,j)={x_c}(i,j) \times g_{c}^{h}(i) \times g_{c}^{w}(j)$$
(8)

The CA attention module can satisfy the need to capture long-range dependencies along one direction while still ensuring accurate position information in the other direction. After adding the attention module to the up-sampling and down-sampling of the bidirectional FPN feature pyramid, it decides which part needs to be concerned, assigns the weights of different feature maps and activates them by the swish activation function, so that the part with smaller weights will be less concerned, and allocates the processing resources more reasonably. The location of the CA attention module is shown in Fig. 5.

Fig. 4
figure 4

CA Attention Module Diagram.

Fig. 5
figure 5

Enhancement of the structure of the feature extraction section.

YOLOX uses dynamic matching of positive samples to get the feature point corresponding to each real frame, and then takes out the predicted frame of that feature point, and uses the real frame and the predicted frame to calculate the IoU regression loss. IoU (Intersection over Union) is the intersection and union ratio, which is used to reflect the detection effect of the real frame and the predicted frame. However, it cannot accurately represent the way the two frames overlap, especially when the IoU is the same, the different two frames position indicates the regression effect is not the same, the calculation formula is Eq. (10), Such a representation is therefore flawed.

$$\operatorname{IoU} (A,B)=\frac{{|A \cap B|}}{{|A \cup B|}}$$
(9)

DIoU37 introduces the minimum outer rectangular diagonal distance between the prediction box and the real box to accelerate the regression to the Euclidean distance between the centroids of the two boxes, which avoids the phenomenon that Loss is too large to be optimised when the two boxes are far away from each other. The formula is expressed as Eq. (11):

$${L_{DIoU}}=1 - \operatorname{IoU} (A,B)+\frac{{{\rho ^2}\left( {b,{b^{gt}}} \right)}}{{{c^2}}}$$
(10)

where: \(\:c\) denotes the minimum outer rectangular diagonal distance between the prediction frame and the real frame, and \(\:b\)\(\:{b}^{gt}\) denotes the center point of the prediction frame and the real frame, respectively.

CIoU37 adds the factor of aspect ratio consistency between predicted and real frames on the basis of DIoU, and the formula is as in (12). Where, \(\:\alpha\:\) is the weight factor, which indicates the proportion of the consistency case in joining the consistency loss, and \(\:{\nu\:}_{1}\) indicates the metric aspect ratio consistency, and the formula is calculated as Eq. (13):

$${L_{CIoU}}=1 - \operatorname{IoU} (A,B)+\frac{{{\rho ^2}\left( {b,{b^{gt}}} \right)}}{{{c^2}}}+\alpha {v_1}{\text{ }}$$
(11)
$${v_1}=\frac{4}{{{\pi ^2}}}{\left( {\arctan \frac{{{w^{gt}}}}{{{h^{gt}}}} - \arctan \frac{w}{h}} \right)^2}$$
(12)
$$\alpha =\frac{{{v_1}}}{{1 - \operatorname{IoU} (A,B)+{v_1}}}$$
(13)

EIoU38 splits the aspect ratio on the basis of CIoU, and considers the consistency of the width and height of the overlapped part and the width and height of the smallest external rectangle respectively, and the formula is as (15). Where \(\:{c}_{w}\) denotes the width of the minimum outer rectangle, \(\:{c}_{h}\) denotes the width of the minimum outer rectangle, \(\:{I}_{w}\) denotes the width of the overlapping part, and \(\:{I}_{h}\) denotes the height of the overlapping part. The EIoU makes it clearer that the width and height are respectively the real differences with their confidence levels, which promotes the effective optimisation of the model.

$${L_{EIoU}}=1 - \operatorname{IoU} (A,B)+\frac{{{\rho ^2}\left( {b,{b^{gt}}} \right)}}{{{c^2}}}+\frac{{{{\left( {{c_W} - {I_w}} \right)}^2}}}{{c_{W}^{2}}}+\frac{{{{\left( {{c_h} - {I_h}} \right)}^2}}}{{c_{h}^{2}}}$$
(14)

In this paper, we propose OL-IoU, which introduces the consistency factor of the aspect ratio of the predicted frame to the real frame \(\:{\nu\:}_{1}\) and the consistency factor of the overlapping aspect-accelerated regression \(\:{\nu\:}_{2}\) on the basis of EIoU. Among them, \(\:{\nu\:}_{1}\) is the same as the representation in CIoU, and \(\:{\nu\:}_{2}\) is calculated as Eq. (16):

$${v_2}=\frac{4}{{{\pi ^2}}}{\left( {\arctan \frac{{{I_w}}}{{{c_w}}} - \arctan \frac{{{I_h}}}{{{c_h}}}} \right)^2}$$
(15)

where, \(\:\beta\:\) is the weight coefficient, which indicates the proportion of the accelerated regression consistency case in the joining consistency loss, and the larger value indicates that the detection frame width and height can be closer to the real frame at the same time. the OL-IoU calculation formula is expressed as Eq. (17):

$${L_{oL - IoU}}=1 - \operatorname{IoU} (A,B)+\frac{{{\rho ^2}\left( {b,{b^{gt}}} \right)}}{{{c^2}}}+\frac{{{{\left( {{c_w} - {I_w}} \right)}^2}}}{{c_{W}^{2}}}+\frac{{{{\left( {{c_h} - {I_h}} \right)}^2}}}{{c_{h}^{2}}}+\alpha {v_1}+\beta {v_2}$$
(16)

Design \(\:\beta\:{\nu\:}_{2}\) Overlapping width and height accelerated regression consistency penalties are used to promote the approach of predicted frames in both directions of width and height to the real frames with equal speed, avoiding that a single direction cannot achieve the purpose of maximising the optimization. At the same time, this design more encourages that the approach of the target frame in a single direction can promote the equivalent operation in the other direction, even if the overlap area reaches an increase of 2D square degree, accelerating the improvement of IoU. it is schematically shown in Fig. 6 below, where (a) is the intersection-parallel ratio representation without adding the overlap width-height accelerated regression consistency penalty, and (b) indicates that the prediction frames are more active after adding the overlap width-height accelerated regression consistency penalty closer to the real frame. By improving the YOLOX model above, the OCC-YOLOXFootnote 1 as shown in Fig. 7 is finally obtained.

Fig. 6
figure 6

Schematic diagram of overlapping IoU regression.

Fig. 7
figure 7

OCC-YOLOX structural model.

Experiments

Experimental environment

In order to verify the effectiveness of the proposed algorithm in this paper, the experimental environment uses Windows 10 platform and PyTorch deep learning framework, RTX 3060Ti graphics card, i7-11700 F processor. Python programming language was applied for programming development. ANACONDA was used for environment management and VSCODE was used for IDE.

Datasets preprocessing

Due to the adaptability of detecting occluded objects scenes, a large number of multi-target occlusion scenes need to be selected. Secondly, in order to meet the needs of in-vehicle viewpoint detection, it is necessary to select in-vehicle viewpoint traffic target public datasets, such as KITTI39, CityPersons40, and BDD100K41. The KITTI dataset is based on the in-vehicle camera captured more than 10,000 real image data of multiple scenes, labelled with eight types of traffic targets, which include various degrees of occlusion and truncation. CityPersons The Cityscapes in-vehicle dataset contains 5,000 images of pedestrians, including a large number of pedestrian occlusion scenes, and the BDD100K dataset collects more than 100,000 images based on the in-vehicle viewpoint, including different lighting, weather, and other scenarios, and annotates 10 types of traffic targets. The above three datasets meet the requirements of the experimental scenarios, so this paper integrates the public datasets KITTI 7481, CityPersons 1599, and BDD100K 2500 images, a total of 11,580 images, as the data source for training and testing, and the selection of images pays special attention to the occlusion scenarios. As shown in Fig. 8, the three selected dataset image samples are shown.

Fig. 8
figure 8

Sample plot of data set selection.

Firstly, we choose KITTI dataset label form as a sample, intercept the category, degree of occlusion and 2D bounding box coordinates in the labels of the three datasets, and convert the TXT label format to VOC format uniformly and reconstruct the picture names. Secondly, we merge similar targets with different category names in different datasets, such as merging Human, Person, and Pedestrian into Pedestrian, and finally establish the vehicle view occlusion object detection dataset with five categories of targets (Car, Pedestrian, Truck, Van, and Cyclist). The occlusion object detection dataset is divided into training set, test set and validation set according to 8:1:1.

Evaluation indicators

The evaluation metrics reflect the model performance by applying the precision-recall (P-R) curve and the detection speed FPS (Frames Per Second). The IoU threshold takes the value of 0.5, in which the average precision of a category is, and the mean average precision (mAP) is the average precision of each category. Detection speed FPS indicates how many frames per second can be detected, the larger the FPS, the higher the detection speed and the higher the real-time performance.

Model training

A single GPU is trained with 300epoch, batch size is 8, adam optimizer is selected, momentum is 0.937, learning rate is set to stochastic gradient descent, initial learning rate is 0.001, and minimum learning rate is 0.00001.

The model training results are shown in Fig. 9. The left side of the figure represents the change in training loss over time, The right side of the figure shows the change in training accuracy for each round. It can be observed that the OCC-YOLOX algorithm reaches the peak accuracy at the 210th round.

Fig. 9
figure 9

Comparison of average accuracy.

Results and discussion

Model testing

The model performance is tested in the test set of the fusion dataset and compared with the YOLOX model, and the comparison results are shown in Table 1 below. In addition, the model test visual image comparison results are shown in Figs. 10, 11, 12, 13, 14 and 15, where Figs. 10, 11 and 12 are the occlusion scene images selected in the test set of the public dataset, and Figs. 13, 14 and 15 are the test images of the occlusion scene collected by ourselves. The left side (a) shows the YOLOX model test results, and the right side (b) shows the OCC-YOLOX model test results. As can be observed from the visual comparison, the OCC-YOLOX algorithm demonstrates significant improvements in detecting occluded objects compared to YOLOX. As shown in Fig. 10, OCC-YOLOX effectively detects small and occluded objects at a distance, demonstrating its ability to focus on distant targets with occlusion. Figure 11 illustrates the superior performance of OCC-YOLOX in detecting closely packed pedestrians in the near field within the KITTI dataset. Figure 12 highlights the improved attention given by OCC-YOLOX to occluded vehicles on the opposite lane in nighttime scenarios. Figures 13 and 14 showcase the higher detection rate and reduced false positives of OCC-YOLOX in complex and congested traffic conditions within the campus environment. Lastly, Fig. 15 reveals the robustness of OCC-YOLOX in detecting occluded targets under challenging conditions of snowy weather and low visibility at night. These results collectively demonstrate the effectiveness of OCC-YOLOX in addressing the challenges of occlusion in diverse scenarios.

Table 1 Comparison of YOLOX, OCC-YOLOX accuracy, recall, and AP values.
Fig. 10
figure 10

Comparison of test results on CityPersons dataset.

Fig. 11
figure 11

Comparison of test effects on KITTI dataset.

Fig. 12
figure 12

Comparison of test effects on BDD100K dataset.

Fig. 13
figure 13

Comparison of Car and Pedestrian test results.

Fig. 14
figure 14

Comparison of Car and Cyclists test results.

Fig. 15
figure 15

Comparison of Car test results at night and in snowy weather.

Ablation experiments

The results of applying the validation set to compare the average accuracy and detection speed of each improvement scheme are shown in Table 2. Replacing the adaptive deformable convolution led to a 0.77% increase in accuracy while still meeting real-time performance requirements. The theoretical rationale behind this improvement lies in the ability of adaptive deformable convolution to dynamically adjust the convolutional kernel based on the shape and position of the target, effectively extracting target features in occlusion scenarios. Implementing Fast SPP resulted in a 10.42 FPS increase in detection speed without any significant degradation in accuracy. This improvement is attributed to the simplification of the spatial pyramid pooling structure, reducing computational burden while maintaining the capability for multi-scale feature fusion, enabling the model to efficiently handle targets of various sizes. The addition of the CA module led to a 0.86% increase in accuracy. This module enhances the model’s ability to focus on occluded targets by emphasizing their positional information, thereby improving detection accuracy in complex occlusion scenarios. Replacing the IoU loss with OL-IoU loss resulted in a 0.18% increase in accuracy. The OL-IoU loss functions more accurately in evaluating the overlap between the predicted and ground truth bounding boxes, especially in cases of partial occlusion, aiding in more precise regression of the bounding boxes. Overall, the refinements implemented in the OCC-YOLOX algorithm have concurrently preserved real-time responsiveness and markedly enhanced its accuracy in detecting objects amid occlusions. These enhancements have not only been corroborated through experimental validation but also enjoy a robust theoretical underpinning.

Table 2 Ablation study of OCC-YOLOX modules.

Comparison of models

Table 3 shows the comparison between the method proposed in this paper and the current mainstream single-stage object detection algorithms in terms of average accuracy, detection speed, model size, and model complexity. The algorithm’s metrics of mAP, Detection speed, and Model weight size were obtained after training on the occluded object detection dataset. Through the comparison, it can be concluded that the accuracy of the OCC-YOLOX algorithm has achieved competitive results with YOLOv7, while also surpassing other mainstream algorithms. At the same time, the advantage of our work lies in achieving a balance between detection precision and speed. Although the accuracy of OCC-YOLOX is 2.39% lower than that of YOLOv7, its detection frame rate is 93.7% higher than YOLOv7, and the model complexity and size are less than a quarter of YOLOv7, with the added benefit of being lightweight compared to other mainstream algorithms. Therefore, the OCC-YOLOX algorithm has more advantages in applications with an onboard perspective.

Table 3 Comparison of the effectiveness of single-stage object detection algorithms.

Conclusions

In this paper, for the occluded target in vehicle view object detection, based on the single-stage algorithm YOLOX, integrating the backbone network, neck network, and multiple improvement methods of the loss function, we propose the occluded object detection algorithm OCC-YOLOX. fusion of three public datasets and self-picked data to validate this paper’s algorithm, which proves the effectiveness of the algorithm. The conclusions are as follows:

  1. (1)

    The OL-IoU is proposed, and overlapping width-height accelerated regression consistency penalties are designed to facilitate the regression of prediction frames with equal speeds in both the width-height directions, which improves the real-time performance of the algorithm.

  2. (2)

    Adaptive deformable convolution is used to replace the traditional backbone convolution and change the spatial distribution of feature points to make it more flexible to adapt to the change of target geometry, which effectively improves the object detection accuracy of the occlusion scene. The number, size and connection of spatial pyramid pooling kernels are simplified using FAST SPP, which effectively improves the efficiency of detection. Adding coordinate attention in enhancing the feature extraction part, using two-dimensional information improves the ability of the convolutional network to judge the importance of feature information, which in turn improves the algorithm’s feature extraction ability.

Additionally, for heavily occluded objects, the algorithm can only detect their approximate location and shape. It is unable to recover the complete object information, and the algorithm currently does not address the understanding of occlusion relationships. Therefore, it cannot yet be applied to scenarios requiring scene understanding. In future work, we plan to collect more data from extreme occlusion scenarios. We will enhance the algorithm’s robustness in these scenarios through techniques such as data augmentation and model refinement. Additionally, we will explore the modeling of occlusion relationships to achieve a better understanding of the occlusion relationships between occluded objects.