Introduction

The vigorous development of the e-commerce industry has brought significant challenges to the sorting work of the logistics industry. The traditional manual sorting has problems such as inefficiency and rising labor costs, and it is difficult to meet the growing demand for parcel sorting in the logistics industry1. Compared with manual sorting, logistics enterprises at this stage are more inclined to adopt semi-automated and fully automated express package sorting schemes, and related researches are gradually emerging, mainly involving mechanical structure, control system and recognition algorithm and other fields2. Among them, the identification of parcel box information is the primary task of sorting in the logistics industry, and is one of the prerequisites for the efficient operation of the sorting system. The identification of package information mainly includes the three-dimensional size determination of the package, the three-dimensional position posture determination and the barcode information recognition, etc. As the basic attribute of the package, the package information provides the necessary information assistance for the machine capture of the package in the logistics industry, the determination of the transportation mode, the determination of the storage area and the correction of the transportation position. Therefore, package information recognition based on vision has important research value for logistics industry.

Manual measurement is vulnerable to interference from human factors, resulting in unstable measurement accuracy and efficiency. At present, laser measurement methods have requirements for the working environment (such as air dust and conveyor belt speed), and the cost is relatively high, which limits their popularization. This paper mainly carries out three-dimensional positioning of the parcel box when loading the parcel box, aiming to solve the high-speed identification problem of the three-dimensional size of the parcel box in the intelligent logistics sorting system.

The main contributions of this paper are as follows:

  1. 1.

    This paper proposes a visual detection algorithm for three-dimensional positioning of parcel box for high-speed sorting system, which restores three-dimensional information of parcel box by collecting two-dimensional image information of parcel box. Based on the proposed algorithm, a lightweight parcel box Detection model, the Efficient Object Detection Network (EODNet), was designed. The model realized high speed and accurate parcel box identification with a small number of parameters and computational cost.

  2. 2.

    In order to reduce the number of parameters and computational complexity of the model and improve the multi-scale feature extraction capability of the model, this paper designs an efficient feature extraction network: ES-ViT as the backbone of the model.

  3. 3.

    We design a new neck structure to achieve selective fusion of features of different levels and adaptive convolution processing of input of different features. This structure includes a fusion module for features of different scales: Channel Weight Screening Module (CWS) and feature fusion module with efficient convolution operation: C2f-GCC.

  4. 4.

    In order to verify the universality of the proposed model and the effectiveness of the model improvement strategy, we selected a large number of data sets to test the model, and compared with other target detection models, our model has the best detection performance in the parcel box dataset.

Related work

Deep learning based detection models have greatly promoted the application of visual technology in real-life logistics scenarios in the past few years.

In the parcel inspection task, Zhang et al.3 designed a lightweight deep learning network EPYOLO to detect small targets and overlapping express packages. Zhang et al.4 designed a lightweight SFN-YOLOv8 model based on YOLOv8 for defect detection of express packages. Qi et al.5 constructed a package recognition system based on ResNet and DenseNet, which has strong robustness and can meet the requirements of autonomous package retrieval for robotic arms. Zhu et al.6 designed a prohibited item packaging detection system based on YOLO, which can efficiently identify and detect prohibited items, and has a good detection effect on overlapping and obstructed prohibited items. Chen et al.7 proposed a hybrid model called Parcels-DNet that efficiently captures the characteristics of parcel deformation and penetration damage, and that the model can be deployed in resource-constrained scenarios such as mobile or embedded devices.

In addition to parcel inspection tasks, researchers have also focused on parcel tracking tasks. Wu et al.8 used an improved network model based on Faster R-CNN to identify and track express packages. Liu et al.9 proposed a multi-target tracking algorithm based on improved YOLOv5 and DeepSORT, which can accurately track express packages and has good real-time performance. Ren et al.10 used the Mask R-CNN model to segment the package in two-dimensional instances to determine its contour, and then performs accurate filtering and fitting on the three-dimensional point cloud data of the package surface to obtain the three-dimensional data of the package. Finally, the robot arm is combined to grasp and classify the package. Kim et al.11 used YOLOv5 model to realize rapid detection and location estimation of parcel boxes. Li et al.12 used the multi-source vision sensor Kinect to collect image information, and combines the minimum boundary rectangle algorithm and background subtraction to achieve non-contact and uninterrupted parcel box volume measurement, and the relative error is less than 5%. Arpenti et al.13 used the depth convolutional model to conduct semantic segmentation of objects on the shelf to obtain the approximate position of objects, and calculated the position information of goods by combining the depth information collected by the RGB-D camera. Ladplee et al.14 designed a rectangular parcel box volume measurement system based on a single liDAR depth camera, which realized the measurement of the three-dimensional size of the parcel box, and the average measurement time was 1.01 s. Dai et al.15 proposed a package positioning algorithm, which combines traditional vision with artificial intelligence vision to achieve the detection and position correction of packages. Lu et al.16 Combined the traditional visual processing methods with the enhanced lightweight YOLOv5 object detection algorithm to achieve high-speed and high-precision package positioning.

Based on the above research work, we can draw some conclusions: 1. Deep learning based visual inspection methods have a significant impact on improving the performance of logistics sorting systems; 2. There are few size measurement techniques for rectangular packaging boxes, and existing measurement solutions have problems such as high hardware costs, low model detection accuracy, multiple model parameters, large computational complexity, and low recognition speed. Based on this, this article proposes a three-dimensional size positioning algorithm for parcel boxes based on a lightweight detection model, which achieves a good balance between recognition accuracy, recognition speed, and equipment cost.

Methods

As shown in Fig. 1, when the worker places the parcel box on the sorting table against the side of the right angle frame, three industrial cameras simultaneously record the position and attitude of the parcel box at the moment. Camera 1 will capture the top view of the parcel box, camera 2 will capture the front view of the parcel box, and camera 3 will capture the rear view of the parcel box. In the three views containing the parcel box, the deviation angle between the parcel box and the camera is 0°, that is, there is no positional deviation of the parcel box in the picture. In this paper, the three-dimensional size of the parcel box can be restored by acquiring the data of the three directions of the parcel box.

Fig. 1
figure 1

Algorithm framework for three-dimensional size recognition of parcel boxes.

Before the operation of the logistics conveyor belt, a square object is used as a standard part, and the proportion between the size of the parcel box in the picture and its actual size. It is assumed that the rear view of the standard object collected in camera 3 is the input data. Since the standard parts are close to the rear wall (reference object), and there is no deviation angle between the standard object and the camera direction, the ratio \(r\) between the pixels of the standard parts in the picture and the real size of the standard parts can be directly obtained. Assuming that the position of the parcel box and the standard part is similar, the real length \(L_{R}\), width \(W_{R}\) and height \(H_{R}\) of the parcel box are calculated as follows:

$$L_{R} = L_{P} \times r,W_{R} = \frac{{L_{R} }}{{R_{LW} }},H_{R} = \frac{{L_{R} }}{{R_{LH} }}$$
(1)
$$r = \frac{{l_{R} }}{{l_{P} }},R_{LW} = \frac{{L_{P} }}{{W_{P} }},R_{LH} = \frac{{L_{P} }}{{H_{P} }}$$
(2)

where \(r\) represents the proportional relationship between the standard part length pixel size \(l_{P}\) and the real standard part length \(l_{R}\), \(R_{LW}\) is the length–width ratio of the parcel box in the image, \(R_{LH}\) is the length-height ratio of the parcel box in the image.

After inputting the original parcel box image into the Efficient and Spatial Vision Transformer(ES-ViT), the parcel box feature map can be obtained. In the neck, the Efficient Channel Weight Filtering Module (ECWS) is used to fuse features from different levels, and finally output recognition results through the detection head of the model.

Model structure

Backbone

Compared with the single CNN structure of mainstream models (such as Faster R-CNN, SSD and YOLO, etc.), the global attention mechanism of Vision Transformer model can effectively capture the global information of the image and avoid the loss of target feature information as the number of network layers increases. In this paper, ES-ViT(as shown in Fig. 2) is designed as the backbone of a parcel box 3D positioning detection model, which will meet the needs of intelligent logistics in reality from the aspects of parameter quantity, computation amount and accuracy. In the middle and high levels of the network, different combinations of MBConv, EfficientViT17, Spatial Pyramid Pooling Fast Module (SPPF) were used to output a total of 4 feature maps of different scales.

Fig. 2
figure 2

ES-ViT structure.

In order to improve the efficiency of the model in feature extraction of parcel boxes, the EfficientViT module in the middle and higher layers of the backbone introduces a nonlinear function ReLU in the self-attention mechanism. Specifically, given the input \(x \in R^{N \times f}\), the generalized form of the attention \(Attention\left( {Q,K,V} \right)\) mechanism can be written as:

$$Attention\left( {Q,K,V} \right) = \sum\limits_{j = 1}^{N} {\frac{{Sim\left( {Q_{i} ,K_{j} } \right)}}{{\sum\nolimits_{j = 1}^{N} {Sim\left( {Q_{i} ,K_{j} } \right)} }}V_{j} }$$
(3)

where \(Q = xW_{Q}\), \(K = xW_{K}\), \(V = xW_{V}\). Note that the attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. \(W_{Q} /W_{K} /W_{V} \in R^{N \times d} (N \gg d)\), both \(W_{Q}\), \(W_{K}\), and \(W_{V}\) are learnable linear projection matrices. And after being initialized with certain values, \(W_{Q}\), \(W_{K}\), and \(W_{V}\) undergo continuous changes during network training. \(Attention\left( {Q,K,V} \right)_{i}\) is row i of matrix \(Attention\left( {Q,K,V} \right)\), and \(Sim\left( {Q,K} \right)\) is the similarity function.

In the EfficientViT module, the ReLU based linear attention mechanism was selected as the similarity function, as shown in formula (4). Using the correlation of matrix multiplication, formula (4) can be rewritten step by step, as shown in formula (5). When we do calculations, we only need to compute \((\sum\nolimits_{j = 1}^{N} {ReLU(K_{j} )^{T} V_{j} } ) \in R^{d \times d}\) and \((\sum\nolimits_{j = 1}^{N} {ReLU(K_{j} )^{T} } ) \in R^{1 \times d}\) once, and we can reuse them for every query. Compared with the attention mechanism based on softmax function, the computational complexity of the linear attention mechanism based on ReLU is reduced from \(o(N^{2} )\) to \(o(N)(Softmax(Q_{i} ,K_{j} ) \in R^{N \times N} ,ReLU(Q_{i} ) \in R^{N \times d} )\), and the ReLU function consumes less computing resources than the softmax function, and is more friendly to the hardware equipment of the real intelligent logistics center.

$$Sim\left( {Q,K} \right) = ReLU\left( Q \right)ReLU\left( K \right)^{T}$$
(4)
$$\begin{aligned} Attention\left( {Q,K,V} \right)_{i} & = \sum\limits_{j = 1}^{N} {\frac{{ReLU\left( {Q_{i} } \right)ReLU\left( {K_{j} } \right)^{T} }}{{\sum\nolimits_{j = 1}^{N} {\left[ {ReLU\left( {Q_{i} } \right)ReLU\left( {K_{j} } \right)^{T} } \right]} }}} V_{j} \\ & = \frac{{\sum\nolimits_{j = 1}^{N} {\left[ {ReLU\left( {Q_{i} } \right)ReLU\left( {K_{j} } \right)^{T} } \right]V_{j} } }}{{ReLU\left( {Q_{i} } \right)\left[ {\sum\nolimits_{j = 1}^{N} {ReLU\left( {K_{j} } \right)}^{T} } \right]}} \\ & = \frac{{\sum\nolimits_{j = 1}^{N} {ReLU\left( {Q_{i} } \right)\left[ {ReLU\left( {K_{j} } \right)^{T} V_{j} } \right]} }}{{ReLU\left( {Q_{i} } \right)\left[ {\sum\nolimits_{j = 1}^{N} {ReLU\left( {K_{j} } \right)}^{T} } \right]}} \\ & = \frac{{ReLU\left( {Q_{i} } \right)\left[ {\sum\nolimits_{j = 1}^{N} {ReLU\left( {K_{j} } \right)}^{T} V_{j} } \right]}}{{ReLU\left( {Q_{i} } \right)\left[ {\sum\nolimits_{j = 1}^{N} {ReLU\left( {K_{j} } \right)}^{T} } \right]}} \\ \end{aligned}$$
(5)

Deep convolution is used after each FFN layer in the EfficientViT module to enhance the ReLU-based linear attention and improve the model’s attention to local information in space. Using group convolution within the module to reduce the overall computational overhead of information aggregation can further enhance the efficiency of parcel box feature extraction.

In order to improve the feature extraction capability of the model in this paper, SPPF module is added to the end of backbone to further improve the efficiency of the model in extracting package feature information in complex industrial environment. Due to the existence of multi-scale spatial containers in SPPF, compared with single-scale sliding window pooling, the model can process spatial feature information of different sizes, avoiding the unbalanced spatial feature information of packages in pictures caused by cropping and warping (as shown in Fig. 3), and improving the quality of model training and the recognition accuracy of packages.

Fig. 3
figure 3

(a) The image has been cropped. (b) The image has been warped.

Neck

In order to improve the model’s attention to input data, we use the channel attention mechanism (as shown in Fig. 4) at the connection between backbone and neck in the network structure, so that the network can more accurately handle the relationship between parcel box features and dynamically adjust the attention distribution. The channel attention mechanism in this paper first performs average pooling and maximum pooling on the parcel box feature map \(F_{in} \in R^{H \times W \times C}\)(where H is the height of the feature map, W is the width of the feature map, and C is the number of channels), respectively, to reduce the dimension of the feature map, eliminate redundant data, and reduce parameters and computation amount. Then, the two pooling results are directly combined, and the sigmoid activation function is used to calculate the weight of each channel in the feature map, that is, \(F_{W} \in R^{1 \times 1 \times C}\), and finally, the channel attention mechanism adds a convolution to modify the number of output channels.

Fig. 4
figure 4

Channel Weight Screening Module (CWS).

The model needs to process information from high and low networks in order to better classify and locate parcel boxes. In the FPN structure of the current mainstream models, the upsampling method is adopted to adjust the feature map. This method is based on human prior knowledge and uses a fixed interpolation method, making it difficult to adjust according to the characteristics of the task. In the neck of the network structure proposed in this article, we opt for transposed convolution instead of traditional upsampling. During the training of the parcel box detection model, the parameters of the convolution kernel are updated gradually according to the gradient change of the loss function, which aims to minimize the error between the predicted value and the true value of the parcel box, and then adjust and optimize the upsampling strategy of the model. In addition, we add a channel attention mechanism in the feature fusion part of neck, so that the model can dynamically adjust the weight of the feature map and improve the parcel box detection performance of the model.

The feature fusion process of different scales is shown in formulas (6) and (7). Given input \(F_{h} \in R^{H \times W \times C}\) of a high-level feature and input \(F_{l} \in R^{{H_{1} \times W_{1} \times C}}\) of a low-level feature, a transposed convolution T-Conv (kernel size = 3, stride = 2, padding = 1) is used to expand the high-level feature \(F_{h}\) to get \(F^{\prime}_{h} \in R^{{H_{1} \times W_{1} \times C}}\). The extended high-level feature \(F^{\prime}_{h}\) is converted into the corresponding attention weight after CA module to filter the low-level feature. Finally, the filtered feature and the extended high-level feature are fused again to get \(F_{O} \in R^{{H_{1} \times W_{1} \times C}}\).

$$F^{\prime}_{h} = T - Conv\left( {F_{h} } \right)$$
(6)
$$F_{o} = F_{l} * CA\left( {F^{\prime}_{h} } \right) + F^{\prime}_{h}$$
(7)

At the end of feature fusion, we add a C2f module to improve the efficiency of the model for parcel box feature fusion (as shown in Fig. 5). Two improvements are made to the C2f module to improve the efficiency of the convolution operation.

Fig. 5
figure 5

C2f-GhostCondConv structure.

The output of a convolutional layer in a neural network can be divided into multiple feature maps according to the number of channels, and there are similar multiple groups of feature maps. In Ghost Module18, one of the similar feature maps in a group is set as Intrinsic feature map, and the rest are Ghost feature maps. The remaining Ghost feature maps can be generated through linear transformations of the Intrinsic feature maps. Given an input \(F_{in} \in R^{h \times w \times c}\), the convolutional filters \(f_{C} \in R^{k \times k \times c \times n}\) of the layer are divided into groups \(s\) on average, and one group of convolutional filters (\(f_{i} \in R^{k \times k \times c \times m} ,m = {n \mathord{\left/ {\vphantom {n s}} \right. \kern-0pt} s}\)) is taken to perform ordinary convolution operations to obtain \(m\) intrinsic feature maps. A group of \(m\) intrinsic feature maps are taken each time for linear transformation. In order to calculate unity and reduce resource consumption, \(m\) Ghost feature maps of the group can be calculated using the same linear transformation. Finally, \(m\) Intrinsic feature maps and \((n - m)\) Ghost feature maps are combined to obtain the final output \(F_{out} \in R^{h1 \times w1 \times n}\). Assuming that the size of all convolution cores is \(k \times k\), the acceleration ratio \(r_{S}\) and compression ratio \(r_{C}\) of the convolution of Ghost Module are theoretically used to calculate the following formula:

$$\begin{aligned} r_{s} & = \frac{{n \times h_{1} \times w_{1} \times c \times k^{2} }}{{\frac{n}{s} \times h_{1} \times w_{1} \times c \times k^{2} + (s - 1) \times \frac{n}{s} \times h_{1} \times w_{1} \times k^{2} }} \\ & \approx \frac{s \times c}{{s + c - 1}} \approx s \\ \end{aligned}$$
(8)
$$r_{c} = \frac{{n \times c \times k^{2} }}{{\frac{n}{s} \times c \times k^{2} + (s - 1) \times \frac{n}{s} \times k^{2} }} \approx \frac{s \times c}{{s + c - 1}} \approx s$$
(9)

where \(s \ll c\).

We selectd dynamic convolution to improve the operation of each convolution \(f \in R^{k \times k \times n}\) in C2f module, so as to enlarge the capacity of the convolution layer and reduce the computation as much as possible. The parameters of the convolution kernel can be obtained by formula (10). In conventional convolution, a convolution kernel determined by the same parameter is used for all input samples, while in the CondConv19 layer, the parameters of a convolution kernel are calculated for each input sample before the convolution operation, the effect is equivalent to a linear combination of \(n\) planar convolution [as shown in formula (12)], so that different convolution is used for different input samples. With the increase of \(n\), the CondConv layer can increase the capacity of the CondConv layer with only a small increase in inference cost, and each CondConv convolution kernel can be applied to different positions of the image only once.

$$Output\left( x \right) = \sigma \left( {\left( {\alpha_{1} \cdot W_{1} + ... + \alpha_{n} \cdot W_{n} } \right) \cdot x} \right)$$
(10)
$$\alpha \left( x \right) = Sigmoid\left( {AvgPool\left( x \right)R} \right)$$
(11)
$$\begin{aligned} Output\left( x \right) & = \sigma \left( {\left( {\alpha_{1} \cdot W_{1} + \cdots + \alpha_{n} \cdot W_{n} } \right) \cdot x} \right) \\ & = \sigma \left( {\alpha_{1} \cdot \left( {W_{1} \cdot x} \right) + \cdots + \alpha_{n} \cdot \left( {W_{n} \cdot x} \right)} \right) \\ \end{aligned}$$
(12)

By improving the neck feature fusion and convolution operation of some modules, the generalization ability and inference speed of the model are improved.

Head

In the real logistics scenario, our parcel box detection model needs to meet the low power consumption of computing power, real-time detection of targets and easy deployment. Based on this, the detection head of YOLOv8 is selected as the detection head of the model in this paper. The specific structure is shown in Fig. 6. CIoU20 loss and Distribution Focal Loss(DFL)21 calculations are shown in formulas (13)–(14) and (15)–(16).

$$L_{CIoU} = 1 - IoU + \frac{{\rho^{2} \left( {b,b^{gt} } \right)}}{{c^{2} }} + \alpha \upsilon ,$$
(13)
$$IoU = \frac{A \cap B}{{A \cup B}},\alpha = \frac{\upsilon }{{\left( {1 - IoU} \right) + \upsilon }},\upsilon = \frac{4}{{\pi^{2} }}\left( {\arctan \frac{{w^{gt} }}{{h^{gt} }} - \arctan \frac{w}{h}} \right)^{2}$$
(14)

where \(A\) represents the area occupied by the prediction box, \(B\) represents the area occupied by the ground truth box, \(b\) is the coordinate of the center point of the prediction box, \(b^{gt}\) is the coordinate of the center point of the ground truth box, \(w\) is the width of the prediction box, \(w^{gt}\) is the width of the ground truth box, \(h\) is the height of the prediction box, \(h^{gt}\) is the height of the ground truth box, \(c\) represents the diagonal distance of the smallest rectangular area that can simultaneously contain both the prediction box and the ground truth box, \(\rho ( \bullet )\) stands for Euclidean distance.

$$DFL\left( {S_{i} ,S_{i + 1} } \right) = - \left( {y_{i + 1} - y} \right)\log (S_{i} ) - \left( {y - y_{i} \log \left( {S_{i + 1} } \right)} \right)$$
(15)
$$S_{i} = \frac{{y_{i + 1} - y}}{{y_{i + 1} - y_{i} }},S_{i + 1} = \frac{{y - y_{i} }}{{y_{i + 1} - y_{i} }}$$
(16)

where \(y_{i}\) and \(y_{i + 1}\) are the two labels closest to the label, \(y_{i} \le y \le y_{i + 1}\), \(S_{i}\) and \(S_{i + 1}\) are the probabilities of predicting labels for label \({y}_{i}\) and \(y_{i + 1}\), respectively.

Fig. 6
figure 6

Detect head structure.

Experiments

Datasets and experiments platform

Parcel box dataset: In order to verify the effectiveness of the parcel box 3D positioning detection algorithm proposed in this paper, we use the real-time images of parcel boxes collected by the image acquisition structure in 3.1 (as shown in Fig. 7) to sort out the real and effective data set. In this study, we sorted out a total of 2916 real-time images of parcel boxes as the total sample of the experiment. These live images are taken by the on-site industrial camera MV-CA003-20GC, and the image resolution is \(672 \times 512\). In order to show the effect of model training and verification, we used the open-source labeling tool labelimg to objectively produce the pictures into three dataset formats, including YOLO, COCO and VOC, and randomly divided the total training data set (training set and verification set) and test set according to the ratio of \(3:1\).

Fig. 7
figure 7

Parcel box dataset.

Traffic Signs Detection dataset22: This dataset mainly collected a total of 4969 samples in 15 categories including red light, green light, stop and speed limit, and the resolution of the picture was \(416 \times 416\). This dataset was used to verify the effectiveness of the model improvement, with 3530 images used for model training, 801 images used for model validation, and 638 images used for model testing.

African Wildlife dataset23: This dataset mainly displays images of four types of animals commonly found in nature reserves in South Africa with a resolution of \(128 \times 86 \sim 4288 \times 3000\). To verify the effectiveness of the model improvement in this paper, 1203 images were used as the training set, 150 images as the validation set, and 151 images as the test set.

Pascal VOC dataset24: Created by the Computer Vision Group at the University of Oxford, this dataset is widely used in computer vision tasks such as object detection, image segmentation, and image classification. The dataset contains 20 common object categories such as people, cars, planes and dogs. In this paper, the dataset was used to verify the universality of the model in other visual tasks. A total of 16,551 pictures from Pascal VOC 2007-trainval and Pascal VOC 2012-trainval were selected for model training. A total of 4952 pictures from Pascal VOC 2007-test were selected as the test set of the model.

In order to ensure the fairness of the experiments, all experiments are conducted on the same server for training and reasoning processes. Table 1 shows the configuration of the experimental environment.

Table 1 Model training environment configuration.

Model measurement criteria

Since the object detection model studied in this paper will be applied to the real logistics center for real-time three-dimensional parcel box positioning, the detection accuracy, model size and reasoning speed are important evaluation criteria. In this experiment, Precision(P), recall(R), F1 and mAP50 are selected as the evaluation indexes of model accuracy, and their calculation is as follows :

$$P = \frac{TP}{{TP + FP}}, \, R = \frac{TP}{{TP + FN}}, \, F1 = \frac{2 \times P \times R}{{P + R}},$$
(17)
$$mAP = \frac{{\sum\nolimits_{i}^{C} {AP_{i} } }}{C}$$
(18)

In ablation experiments, according to the definition in TIDE25, localization error, duplicate detection error, background error and False Positive are used to intuitively reflect the error between the prediction box and the real box of the 3D positioning model. Under the assumption that the prediction box and the real box have the same category and the overlap degree of IOU26 is greater than the set threshold t, we choose the one with the highest overlap degree as True Positive, and the rest as False Positive. Therefore, the above four errors can be defined as:

  • localization error \(E_{loc}\): The prediction frame classification is correct, but the location is incorrect.

  • duplicate detection error \(E_{dupe}\): The category of prediction boxes is correct and the overlap is greater than t, but a prediction box with a higher score matches the true box.

  • background error \(E_{bkg}\): Checks the background as the foreground.

  • False Positive error \(E_{FP}\): indicates a location error, background confusion, or repeated detection.

Ablation experiments

In this paper, open-source datasets and parcel box dataset were selected to conduct ablation experiments on the model, so as to ensure that the improved strategies in this paper are valid and effective in improving the model detection performance.

In the ablation experiment of open-source datasets, this paper selected Traffic Signs Detection dataset and African Wildlife dataset as the datasets to verify that the improvement strategy proposed in this paper has the same improvement effect on different data. The results are shown in Tables 2 and 3. According to the experimental results, it can be seen that the detection performance of the model is improved when using ES-ViT and C2f-GCC modules. The improvement strategy proposed in this paper is effective for different types of datasets.

Table 2 Experimental results of ablation of Traffic Signs detection dataset by the model.
Table 3 Experimental results of ablation of African Wildlife dataset by the model. (B): Backbone group, (N): Neck group, (\(\pm x\)) is the change of model detection performance of backbone group, and (\(\pm x\)) is the change of model detection performance of neck group.

In this paper, two groups of ablation experiments, network structure improvement and loss function improvement, were carried out respectively when using the parcel box dataset.

According to the results in Tables 4 and 5, it can be seen that the detection performance of the model is improved when using ES-ViT and C2f-GCC modules. The positioning error, duplicate detection error, and background error of the model are all reduced. The experimental results indicate that the improvement strategy proposed in this paper is effective.

Table 4 Experimental results of ablation of Parcel box Dataset by the model.
Table 5 Experimental results of ablation of Parcel box Dataset by the model under TIDE index.

We recorded the packet detection performance of models with different improvement strategies at different confidence levels, as shown in Fig. 8a–c, as well as the training process before and after model improvement, as shown in Fig. 8d–f.

Fig. 8
figure 8

Performance of the model in the parcel box dataset.

In order to verify that the loss function selected in this paper can meet the parcel detection task of the real intelligent logistics center, we tried to combine various loss functions and recorded relevant experimental information, and the experimental results are shown in Table 6. The experiment shows that in the parcel box dataset, the loss function selected in this paper demonstrates high precision detection performance and can meet the task of parcel box detection.

Table 6 The influence of different loss function combinations on the performance of model parcel box detection.

Considering that different loss functions may affect the quality of the model prediction box, TIDE index is also used to evaluate the error between the model prediction box and the ground truth box under different loss functions (as shown in Table 7). The experiment shows that when BCELoss and CIoU are used in the detection header of the model in this paper, the model produces high-quality predicted bounding boxes for parcel box, low probability of repeated detection and background confusion, and the comprehensive detection accuracy of the model is the highest.

Table 7 The influence of different loss function combinations on the performance of model parcel box detection under TIDE index.

Based on the experimental outcomes derived from the aforementioned distinct datasets, it is evident that the model proposed in this paper achieves relatively high values in terms of detection precision, recall, and mAP for the target. This indicates that the model exhibits strong classification capability for the target. Furthermore, under the TIDE index, the quality of the model’s predicted bounding boxes for the target is superior, demonstrating its excellent positioning ability. To sum up, the model in this paper may be applicable to the tasks of target classification and localization.

We recorded the detection performance of the model under different IoU thresholds, and the results are presented in Table 8. The model exhibits a relatively high overall detection accuracy across various IoU thresholds, with the fluctuation of accuracy values controlled within approximately 1%, indicating that the model’s detection performance is relatively stable.

Table 8 Detection performance of models on parcel box dataset under different backbones.

In addition, in order to verify that the network structure of the proposed model can achieve a balance between detection performance and model lightweighting in parcel box detection tasks, this paper conducts two additional groups of comparison experiments on the neck and backbone of the model respectively. In order to control experimental variables, only part of the network structure in each group is selected for comparison experiments. At the same time, the lightweight structure commonly used in the field of target detection is replaced.

According to the results in Tables 9 and 10, it can be seen that the backbone and neck networks designed in this paper have achieved a good balance between structural lightweighting and detection accuracy.

Table 9 Detection performance of models on parcel box dataset under different backbones.
Table 10 Detection performance of models on parcel box dataset under different Necks.

Model comparison experiment

In order to verify the balance between the detection performance and model lightweighting of the packaging detection model designed in this article in the detection task, commonly used models in the field of object detection were selected for comparison in the packaging detection experiment. The experimental results are shown in Table 11. The model in this paper has the smallest number of parameters and file size, and the detection accuracy of the model is the highest. In addition, the inference results of some models were also recorded, as shown in the Fig. 9.

Table 11 Detection performance of different models on parcel box dataset.
Fig. 9
figure 9

Inference results of different models on parcel box dataset.

Comparison experiment of parcel box size prediction

We select 450 images from the parcel test dataset to sort out 150 parcel information of different sizes, and selects a common model in the field of object detection for experiment. The experimental results are shown in Table 12. Our model has the best overall performance in terms of parcel box size prediction error, and the four errors are relatively low, showing high-precision detection performance and meeting the task of parcel box detection.

Table 12 Experimental results of comparison of prediction errors of different models for parcel boxes.

In addition, we compared our proposed method with some existing methods with smaller model sizes on the PASCAL VOC 2007 test dataset. As shown in Table 13, our model has the best detection performance, with mAP and mAP50-95 reaching 78.4% and 44.6%, respectively. The model has good universality.

Table 13 Detection performance of different models on Pascal VOC 2007-test.

Conclusions

In this paper, a vision algorithm based on deep learning is proposed to recognize the three-dimensional size of the rectangular parcel box on the sorting table of intelligent logistics system. The algorithm achieves a good balance between recognition efficiency, recognition accuracy, algorithm universality and deployment cost.

According to the requirements of the parcel box detection task in this article, a deep learning based end-to-end object detection model for parcel box recognition is designed. This paper introduces innovations of varying degrees in the backbone and neck sections of the model’s network structure and uses multiple datasets to sequentially verify the effectiveness of the improved strategy, the effectiveness of the localization algorithm, and the universality of the detection model. In the backbone of the model, we use EfficientViT and SPPF modules to improve the model’s focus on target parcel boxes with fewer parameters and computational costs, and efficiently process parcel box features of different scales. In the neck part of the model, we design a high-low layer feature fusion structure CWS, and use C2f-GCC module to improve the convolution operation efficiency and reduce the number of parameters and computation. The results of the ablation experiment indicate that the improvement strategy proposed in this paper can improve the detection performance of the model on different domain datasets and reduce the error of the model’s prediction box. In the comparison experiment of parcel box detection and parcel box size prediction error with other object detection models, our designed model performed the best in terms of parameter quantity, model volume, inference speed, and detection accuracy. Although the comprehensive index of our model is the best in the experiment in this paper, there is still a large room for improvement in GFLOPs and the number of model parameters. Therefore, in the future work, we will focus on the network structure, computational complexity and inference speed of the model to design a general model suitable for parcel box detection in the field of logistics.