Introduction

The rapid development of smart libraries has driven the digital and intelligent transformation of library services1,2,3. Leveraging information technology and big data analytics, smart libraries can gain deeper insights into user needs, optimize resource allocation, and thereby improve operational efficiency and service quality4. In this process, data analysis is increasingly becoming a key tool for supporting management decisions. As one of the core metrics, visitor flow monitoring not only directly reflects the usage of resources and space utilization but also provides a scientific basis for service optimization and the formulation of management strategies5. Therefore, achieving accurate visitor flow detection is a crucial step towards intelligent management in smart libraries6.

Currently, most libraries rely on access control systems and video detection to collect crowd flow data7. However, although access control systems based on card swiping or facial recognition can provide more accurate entry and exit data, they are prone to congestion during peak hours and have lag in data export, making it difficult to meet real-time monitoring needs8. In contrast, video analysis technology has gradually become an effective complementary solution by detecting and tracking pedestrians in surveillance footage, providing more dynamic and real-time pedestrian flow data9,10. Video analysis methods rely on deep learning techniques, such as convolutional neural networks (CNN) and Transformer object detection models, and have made significant progress in recent years11. However, due to the spatial layout of the library, existing monitoring equipment often can only obtain video data from a single perspective. In densely populated and heavily obstructed environments, detection performance is still greatly affected12,13. Specifically, the targets in the side view exhibit varying degrees of deformation and occlusion, leading to a significant decrease in the accuracy of traditional pedestrian detection methods in this scene. At the same time, the interactive motion of dense crowds increases the difficulty of target tracking and counting. In addition, although mainstream object detection models such as Faster R-CNN and YOLO perform well in the field of object recognition, they are difficult to balance between high accuracy and real-time performance. Some lightweight improvement schemes, although improving detection speed, often struggle to maintain sufficient detection accuracy in high-density scenes. Therefore, achieving accurate and real-time monitoring of pedestrian flow in a high-density and limited viewing angle dynamic environment for smart libraries remains an important challenge in the current field.

To address this challenge, this paper proposes a visitor flow monitoring method based on an improved YOLOv8n model. The main contributions of this work can be summarized as follows:

  1. 1.

    A dataset for high-density and occluded scenarios from a side view in a library setting is constructed, and a novel data collection approach is provided. The proposed method has been validated on this dataset, with the validation results demonstrating its superior performance.

  2. 2.

    A lightweight Dense-Stream YOLOv8n model for high-density visitor flow monitoring is introduced. Experimental results show that this model outperforms existing mainstream models in this task, exhibiting higher real-time performance and accuracy.

  3. 3.

    A highly adaptable benchmark area counting algorithm is designed for complex layouts in specific library settings, which effectively addresses occlusion issues under side views. Additionally, this algorithm demonstrates good scalability, making it applicable to other similar scenarios.

  4. 4.

    New methods and technical support for big data monitoring and analysis in libraries are provided, contributing to enhanced data-driven management and service levels.

Related research

With the rapid development of smart libraries, achieving accurate and real-time visitor flow monitoring in high-density and perspective-limited dynamic environments has gradually become a key research direction14,15. However, most of the currently applied methods often struggle to balance accuracy and real-time performance when dealing with the high-density and perspective-limited dynamic environments of libraries16. To address these shortcomings, the application of deep learning techniques and lightweight models has increasingly become a central focus of research in recent years17.

Traditional counting methods

Table 1 Characteristics of traditional counting methods.

Traditional methods such as manual counting, sensor counting, and access control systems are still used for crowd monitoring in libraries, but these methods have significant limitations in practical applications. Table 1 summarizes the characteristics and drawbacks of these methods to provide a clearer illustration of their limitations in the context of smart libraries.

Early visitor flow monitoring primarily relied on manual counting, which, although simple and easy to implement, was inefficient and prone to significant errors, failing to meet the requirements for precision and efficiency in smart libraries18. Subsequently, sensor-based counting was gradually introduced, improving the level of automation to some extent. However, the accuracy of these systems could be easily affected by variations in individuals’ body sizes and movement speeds.

In recent years, access control systems based on card swiping and facial recognition have been widely adopted, enabling more accurate counting and supporting management19. Nevertheless, these systems tend to cause congestion during peak entry and exit times and typically rely on offline data export, resulting in poor real-time performance. Additionally, their monitoring accuracy can be influenced by the installation location and configuration, especially in high-density scenarios where it may be difficult to comprehensively capture all entry and exit data20.

Video analysis technology, through the real-time processing of camera footage, achieves contactless and continuous monitoring, demonstrating greater adaptability to various environments21. However, due to layout constraints in some libraries, cameras often can only capture images from a single side perspective, leading to occlusion issues in high-density and dynamic crowds, which in turn affects the accuracy of visitor flow monitoring. Particularly in high-density dynamic environments with limited perspectives, traditional video detection algorithms still struggle to achieve the required levels of recognition accuracy and processing efficiency22.

Crowded scene detection method under deep learning

In recent years, with the rise of Convolutional Neural Networks (CNNs), deep learning has made breakthrough progress in the field of object detection in crowded scenes. Early object detection methods were mainly based on single-stage and two-stage detectors. Among them, the YOLO series model proposed by Redmon et al.;achieved real-time detection by transforming the object detection problem into a single regression problem23; Ren et al. proposed the Faster R-CNN framework and introduced Region Proposal Network (RPN), which significantly improved the accuracy and efficiency of detection24. However, its accuracy in complex scenarios still has certain limitations25,26,27,28,29.

To improve detection performance, researchers have proposed a series of improvement methods: CenterNet proposed by Zhou et al. improves the detection accuracy of small targets by locating the center point of the target,avoids the instability of bounding box regression in traditional methods, and demonstrates stronger robustness30 ; Luo et al.introduced attention mechanism to enable the model to focus on the features of key regions, improving the detection ability of pedestrians in complex scenes31;Zhao et al. proposed the MS2ship dataset for object detection in drone imagery, providing high-quality training data for small object detection in maritime environments32. However, these methods have achieved good detection results in specific application scenarios, but there are still problems such as decreased detection accuracy and insufficient generalization ability when facing dense occlusion, dynamic changes, and multi-scale targets in complex environments.

With the increasing complexity of object detection tasks, researchers have proposed many advanced models that combine contextual information, multi-scale analysis, and other strategies, resulting in improved accuracy and speed of object detection. For example, Liang et al. proposed a method combining Swin Transformer (SwinT) with Faster R-CNN, which utilizes window attention mechanism to extract global contextual information and improve the robustness of pedestrian detection in highly occluded scenes33; Li et al. addressed the issue of occlusion in underwater fish detection and optimized YOLOv8 using RT-DETR structure to improve detection performance in complex scenes34. Although these methods demonstrate strong detection capabilities in crowded scenarios, they often rely on high computing resources and still need further optimization to adapt to their deployment on edge devices. In addition, the combination of 3D LiDAR and Internet of Things (IoT) technology used by Guefrachi et al. can provide richer environmental information, but the hardware cost is high and its applicability in indoor environments is still limited35; Meuser et al. focused on Edge AI and explored the potential of model training and reasoning on the edge side. Although they provided a new research direction for the combination of target detection and edge computing, they still faced challenges such as model architecture constraints, computing resource constraints and data privacy36.

Therefore, in intelligent environments such as smart libraries, smart agriculture, and transportation hubs, achieving high-density, side view real-time pedestrian flow monitoring still faces many challenges, especially in the case of limited computing resources37. How to improve the model’s lightweight and real-time performance while ensuring detection accuracy remains an urgent problem to be solved . To address this issue, an improved YOLOv8n model was proposed and its functionality was enhanced by integrating the DensityNet module, using model pruning and knowledge distillation techniques.

Model lightweighting

As deep learning models are applied across various scenarios, model lightweighting has become a critical direction for improving real-time detection efficiency. The design of lightweight convolutional modules enables deep learning models to perform more efficient crowd monitoring on resource-constrained edge devices38. Pruning techniques, as a common optimization approach, reduce model complexity by removing redundant parameters, thereby significantly enhancing computational efficiency39. Studies on applying pruning techniques in YOLO models have shown that pruning can effectively reduce computational overhead, making the model more suitable for deployment on embedded devices in high-density scenarios. Furthermore, combining knowledge distillation strategies, where a teacher model guides the training of a student model, can further improve the detection accuracy and robustness of lightweight models in high-density scenarios, offering superior performance in practical deployments40.

Data collection and preprocessing

Data collection

This study continuously collected crowd flow video data over multiple days and different time periods on the second floor of Nanchang University Library using a side-angle top-down shooting method. Based on the specific layout of the library’s entrances and exits, the camera equipment was installed at positions and angles that could clearly capture the dynamics of people entering and exiting (see Fig. 1), ensuring coverage of scene changes with varying crowd densities at different times. This collection process recorded human activity data in various dynamic environments, providing a rich and representative data foundation for subsequent model training.

Fig. 1
figure 1

Scene diagram of data collection.

Data preprocessing and partitioning

To further support model training, the collected video data underwent several preprocessing steps. First, video frame extraction was applied to the captured library crowd flow videos, extracting one frame every five seconds and saving it in “.jpg” format, converting the dynamic video into a sequence of static images. This allows the model to train on static images for object detection, more accurately capturing the dynamic details of individuals. Next, the extracted image set was filtered to remove frames with no people or redundant information, retaining only effective images with varying crowd densities to ensure the diversity and representativeness of the dataset. Finally, the selected images were manually annotated using the LabelImg tool, marking the positions and categories of individuals in each image, providing high-quality labeled data for model training. The high-quality dataset generated through this preprocessing workflow lays a solid foundation for subsequent model training and performance optimization.

After annotation, the obtained image dataset was divided into training, testing, and validation sets in a ratio of 7:2:1, resulting in 3745 images for the training set, 1070 images for the testing set, and 535 images for the validation set. Each dataset covers multiple time periods and different crowd densities to ensure that the model learns from a sufficient number of samples during training, capturing features under various conditions. Additionally, during validation and testing, the model’s generalization ability can be comprehensively evaluated, further enhancing the overall performance of the model.

Constructing Dense-Stream YOLOv8n model

YOLOv8n model

YOLO (You Only Look Once) is an efficient deep learning object detection model that simplifies detection into a regression problem, enabling rapid identification and tracking of various objects. YOLO not only excels in terms of accuracy and speed but also offers multiple functional features and variants to accommodate different scenario requirements41,42,43.Although YOLO has achieved significant success in various applications, it still requires further improvements to meet the resource constraints and real-time performance demands in high-density, occluded environments such as those found in library settings.To address these challenges, this paper introduces the Dense-Stream YOLOv8n model, which enhances the YOLOv8n model through model pruning and knowledge distillation. These modifications aim to balance detection accuracy and real-time performance, enabling efficient monitoring of visitor flow at library entrances and exits. The architecture of the proposed Dense-Stream YOLOv8n model is illustrated in Fig. 2.

Fig. 2
figure 2

Dense-Stream YOLOv8n Library Pedestrian Flow Detection Structure Diagram.

Design DensityNet data augmentation module

To effectively address the occlusion issues caused by high-density crowds and side-angle data collection in the dynamic scenes of libraries, this study integrated a lightweight convolutional enhancement module, DensityNet, into the YOLOv8n model to improve its ability to extract features from pedestrian images. By comparing the original image (Fig. 3a) with the image processed by DensityNet (Fig. 3b), it is evident that after DensityNet processing, the model captures the contours and detailed features of pedestrians more accurately, making the key features of pedestrians clearer.

Fig. 3
figure 3

Comparison of DensityNet feature extraction.

The core structure of the Convolutional Enhancement Module, DensityNet, includes convolutional layers, batch normalization layers, ReLU activation functions, and skip connections. Firstly, DensityNet uses a 3\(\times\)3 convolutional layer to extract local features, enhancing the model’s sensitivity to detailed variations. Secondly, the batch normalization layer is employed to standardize the feature distribution, reducing training instability caused by differences in input data. The standardized output is then transformed through a ReLU activation function, which introduces non-linearity and improves the model’s responsiveness to key features. To further preserve the original image information, DensityNet employs skip connections to perform a weighted addition of the convolutional output and the input image features. The expression for the skip connection is:

$$\begin{aligned} y_{\text {out}} = 0.05 \cdot y_{\text {conv}} + x \end{aligned}$$
(1)

Here, \(y_{\text {conv}}\) represents the convolutional enhancement output feature, x is the input image feature, and the coefficient 0.05 is used to balance the proportion of convolutional features and the original image features. The final output \(y_{\text {out}}\), obtained after the skip connection, is converted into an image format, making the enhanced image features more recognizable and helping to improve the model’s real-time pedestrian detection performance in high-density scenarios.

Model pruning

To meet the real-time detection requirements in high-density, view-limited dynamic environments of smart libraries, this study optimized the YOLOv8n model through pruning to reduce computational complexity and enhance detection efficiency on device endpoints, meeting the real-time requirements during peak crowd flow periods. Pruning, as a method of compressing deep learning models, significantly reduces model computational costs and storage requirements by removing redundant or low-importance model parameters, enabling efficient inference on resource-constrained devices.

During the pruning process, first, the model undergoes sparsity regularization training to guide the weight distribution of the model towards sparsity, weakening the impact of non-important parameters, making subsequent pruning easier. Specifically, sparsity regularization is introduced to specific layers of the model (such as BatchNorm layers), adjusting the original loss function to:

$$\begin{aligned} \text {Loss}_{\text {total}} = \text {Loss}_{\text {original}} + \lambda \sum \nolimits _i \left| w_i \right| \end{aligned}$$
(2)

Here, \(\lambda\) is the sparsity coefficient that controls the strength of the sparsity regularization term on the weights, and \(\sum \nolimits _i \left| w_i \right|\) is the sum of the absolute values of the weights to be pruned. This process gradually reduces the absolute values of low-importance weights, achieving the reduction of redundant parameters.

During the weight screening phase, the model further performs global pruning based on the sparsity distribution of the weights. Specifically, the weights of the entire model are sorted in ascending order by their absolute values. Let the model weights be \(| w_i|\). A pruning threshold \(\tau\) is determined globally based on the specified sparsity rate p which is the value at the position corresponding to the top \(p\%\) of the weights in the sorted list. The formula is expressed as:

$$\begin{aligned} \tau = \text {Quantile}\left( \left\{ \left| w_i \right| \right\} , p \right) \end{aligned}$$
(3)

Here, \(\text {Quantile}\left( \left\{ \left| w_i \right| \right\} , p \right)\) represents the percentile of the absolute values of the weights sorted at the proportion p. Ultimately, weights below this threshold are set to zero, achieving the pruning of redundant parameters. The formula is expressed as:

$$\begin{aligned} \text {compress}(w_i) = {\left\{ \begin{array}{ll} 0, & \text {if } \left| w_i \right| < \tau \\ w_i, & \text {otherwise} \end{array}\right. } \end{aligned}$$
(4)

After the pruning strategy involving sparsity regularization and global screening, the model parameters are streamlined. Subsequent fine-tuning training will be conducted to recover the accuracy, ensuring that the compressed model maintains effective detection and real-time performance in the high-density, view-limited dynamic environment of smart libraries.

Knowledge distillation

To enhance the performance of the pruned model, this study adopts the knowledge distillation method by constructing a teacher model YOLOv8l and a student model YOLOv8n, to achieve knowledge transfer. The teacher model YOLOv8l has the complete original model architecture and parameters, capable of providing rich features and accurate predictions. In contrast, the student model YOLOv8n is a lightweight model that has undergone data augmentation and pruning optimization. The knowledge distillation process mainly includes two parts of loss: logits distillation and feature distillation, to guide the learning of the student model, making it as close as possible to the performance of YOLOv8l in the reduced scale.

First, logits distillation is based on the similarity of the output probability distributions between YOLOv8l and the improved YOLOv8n, using the Kullback-Leibler divergence to measure the difference between the two44. The formula is as follows:

$$\begin{aligned} L_{\text {logical}} = D_{\text {KL}}(P_T \Vert P_S) \end{aligned}$$
(5)

Here, \(P_T\) and \(P_S\) represent the output probability distributions of the teacher model and the student model, respectively. By minimizing the logits distillation loss, the improved YOLOv8n can better mimic the prediction behavior of the teacher model.

Secondly, feature distillation focuses on aligning the intermediate feature layers of YOLOv8l and the improved YOLOv8n, calculating the difference between the two using mean squared error (MSE) to enable the student model to learn the rich feature representations from the teacher model. The feature distillation loss is defined as:

$$\begin{aligned} L_{\text {feature}} = \frac{1}{N} \sum _{i=1}^N \left( F_T^{(i)} - F_S^{(i)} \right) ^2 \end{aligned}$$
(6)

Here, \(F_T\) and \(F_S\) are the feature outputs of the teacher model and the student model at the ii-th layer, respectively. Through feature distillation, the improved YOLOv8n can effectively acquire the deep-level information from YOLOv8l, enhancing its feature representation capability during inference.

The overall loss function of the distillation process is the weighted sum of the logits distillation loss and the feature distillation loss, which is given by:

$$\begin{aligned} L_{\text {distill}} = \alpha \cdot L_{\text {logical}} + \beta \cdot L_{\text {feature}} \end{aligned}$$
(7)

Here, \(\alpha\) and \(\beta\) are the weighting factors that balance the contributions of the logits and feature distillation losses.

Through this knowledge distillation process, the lightweight model YOLOv8n, after data augmentation and pruning optimization, can approach the accuracy and robustness of the teacher model YOLOv8l while maintaining a small scale and low computational complexity. This makes it effectively adaptable to the real-time detection needs in the high-density, view-limited dynamic environment of smart libraries.

Reference area counting algorithm

To meet the needs of personnel entry and exit judgment and precise counting in the high-density, view-limited dynamic environment of smart libraries, this study proposes a baseline region counting algorithm. By improving the traditional baseline detection method45 to baseline region detection, the counting accuracy in occluded and dense environments is enhanced, thereby improving the monitoring precision in dynamic scenes of personnel entry and exit, achieving more reliable crowd flow monitoring. The flowchart of this algorithm is shown in Fig. 4.

Fig. 4
figure 4

Flow chart of reference area counting algorithm.

The algorithm first expands the baseline to a baseline region to provide a longer detection trigger time, thus better adapting to high-density, view-limited dynamic environments. The position of the baseline region is optimized by the ratio parameters a and b . Specifically, these parameters represent the relative positions of the left and right boundaries of the baseline region within the video frame, with values ranging from 0 to 1. The specific values of the parameters need to be adjusted according to the actual scene layout and the installation position of the camera equipment to ensure that the baseline region effectively covers areas that people must pass through. The positioning of the baseline region can be expressed by the following formula:

$$\begin{aligned} & \text {baseline}_{x_1} = W \times a \end{aligned}$$
(8)
$$\begin{aligned} & \text {baseline}_{x_2} = W \times b \end{aligned}$$
(9)
$$\begin{aligned} & \text {baseline}_{y_1} = 0 \end{aligned}$$
(10)
$$\begin{aligned} & \text {baseline}_{y_2} = H \end{aligned}$$
(11)

Here, W and H are the width and height of the video frame, respectively; a and b are the ratio parameters for the left and right boundaries of the baseline region.

During the object detection process, the algorithm uses the improved lightweight model YOLOv8n to obtain the center position of each detected object and records the position information in the track_history dictionary. By analyzing the trajectory data, the movement direction of each target can be determined, providing support for subsequent counting. When a target is detected entering from the left, i.e., the coordinates of the target in the previous frame satisfy the following condition relative to the boundary coordinates of the baseline region, it is marked as “in”:

$$\begin{aligned} \begin{aligned}&\text {if } \text {previous}\_\text {position}[0] < \text {baseline}_{x_1} \\&\text {and } \text {current}\_\text {position}[0] \ge \text {baseline}_{x_1} : \text {in}\_\text {count} += 1 \end{aligned} \end{aligned}$$
(12)

When a target is detected entering from the right, i.e., the coordinates of the target in the previous frame satisfy the following condition relative to the boundary coordinates of the baseline region, it is marked as “out”:

$$\begin{aligned} \begin{aligned}&\text {if } \text {previous}\_\text {position}[0] > \text {baseline}_{x_2} \\&\text {and current}\_\text {position}[0] \le \text {baseline}_{x_2} : \text {out}\_\text {count} += 1 \end{aligned} \end{aligned}$$
(13)

The pseudocode for this process is shown in Table 2:

Table 2 Reference area counting algorithm.

Analysis of results and evaluation of the models

Experimental environment

The hardware environment consists of Windows 11, a processor of 12th Gen Intel(R) Core(TM) i5-12400F @ 2.50 GHz, and a GPU of NVIDIA GeForce RTX 4060. The development environment is PyCharm Community Edition 2023.3.4, with the interpreter being Python 3.8.19. The libraries imported include Ultralytics 8.0.135, OpenCV 4.6.0.66, and Numpy 1.24.4, among others.The main parameters set during training in this study are shown in Table 3:

Table 3 Main parameter settings for the experiment.

In the experiment, a learning rate annealing strategy was adopted, with an initial learning rate set at Lr0=0.001 and gradually decaying to Lrf=0.01 to prevent unstable convergence in the later stages of training. The early stop strategy is set to patience=30, which means that no increase was observed for 30 consecutive 0.9* mAP@0.5 +0.1* mAP@0.5 , the training will be terminated early. In addition, in terms of optimizer selection, SGD, which performs more stably in high-density object detection tasks, was chosen as the optimizer. Through the above optimization strategy and hyperparameter optimization strategy, Dense Stream YOLOv8n can maintain excellent detection accuracy while ensuring lightweight, ensuring the stability and generalization ability of the model in high-density occlusion scenes.

Ablation experiment

To verify the effectiveness of various optimization modules in improving the detection performance of the model, this experiment progressively introduces the DensityNet module, model compression (pruning), and knowledge distillation strategies on the basis of the YOLOv8n model, designing multiple ablation studies. The comparison models include the original YOLOv8n, YOLOv8n-DensityNet, YOLOv8n-DensityNet-compress, and Dense-Stream YOLOv8n, to systematically analyze the contribution of each optimization module to the model’s performance.The experimental results are shown in Table 4. From the data in the table, it can be observed that the various optimization strategies have had varying degrees of impact on the model’s detection accuracy, inference speed, and computational load.

Table 4 Comparison of experimental results.

Specifically, the original YOLOv8n model achieves mAP@0.5(IoU=0.5) and mAP@0.5:0.95(Average mAP of IoU=0.5 to 0.95) of 0.991 and 0.898, respectively, with a frame rate (FPS) of 189.5 and a computational load of 8.1 GFLOPs, and a parameter count of 3.01M. After introducing the lightweight convolution module DensityNet, the YOLOv8n-DensityNet model maintains an mAP@0.5 of 0.991, with a slight drop in mAP@0.5:0.95 to 0.896, but the FPS increases to 191.7, and the parameter count remains at 3.01M. This indicates that the DensityNet module enhances detection accuracy and preserves detailed features while having a limited impact on computational resources.

Additionally, it improves the convergence speed of model training.On this basis, the YOLOv8n-DensityNet model is pruned using compression strategies to obtain the YOLOv8n-DensityNet-compress model. After pruning, the model’s mAP@0.5 slightly decreases to 0.989, and mAP@0.5:0.95 drops to 0.847, but the frame rate (FPS) significantly increases to 253.4, the parameter count reduces to 2.04M, and the computational load (GFLOPs) decreases to 4.0. This result shows that although there is a slight decrease in mAP, the loss in accuracy is within an acceptable range, and there is a significant improvement in inference speed and resource efficiency.

Furthermore, by introducing knowledge distillation strategies to the YOLOv8n-DensityNet-compress model, the Dense-Stream YOLOv8n model is obtained. Experimental data show that this model achieves mAP@0.5 and mAP@0.5:0.95 of 0.99 and 0.861, respectively, with the FPS further increasing to 254.0, and the GFLOPs and parameter count remaining at 4.0 and 2.04M. The introduction of knowledge distillation strategies ensures low computational cost while further enhancing the model’s generalization ability, significantly optimizing its detection performance in high-density scenarios.

Fig. 5
figure 5

Heatmaps chart of test results.

To more intuitively demonstrate the detection effects of pruning and knowledge distillation strategies in high-density occlusion scenarios, Fig. 5b,c show the detection result heatmaps of the original YOLOv8n model and the optimized Dense-Stream YOLOv8n model in high-density crowds. As shown in Fig. 5c, the optimized model can more accurately focus on the key parts of side-by-side individuals, improving detection performance in occluded situations. In contrast, the original model performs less effectively in handling similar occlusion scenarios (see Fig. 5b). Figure 5a shows the original image of the actual scene for comparative analysis. The experimental results further validate that the Dense-Stream YOLOv8n model, with the support of pruning and knowledge distillation strategies, can achieve superior detection performance in high-density, occluded environments.

Fig. 6
figure 6

Effect diagram of each optimization strategy.

Additionally, to comprehensively analyze the effects of various optimization strategies, this experiment also plots precision-recall curves (Fig. 6a) and loss function curves (box_loss, dfl_loss, cls_loss) (Fig. 6b) to assist in verifying the improvements in detection performance brought by each optimization module. Additionally, to comprehensively analyze the effects of various optimization strategies, this experiment also plots precision-recall curves (Fig. 6a) and loss function curves (box_loss, dfl_loss, cls_loss) (Fig. 6b) to assist in verifying the improvements in detection performance brought by each optimization module. Additionally, to comprehensively analyze the effects of various optimization strategies, this experiment also plots precision-recall curves (Fig. 6a) and loss function curves (box_loss, dfl_loss, cls_loss) (Fig. 6b) to assist in verifying the improvements in detection performance brought by each optimization module.

From the precision-recall curves (Fig. 6a), it can be observed that after introducing the lightweight convolution module DensityNet, YOLOv8n-DensityNet shows improvements in both detection speed and accuracy compared to the original YOLOv8n model. After incorporating the pruning compression strategy, the mAP@0.5 of YOLOv8n-DensityNet-compress further improves in high-density scenarios, indicating that the pruning operation effectively reduces redundant parameters, enhances inference speed, and makes the model suitable for real-time detection on edge devices. With the addition of the knowledge distillation strategy, Dense-Stream YOLOv8n performs optimally in high-density scenarios, further validating the critical role of distillation strategies in enhancing model generalization and robustness.

In terms of the loss functions (Fig. 6b), as each optimization strategy is progressively introduced, the box_loss, dfl_loss, and cls_loss in both the training and validation processes gradually decrease and stabilize around 200 epochs. This trend indicates that the model achieves good convergence under the optimization of each strategy. Among them, Dense-Stream YOLOv8n performs best in terms of box_loss and dfl_loss, further confirming that the introduction of distillation strategies allows the model to maintain low detection errors while improving detection accuracy.

At the same time, this experiment compared and analyzed the accuracy recall curves (see Fig. 7a,b) and F1 score curves (see Fig. 7c,d) of YOLOv8n and Dense Stream YOLOv8n during the training process to evaluate the impact of parameter reduction on model performance.

Fig. 7
figure 7

F1 and P-R variation curves during training.

From Fig. 7a,b, it can be seen that compared to YOLOv8n, Dense Stream YOLOv8n only showed a decrease of 0.01 in the highest F1 score and 0.023 in the lowest F1 score with a reduction of about 50% in parameter quantity, indicating that the model still maintains high stability in balancing accuracy and recall. Meanwhile, from Fig. 7c,d, it can be observed that under the same training conditions, Dense Stream YOLOv8n exhibits mAP@0.5 Only a decrease of 0.002 further validates that the model can maintain excellent detection performance while reducing computational costs. These experimental results indicate that Dense Stream YOLOv8n can maintain detection accuracy close to YOLOv8n while significantly reducing parameter size, fully demonstrating the effectiveness and efficiency of its lightweight design.

In summary, through the improvements introduced by the DensityNet module, pruning compression strategy, and knowledge distillation strategy, the YOLOv8n model has achieved a good balance between accuracy, inference speed, and model complexity. Especially in high-density dynamic scenarios, the pruning and knowledge distillation strategies have significantly enhanced the model’s real-time performance and robustness, validating the effectiveness of the proposed optimization schemes.

Simulation

To verify the practical application effects of the improved YOLOv8n model and the baseline region counting algorithm in high-density, view-limited dynamic library environments, simulation experiments were designed and conducted to systematically evaluate the performance of the optimization strategies in this scenario.

This study randomly selected 10 densely populated video clips from peak hours and 10 non-densely populated video clips from off-peak hours from previously collected videos of the Nanchang University Library. To ensure the representativeness and rigor of the results, each video segment was trimmed to 1-2 minutes, retaining the segments with the most characteristic human flow, to fully reflect the typical human flow environment in actual applications. Such processing can more effectively test the detection accuracy and robustness of the model and algorithm under different human flow densities, providing more realistic and reliable sample data for the subsequent analysis of detection performance.

During the sample video detection process, the actual number of people entering and exiting, the number of people detected as entering and exiting, and the processing speed (FPS) were recorded to analyze the model’s detection accuracy and real-time performance under different human flow densities, Please refer to Supplementary Video 1 (detection effects after trimming at triple speed for the entire day) and Supplementary Video 2 (detection effects at normal speed for a specific time period). To further illustrate the model’s performance in dense and non-dense human flow scenarios, Fig. 8a,b show the real-time detection results in high-density and low-density scenarios, respectively.

Fig. 8
figure 8

Detection result diagram.

To further analyze the consistency of the improved model in classification tasks, Fig. 9a,b show the confusion matrices of the original YOLOv8n model and the optimized Dense-Stream YOLOv8n model. It can be observed that both models exhibit similar classification performance on the “person” and “background” categories, achieving high classification accuracy. This indicates that the optimized model, while reducing computational complexity and improving inference speed, can maintain classification performance comparable to the original model, ensuring detection stability and accuracy in high-density environments.

Fig. 9
figure 9

Comparison of confusion matrix.

At the same time, in order to evaluate the generalization performance of the model in different scenarios, the detection ability of the original YOLOv8n and the optimized Dense Stream YOLOv8n was compared and tested on two datasets, ShanghaiTech Crowd Counting Dataset (Fig.10a,b) and AudioVisual (Fig.10c,d).

Fig. 10
figure 10

Generalization testing.

From Fig.10, it can be seen that the lightweight Dense Stream YOLOv8n performs better in pedestrian detection in side view scenes in certain situations. This phenomenon can be attributed to the fact that the model learned the feature extraction ability of the YOLOv8L model with high accuracy during the knowledge distillation process, allowing Dense Stream YOLOv8n to maintain high detection accuracy even with reduced parameter count. This result further validates the generalization ability of Dense Stream YOLOv8n on different datasets, indicating that its lightweight design has strong adaptability while ensuring detection accuracy, and can be extended to a wider range of intelligent monitoring scenarios.

Additionally, this study recorded the processing speeds in both CPU and NVIDIA RTX 4060 GPU environments to evaluate the real-time performance of the model. Table 5 summarizes the detection results for the two types of videos, including key metrics such as the number of people entering and exiting, detection accuracy, and processing speed.

Table 5 Model performance evaluation table.

The experimental results show that the model exhibits extremely high detection accuracy in both high-density and low-density human flow environments, reaching 99.41% and 99.88% respectively, effectively verifying its adaptability in complex human flow environments. In terms of processing speed, the Fps on NVIDIA RTX 4060 GPU reached around 34 FPS, fully meeting the real-time monitoring needs of high-density dynamic scenes in smart libraries. Although the CPU processing speed is relatively low (5.14 FPS for dense scenes and 5.53 FPS for non dense scenes), recording its frame rate is to demonstrate the applicability of the model in resource constrained environments; For application scenarios that do not require strict real-time performance, CPU processing still has certain reference value. Despite occasional extreme occlusion and environmental complexity challenges, the model still exhibits good stability overall in practical applications.

Comparative experiment

To further validate the superior performance of Dense-Stream YOLOv8n in detecting high-density and side-view visitor flows in dynamic smart library environments, this study selected several commonly used object detection models as comparative benchmarks. These include DDOD (Disentangle Your Dense Object Detector)46, Faster R-CNN (Fast Region-based Convolutional Network), SSD (Single Shot Multibox Detector), YOLOv11n, and the original YOLOv8n model. The models were evaluated based on metrics such as mean Average Precision (mAP@0.5 and mAP@0.95), inference frames per second (FPS), computational complexity (GFLOPs), and the number of parameters. The experimental results are summarized in Table 6.

Table 6 Comparison of experimental results.

Based on the experimental results in Table 6, Dense-Stream YOLOv8n demonstrates superior performance across multiple key metrics. In terms of mean Average Precision (mAP), it achieves an mAP@0.5 of 0.990, which is nearly on par with YOLOv8n and SSD at 0.991, and slightly higher than other models such as Faster R-CNN (0.981), DDOD (0.984), and YOLOv11n (0.859). This indicates its competitive accuracy. Additionally, for the mAP@0.5:0.95 metric, Dense-Stream YOLOv8n reaches 0.861, which, although lower than SSD (0.896) and YOLOv8n (0.898), remains within an acceptable range. Compared to DDOD (0.846), Faster R-CNN (0.835), and YOLOv11n (0.606), its accuracy is improved by 1.8%, 3.1%, and 29.61%, respectively. This suggests that Dense-Stream YOLOv8n has good adaptability across various density scenarios, making it suitable for real-world visitor flow monitoring tasks in smart libraries.

In terms of inference speed, Dense-Stream YOLOv8n achieves a frame rate (FPS) of 254.0, significantly higher than the other benchmark models. Specifically, DDOD and Faster R-CNN have FPS rates of 22.5 and 18.3, respectively, while SSD has an FPS of 39.7. In comparison, the original YOLOv8n and YOLOv11n achieve 189.5 FPS and 243.9 FPS, respectively, which, although good, are still lower than Dense-Stream YOLOv8n. This significant improvement in performance clearly indicates that Dense-Stream YOLOv8n has a substantial advantage in real-time monitoring capabilities, effectively meeting the demands of high-density visitor flow environments in smart libraries.

Furthermore, in terms of computational complexity (GFLOPs) and the number of parameters, Dense-Stream YOLOv8n has values of 4.0 GFLOPs and 2.04M parameters, respectively. Compared to other models such as Faster R-CNN (203 GFLOPs, 43.61M parameters) and SSD (30.49 GFLOPs, 23.74M parameters), it exhibits a lightweight characteristic, effectively reducing resource consumption. This makes it well-suited for deployment on edge devices, which is crucial for real-time monitoring in high-density visitor flow environments.

In summary, through comparisons with other models, Dense-Stream YOLOv8n demonstrates superior performance in high-density, side-view visitor flow monitoring in smart libraries. It provides strong support for achieving accurate and real-time visitor flow monitoring. The model’s balanced performance in accuracy, real-time capability, and resource efficiency makes it an excellent choice for practical applications in smart library settings.

Discussion

Performance advantages and comparative analysis of dense-stream YOLOv8n model

The proposed Dense-Stream YOLOv8n model in this study demonstrates significant performance advantages in object detection, notably outperforming existing modified YOLOv8 models. For instance, the FPS of the improved YOLOv8 model proposed by Lei et al. is 41.4647, and the FPS of the improved YOLOv8 model proposed by Safaldin et al.is 30.048 , whereas the FPS of our method reaches 254.0, significantly enhancing real-time detection capabilities. In terms of accuracy, Dense-Stream YOLOv8n achieves an mAP@0.5 of 0.99 and an mAP@0.5:0.95 of 0.861, which surpasses the GR-YOLO proposed by Li et al. (mAP@0.5 of 0.855 and mAP@0.5:0.95 of 0.600) and other modified models49.

Moreover, the model size of Dense-Stream YOLOv8n is only 2.04M, which is significantly smaller than the 13.3MB improved YOLOv8 model proposed by Wang et al.50. This results in lower storage requirements and higher computational efficiency, making it more suitable for embedded and edge device applications.

In addition to outperforming existing modified YOLOv8 models, Dense-Stream YOLOv8n also significantly outperforms lightweight networks such as PFEL-Net and PeleeNet. Compared to PFEL-Net51, Dense-Stream YOLOv8n has a clear advantage in accuracy. Although PFEL-Net has a smaller model size (0.9MB), Dense-Stream YOLOv8n (2.04M) achieves an mAP@0.5 of 0.99, far exceeding PFEL-Net’s 0.656. Furthermore, PeleeNet is limited in detection accuracy in complex scenarios, especially for small objects and occluded scenes52. In contrast, Dense-Stream YOLOv8n, with the introduction of the DensityNet module, effectively enhances the detection capability for small objects and occluded scenes.

Therefore, Dense-Stream YOLOv8n not only ensures high real-time performance but also offers superior accuracy and broader applicability. Overall, Dense-Stream YOLOv8n achieves an excellent balance between precision, speed, and resource consumption, making it well-suited for object detection tasks in complex and dynamic environments.

Limitations

Although the Dense Stream YOLOv8n model has shown good performance in detecting side view angles in high-density populations, it still has certain limitations in extreme occlusion environments and complex lighting conditions. Specifically, when the target is severely occluded, the model may have difficulty accurately distinguishing individuals, resulting in missed or false detections, especially in scenes with dense pedestrian overlap, where detection accuracy significantly decreases. In addition, changes in lighting conditions may affect the extraction of target features, leading to instability in the detection capability of the model under different lighting conditions. At the same time, although this study optimized the detection performance of the model in side view scenes, when applied to top-down or other non ideal viewing environments, changes in the shape of the target may lead to a decrease in the model’s generalization ability, thereby affecting the robustness and stability of detection.

To further improve the generalization ability of Dense Stream YOLOv8n in complex scenes, future research can improve it from multiple perspectives and multimodal data expansion, feature enhancement, and structural optimization53. On the one hand, multimodal data fusion provides richer feature representations for occluded targets, thereby enhancing the model’s understanding of target morphology and spatial information54. On the other hand, introducing more advanced attention mechanisms, such as Swin Transformer structure or self attention based feature enhancement methods, can effectively improve the ability to capture global information in the object detection process, enabling the model to locate targets more accurately in occlusion situations55. Taking into account these optimization directions, not only can the detection capability of Dense Stream YOLOv8n be improved under different perspectives and occlusion conditions, but it also provides a more adaptive solution for future real-time object detection intelligent systems in complex environments.

Conclusions

This study proposes a lightweight pedestrian flow detection method based on an improved YOLOv8n model, Dense Stream YOLOv8n, aimed at addressing the challenges of pedestrian flow monitoring in high-density and side view dynamic environments of smart libraries. This method introduces a lightweight convolutional enhancement module DensityNet, pruning, and knowledge distillation techniques to significantly reduce computational complexity while improving detection accuracy, making it suitable for real-time requirements on edge devices. The experimental results show that Dense Stream YOLOv8n has an average accuracy in high-density scenes (mAP@0.5) reaching 0.99, close to the original YOLOv8n (0.991), while at the same time mAP@0.5:0.95 reaches 0.861. In terms of real-time performance, the frame rate (FPS) of Dense Stream YOLOv8n has been increased to 254, which is 34.0% higher than the original YOLOv8n (189.5 FPS). At the same time, the computational load has been reduced to 4.0 GFLOP and the parameter count has been reduced to 2.04M, significantly reducing computational resource consumption and making it more suitable for edge devices with limited computation. In addition, simulation experiments and generalization tests have shown that this method exhibits high detection stability under different crowd density conditions, especially in dynamic environments with occlusion in the side view angle, and still maintains good detection performance. The comparison with different detection models further confirms that Dense Stream YOLOv8n achieves a good balance between detection accuracy, inference speed, and computational resource consumption. The method proposed in this study provides theoretical and technical support for integrated solutions in intelligent environments such as smart library management systems and smart city infrastructure.