Key frame extraction algorithm for surveillance videos using an evolutionary approach

Rajan, Manjusha; Parameswaran, Latha

doi:10.1038/s41598-024-84324-0

Download PDF

Article
Open access
Published: 02 January 2025

Key frame extraction algorithm for surveillance videos using an evolutionary approach

Manjusha Rajan¹ &
Latha Parameswaran¹

Scientific Reports volume 15, Article number: 536 (2025) Cite this article

5177 Accesses
7 Citations
Metrics details

Subjects

Abstract

With rapid technological advancements, videos are captured, stored, and shared in multiple formats, increasing the requirement for summarization techniques to enable shorter viewing durations. Key Frame Extraction (KFE) algorithms are crucial in video summarization, compression, and offline analysis. This study aims to develop an efficient KFE approach for generic videos. Existing methods include the Adaptive Key Frame Extraction Algorithm, which reduces redundancy while ensuring maximum content coverage; the Optimal Key Frame Extraction Algorithm, which utilizes a Genetic Algorithm (GA) to select key frames optimally; and the Rapid Key Frame Extraction Algorithm, which employs clustering techniques to identify typical key frames. However, a clear prerequisite remains for a more versatile KFE technique that can address generic applications rather than specific use cases. Evolutionary algorithms offer a powerful solution for achieving optimal KFE. This proposed method leverages an interactive GA with a well-designed Fitness Function and elitism-based survivor selection to enhance performance. This proposed algorithm has been tested on diverse datasets, including VSUMM, SumMe, Mall, user-generated videos, surveillance footage from Amrita Vishwa Vidyapeetham University (Coimbatore, India), and web-sourced videos. The results demonstrate that the proposed KFE approach adheres to benchmark data and captures additional significant frames. Compared to Differential Evolution (DE) techniques and Deep Learning (DL) models from the literature, this recommended algorithm demonstrates superior efficiency, as verified through quantitative and qualitative evaluation metrics. Furthermore, the computational complexity of the GA is intricately compared to that of DE and DL-based approaches, highlighting the distinct efficiencies and performance features.

An effective Key Frame Extraction technique based on Feature Fusion and Fuzzy-C means clustering with Artificial Hummingbird

Article Open access 04 November 2024

Transfer learning model for anomalous event recognition in big video data

Article Open access 13 November 2024

Keyphrase extraction by the use of glove and ResNeXt optimized by enhanced human evolutionary optimization (EHEO) algorithm

Article Open access 24 November 2025

Introduction

With the progressive explosion of media content in the prevailing information age, videos have assumed a predominant role in content sharing. With the unabated exchanges of discrete sizes of videos on the Internet, reviewing a complete set of related videos retrieved for a particular search is an onerous exercise. The viewing audience would be better served by an option to preview a concise video synopsis rather than waste precious time by watching an irrelevant, poorly composed video. Video summarization entails the identification of key frames (KF) in a trailer video that signifies the vital aspects of visual media. An incomprehensive, skewed summarization that unfairly promotes only one side of an issue ensues in an inadequate, incompetent video synopsis. Efficient depiction of selective video clips enables a well-balanced compendium of a trailer.

Computer Vision (CV) is focused on mimicking human visual systems, facilitating thorough visual content analysis with minimal investment of time and effort. CV’s footprints are apparent in diverse domains, such as computer medical vision, machine vision, surveillance, unmanned vehicles, and missile guidance. Video analytics, which involves an examination of a video in real-time or offline modes, has dealt with problems like event detection, pattern recognition, surveillance, crowd management, and public safety. Analytics can be performed on the entire video or only on frames with perceptible changes. In detecting events of interest, analysis restricted to frames bearing prominent change instead of scanning through an entire video would undoubtedly save time and computing resources. The accuracy of the analysis is proportional to the precision with which the prominent KF is detected.

Extensive research has contributed to varied Key Frame Extraction (KFE) approaches. The KFE schemes have been based on shot, motion, or an evolutionary approach. Shot-based KFE aims to partition a video into shots and identify the KF within each shot. Motion-based KF techniques pinpoint radical changes among the video frames.

Evolutionary approaches are handy in solving problems employing techniques stimulated by nature. Evolutionary Algorithm (EA) is subdivided into Evolutionary programming and strategies, Genetic Algorithm (GA), and Genetic Programming (GP). GAs, inspired by natural genetics and evolution, can search large spaces in pursuit of optimal combinations of solutions. GA imitates the natural selection process where offspring for the next generation are produced by the fittest individuals designated as parents in the preceding generations.

Motivation of the study

The necessity for efficient and automated video content summarization drives the motivation for developing a GA-based approach for KFE. The KFE plays a crucial role in reducing the volume of data that needs to be processed, stored, and transmitted while preserving vital information. By leveraging the capabilities of GA, this work can optimize the selection of KF through a well-defined Fitness Function (FF) that evaluates the significance of each frame.

Key Motivations for this Approach Include:

Optimization power: The GA is adept at finding optimal or near-optimal solutions within large search spaces, making them particularly suited for handling extensive video datasets.
Adaptability: GA is highly adaptable to several types of videos and extraction criteria. The flexibility of the FF allows for customization across different domains and use cases.
Reducing redundancy: By selecting a diverse yet representative set of frames, GA effectively minimizes redundancy in the extracted summary, resulting in a more concise and informative representation of the video content.
Efficiency: GA-based KFE methods often outperform exhaustive search techniques in speed and performance, especially when dealing with complex or lengthy video sequences. It has been substantiated and detailed in section 4.5.

This work focuses on performing KFE using GA on unconstrained video. Primarily, the choice and design of the fitness function and survivor selection are the essence of this proposed work.

Objective of the work

The primary objective of developing a GA-based KFE is to achieve efficient, automated video content summarization by identifying and selecting the most representative frames. This approach aims to optimize the selection process, ensuring that important information is retained while minimizing redundancy and reducing data processing requirements.

Specific objectives include:

Maximizing relevance: Ensure that the KFE represents the video’s most informative and relevant moments, capturing the core narrative or action. The results in section "Experimental results" demonstrate the efficacy of the proposed algorithm in identifying the most relevant information, as proven through comparison with ground truth data.
Minimizing redundancy: Select diverse frames to avoid repetitive content, making the summary more summarizing and informative.
Enhancing adaptability: Design the extraction process to be flexible across various video types and domains, with the GA-FF tailored to adapt to different extraction criteria.
Improving computational efficiency: Leverage the optimization power of GAs to reduce computational load compared to exhaustive search methods, enabling faster and more scalable video summarization. It has been substantiated and detailed in Sect. 4.5

This algorithm has been tested across various datasets to demonstrate its efficacy. The rest of this paper is organized as follows: section "Related works" reviews various techniques for KFE reported in the literature; section "Proposed algorithm for KFE" outlines the proposed framework; section "Experimental results" shows the validation results that attest to the efficacy of the proffered work.

Related works

Many investigators have developed algorithms for the KFE. A few significant related works are discussed below: The authors¹ suggested a method for video summarization using Capsules Net. Capsules Net was trained to extract the content and motion features, which enabled the generation of an inter-frame motion curve. Each frame was fitted with a Cubic H-Bezier curve on a sliding window. G1continuity error was computed to represent the change in regularity between the piecewise curves. Transition Effects Detection (TED), which indicated sudden video changes, was captured to identify shots. A self-attention model was pursued to KFE using the position content of each shot.

In², KFE was performed on summary space by employing Lipschitz functions to map the frames of the video to a higher dimensional summary space. K-means clustering was applied to the frames in the summary space to infer the anchor points by the generated weight matrix, which are the most representative frames in each cluster as the selected representative frames exhibited redundancies, KFE from these sets of representative frames. The application of Discrete Cosine Transform (DCT) on representative frames enabled the generation of a DCT matrix. In turn, the DCT matrix facilitated the derivation of Hamming distances for KFE.

The authors in³ proffered a non-linear Sparse Dictionary Selection (SDS) for the KFE. The video frames were mapped to a higher dimensional feature space using the Mercer kernel to convert the non-linear relationship between frames into a linear one. The standard kernel SDS, derived from Simultaneous Orthogonal Matching Pursuit (SOMP), was used to select frames correlated to the residuals. Next, frames orthogonal to the latter were selected to ensure that the current frame was dissimilar to the previously selected frame. A robust Kernel SDS was used to identify candidate KF. The best candidate, KF, was then selected based on its significance. This process was repeated for several iterations, punctuated by updates of the reconstruction coefficient and the energy ratio of the residual at each iteration; KFE on termination of this sequence.

A method for KFE for the assembly process was suggested⁴. A semantic graph was constructed for each frame. Objects were represented as 3-D ellipsoids. Meaningful semantic segments were formed in each frame. The graph’s vertices corresponded to segments, and the edges corresponded to the spatial relations between two segments. Differences in Eigenvalues and the structural change in the graphs of the consecutive frames indicated the presence of KF.

The authors⁵ Put forth KFE for the estimation of odometry. The drift errors were added over time during odometry approximation, which had been reduced by identifying KF based on scan similarity in each step. Euclidean distance was used to evaluate the error between the real and the valued poses. Errors in pose estimation increase as the process continues due to local transformation. KFE minimized the accumulated error at each step based on scan similarity. The scan data having the lowest uncertainty, exceeding the user-defined similarity threshold compared to other candidates, were selected as KF.

The authors⁶ focused on annotated sentences of videos. The video was first partitioned into shots, which were selected as frames that ensue at 25 intervals in the video under consideration. The frame from each shot, which had lower visual features than other frames, was selected as a KF. 215D Visual feature vector comprising of 45D color moment feature vector, 170D Hierarchical Wavelet Packet (HWVP) descriptor, SIFT, and Color SIFT were used for sentence generation—each sentence comprised of elements such as objects, events, scenes, and adjectives. The Weighted Scoring Algorithm (WSA) was used to find the best frame elements. The relationships among the best elements were then analyzed, and a sentence was constructed using the Correlation Graph Algorithm (CGA).

The authors⁷ performed image segmentation based on differential evolution. A histogram was used to obtain the preliminary threshold. The Gaussian Mixture Model (GMM) was used to attain multiple thresholds. The mean square error of the difference between the Gaussian function and the histogram is used as the fitness function. The FF expedites the determination of the best threshold for image segmentation. The authors⁸ invented an evolutionary approach to image fusion. Input images were classified into blocks; the more evident blocks were used to constitute the final image. EA was employed to dynamically find the best block size for the composition of the final image. The authors⁹ reviewed the application of various EAs for combinatorial optimization problems.

The KFE was carried out in three phases: first, extracting features from video frames; next, segmenting the video into distinct parts; and finally, identifying the KF that best represents the content of each segment¹⁰. Features such as Scale-Invariant Feature Transform (SIFT), Global Image Structure Tensor (GIST), Histograms (HSV), and (Pyramid Histogram of Oriented Gradients (PHOG) were extracted from the video frames, and the K-means clustering algorithm was applied to group the videos into distinct segments based on these extracted features. KFE is subsequently extracted using dynamic programming combined with a 0–1 integer linear programming algorithm, optimizing the selection process for greater efficiency and accuracy.

In¹¹, the authors used Spatio-Temporal subtitles for KFE. Initially, mid-frames between the appearance and disappearance of the subtitles were selected. An SSPA curve was generated, and catastrophic points were detected using edge detection methods. The KFE was then extracted based on the steady points derived from these catastrophic points and the presence of subtitles.

It focuses on identifying KF for object detection while minimizing False Negatives (FN)¹². The FN was determined by combining motion detection using optical flow with object detection in each frame. The pre-trained YOLOv5 was then fine-tuned based on the identified FN, which resulted in better KFE than standard YOLOv5. The KFE identified them based on abrupt shot transitions¹³. Abrupt shots were detected by applying Sobel, Canny, and Roberts edge detection techniques and feature extraction using block-based local binary patterns. Histogram-based thresholding was then used to extract the abrupt shots. The Sobel gradient was applied, and frames with higher coefficients were selected as KF based on the Z-score of their magnitudes.

The KFE was identified using a simple GA¹⁴. Color-based distance metrics were used to detect frames with significant intensity changes, which were marked as shots. The FF was then applied to each frame within every shot, and over multiple generations, the frames with the highest fitness values were selected as KF. In¹⁵, the authors aimed to identify KF in dance videos. These video frames have been analyzed using directional gradient histogram and optical flow directional features, with music features also integrated to assist in KFE. The authors compared KFE using entropy, absolute difference, optical flow, and the YOLO¹⁶. The experimental results demonstrated that YOLO outperformed all other methods for KFE.

Table 1 illustrates a comparison of different methodologies discussed in the literature. Closely examining the published literature delineates that KFE within videos is still an open problem. Accordingly, this work proffers a different of the GA with an efficient FF and carefully crafted crossover, mutation, and survivor mechanisms for unconstrained videos.

Table 1 Comparison of different algorithms for KFE.

Subjects

Abstract

Similar content being viewed by others

An effective Key Frame Extraction technique based on Feature Fusion and Fuzzy-C means clustering with Artificial Hummingbird

Transfer learning model for anomalous event recognition in big video data

Keyphrase extraction by the use of glove and ResNeXt optimized by enhanced human evolutionary optimization (EHEO) algorithm

Introduction

Motivation of the study

Objective of the work

Related works

Proposed algorithm for KFE

Creation of initial population

Guideline for selecting ‘r’

Parent selection

Cross-over and mutation

Cross-over

Mutation

Survivor selection

Termination condition

Experimental results

Results and discussion of the proposed algorithm on VSUMM

Results and discussion of the proposed algorithm in comparison with a meta-heuristic approach (MHA)

Results and discussion of the proposed algorithm on SumMe

Evaluation

Quantitative evaluation

Subjective evaluation

Comparison of computational complexity: a detailed analysis

Computational complexity of GA

Computational complexity of DL

Computational complexity of DE

Comparison of computational complexities

Conclusion and future work

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Consent to participate/publish

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links