Design of an integrated model with temporal graph attention and transformer-augmented RNNs for enhanced anomaly detection

Veesam, Sai Babu; Satish, Aravapalli Rama; Tupakula, Sreenivasulu; Chinnam, Yuvaraju; Prakash, Krishna; Bansal, Shonak; Faruque, Mohammad Rashed Iqbal

doi:10.1038/s41598-025-85822-5

Download PDF

Article
Open access
Published: 21 January 2025

Design of an integrated model with temporal graph attention and transformer-augmented RNNs for enhanced anomaly detection

Sai Babu Veesam¹,
Aravapalli Rama Satish¹,
Sreenivasulu Tupakula²,
Yuvaraju Chinnam³,
Krishna Prakash⁴,
Shonak Bansal⁵ &
…
Mohammad Rashed Iqbal Faruque⁶

Scientific Reports volume 15, Article number: 2692 (2025) Cite this article

6595 Accesses
20 Citations
Metrics details

Subjects

Abstract

It is important in the rising demands to have efficient anomaly detection in camera surveillance systems for improving public safety in a complex environment. Most of the available methods usually fail to capture the long-term temporal dependencies and spatial correlations, especially in dynamic multi-camera settings. Also, many traditional methods rely heavily on large labeled datasets, generalizing poorly when encountering unseen anomalies in the process. We introduce a new framework to address such challenges by incorporating state-of-the-art deep learning models that improve temporal and spatial context modeling. We combine RNNs with GATs to model long-term dependencies across cameras effectively distributed over space. The Transformer-Augmented RNN allows for a better way than standard RNNs through self-attention mechanisms to improve robust temporal modeling. We employ a Multimodal Variational Autoencoder-MVAE that fuses video, audio, and motion sensor information in a manner resistant to noise and missing samples. To address the challenge of having a few labeled anomalies, we apply the Prototypical Networks to perform few-shot learning and enable generalization based on a few examples. Then, a Spatiotemporal Autoencoder is adopted to realize unsupervised anomaly detection by learning normal behavior patterns and deviations from them as anomalies. The methods proposed here yield significant improvements of about 10% to 15% in precision, recall, and F1-scores over traditional models. Further, the generalization capability of the framework to unseen anomalies, up to a gain of + 20% on novel event detection, represents a major advancement for real-world surveillance systems.

Multimodal anomaly detection in complex environments using video and audio fusion

Article Open access 10 May 2025

Multi-camera spatiotemporal deep learning framework for real-time abnormal behavior detection in dense urban environments

Article Open access 23 July 2025

Weakly supervised video anomaly detection based on hyperbolic space

Article Open access 01 November 2024

Introduction

It is a key research area for anomaly detection in surveillance systems, especially because of the increasing installation of a multi-camera network in cities, industries, and public places. While such systems are installed to enhance security and operational efficiency, they continuously generate copious volumes of video data that demand sophisticated techniques for automatically detecting abnormal activities. Traditional anomaly detection^1,2,3 commonly relies either on handcrafted features or on classical machine learning algorithms that are not always suitable for modeling complex spatial and temporal dependencies present in such data samples. Moreover, they usually require large labeled training datasets, and this limits scalability, generalizing poorly to unseen anomalies. More recently, significant progress has been made in using neural networks for anomaly detection, thanks to the emergence of deep learning. However, most of these DL-based methods^4,5,6 still lack the capability for modeling complex spatial relationships between multiple camera feeds and the long-term temporal dependencies that play a critical role in identifying rare or gradual anomalies. For instance, RNNs, though widely used in temporal modeling, have inherent limitations in capturing long-range dependencies during the process. While the convolutional methods operating on either individual frames or local patches are unable to capture larger-scale contextual information across different camera views for diverse scenarios. These lacunae in the current frameworks further necessitate an advanced architecture that shall integrate both spatial and temporal information across a network of cameras.

Given such challenges, this work proposes a new paradigm for anomaly detection by designing an integrated model that leverages state-of-the-art deep learning architectures on temporal context modeling and spatial information fusion. In this paper, we propose a Temporal Graph Attention Network-TGAT-based model that integrates RNNs with GATs for capturing temporal dependencies while dynamically attending to important camera feeds over temporal instance sets. This approach provides a better monitoring of complex environments, where anomalies can manifest only for larger time spans and over multiple distributed sensors^4,5,6. Another key challenge that this work tries to address is how to fuse multimodal data samples. Indeed, many surveillance systems can incorporate added sensor modalities such as audio, motion detectors, or even biometric data, adding their respective information for detecting anomalies. To handle this, the model under the proposal will embody the Multimodal Variational Autoencoder that will learn a joint representation of latent variables from multiple sensor modalities. It uses optimization of shared latent variables with reconstruction loss to ensure that salient information from each modality is retained by MVAE. The detection capability is thus more robust in this case, even when noisy or missing data samples are available in process.

Graph Attention Networks (GATs) recently achieved great success in most anomaly detection and spatiotemporal modeling domains. For example, GATs were applied to model interactions among individuals in highly dense crowds to capture collective movement patterns as well as group dynamics and anomalies. In video surveillance, GATs were applied to multi-camera setups in order to establish correlations between spatially distributed feeds that allow anomalies in pedestrian or vehicular movement to be detected better in these operations. Moreover, in traffic monitoring, GATs have been used for modeling interdependencies between road segments so that congestion detection and the prediction of accidents could be better done in process. These applications are able to demonstrate the ability of GATs toward learning relationships between structured and unstructured data, which only serves to prove the importance of relevance to the proposed work. Such examples in incorporation stress that the GAT is adaptable to multiple scenarios, providing flexibility to the context of the study process.

Another important aspect is the rarity of anomalous events, which makes it challenging to train supervised models. In this respect, the integrated model employs few-shot learning through Prototypical Networks, which enable anomaly classification with only a few shots of labeled examples. This is of particular value in surveillance situations since it is typically too impractical and expensive to collect such large annotated datasets. By learning a metric space where the distances between prototypes, as representatives of normal and anomalous behavior, respectively, can be computed, it generalizes well to unseen anomalies-a basic but key requirement in real-world applications. An unsupervised Spatiotemporal Autoencoder learns patterns in the normal behavior of both dimensions: spatial and temporal. Autoencoders are known for their ability to compress input data into a lower-dimensional latent space and reconstruct it. That is, in anomaly detection, deviation in reconstruction error acts as a proxy for recognizing abnormal patterns. By generalizing this idea to spatiotemporal data, the autoencoder will be able to model normal behaviors more effectively within a network of multiple cameras and flag deviations that possibly signify anomalies. This integrated model is thus the comprehensive solution to the challenges faced by the different existing systems of anomaly detection. It significantly improves accuracy, precision, and recall by including temporal and spatial modeling with advanced techniques for data fusion and few-shot learning. The generalization of the model to unseen anomalies is a quantum leap in this respect since the inability to do so has been one of the most irritating aspects that the traditional approaches have exhibited so far. Extensive experiments demonstrate that the proposed methods raise the performance bar of the state-of-the-art in different real-world scenarios by significantly improving F1-scores and reducing false positives.

Motivation and contribution

The motivation for this work arises from the increasing complexity of modern surveillance systems and the deficiency in the potential of the existing anomaly detection methods to fully catch up with challenges occurring in multi-camera environments. Traditional models based on hand-crafted features or early deep learning techniques usually do not capture nuanced spatial and temporal dynamics in an environment, which are necessary for the detection of abnormal events. Not only do surveillance systems deal with vast data emanating from several cameras, but the anomalies of interest tend to be rare, subtle and spread out in a fact that makes the job challenging. Additional complexities arise by integrating extra sensor data, such as motion or audio, since these modalities must be effectively fused to achieve further improvements in detection accuracy levels. Most of the existing models also suffer from over-dependence on large annotated datasets-a factor that further limits their employment in real-world settings, where such data is hard or too expensive to come by. This work overcomes these limitations by proposing a new integrated model leveraging the latest state-of-the-art deep learning model architectures. In this paper, the proposed key innovation is the Temporal Graph Attention Network or, in short, TGAT. It models both long-term temporal dependencies and spatial correlations across a network of cameras. Rather than resorting to traditional RNNs, which struggle to model longer-range dependencies, TGAT leverages attention mechanisms that dynamically lock onto the most relevant camera feeds at each set of timestamps, hence enhancing the model’s power to monitor complex environments. The proposed Transformer-Augmented Recurrent Neural Network increases the temporal modeling capability of the system by combining the strengths of RNNs in terms of short-term event correlations with those of transformers for capturing long-range dependencies. This hybrid approach ensures the effective modeling of both local and global temporal patterns toward a holistic solution for anomaly detection.

This work also contributes to a Multimodal Variational Autoencoder that fuses data from different sensor modalities, such as video, audio, and motion. The MVAE performs optimization based on a joint latent space preserving all the critical information from each modality, thereby making the model robust against any noisy or incomplete data, as is common in many real-world surveillance systems. Apart from that, few-shot learning with Prototypical Networks deals with the challenge of a limited number of labeled datasets by letting the model generalize to unseen anomalies with a few labeled examples. It has more value in anomaly detection since, in many cases, it is highly impractical to get large amounts of data with annotations. Lastly, the Spatiotemporal Autoencoder offers an unsupervised framework for learning normal behavior patterns where the reconstruction error serves as a signal to detect deviations that may point to anomalies. This reduces the false positive rate significantly by an unsupervised approach while improving recall in general, especially for multiple cameras. Putting all these together yields a very strong model, outperforming the state of the art in accuracy and generalization. This paper therefore proposes an effective detection of anomalies in complex and real surveillance settings with the main challenges of long-term temporal modeling, multimodal data fusion, and few-shot learning. Extensive experimental results have shown that the proposed approaches show significant improvement in anomaly detection performance, both by enhancing F1 scores and reducing false positive rates, hence making the model highly applicable for a wide range of security and monitoring applications.

Review of existing models used for multiple camera anomaly analysis

Large-scale datasets and rapidly developing deep learning techniques have contributed to the rapid growth in the area of crowd activity analysis and anomaly detection. With increasing complexities in urban life, the demand for effective and efficient surveillance to monitor crowd activities and detect anomalies is becoming critical. This work presents a comprehensive review of 40 influential studies in the subject area of crowd anomaly detection, based on various techniques ranging from CNNs and GNNs to VAEs, RL, and other state-of-the-art AI frameworks. Each of these contributions brings methodologies that may uniquely contribute to solving the challenges of dense, dynamic crowds for both real-time and post-event analyses, each with discussions on limitations, effectiveness, and applicability. The majority of the approaches under review can be noted to operate on supervised learning, which is highly dependent on labeled datasets. This very dependence on large labeled data, however, then becomes a limitation in itself, because amassing datasets representative of a wide variety of anomalies is often an exhausting, costly, or sometimes impossible undertaking in many situations. Approaches, such as those in^1,2,3, prove the well-known fact that traditional supervised methods, while performing reasonably, mostly fail to generalize beyond their training data samples. For instance, the CNN-based accident detection in¹ showed 89.5% accuracy in classifying traffic accidents but is not scalable with regard to variations in accident types. Similarly^2,3, developed a fuzzy cognitive deep learning framework for the prediction of crowd behavior and pointed out that generalization would be hard in no-crowd scenarios, hence limiting the wider applicability of the model. More specifically, while in such studies, it often comes out that the deep learning methods have developed into very powerful feature extracting and pattern recognition means, their actual performance is so much dependent upon the diversity and quality of the training data samples.

While this happens, unsupervised learning studies have promised performance by overcoming the limitations in the performance of their supervised counterparts using techniques such as variational autoencoders and GANs. For example^7,8, used variational autoencoders coupled with motion consistency to detect abnormal crowd behaviors and reported an accuracy of 86%. These kinds of unsupervised models turn out to be more functional when labeled data is available in small quantities or when anomalies are infrequent and unpredictable. However, these methods also have their drawbacks, specifically when the environment is complex and at high density since their reconstruction errors can grow due to noisy or incomplete data samples. Works like^9,10, generating the motion of crowds in virtual reality, faced the problem of scaling up to larger crowds, showing the limitation of unsupervised learning in highly dynamic scenarios. Other attention that is given in the field is to multimodal fusion techniques. Integrating multiple streams like video, audio, and motion sensors will help to improve robustness and accuracy in the analytics of crowd behavior. In the research work proposed by^11,12,13, the study involved a secure smart surveillance system integrated with transformers for the recognition of crowd behavior and identified an accuracy of 90% in recognizing abnormal behaviors. For example^14,15,16, applied ant colony optimization to find the optimal layouts with the aim of crowd management. This resulted in a 23% reduction in crowding in simulated environments. The above studies show that the integration of multiple modalities can indeed enhance the predictive power of anomaly detection systems. However, most of these generally computationally intensive resource schemes^17,18,19 and specialized infrastructure, and hence difficult to deploy in real time in resource-constrained environments. Recently, attention mechanisms and graph-based approaches have become of focal interest along with multi-modal systems, many research studies show that this captures the intrinsic spatiotemporal dependencies involved in crowd behavior, which greatly improves anomaly detection tasks. A typical example is in²⁰, where authors used an attention-based CNN-LSTM model with multiple head self-attention for violence detection to obtain 85.3% accuracy. This showed the strength of attention mechanisms in filtering out the noise and highlighting the relevant features. Most importantly, this is very meaningful in cluttered situations where numerous overlapping activities are occurring simultaneously. While^21,22,23, applies the graph convolution neural network in classifying structured and unstructured crowds, thereby achieving an 87% F1-score in crowd classification. Graph-based models, in particular, capture the interrelation of the crowd individuals much better and yield higher accuracy in group behavior detection along with anomaly in collectiveness.

Despite this encouraging result^24,25,26 on the whole from the different studies, many limits remain: Most of the approaches, and in particular those using deep learning, are computationally very expensive, which requires important hardware resources, reducing their feasibility actually to be deployed in real-time and at large scale in an urban environment. For example, works such as^{27,28,29,30,31,32} reported good anomaly detection performance using an attention-guided and a GAN-based model, respectively. However, the computational overhead is usually expensive for real applications in general but most particularly in constrained resource settings such as a remote surveillance system or at the edge. Besides, many methods work well in^{33,34,35,36,37} highly controlled environments but struggle thereafter to keep up their accuracy in more dynamic and less predictable settings. For example, while the zero-shot classifier in^35,38,39 could detect novel anomalies quite accurately at 85% accuracy, it is limited when applied in real scenarios where spatiotemporal descriptors may be incomplete or noisy.

Table 1 Empirical review of existing methods.

Subjects

Abstract

Similar content being viewed by others

Introduction

Motivation and contribution

Review of existing models used for multiple camera anomaly analysis

Proposed design of an integrated model with temporal graph attention and transformer-augmented RNNs for enhanced anomaly detection

Comparative result analysis

Quantitative and qualitative results

Complexity analysis

Extended training analysis

Practical use case scenario analysis

Extended analysis

Ablation study analysis

Conclusions and future scopes

Future scope

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Search

Quick links