Introduction

The increasing occurrence of deepfakes, or altered video content has introduced a persistent need for reliable detection procedures. These manipulated videos replicate human facial expressions, speech synchronization, and even replicate emotional expressions with high fidelity1. Deepfakes can effortlessly blend facial transitions and maintain consistent lighting, and texture across frames which makes identification into a complex process. Their increasing complexity introduces more challenges specifically in domains like digital journalism, political media, forensic analysis, and cybersecurity2. A major aspect in this context is the role of video compression, which is used to either hide small variations in frame content. Since most digital videos are generally compressed before being shared or stored. So, the detection model must have the knowledge about how the compression affects visible information. Meanwhile alterations caused by re-encoding introduce or remove traces. Therefore, an ideal solution must be required to perform more than image-level classification considering both motion tracking and structure-aware analysis for handling the compressed and uncompressed content effectively.

Most existing detection systems mainly depends on visual features extracted from individual frames without modeling the temporal continuity between successive video segments. As deepfakes are used in many applications, conventional static detection approaches are insufficient to detect the changes due to frame-level inconsistencies which are masked by smooth transitions3. Another major limitation is sensitivity to compression objects. Another important factor in real-world applications is the variability introduced by compression. When videos are uploaded to social networks or shared online, they undergo re-compression using different formats. This compression leads to changes in texture, resolution, and color depth. Due to these changes the visibility of visual traces left behind by editing or generation techniques are not visible. As a result, systems trained on clean or high-quality datasets may not perform well when evaluated on lower-quality or heavily compressed data. Furthermore, methods which are already in use do not have the ability to identify new types of manipulations involving multi-modal content such as audio-visual mismatches, and partial face swaps, etc., This lack of adaptability requires the development of modular and flexible detection systems that can scale with the evolving nature of synthetic media creation techniques.

A wide range of deepfake detection models have been developed in recent times considering different features, architectures, and design objectives4. Convolutional Neural Networks (CNNs) have been widely used in many approaches due to their effectiveness in extracting spatial patterns5. Later low-level noise patterns and boundary mismatches are identified using CNN variants such as ResNet and VGG6. Though these models were effective in detecting deepfakes but they lacked in performance due to the inability in understanding the temporal behavior. Other models introduced frequency domain analysis to uncover artifacts left behind by GAN-based generation7. These approaches use discrete cosine or Fourier transforms to analyze inconsistencies in image compression or spectral energy distribution. While frequency domain features are informative, they are vulnerable to post-processing steps such as blurring or noise injection, which can suppress detectable signals. Further research led to the use of capsule networks, which aim to preserve spatial hierarchies and relationships8. These models attempted to address limitations of CNNs by tracking viewpoint and pose, but they struggled with long-range dependency modeling and were computationally intensive. Additionally, few-shot learning techniques have been explored to enable quick adaptation to new types of manipulations, though these approaches require careful fine-tuning and often underperform when class imbalance exists. Though these methods contribute valuable insights, they primarily function at a frame or patch level and do not fully address the continuity and narrative coherence present in video sequences.

To overcome the limitations of purely spatial models, temporal modeling strategies have gained attention in recent literature. Recurrent Neural Networks (RNNs), especially variants like LSTM and GRU, have been adopted to capture frame-to-frame correlations9. These networks are capable of learning sequential features and identifying unnatural transitions. However, basic RNN-based models face challenges in maintaining long memory and are susceptible to vanishing gradients. To mitigate these issues, attention-based models have been introduced, where self-attention mechanisms assign importance to relevant frames or spatial zones. Although attention helps in contextualizing features, standalone attention layers lack hierarchical compression and can become computationally expensive. Furthermore, hybrid approaches combining CNNs for spatial encoding with LSTM or GRU units for temporal analysis show promise but often depend on extensive labeled datasets and do not incorporate structural prediction or compression-based discrepancies10. The motivation for this research stems from the observed gap in integrating structural analysis, frequency-based compression clues, and sequential modeling into a unified framework. The objective is to develop a model that not only captures visual and temporal anomalies but also examines intra-frame prediction inconsistencies using standard image compression techniques. Such an approach aims to exploit subtle distortions introduced during generation while leveraging temporal attention to model frame dependencies, thus improving classification accuracy in both controlled and real-world datasets.

The proposed research presents a deepfake detection framework, IP-GTA Net, which incorporates intra prediction through block-wise reconstruction using a hybrid convolutional autoencoder, combined with gated temporal attention, to effectively classify original and altered video sequences. The model initially incorporates convolutional autoencoder for intra prediction to each frame. In this the encoder captures spatial redundancies and the decoder reconstructs the frame to reveal hidden discrepancies. This intra-frame video compression provides a learned alternative to traditional methods which faces difficulty in handling delicate visual distortions commonly found in tampered content. The processed frames are then processed through MobileNetV3 to extract spatially rich features with reduced computational complexity. Further to model temporal dependencies, a gated convolutional GRU module is incorporated which captures sequential transitions and selectively filters relevant information through attention-weighted gates. To enhance the convergence stability RMSprop is used for optimization. The novelty of this framework present in its combination of intra prediction with deep spatio-temporal analysis which creates a hybrid system to addresses both generative residue and motion fidelity. Unlike existing methods that treat frames independently or model sequences without structural validation, the proposed IP-GTA Net analyzes content both in isolation and context which increases its resilience to adversarial manipulations and generalization across varied data. The major contributions of this research work are summarized as follows.

  • A novel hybrid deepfake detection model (IP-GTA Net) is proposed, combining block-wise intra prediction using a convolutional autoencoder to simulate compression artifacts and reveal fine-grained frame-level inconsistencies.

  • The model integrates MobileNetV3 for spatial feature encoding and a gated convolutional GRU with temporal attention to capture manipulation patterns across frame sequences for robust video-level classification.

  • A detailed comparative analysis is presented with existing models such as XceptionNet, Two-Stream CNN, EfficientNet-B0, and Capsule Network, demonstrating improved accuracy, precision, recall, and F1-score on benchmark datasets.

The following sections of this article are organized as follows: Sect. 2 provides a detailed review of related literature and existing methodologies relevant to deepfake detection. Section 3 outlines the mathematical foundation and structural formulation of the proposed IP-GTA Net model. Section 4 highlights the experimental findings, including quantitative evaluations and comparative analysis with baseline models. Finally, Sect. 5 summarizes the overall conclusions of this research work and suggests future directions for continued research.

Related works

Recent advancements in deepfake detection have introduced different models which are based on deep learning approaches like convolutional networks and transformer-based models and hybrid spectral-spatial frameworks. Each method has its own merits in analyzing facial inconsistencies, frequency cues, or temporal dependencies. However, these models still face challenges in generalizing across manipulation types and maintaining real-time efficiency under constrained environments.

The deepfake detection model presented in11 combines the Local Binary Pattern (LBP) descriptor with an Ensemble of Deep CNN models such as VGG19, ResNet50, and InceptionV3. The presented model is developed with an aim to enhance spatial texture recognition using LBP pre-processing as it highlight the complex pixel-level irregularities in manipulated facial regions. These pre-processed frames are then processed by the ensemble model for classification. Experimental analysis exhibits the model average accuracy for benchmark datasets. However, the presented model is computationally intensive and depends heavily on static frame-based analysis which restricts its ability in capturing temporal inconsistencies across frames. The lightweight deepfake detection model presented in12 incorporates modified MobileNetV2 with a depth wise separable attention module. The presented approach processes the individual frames and extracts discriminative features to differentiate real and fake content. Experimental evaluation using benchmark dataset exhibits the model accuracy and its low complexity. However, the dependence on frame-level analysis limits the presented model ability in capturing temporal features associated with face manipulation artifacts.

This deepfake detection model presented in13 incorporates capsule networks to overcome the challenges in deepfake detection procedure. The presented model captures hierarchical relationships by utilizing a dynamic routing between capsules. This routing allows the model to detect complex objects and inconsistencies within facial features. Experimental validation on the benchmark deep fake datasets exhibits that the capsule-based model better accuracy which is better than conventional CNNs. However, the lack of temporal modeling, and missing of cross-frame temporal dependency analysis limits the model adaptability in real time detection process. The 3D CNN model presented in14 simultaneously learn spatial artifacts and temporal inconsistencies in deepfake videos. The presented model captures short-term motion dynamics by analyzing consecutive frames as volumetric data. This allows the model to detect manipulation which are invisible in static frames. Experimental evaluation on DeepFake datasets exhibits an average accuracy. However, the computation complexity of the presented model is high compared to 2D-CNN.

The frequency-aware deepfake detection model presented in15 incorporates Discrete Cosine Transform (DCT) features into a convolutional neural network. By transforming input images into the frequency domain, the presented model aims to capture high-frequency inconsistencies present in forged content, which might be missed in the spatial domain. The DCT coefficients are combined with RGB data and fed into a dual-stream CNN for joint feature learning. Experimental validation on FaceForensics++, Celeb-DF, and DFDC datasets reveals that the method achieves an accuracy of 91.1%, with improved robustness on compressed videos. The frequency analysis enhances the model’s ability to detect manipulations in subtle texture regions. However, the added complexity of frequency-stream processing increases training time and memory requirements. Moreover, the model focuses on static frame-based analysis, and thus lacks mechanisms to capture temporal inconsistencies, which are crucial in many realistic deepfake scenarios involving motion and dynamic facial transitions.

The dual-stream recurrent convolutional network presented in16 captures both spatial and temporal inconsistencies in manipulated facial videos. The first stream uses a conventional CNN to extract spatial features from individual frames, while the second employs a recurrent unit, specifically a Bidirectional GRU, to model temporal dependencies across sequences. By combining these two components, the architecture addresses limitations of static-frame methods and enhances the ability to detect frame-level anomalies caused by forgery artifacts. The model was tested on benchmark datasets, achieving a peak accuracy of 91.5% which is better than CNN baseline models. However, the dependency on recurrent units increases computational complexity and inference time, which limits the model real-time deployment.

The detection method presented in17 incorporates a Temporal Convolutional Network (TCN) with a lightweight EfficientNet-based encoder. The presented model processes frame sequences to extract motion-related features. Specifically, the temporal block captures long-range dependencies across frames which is helpful in detecting indirect manipulations in face videos. The combined architecture is evaluated on benchmark deepfake datasets and the attained average accuracy of 90.9% exhibits the model balanced trade-off between detection performance and computational cost.

The two-branch architecture presented in18 considers both spatial artifacts and temporal inconsistencies for deepfake detection. The first branch is responsible for learning spatial features through a convolutional network. The second branch incorporates a temporal attention mechanism to focus on frame-wise relationships and continuity. The presented model is trained and evaluated on benchmark datasets and exhibits 91.9% accuracy. Additionally, the temporal attention module enhances detection accuracy by prioritizing frames with higher manipulation and improves robustness in detecting the modifications. However, the presented model requires frame alignment preprocessing and sensitive to sudden motion transitions. This reduces the model effectiveness in uncontrolled environments.

The self-supervised deepfake detection model presented in19 utilizes contrastive learning for improved generalization across manipulation types. The presented method generates positive and negative pairs from augmented video frames to train the model for identifying real and fake representations. The presented model architecture includes a CNN encoder followed by a projection head to map features into an embedding space. Experimental evaluations on benchmark Celeb datasets exhibits the model average accuracy. However, the presented model performance reduces slightly while processing low-quality or highly compressed videos.

The lightweight CNN model presented in20 for deepfake video detection combines spatial feature extraction with a motion-aware attention module to focuses on dynamic inconsistencies across consecutive frames. The CNN backbone captures localized facial features and the attention unit identifies complex manipulative transitions introduced by deepfake generation algorithms. The method was tested on benchmark datasets and exhibited average accuracy with reduced computational overhead. However, the model has limitations as it exhibits poor performance while detecting adversarially enhanced forgeries that mimic natural transitions.

Vision Transformers (ViT) based deepfake detection model presented in21 processes input images as sequences of patches and allows global context learning without depending on localized kernels. The presented model is trained and tested on benchmark datasets and achieves an accuracy of 91.4% which is better than existing CNN-based baselines. The attention maps produced by the presented model highlight relevant facial areas impacted by alternations and provides better interpretability to the detection process. However, the model requires large-scale training data and extensive computational resources, which may limit its deployment in low-resource environments.

The transformer-based model presented in22 for deepfake detection utilizes global contextual information in face video sequences. The presented model incorporates a Vision Transformer (ViT) to capture long-range dependencies and changes across frames. The method segments input sequences into patches and encodes them using positional embeddings, enabling the model to analyze spatial patterns. Experimental evaluations on benchmark datasets demonstrate an average detection accuracy which is superior over conventional CNN-based methods. However, the presented model is computationally complex and requires a large volume of training data for effective convergence. Also, its performance decreases in cases of low-resolution videos or poor lighting conditions.

The multi-task learning framework presented in23 simultaneously performs deepfake detection and manipulation localization using a shared backbone with dual heads. The presented model architecture utilizes a modified EfficientNet-B4 encoder, followed by two branches. The first branch is used for classification and second branch is used for pixel-level heatmap prediction to highlight manipulated regions. This dual-objective structure allows the model to detect the presence of fakes but also visually interpret manipulated zones within the input frames. However, the presented model is dependency on accurate pixel-level annotations limits its scalability to large or diverse datasets.

The detection model presented in24 utilizes spatiotemporal features using a 3D ResNet integrated with an attention-guided module. The presented model is designed to extract both appearance and motion-based features from short video clips which enables the model to recognize inconsistencies introduced by facial changes over time. The attention mechanism highlights the critical regions in both spatial and temporal dimension which further improves interpretability and performance. Experimental evaluations exhibited the model better accuracy over existing deep learning models. However, the presented is computationally intensive due to the use of 3D convolutions and requires substantial GPU resources for training.

The hybrid deepfake detection model presented in25 combines a dual-stream CNN architecture with a spatiotemporal attention mechanism. The spatial stream processes RGB frame data, while the temporal stream captures dynamic changes across consecutive frames using optical flow. Further the attention fusion module integrates both streams to highlight manipulation-relevant regions and motion inconsistencies. The model is evaluated on benchmark datasets and demonstrates improved performance over baseline single-stream methods. However, the dependency on accurate optical flow computation increases processing time and reduces efficiency under variable motion blur or occlusion.

The hybrid approach presented in26 for deepfake detection using convolutional and frequency domain features to enhance artifact sensitivity. The presented model integrates a CNN-based encoder with a wavelet decomposition module to extract multi-resolution frequency patterns from facial regions. This fusion allows the model to capture both spatial and indirect spectral inconsistencies which are commonly introduced by manipulation techniques. Evaluation on benchmark datasets shows the model accuracy as 91.7% which is better than single-domain models. However, the presented model processes individual frames without modeling temporal continuity which limits its effectiveness in detecting manipulations that preserve inter-frame consistency.

The hybrid deepfake detection model presented in27 incorporates both convolutional and frequency-based feature extraction to improve detection reliability. The presented approach employs a dual-stream architecture with one stream dedicated to spatial features using a ResNet-based encoder and the other stream process the frequency artifacts using Discrete Cosine Transform (DCT). The outputs from both branches are fused before classification, which enables the model to detect inconsistencies in both pixel distribution and spectral patterns. Experimental validation on FaceForensics + + and Celeb-DF datasets shows an accuracy of 91.2%, demonstrating competitive performance against several state-of-the-art techniques. The method is particularly effective in capturing compression-induced anomalies and fine-grained facial manipulations. However, its reliance on hand-crafted frequency processing increases sensitivity to low-resolution and highly compressed inputs, which may diminish accuracy in real-world scenarios. Moreover, the computational cost of managing dual streams poses a limitation for real-time or embedded system deployment.

Table 1 Summary of research works.

Research gap

From the comprehensive analysis summary presented in Table 1 reveals several persistent challenges in existing deepfake detection systems that justify the need for an improved approach. Many models rely heavily on spatial domain features using CNNs or hybrid CNN-transformer architectures, which often overlook subtle temporal inconsistencies or frequency-based anomalies inherent in manipulated content. Although methods like attention modules, temporal modeling, and frequency transforms such as DCT have been attempted, they are either computationally intensive, require large volumes of labeled data, or fail to generalize well across datasets. Several models achieve high accuracy on specific datasets, but their robustness significantly drops in real-world scenarios involving compression artifacts, profile views, or lighting variations. Additionally, transformer-based models, while effective in capturing global context, demand extensive computational resources and are not optimized for real-time applications. Motion-aware or temporal attention mechanisms, though promising, often lack integration with frequency-domain cues, missing fine-grained manipulation patterns. Furthermore, self-supervised approaches reduce dependency on labels but fall short in low-quality settings. This research gap in spatial-frequency-temporal fusion, computational efficiency, and generalization capability highlights the need for a novel architecture which integrates intra prediction, feature extraction, and temporal attention modeling as an efficient and resource-aware approach.

Proposed work

The proposed approach presents a novel model for detecting and classifying deepfake content by combining intra-frame reconstruction, efficient spatial feature extraction, temporal analysis, and attention-based learning. The complete architecture has been carefully designed to maintain a balance between accuracy and computational feasibility. The proposed architecture has an intra prediction module, which uses a hybrid convolutional autoencoder to simulate block-based video compression. This component processes each frame by dividing it into uniform blocks, then reconstructs the input to expose delicate frame-level inconsistencies, which are introduced through editing or generation techniques. Further the frames are processed by MobileNetV3 for extracting the necessary features. This network is chosen for its ability to retain spatial detail while maintaining low computational cost, making it suitable for tasks that require quick processing of large image sequences.

Fig. 1
figure 1

Process flow of proposed model.

The extracted features from each frame are then organized as a temporal sequence to capture visual changes across consecutive frames. To model this sequence, a convolutional gated recurrent unit (ConvGRU) is employed in the proposed architecture which preserve both spatial and temporal characteristics in the data. Additionally, an attention mechanism is applied before sequence modeling to assign higher importance to frames that may carry signs of tampering or distortion. These steps ensure that both local and global patterns are considered when analyzing frame transitions and motion consistency. The final prediction is made by passing the output of the sequence model into a fully connected layer equipped with SoftMax activation, which classifies the video segment as either real or fake. The complete process flow of proposed model is presented in Fig. 1 starting from input frame preparation to classification which is guided by a compression-aware reconstruction strategy that strengthens the model’s ability to detect fine manipulations, particularly in real-world video scenarios where compression significantly alters content quality.

Input representation and frame sequence

In the initial phase of the Deepfake detection and classification system, the video input is treated as a structured temporal collection of image frames. The goal is to prepare each individual frame for further processing, including intra prediction and feature extraction. This stage lays the foundation for the entire pipeline by ensuring that every video is decomposed in a way that retains both temporal and spatial coherence. Let the incoming digital video be denoted by the variable \(\:\mathcal{V}\), which consists of a sequence of discrete time-indexed frames. This can be mathematically formulated as

$$\:\mathcal{V}\mathcal{\:}=\mathcal{\:}\left\{\:{F}_{t}\:\right|\hspace{0.17em}\:t\:=\:1,\:2,\:3,\:\dots\:,\:T\:\}$$
(1)

where \(\:\mathcal{V}\) indicates the full video sequence, \(\:{F}_{t}\) indicates the individual frame extracted at time index \(\:t\), \(\:T\) indicates the total number of frames considered from the video. Each frame \(\:{F}_{t}\) is treated as a static image with three color channels corresponding to Red, Green, and Blue (RGB). Formally, the dimensionality of each frame is expressed as \(\:{F}_{t}\in\:{R}^{H\times\:W\times\:3}\:\)in which \(\:H\) indicates the height (number of vertical pixels) of the frame, \(\:W\) indicates the width (number of horizontal pixels) of the frame, 3 indicates the number of color channels in an RGB image. This representation ensures that each frame is processed as a color image with detailed spatial information, suitable for manipulation analysis.

To enable block-wise intra prediction and frequency-domain transformation in later stages, each frame \(\:{F}_{t}\) is segmented into non-overlapping square regions. Let \(\:{B}_{i,j}^{t}\) denote the block located at row \(\:i\) and column \(\:j\) within frame \(\:t\). Mathematically it is formulated as

$$\:{F}_{t}={\bigcup\:}_{i=1}^{M}{\bigcup\:}_{j=1}^{N}{B}_{i,j}^{t},\hspace{1em}{B}_{i,j}^{t}\in\:{R}^{b\times\:b\times\:3}$$
(2)

where \(\:{B}_{i,j}^{t}\) indicates the block within frame \(\:t\), covering a square region of size \(\:b\:\times\:b\), \(\:M=\frac{H}{b}\) indicates the total number of vertical blocks, \(\:N=\frac{W}{b}\) indicates the total number of horizontal blocks. The union operation ensures full coverage of the frame without overlap. This step enables each frame to be treated as a composition of smaller regions, allowing localized frequency analysis, which is highly sensitive to subtle alterations commonly introduced in Deepfake content.

Although each frame is handled individually in this phase, the eventual classification depends on how these frames relate to each other over time. Therefore, each frame \(\:{F}_{t}\) carries an implicit temporal index \(\:t\), which serves as a reference for sequencing in downstream operations such as recurrent modeling. To maintain a uniform input structure across different videos with varying durations, temporal normalization is applied which is mathematically expressed as \(\:\stackrel{\sim}{t}=\frac{t}{T}\) in which \(\:\stackrel{\sim}{t}\) indicates the normalized time index in the range \(\:\left(\text{0,1}\right]\)), \(\:T\) indicates the total number of frames under consideration. Where \(\:t\) refers to the actual time index of each video frame in the input sequence. It represents the original order of frames as they appear in the video. This helps the model preserve temporal information. whereas \(\:\stackrel{\sim}{t}\) is the normalized time value. It maps the time index into a standard range between 0 and 1. This is useful when videos have different lengths. It helps the model treat all sequences in a consistent way. Using \(\:\stackrel{\sim}{t}\), the model can learn time-based patterns without depending on exact frame counts. This improves temporal learning and makes the model more general for different video durations. This temporal reference becomes vital in aligning motion-based cues and detecting inconsistent transitions between frames, which are often indicators of tampered video content.

Intra prediction and frame analysis using hybrid convolutional autoencoder

To effectively simulate the structure and preserving behavior of traditional video compression schemes, the proposed model partitions each input frame into non-overlapping square sub-blocks prior to reconstruction. In this process, a fixed block size of 8 × 8 pixels is utilized, which aligns with the standard macroblock size used in conventional intra-frame coding protocols such as H.264 and JPEG. Although formats like JPEG and H.264 use different compression methods after decoding, both produce standard RGB frames. These frames still contain compression objects such as block noise, blurring, or color shifts. The proposed model uses these decoded frames for learning and it detects remaining visual distortions by compression, regardless of the format. This makes the model suitable for videos encoded in JPEG, H.264, or similar formats. Also, the dimensional choice ensures that the autoencoder learns compression-induced distortions at a local level, facilitating the identification of delicate manipulations in small regions of the image. Smaller blocks such as 8 × 8 allow for higher spatial resolution in detecting inconsistencies, particularly in facial areas where blending artifacts or pixel-level anomalies may be confined to small zones. In this model, 8 × 8 blocks are used to match common video compression standards. If the block size is reduced to 4 × 4, the model captures smaller details. This may help detect very fine manipulations, but it increases the number of blocks. As a result, the model takes more time and memory to process. On the other hand, if the block size is increased to 16 × 16, the number of blocks becomes fewer. This reduces the processing load but may miss small tampering patterns. Larger blocks also average out local artifacts, which can hide manipulation traces. So, the block size affects both reconstruction quality and detection accuracy. The calculations will also change because the number of encoder-decoder operations depends on the block count. Therefore, selecting the right block size is important. The 8 × 8 setting gives a balance between detail detection and computational efficiency. By reconstructing the frame on a per-block basis, the model implicitly mimics the quantization and loss behavior found in real-world compression pipelines, thus enabling it to better distinguish natural content from generated artifacts. The 8 × 8 sub-block configuration also offers a practical trade-off between computational efficiency and spatial granularity, making it suitable for high-resolution frames processed in real-time settings.

In this stage, the focus is on extracting compression-aware features from each video frame using a hybrid convolutional autoencoder, which functions as the intra prediction module. Unlike traditional transform-based methods such as DCT, this approach reconstructs each input frame by learning compact spatial representations, revealing hidden irregularities typically introduced during manipulation. Each incoming frame \(\:{F}_{t}\) at time index \(\:t\), with spatial dimensions \(\:H\:\times\:W\:\times\:3\), is divided into non-overlapping square blocks of size \(\:b\:\times\:b\), where each block \(\:{B}_{ij}^{t}\) is formulated as

$$\:{B}_{ij}^{t}={F}_{t}\left[i\cdot\:b:\left(i+1\right)\cdot\:b,j\cdot\:b:\left(j+1\right)\cdot\:b\right]$$
(3)

Here, \(\:i\) and \(\:j\) denote the row and column indices of the block grid, respectively. Each block is passed through the encoder part of the network, composed of stacked convolutional layers with non-linear activations. Let the encoder function be denoted by \(\:\mathcal{E}\), then the encoded latent representation \(\:{Z}_{ij}^{t}\) is formulated as

$$\:{Z}_{ij}^{t}=\mathcal{E}\left({B}_{ij}^{t}\right)$$
(4)

This compressed form captures the essential spatial characteristics of the block. The decoder \(\:\mathcal{D}\), which mirrors the encoder structure, reconstructs the original block as follows

$$\:{\widehat{B}}_{ij}^{t}=\mathcal{D}\left({Z}_{ij}^{t}\right)$$
(5)

where \(\:\mathcal{D}\) represents the decode function. Further all the reconstructed blocks are arranged back into the original frame which is mathematically expressed as

$$\:\widehat{{F}_{t}}=\bigcup\:_{i=1}^{{N}_{H}}\bigcup\:_{j=1}^{{N}_{w}}{\widehat{B}}_{ij}^{t}$$
(6)

where \(\:{N}_{H}\) and \(\:{N}_{w}\) are the number of blocks in height and width directions. \(\:\widehat{{F}_{t}}\) represents the full intra-predicted frame. The complete reconstructed frame \(\:\widehat{{F}_{t}}\) is assembled by placing all \(\:{\widehat{B}}_{ij}^{t}\) blocks back into their original positions, preserving the overall spatial structure. The hybrid autoencoder acts as a soft compression model, mimicking the loss characteristics of actual video encoding. By using learned filters instead of fixed mathematical transforms, it adapts better to real-world distortions, including those caused by various compression levels. This block-wise reconstruction-based prediction forms the basis for the next stages, where the processed frames are passed into spatial feature encoders and temporal analyzers. The approach not only provides a high-resolution view of local distortions but also simulates the structural impact of video compression, making it highly relevant for both manipulation detection and compression-aware content validation.

Deep feature extraction using MobileNetV3

In this stage, for extracting the deep features, MobileNetV3 is used in the proposed work. The network extracts high-level representations from the video frames and capture patterns that are crucial for identifying signs of manipulation, such as unnatural facial textures, inconsistent lighting, or blending artifacts. The reconstructed frame \(\:\widehat{{F}_{t}}\in\:{R}^{H\times\:W\times\:3}\), obtained after intra prediction, is resized to a fixed resolution to match the input requirement of MobileNetV3. Mathematically it is expressed as \(\:{\widehat{{F}_{t}}}^{\left(resized\right)}=\text{Resize}\left(\widehat{{F}_{t}},{H}^{{\prime\:}},{W}^{{\prime\:}}\right)\) in which \(\:{\widehat{{F}_{t}}}^{\left(resized\right)}\) indicates the frame resized to height \(\:{H}^{{\prime\:}}\) and width \(\:{W}^{{\prime\:}}\). The resized frame preserves the RGB format with 3 channels. The first step in MobileNetV3’s internal processing is a standard convolution followed by a non-linear activation, designed to increase the representation capacity of the input image. Mathematically it is expressed as

$$\:{f}_{1}={\updelta\:}\left(BN\left({W}_{1}\text{*}{\widehat{F}}_{t}^{\left(resized\right)}+{b}_{1}\right)\right)$$
(7)

where \(\:{W}_{1}\) indicates the convolutional filter bank, \(\:{b}_{1}\) indicates the bias term, ‘\(\:*\)’ indicates the convolution operation, \(\:BN\left(\cdot\:\right)\) indicates the batch normalization, \(\:{\updelta\:}\left(\cdot\:\right)\) indicates the non-linear activation, \(\:{f}_{1}\) indicates the intermediate feature map. This step converts raw pixel data into a set of basic feature maps capturing edges, contours, and color gradients. MobileNetV3 applies depthwise separable convolution to reduce the number of trainable parameters and computation cost. This process involves two separate operations such as depthwise convolution and pointwise convolution. Mathematically it is expressed as

$$\:{f}_{dw}={\updelta\:}\left(BN\left({W}_{dw}\star\:{f}_{1}+{b}_{dw}\right)\right)$$
(8)

Depthwise Convolution

$$\:{f}_{pw}={\updelta\:}\left(BN\left({W}_{pw}\text{*}{f}_{dw}+{b}_{pw}\right)\right)$$
(9)

Pointwise Convolution.

where \(\:{W}_{dw}\) indicates the depthwise filter applied to each input channel separately, \(\:\star\:\) indicates the channel-wise convolution, \(\:{W}_{pw}\) indicates the pointwise filter (1 × 1 convolution), \(\:{f}_{dw},{f}_{pw}\) indicates the intermediate outputs, \(\:{b}_{dw},{b}_{pw}\) indicates the respective bias terms. This two-stage process allows the model to extract complex spatial features with significantly lower resource requirements.

The Squeeze-and-Excitation (SE) block is a critical component embedded within MobileNetV3’s architecture to enhance the network’s sensitivity to the most informative channels in the feature maps. Unlike traditional convolutional layers that treat all channels equally, the SE module selectively emphasizes channels that carry more discriminative power—especially useful for Deepfake detection, where fine texture inconsistencies and subtle anomalies are often buried in specific feature channels. This operation unfolds in three principal stages such as Squeeze, Excitation, and Scaling.

In the first phase, the spatial dimensions of the feature map are collapsed to extract a single descriptor per channel. Let the input feature map from the previous convolutional layer be \(\:f\in\:{R}^{H\times\:W\times\:C}\) in which \(\:H\), \(\:W\) indicates the height and width of the feature map, \(\:C\) indicates the number of channels, \(\:f\left(x,y,c\right)\) indicates the activation value at spatial position \(\:\left(x,y\right)\) in channel \(\:c\). A global average pooling is applied across the spatial plane to compute a channel-wise descriptor \(\:s\in\:{R}^{C}\) which is mathematically formulated as

$$\:s\left(c\right)=\frac{1}{H\cdot\:W}{\sum\:}_{x=1}^{H}{\sum\:}_{y=1}^{W}f\left(x,y,c\right)$$
(10)

where \(\:s\left(c\right)\) indicates the scalar representing the average activation of channel \(\:c\). The resulting vector \(\:s\) contains \(\:C\) values, one for each feature channel. This operation aggregates the entire spatial context into a compact form, allowing the network to learn global channel dependencies. The output vector \(\:s\) from the squeeze step is passed through a bottleneck structure formed by two fully connected layers. This structure introduces non-linearity and compresses inter-channel relationships. Mathematically the FC layers are formulated as

$$\:z={\updelta\:}\left({W}_{1}s+{b}_{1}\right),\hspace{1em}z\in\:{R}^{{C}_{r}}$$
(11)
$$\:a={\upsigma\:}\left({W}_{2}z+{b}_{2}\right),\hspace{1em}a\in\:{R}^{C}$$
(12)

where \(\:{W}_{1}\in\:{R}^{{C}_{r}\times\:C}\) indicates the weight matrix reducing the dimensionality from \(\:C\) to \(\:{C}_{r}\), \(\:{W}_{2}\in\:{R}^{C\times\:{C}_{r}}\) indicates the weight matrix expanding the reduced representation back to \(\:C\), \(\:{b}_{1}\), \(\:{b}_{2}\) indicates the respective bias terms, \(\:{\updelta\:}\left(\cdot\:\right)\) indicates the activation function, \(\:{\upsigma\:}\left(\cdot\:\right)\) indicates the sigmoid function to constrain outputs between 0 and 1, \(\:{C}_{r}\) indicates the reduced dimension, typically \(\:{C}_{r}=C/r\), where \(\:r\) is the reduction ratio, \(\:a\left(c\right)\) indicates the attention weight assigned to channel \(\:c\). This phase enables the model to learn which channels are more relevant by analyzing their global statistics and outputting a dynamic reweighting factor.

The final attention vector \(\:a\) is applied to the original feature map \(\:f\) via channel-wise multiplication which is mathematically formulated as \(\:{f}^{{\prime\:}}\left(:,:,c\right)=a\left(c\right)\cdot\:f\left(:,:,c\right)\) in which \(\:{f}^{{\prime\:}}\in\:{R}^{H\times\:W\times\:C}\) indicates the recalibrated feature map, \(\:{f}^{{\prime\:}}\left(:,:,c\right)\) indicates the updated channel obtained by scaling all spatial values in channel \(\:c\) by the scalar \(\:a\left(c\right)\). This scaling step suppresses channels with lower importance and enhances the influence of channels carrying stronger predictive features. Also, it introduces dynamic, context-driven modulation that helps the network focus on content most indicative of video tampering in Deepfake regions.

The SE module adaptively learns to highlight or suppress feature channels by embedding spatial context into compact descriptors. This have been achieved by analyzing inter-channel dependencies, and re-scaling feature responses. The lightweight design aligns well with MobileNetV3’s which makes it ideal for real-time applications like Deepfake classification. After several blocks of separable convolution and attention, the final feature representation is compressed to a lower-dimensional vector suitable for sequence-level modeling which is mathematically formulated as

$$\:{z}_{t}={\upvarphi\:}\left({\widehat{F}}_{t}^{\left(resized\right)}\right)\in\:{R}^{d}$$
(13)

where \(\:{\upvarphi\:}\) indicates the entire MobileNetV3 function representing all layers from input to projection, \(\:{z}_{t}\) indicates the feature vector encoding the semantic content of frame \(\:t\), \(\:d\) indicates the dimensionality of the final embedding.

The feature extraction stage using MobileNetV3 plays a critical role in capturing semantic information from reconstructed frames that have undergone intra prediction. Beyond general spatial representations, the network is designed to extract a diverse set of feature types that are particularly relevant to identifying fake video content. These include high-frequency edge discontinuities, which often emerge along manipulated facial contours; non-uniform illumination patterns, which may result from inconsistent lighting artifacts introduced during face synthesis; and regional texture mismatches, which frequently occur due to blending operations between synthetic and authentic regions. MobileNetV3’s architecture, that includes depthwise separable convolutions and embedded Squeeze-and-Excitation (SE) blocks, further allows the model to isolate channel-wise variations that carry discriminative cues such as color aberrations, surface gloss mismatches, and local blur. These features are not isolated at a single scale but are captured hierarchically across multiple layers, enabling the network to retain both low-level detail and abstract representations. This multi-level encoding is essential for identifying inconsistencies that vary in spatial scale and context, which often go undetected in single-layer CNNs. By passing these semantically rich feature maps into the temporal modeling block, the model retains fine spatial cues necessary for deepfake classification under various compression and resolution conditions.

Sequence modeling with Temporal attention and convolutional GRU

This phase of the proposed Deepfake detection framework is focused on learning the temporal dynamics across consecutive frames in a video. While previous stages handle frame-wise feature extraction, this stage models dependencies between frames—critical for detecting inconsistencies introduced by manipulation techniques such as frame blending, temporal flickering, or unnatural motion transitions. A hybrid approach combining temporal attention and convolutional gated recurrent units (ConvGRUs) is used to effectively model both short-term and long-range temporal patterns. From the previous stage, each frame \(\:t\) is encoded into a compact feature vector \(\:{z}_{t}\in\:{R}^{d}\). The full sequence of such vectors forms the input for temporal modeling which is mathematically expressed as \(\:Z=\{{z}_{1},{z}_{2},\dots\:,{z}_{T}\}\) in which \(\:{z}_{t}\) indicates the deep feature representation of frame \(\:t\), extracted via MobileNetV3, \(\:T\) indicates the total number of frames in the sequence, \(\:d\) indicates the dimensionality of each feature vector. This sequence is used to model temporal relationships across frames and to identify inconsistencies specific to Deepfake manipulation.

Temporal attention mechanism

Before feeding the sequence into the recurrent unit, an attention mechanism is applied to assign an importance weight to each frame’s feature representation based on its relevance to the overall video context. The attention score computation is mathematically formulated as

$$\:{e}_{t}={v}^{T}\cdot\:\text{tanh}\left({W}_{a}{z}_{t}+{b}_{a}\right)$$
(14)

Further weight normalization is formulated as

$$\:{{\upalpha\:}}_{t}=\frac{\text{exp}\left({e}_{t}\right)}{{\sum\:}_{k=1}^{T}\text{exp}\left({e}_{k}\right)}$$
(15)

where \(\:{e}_{t}\) indicates the unnormalized attention score for frame \(\:t\), \(\:v\in\:{R}^{h}\) indicates the learnable vector used to project the transformed feature, \(\:{W}_{a}\in\:{R}^{h\times\:d}\) indicates the learnable weight matrix, \(\:{b}_{a}\in\:{R}^{h}\) indicates the bias vector, \(\:{{\upalpha\:}}_{t}\) indicates the normalized attention weight for frame \(\:t\), \(\:h\) indicates the size of the hidden projection used for scoring. This mechanism helps highlight frames with strong evidence of manipulation, allowing the model to concentrate more on them during sequence modeling. Further the frame features are weighted by their attention scores to form a context vector which is mathematically expressed as

$$\:c={\sum\:}_{t=1}^{T}{{\upalpha\:}}_{t}\cdot\:{z}_{t}$$
(16)

where \(\:c\in\:{R}^{d}\) indicates the aggregated representation summarizing the entire sequence, guided by attention weights. This vector provides a temporally aware summary of the video segment, emphasizing frames likely influenced by synthetic alterations.

After attention weighting, the full sequence \(\:Z\) is processed through a Convolutional Gated Recurrent Unit (ConvGRU), which combines convolution operations with the memory and gating principles of traditional GRUs. ConvGRUs preserve spatial structure during recurrence, making them ideal for sequences of feature maps or dense vectors. At each time step \(\:t\), the recurrent computations are mathematically expressed as

$$\:{r}_{t}={\upsigma\:}\left({W}_{r}\text{*}{z}_{t}+{U}_{r}\text{*}{h}_{t-1}+{b}_{r}\right)$$
(17)

Reset Gate

$$\:{u}_{t}={\upsigma\:}\left({W}_{u}\text{*}{z}_{t}+{U}_{u}\text{*}{h}_{t-1}+{b}_{u}\right)$$
(18)

Update Gate

$$\:{\stackrel{\sim}{h}}_{t}=\text{tanh}\left({W}_{h}\text{*}{z}_{t}+{U}_{h}\text{*}\left({r}_{t}\odot\:{h}_{t-1}\right)+{b}_{h}\right)$$
(19)

Candidate Activation

$$\:{h}_{t}=\left(1-{u}_{t}\right)\odot\:{h}_{t-1}+{u}_{t}\odot\:{\stackrel{\sim}{h}}_{t}$$
(20)

Final State.

where \(\:{z}_{t}\) indicates the current input at time \(\:t\), \(\:{h}_{t-1},{h}_{t}\in\:{R}^{d}\) indicates the previous and current hidden states, \(\:{r}_{t}\) indicates the reset gate vector, \(\:{u}_{t}\) indicates the update gate vector, \(\:{\stackrel{\sim}{h}}_{t}\) indicates the candidate activation, \(\:{W}_{*},{U}_{*}\) indicates the convolution-based weight matrices for input and recurrent connections, \(\:{b}_{*}\) indicates the respective biases, \(\:{\upsigma\:}\left(\cdot\:\right)\) indicates the sigmoid activation function, \(\:{tanh}\left(\cdot\:\right)\) indicates the hyperbolic tangent activation, \(\:\odot\:\) indicates the element-wise multiplication. This structure selectively retains relevant temporal features while filtering out irrelevant or redundant patterns. The output of the ConvGRU across all time steps is mathematically expressed as \(\:H=\{{h}_{1},{h}_{2},\dots\:,{h}_{T}\}\). The last state \(\:{h}_{T}\) is used as the final summary vector which is expressed as \(\:\widehat{h}={h}_{T}\) in which \(\:H\) indicates the sequence of all hidden states, \(\:\widehat{h}\) indicates the final hidden state summarizing the entire frame sequence. This output vector encodes both spatial and temporal features which are fine-tuned by attention weighting and recurrent memory. Finally, it is processed in the classification stage to determine whether the input sequence is original or modified.

Classification layer

The final stage in the Deepfake detection and classification approach transforms the temporally modeled feature sequence into a final decision. This step incorporates fully connected classification layer to assign each video segment a class label indicating whether it is original or modified. The classifier processes the output of the temporal modeling block and maps it to a probability distribution over predefined class.

The input to this stage is the output vector obtained from the temporal modeling module. If the final hidden state from the Convolutional Gated Recurrent Unit (ConvGRU) is used, this vector is mathematically expressed as \(\:\widehat{h}\in\:{R}^{d}\) in which \(\:\widehat{h}\) indicates the temporally encoded representation of the entire frame sequence, \(\:d\) indicates the dimensionality of the hidden state vector output by the ConvGRU. This vector encodes both spatial and temporal characteristics of the video, having passed through the attention and recurrent layers. Before assigning class labels, a linear transformation is applied to the feature vector \(\:\widehat{h}\). This transformation computes the unnormalized prediction scores which is mathematically expressed as

$$\:o={W}_{o}\widehat{h}+{b}_{o}$$
(21)

where \(\:o\in\:{R}^{K}\) indicates the logit vector containing raw scores for each of the \(\:K\) possible classes, \(\:{W}_{o}\in\:{R}^{K\times\:d}\) indicates the weight matrix of the classification layer, \(\:{b}_{o}\in\:{R}^{K}\) indicates the bias vector associated with each class, \(\:K\) indicates the number of output classes. This transformation creates a mapping from the learned feature space to the decision space, preparing the data for probabilistic interpretation. The logit vector \(\:o\) is passed through a softmax function to convert the raw scores into a probability distribution across the output classes which is mathematically expressed as

$$\:{\widehat{y}}_{k}=\frac{\text{exp}\left({o}_{k}\right)}{{\sum\:}_{j=1}^{K}\text{exp}\left({o}_{j}\right)},\hspace{1em}\text{for\:}k=\text{1,2},\dots\:,K$$
(22)

where \(\:{\widehat{y}}_{k}\) indicates the predicted probability of the input belonging to class \(\:k\), \(\:{o}_{k}\) indicates the logit corresponding to class \(\:k\), \(\:\widehat{y}\in\:{R}^{K}\) indicates the vector containing probabilities for all classes. This function ensures that all class probabilities are positive and sum to one, enabling clear interpretation and threshold-based decision-making. The predicted label is determined by selecting the class with the highest probability which is mathematically formulated as

$$\:\widehat{c}=\text{arg}\underset{k}{\text{max}}\left({\widehat{y}}_{k}\right)$$
(23)

where \(\:\widehat{c}\) indicates the index of the predicted class, \(\:\text{arg}max\) indicates the operator that returns the index of the maximum value in the predicted probability vector. During the training phase, the model parameters are optimized using a loss function that compares predicted probabilities with the actual class labels. For classification categorical cross-entropy is employed which is mathematically expressed as

$$\:\mathcal{L}=-{\sum\:}_{k=1}^{K}{y}_{k}\cdot\:\text{log}\left({\widehat{y}}_{k}\right)$$
(24)

where \(\:{y}_{k}\) indicates the ground-truth label for class \(\:k\), \(\:{\widehat{y}}_{k}\) indicates the predicted probability for class \(\:k\), \(\:\mathcal{L}\) indicates the loss value for a single input sample.

Model update using RMSprop optimizer

The final component in the Deepfake detection framework involves updating the model parameters to improve classification accuracy over successive training iterations. This is achieved through the use of a gradient-based optimization technique. The RMSprop (Root Mean Square Propagation) optimizer is selected due to its effectiveness in handling non-stationary objectives and its ability in maintaining stable learning rates for each parameter independently. It is particularly useful in to process sequential data where gradient magnitudes fluctuate across different layers or time steps. Once the classification output \(\:\widehat{y}\) is compared with the ground truth \(\:y\), the loss \(\:\mathcal{L}\) is computed using the loss function. The next step is to calculate the partial derivatives of the loss with respect to each trainable parameter \(\:{\uptheta\:}\) which is mathematically formulated as

$$\:{g}_{t}=\frac{\partial\:\mathcal{L}}{\partial\:{{\uptheta\:}}_{t}}$$
(25)

where \(\:{g}_{t}\) indicates the gradient of the loss at training step \(\:t\), \(\:{{\uptheta\:}}_{t}\) indicates the model parameter at iteration \(\:t\), \(\:\mathcal{L}\) indicates the loss value resulting from incorrect prediction. These gradients indicate the direction and magnitude of change needed to reduce the loss. RMSprop maintains an exponentially weighted moving average of the squared gradients for each parameter. This helps in adapting the learning rate by scaling it inversely to the magnitude of recent gradient values. Mathematically it is expressed as

$$\:E{\left[{g}^{2}\right]}_{t}={\uprho\:}\cdot\:E{\left[{g}^{2}\right]}_{t-1}+\left(1-{\uprho\:}\right)\cdot\:{g}_{t}^{2}$$
(26)

where \(\:E{\left[{g}^{2}\right]}_{t}\) indicates the moving average of the squared gradients at time \(\:t\), \(\:{\uprho\:}\) indicates the decay rate, \(\:{g}_{t}^{2}\) indicates the element-wise square of the gradient at time \(\:t\), \(\:E{\left[{g}^{2}\right]}_{t-1}\) indicates the previous moving average value. This step smooths the gradient behavior and prevents sudden jumps during training caused by large updates. Using the computed gradient and the running average of squared gradients, the model parameters are updated as follows

$$\:{{\uptheta\:}}_{t+1}={{\uptheta\:}}_{t}-\frac{{\upeta\:}}{\sqrt{E{\left[{g}^{2}\right]}_{t}+{\upepsilon\:}}}\cdot\:{g}_{t}$$
(27)

where \(\:{\theta\:}_{t+1}\) indicates the updated parameter after applying the optimization step, \(\:\eta\:\) indicates the learning rate, \(\:\epsilon\:\) indicates a small constant which is added to prevent division by zero, \(\:\sqrt{E{\left[{g}^{2}\right]}_{t}+\epsilon\:}\) indicates the normalization factor that adapts the learning rate per parameter, \(\:{g}_{t}\) indicates the current gradient value. This adaptive scaling allows parameters with consistently large gradients to be updated slowly and those with small gradients to be updated more aggressively. RMSprop normalizes each gradient independently and helps the model to converge faster even when processing long frame sequences or learning temporal dependencies in manipulated videos. These update steps are repeated across multiple epochs and batches of training data. This continual refinement leads the model towards optimal parameter values that minimize the classification loss and improve its ability to generalize to unseen Deepfake patterns.

This model integrates spatial compression to highlight discrepancies in structure, leverages deep networks for semantic extraction, and exploits sequential dependencies for temporal validation. The use of intra-prediction-based processing ensures computational efficiency, while the hybrid temporal classifier effectively captures manipulation artifacts common in Deepfake content—delivering robust classification performance in real-time streaming environments.

Algorithm 1
figure a

Pseudocode for the proposed Deepfake Detection and Classification model.

The key innovation in the proposed model is present in its effective handling of video compression, which brings significant challenges in deepfake detection due to the loss of fine visual clues. Unlike traditional methods that process frames directly without considering the distortion introduced during encoding, the proposed model performs intra-frame compression using a hybrid convolutional autoencoder. This module is specifically designed to perform block-based compression by dividing each frame into fixed-size segments and reconstructing them individually. This reconstruction process does not only reduce dimensionality but highlights complex irregularities that emerge from manipulation, such as unnatural blending or inconsistent textures, which are typically flattened or obscured during standard compression procedures. To strengthen this process, the autoencoder is trained to capture high-frequency loss patterns by learning spatial redundancy through its encoder layers and restoring structural integrity via a decoder that re-emphasizes texture inconsistencies. This adaptive mechanism serves as a better alternative to mathematical transforms like DCT and provides better generalization under varied compression levels and formats. Moreover, by embedding this compression-aware prediction, the model ensures that IP-GTA outperform frequency-aware CNNs and static-frame models, especially when dealing with re-encoded or low-bitrate videos often found in online environments.

Results and discussion

The experimentation for the proposed intra prediction-based Deepfake detection model was carried out through a carefully structured simulation environment using Python and the TensorFlow-Keras framework. All computations and model training were conducted on a workstation equipped with an NVIDIA GPU to facilitate real-time processing efficiency. Initially, a collection of Deepfake and authentic videos was acquired from publicly available datasets. The initial video compression for intra frame prediction is done through hybrid convolutional autoencoder. Following reconstruction, each frame was resized and passed through MobileNetV3 to derive compact feature vectors, which were temporally ordered for each video segment. These frame-wise features were then processed using a hybrid temporal structure composed of an attention mechanism followed by a convolutional gated recurrent network to capture temporal inconsistencies commonly found in manipulated content. The classification module produced a final decision based on the temporal output, with predictions evaluated against known labels. Model optimization was performed using the RMSprop algorithm, and metrics such as accuracy, precision, recall, F1-score, and AUC were recorded. Hyperparameters were tuned empirically through iterative validation runs, and the performance was benchmarked against conventional models to validate the superiority of the proposed architecture.

Table 2 Simulation hyperparameters of proposed model.

The dataset employed in this study includes two structured layers of information: the video-level data and the frame-level representation. Initially, two directories Celeb-real and Celeb-synthesis were utilized, each containing 127 short video clips, spanning durations of 10 to 15 s. These folders represent genuine and artificially generated facial content, respectively, providing a balanced pool of source material. From these videos, frames were extracted and stored in separate folders labeled Real and Fake, each consisting of 1247 image frames. All images are formatted in RGB with a resolution of 942 × 500 pixels, a bit depth of 24, and a spatial density of 96 dots per inch both horizontally and vertically. This dual-level structure allows both temporal and spatial feature extraction, critical for detecting synthetic inconsistencies. The frame-level separation ensures the model can process individual images while preserving the context derived from their sequence in video form. The dataset thus supports end-to-end training and evaluation by aligning visual quality, manipulation type, and structural consistency across both time and space, which is essential for robust and real-time Deepfake classification.

Fig. 2
figure 2

(a) Fake video frame (b) Detected face (c) Feature image (d) Intra-predicted frame.

Fig. 3
figure 3

(a) Real video Frame (b) Detected Face (c) Feature image (d) Intra-predicted Frame.

Figure 2 illustrates a manipulated video sample processed through a structured pipeline designed for deepfake detection. The first image shows the original frame extracted from an altered video segment. This frame is subjected to face detection, isolating the region of interest as shown in the second image, which focuses exclusively on the facial area to enhance the relevance of feature analysis. The third image presents the feature map generated by the convolutional layers of the feature extraction backbone, highlighting texture inconsistencies and spatial discontinuities often associated with manipulated visuals. These localized features, captured at multiple scales, serve as key indicators for distinguishing authentic content from fabricated sequences. The same workflow is uniformly applied across the dataset. Figure 3 presents an example of a real video frame following the identical processing route. The initial image captures an unaltered facial frame from a genuine sequence. The second image extracts the target facial area, while the third shows the resulting feature map, preserving structural continuity and surface consistency that reflect natural content patterns. The final column in both Figs. 2 and 3 displays the intra prediction outputs derived from the reconstruction-based transformation carried out by a hybrid convolutional autoencoder. In the manipulated frame, visible inconsistencies appear along the facial contours, particularly near high-detail regions such as the eyes and lips, where unnatural transitions and textural imbalances are observed. In contrast, the real frame shows smoother gradients and coherent luminance spread, indicating visual continuity. These differences, though minor at the pixel level, become more evident when processed through the autoencoder, which simulates the behavior of intra-frame compression. The encoder condenses the frame while the decoder attempts full reconstruction, thus exposing areas where manipulated regions disrupt normal spatial flow. This block-wise reconstruction provides compression-aware evidence of manipulation, reinforcing the model’s ability to differentiate tampered videos through subtle visual imprints.

Figure 4 presents a comparison of intra prediction evaluation metrics between manipulated (Fake) and original (Real) frames using the hybrid convolutional autoencoder applied for compression-aware reconstruction. The Compression Ratio values are nearly identical, with Fake frames recording 88.159 and Real frames at 88.241, indicating consistent reduction in data dimensions during encoding across both types. The Mean Squared Error (MSE) is lower for Fake frames at 106.004 compared to 117.225 for Real frames, suggesting that synthesized content, due to its smoother surface features, is easier to reconstruct and presents less pixel-level deviation. In terms of Peak Signal-to-Noise Ratio (PSNR), Fake frames register a slightly higher value of 28.381 dB than Real frames at 27.990 dB, pointing again to reduced variance in manipulated textures that leads to less reconstruction distortion. The Structural Similarity Index (SSIM) values are closely matched, with 0.883 for Fake and 0.875 for Real, showing that contrast, structure, and luminance are preserved well in both cases. Similarly, the Correlation Coefficient remains consistently high, with values of 0.981 and 0.980 for Fake and Real respectively, reflecting strong alignment between original and reconstructed frames. These subtle metric differences reinforce the challenge of relying solely on reconstruction quality for detection and highlight the need for integrated spatial and temporal modeling to expose deepfake alterations.

Fig. 4
figure 4

Intra prediction performance metrics.

Fig. 5
figure 5

Loss curve of hybrid convolutional autoencoder model used for video compression intraframe prediction.

Figure 5 shows the training and validation loss trends for the hybrid convolutional autoencoder applied in intra-frame prediction for video compression analysis. Initially, both losses begin above 0.045 and rapidly decrease within the first 10 epochs, reaching approximately 0.0075 for training and 0.0083 for validation. By epoch 20, the losses stabilize near 0.003, and continue with slight variation, ending below 0.0025 by epoch 50. The consistent proximity between the two curves indicates reliable generalization and minimal overfitting. These results confirm that the model efficiently reconstructs frames while preserving spatial detail, aligning with its role in compression-aware prediction.

While intra prediction evaluation metrics such as MSE, PSNR, SSIM, correlation coefficient, and compression ratio were used to assess the reconstruction quality of frames in the proposed ITGTANet model, a direct comparison with existing compression-aware techniques was not initially provided. To establish the model’s effectiveness in compression resilience, a comparative analysis is conducted against representative baseline methods including DCT-based frequency CNNs, dual-stream RGB models, and frequency-fusion architectures. Quantitatively, ITGTANet achieves a Peak Signal-to-Noise Ratio (PSNR) of 28.38 dB and a Structural Similarity Index (SSIM) of 0.883, surpassing DCT-CNN models which record lower PSNR values around 26.4 dB and SSIM values below 0.87. Moreover, the Mean Squared Error (MSE) for ITGTANet remains the lowest at 106.00, reflecting improved fidelity in frame reconstruction under compressed conditions. A higher correlation coefficient (0.981) and compression ratio (88.16) further demonstrate the proposed model’s ability to preserve content structure while simulating compression dynamics. These metrics confirm that ITGTANet not only reconstructs frames with greater accuracy but also captures localized inconsistencies more effectively than fixed-transform or static-frame models. The quantative results of compression techniques are presented in detail in Table 4.

Table 3 Comparative analysis of compression techniques.
Fig. 6
figure 6

(a) Accuracy (b) Loss curves of proposed model for training and validation.

The training and validation accuracy plot depicted in Fig. 6 (a) demonstrates the learning behavior of the proposed model over 200 epochs. As the training progresses, the accuracy improves steadily, surpassing 70% after around 50 epochs. From epoch 100 onward, accuracy values consistently exceed 85%, with minimal deviation between the training and validation curves. By epoch 200, the model reaches a training accuracy of approximately 90% and validation accuracy slightly above 91%. The corresponding loss graph depicted in Fig. 6(b) exhibits a consistent downward trend for both training and validation sets, starting from an initial value above 0.75. Over time, the loss decreases steadily, crossing the 0.4 mark around epoch 75. Beyond this point, the loss continues to drop and stabilizes near 0.2 after 175 epochs. Although some minor fluctuations are observed in the validation loss curve between epochs 100 and 160, the overall pattern indicates convergence without divergence. The RMSprop optimizer effectively smooths gradient updates, while dropout regularization and reduced model complexity prevent overfitting. The final training and validation losses both converge around 0.2, reinforcing that the model maintains high classification confidence with minimal prediction uncertainty across unseen samples. This performance supports the effectiveness of the proposed hybrid structure for real-time Deepfake detection.

Fig. 7
figure 7

Confusion matrix for the proposed model for training and testing.

The confusion matrices depicted in Fig. 7 (a) and (b) exhibits the classification performance of the proposed model on both training and testing datasets. For the training set, the model correctly identified 786 real and 820 fake samples, while misclassifying 90 real as fake and 50 fake as real. This indicates strong learning capability with minor confusion between categories. On the test set, the model achieved 327 correct predictions for real inputs and 351 for fake, with only 45 real samples misclassified as fake and 26 fake samples misclassified as real. The test performance demonstrates reliable generalization, supported by reduced misclassification in both categories. These results confirm the model’s effectiveness in distinguishing subtle manipulation features while preserving high recognition accuracy across both known and unseen video segments. The incorporation of temporal modeling and frequency-domain preprocessing plays a key role in minimizing error margins during classification.

Fig. 8
figure 8

PR analysis for the proposed model for training process.

Fig. 9
figure 9

PR analysis for the proposed model for testing process.

The precision-recall curves depicted in Figs. 8 and 9 reflect the model’s classification performance for both real and fake categories on training and testing data. During training shown in Fig. 8, the proposed model attains the average precision (AP) for real samples reaches 0.9829, while fake samples yield an AP of 0.9821. This high level of precision, sustained across all recall values, indicates that the model maintains excellent confidence and balance in correctly identifying both classes. For the testing phase shown in Fig. 9, the proposed model demonstrates robust performance with AP values of 0.9787 for real and 0.9789 for fake, confirming that the learned representation generalizes well to unseen data. The consistent curvature near the upper boundary and minimal drop-off at higher recall values suggest that false positives are effectively minimized, and true positive identification is strong even when recall approaches 1. These results are a direct consequence of integrating frequency-domain inputs, spatial feature encoding, and temporal attention mechanisms, all of which collectively reinforce the model’s ability to distinguish fine-grained manipulation artifacts with minimal degradation under real-world testing conditions.

Fig. 10
figure 10

ROC curve for the proposed model for training process.

Fig. 11
figure 11

ROC curve for the proposed model for training process.

The ROC curves depicted in Figs. 10 and 11 exhibits the classification sensitivity and specificity of the model for both training and testing stages. On the training set results given in Fig. 11, the model achieves an area under the curve (AUC) of 0.9817 for both real and fake classes, indicating near-perfect separation between categories with minimal overlap. This suggests that the model correctly identifies manipulated and genuine content with a very low false positive rate. On the test set results given in Fig. 11, the AUC scores slightly decrease to 0.9777 for both classes, which still reflects strong generalization capability. The proposed model consistently achieves high true positive rates while maintaining low false alarms.

Table 4 Proposed model performances on training and testing.

Table 4 presents key performance indicators that assess the proposed model’s effectiveness in classifying Deepfake and real video data. On the training set, the model achieves an accuracy of 91.98%, with precision and recall closely aligned at 92.06% and 91.99%, respectively, resulting in an F1-score of 91.98%. On the testing set, the proposed model accuracy reaches 90.52%, while precision and recall slightly drop to 90.64% and 90.50%, yielding an F1-score of 90.51%. The Matthews Correlation Coefficient (MCC), which evaluates the overall quality of classifications exhibits at 0.8405 for training and 0.8114 for testing. This strong performance across all metrics, with minimal drop between training and testing, demonstrates that the model maintains generalization without overfitting.

Table 5 Simulation hyperparameters of existing methods.

The comparative analysis presented in Fig. 12 demonstrates that the proposed IP-GTA Net better performance over all existing models across precision metrics. Specifically, IP-GTA Net achieves a precision of 0.9064, exceeding XceptionNet (0.8795), Two-Stream CNN (0.8612), EfficientNet-B0 (0.8467), and Capsule Network (0.8248). This improvement is attributed to the inclusion of intra-frame prediction coupled with gated temporal attention, which allows the model to retain fine-grained spatial-temporal cues necessary for discriminating subtle manipulations in deepfake sequences.

Fig. 12
figure 12

Comparative analysis of precision metric.

Fig. 13
figure 13

Comparative analysis of recall metric.

The recall analysis given in Fig. 13 confirms that the proposed IP-GTA Net model better performance across all existing methods across training epochs, demonstrating superior ability to identify manipulated video content. At the final epoch, IP-GTA Net achieves a recall of 0.905, significantly higher than XceptionNet at 0.8802, Two-Stream CNN at 0.8645, EfficientNet-B0 at 0.842, and Capsule Network at 0.821. This performance indicates that IP-GTA Net is more effective at minimizing false negatives, which is critical in deepfake detection where missing altered frames can compromise trust. The improved recall is obtained from the integration of intra prediction and gated temporal attention, which collectively preserve structural coherence and contextual flow across frames.

Fig. 14
figure 14

Comparative analysis of F1-score metric.

The F1-score performance analysis presented in Fig. 14 highlights the superior balance achieved by the proposed IP-GTA Net model between precision and recall in the task of deepfake classification. As shown in the analysis, IP-GTA Net exhibits consistent improvement throughout training and reaches a final F1-score of 0.9051, which surpasses all existing benchmarks. XceptionNet trails behind at 0.8798, followed by CNN at 0.8628, EfficientNet-B0 at 0.8443, and Capsule Network at 0.8229. The enhanced performance of IP-GTA Net is due to its integrated intra-frame prediction mechanism and gated temporal attention module, which allow for efficient spatial-temporal feature alignment across video frames.

Fig. 15
figure 15

Comparative analysis of accuracy metric.

The accuracy comparison depicted in Fig. 15 demonstrates that the proposed IP-GTA Net superior classification performance over all other models across epochs. The final accuracy of 0.9052 is outperforming XceptionNet (0.8821), Two-Stream CNN (0.8673), EfficientNet-B0 (0.8519), and Capsule Network (0.8382). This higher accuracy indicates the proposed model’s improved ability to correctly identify both real and fake samples with minimal misclassification. The enhanced result is due to the intra prediction module, which enhances spatial structure consistency, and the gated temporal attention unit, which improves temporal dependency learning. These components collectively allow the model to analyze delicate frame-by-frame inconsistencies common in deepfake manipulations. In contrast, Capsule Network records the lowest accuracy, suggesting its dynamic routing mechanism is insufficient to handle the complex spatial and temporal variations present in manipulated sequences.

Table 6 Proposed model performances compared with existing methods.

The comparative analysis given in Table 6 highlights that the proposed IP-GTA Net achieves superior results across all evaluation metrics. It records the highest accuracy of 0.9052, while XceptionNet and Capsule Network fall behind with 0.8821 and 0.8382 respectively. In terms of precision, the proposed model reaches 0.9064, outperforming XceptionNet at 0.8795 and EfficientNet-B0 at 0.8467, indicating fewer false positives. The recall value of 0.905 confirms IP-GTA Net’s strong capability in detecting true positives, surpassing the next best score of 0.8802 from XceptionNet. The balanced nature of its classification is evident in the F1-score of 0.9051, ahead of all others. The proposed model also secures the highest Matthews Correlation Coefficient at 0.8114, reflecting improved robustness and reliability. These improvements stem from the use of intra prediction and gated temporal attention, which enhance contextual feature integration. Capsule Network scores the lowest across most metrics, likely due to limitations in capturing long-term dependencies in sequential data.

Conclusion

The proposed research introduces a robust deepfake detection framework, IP-GTA Net, which integrates intra prediction mechanisms with gated temporal attention to enhance the learning of spatial and sequential inconsistencies in manipulated videos. The approach begins with intra-frame reconstruction, where each video frame is processed using an encoder–decoder network that simulates the behavior of video compression, revealing subtle distortions and inconsistencies, followed by deep feature extraction through MobileNetV3 and classification using a gated convolutional GRU model optimized via RMSprop. Experiments were conducted on a dataset composed of real and synthetically generated videos from the Celeb-real and Celeb-synthesis repositories, comprising over 120 video sequences in each category and thousands of processed frames. The proposed model achieved superior test accuracy of 0.9052, precision of 0.9064, recall of 0.905, F1-score of 0.9051, and MCC of 0.8114, outperforming benchmark models including XceptionNet, Two-Stream CNN, EfficientNet-B0, and Capsule Network across all metrics. The results validate the effectiveness of the proposed temporal and structural analysis pipeline in discerning subtle artifacts common in deepfakes. However, the current system is limited by the dataset’s diversity, primarily featuring frontal face manipulations. In future work, the model can be extended to multi-modal datasets and real-time detection settings, while exploring transformer-based sequence encoding and broader manipulative techniques such as audio-visual mismatches and 3D facial retargeting.

Appendix A

Acronym

Definition

IP-GTA Net

Intra Prediction–Gated Temporal Attention Network

RGB

Red, Green, Blue (Color Channels)

MSE

Mean Squared Error

PSNR

Peak Signal-to-Noise Ratio

SSIM

Structural Similarity Index

CR

Compression Ratio

MCC

Matthews Correlation Coefficient

GRU

Gated Recurrent Unit

ConvGRU

Convolutional Gated Recurrent Unit

SE Block

Squeeze-and-Excitation Block

DCT

Discrete Cosine Transform

CNN

Convolutional Neural Network

ReLU

Rectified Linear Unit

AUC

Area Under the Curve

ROC

Receiver Operating Characteristic

PR Curve

Precision–Recall Curve

FC Layer

Fully Connected Layer

BN

Batch Normalization

CE Loss

Cross-Entropy Loss

ViT

Vision Transformer

HEVC

High Efficiency Video Coding (H.265)

H.264

Advanced Video Coding (AVC) Standard

JPEG

Joint Photographic Experts Group (Image Compression Standard)

Appendix B

Evaluation metrics for frame quality and consistency

To determine the accuracy of reconstruction and detect potential irregularities, several metrics are applied as follows.

  • Mean Squared Error (MSE): It measures the average squared difference between the original and reconstructed frame.

$$\:MSE=\frac{1}{HW}{\sum\:}_{x=1}^{H}{\sum\:}_{y=1}^{W}{\left({F}_{t}\left(x,y\right)-\widehat{{F}_{t}}\left(x,y\right)\right)}^{2}$$

where \(\:{F}_{t}\left(x,y\right)\) is the original frame pixel at position \(\:\left(x,y\right)\), \(\:\widehat{{F}_{t}}\left(x,y\right)\) defines the corresponding pixel.

  • Peak Signal-to-Noise Ratio (PSNR): It indicates how much signal content remains after transformation, higher values suggest better preservation.

$$\:PSNR=10\cdot\:{\text{log}}_{10}\left(\frac{{255}^{2}}{MSE}\right)$$
  • Structural Similarity Index (SSIM): It Compares the structure, luminance, and contrast of two images.

$$\:SSIM=\frac{\left(2{{\upmu\:}}_{F}{{\upmu\:}}_{\widehat{F}}+{c}_{1}\right)\left(2{{\upsigma\:}}_{F\widehat{F}}+{c}_{2}\right)}{\left({{\upmu\:}}_{F}^{2}+{{\upmu\:}}_{\widehat{F}}^{2}+{c}_{1}\right)\left({{\upsigma\:}}_{F}^{2}+{{\upsigma\:}}_{\widehat{F}}^{2}+{c}_{2}\right)}$$

where \(\:{\upmu\:}\) represents the average intensity, \(\:{\upsigma\:}\) represents variance, \(\:{{\upsigma\:}}_{F\widehat{F}}\) indicates the cross-variance, \(\:{c}_{1},{c}_{2}\) are the small constants.

  • Correlation Coefficient (\(\:{\uprho\:}\)): It measures the linear dependence between the original and reconstructed frames. The correlation coefficient values close to 1 indicates the strong similarity. Mathematically it is formulated as.

$$\:{\uprho\:}=\frac{\sum\:\left({F}_{t}-{{\upmu\:}}_{F}\right)\left(\widehat{{F}_{t}}-{{\upmu\:}}_{\widehat{F}}\right)}{\sqrt{\sum\:{\left({F}_{t}-{{\upmu\:}}_{F}\right)}^{2}\sum\:{\left(\widehat{{F}_{t}}-{{\upmu\:}}_{\widehat{F}}\right)}^{2}}}$$
  • Compression Ratio (CR): It evaluates the efficiency of the transformation process in terms of storage or transmission size.

$$\:CR=\frac{\text{Raw Bits}}{\text{Compressed Bits}}$$

The process helps to highlight the structural features and texture inconsistencies that may indicate forgery.

Appendix C

figure b