Introduction

The rapid evolution of artificial intelligence (AI) and machine learning has not only revolutionised traditional computing paradigms but has also fundamentally transformed the field of image generation. Over the past decade, cutting-edge techniques have emerged that enable machines to produce images with astonishing realism. Models such as Generative Adversarial Networks (GANs)1,2 have been at the forefront of this revolution, introducing adversarial frameworks that pit two neural networks against each other to refine image quality iteratively. Similarly, Diffusion models3 have harnessed the power of stochastic processes to reverse-engineer high-quality images from noisy inputs. These technological breakthroughs have not only broadened the horizons of what is possible in digital content creation but have also increased rapid innovation across various creative fields. In art and design, for instance, these generative models empower artists to explore novel aesthetics and generate complex visual patterns that were previously unimaginable. In the entertainment industry, they contribute to more immersive visual effects and realistic character renderings, enhancing the overall viewer experience. However, while these advancements have catalysed creativity and innovation, they have concurrently introduced a host of challenges that extend beyond the realm of art.

One of the most pressing concerns is the potential for misuse in the spread of misinformation and digital forgery. The hyper-realistic images produced by these models can be weaponised to create deceptive content that undermines public trust in media. As synthetic images become increasingly convincing, there is a growing risk that they could be deployed to manipulate opinions, distort facts, or even influence political outcomes. This erosion of public trust is particularly alarming in a digital age where visual media often serves as a primary source of information and evidence. Traditionally, classifiers trained on comprehensive datasets containing natural and AI-generated images have performed robustly within the confines of the known data distribution. These classifiers rely on patterns learned during training to distinguish between authentic and synthetic images accurately. However, a critical vulnerability arises when novel generative techniques produce images that fall outside the scope of these training datasets. As new methods evolve rapidly, they generate images with features and characteristics that differ significantly from those encountered during training. This phenomenon, known as the out-of-distribution (OOD) problem, poses a serious threat to the reliability of conventional classifiers. When faced with OOD images, classifiers may fail to recognise subtle deviations, allowing these images to bypass detection mechanisms4,5,6,7.

Recent research has underscored the fragility of current classification models in the face of such distributional shifts. Even minor changes in the data distribution can lead to marked performance degradation, highlighting the inherent challenges in reliably detecting OOD examples8,9. This degradation is particularly concerning in real-world applications where the data landscape continually evolves, and adversaries may deliberately exploit these vulnerabilities. The resulting uncertainty not only compromises the integrity of automated systems but also calls into question the broader trustworthiness of AI-driven decision-making processes.

This work addresses the challenge by building on existing research and introducing a unified framework that integrates complementary uncertainty measures. The framework combines entropy derived from Monte Carlo dropout, Gaussian Process predictive variance, and several Fisher Information-based metrics, namely, total Fisher information, the Frobenius norm of the Fisher Information Matrix, and Fisher entropy, to yield a single scalar uncertainty value for each sample. While MC dropout offers an ensemble-like estimate of epistemic uncertainty through multiple stochastic forward passes, its uncertainty estimates are often overly sensitive to dropout hyperparameters and may not correctly reflect changes in data density. In contrast, GP variance predictive variance provides a probabilistic measure of uncertainty that naturally decreases in regions with high training data density. This occurs because the GP employs a kernel function to quantify the similarity between a new test input and the training examples. Fisher Information metrics capture the local sensitivity of the model’s predictions to small parameter perturbations, indicating how well-determined the predictions are. An overview of this unified framework, including the flow of data from input image to the final accept/reject decision, is illustrated in Fig. 1. By employing Particle Swarm Optimisation (PSO) to optimally combine these diverse metrics, our framework leverages their complementary strengths to yield a well-calibrated uncertainty measure. This integrated scalar not only enhances the detector’s ability to reject or flag uncertain predictions, particularly when encountering novel or evolving AI-generated images, but also improves generalisation by incorporating both global data-driven confidence and local model sensitivity into a single metric.

The key contributions of this work are as follows:

  • This work proposes a comprehensive framework that combines multiple uncertainty measures, namely, per instance, Fisher Information, entropy measures from MC dropout, and GP predictive variance via Deep Kernel Learning into a single, cohesive scalar uncertainty score. This integration is achieved through Particle Swarm Optimisation, which leverages the complementary strengths of each metric.

  • Second, combining these uncertainty measures, our framework provides a more robust mechanism for filtering out-of-distribution (OOD) AI-generated images. This is particularly important in dynamic environments where the rapid emergence of new generative techniques may lead to images that deviate from the training distribution.

  • Third, the integrated uncertainty metric not only aids in detecting novel, potentially harmful AI images but also enhances overall classifier generalisation by incorporating both global data-driven confidence (via GP variance) and local model sensitivity (via Fisher Information and MC dropout entropy).

  • Fourth, this work demonstrates through extensive experiments that our integrated approach significantly improves the detection of OOD samples. The framework’s ability to combine multiple non-conformity measures into a single uncertainty value leads to more accurate and well-calibrated predictions compared to traditional classifiers.

Fig. 1
figure 1

Overview of the proposed framework. Input images are preprocessed and passed through a feature extractor and task head for classification. Fisher Information, MC Dropout, and GP variance are computed in parallel, normalised, and combined via PSO to produce a unified uncertainty score for accept/reject decisions.

Related works

The detection of AI-generated images has gained significant attention due to the rapid advancement of generative models like GANs, diffusion models, and others. These models, including BigGAN, GLIDE, Stable Diffusion, and MidJourney, have revolutionised image synthesis but also pose risks, such as misinformation and digital forgery. Researchers have proposed diverse techniques to address the challenge of distinguishing AI-generated images from natural ones.

A notable direction has been using Convolutional Neural Networks (CNNs) for image classification, leveraging hierarchical feature representations to detect subtle differences. For instance, Zhu et al. proposed GenImage, a benchmark dataset to test the performance of classifiers in detecting synthetic images4. However, CNNs often rely on confidence scores derived from softmax outputs, which may not adequately capture out-of-distribution anomalies. Among CNN-based detection models, ResNet50 has been widely adopted to classify real and AI-generated images based on learned features10,11,12. Furthermore, the Patch Selection Module (PSM) enhances detection performance by leveraging global and local image features, integrating attention-based fusion mechanisms into the Resnet50 backbone network12. Alternative approaches employ handcrafted or statistical features. For example, spectral analysis has been explored to identify artefacts introduced during synthesis, as demonstrated by Durall et al., who analysed frequency domain differences in AI-generated images13. Another spectral-based approach, Deep Image Fingerprint (DIF)14, extracts high-frequency artefacts from images using a U-Net-based high-pass filter, achieving strong generalisation even on unseen AI-generated images. Techniques like Wasserstein distances and Sinkhorn approximations have also been employed to quantify dissimilarities in feature spaces, improving detection robustness15. Dogoulis et al. focus on the quality of the fake dataset based on the rationale that if a network is trained on high-perceptual-quality synthetic images, it will focus less on noticeable artefacts from the generative process and more on invariant characteristics across different concepts. As a result, this approach enhances the detector’s generalisation ability, allowing it to distinguish AI-generated images from real ones more effectively. The study employs a ResNet-50 model for the downstream classification task of identifying real and fake images16. However, it has been shown in previous work that they do not generalise well and failed to detect AI-generated images when the AI image generation method changes4,5,6,7. To address this challenge, online training strategies have been explored, in which detectors incrementally update their knowledge by incorporating AI-generated images in chronological order based on their historical release dates. This approach has been shown to improve performance on unseen generative models as the training history expands17. However, a key limitation of incremental updates is the gap between model updates and the emergence of new generative techniques. Before the detector is trained with new training data, AI-generated images produced by recently developed models may still evade detection; therefore, there is a need for uncertainty-based methods that can reject predictions when encountering data that has not yet been seen, ensuring more reliable detection in rapidly evolving AI-generation landscapes.

The application of uncertainty measures in classification is quite common. Among them, the MC dropout18 has been used widely to estimate uncertainty in image classification and segmentation by using dropout during inference to obtain the predictive distribution using the same input and multiple inference19,20,21. Although less commonly used, Fisher Information provides an alternative approach for uncertainty estimation by measuring the sensitivity of a model’s predictions to small perturbations in its parameters. In this context, Fisher Information-Based Evidential Deep Learning (I-EDL)22 incorporates Fisher Information into an evidential deep learning framework to reweight the learning process based on the informativeness of predictions. By dynamically adjusting uncertainty penalties using the Fisher Information Matrix (FIM), this approach ensures that highly uncertain samples are handled more effectively. Another approach to uncertainty quantification is Deep Gaussian Processes (DGPs), which extend traditional Gaussian Processes (GPs) by stacking multiple layers of latent functions, providing a hierarchical representation of uncertainty. Recent studies have applied DGPs in radiation image reconstruction, where a Gaussian Process prior (GPP) enables Bayesian inference with structured priors, improving both image quality and uncertainty quantification23. Similarly, Convolutional Deep Gaussian Processes (CDGPs) have been explored for decision-making under uncertainty in critical applications such as autonomous driving and healthcare. Furthermore, Bayesian formulations of Deep Convolutional Gaussian Processes (DCGPs) improve uncertainty calibration and model selection, offering an alternative to dropout-based Bayesian deep learning methods24.

More recently, Gozalo-Salellas et al.25 introduced a training-free uncertainty-based framework, WePe, that detects AI-generated images by applying random weight perturbations to large pre-trained vision models such as DINOv2. Their method builds on the observation that natural and generated images exhibit different sensitivities to parameter perturbations: while the features of natural images remain relatively stable, those of generated images fluctuate more substantially, leading to higher predictive uncertainty. This approach is conceptually related to our work in that it also employs uncertainty as a detection signal. However, WePe is designed as a training-free method and directly assigns class labels based on feature instability, without incorporating a mechanism for selective rejection. In contrast, our work combines the entropy estimates derived from MC dropout, the predictive variance obtained from Gaussian Process models, and several Fisher Information-based metrics, including the total Fisher information, the Frobenius norm and Fisher entropy of the Fisher Information Matrix, to produce a single scalar uncertainty value for each sample. This composite uncertainty measure not only encapsulates the strengths of each approach but also mitigates their isolated limitations, providing a more robust assessment of model confidence.

Methodology

Data preparation and preprocessing

To evaluate the effectiveness of our proposed framework for detecting AI-generated images, we utilised the GenImage dataset4, a large-scale benchmark specifically designed for this purpose. GenImage comprises over 2.6 million images, evenly split between real and AI-generated content. The real images in GenImage are sourced from the ImageNet dataset, encompassing 1,331,167 images across 1,000 distinct classes. The AI-generated images are produced using eight state-of-the-art generative models, namely BigGAN26 GLIDE27, Vector Quantized Diffusion Model (VQDM)28 Stable Diffusion V1.429, Stable Diffusion V1.529,ADM30, and Midjourney31. Each generative model contributes approximately the same number of images per class, maintaining balance across the dataset. Typically, each model generates around 162 images per class for training and 6 for testing. An exception is Stable Diffusion V1.5, which provides slightly more, with 166 images for training and 8 for testing per class. The dataset is structured into subsets corresponding to each generative model. Each subset contains both the AI-generated images from that specific model and their corresponding real images from ImageNet for the same classes. Importantly, the real images are not shared between subsets, preventing any overlap that could introduce bias during training or evaluation. In our research, we concentrated on a specific subset of the GenImage dataset composed of images generated by Stable Diffusion V1.5, Midjourney, GLIDE, VQDM and BigGAN. To evaluate our model’s performance, we exclusively trained it on images produced by Stable Diffusion V1.5 and then assessed its in-domain performance using additional images from the same generator. Meanwhile, images generated by Midjourney, GLIDE, VQDM and BigGAN were employed to gauge the model’s ability to handle out-of-domain scenarios. In addition, we also included images generated by StyleGAN3. Since StyleGAN3 is trained exclusively for face generation, its distribution differs markedly from the ImageNet-based subsets in GenImage. We therefore used it solely as a supplementary, highly out-of-distribution test case to further evaluate the robustness of our framework. The StyleGAN3 subset32 comprised 10,392 images for training and 2,598 images for testing, complemented by real face images obtained from the Deepfake Face Images dataset33.

Statistical analysis on datasets

We performed a rigorous and scientifically grounded statistical analysis to quantify the textural differences between natural (real) and AI-generated (fake) images across five datasets. Each dataset corresponds to one of five prominent generative models: BigGAN, VQDM, Glide, Midjourney, and Stable Diffusion. From each model, we collected 1,000 real and 1,000 AI-generated image samples, ensuring a balanced and controlled comparative setup between the two classes. This setup allowed us to make fair and interpretable comparisons between real and synthetic textures.

Table 1 Summary of Statistical Analysis of Textural Features
Fig. 2
figure 2

Pairplots comparing real (blue) and AI-generated (purple) images across four GLCM features: Contrast, Energy, Entropy, and Homogeneity. Each subplot shows the feature distribution and inter-feature relationships for a specific model.

Our methodology is based on the hypothesis that generative models, despite their impressive visual quality, introduce subtle yet consistent textural artefacts that deviate from the statistical properties of real images. To investigate this, we employed the Grey-Level Co-occurrence Matrix (GLCM) method, which quantifies spatial relationships between pixel intensities and provides interpretable measures of image texture. From each image, we extracted four key features: Contrast, Energy, Entropy, and Homogeneity. These features serve as a compact representation of each image’s texture signature.

To statistically evaluate whether these features differ between real and AI-generated images, we applied three complementary analytical techniques. First, we independently conducted a univariate analysis using Welch’s t-test for each feature. This test is particularly appropriate because it does not assume equal variances between groups. The results, summarised in Table 1, revealed strong evidence of significant differences in most features across datasets. For instance, in the VQDM dataset, Welch’s t-test showed a t-statistic of 24.50 for contrast and 20.02 for entropy, both with \(p < 0.0001\), indicating highly significant differences. Interestingly, the entropy values in the Midjourney dataset were not significantly different between real and AI images (\(t = 0.32\), \(p = 0.7477\)), suggesting that this specific generative model produces images with entropy levels more closely aligned to natural images.

Second, we conducted a multivariate analysis using Hotelling’s T2 test to assess the joint significance of all four GLCM features. This multivariate test evaluates whether the combined distribution of features for real images differs from that of AI-generated images. As shown in Table 1, all five models yielded highly significant results (\(p < 0.0001\)), with particularly high T2 statistics for BigGAN (849.95) and VQDM (795.16), indicating substantial multivariate separation between real and synthetic images.

Third, to measure the degree of distributional drift beyond hypothesis testing, we calculated the Kullback-Leibler (KL) divergence between the feature distributions of real and AI-generated images. This metric quantifies how much the probability distribution of each feature in AI-generated images deviates from its real counterpart. Notably, the energy feature showed large KL divergence in multiple models-for example, Stable Diffusion and Glide both had Energy KL values of 58.15-highlighting significant shifts in the underlying textural distributions, as reported in Table 1.

These statistical findings are further supported by the pairplot visualisations presented in Fig. 2, which illustrate the inter-feature relationships and distributional differences between the real and AI-generated classes for each dataset. In these plots, real images are shown in blue and AI-generated images in green. The degree of clustering or overlap visually reinforces the results observed in the statistical tests. Glide, for instance, demonstrates moderate overlap in entropy and homogeneity but shows clear distinctions in contrast and Energy. Midjourney displays greater separation in contrast and Energy but minimal difference in entropy, aligning with the non-significant p-value observed. Stable Diffusion exhibits some overlapping, but shows lower contrast and Energy in AI-generated samples. In contrast, VQDM and BigGAN both show clear separation across all four features, indicating highly distinguishable synthetic textures.

Overall, this combination of statistical analysis and visual evidence provides robust support for the claim that AI-generated images exhibit distinct and quantifiable differences from natural images in terms of their textural properties.

Base classifier

Our framework leverages existing base classifiers, fine-tuning them to differentiate between AI-generated and natural images. Our primary model is a trained ResNet50 architecture34, which has been extensively employed in AI versus natural image detection10,11,12. By capitalising on the hierarchical features learned from large-scale datasets like ImageNet, we retain the convolutional base of ResNet50 as a robust feature extractor. The original classification head is replaced with a custom-designed module tailored for our binary classification task. This module begins with a global average pooling layer that condenses the feature maps into a 2048-dimensional vector, which is then processed through a classification head consisting of two fully connected layers with 512 and 256 units, respectively. Each fully connected layer is accompanied by ReLU activations and dropout regularisation (with dropout rates of 0.5) to mitigate overfitting, culminating in a final layer that outputs a single logit for decision-making.

In addition to our ResNet50-based approach, we also explored a Vision Transformer (ViT) model to assess the generalizability of our uncertainty measures across diverse architectures. For the ViT model, we employed the vit_base_patch16_224 architecture, renowned for its effective feature extraction. Similar to the ResNet50 adaptation, the original classification head of the ViT was removed and substituted with a custom-designed classifier. The ViT extracts a 768-dimensional embedding from its feature extraction stage, which is then reduced to 512 units via a fully connected layer, followed by ReLU activation and dropout. This intermediate representation is further refined by another fully connected layer, reducing it to 256 units, again using ReLU activation and dropout, before a final fully connected layer outputs a single logit for binary classification. In order to maintain the integrity of the pre-trained representations, most of the ViT layers are kept frozen during training, with only the final two transformer blocks being fine-tuned to capture task-specific nuances.

By integrating both the ResNet50 and ViT architectures within our framework, we demonstrate that our approach is robust and adaptable to different neural architectures. While our primary focus is on the extensively validated ResNet50, the experiments with ViT show the potential of our uncertainty measures to enhance classification performance across various models.

Integration of uncertainity measure

Per instance fisher information

Fisher Information serves as the first uncertainty measure in our framework, quantifying how each image, whether AI-generated or natural, influences the parameters of our classification model. This metric assesses the sensitivity of the model’s likelihood function to changes in its parameters on a per-instance basis, thereby providing insights into how well an image conforms to the learned patterns.

The foundation of this approach is the Fisher Information Matrix (FIM) for a single data point, which quantifies the amount of information the observable data carries about the model parameters. For a given data point \((x_i, y_i)\) and model parameters \(\theta\), the per-instance Fisher Information is defined as:

$$\begin{aligned} I_i(\theta ) = \mathbb {E}_{y}\left[ \left( \nabla _\theta \log p(y | x_i; \theta ) \right) \left( \nabla _\theta \log p(y | x_i; \theta ) \right) ^\top \right] , \end{aligned}$$
(1)

where \(p(y | x_i; \theta )\) represents the likelihood of observing the label \(y\) given the input \(x_i\) under parameters \(\theta\), and \(\nabla _\theta\) denotes the gradient with respect to \(\theta\).

Due to the high dimensionality of deep neural network parameters, computing the full Fisher Information Matrix (FIM) for every layer is highly resource-intensive. To simplify this, we concentrate on the customs classification head that sits atop our pre-trained ResNet50 and ViT models. This part of the network is solely responsible for making the final binary decision between AI-generated and natural images. By extracting the FIM from just this custom head, we can capture the most important information about how individual inputs influence the model’s decisions without overwhelming our computational resources.

For each input image \(x_i\), we perform a forward pass to obtain the output logits \(f(x_i; \theta )\). We then compute the gradients of the binary cross-entropy loss \(\mathscr {L}(y, f(x_i; \theta ))\) with respect to the model parameters for each potential class label \(y \in \{0, 1\}\), regardless of the true label. We assume equal prior probabilities for both classes, \(p(0) = p(1) = 0.5\), and the gradient for each class is given by:

$$\begin{aligned} g_i^{(y)} = \nabla _\theta \mathscr {L}(y, f(x_i; \theta )). \end{aligned}$$
(2)

The per-instance Fisher Information Matrix is then approximated as:

$$\begin{aligned} I_i(\theta ) = \sum _{y \in \{0, 1\}} p(y) \left( g_i^{(y)} \odot g_i^{(y)} \right) , \end{aligned}$$
(3)

where \(\odot\) denotes element-wise multiplication. This formulation effectively accumulates the squared gradients weighted by the class probabilities, capturing each parameter’s expected curvature of the loss function.

To distil this matrix into a set of scalar metrics that summarise the uncertainty associated with each instance, we extract several key measures:

  • Total Fisher Information (Trace): This is computed as

    $$\begin{aligned} \text {Trace}(I_i(\theta )) = \sum _{k} \left[ I_i(\theta ) \right] _k, \end{aligned}$$
    (4)

    where \(\left[ I_i(\theta ) \right] _k\) represents the diagonal elements of the matrix, corresponding to the variance of each parameter.

  • Frobenius Norm: This metric captures the overall magnitude of the Fisher Information Matrix, calculated as

    $$\begin{aligned} \Vert I_i(\theta ) \Vert _F = \sqrt{ \sum _{k} \left( \left[ I_i(\theta ) \right] _k \right) ^2 }. \end{aligned}$$
    (5)
  • Entropy-Based Measure: By normalizing the diagonal elements, we define a probability distribution \(p_i(k) = \frac{ \left[ I_i(\theta ) \right] _k }{ \text {Trace}(I_i(\theta )) }\) and compute the entropy as:

    $$\begin{aligned} \text {Entropy} = - \sum _{k} p_i(k) \log \left( p_i(k) + \epsilon \right) , \end{aligned}$$
    (6)

    where \(\epsilon\) is a small constant for numerical stability.

By concentrating on the parameters of the custom classification head, we directly assess the sensitivity of the model’s decision-making process. In this architecture, the pre-trained convolutional base extracts general features from the images, and the classification head interprets these features to produce task-specific predictions. Consequently, the per-instance Fisher Information derived from the custom head clearly measures how each image affects the model’s capacity to distinguish between AI-generated and natural images. Lower Fisher Information values suggest that an image exerts minimal influence on the model parameters, potentially indicating non-conformity with learned patterns. In contrast, higher values indicate that the image strongly aligns with the training data. Finally, these scalar metrics are aggregated across the relevant layers of the classification head. We sum the values from all selected layers for cumulative metrics, such as the Total Fisher Information and Frobenius Norm. In contrast, we compute the mean across layers for metrics expressed as averages, such as the entropy-based measure.

Deep kernel learning with gaussian processes

Deep Kernel Learning (DKL) provides a robust hybrid framework that combines the representational strength of deep neural networks with the probabilistic flexibility of Gaussian Processes (GPs). This approach enables the extraction of rich, semantically meaningful features from high-dimensional image data while leveraging GPs to model uncertainty and make robust predictions. In our work, DKL is employed to classify images into two categories: AI-generated and natural images, with the variance of the GP posterior serving as our primary measure of uncertainty.

Due to the complexity and high dimensionality of image data, we first transform raw pixel inputs into a compact and semantically meaningful feature space using a deep neural network. This transformation is achieved through a modified ResNet50 or ViT network trained as a feature extractor. Let the neural network feature extractor be denoted as

$$\begin{aligned} \mathscr {F}_\theta : \mathbb {R}^{\text {image dims}} \rightarrow \mathbb {R}^d, \end{aligned}$$
(7)

parameterized by \(\theta\). For an image \(\textbf{x}\), the extracted feature vector is:

$$\begin{aligned} \textbf{z} = \mathscr {F}_\theta (\textbf{x}), \end{aligned}$$
(8)

where \(\textbf{z} \in \mathbb {R}^d\) represents the embedding of the learned feature. These embeddings are designed to capture the essential distinctions between AI-generated and natural images, reducing the complexity of the input space while preserving discriminative information. To model the classification task, we place a Gaussian Process prior to a latent function \(f: \mathbb {R}^d \rightarrow \mathbb {R}\) that maps the feature vectors \(\textbf{z}\) to a real-valued latent score. The GP prior is defined as:

$$\begin{aligned} f(\textbf{z}) \sim \mathscr{G}\mathscr{P}(m(\textbf{z}), k(\textbf{z}, \textbf{z}')), \end{aligned}$$
(9)

where \(m(\textbf{z})\) is the mean function, commonly assumed to be zero, and \(k(\textbf{z}, \textbf{z}')\) is a kernel function that encodes similarity between feature vectors. By setting \(m(\textbf{z}) = 0\), the prior simplifies to:

$$\begin{aligned} f(\textbf{z}) \sim \mathscr{G}\mathscr{P}(0, k(\textbf{z}, \textbf{z}')). \end{aligned}$$
(10)

In our implementation, we use the Radial Basis Function (RBF) kernel given by:

$$\begin{aligned} k(\textbf{z}, \textbf{z}') = \sigma ^2 \exp \left( -\frac{\Vert \textbf{z} - \textbf{z}'\Vert ^2}{2l^2}\right) , \end{aligned}$$
(11)

Where \(\sigma ^2\) is the output scale and \(l\) is the length scale, both of which are learned during training. By combining the deep feature extractor with the RBF kernel, we effectively learn a deep kernel that adapts to the specific characteristics of our data.

Although GPs provide a flexible approach to modelling functions, their computational cost scales cubically with the number of training points \(N\), making the exact GP inference infeasible for large datasets. To address this limitation, we introduce \(M \ll N\) inducing points.

$$\begin{aligned} \textbf{Z}_u = [\textbf{z}_u^{(1)}, \ldots , \textbf{z}_u^{(M)}], \end{aligned}$$
(12)

which acts as a low-rank approximation of the GP. These inducing points summarise the function’s behaviour over the input space, reducing computational complexity while retaining much of the GP’s expressive power. In our implementation, we use 1000 inducing points, 500 for each class. The corresponding inducing function values

$$\begin{aligned} \textbf{u} = [f(\textbf{z}_u^{(1)}), \ldots , f(\textbf{z}_u^{(M)})]^\top \end{aligned}$$
(13)

are assumed to follow the same GP prior:

$$\begin{aligned} \textbf{u} \sim \mathscr {N}(\textbf{0}, \textbf{K}_{u,u}), \end{aligned}$$
(14)

where \(\textbf{K}_{u,u}\) is the kernel matrix evaluated on the inducing points. The full latent function values \(\textbf{f}\) for the training data are then conditionally Gaussian given \(\textbf{u}\):

$$\begin{aligned} \textbf{f}|\textbf{u} \sim \mathscr {N}(\textbf{K}_{f,u}\textbf{K}_{u,u}^{-1}\textbf{u}, \textbf{K}_{f,f} - \textbf{K}_{f,u}\textbf{K}_{u,u}^{-1}\textbf{K}_{u,f}), \end{aligned}$$
(15)

Where \(\textbf{K}_{f,u}\) and \(\textbf{K}_{f,f}\) are kernel matrices evaluated between training points and inducing points, and among training points alone, respectively. This sparse approximation reduces the cost of GP inference and training to \(\mathscr {O}(NM^2)\), making it feasible for larger datasets.

Since this is a binary classification problem, we map the latent function values \(f(\textbf{z})\) to class probabilities using the logistic sigmoid function:

$$\begin{aligned} \sigma (f) = \frac{1}{1 + \exp (-f)}. \end{aligned}$$
(16)

The likelihood of a label \(y \in \{0,1\}\) given \(f(\textbf{z})\) is:

$$\begin{aligned} p(y|f) = \sigma (f)^{y}[1-\sigma (f)]^{1-y}. \end{aligned}$$
(17)

For a dataset with \(N\) training points, the joint likelihood is:

$$\begin{aligned} p(\textbf{y}|\textbf{f}) = \prod _{i=1}^N p(y_i|f_i). \end{aligned}$$
(18)

Because this likelihood is non-Gaussian, exact posterior inference is intractable. Instead, we employ variational inference, where we approximate the posterior \(p(\textbf{f}, \textbf{u}|\textbf{y})\) using a variational distribution:

$$\begin{aligned} q(\textbf{f}, \textbf{u}) = p(\textbf{f}|\textbf{u})q(\textbf{u}), \end{aligned}$$
(19)

and \(q(\textbf{u})\) is chosen to be Gaussian:

$$\begin{aligned} q(\textbf{u}) = \mathscr {N}(\textbf{u}|\varvec{\mu }, \textbf{S}). \end{aligned}$$
(20)

The Evidence Lower Bound (ELBO) provides a tractable objective for optimising the variational parameters \(\varvec{\mu }\) and \(\textbf{S}\):

$$\begin{aligned} \mathscr {L} = \mathbb {E}_{q(\textbf{f})}[\log p(\textbf{y}|\textbf{f})] - \textrm{KL}[q(\textbf{u})||p(\textbf{u})]. \end{aligned}$$
(21)

Here, the first term measures how well the variational approximation explains the data. In contrast, the second term regularises the variational posterior \(q(\textbf{u})\), ensuring it remains consistent with the GP prior. By maximising \(\mathscr {L}\), we jointly optimise the kernel hyperparameters, the inducing point locations, and the deep feature extractor parameters \(\theta\). For a test point \(\textbf{x}_*\), we compute the predictive distribution over the latent function \(f(\textbf{z}_*)\) using the GP posterior. The resulting distribution is Gaussian:

$$\begin{aligned} q(f_*) = \mathscr {N}(f_*|\mu _*, \sigma _*^2), \end{aligned}$$
(22)

where \(\mu _*\) and \(\sigma _*^2\) are derived from the variational posterior. Importantly, the variance \(\sigma _*^2\) serves as our uncertainty measure, quantifying the model’s confidence in its prediction. The class probabilities are obtained by integrating over this predictive distribution:

$$\begin{aligned} p(y_* = 1|\textbf{z}_*) = \int \sigma (f_*)q(f_*) \, df_*. \end{aligned}$$
(23)

Due to the nonlinearity of the sigmoid function, this integral lacks a closed-form solution. To approximate it, we employ Monte Carlo methods. In this approach, we draw \(S\) latent function samples \(f_*^{(s)}\) from the Gaussian distribution \(q(f_*) = \mathscr {N}(f_*|\mu _*, \sigma _*^2)\), where each sample is generated as:

$$\begin{aligned} f_*^{(s)} = \mu _* + \epsilon ^{(s)} \sigma _*, \end{aligned}$$
(24)

and \(\epsilon ^{(s)}\) is drawn from a standard normal distribution \(\mathscr {N}(0, 1)\). For each sample, we evaluate the sigmoid function to produce probabilities \(\sigma (f_*^{(s)})\). The integral is then approximated by averaging these probabilities:

$$\begin{aligned} p(y_* = 1|\textbf{z}_*) \approx \frac{1}{S} \sum _{s=1}^S \sigma (f_*^{(s)}). \end{aligned}$$
(25)

This Monte Carlo approximation not only provides the predicted class probabilities but also captures predictive uncertainty through the variance \(\sigma _*^2\) of the GP posterior. This variance is used as uncertainty in the framework, as it informs us about the confidence of the model’s predictions and enables robust decision-making in our classification task.

MC dropout

MC Dropout is integrated into our framework to quantify uncertainty in both the ResNet50-based and ViT-based classifiers. MC Dropout is applied during inference to generate a distribution of probabilistic outputs through multiple stochastic forward passes. Dropout layers within the custom classification head of Resnet50 and ViT remain activated during inference, enabling the model to produce \(N = 20\) stochastic predictions for each input image. The multiple outputs obtained via MC Dropout are aggregated to compute key uncertainty metrics. The Entropy of Expected is calculated as:

$$\begin{aligned} \text {Entropy of Expected} = -\bar{p} \log (\bar{p}) - (1-\bar{p}) \log (1-\bar{p}), \end{aligned}$$
(26)

Where \(\bar{p}\) is the mean predicted probability across all MC passes. We also compute the Expected Entropy:

$$\begin{aligned} \text {Expected Entropy} = -\frac{1}{N} \sum _{i=1}^{N} \left( p_i \log (p_i) + (1 - p_i) \log (1 - p_i) \right) , \end{aligned}$$
(27)

with \(p_i\) representing the predicted probability from the \(i\)th forward pass. The difference between these two metrics, defined as Knowledge Uncertainty,

$$\begin{aligned} \text {Knowledge Uncertainty} = \text {Entropy of Expected} - \text {Expected Entropy}, \end{aligned}$$
(28)

The rationale for using MC Dropout lies in its ability to capture the variability in model predictions that arises from uncertainty. For instance, the Entropy of Expected metric reflects the uncertainty in the model’s parameters, typically peaking in regions with sparse or unseen data. In contrast, Expected Entropy highlights the inherent noise within the data, such as overlapping classes or noisy labels. The difference between these two metrics, termed knowledge uncertainty, offers additional insight by identifying instances where uncertainty arises from a combination of data ambiguity and model limitations. In our experiments, we observed that Expected Entropy yields nearly the same values as the Entropy of Expected; therefore, in the overall framework, we use the Entropy of Expected and Knowledge uncertainty as the uncertainty derived from MC dropout in the overall framework.

Combining and normalising uncertainty measures

In this section, we describe how to combine the uncertainty metrics introduced earlier (Fisher-based measures, Gaussian Process variance, and MC Dropout-based metrics) into a single unified uncertainty score per sample. The goal is to leverage a weighted combination of these measures so that samples with high overall uncertainty can be rejected rather than incorrectly accepted. To optimise the weights of each uncertainty measure and the rejection threshold, we employ Particle Swarm Optimisation (PSO).

From the previous section, recall that each sample \(i\) is associated with multiple uncertainty measures: Fisher-based uncertainties, Gaussian Process (GP) variance, and MC Dropout metrics. To ensure higher uncertainty values reflect greater non-conformity, the Fisher-based measures are transformed by taking their reciprocals:

  • Fisher Total Uncertainty:

    $$\begin{aligned} \text {Fisher Total Uncertainty} = \frac{1}{\text {Trace}(I_i(\theta )) + \epsilon }, \end{aligned}$$
    (29)
  • Fisher Frobenius Uncertainty:

    $$\begin{aligned} \text {Fisher Frobenius Uncertainty} = \frac{1}{\Vert I_i(\theta ) \Vert _F + \epsilon }, \end{aligned}$$
    (30)
  • Fisher Entropy Uncertainty:

    $$\begin{aligned} \text {Fisher Entropy Uncertainty} = \frac{1}{\text {Fisher Entropy} + \epsilon }. \end{aligned}$$
    (31)

The GP variance is directly used as an uncertainty measure. Finally, MC Dropout provides two additional metrics (e.g., Entropy of Expected Predictions and Knowledge Uncertainty). All these measures are column-stacked into a single feature matrix \(\textbf{X}_{\text {uncertainty}}\). Z-score normalisation is applied to each feature to balance the contributions of each uncertainty measure.

$$\begin{aligned} x_{\text {norm}} = \frac{x - \mu }{\sigma }, \end{aligned}$$
(32)

yielding the normalised matrix

$$\begin{aligned} \textbf{X}_{\text {norm}} = \begin{bmatrix} x_{1,1}^{\text {norm}} & x_{1,2}^{\text {norm}} & \ldots & x_{1,6}^{\text {norm}} \\ x_{2,1}^{\text {norm}} & x_{2,2}^{\text {norm}} & \ldots & x_{2,6}^{\text {norm}} \\ \vdots & \vdots & \ddots & \vdots \\ x_{N,1}^{\text {norm}} & x_{N,2}^{\text {norm}} & \ldots & x_{N,6}^{\text {norm}} \end{bmatrix}. \end{aligned}$$
(33)

Here, \(x\) is the raw uncertainty value, \(\mu\) is its mean, and \(\sigma\) is the standard deviation for that feature. A linear combination of the normalized features, weighted by \(\textbf{w}\), produces a single combined uncertainty score \(u_i\) for each sample:

$$\begin{aligned} u_i = \sum _{j=1}^m w_j \cdot x_{i,j}, \end{aligned}$$
(34)

Which can be written in matrix form as:

$$\begin{aligned} \textbf{u} = \textbf{X}_{\text {norm}} \cdot \textbf{w}. \end{aligned}$$
(35)

A sample is rejected if its combined score \(u_i\) exceeds a threshold \(\tau\):

$$\begin{aligned} \text {Rejected} = \{ i \mid u_i > \tau \}. \end{aligned}$$
(36)

To optimise the weights \(\textbf{w}\) and the rejection threshold \(\tau\), we use Particle Swarm Optimisation (PSO). The objective function aims to maximise the sum of Correctly Predicted Acceptance (CPA%) and Incorrectly Predicted Rejection (IPR%) rates for both AI and Nature, given by the following formulae.

$$\begin{aligned} \text {CPA\%} = \frac{\text {CA}}{\text {CA} + \text {CR}}, \end{aligned}$$
(37)

where \(\text {CA}\) is the number of correctly classified samples with \(u_i \le \tau\).

$$\begin{aligned} \text {IPR\%} = \frac{\text {CR}}{\text {CR} + \text {IA}}, \end{aligned}$$
(38)

where \(\text {CR}\) is the number of misclassified samples with \(u_i > \tau\), and \(\text {IA}\) the misclassified samples with \(u_i \le \tau\).

The total score to be maximised is:

$$\begin{aligned} \text {Score} = \text {CPA\%}_{\text {AI}} + \text {IPR\%}_{\text {AI}} + \text {CPA\%}_{\text {Nature}} + \text {IPR\%}_{\text {Nature}}. \end{aligned}$$
(39)

Because PSO solves a minimisation problem, we minimise the negative of this score:

$$\begin{aligned} (\textbf{w}^*, \tau ^*) = \arg \min _{\textbf{w}, \tau } \Bigl (- \sum _{\text {classes}} \bigl (\text {CA\%} + \text {CR\%}\bigr )\Bigr ). \end{aligned}$$
(40)

Each PSO particle \(\textbf{p}\) is defined as:

$$\begin{aligned} \textbf{p} = [w_1, w_2, \ldots , w_m, \tau ], \end{aligned}$$
(41)

With positions (weights and threshold) initialised randomly within predefined bounds \(\bigl [w_j^\text {min}, w_j^\text {max}\bigr ]\) and \([\tau ^\text {min}, \tau ^\text {max}]\). At each iteration, the velocity \(\textbf{v}_i\) and position \(\textbf{p}_i\) of particle \(i\) are updated:

$$\begin{aligned} \textbf{v}_i = \omega \textbf{v}_i + \phi _p \textbf{r}_p \odot (\textbf{p}_i^{\text {best}} - \textbf{p}_i) + \phi _g \textbf{r}_g \odot (\textbf{g}^{\text {best}} - \textbf{p}_i), \end{aligned}$$
(42)
$$\begin{aligned} \textbf{p}_i = \textbf{p}_i + \textbf{v}_i. \end{aligned}$$
(43)

Here, \(\omega\) is the inertia weight, \(\phi _p\) and \(\phi _g\) are cognitive and social coefficients, \(\textbf{r}_p\) and \(\textbf{r}_g\) are random vectors in \([0,1]\), and \(\odot\) denotes elementwise multiplication. Positions are clipped to respect the specified bounds, and velocities are reset if they reach the boundaries. Upon convergence, the PSO returns the optimal weights \(\textbf{w}^*\), the optimal threshold \(\tau ^*\), and the highest final score. These weights determine the relative importance of each uncertainty measure, and \(\tau ^*\) serves as the unified decision boundary for rejecting samples with high combined uncertainty. Figure 3 illustrates the final weights \(\textbf{w}^*\) discovered by PSO for each uncertainty measure derived from Resnet50. Notably, Total Fisher Information and GP variance receive higher weights, indicating their more influential role in determining whether a sample is ultimately accepted or rejected. By contrast, MC Dropout and other measures contribute less to this setup. Once derived, these PSO-based weights can be applied consistently across all datasets, providing a unified acceptance–rejection mechanism that generalises under diverse data distributions.

Fig. 3
figure 3

PSO-derived weighting scheme for combining multiple uncertainty measures from Resnet50.

Results

We first evaluated the model using data drawn from the same source as the training set, partitioning that dataset into separate training and testing splits. Table 2 reports the resulting precision, recall and F1 scores in the Resnet50 model, while Table 3 presents the corresponding confusion matrix. To examine performance on images generated by different techniques from what the model is trained on, we next tested the model trained solely on Stable Diffusion data against four additional datasets (Midjourney, Glide, VQDM and BigGAN). The results of these cross-dataset experiments are summarised in Table 4 for Resnet50 and Table 5 for ViT. As shown in Table 4 and Table 7, the detection accuracy decreases significantly when applied to previously unseen generative models, and the confusion matrix in Tables 6 and 7 indicates that the majority of misclassifications occur for AI-generated images.

Table 2 Performance Metrics for Each Dataset with test set drawn from the same AI image generation technique (AI = 0, Nature = 1) using ResNet50
Table 3 Confusion Matrices for Each Dataset with test set drawn from the same AI image generation technique using ResNet50 (AI=0, Nature=1).
Table 4 Performance Metrics for each Dataset when tested on ResNet50 model trained with Stable Diffusion
Table 5 Performance Metrics for each Dataset when tested on ViT model trained with stable diffusion
Table 6 Confusion Matrices for each Dataset when tested on ResNet50 model trained with Stable Diffusion (AI=0, Nature=1)
Table 7 Confusion Matrices for each Dataset when tested on ViT model trained with stable diffusion (AI=0, Nature=1)

To address the limitations inherent in standard classification, we introduced an acceptance-rejection mechanism governed by a threshold \(\tau\), where the model rejects samples if their confidence (or, conversely, uncertainty) does not meet this specified threshold. Each sample \(i\) in the dataset is assigned a scalar score \(u_i\), which can be interpreted as either uncertainty (higher values indicating lower confidence) or confidence (higher values indicating greater confidence). By accepting only those samples that exceed the confidence threshold, the system reduces the risk of erroneously labelling AI-generated images as Natural Images. The model either accepts or rejects the sample based on whether \(u_i\) crosses a threshold \(\tau\). Formally, one may reject a sample if \(u_i > \tau\) (for uncertainty-based scoring) or if \(u_i < \tau\) (for confidence-based scoring). We define four primary outcomes, and they are Correct Accepted (CA) which is the number of samples for which the prediction is correct and the sample is accepted by the threshold rule, Correct Rejected (CR), which is the number of samples for which the prediction is correct but the sample is rejected due to high uncertainty (or low confidence), Incorrect Accepted (IA) which is number of samples for which the prediction is incorrect and yet the sample is accepted and Incorrect Rejected (IR) which is number of samples for which the prediction is incorrect and the sample is rejected often a desirable outcome in scenarios where errors should be quarantined or flagged for further review. Formally, the formulae are given below.

$$\begin{aligned} \text {Correct Accepted (CA)} = \sum _{i=1}^{N} \textbf{1}\Bigl [\hat{y}_i = y_i \,\wedge \, (\text {acceptance criterion})\Bigr ], \end{aligned}$$
(44)
$$\begin{aligned} \text {Correct Rejected (CR)} = \sum _{i=1}^{N} \textbf{1}\Bigl [\hat{y}_i = y_i \,\wedge \, (\text {rejection criterion})\Bigr ], \end{aligned}$$
(45)
$$\begin{aligned} \text {Incorrect Accepted (IA)} = \sum _{i=1}^{N} \textbf{1}\Bigl [\hat{y}_i \ne y_i \,\wedge \, (\text {acceptance criterion})\Bigr ], \end{aligned}$$
(46)
$$\begin{aligned} \text {Incorrect Rejected (IR)} = \sum _{i=1}^{N} \textbf{1}\Bigl [\hat{y}_i \ne y_i \,\wedge \, (\text {rejection criterion})\Bigr ], \end{aligned}$$
(47)
$$\begin{aligned} \tau ^* = \arg \max _{\tau } \Bigl (\text {AcceptedCorrect\%}_{\text {AI}} + \text {RejectedIncorrect\%}_{\text {AI}} + \text {AcceptedCorrect\%}_{\text {Nature}} + \text {RejectedIncorrect\%}_{\text {Nature}}\Bigr ). \end{aligned}$$
(48)

Where \(\textbf{1}[\cdot ]\) is the indicator function returning 1 if the condition is true and 0 otherwise. The acceptance criterion depends on whether \(u_i\) is treated as confidence or uncertainty, for example \((u_i \le \tau )\) vs. \((u_i \ge \tau )\).For each class of interest, such as AI (\(y_i = 0\)) or Nature (\(y_i = 1\)), these four outcomes can be computed separately by restricting the sums to samples in that class. To determine the best threshold \(\tau\), we optimised Equation 48 by performing an exhaustive search over \(\tau\) across a finely spaced range derived from the minimum and maximum Probabilities or Uncertainties in the Stable Diffusion test set. We identified the threshold \(\tau ^*\) that maximises the model’s ability to reject incorrect predictions while retaining as many correct predictions as possible. This selection ensures a balance between retaining confidently correct predictions while rejecting uncertain and potentially incorrect predictions. Once the optimal threshold \(\tau ^*\) was determined using Stable Diffusion, it was subsequently applied to other datasets, including MidJourney, GLIDE, VQDM and BigGAN, without further re-optimisation. This approach ensures a consistent evaluation framework across datasets and allows us to assess the generalisation of the acceptance-rejection mechanism under different data distributions. Figures 4 and 5 illustrate the correctly predicted acceptance rates and incorrectly predicted rejection rates across varying classifier probability thresholds and total Fisher Information, respectively, as an example. These plots highlight the impact of different thresholds on classification performance, with the optimal threshold \(\tau ^*\) obtained using the Stable Diffusion dataset on ResNet-50. The remaining graphs for other uncertainty measures are not shown due to space constraints, but the CPA and IPR obtained on the threshold for each uncertainty measure are summarised in Table 8 due to space constraints. Figure 10 shows the scatter plot of the trade-off between CPA and IPR, where the values are computed as averages over the Midjourney, Glide, VQDM, BigGAN and SyleGan3 datasets. In this plot, each point coloured uniquely according to its corresponding uncertainty measure depicts its average CPA (i.e., the percentage of correct predictions accepted) on the x-axis and its average IPR (i.e., the percentage of incorrect predictions rejected) on the y-axis. Ideally, an uncertainty measure would achieve high CPA and high IPR (i.e., cluster near the optimal performance region of (100, 100)), indicating that it retains the correct predictions while effectively rejecting the most incorrect ones.

Fig. 4
figure 4

Correctly predicted acceptance rates and incorrectly predicted rejection rates across varying classifier probability thresholds, including the threshold \(\tau ^*\) obtained using Stable Diffusion on ResNet-50.

Fig. 5
figure 5

Correctly predicted acceptance rates and incorrectly predicted rejection rates across varying classifier total Fisher Information thresholds, including the optimal threshold \(\tau ^*\) obtained using Stable Diffusion on ResNet-50.

Table 8 Comparison of Correctly Predicted-Accepted (CPA) and Incorrectly Predicted-Rejected (IPR) for ResNet50 and ViT across various datasets and uncertainty measures
Table 9 Rejection rates (Total, Correct Prediction, and Incorrect Prediction) for various confidence and uncertainty measures when ResNet50, trained on Stable Diffusion, is tested on Stable Diffusion (SD), Midjourney (MJ), GLIDE (GD), VQDM (VQ),BigGAN (BG) and StyleGAN3 (SG)
Table 10 Rejection rates (Total, Correct Prediction, and Incorrect Prediction) for various confidence and uncertainty measures when ViT, trained on Stable Diffusion, is tested on Stable Diffusion (SD), MidJourney (MJ), GLIDE (GD), VQDM (VQ), and BigGAN (BG)

One of the main aims of our analysis is to examine the total rejection rate, correct rejection rate, and incorrect rejection rate separately for the AI and Nature classes. Under conditions of dataset shift, misclassifications are concentrated in the AI class, making it particularly important to detect and reject AI-generated errors before they can be mistakenly accepted as Natural images. By breaking down these rates for AI versus Nature, we can pinpoint where errors occur, how many are correctly intercepted, and whether the system is discarding valid samples at an acceptable level. We define these rates as follows. The total rejection rate measures the fraction of all samples, whether correct or incorrect, that end up being rejected:

$$\begin{aligned} \text {Total Rejection Rate} = 100 \times \frac{CR + IR}{CA + CR + IA + IR}, \end{aligned}$$
(49)

where \(CA\) denotes Correct Accepted, \(CR\) Correct Rejected, \(IA\) Incorrect Accepted, and \(IR\) Incorrect Rejected. For the correctly classified samples, the correct rejection rate specifies the percentage of those that still undergo rejection.

$$\begin{aligned} \text {Correct Rejection Rate} = 100 \times \frac{CR}{CA + CR}, \end{aligned}$$
(50)

Among the misclassified samples, the incorrect rejection rate highlights how many of those errors are caught and flagged by the model:

$$\begin{aligned} \text {Incorrect Rejection Rate} = 100 \times \frac{IR}{IA + IR}. \end{aligned}$$
(51)

Each of these rates can be computed separately for AI and Nature by restricting \((CA, CR, IA, IR)\) to instances belonging to the respective label. To capture an overall sense of how strict the mechanism is, we look at the total rejection rate, which indicates what fraction of all samples, correct or not, are flagged. The correct rejection rate then shows how many legitimate predictions are needlessly filtered out, shedding light on whether the threshold might be too conservative. Finally, the incorrect rejection rate reveals how effectively actual misclassifications are quarantined.

In scenarios where dataset shifts magnify AI errors, it may be desirable to tolerate a relatively high total rejection rate for AI, even if it includes some correctly predicted AI samples, if doing so substantially reduces the risk of synthetic content slipping through. This trade-off can be acceptable as long as the acceptance of correctly predicted Nature samples and In-domain AI images is not unduly compromised. As an example Figs. 6 and 7 Illustrate the AI and Nature rejection rates, respectively, under Total Fisher Information for ResNet-50 using the best threshold obtained from Stable Diffusion. We focus on ResNet-50 due to its superior performance (in terms of accuracy, precision, recall, and F1 score) compared to ViT, providing a more reliable assessment of how the acceptance-rejection mechanism handles data distribution shifts. Due to space limitations, the remaining rejection rates computed at this threshold are presented in Table 9. While the Vision Transformer (ViT) was also evaluated as part of our broader study, its slightly lower baseline performance makes ResNet-50 a more illustrative example. As a result, our discussion primarily focuses on ResNet-50, with ViT results included for completeness. Table 10 presents the rejection rates obtained using the ViT model.

Fig. 6
figure 6

Rejection rates for the AI class (Total, Correct Prediction, and Incorrect Prediction) using the optimal threshold \(\tau ^*\) based on Total Fisher Information, derived from ResNet50 trained on Stable Diffusion.

Fig. 7
figure 7

Rejection rates for the Nature class (Total, Correct Prediction, and Incorrect Prediction) using the optimal threshold \(\tau ^*\) based on Total Fisher Information, derived from ResNet50 trained on Stable Diffusion.

To further illustrate this trade-off, Figs. 11 and 12 present scatter plots that compare the average Correct Prediction Rejection Rate (CPRR) and Incorrect Prediction Rejection Rate (IPRR) computed over the five non–Stable Diffusion datasets (Midjourney, Glide, VQDM, BigGAN and StyleGAN3). In these plots, each point is coloured uniquely according to its corresponding uncertainty measure, with the x-axis indicating the average CPRR (i.e., the percentage of correctly predicted samples that are mistakenly rejected) and the y-axis representing the average IPRR (i.e., the percentage of incorrect predictions that are successfully rejected). The variations in generation methods can lead to shifts in the data distribution, making AI classification more challenging as shown in Table 6 and 7; hence, a more conservative rejection mechanism (i.e., a higher CPRR) is sometimes justified to ensure that incorrect predictions are effectively filtered out.

Detailed analysis and interpretation of results

The results in Table 8 present a detailed comparison of Correctly Predicted Accepted (CPA) and Incorrectly Predicted Rejected (IPR) rates for ResNet50 across different uncertainty measures and datasets. On the in-distribution dataset, Stable Diffusion, the model performs exceptionally well across most uncertainty metrics. Probability, Total Fisher Information, and Fisher Frobenius Norm each achieve CPA rates above 98% and IPR rates exceeding 94%, indicating strong reliability in both accepting correct and rejecting incorrect predictions. Fisher Entropy, while slightly less conservative, still maintains a good balance. Gaussian Process uncertainty also performs strongly here. However, Entropy of Expected and Knowledge Uncertainty, though achieving perfect IPR (100%), do so at the cost of drastically reduced CPA, rejecting most correct predictions. The Combined Uncertainty measure offers a more balanced alternative, accepting around 67% of correct predictions while still rejecting all incorrect ones.

Fig. 8
figure 8

Rejection rates for the AI class (Total, Correct Prediction, and Incorrect Prediction) using the optimal threshold \(\tau ^*\) based on Combine Uncertainty, derived from ResNet50 trained on Stable Diffusion.

Fig. 9
figure 9

Rejection rates for the Nature class (Total, Correct Prediction, and Incorrect Prediction) using the optimal threshold \(\tau ^*\) based on Combine Uncertainty, derived from ResNet50 trained on Stable Diffusion.

Fig. 10
figure 10

Trade-off between average Correct Prediction Acceptance (CPA) versus Incorrect Prediction Rejection (IPR).

Fig. 11
figure 11

Trade-off between average Correct Prediction Rejection Rate (CPRR) versus Incorrect Prediction Rejection Rate (IPRR) for the AI class.

Fig. 12
figure 12

Trade-off between average Correct Prediction Rejection Rate (CPRR) versus Incorrect Prediction Rejection Rate (IPRR) for the Nature class.

Under distributional shift evaluated on Midjourney, GLIDE, VQDM, BigGAN and StyleGAN3, the performance of individual Uncertainty measures declines, particularly in rejecting incorrect predictions. For instance, in Midjourney, while CPA remains above 85% for probability and Fisher-based methods, their IPR drops below 33%. In contrast, the Gaussian Process measure demonstrates slightly better robustness under shift, improving IPR by 5–10 percentage points compared to probability or Fisher metrics, though still far from ideal. Entropy of Expected and Knowledge Uncertainty behave differently. Across all out-of-distribution datasets, they consistently achieve near-perfect IPRs, successfully rejecting almost all incorrect predictions. However, they also reject a significant fraction of correct ones-CPA falls below 25% in most cases. This overly conservative behaviour limits their practicality in settings where preserving accurate predictions is critical. Combined Uncertainty emerges as the most robust and balanced approach across datasets. While its CPA under shift is lower than in-distribution (e.g., around 52% on Midjourney, 49% on GLIDE, 62% on VQDM, 61% on BigGAN and 85% on StyleGAN3 ), it compensates with significantly higher IPR, ranging from 29% to 93%, outperforming all other individual metrics in rejecting incorrect predictions. This trade-off of rejecting more correct predictions to prevent misclassification may be acceptable in high-risk applications, mainly since rejected synthetic samples can be used for further retraining.

The trade-offs become even clearer when these metrics are visualised as a scatter plot in Fig. 10, which is generated from the averaged data over Midjourney, Glide, VQDM and BigGAN. The red star at (100, 100) denotes the ideal performance, perfectly accepting all correct predictions while rejecting all incorrect ones. Under dataset shift, however, most measures fail to reach this ideal. For instance, measures such as probability, Total Fisher Information, and Fisher Frobenius Norm cluster at relatively high CPA values (in the mid-80s to low-90s) but exhibit low IPR values (around 29–33%), indicating that although they retain most correct predictions, they do not effectively reject incorrect ones. Conversely, measures based on Entropy of Expected and Knowledge Uncertainty achieve near-perfect IPR values (close to 100%), but their CPA values are significantly lower, reflecting an overly aggressive rejection that may discard many correct predictions. As mentioned before, the Combined Uncertainty measure demonstrates a more balanced performance. It achieves a moderate CPA of approximately 60% while delivering a high IPR of about 70%.

Given that most misclassifications under dataset shift occur within the AI class, adopting a more conservative rejection threshold can be justified, particularly if it helps block misclassified synthetic content while keeping performance stable for the Nature class and in-distribution AI samples. Table 9 illustrates this dynamic in detail. On the in-distribution Stable Diffusion data, Uncertainty measures like probability, Total Fisher Information and Fisher Frobenius Norm perform nearly perfectly, with high Correct Prediction Acceptance (CPA) and strong Incorrect Prediction Rejection (IPR) rates across both AI and Nature classes. However, this reliable behaviour deteriorates once the model is exposed to distributional shift-as seen on Midjourney (MJ), GLIDE (GD), VQDM (VQ) BigGAN(BG) and StyleGAN3(SG). For the AI class in particular, these same measures now exhibit Correct Prediction Rejection Rates (CPRRs) ranging from 40% to 75% and low IPRRs between 20% and 30%, as visualised in Fig. 11. In other words, they begin rejecting many correct AI predictions while failing to filter out a significant number of incorrect ones. On the other hand, the entropy of expectation and knowledge uncertainty behaves in a highly conservative manner. In the AI class, both measures yield CPRRs and IPRRs close to 100%, effectively rejecting nearly every prediction, correct or incorrect. While this strategy is successful at blocking synthetic errors, it does so by indiscriminately discarding a large portion of correct predictions, particularly from the Nature class, as shown in Fig. 12. The Combined Uncertainty measure, however, achieves a more practical balance. While it does have a higher CPRR in the AI class, reflecting that some correct predictions are sacrificed, it compensates with an IPRR of approximately 70%, meaning it filters out most misclassified synthetic inputs (Fig. 11). In the Nature class (Fig. 12), Combined Uncertainty manages to retain nearly all correct predictions, showing a low CPRR while still maintaining a high rejection of incorrect ones at 80%. This class-specific stability is crucial in scenarios where natural content must be preserved. Figures 8 and 9 further support this strategy. When applying the optimal rejection threshold \(\tau\) derived from the Combined Uncertainty signal, the AI class benefits from strong rejection of incorrect predictions, while the Nature class continues to preserve accurate predictions and reject errors effectively.

In summary, the results from Tables 8 and 9, along with Figs. 11 and 12, highlight an important trade-off. While standard uncertainty measures perform well on in-distribution data, their effectiveness drops significantly under distributional shifts, particularly for the AI class. Overly conservative approaches like Entropy of Expected and Knowledge Uncertainty reject nearly all predictions, limiting their utility. In contrast, the Combined Uncertainty measure offers a more desirable middle ground. This balance is especially valuable given that most errors under shift occur in the AI class, where strict filtering of synthetic content is essential. Meanwhile, performance in the Nature class remains stable. Still, as dataset shift continues to increase and overall rejection rates rise, it may become necessary to periodically retrain the model to maintain robustness and adapt to evolving distributions.

Adversarial robustness evaluation

To evaluate the reliability and security of the proposed AI-generated image detection framework, we conducted a series of adversarial robustness experiments. We employed two widely used white-box attack methods: the Fast Gradient Sign Method (FGSM) and the Projected Gradient Descent (PGD). FGSM generates adversarial examples by adding a single-step perturbation in the direction of the loss gradient sign, while PGD iteratively applies smaller perturbations and projects the perturbed inputs back to a constrained region around the original inputs after each step. Both attacks were applied to the validation set of Stable Diffusion images using the ResNet50 classifier trained on the same domain.

Table 11 Performance Metrics on ResNet50 Model Trained with Stable Diffusion under FGSM and PGD Attacks
Table 12 Confusion Matrices for each Attack when tested on ResNet50 model trained with Stable Diffusion (AI=0, Nature=1)
Table 13 Correctly Predicted-Accepted (CPA) and Incorrectly Predicted-Rejected (IPR) under FGSM and PGD adversarial attacks across uncertainty measures.

All inputs were normalised using standard ImageNet statistics before attack generation. Each RGB channel was standardized by subtracting the mean \(\mu = [0.485,\ 0.456,\ 0.406]\) and dividing by the standard deviation \(\sigma = [0.229,\ 0.224,\ 0.225]\), resulting in normalized inputs of the form:

$$\begin{aligned} \text {Input}_{\text {norm}} = \frac{\text {Input} - \mu }{\sigma } \end{aligned}$$
(52)

To ensure perturbed inputs remained valid within the normalised pixel range, we applied clamping after each perturbation using:

$$\begin{aligned} \text {Input}_{\text {clamped}} = \min \left( \max \left( \text {Input}_{\text {adv}}, \text {CLAMP}_{\text {MIN}}\right) , \text {CLAMP}_{\text {MAX}}\right) \end{aligned}$$
(53)

where \(\text {CLAMP}_{\text {MIN}} = \frac{0.0 - \mu }{\sigma }\), \(\text {CLAMP}_{\text {MAX}} = \frac{1.0 - \mu }{\sigma }\)

The attacks were applied under varying perturbation budgets of \(\varepsilon \in {0.01, 0.03, 0.05}\). For PGD, we adopted a step size of \(\alpha = \varepsilon / 3\) and performed ten iterations per sample. The impact of these perturbations on the model’s classification performance is summarised in Table 11, with corresponding confusion matrices presented in Table 12. Under clean, unperturbed conditions, the model achieved near-perfect results, with an accuracy and F1 score of 0.9979. However, even minimal perturbations introduced by adversarial attacks resulted in a substantial performance drop. FGSM with \(\varepsilon = 0.01\) reduced accuracy to 44.71%, with an F1 score of 0.5995. As the perturbation increased to 0.05, the F1 score improved modestly to 0.6597, driven primarily by an increase in recall rather than balanced precision. This trend suggests that under attack, the model increasingly predicts the “Nature” class, correctly identifying most Nature images but at the cost of misclassifying a large number of AI-generated samples. Specifically, precision remains low across all FGSM perturbation levels (approximately 47–50%), indicating a high false positive rate where AI images are incorrectly classified as Nature. PGD, being a stronger and iterative attack method, caused even more pronounced degradation. At \(\varepsilon = 0.01\), the model’s accuracy dropped to 35.45% with an F1 score of 0.5207. At higher perturbations, performance plateaued at low levels, with accuracy around 40.57% and an F1 score of 0.5773 at \(\varepsilon = 0.05\). These results shows PGD’s superior ability to generate effective adversarial examples that exploit the model’s vulnerabilities more thoroughly than single-step attacks like FGSM. The confusion matrices provide further insight into the classification behaviour under attack. Without perturbation, the model correctly classified nearly all examples, with only 19 false positives and 14 false negatives. However, under FGSM, the number of true negatives (correctly identified AI samples) dropped dramatically-from 531 at \(\varepsilon = 0.01\) to just 520 at \(\varepsilon = 0.05\) while false positives surged to over 7400. This indicates that the model’s decision boundary becomes heavily skewed, favouring Nature predictions across the board. Similarly, PGD attacks essentially eliminated the model’s ability to detect AI-generated content: at \(\varepsilon = 0.05\), the true negative count fell to zero, meaning every AI image was misclassified as Nature. While the model continued to correctly identify a large number of Nature images (e.g., 6492 true positives), this came at the cost of complete failure to reject adversarial AI samples. These findings highlight a significant asymmetry in the model’s robustness. Under adversarial conditions, particularly with PGD, the classifier becomes overly confident in assigning the Nature label, likely because the perturbations push the input representations into regions of the feature space associated with natural images. This behaviour is especially concerning for a detector designed to flag AI-generated content, as high false positive rates mean that synthetic images go undetected, potentially undermining trust and safety applications.

Fig. 13
figure 13

Trade-off between average Correct Prediction Acceptance (CPA) and Incorrect Prediction Rejection (IPR), evaluated by averaging their values over all adversarial attack types across the range of perturbation magnitudes (\(\epsilon\)) employed in the experiment.

To enhance robustness against adversarial attacks, we evaluated an uncertainty-based rejection mechanism. We tested the implemented uncertainty measures, namely Fisher Information-based metrics, Gaussian Process (GP) variance, Monte Carlo Dropout–based metrics, and a Combined Uncertainty approach, under both FGSM and PGD attacks at \(\varepsilon \in {0.01, 0.03, 0.05}\). The detailed CPA (Correctly Predicted–Accepted) and IPR (Incorrectly Predicted–Rejected) results are reported in Table 13 for each perturbation level. For a clearer assessment, we computed the average CPA and IPR across all perturbation levels and both attack types, capturing each method’s overall reliability, as illustrated in Fig. 13. Among all methods, Gaussian Process uncertainty achieved the highest IPR (80.45%), meaning it was the most effective at rejecting incorrect predictions, a key goal in adversarial settings. It also maintained a solid CPA of 84.20%, showing that it retained a good portion of correct predictions despite its aggressive filtering. The Combined Uncertainty method also performed well, averaging 61.41% IPR and a higher CPA of 92.17%. While its IPR was slightly lower than GP’s, it struck a more balanced trade-off, making it a strong general-purpose option, especially when both reliability and usability are priorities. Fisher Entropy offered the best balance among the Fisher-based metrics, with 81.44% CPA and 20.59% IPR, providing selective but more conservative rejection. In contrast, Total Fisher Information and Frobenius Norm were overly cautious under stronger attacks, averaging only around 47.7% CPA and 51.6% IPR, making them less suitable for robust rejection. Monte Carlo Dropout–based metrics like Entropy of Expected had perfect CPA (100.00%) but only 37.76% IPR, meaning they preserved all correct predictions but missed many incorrect ones, limiting their effectiveness in adversarial defence.

In summary, when prioritising high rejection of incorrect predictions with reasonable retention of correct ones, Gaussian Process uncertainty proves most effective. Combined Uncertainty offers a practical alternative, delivering moderate performance in both acceptance of correct prediction and rejection of incorrect prediction.

Fig. 14
figure 14

Trade-off between average Correct Prediction Acceptance (CPA) and Incorrect Prediction Rejection (IPR), computed as the mean across the Midjourney, GLIDE, VQDM, and BigGAN datasets using the DIRE model.

Comparison with other related work

Table 14 Comparison of Performance Metrics across Datasets for FreqNet, DIRE, GLCM+CNN, LBP+CNN, and Our classifier (AI = 0, Nature = 1), trained on Stable Diffusion

The core innovation of our approach lies in its ability to reject incorrect predictions under out-of-distribution (OOD) conditions rather than relying solely on raw classification accuracy. Unlike conventional classifiers that provide deterministic outputs for every input, our method introduces a rejection mechanism based on uncertainty, allowing the model to defer decisions when confidence is low, a capability that becomes crucial under dataset shift. Prior methods, such as FreqNet35 and, DIRE offer interpretable, feature-based predictions but lack this flexibility. FreqNet detects spectral anomalies believed to be induced by architectural artefacts (e.g., upsampling layers) in GAN-generated images. DIRE36, in contrast, leverages reconstruction residuals from a pre-trained diffusion model, operating under the assumption that synthetic images are better reconstructed than natural ones due to their proximity to the learned generation manifold. While both models offer intuitive insights, they produce fixed outputs for all samples, regardless of confidence, making them vulnerable when exposed to unseen data from generators like Midjourney, GLIDE, VQDM, or BigGAN.

To explore alternative representations, we also implemented two texture-based hybrid approaches, GLCM+CNN and LBP+CNN, that combine handcrafted feature extraction with deep learning. The GLCM+CNN pipeline computes Gray-Level Co-occurrence Matrices to capture spatial texture patterns and passes them to a CNN for classification. This method has shown promise in tasks like presentation attack detection 37 and histopathology classification 38. Similarly, LBP+CNN encodes local texture variations using Local Binary Patterns, which are then fed into a CNN. LBP-based features have proven effective in liveness detection and synthetic face recognition 39 due to their robustness to subtle visual artefacts. When evaluated on our cross-generator detection task, both hybrid models performed well on in-distribution data (Stable Diffusion), achieving F1-scores of 0.9863 (GLCM+CNN) and 0.9788 (LBP+CNN). However, their generalisation of OOD data declined significantly. For example, GLCM+CNN reached only 0.6522 on Midjourney and 0.6629 on GLIDE, while LBP+CNN dropped to 0.5981 and 0.6795 on the same datasets. These results suggest that while handcrafted texture features capture certain class-specific artefacts, they are less adaptive to shifts across generative models. In comparison, our ResNet50 model trained on Stable Diffusion achieves better generalisation across all OOD datasets. As shown in Tables 14 and 4, ResNet50 outperforms FreqNet, DIRE, and the hybrid models on Midjourney (F1 = 0.7458) and GLIDE (F1 = 0.7908). While FreqNet performs slightly better on VQDM (0.7387 vs 0.6994) and GLCM+CNN approaches similar performance (0.6777), our model demonstrates more consistent cross-generator robustness. More importantly, our approach does not stop at classification. The critical distinction is in our confidence-aware rejection mechanism, which enables the model to handle OOD inputs more safely. By combining uncertainty estimates from Fisher Information, Gaussian Process variance, and Monte Carlo Dropout, the system can flag and reject low-confidence predictions. This not only improves reliability but also prevents overconfident misclassifications without requiring retraining for each new data distribution. In high-risk applications where misclassified synthetic content must be blocked, this rejection capability is far more impactful than accuracy alone.

Importantly, we demonstrate that this rejection mechanism is model-agnostic. To validate this, we trained a separate ResNet on DIRE’s residual features and confirmed that all uncertainty measures from our main pipeline remain applicable. As shown in Fig. 14, the trade-off between average Correct Prediction Acceptance (CPA) and Incorrect Prediction Rejection (IPR) remains consistent, underscoring the robustness of our approach. This illustrates that our rejection strategy is not limited to deep pixel-space classifiers; it can also be effectively integrated into interpretable, feature-driven models such as DIRE, FreqNet, or hybrid architectures built from handcrafted features like GLCM+CNN and LBP+CNN.

Conclusion

The empirical results across both in-distribution (Stable Diffusion) and shifted datasets (Midjourney, Glide, VQDM, BigGAN and StyleGAN3) highlight the critical role of uncertainty measures in controlling the acceptance or rejection of samples under varying data distributions. When the test data closely resembles the training distribution, standard metrics such as probability, total Fisher information, and Fisher Frobenius norm excel; they accept nearly all correct predictions while successfully rejecting most incorrect ones. However, under dataset shift, particularly involving AI-generated samples, their effectiveness at rejecting incorrect predictions diminishes significantly, indicating a vulnerability to unseen or evolving generative methods.

Conversely, highly conservative measures such as Entropy of Expected and Knowledge Uncertainty tend to reject nearly all newly encountered AI data, resulting in near-perfect incorrect rejection rates. However, this aggressive behaviour comes at a substantial cost: a large portion of correct predictions across both the AI and Nature classes are also discarded. In contrast, the Combined Uncertainty measure demonstrates a more balanced and robust performance under dataset shift. While it does reject more correct out-of-distribution AI predictions than standard measures, it consistently achieves a high incorrect rejection rate of approximately 70% on previously unseen generative models. Importantly, it still retains a reasonable proportion of correct predictions, particularly for the Nature class and in-distribution AI images, where rejection is minimal. In high-risk scenarios, this conservative approach is often desirable to adopt a stricter stance on AI predictions, as it helps quarantine potentially misclassified synthetic samples. These rejected, but high-quality examples can then be leveraged during retraining, enabling the model to better adapt to evolving generative distributions. This trade-off between aggressiveness and selectivity is also observed under adversarial attacks. When subjected to FGSM and PGD perturbations, the Gaussian Process-based uncertainty achieves approximately 80% rejection of successful adversarial examples, outperforming other individual measures in terms of error filtering. However, it also exhibits a lower acceptance rate for correct predictions, retaining only about 85% of them. In contrast, the Combined Uncertainty measure rejects around 60% of adversarially perturbed errors while preserving over 90% of correct predictions. It is a more reliable and conservative defence mechanism that balances caution with accuracy.

In summary, this analysis underscores that as dataset shift intensifies, whether through novel generative models or more pronounced domain shifts, no single measure perfectly preserves the balance between correct acceptance and incorrect rejection. Retraining the model or adopting advanced domain adaptation strategies may be necessary to maintain stringent performance standards. Nevertheless, the Combined Uncertainty strategy presented here offers a versatile framework that can be tuned according to specific risk tolerances. By leveraging multiple uncertainty indicators together, it mitigates blind spots inherent in individual measures, supporting more informed decision-making when confronted with rapidly evolving synthetic image generation methods.