A multi scale spatial attention based zero shot learning framework for low light image enhancement

Aslam, Muhammad Azeem; Khalid, Hassan; Ahmed, Nisar

doi:10.1038/s41598-025-30479-3

Download PDF

Article
Open access
Published: 03 December 2025

A multi scale spatial attention based zero shot learning framework for low light image enhancement

Muhammad Azeem Aslam¹,
Hassan Khalid² &
Nisar Ahmed³

Scientific Reports volume 16, Article number: 976 (2026) Cite this article

1755 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Low-light image enhancement remains a challenging task, particularly in the absence of paired training data. In this study, we present LucentVisionNet, a novel zero-shot learning framework that addresses the limitations of traditional and deep learning-based enhancement methods. The proposed approach integrates multi-scale spatial attention with a deep curve estimation network, enabling fine-grained enhancement while preserving semantic and perceptual fidelity. To further improve generalization, we adopt a recurrent enhancement strategy and optimize the model using a composite loss function comprising six tailored components, including a novel no-reference image quality loss inspired by human visual perception. Extensive experiments on both paired and unpaired benchmark datasets demonstrate that LucentVisionNet consistently outperforms state-of-the-art supervised, unsupervised, and zero-shot methods across multiple full-reference and no-reference image quality metrics. Our framework achieves high visual quality, structural consistency, and computational efficiency, making it well-suited for deployment in real-world applications such as mobile photography, surveillance, and autonomous navigation.

A hybrid zero-reference and dehazing network for joint low-light underground image enhancement

Article Open access 24 March 2025

A depth iterative illumination estimation network for low-light image enhancement based on retinex theory

Article Open access 12 November 2023

Multi-branch low-light image iterative enhancement network

Article Open access 03 December 2025

Introduction

Image acquisition is not always carried out under ideal conditions in terms of camera characteristics, ambient conditions, or acquisition angle. Poor illumination is one of the most prevalent and limiting problems for digital images^1,2, particularly in indoor acquisition environments, at night, or due to camera constraints. The outcome is dark, noisy, low contrast, and poor perceptual quality images. This affects both the human perceptual experience and the performance of high-level semantic tasks such as object recognition, segmentation, and depth estimation. Possible solutions include modifying the surrounding environment or making changes to the camera characteristics, such as increasing the ISO or prolonging the exposure time. Ambient conditions can be improved only in controlled environments and are therefore not generally applicable. Moreover, increasing the ISO leads to higher noise levels-primarily due to read noise, thermal noise, shot noise, and other contributing factors. Similarly, increasing the exposure time might result in increased thermal noise, motion blur, camera shake, and overexposure, making it an unsuitable solution as well. Image editing software can also be used to enhance low-light images³; however, it has two key drawbacks. First, using such software requires expertise and can be time-consuming. Second, this approach often lacks automation, consistency, and speed, necessitating manual fine-tuning. In this work, we follow the zero-shot or zero-reference paradigm commonly adopted in recent low-light enhancement frameworks (e.g., Zero−DCE, Zero−DCE++), where the model is trained without paired ground-truth supervision and directly enhances unseen low-light inputs. This usage of “zero-shot” is distinct from semantic zero-shot learning in classification, and is instead meant to highlight the absence of paired references during both training and inference.

Quality of images is of vital importance for human visual experience^4,5 as well as many high-level computer vision tasks such as autonomous driving, surveillance, scientific or medical imaging where preservation of semantic information is vital for interpretation and decision making^6,7,8,9,10. For instance, low-light images may make it difficult to perform object recognition, anomaly identification, or face identification during surveillance or affect the visibility of the road and its surroundings, which is critical for safe navigation^{11,12,13,14,15,16}.

Automated low-light image enhancement techniques can rescue the situation by improving the perceptual and semantic quality of these images. This leads to an enhanced perceptual experience, improved extraction of semantic information, and greater accuracy in interpretation^7,17,18. Among the most striking benefits of automated low-light image enhancement methods are speed, consistency, and scalability. The problem is addressed via both traditional image processing algorithms and modern deep learning-based solutions. The approaches work by preserving the content and reducing artifacts, with an ability to be integrated into existing systems and making them compatible with a wide range of applications.

Among the traditional techniques, histogram-based methods^19,20,21,22, exposure correction, image fusion, and Retinex-based methods are most prominent. Gamma correction²³ and tone mapping^24,25 are the most common methods to perform exposure correction. Histogram equalization and its variations, including BPDHE (Brightness Preserving Dynamic Histogram Equalization)²² and CLACHE (Contrast Limited Adaptive Histogram Equalization)²¹. Combining multiple images of a scene taken under various exposure conditions is known as image fusion^26,27. This can be done via weighted averaging, wavelet fusion²⁸, or Laplacian pyramid fusion²⁹. Retinex is another well-known non-linear technique³⁰ for improving images in low light. By first breaking down the image into its illumination and reflectance components, then improving the illumination component, the technique makes features in low-light images visibly better. The most effective non-deep learning technique for enhancing the low-light qualities of digital images is multi-scale Retinex³¹, adaptive Retinex³², color Retinex³³, and multi-scale Retinex with color restoration³⁴.

In recent years, deep learning-based methods for improving low-light images have become more and more popular³⁵. They outperform conventional methods in terms of perceived quality. Their exceptional performance can be attributed to their capacity to extract intricate features from large amounts of training data.

Unsupervised techniques, as opposed to supervised ones, might learn straight from the input images without the need for any ground truth information. Generative adversarial networks (GANs) are commonly employed to enhance low-light images [30]. GANs consist of two networks: a discriminator network that can distinguish between produced and genuine images, and a generator network that enhances preexisting images. The discriminator network is fooled by high-quality augmented images generated by the generator network, which has been trained for this purpose. Supervised methods^36,37,38,39 perform learning from image-to-image mapping and have provided the highest scores in terms of quality metrics on benchmark datasets^{40,41,42,43,44,45,46,47}. The limitation for these approaches lies with their dependence on paired training examples with low-light or perfectly lit images of the same scene. Acquisition or preparation of a dataset with such image pairs is expensive or sometimes infeasible due to weaker control over ambient conditions. Moreover, supervised methods trained on a particular type of lightning conditions can perform image enhancement in similar scenario only.

On the other hand, unsupervised methods may involve hyperparameter optimization^47,48,49, Retinex-based learning approaches^47,48,49,50, or zero-shot learning techniques⁵¹. These methods do not require paired training samples and rely solely on low-light images. These approaches have their flaws, such as serious noise amplification, poor adaptive enhancement, and a large number of model parameters. While image brightening enhances visibility, it also amplifies noise substantially; applying denoising techniques afterward may degrade or remove critical semantic details. Poor adaptivity results in overexposed areas in the image, which result in brightening the already bright areas and therefore provide unsatisfactory perceptual quality. An excessive number of model parameters presents a significant challenge to adoption, as some of the most successful models are too complex for deployment in many real-world scenarios, ultimately restricting their applicability.

Zero-shot learning (ZSL) has recently emerged as a promising direction in the field of low-light image enhancement, enabling image correction without the need for paired training data. One of the pioneering efforts in this domain is Zero-DCE⁵², which was further improved upon by Zero-DCE++⁵³, and later extended through semantic-guided zero-shot learning approaches⁵⁴. While the semantic-guided ZSL approach integrates semantic cues to improve quality of enhanced images, it still exhibits key limitations such as a lack of perceptual learning aligned with human visual preferences, insufficient attention mechanisms for curve estimation, and limited ability to capture both fine and coarse details due to the absence of multi-scale learning strategies.

This study introduces LucentVisionNet, a zero-shot learning framework for low-light image enhancement that addresses key limitations of existing methods. The framework combines multi-scale curve estimation with spatial attention and residual refinement to achieve perceptually coherent results without requiring paired training data. To guide optimization, we employ a composite objective with six complementary loss terms^52,53,54, including a novel no-reference image quality loss based on MUSIQ-AVA⁵⁵. Extensive experiments on both paired and unpaired datasets demonstrate that LucentVisionNet outperforms state-of-the-art supervised, unsupervised, and zero-shot methods, achieving competitive visual quality and generalization with low computational overhead.

The novelty of LucentVisionNet does not lie in the isolated use of depthwise separable convolutions, spatial attention, or residual learning-components that are indeed established, but rather in their purposeful integration for zero-reference enhancement under extreme illumination constraints. Unlike prior zero-shot approaches such as Zero-DCE and its variants, LucentVisionNet introduces:

1.
Unlike prior approaches that rely on shallow or single-stage pipelines, our method introduces a three-stage multi-scale input strategy, where images are processed at full, half, and quarter resolutions. This design jointly captures global illumination patterns and fine-grained texture details-an integration that is seldom achieved in existing works. The ablation study confirms that this structured aggregation consistently improves both perceptual and reference-based metrics, while avoiding parameter inflation.
2.
A recurrent refinement mechanism that progressively enhances images, stabilizing performance in severely underexposed regions where single-pass corrections often fail.
3.
A perceptual-guided composite loss, which for the first time integrates a no-reference IQA model (MUSIQ-AVA) directly into the training objective, aligning optimization with human aesthetic perception.

Figure 1 presents a visual comparison of low-light enhanced example images produced by all ZSL algorithms under consideration. Notably, the proposed LucentVisionNet demonstrates superior visual quality in comparison to existing methods. In addition, Fig. 2 reports the average score calculated by averaging scores for all Blind Image Quality assessment metrics (score scaled to 100) for the same set of images. The proposed algorithm achieves the highest score, indicating its effectiveness in producing perceptually favorable results.

Related work

Exposure correction, histogram equalization, image fusion, dehazing, and Retinex-based techniques are common practices to improve low-light images. These methods can improve the contrast and perceptual appearance of images, but may result in increased noise or low color restoration. Moreover, these approaches are not learning-based and perform sub-optimally for several high-level computer vision tasks.

Learning-based solutions are mostly based on deep learning algorithms and provide superior low-light image enhancement performance in terms of perceptual appearance and quality metrics. These methods can be broadly classified into supervised^36,37,38,39 and unsupervised methods^52,56,57,58. Supervised learning-based solutions provide the highest performance in terms of quality metrics on the benchmark datasets^{40,41,42,43,44,45,46,47} as compared to unsupervised approaches. However, in contrast to other supervised learning tasks, they are trained using paired images that are not consistent (absolute) for a task. For instance, a low-light scene can have multiple high-light variants, making it difficult to determine the most optimal reference image. The selection of the ideal reference image^59,60 remains a challenge after correction/selection by experts, therefore increasing the complexity of the problem and the reliability of a solution.

Therefore, the most prominent challenge in supervised low-light enhancement methods is the presence of multiple potential references. A solution to these problems is the MAXIM⁶¹ which is a large and complex network and has state-of-the-art performance. The problem with such methods is the computational complexity, which makes them time-consuming and may limit their applicability to some scenarios. Another type of supervised approach is the use of hyperparameters^47,48,49 or Retinex^47,48,49,50 during training to connect the input image to the output.

As an example of a hyperparameter-based approach, Fu et al.⁴⁹ proposed the use of a sub-network to perform automatic selection of hyperparameters, whereas Chen et al.⁴⁸ introduced the use of the exposure time ratio between the reference and low-light image as hyperparameters. Going towards Retinex-based supervised learning, Wei et al.⁵⁰ implemented the streamlined version of the Retinex model into the network. The streamlined Retinex model assumes that all three-color channels share the same illumination image, however, this assumption is at contrast with reality⁵⁹, resulting in unsatisfactory denoising results. To overcome these limitations, Zhang et al.^47,60 presented a hybrid approach and incorporated both Retinex and hyperparameters into their network to perform color correction and noise removal in the reflection image. Despite their relatively less computational complexity, they are still slow and may not be suitable for some real-time application requirements.

Unsupervised methods are based on the assumption that the output image satisfies certain constraints and therefore makes them stable for unseen scenarios. For instance, Guo et al.⁵² proposed a specifically designed loss function based on the constraint of having a mean brightness between 0.4 to 0.6. This mean value assumption makes it incredibly simple and fast, but makes it unsuitable to restore color information or remove noise. Xiong et al.⁶² make a constraint on the initial value of the illumination image in a simplified Retinex model. Their constraint is based on the assumption that the maximum value in each of the red, green and blue channle is the initial value of the illumination image. Jiang et al.⁵⁶ make use of GAN model to learn a constraint on the output from normal light images. Similarly, Ma et al.⁵⁷ constraints the similarity of the outputs throughout the training process. These models meet the complexity requirements for most applications but lack in producing visually appealing and perceptually accurate results.

Transformer-based methods have recently reshaped low-light image enhancement by explicitly modeling long-range dependencies and multi-scale interactions that conventional CNNs struggle to capture. Early restoration transformers such as Restormer demonstrated that attention-based architectures can substantially improve denoising and detail recovery for high-resolution restoration tasks⁶³. Building on this trend, Retinexformer proposed a one-stage Retinex-inspired Transformer that jointly estimates illumination and reflectance within a Transformer backbone, showing superior perceptual fidelity on standard LLIE benchmarks⁶⁴. For ultra-high-definition inputs, LLFormer introduced axis-based multi-head self-attention and cross-layer attention fusion to reduce complexity while preserving global context, reporting marked gains on 4K/8K datasets⁶⁵. More recent works (e.g., DarkIR, LYT-Net, MEFormer and other Swin-based GAN hybrids) extend this direction by combining multi-task restoration (illumination, denoising, deblurring) and lightweight attention modules to handle real-world degradations, but at the cost of increased model size or training complexity^66,67,68. ZSL algorithms have emerged as a transformative approach for low-light image enhancement, enabling models to improve brightness, contrast, and color fidelity without paired training data. Techniques like Zero-Reference Deep Curve Estimation (ZRDCE) utilize deep neural networks to predict pixel-wise adjustment curves, dynamically enhancing images without reference to ground-truth normal-light images, making it ideal for real-time applications such as mobile photography⁵². Similarly, Semantic-Guided Zero-Shot Learning (SG-ZSL) integrates semantic information, such as object categories and scene context, to guide enhancement, preserving meaningful content and achieving superior perceptual quality in complex scenes like autonomous driving footage⁵⁴. These ZSL methods demonstrate robust generalization to unseen lighting conditions, outperforming traditional supervised approaches in flexibility and practicality⁵¹.

While recent deep learning-based solutions have achieved strong performance in various low-light enhancement scenarios, they often suffer from noise amplification and overexposure in extremely dark or bright regions of an image. Additionally, their large model sizes and high computational demands limit their applicability in real-time and resource-constrained environments. Transformer-based approaches further advance the field by modeling long-range dependencies and multi-scale interactions. However, they still face practical bottlenecks: attention layers and cross-scale fusion incur substantial FLOPs and memory overhead, leading to latency and deployment challenges. At the same time, some designs tend to produce color shifts or over-smoothed textures when guided only by pixel-wise or simple perceptual losses. These challenges highlight the need for lightweight, perceptually guided alternatives. To this end, we propose LucentVisionNet, a multi-scale framework with spatial attention and both perceptual and semantic guidance, designed to generate aesthetically appealing and perceptually accurate results at low computational cost, effectively balancing fidelity and efficiency.

Proposed model

In this study, we introduce a novel image enhancement model that utilizes Depthwise Separable Convolutional Neural Networks (DSCNN) in conjunction with Spatial Attention. The initial step involves providing a comprehensive overview of the architectural elements of the model, encompassing the DSCNN blocks and the Spatial Attention module. Subsequently, we present the mathematical expressions for each constituent in order to elaborate on the functioning of the model. The proposed model is inspired by existing ZSL algorithms^52,53,54. The architecture for the proposed model is reported in Fig. 3.

The image enhancement model proposed in this work is referred to as LucentVisionNet. The proposed framework comprises three fundamental components: the Feature Extraction and Aggregation Block, the Spatial Attention and Deep Spatial Curve Estimation Network, and the Residual Learning module. The proposed model has been specifically developed to enhance low-light image quality through a perceptually-aware enhancement strategy. Additionally, the enhancement process is further refined through the utilization of residual learning techniques.

Feature extraction and aggregation block

This block consists of a depth-wise convolution and a point-wise convolution that leads to Depthwise Separable Convolution.

Depthwise separable convolution block

Depthwise separable convolution is an efficient alternative to traditional convolution operations, reducing computational cost while preserving performance. It consists of two primary stages: depthwise convolution and pointwise convolution^69,70.

Depthwise Convolution In the depthwise convolution step, each input channel is convolved independently with a separate 2D filter. This operation captures spatial features for each channel individually, significantly reducing computational complexity.

Let ${\textbf{X}} \in {\mathbb {R}}^{C_{\text {in}} \times H \times W}$ represent the input feature map, where $C_{\text {in}}$ is the number of input channels, and H and W are the spatial dimensions. For each input channel ${\textbf{X}}_c \in {\mathbb {R}}^{H \times W}$, a depthwise convolution is applied using a filter ${\textbf{W}}_c \in {\mathbb {R}}^{K \times K}$:

$$\begin{aligned} \text {DWConv}({\textbf{X}}_c) = {\textbf{X}}_c * {\textbf{W}}_c, \end{aligned}$$

(1)

where $*$ denotes the 2D convolution operation. This process results in $C_{\text {in}}$ separate output feature maps.

Pointwise Convolution Following the depthwise step, pointwise convolution is applied using a $1 \times 1$ convolutional filter to combine the individual feature maps across channels⁶⁹. Let ${\textbf{Y}} \in {\mathbb {R}}^{C_{\text {in}} \times H \times W}$ denote the output of the depthwise convolution. A pointwise convolution filter ${\textbf{P}} \in {\mathbb {R}}^{C_{\text {out}} \times C_{\text {in}} \times 1 \times 1}$ is applied as:

$$\begin{aligned} \text {PWConv}({\textbf{Y}}) = {\textbf{Y}} * {\textbf{P}}, \end{aligned}$$

(2)

yielding an output feature map ${\textbf{Z}} \in {\mathbb {R}}^{C_{\text {out}} \times H \times W}$, where each spatial location is a linear combination of all input channels.

Overall Depthwise Separable Convolution The combination of depthwise and pointwise convolutions defines the depthwise separable convolution⁷⁰. It efficiently factorizes the standard convolution into a spatial convolution (depthwise) and a channel mixing operation (pointwise):

$$\begin{aligned} {\textbf{Y}}&= [\text {DWConv}({\textbf{X}}_1), \text {DWConv}({\textbf{X}}_2), \dots , \text {DWConv}({\textbf{X}}_{C_{\text {in}}})] \in {\mathbb {R}}^{C_{\text {in}} \times H \times W}, \end{aligned}$$

(3)

$$\begin{aligned} {\textbf{Z}}&= \text {PWConv}({\textbf{Y}}) \in {\mathbb {R}}^{C_{\text {out}} \times H \times W}. \end{aligned}$$

(4)

This approach offers a substantial reduction in the number of parameters and computations, making it highly suitable for deployment in resource-constrained environments.

Spatial attention block

The spatial attention mechanism enhances the representational power of convolutional neural networks by assigning importance to different spatial regions in the input feature maps⁷¹. Given an input tensor ${\textbf{X}} \in {\mathbb {R}}^{C_{\text {in}} \times H \times W}$, where $C_{\text {in}}$ is the number of input channels and $H \times W$ denotes the spatial resolution, the spatial attention block^72,73 proceeds as follows:

Feature Map Projection. First, the input tensor is projected using a $1 \times 1$ convolutional layer to generate intermediate feature maps:

$$\begin{aligned} {\textbf{F}} = {\textbf{X}} * {\textbf{W}}_{\text {conv}}, \end{aligned}$$

(5)

where $*$ denotes the 2D convolution operation, and ${\textbf{W}}_{\text {conv}}$ represents the $1 \times 1$ convolutional kernel. The resulting tensor ${\textbf{F}} \in {\mathbb {R}}^{C' \times H \times W}$ contains refined features from the input.

Attention Map Generation. A second $1 \times 1$ convolutional layer is applied to ${\textbf{F}}$ to generate a spatial attention map:

$$\begin{aligned} {\textbf{M}} = {\textbf{F}} * {\textbf{W}}_{\text {att}}, \end{aligned}$$

(6)

where ${\textbf{W}}_{\text {att}}$ is another $1 \times 1$ convolutional kernel. The output ${\textbf{M}} \in {\mathbb {R}}^{1 \times H \times W}$ represents the unnormalized attention weights over spatial dimensions.

Normalization via Sigmoid. To ensure interpretability and constrain the attention values between 0 and 1, a sigmoid activation function $\sigma (\cdot )$ is applied:

$$\begin{aligned} {\textbf{A}} = \sigma ({\textbf{M}}). \end{aligned}$$

(7)

Attention-Weighted Output. The final output is obtained by performing element-wise multiplication between the attention map ${\textbf{A}}$ and the intermediate feature maps ${\textbf{F}}$:

$$\begin{aligned} {\textbf{O}} = {\textbf{F}} \odot {\textbf{A}}, \end{aligned}$$

(8)

where $\odot$ denotes element-wise multiplication. This operation emphasizes informative spatial regions while suppressing less relevant ones. This spatial attention mechanism improves the model’s ability to focus on significant spatial features, making it beneficial for tasks such as image classification and segmentation⁷¹.

Multi-scale spatial curve estimation network

To capture both fine-grained details and high-level contextual information, the proposed architecture employs a multi-resolution feature extraction strategy. The input image $X \in {\mathbb {R}}^{C_{\text {in}} \times H \times W}$ is processed at three distinct resolutions: the original scale X, a half-scale downsampled version $X/2 \in {\mathbb {R}}^{C_{\text {in}} \times H/2 \times W/2}$, and a quarter-scale version $X/4 \in {\mathbb {R}}^{C_{\text {in}} \times H/4 \times W/4}$. These multi-resolution representations are independently fed into parallel Feature Extraction and Aggregation Blocks 3.1, each denoted by different colors in Fig. 3 (green, blue, and yellow).

Each block is composed of a stack of Depthwise Separable Convolutional Neural Networks (DSCNNs) 3.1, where the i-th layer is denoted as $DWConv_i$. The operation of the DSCNN is defined as:

$$\begin{aligned} DWConv_i(X) = \text {ReLU}\left( (X * K_{\text {depthwise}}^i) * K_{\text {pointwise}}^i\right) , \end{aligned}$$

(9)

where $K_{\text {depthwise}}^i$ represents a $3 \times 3$ depthwise convolution kernel applied separately to each input channel, and $K_{\text {pointwise}}^i$ is a $1 \times 1$ convolution kernel used to combine the resulting outputs. This design significantly reduces computational complexity while preserving critical spatial and semantic features.

The outputs of the feature extraction modules at each resolution are denoted as $D_1$, $D_2$, and $D_3$ corresponding to input scales X, X/2, and X/4, respectively.

Following the extraction stage, multi-scale outputs undergo a comprehensive fusion process that includes upsampling, feature aggregation, spatial attention, and final prediction.

To ensure uniform spatial dimensions, the outputs $D_2$ and $D_3$ are upsampled by factors of 2 and 4, respectively, aligning them with the resolution of $D_1$. The fusion is conducted through concatenation followed by additional DSCNN layers, facilitating hierarchical integration of features. The aggregation process is formally represented as:

$$\begin{aligned} F_1&= \text {DSCNN}\left( \text {Concat}(D_1, \text {Upsample}(D_2))\right) , \end{aligned}$$

(10)

$$\begin{aligned} F_2&= \text {DSCNN}\left( \text {Concat}(F_1, \text {Upsample}(D_3))\right) , \end{aligned}$$

(11)

$$\begin{aligned} F_{\text {agg}}&= \text {Concat}(F_1, F_2). \end{aligned}$$

(12)

This work employs a hierarchical feature fusion strategy to integrate both local and global contextual information effectively. The aggregated feature map, $F_{\text {agg}}$, is processed through a Spatial Attention Block (SAB, Fig. 3.2), which adaptively highlights salient spatial regions while suppressing irrelevant areas. Subsequently, the refined features are passed through a DSCNN layer followed by a Tanh activation, generating the final prediction map. This output stage, denoted as Conv_Final in Fig. 3, transforms the fused representations into the target output space. The architecture efficiently captures multi-scale contextual cues and spatial saliency with low computational overhead, making it particularly suitable for tasks requiring precise spatial understanding.

Residual learning

To further enhance image restoration, our framework integrates residual learning, inspired by contemporary zero-shot learning-based enhancement methods^52,53,54. Residual connections facilitate stable gradient propagation, mitigate vanishing gradient issues, and preserve fine image details, all while enabling the network to model complex transformations efficiently. The residual learning pipeline is reported in Fig. 4.

The enhancement process is formulated iteratively as follows⁵²:

$$\begin{aligned} X_t = X_{t-1} + D \left( X_{t-1}^2 - X_{t-1} \right) \end{aligned}$$

(13)

where:

$X_t$ is the enhanced image at iteration t,
$X_{t-1}$ is the output from the previous iteration,
D is a diagonal matrix with enhancement factors $x_r$ along its diagonal.

This formulation ensures effective gradient flow and allows the network to learn deeper feature representations robustly. The residual term in each iteration also preserves critical image structures, enabling fine-grained feature refinement while maintaining overall image integrity.

This design choice, derived from the principles of residual learning, enhances the network’s ability to capture complex nonlinear mappings, thereby increasing its adaptability to diverse enhancement requirements. Collectively, these properties contribute to a robust image enhancement framework that improves visual quality while preserving the authenticity of the original content.

Loss function

The composite loss function used to train the enhancement network is a weighted combination of multiple complementary losses, each designed to guide the network toward generating perceptually high-quality, naturally illuminated, and semantically consistent images^53,54. The composite loss is formulated as:

$$\begin{aligned} & {\mathcal {L}}_{\text {composite}} & = \lambda _{\text {TV}} {\mathcal {L}}_{\text {TV}}(A) + {\mathcal {L}}_{\text {spa}}(I_{\text {enh}}, I_{\text {low}}) + \lambda _{\text {col}} {\mathcal {L}}_{\text {color}}(I_{\text {enh}}) + \lambda _{\text {exp}} {\mathcal {L}}_{\text {exp}}(I_{\text {enh}}, E) \nonumber \\ & \quad + \lambda _{\text {seg}} {\mathcal {L}}_{\text {seg}}(I_{\text {enh}}) + \lambda _{\text {NR}} {\mathcal {L}}_{\text {NR}}(I_{\text {enh}}) \end{aligned}$$

(14)

Where:

$I_{\text {enh}}$ is the enhanced image,
$I_{\text {low}}$ is the low-light input image,
$A$ is the learned enhancement map,
$E$ is the reference exposure map,
$\lambda _{\text {TV}} = 1600$, $\lambda _{\text {col}} = 5$, $\lambda _{\text {exp}} = 10$, $\lambda _{\text {seg}} = 0.1$, and $\lambda _{\text {NR}} = 0.1$ are weighting factors.

Each component of the composite loss is described below:

Total variation loss (${\mathcal {L}}_{\text {TV}}$)

This total variation loss encourages spatial smoothness in the enhancement map $A$, preventing abrupt changes and noise by measuring the difference in the neighborhood pixels⁵⁴:

$$\begin{aligned} {\mathcal {L}}_{\text {TV}}(A) =\frac{1}{C H W} \sum _{c=1}^C \sum _{h=1}^H \sum _{w=1}^W\left[ \left( \nabla _x I_{c, h, w}\right) ^2+\left( \nabla _y I_{c, h, w}\right) ^2\right] \end{aligned}$$

(15)

In our framework, the weighting factor for the total variation term ($\lambda _{\text {TV}} = 1600$) is set relatively high compared to other loss components. This choice was made empirically after preliminary trials showed that lower weights resulted in unstable enhancement maps with visible banding and local illumination artifacts. A stronger smoothness prior ensures that the enhancement function varies gradually across the image, particularly in dark homogeneous regions, while other complementary terms (e.g., exposure, color constancy, and perceptual no-reference loss) preserve structural fidelity and semantic consistency. Thus, the high weight on ${\mathcal {L}}_{\text {TV}}$ balances the trade-off between perceptual naturalness and structural preservation.

Spatial consistency loss ${\mathcal {L}}_{\text {spa}}$

To ensure that the enhanced image preserves the local structures of the original input, we employ a Spatial Consistency Loss⁵⁴, which constrains the directional gradients of the output to align with those of the input.

Let $I$ denote the original RGB image and ${\hat{I}}$ the enhanced image. Both are converted to grayscale via channel-wise averaging:

$$\begin{aligned} I_{\text {gray}} = \frac{1}{3} \sum _{c=1}^{3} I_c, \quad {\hat{I}}_{\text {gray}} = \frac{1}{3} \sum _{c=1}^{3} {\hat{I}}_c \end{aligned}$$

(16)

To reduce noise, an average pooling operation ${\mathcal {P}}(\cdot )$ with a $4 \times 4$ kernel is applied:

$$\begin{aligned} I_p = {\mathcal {P}}(I_{\text {gray}}), \quad {\hat{I}}_p = {\mathcal {P}}({\hat{I}}_{\text {gray}}) \end{aligned}$$

(17)

Directional gradients are computed using four fixed convolutional kernels $K_d$ corresponding to the directions $d \in \{\text {left}, \text {right}, \text {up}, \text {down}\}$:

$$\begin{aligned} \nabla _d I = I_p * K_d, \quad \nabla _d {\hat{I}} = {\hat{I}}_p * K_d \end{aligned}$$

(18)

where $*$ denotes convolution.

The spatial consistency loss is defined as the sum of squared differences between the directional gradients of the input and enhanced images:

$$\begin{aligned} {\mathcal {L}}_{\text {spa}} = \sum _{d \in \{\text {left}, \text {right}, \text {up}, \text {down}\}} \left\| \nabla _d I - \nabla _d {\hat{I}} \right\| _2^2 \end{aligned}$$

(19)

Explicitly, this can be written as:

$$\begin{aligned} {\mathcal {L}}_{\text {spa}} =&\left\| \nabla _{\text {left}} I - \nabla _{\text {left}} {\hat{I}} \right\| _2^2 + \left\| \nabla _{\text {right}} I - \nabla _{\text {right}} {\hat{I}} \right\| _2^2 \nonumber \\ +&\left\| \nabla _{\text {up}} I - \nabla _{\text {up}} {\hat{I}} \right\| _2^2 + \left\| \nabla _{\text {down}} I - \nabla _{\text {down}} {\hat{I}} \right\| _2^2 \end{aligned}$$

(20)

By aligning the directional gradients, this loss enforces structural similarity between the input and enhanced images, preserving edges, textures, and other fine-grained local details.

Color constancy loss (${\mathcal {L}}_{\text {color}}$)

Encourages realistic color balance by minimizing deviation between the RGB channels:

$$\begin{aligned} {\mathcal {L}}_{\text {color}} = \sqrt{(R - G)^2 + (R - B)^2 + (G - B)^2} \end{aligned}$$

(21)

where $R, G, B$ denote the mean intensities of the red, green, and blue channels of the enhanced image.

Exposure control loss (${\mathcal {L}}_{\text {exp}}$)

Regulates the exposure level of the enhanced image toward a reference exposure map $E$:

$$\begin{aligned} {\mathcal {L}}_{\text {exp}} = \left\| \text {AvgPool}(I_{\text {enh}}) - E \right\| ^2 \end{aligned}$$

(22)

Segmentation guidance loss (${\mathcal {L}}_{\text {seg}}$)

This auxiliary loss promotes semantic fidelity by penalizing deviations in segmentation structure between the enhanced image and a reference segmentation map, typically using an unsupervised segmentation network⁵⁴.

No-reference image quality loss ${\mathcal {L}}_{\text {NR}}$

Using the MUSIQ-AVA model, we apply a No-Reference Image Quality Assessment (NR-IQA) loss to guarantee that the improved image is perceptually high-quality from a human perspective⁵⁵. This model was trained using the AVA dataset⁷⁴, which includes aesthetic quality annotations from human assessments, and is based on the Multiscale Image Quality Transformer (MUSIQ) architecture.

Let ${\hat{I}}$ denote the enhanced image. The MUSIQ-AVA model predicts an aesthetic quality score $S({\hat{I}}) \in [0, 100]$, where higher values correspond to higher perceptual quality. We define the no-reference loss as the deviation from the maximum possible aesthetic score:

$$\begin{aligned} {\mathcal {L}}_{\text {NR}} = 100 - {\mathbb {E}}\left[ S({\hat{I}}) \right] \end{aligned}$$

(23)

where ${\mathbb {E}}[ \cdot ]$ denotes the mean over the batch of predicted quality scores. This formulation encourages the enhancement network to generate images that maximize the perceived quality.

The MUSIQ-AVA model supports gradient backpropagation, allowing it to be used directly as a loss function:

The model is instantiated with as_loss=True to enable its use in training.
During the forward pass, the aesthetic score is computed and its mean is taken across the batch.
The loss is then defined as the difference between the maximum quality score (100) and the average score.

This no-reference loss is especially important in scenarios where ground-truth high-quality images are unavailable, and subjective perceptual quality becomes a key optimization criterion, and so used in the current work.

Final objective

$$\begin{aligned} {\mathcal {L}}_{\text {composite}} = 1600 \cdot {\mathcal {L}}_{\text {TV}} + {\mathcal {L}}_{\text {spa}} + 5 \cdot {\mathcal {L}}_{\text {color}} + 10 \cdot {\mathcal {L}}_{\text {exp}} + 0.1 \cdot {\mathcal {L}}_{\text {seg}} + 0.1 \cdot {\mathcal {L}}_{\text {NR}} \end{aligned}$$

(24)

This composite loss ensures that the enhanced outputs are perceptually natural, well-exposed, structurally faithful, and aesthetically pleasing.

Experimental settings

Implementation details

In alignment with Zero-DCE^52,53,54, our training strategy leverages a dataset specifically curated to include both low-light and over-exposed conditions, enabling the model to learn dynamic range enhancement effectively. In particular, we use 360 multi-exposure sequences from the Part1 subset of the SICE dataset⁷⁵. We extract a total of 3,022 images with varying exposure settings from them. Consistent with prior work such as EnlightenGAN⁵⁶, we randomly split the dataset into 2,422 images for training and 600 for validation. All images are resized to $512 \times 512 \times 3$ to maintain consistency during training and evaluation.

This training configuration ensures robustness for real-world low-light and overexposed image enhancement tasks by improving the model’s generalization across a range of illumination conditions. The trained model is evaluated on multiple subsets of different datasets selected from earlier research in the same field in order to test the proposed approach in real time. Table 1 contains the specifics of these test sets.

Table 1 Comparison of test sets used for evaluation of performance and comparative analysis.

Subjects

Abstract

Similar content being viewed by others

A hybrid zero-reference and dehazing network for joint low-light underground image enhancement

A depth iterative illumination estimation network for low-light image enhancement based on retinex theory

Multi-branch low-light image iterative enhancement network

Introduction

Related work

Proposed model

Feature extraction and aggregation block

Depthwise separable convolution block

Spatial attention block

Multi-scale spatial curve estimation network

Residual learning

Loss function

Total variation loss (\({\mathcal {L}}_{\text {TV}}\))

Spatial consistency loss \({\mathcal {L}}_{\text {spa}}\)

Color constancy loss (\({\mathcal {L}}_{\text {color}}\))

Exposure control loss (\({\mathcal {L}}_{\text {exp}}\))

Segmentation guidance loss (\({\mathcal {L}}_{\text {seg}}\))

No-reference image quality loss \({\mathcal {L}}_{\text {NR}}\)

Final objective

Experimental settings

Implementation details

Performance validation metrics

Experimental results

Qualitative assessment through visual comparison

Quantitative evaluation using performance metrics for unpaired datasets

Quantitative evaluation using performance metrics for paired datasets

Worst-case comparison across paired datasets (LOL and LOL-v2)

Loss function weighting factors sensitivity analysis

Metrics

Experiment design

Results

Ablation study

Loss function analysis

Effect of removing spatial attention

Effect of multi-scale input

Computational analysis

Discussion

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links