Abstract
Images captured in low-light conditions often suffer from poor visibility and noise corruption. Low-light image enhancement (LLIE) aims to restore the brightness of under-exposed images. However, most previous LLIE solutions enhance low-light images via global mapping without considering various degradations of dark regions. Besides, these methods rely on convolutional neural networks for training, which have limitations in capturing long-range dependencies. To this end, we construct a hybrid framework dubbed hybLLIE that combines transformer and convolutional designs for LLIE task. Firstly, we propose a light-aware transformer (LAFormer) block that utilizes brightness representations to direct the modeling of valuable information in low-light regions. It is achieved by utilizing a learnable feature reassignment modulator to encourage inter-channel feature competition. Secondly, we introduce a SeqNeXt block to capture the local context, which is a ConvNet-based model to process sequences of image patches. Thirdly, we devise an efficient self-supervised mechanism to eliminate inappropriate features from the given under-exposed samples and employ high-order curves to brighten the low-light images. Extensive experiments demonstrate that our HybLLIE achieves comparable performance to 17 state-of-the-art methods on 7 representative datasets.
Similar content being viewed by others
Introduction
Images captured in low-light conditions often suffer from reduced visibility and various degradations (e.g., noise corruption, unclear details, and color deviation, etc.). Therefore, low-light images are unsatisfactory for information transfer as they pose challenges to human perception and subsequent high-level computer vision tasks1,2,3. To uncover the contents of images in low-light conditions, great efforts have been devoted to low-light image enhancement (LLIE)4,5,6. Histogram Equalization7 and Retinex theory8,9 are two representative traditional LLIE methods. The former adjusts the histogram by redistributing the most common intensity values to enhance the limited dynamic range. The latter assumes that an image can be decomposed into reflectance and illumination components. However, these traditional methods rely on careful parameter tuning to achieve ideal results. It results in enhanced samples that are perceptually inconsistent with real-world images captured under normal lighting conditions.
With the advancement of computing power and the vast amount of data, learning-based methods have been introduced in the LLIE task. These methods can be categorized into supervised learning-based approaches and unsupervised learning-based techniques. Supervised learning-based approaches rely on low/normal-light image pairs for training, which are expensive and time-consuming due to the challenges in obtaining high-quality reference images. On the other hand, unsupervised learning techniques leverage manual priors and intricate loss functions to bypass the need for normal-light images. However, most unsupervised learning methods10,11,12,13,14 employ Convolutional Neural Networks (CNNs) for training. Their local receptive fields limit the ability to model long-range dependencies, resulting in suboptimal enhancement effects. Although transformer-based models have shown promise in perceiving non-local features, directly applying original vision transformers to LLIE is not optimal. This is because they are computationally expensive and do not account for the illumination representation in low-light images, which leads to difficulties in addressing the issue of blurred details in these images. Consequently, existing methods often struggle to balance efficiency, detail preservation, and natural enhancement results.
To address these challenges, we present HybLLIE, a hybrid architecture that combines transformer and convolutional operators. In HybLLIE, we design a light-aware transformer (LAFormer) block to capture more valuable information. The key component of LAFormer is a learnable feature reassignment modulator (FRM). It encourages inter-channel feature competition, which is formulated as a deep adaptive bias to enhance image quality. To leverage the strengths of U-shaped architecture and the LAFormer block, we take the proposed LAFormer blocks as the decoder. The reason is that a U-shaped architecture comprised entirely of transformer blocks would primarily focus on capturing contextual features at all stages, resulting in lower feature resolution and limited local spatial information. Therefore, we propose a SeqNeXt block and use it as the encoder. Unlike the prior ConvNeXt block15, our SeqNeXt block can be directly applied to sequences of image patches and utilizes a smaller convolution kernel for modeling. Furthermore, we introduce high-order curves (HC) to improve the brightness of low-light images. HC allows pixel-wise adjustments in dynamic range while preserving the contrast of neighboring pixels.
Compared with the existing curve estimation-based LLIE methods14,16,17, we go a step further to explore the properties of the curve parameters instead of utilizing them directly to enhance low-light images. Specifically, we adopt an efficient self-supervised mechanism to filter out inappropriate features from the input instance and employ the optimized image to participate in the curve adjustment. It can avoid introducing additional noise on the image during curve iterations. Benefiting from the above designs, our method can achieve better results, allowing it to suppress noise and preserve detailed information while restoring image brightness. An example of enhancing a low-light image in a complex scene is shown in Fig. 1.
Overall, we summarize the contributions of this paper as follows:
-
A Transformer-CNN hybrid network, HybLLIE, is proposed for enhancing low-light images, which can be trained end-to-end.
-
The novel LAFormer and SeqNeXt blocks are designed to improve feature extraction capabilities.
-
A learnable feature reassignment modulator is adopted to encourage inter-channel feature competition, which is formulated as a deep adaptive bias to adjust multi-scale features and improve image quality.
-
A self-supervised mechanism is applied to remove inappropriate features for brightness adjustment.
Related work
Traditional methods for enhancing low-light images
Early methods for enhancing low-light images are based on histogram equalization7,18 and Retinex theory8,19,20. These approaches require manual priors to enhance the visibility of low-light images. However, the resulting enhanced images often exhibited perceptual inconsistencies compared to normal light images.
Deep learning-based low-light image enhancement methods
Deep learning-based methods4,21,22,23,24 have become dominant approaches in low-light image enhancement with tremendous potential. Wei et al.25 propose an end-to-end Retinex network that decomposes images into reflectance and illumination components, effectively enhancing the images. Lv et al.26 utilizes a multi-branch network to fuse features from multiple modules, enhancing the quality of the resulting images. Guo et al.14 present a neural network that fits a high-order curve and uses it to adjust input images on a pixel-wise basis. Ma et al.27 introduce a self-calibrated illumination framework that utilizes unsupervised learning to restore brightness in underexposed regions of images. Furthermore, several studies employed models based on U-Net or U-Net-like architectures to simplify the encoder-decoder structures10,13,14, which effectively improved the enhanced image quality. Nevertheless, these methods may introduce noticeable noise and color deviations in enhanced images.
U-shaped transformer-based framework
Transformer28 was originally proposed in the field of natural language processing and achieved remarkable success. In the following years, Dosovitskiy et al. introduced ViT29, which divided images into different image patches for training and achieved state-of-the-art results in various computer vision tasks. Recently, Yu et al. abstracted a general MetaFormer architecture30 from transformers, which does not specify the token mixer. They demonstrated that the power of transformers primarily originated from this general architecture rather than the design of self-attention mechanisms in the token mixer. Additionally, many researchers have built transformer blocks upon the elegant architecture U-Net. For instance, Chen et al.31 incorporated the transformer as part of the encoder and mitigated the resolution loss by introducing convolutional layers at the beginning of the encoder. Wang et al.32 introduced the LeWin Transformer, a window-based self-attention with reduced computational complexity, and designed a U-Net entirely composed of LeWin Transformers for image restoration tasks. In general, researchers balance network performance and computational complexity when designing U-shaped architectures with transformers by modifying the self-attention mechanism or introducing convolutional layers. Our HybLLIE applies the hybrid structure to build multi-scale features while using the proposed LAFormer block and SeqNeXt block, which can combine the benefits of the transformer and convolutional designs.
Method
Overall pipeline
As shown in Fig. 2a, the overall structure of the proposed HybLLIE is a U-shaped hierarchical network with skip connections between the encoder and the decoder. Given a low-light image \(I\in \mathbb {R} ^{H\times W\times 3}\), HybLLIE first applies a patch partition and a linear embedding layer to split it into non-overlapping patches with a patch size of \(4\times 4\) and projects feature dimension into an arbitrary C (we set C as to 32). At this stage, the given input will be downsampled by \(4\times\). Next, following the design of the U-Net, the transformed features pass through three encoder stages and patch merging layers. Each stage contains a stack of the proposed SeqNeXt block, which takes advantage of the depthwise convolution for extracting local context and spatial features. In the patch merging layer, the feature resolution will be downsampled by \(2\times\). For example, given an input feature map, the k-th stage produces the dimension that is set to \(2^{k-1}\times C\) and the resolution is \(\frac{H}{2^{k+1} } \times \frac{W}{2^{k+1}} (k=1, 2, 3)\). Then, a bottleneck stage with a stack of SeqNeXt blocks. In this stage, the hierarchical structure allows SeqNeXt blocks to capture longer dependencies, facilitated by the close proximity between the convolutional kernel size and the patch size. For feature reconstruction, the decoder contains three stages and patch extension layers. Each stage consists of a stack of LAFormer blocks, similar to the encoder stages. In contrast to the patch merging layer, the patch extension layer performs upsampling. We take the first one as an example. Before upsampling, a linear layer is applied to the input features (\(\frac{H}{32} \times \frac{W}{32} \times 8C\)) to increase the feature dimension to \(2\times\) of the original dimension (\(\frac{H}{32} \times \frac{W}{32} \times 16C\)). Then, we use the rearrange operation to expand the resolution of the input features to \(2\times\) of the input resolution and reduce the feature dimension to a quarter of the input dimension (\(\frac{H}{32} \times \frac{W}{32} \times 16C\rightarrow \frac{H}{16} \times \frac{W}{16}\times 4C\)). The last patch extension layer is used to perform \(4\times\) upsampling to restore the resolution to the input resolution (\(H\times W\)). Finally, we design a projection block to remove the inappropriate features from the low-light image and utilize the high-order curves to obtain the enhanced image.
Basic components
LAFormer and SeqNeXt
The standard transformer architecture computes self-attention globally between all patch tokens, which leads to a quadratic computation cost relative to the number of patch tokens29. It is unsuitable to stack transformers in a network and apply global self-attention to high-resolution feature maps. Therefore, we propose a LAFormer block, as shown in Fig. 2b. Specifically, given an input image I, it first undergoes patch partitioning and a linear embedding layer, producing tokens \(I_{k}\in \mathbb {R}^{L\times C }\) with sequence length L are fed to the repeated blocks. These tokens are then processed through a series of repeated blocks, each consisting of two residual sub-blocks. The first sub-block mainly contains a feature reassignment modulator (FRM) to capture brightness features. This sub-block can be expressed as,
where \(LN(\cdot )\) denotes Layer Normalization, \(FRM(\cdot )\) means the feature reassignment modulator.
The second sub-block primarily consists of a two-layered MLP with non-linear activation,
where \(W_{1} \in \mathbb {R } ^{C\times rC}\) and \(W_{2} \in \mathbb {R} ^{rC\times C}\) are learnable parameters with MLP expansion ratio r, \(\sigma (\cdot )\) is an activation function.
To enable the proposed network to capture more useful information, we draw inspiration from ConvNeXt15 and modify the LAFormer architecture. Specifically, we replace the FRM and the first LayerNorm layer in LAFormer with a \(3\times 3\) depthwise convolution. Additionally, we combine the two residual connections, resulting in a new module called SeqNeXt, as shown in Fig. 2c. This design helps to further improve the performance of the model.
Feature reassignment modulator
The key component of the LAFormer block is a feature reassignment modulator (FRM). Different from the previous self-attention mechanism, FRM exploits the illumination representations to constrain the interactions between different dark regions. At first, we employ a max-pooling operator \(MaxPool(\cdot )\) to extract the maximum brightness value and utilize a residual connection to compensate for the feature loss caused by pooling operations, which can be formulated as follows,
Then, we employ a norm-based aggregation operator to perceive information and find that L2-norm resulted in better performance. This gives us a set of aggregated values \(Norm(T) = \left\{ \left\| t_{1} \right\| , \left\| t_{2} \right\| ,...,\left\| t_{L} \right\| \right\} \in \mathbb {R}^{L\times 1}\), where \(Norm\left( \cdot \right)\) means L2-norm, and \(\left\| t_{L} \right\|\) is a scalar that means sequence L aggregated channel features. In other words, after applying the L2-norm operation along the channel dimension, the resulting dimension is reduced to 1, while the other dimensions remain unchanged. This means that for each of the L positions, there is a scalar value representing the aggregated channel feature at that position. Next, we apply a divisive normalization function \(N\left( \cdot \right)\) to the scalar as follows,
Equation (4) computes \(\left\| t_{i} \right\|\) relative importance compared to all tokens, which can be applied to all aggregated values: \(C_{T} = N (Norm (T))\). For calibrating the input responses, we use pixel-wise multiplication \(T*C_{T}\).
In addition, various image have their own distinctive unclear details and color deviation to be handled or restored. To further boost the capability of FRM for dealing with different situations, we introduce a learnable calibration factor \(\gamma\) to improve the image quality. \(\gamma\) is a trainable tensor that gets updated through backpropagation. To preserve the inherent features, we design a residual connection between the input and output of the FRM and add an additional parameter \(\beta\) for optimization. For more configuration attempts for the FRM, refer to Table 1. The resulting final FRM can be formulated as follows,
As shown in Fig. 3, we demonstrate the effectiveness of the FRM in two aspects: exposure control and texture detail restoration. With the FRM, our models show sharper details and more natural colors.
Curve adjustment
Lightness curve
Inspired by image processing software such as Photoshop, a lightness curve can be introduced to improve the brightness of images. The curve can be expressed as,
where x is pixel coordinates, \(I\left( x \right)\) denotes the given input, \(\alpha \in \left[ -1,1 \right]\) refers a learnable parameter to control the curvature range. Reference14 suggest that the optimal curves for enhancing low-light images are often of very high order and can be applied iteratively to adjust the brightness. Thus, Eq. (6) can be reformulated as,
where n is the number of iterations, and we set the value of n as 4, which can enhance most low-light scenarios. As shown in Fig. 4, the curve has two characteristics. Firstly, it is monotonous to preserve the contrast of neighboring pixels. Secondly, this curve is differentiable, which is a necessary condition for satisfying backpropagation in neural networks.
We normalize the pixels to the range of [0, 1]. The horizontal axis represents input pixel values, while the vertical axis represents output pixel values. (a) Represents the light enhancement curves. (b,c) Curves obtained after 2 and 3 iterations. (d) Curves of iterating 4 times, and \(\alpha\) for the first three iterations are 0.8, 0.7 and 0.7, respectively.
Projection block
Compared with previous curve estimation-based LLIE methods14,16,17, which directly employ the estimated parameters to enhance low-light images, our HybLLIE introduces a projection block to remove redundant and inappropriate features. Specifically, we propose removing inappropriate features to ensure a more reasonable enhancement of low-light images, rather than directly using \(\alpha\) for curve adjustment. To this end, we design a projection block consisting of a convolutional layer with a Tanh activation function and incorporate a derivative operator to refine feature selection. This process can be expressed as:
where \(Proj( \cdot )\) denotes the projection block, \({I_f}(x)\) is the filtered image. We provide an example of the filtered image and the corresponding different features as shown in Fig. 5. One can see that the filtered image closely resembles the low-light sample and exhibits clearer details with noise removed. To assist curve adjustment, we combine the \({I_f}(x)\) with the output of the framework through element-wise addition, and then split it along the channel dimension to derive the \(\alpha\) values for curve adjustment. Furthermore, we devise a projection loss \(L_{pro}\) ensures that the filtered image \(I_{f}(x)\) remains close to the original input image I(x), which can be expressed as \({L_{pro}} = \left| {{I_f}(x) - I(x)} \right|\). We keep the other loss functions the same as Zero-DCE14.
Experiments
This section begins with a discussion of the experimental configuration. Then, we verify the generalization ability of HybLLIE in seven datasets under a variety of low-light conditions. Finally, we conduct ablation experiments to evaluate the proposed HybLLIE components.
Qualitative comparisons on paired LOL dataset25, The yellow and gray boxes indicate obvious differences.
Experimental setup
To maximize the adjustment capability of the curve adjustment for a wide dynamic range, our training dataset includes both underexposed and overexposed images. Specifically, we select 2002 images of varying exposure levels from the Part1 subset of SICE dataset33 and resize them to \(224\times 224\) pixels for training. Our experiments are conducted using the PyTorch framework on a single NVIDIA GTX 1080Ti GPU. We utilize the Adam optimizer with \(\beta _{1} = 0.9\) and \(\beta _{2} = 0.999\), and a learning rate of \(1e^{-4}\). The batch size is set to 8.
Comparison with state-of-the-arts
Our HybLLIE is compared with 17 state-of-the-art (SOTA) LLIE methods. These methods include two traditional approaches (LIME18 and Brain8), four supervised learning-based methods (RetinexNet25, MBLLEN26,KinD34, and URetinex35), and eleven unsupervised learning-based methods (LE-GAN36, RUAS11, EnGAN10, SSIENet13, Zero-DCE14, Zero-DCE++16, SCI27, ReLLIE17, UNIE37, PairLIE12, and SSIENetV238). The results are reproduced by using publicly available official source codes and recommended parameters.
Quantitative and qualitative analysis
To measure the color, structure, and high-level feature similarity of different images, we test our method on the Part 2 subset of SICE dataset33, which consists of 229 multiple-exposure sequences and their corresponding reference images for each sequence. Besides, for a more convincing comparison, we test our model on LOL dataset25 without retraining. The LOL test dataset consists of 15 pairs of images taken in both low and normal lighting conditions. These images are captured in real-world settings with varying exposure times, resulting in noticeable noise. For full-reference image quality assessment, we adopt Peak Signal-Noise-Ratio (PSNR), Structural Similarity (SSIM)39, Learned Perceptual Image Patch Similarity (LPIPS)40, and Mean Absolute Error (MAE). Higher values of PSNR and SSIM indicate better image quality, while lower values of LPIPS and MAE indicate better image quality.
The quantitative evaluation results are shown in Table 2. As can be seen, among the eight metrics across the two datasets, HybLLIE outperforms other 17 methods by ranking first in three metrics and second in another three. Compared to other unsupervised learning methods, UNIE ranks first in two metrics and third in one. Therefore, HybLLIE demonstrates superior performance within the unsupervised learning methods. When compared to the best-performing supervised learning method, URetinex, it ranks first in three metrics, second in two, and third in one. This indicates that our method also exhibits strong competitiveness in comparison with supervised learning methods. The reason our method did not achieve the best results in some metrics is that it is based on an unsupervised learning approach and was only trained on the SICE dataset. In the future, we plan to further improve the model structure and loss function to enhance the model’s performance.
For a more intuitive comparison, we report the qualitative comparisons on LOL datasets25. As shown in Fig. 6, RetinexNet, SCI, SSIENet, and Zero-DCE exhibit low brightness in the hot air balloon region highlighted by the yellow box, failing to reflect the inherent brightness of the corresponding area in the ground truth (GT). MBLLEN and RUAS, on the other hand, tend to over-enhance this region, resulting in overexposure. Additionally, SCI, EnGAN, SSIENet, and Zero-DCE produce images with many dark areas overall. In the region marked by the white box, the floor details are noticeably blurred in SSIENet and Zero-DCE, whereas our method achieves results closest to the GT in this area. This is because the proposed hybrid U-shaped framework and projection technique provide sufficient information for addressing the low-light image enhancement task, enabling HybLLIE to enhance brightness while preserving structural details.
Qualitative comparisons on the unpaired NPE dataset41.
Qualitative comparisons on the unpaired DICM dataset42.
Qualitative comparisons on the unpaired LIME dataset18.
Qualitative comparisons on the unpaired MEF dataset43.
Generalization ability on real-world images
To demonstrate the generalization ability of the proposed method, we further conduct experiments on five real-world low-light image datasets, including NPE41 (84 images), DICM42 (64 images), LIME18 (10 images), MEF43 (17 images), and VV (24 images). As all of these datasets are unpaired, we employ the NIQE41 and the CEIQ44 to make quantitative comparisons between various methods. Smaller NIQE scores and higher CEIQ scores indicate a better perceptual proclivity quality. The comparison results are displayed in Table 3. It can be observed that, across five datasets and ten metrics, our method ranks in the top three for seven of the metrics. Specifically, CEIQ achieves the highest value on the LIME, MEF, and VV datasets, while NIQE performs best on the NPE dataset. Additionally, NIQE and CEIQ rank third on the DICM dataset, and NIQE ranks third on the MEF dataset. In comparison, LIME and Zero-DCE++ each have only two metrics in the top position, and LE-GAN and SSIENet have just one metric ranked first. Furthermore, our method is the only one with the most metrics in the top three, further proving that our approach achieves the best performance overall. Figure 7 shows the results of a challenging image on the NPE dataset. From the results, we can observe that our method enhances dark regions while simultaneously preserving the color of the sky. The result is visually pleasing without obvious noise or artifacts. In contrast, RetinexNet, EnGAN, and Zero-DCE++ generate visually good results, but they contain some undesired noise and color casts. LIME fails to recover the image. MBLLEN, URetinex, RUAS, and SCI have caused overexposure in the sky region of the image. Our proposed method and SSIENet enhance dark regions and preserve the color of the input image simultaneously. It demonstrates that our method has great generalization ability in real-world images with a more naturalistic quality. Figure 8 presents a visual comparison of different methods on the DICM dataset. It can be observed that RetinexNet and SSIENetV2 result in unrealistic colors for the house, while SCI, Zero-DCE++, and LE-GAN either overexpose the sky or lose a significant amount of detail. KinD still leaves the house area underexposed. Only Brain, PairLIE, and our method successfully restore image brightness, with our approach better preserving the sky features of the given input. Figure 9 shows qualitative results on the LIME dataset. LE-GAN and SCI still exhibit some overexposure in the sky region, while SSIENet and SSIENetV2 produce unnatural colors. Brain, SCI, and UNIE display clear content in the upper part of the image, but the lower part remains underexposed with visible dark regions. Compared to other methods, our approach yields the clearest and most natural visual effect. Additionally, Fig. 10 presents visual comparisons on the MEF dataset. KinD, Brain, MBLLEN, EnGAN, and SCI fail to enhance the bookshelf on the left side of the image properly, making it difficult to discern details. RUAS causes severe overexposure in the right half of the image. While Zero-DCE, SSIENet, and our method appropriately enhance image brightness, our method delivers the sharpest texture details and the most natural color rendition, especially in the bookshelf and the view outside the window.
In Fig. 11, we further demonstrate examples of noise suppression. It can be seen that while all methods successfully improve the image brightness, our method effectively suppresses noise on the hair, face, and neck, resulting in a clear and natural appearance. In contrast, noise is still noticeable in the other methods, further highlighting the effectiveness of our approach.
Computational efficiency
We report the parameters and FLOPs of several SOTA methods in Table 4. As can be seen, our method has fewer parameters and significantly lower FLOPs compared to others. This is mainly due to our proposed LAFormer architecture, where the feature reassignment modulator avoids the high computational complexity of traditional self-attention mechanisms. LAFormer constructs this layer using max-pooling, L2-norm, and element-wise multiplication, which significantly reduces computational complexity. Additionally, the MLP in LAFormer consists solely of fully connected layers, with no convolutional operations, making it a lightweight module. In the ablation experiments, we compare LAFormer with other Transformer backbones to validate its lightweight and efficient design. Furthermore, we introduce the SeqNeXt module, which uses \(3\times 3\) depthwise convolutions to further reduce computational overhead. In contrast to the Zero-DCE method14, which does not perform downsampling or use depthwise convolutions, all convolution operations in Zero-DCE are performed on high-resolution images, and it requires 8 curve iterations. Our method uses downsampling with most of the computation carried out on low-resolution feature maps and only 4 curve iterations. As a result, our method achieves the lowest FLOPs.
Ablation study
In this section, we analyze the effect of each component of HybLLIE in detail. The evaluations are conducted on SICE dataset33. We discuss the ablation below according to the following aspects.
Effects of the components
The ablation results are reported in Table 5. Compared with transformers and their variants, the main change made by LAFormer block is using the feature reassignment modulator (FRM). We first conduct ablation for this operator by replacing FRM with self-attention mechanism. Surprisingly, FRM outperforms self-attention mechanism by 1.2 dB on the PSNR metric. Then, FRM is replaced with depthwise convolution for spatial modeling, and we attempt to replace the max pooling operator in FRM with an average pooling operator. Our FRM still achieves competitive results because it takes into account the characteristics of low-light image data. We also test the effects of max pooling size in FRM. There are obvious performance drops. Thus, we adopt the default pooing size of 5 for FRM. Until now, we have tried multiple operators in the LAFormer block, and only our proposed FRM has achieved satisfactory results. It fully demonstrates that FRM is the key component of the LAFormer block. Furthermore, we conduct an ablation analysis on the operators within the SeqNeXt block and find that the best results are achieved only when the depthwise convolution size is \(3\times 3\). We also verify the effectiveness of the activation functions. When we replace GELU with SiLU and ReLU, all values show a decline. Compared with Table 2, our HybLLIE, which combines LAFormer and SeqNeXt blocks, achieves promising results even when different operators are used instead of the FRM. This indicates that our HybLLIE is a general architecture for enhancing low-light images.
Backbone
To analyze the influence of different encoder and decoder designs, we switch the roles of LAFormer blocks and SeqNeXt blocks: LAFormer blocks serve as the encoder, while SeqNeXt blocks are the decoder. Besides, we also compare the effects of HybLLIE when composed entirely of SeqNeXt or LAFormer blocks on performance. The comparative results are shown in Table 6. We observe that when changing the components of both the encoder and decoder, there is a varying degree of decline in both PSNR and SSIM metrics. These results demonstrate that our hybrid framework effectively combines the advantages of both transformer and convolutional designs, ultimately yielding the most optimal outcomes.
To evaluate the effectiveness of our U-shaped framework and the impact of other Transformer modules on the LLIE task, we attempt to design some variants. Specifically, we remove the patch embedding layer and the curve adjustment from HybLLIE, resulting in a backbone called HybUNet. Then, we replace the backbone network HybUNet with TransUNet31. TransUNet is a hybrid model that incorporates convolutional layers and transformers. Similarly, we also introduce a framework named Swin-Unet45, which is composed entirely of transformers. At the end of the encoder in these above frameworks, we add the curve adjustment like HybLLIE. Therefore, we obtain two variants dubbed TransLLIE and SwinLLIE. We use these variants to enhance the low-light images. The visual comparison of these enhanced results is shown in Fig. 12. We can observe that TransLLIE fails to produce satisfactory results. Although SwinLLIE restores the brightness of the images, it introduces color deviations compared to the ground truth. Table 7 represents the comparison results, which show that our backbone network can attain competitive performance and the potential of transformers for LLIE can be further explored.
Filtered image
This section examines the effects of the filtered image. As shown in Table 8, we conduct a comparison on SICE Part2 dataset to evaluate the effectiveness of incorporating filtered images into curve estimation, assessing the impact of removing inappropriate features. We can observe that HybLLIE with the filtered image can reach a 19.72 PSNR value and a 23.47 MAE value. It demonstrates that utilizing the filtered image can improve the enhanced results.
As shown in Table 8, we conduct a comparison on SICE Part2 dataset to evaluate the effectiveness of incorporating filtered images into curve estimation, assessing the impact of removing inappropriate features. We can observe that HybLLIE with the filtered image can reach a 19.72 PSNR value and a 23.47 MAE value. It demonstrates that utilizing the filtered image can improve the enhanced results.
Conclusion
In this paper, we present HybLLIE, a general architecture for unsupervised low-light image enhancement. HybLLIE is designed with a hybrid U-shaped network, comprising LAFormer and SeqNeXt blocks as its core components. To deal with the image degradations and improve restoration quality, we propose a learnable feature reassignment modulator. It allows HybLLIE to adapt and consistently perform well across various scenarios. Additionally, to assist in the enhancement, we devise a projection block to remove inappropriate features and implement the projected image to estimate parameters. Extensive experiments demonstrate that our method consistently achieves the best performance on real-world images using the same network structure.
In future work, we will further develop components of HybLLIE to reduce time complexity, and dig into learning better feature representation. We hope that this work can bring a new perspective to using a hybrid U-shaped vision backbone.
Data availibility
The datasets generated and/or analyzed during the current study are available at https://paperswithcode.com/dataset/lol and https://ieeexplore.ieee.org/document/8259342/.
References
Li, C. et al. Photon-limited object detection using non-local feature matching and knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3976–3987 (2021).
Liu, W. et al. Improving nighttime driving-scene segmentation via dual image-adaptive learnable filters. IEEE Trans. Circ. Syst. Video Technol. (2023).
Dong, Z., Fang, T., Li, J. & Shao, X. Weakly supervised fine-grained semantic segmentation via spatial correlation-guided learning. Comput. Vis. Image Underst. 236, 103815 (2023).
Wang, M., Li, J. & Zhang, C. Low-light image enhancement by deep learning network for improved illumination map. Comput. Vis. Image Underst. 232, 103681 (2023).
Wang, J., Huang, S., Huo, Z., Zhao, S. & Qiao, Y. Bilateral enhancement network with signal-to-noise ratio fusion for lightweight generalizable low-light image enhancement. Sci. Rep. 14, 29832 (2024).
Chen, J., Wang, Y. & Han, Y. A semi-supervised network framework for low-light image enhancement. Eng. Appl. Artif. Intell. 126, 107003 (2023).
Jebadass, J. R. & Balasubramaniam, P. Low light enhancement algorithm for color images using intuitionistic fuzzy sets with histogram equalization. Multimedia Tools Appl. 81, 8093–8106 (2022).
Cai, R. & Chen, Z. Brain-like retinex: A biologically plausible retinex algorithm for low light image enhancement. Pattern Recogn. 136, 109195 (2023).
Chen, Y., Wen, C., Liu, W. & He, W. A depth iterative illumination estimation network for low-light image enhancement based on retinex theory. Sci. Rep. 13, 19709 (2023).
Jiang, Y. et al. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 30, 2340–2349 (2021).
Liu, R., Ma, L., Zhang, J., Fan, X., & Luo, Z. Retinex-inspired unrolling with cooperative prior architecture search for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10561–10570 (2021).
Fu, Z. et al. Learning a simple low-light image enhancer from paired low-light instances. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22252–22261 (2023).
Zhang, Y., Di, X., Zhang, B. & Wang, C. Self-supervised image enhancement network: Training with low light images only. arXiv preprint arXiv:2002.11300 (2020).
Guo, C. et al. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1780–1789 (2020).
Liu, Z. et al. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11976–11986 (2022).
Li, C., Guo, C. & Loy, C. C. Learning to enhance low-light image via zero-reference deep curve estimation. IEEE Trans. Pattern Anal. Mach. Intell. 44, 4225–4238 (2021).
Zhang, R., Guo, L., Huang, S., & Wen, B. Rellie: Deep reinforcement learning for customized low-light image enhancement. In Proceedings of the 29th ACM international conference on multimedia, 2429–2437 (2021).
Guo, X., Li, Y. & Ling, H. Lime: Low-light image enhancement via illumination map estimation. IEEE Trans. Image Process. 26, 982–993 (2016).
Zhou, M. et al. Low-light enhancement method based on a retinex model for structure preservation. IEEE Trans. Multimedia (2023).
Du, S. et al. Low-light image enhancement and denoising via dual-constrained retinex model. Appl. Math. Model. 116, 1–15 (2023).
Wang, L., Zhao, L., Zhong, T. & Wu, C. Low-light image enhancement using generative adversarial networks. Sci. Rep. 14, 18489 (2024).
Liang, X. et al. Low-light image enhancement via adaptive frequency decomposition network. Sci. Rep. 13, 14107 (2023).
Liu, S., Wang, J., Zhang, S., Yu, P. & Ma, X. Low-light image enhancement via multi-stream vision state space module. SIViP 19, 244 (2025).
Fan, G., Yao, Z. & Gan, M. Illumination-aware and structure-guided transformer for low-light image enhancement. Comput. Vis. Image Underst. 252, 104276 (2025).
Wei, C., Wang, W., Yang, W., & Liu, J. Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560 (2018).
Lv, F., Lu, F., Wu, J. & Lim, C. Mbllen: Low-light image/video enhancement using cnns. In BMVC 220, 4 (2018).
Ma, L., Ma, T., Liu, R., Fan, X., & Luo, Z. Toward fast, flexible, and robust low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5637–5646 (2022).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Yu, W. et al. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10819–10829 (2022).
Chen, J. et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021).
Wang, Z. et al. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17683–17693 (2022).
Cai, J., Gu, S. & Zhang, L. Learning a deep single image contrast enhancer from multi-exposure images. IEEE Trans. Image Process. 27, 2049–2062 (2018).
Zhang, Y., Zhang, J., & Guo, X. Kindling the darkness: A practical low-light image enhancer. In Proceedings of the 27th ACM international conference on multimedia, 1632–1640 (2019).
Wu, W. et al. Uretinex-net: Retinex-based deep unfolding network for low-light image enhancement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5901–5910 (2022).
Fu, Y., Hong, Y., Chen, L. & You, S. Le-gan: Unsupervised low-light image enhancement network using attention module and identity invariant loss. Knowl.-Based Syst. 240, 108010 (2022).
Jin, Y., Yang, W., & Tan, R. T. Unsupervised night image enhancement: When layer decomposition meets light-effects suppression. In European Conference on Computer Vision, 404–421 (Springer, 2022).
Zhang, Y. et al. Self-supervised low light image enhancement and denoising. arXiv preprint arXiv:2103.00832 (2021).
Wang, Z., Bovik, A. C., Sheikh, H. R. & Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004).
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, 586–595 (2018).
Wang, S., Zheng, J., Hu, H.-M. & Li, B. Naturalness preserved enhancement algorithm for non-uniform illumination images. IEEE Trans. Image Process. 22, 3538–3548 (2013).
Lee, C., Lee, C. & Kim, C.-S. Contrast enhancement based on layered difference representation of 2d histograms. IEEE Trans. Image Process. 22, 5372–5384 (2013).
Lee, C., Lee, C., Lee, Y.-Y. & Kim, C.-S. Power-constrained contrast enhancement for emissive displays based on histogram equalization. IEEE Trans. Image Process. 21, 80–93 (2011).
Fang, Y. et al. No-reference quality assessment of contrast-distorted images based on natural scene statistics. IEEE Signal Process. Lett. 22, 838–842 (2014).
Cao, H. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part III, 205–218 (Springer, 2023).
Acknowledgements
This work was supported by the National Natural Science Foundation of China [grant number 61903274] and the Natural Science Foundation of Tianjin [grant number 23YDTPJC00500].
Author information
Authors and Affiliations
Contributions
Yutao Jin and Yue Sun developed the method, conducted experiments, and wrote the manuscript. Jiabao Liang and Xiaoning Yan analyzed the results, Zeyao Hou and Shuangwu Zheng reviewed the manuscript, and Xiaoyan Chen managed this project.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Jin, Y., Sun, Y., Liang, J. et al. A hybrid framework for curve estimation based low light image enhancement. Sci Rep 15, 8611 (2025). https://doi.org/10.1038/s41598-025-92161-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-92161-y