Introduction

The demand for efficient image transmission over wireless channels is escalating, driven by applications like autonomous driving, remote sensing, and the internet of things (IoT) that require reliable visual data in real-time. These applications often operate in environments with limited bandwidth and high interference, where noisy channel conditions can severely degrade image quality1,2. Achieving both high compression efficiency and robustness to bit errors is therefore paramount, yet traditional approaches struggle to meet these dual requirements. Conventional compression standards, such as JPEG20003 and WebP4, while widely used, falter at very low bitrates and are highly sensitive to transmission errors. Even minor errors can corrupt an entire image, necessitating robust error correction or retransmission schemes that increase latency and reduce system efficiency5. This limitation highlights the need for new compression paradigms that can gracefully handle the imperfections of wireless channels. Recent advancements in deep learning-based compression offer a promising solution to these challenges. Noisy wireless channels, particularly those affected by additive white Gaussian noise (AWGN)„ intensify the challenges associated with image transmission. High bit error rates and limited bandwidth can severely degrade image quality, making it difficult to achieve acceptable performance at very low bits per pixel (BPP). While channel coding techniques can mitigate some errors, relying solely on them may not be sufficient or efficient for ultra-low bitrate image transmission. In such scenarios, achieving both high compression efficiency and robustness to noise becomes paramount, as traditional methods struggle to meet these dual requirements6. Recent advancements in deep learning-based compression and reconstruction methods offer promising solutions to these challenges. By leveraging deep learning architectures, these techniques can achieve significant compression while maintaining or even enhancing perceptual image quality2,7,8. learned image compression (LIC) models, such as hyperprior models8,9 and vector quantized generative adversarial networks (VQGAN)10, have demonstrated superior performance compared to traditional methods. While hyperprior models focus on quantization and lossless compression in the bottleneck of their autoencoders, VQGAN was originally developed as an image tokenizer for image generation transformers. Nonetheless, its utilization of vector quantization in the latent space facilitates robust image transmission even in the absence of channel coding, which is advantageous for real-time image transmission as it reduces latency and computational overhead associated with encoding and decoding processes. This characteristic makes VQGAN a valuable component in image transmission systems, especially under dynamic and poor channel conditions where traditional compression models may falter. The integration of recent advancements in neural joint source-channel coding (JSCC) further underscores the progress and ongoing challenges in efficient image transmission over wireless channels. Notably, Yang et al.11 introduce a neural JSCC backbone based on the Swin Transformer architecture, which demonstrates remarkable adaptability to diverse channel conditions and transmission requirements. Their approach incorporates a code mask module that prioritizes channel importance, enabling adaptive transmission through a single, unified model. This is achieved by integrating the target rate into multiple layers of the image compressor. In contrast, our methodology leverages a hyperprior model that organizes feature maps in a sorted manner, facilitating a different form of adaptability. Additionally, while Yang et al. focus primarily on the inference time of the encoder and decoder, our study focuses on the waiting time induced by dynamic channel conditions, providing a more comprehensive evaluation of system latency. Furthermore, Wu et al.12 present JSCCformer-f, a wireless image transmission paradigm that capitalizes on feedback from the receiver to enhance transmission efficacy. The unified encoder in JSCCformer-f effectively utilizes semantic information from the source image, channel state information, and the decoder’s current belief about the source image derived from the feedback signal to generate coded symbols dynamically at each transmission block. However, similar to Yang et al., their work primarily addresses the inference time of the model without delving into the waiting time associated with image transmission. Moreover, JSCCformer-f necessitates continuous feedback from the decoder, which introduces additional communication overhead and complexity. Unlike their approach, our research focuses on progressive transmission and provides an in-depth analysis of transmission waiting time, an important factor in delay sensitive applications, eliminating the dependency on receiver feedback and thereby streamlining the transmission process. These recent studies by Yang et al.11 and Wu et al.12 highlight significant strides in the development of flexible and efficient neural JSCC frameworks. While these frameworks demonstrate significant flexibility, their latency analysis is typically confined to the model’s inference time. They do not fully address the transmission waiting time induced by fluctuating channel conditions, where a system must pause or adapt its transmission rate. Our work complements these studies by focusing specifically on this transmission-related latency, which is a critical bottleneck in real-world wireless deployments. Addressing these gaps, our work aims to enhance the robustness and efficiency of image transmission systems by focusing on both inference and transmission waiting times, thereby contributing to more reliable and low-latency wireless communication solutions. Building upon these advancements, we propose a novel adaptive and progressive image transmission pipeline based on state-of-the-art LIC architectures. Specifically, we leverage the strengths of the hyperprior model and VQGAN to address the challenges of dynamic wireless channels by balancing robustness, throughput, and latency. The hyperprior model is known for its exceptional compression performance due to effective quantization and lossless compression in the bottleneck of the autoencoder. However, it is highly sensitive to bit errors, which limits its application in noisy channels unless robust error correction or retransmission mechanisms are employed. Conversely, VQGAN utilizes vector quantization in the latent space, inherently providing robustness to bit errors and allowing the decoder to reconstruct images even without channel coding. This characteristic enhances robustness in noisy environments but may not achieve the same compression efficiency as the hyperprior model. To harness the advantages of both models and overcome their individual limitations, we introduce progressive versions of these architectures. Our progressive transmission framework allows for partial image transmission and decoding, enabling immediate availability of coarse images under suboptimal channel conditions or limited throughput. As channel conditions improve or more bandwidth becomes available, additional data can be transmitted to progressively refine the image quality. This approach not only maintains image integrity under poor channel conditions but also significantly reduces latency by allowing immediate partial image availability. We evaluate our proposed pipeline on the Kodak high-resolution image dataset, measuring performance in terms of peak signalto-noise ratio (PSNR) and structural similarity index measure (SSIM). Experimental results demonstrate that our progressive transmission framework significantly enhances robustness and reduces latency compared to non-progressive counterparts. Moreover, the progressive approach is particularly beneficial for IoT applications and task-oriented communications, where initial low-quality images are sufficient for immediate processing, and subsequent refinements improve performance as conditions permit. By integrating adaptive and progressive LIC architectures, this work contributes to the advancement of intelligent communications by providing a robust and efficient solution for real-time image transmission in wireless systems. Our approach addresses the critical need for reliable, low-latency image delivery in environments with dynamic and challenging channel conditions, thereby supporting the demands of real-time computer vision tasks and other emerging applications. In this work, we make the following key contributions:

  1. 1.

    We design and implement an adaptive, progressive transmission pipeline for two distinct, state-of-the-art LIC architectures–one hyperprior-based and one VQGAN-based–tailored for dynamic wireless environments.

  2. 2.

    We provide a detailed comparative analysis of these models over wireless channels, with a unique focus on transmission waiting time, a critical component of end-to-end latency that is often overlooked in favor of model inference time.

  3. 3.

    We demonstrate the practical trade-offs between latency, robustness, and throughput, showing that the progressive hyperprior model is superior for low-latency applications, while the progressive VQGAN excels in robust image reconstruction without channel coding.

These contributions collectively advance the state-of-the-art in intelligent image transmission over wireless channels, providing robust, low-latency, and efficient solutions tailored to the demands of modern and emerging applications, and the project code is available at https://github.com/M0574F4/LIC_TX. Traditional image compression techniques have been the cornerstone of digital image transmission for decades. Standards such as JPEG13, WebP4, and high efficiency video coding (HEVC)14 are widely adopted due to their balance between compression efficiency and computational complexity. JPEG, one of the earliest compression standards, employs the discrete cosine transform (DCT) to reduce spatial redundancy, followed by quantization and entropy coding13. Despite its widespread use, JPEG suffers significant artifacts and quality degradation at extremely low bitrates due to large quantization steps, which discard essential image details15. WebP, developed by Google, extends the capabilities of JPEG by incorporating both lossy and lossless compression techniques, leveraging predictive coding and entropy encoding to achieve better compression ratios4. HEVC further improves compression efficiency by using advanced techniques such as block partitioning, motion compensation, and in-loop filtering, making it suitable for high-resolution and high-dynamic-range images14. However, these traditional methods are highly sensitive to bit errors, especially in noisy wireless channels. Minor bit errors can lead to significant artifacts or render the entire image undecodable, necessitating the use of robust error correction or retransmission strategies to maintain image integrity5. Recent advancements in deep learning have encouraged the development of LIC models, which leverage neural networks to outperform traditional compression standards in both compression efficiency and adaptability. hyperprior models, introduced by Ballé et al.8, utilize a hypernetwork to model the distribution of latent representations, enabling more efficient entropy coding and improved compression performance. These models achieve state-of-the-art results by optimizing rate-distortion trade-offs through end-to-end training9. VQGAN, originally designed as image tokenizers for generative transformers10, has been repurposed for image compression by leveraging vector quantization in the latent space of autoencoder architectures6. Unlike traditional LIC models, VQGAN is not inherently a compression model but offers robust image reconstruction capabilities where errors in the tokenized latent representation result in only localized quality degradation rather than making the entire image undecodable. However, VQGAN’s primary focus on image generation rather than compression introduces challenges in achieving optimal compression ratios without compromising image quality. Other notable LIC approaches include variational autoencoders (VAEs)16 and generative adversarial networks (GANs)17, which have been tailored for compression by incorporating probabilistic models and adversarial training to enhance perceptual quality. These models demonstrate significant improvements over traditional methods, particularly in maintaining image fidelity at lower bitrates. While early LIC models required training separate networks for each target bitrate, significant research has focused on creating single, flexible models capable of progressive decoding. This allows for the reconstruction of an image from a single, partially transmitted bitstream. Researchers have explored various strategies to achieve this. For instance,18 developed a progressive decoder demonstrating controllable rates within a hyperprior architecture. Other works have proposed complete frameworks for progressive compression. The method by Lu et al.19 builds upon latent scaling techniques to create a fully embedded bitstream using nested quantization and an importance-based ordering of latent elements. As an alternative quantization-based strategy, Li et al.20 proposed a progressive compression framework that leverages dead-zone quantizers, achieving scalability by adjusting the quantizer parameters in stages. These advanced methods highlight the potential of progressive coding, and our work analyzes two such state-of-the-art progressive schemes in the context of wireless transmission challenges. Progressive transmission techniques have gained traction as a means to enhance image transmission robustness and reduce latency in wireless communications. Progressive compression schemes, such as scalable video coding (SVC)21 and layered image compression22, enable the transmission of image data in multiple layers, allowing receivers to reconstruct low-quality images quickly and progressively enhance them as more data becomes available. This approach is particularly beneficial in scenarios with fluctuating channel conditions, as it ensures that partial data can be utilized effectively without waiting for complete transmission23. In the context of wireless communications, progressive transmission aligns well with the need for balancing robustness, throughput, and latency. Techniques such as unequal error protection24 and adaptive modulation and coding (AMC)25 have been employed to dynamically adjust transmission parameters based on channel quality. To the best of our knowledge, progressive image transmission tailored specifically for learned image compression has not been extensively investigated. However, researchers outside the communications domain have explored progressive decoding strategies. For instance,18 developed a progressive decoder, demonstrating the feasibility of controllable rates in an autoencoder with a hyperprior branch. Additionally, Flowers et al.26 employed a vector quantized variational autoencoder (VQ-VAE) model and proposed a hierarchical architecture that performs residual vector quantization in the bottleneck of the image autoencoder. These studies highlight the potential of progressive decoding and hierarchical quantization approaches, inspiring our proposed adaptive and progressive image transmission pipeline integrated with LIC models for wireless environments Table 1.

Table 1 Comparison with Prior Art in Progressive Learned Image Compression

Results

Performance under simulated fading channels

Table 2 illustrates the performance of our proposed models across various signal-to-noise ratio (SNR) values, providing a comprehensive view of throughput, latency, and image quality under realistic channel conditions. Each model is configured to maximize its performance within its respective architecture. The progressive-hyperprior model is designed to transmit up to 32 feature maps, balancing between resolution and minimal latency. progressive-VQGAN operates with 10 residual quantization stages (Mstages = 10), aiming for higher fidelity under limited channel conditions. Finally, adaptive WebP dynamically adjusts quality factors (here from 1 to 4) to utilize available channel bandwidth effectively, optimizing image quality. A snapshot of the Rayleigh fading channel magnitude h over a timespan of 300 ms is presented in Fig. 1. This plot illustrates the temporal variations in channel conditions, which directly impact the image quality and transmission latency of each model. As observed, significant drops in channel magnitude correspond to stopped transmission for adaptive WebP, where it fails to deliver even the lowest quality image (quality factor = 1). In contrast, the progressive transmission schemes continue to function under these challenging channel conditions. This resilience is crucial for applications requiring consistent image quality under varying channel conditions Table 3.

Fig. 1: Comparison of channel masking impact on PSNR.
Fig. 1: Comparison of channel masking impact on PSNR.
Full size image

A 300 ms snapshot of the fading channel magnitude h, PSNR of the transmitted images for the three models, and waiting time T of the image transmission system for progressive-hyperprior, progressive VQGAN, and adaptive WebP models over the snapshot.

Table 2 Performance comparison of the adaptive WebP, progressive-VQGAN, and progressive-hyperprior models on the Kodak dataset across various SNR values
Table 3 Notations

Analysis of latency, quality, and throughput

The experimental data offers insights into each model’s strengths in terms of latency, image quality, and throughput across diverse SNR settings. The Progressive-Hyperprior model consistently outperforms the other models in minimizing latency across all SNR conditions. At low SNR levels (-10, -5, and 0 dB), it achieves the lowest average waiting time Tavg and 99.9th percentile waiting time T99.9%, demonstrating its effectiveness for real-time applications where speed is essential. Notably, even at 5 dB, where Adaptive WebP briefly offers a slightly lower Tavg, Progressive-Hyperprior maintains a strong lead in 99.9th percentile latency, supporting its robustness under both challenging and favorable conditions for delay-sensitive applications. This superior latency performance is primarily due to the progressive-hyperprior model’s design, which bases its progressive decoding on compact feature maps. These compact feature maps enable the transmission of a granular and small number of bits, even in poor channel conditions. Image quality, as measured by PSNR and SSIM, varies significantly with SNR and model choice: Adaptive WebP yields the highest quality metrics (PSNR and SSIM) at higher SNR levels (0 dB and 5 dB), achieving peak fidelity in ideal channel conditions. However, it cannot transmit at −10, −5, and 0 dB due to channel capacity constraints. In contrast, progressive-VQGAN excels at lower SNR levels, outperforming progressive-hyperprior in terms of PSNR and SSIM. Its architecture is resilient to adverse conditions, providing higher image quality when Adaptive WebP fails to transmit. This suggests progressive-VQGAN’s suitability for scenarios demanding quality retention under limited bandwidth. Throughput performance underscores the progressive-hyperprior’s efficiency: At low to moderate SNR (-10 to 0 dB), progressive-hyperprior achieves the highest throughput, indicating optimal use of available channel capacity even under stringent conditions. This throughput advantage makes it suitable for applications where maximizing data transfer is essential despite poor channel quality. At 5 dB, Adaptive WebP surpasses other models in throughput, taking advantage of the higher quality factor options available at improved SNR. This transition illustrates Adaptive WebP’s capacity to exploit good channel conditions effectively but also underscores its dependency on sufficient bandwidth. Figure 2 illustrates the impact of the parameter \({N}_{\max }\) on the tradeoff between throughput, robustness (PSNR and SSIM), and latency for the hyperprior model. The parameter \({N}_{\max }\) specifies the total number of feature maps to be transmitted for each image, enabling us to adapt the model’s performance to different channel conditions. By examining this figure, we can select the appropriate value of \({N}_{\max }\) to achieve a desired tradeoff. For instance, if the target performance requirements are SSIM > 0.75 and PSNR > 27 dB, then \({N}_{\max }=96\) and \({N}_{\max }=192\) both meet these criteria. However, since \({N}_{\max }=96\) allows for higher throughput (transmitting more images per second), it would be the preferred choice. This analysis helps the selection of \({N}_{\max }\) based on specific performance and throughput priorities.

Fig. 2: Visualization of Tavg (ms) and performance metrics for the hyperprior model.
Fig. 2: Visualization of Tavg (ms) and performance metrics for the hyperprior model.
Full size image

a PSNR (dB) and b SSIM. The size of the circles represents throughput, with annotations on the circles indicating the channel SNR. By adjusting \({N}_{\max }\), which determines the number of feature maps to be transmitted for each image, we can observe the impact on PSNR and SSIM, reflecting the tradeoff between robustness, latency, and throughput in varying channel conditions.

Discussion

The findings emphasize the adaptability of each model to specific use-case demands, showcasing the trade-offs between latency, quality, and throughput. Progressive-hyperprior is ideal for delay-sensitive applications, minimizing latency while maintaining throughput across various SNR levels. Its selective transmission of feature maps allows it to remain functional under poor conditions, a trait that could benefit applications such as interactive video streaming or remote sensing. Progressive-VQGAN’s robustness in low SNR conditions makes it advantageous where quality is a priority and latency is secondary. Its vector quantization approach allows reliable image reconstruction, reducing the need for channel coding (hence encoding and decoding time) lowering computational load. Progressive-VQGAN’s performance under degraded channels highlights its utility for applications that prioritize visual fidelity under bandwidth constraints. Performing best in high-quality conditions (5 dB), Adaptive WebP achieves the highest image quality and throughput in such environments. However, its dependency on channel capacity restricts its usability in lower SNR scenarios. Adaptive WebP’s efficiency in optimal conditions positions it as an ideal choice for high-definition media streaming where channel conditions are controlled or consistently high. Our results underscore an important trade-off: while specialized codecs like adaptive WebP are highly optimized for high-SNR conditions, our progressive LIC frameworks are designed for robustness, providing a significant performance advantage and maintaining service continuity in challenging low-SNR regimes where standard codecs fail. The VQGAN’s robustness “without channel coding" refers to its ability to function without an outer error-correction code (e.g., LDPC) applied to the bitstream, as the inherent error resilience of vector quantization ensures bit errors lead to localized, non-catastrophic artifacts. The adaptability and progressive nature of these models provide valuable tools for optimizing wireless communication systems, particularly in variable or constrained environments. By tailoring the transmission approach to the channel’s real-time conditions, our framework enables targeted optimization for latency, quality, or throughput as the application requires. Real-time and low-latency applications can benefit from progressive-hyperprior’s latency performance, which supports immediate data access needed for interactive services. For scenarios where image quality is essential, particularly under stable and high SNR conditions, Adaptive WebP or progressive-VQGAN can achieve high PSNR and SSIM values, making them ideal for applications where visual clarity is crucial. The robustness of progressive-VQGAN without channel coding reduces processing demands, beneficial in resource-limited environments or applications where computational efficiency is critical. This paper introduced and evaluated an adaptive and progressive image transmission pipeline for wireless systems, integrating two state-of-the-art LIC architectures. By tailoring a hyperprior-based model and a VQGAN-based model for progressive delivery, our framework effectively enhances robustness and reduces latency, particularly in challenging low-SNR channel conditions where traditional methods like adaptive WebP fail. Our key finding is a clear trade-off between the two architectures: the progressive hyperprior model consistently achieves the lowest transmission latency, making it the optimal choice for delay-sensitive applications. Conversely, the progressive VQGAN model delivers robust image quality without requiring outer channel coding, making it highly suitable for scenarios where computational resources are limited or channel conditions are severe. These results underscore the substantial benefits of integrating intelligent, learning-based compression techniques into modern wireless communication protocols. We acknowledge the limitations of this study, which also open avenues for future research. Our evaluation was conducted on the Kodak dataset and a specific Rayleigh fading channel model. Future work should validate these findings across larger and more diverse datasets (e.g., DIV2K, CLIC) and a wider range of channel models, including Rician fading and AWGN-only scenarios. Investigating the impact of varying mobility conditions (i.e., different Doppler frequencies) would also provide a more comprehensive understanding of the system’s performance. Furthermore, our latency analysis focused on transmission time; a complete end-to-end evaluation including compression and decompression times on resource-constrained devices is an important next step. Finally, as suggested by the reviewers, evaluating the pipeline’s performance on specialized, correlated datasets (such as underwater or road imagery) and including variance or confidence intervals in our numerical results would further strengthen the conclusions.

Methods

This section details the proposed system architecture, the adaptive transmission mechanism, and the comprehensive experimental setup used to evaluate our pipeline. Efficient image transmission in wireless communications necessitates a comprehensive understanding of both the communication channel characteristics and the fundamentals of image compression techniques. This section provides the essential background required to appreciate the challenges and innovations presented in this study. The performance of image transmission systems in wireless environments is intrinsically linked to the properties of the communication channels. To design robust image compression and transmission schemes, it is crucial to model these channels accurately while maintaining a balance between complexity and practicality. A wireless communication channel can be mathematically modeled as:

$$y=hx+n,$$
(1)

where:

  • \(y\in {{\mathbb{C}}}^{L}\) is the received signal vector,

  • \(h\in {{\mathbb{C}}}^{L}\) represents the channel coefficients,

  • \(x\in {{\mathbb{C}}}^{L}\) is the transmitted signal vector,

  • \(n \sim {\mathcal{CN}}(0,{\sigma }^{2}I)\) is the AWGN vector with zero mean and covariance matrix σ2I.

and L is dimensionality of the signal vectors. In this model, h encompasses both large-scale fading (path loss, shadowing) and small-scale fading (multipath effects). The AWGN component n captures the thermal noise inherent in the communication system. This formulation allows us to abstract the complex channel characteristics into an effective transmission rate R, which is crucial for designing adaptive and progressive transmission schemes. Shannon’s capacity theorem provides a fundamental limit on the maximum achievable transmission rate C for a given channel, ensuring reliable communication. However, Shannon’s theorem assumes infinitely long codewords, which is impractical for real-world applications where finite blocklengths n are used. The finite blocklength capacity introduces a trade-off between throughput and delay:

$$R=C-\sqrt{\frac{V}{n}}{Q}^{-1}(\epsilon ),$$
(2)

where:

  • R is the achievable rate,

  • V is the channel dispersion,

  • n is the blocklength,

  • Q−1(ϵ) is the inverse of the Q-function evaluated at the error probability ϵ.

This relationship highlights that shorter blocklengths, which are desirable for low latency, result in rates R that are below the Shannon capacity C. Consequently, there is a trade-off between maximizing throughput and minimizing delay, which is particularly relevant for designing progressive and adaptive transmission schemes where latency is a critical factor27. For the purposes of this study, we use achievable transmission rate R, which is a function of the available bandwidth B and the SNR. According to Shannon’s capacity theorem28, the channel capacity C in bits per second (bps) is given by:

$$C=B\,{\log }_{2}\left(1+\frac{P}{{N}_{0}B}\right),$$
(3)

where:

  • P is the transmit power,

  • N0 is the noise power spectral density.

Throughput

Throughput refers to the effective data transmission rate and is constrained by the channel capacity C. Typically, in communication systems, throughput is measured in bits per second (bps) to indicate how much data can be transmitted within a given time frame. However, in this work, which focuses on computer vision tasks, we define throughput as pixels per second, as this metric more directly relates to the amount of visual data processed over time. Higher throughput, measured in pixels per second, enables faster image processing and transmission, which is essential for applications requiring real-time or near-real-time data delivery.

Robustness

Robustness pertains to the accuracy and integrity of the transmitted image data. It is influenced by factors such as the bit error rate (BER). In task-oriented communications, robustness is defined not only by the BER but also by the successful completion of specific tasks, such as object detection or recognition, based on the received images. However, in this work, where specific tasks are not the primary focus, image fidelity is primarily assessed using metrics like PSNR and SSIM. These metrics provide a quantifiable measure of image fidelity and integrity, effectively capturing robustness in the absence of task-specific evaluations.

Latency

Latency L is the time delay between the initiation of image transmission and its successful reception and reconstruction at the receiver. End-to-end latency can be decomposed into several components:

$$L={L}_{c}+{L}_{t}+{L}_{d},$$
(4)

where:

  • Lc is the compression latency,

  • Lt is the transmission latency,

  • Ld is the decompression latency.

Compression latency (Lc) is the time taken to compress the image before transmission. Transmission latency (Lt) is the time required to transmit the compressed data over the wireless channel, which is influenced by the available transmission rate and the size of the data. Decompression latency (Ld) is the time taken to decompress and reconstruct the image at the receiver. Progressive transmission primarily aims to reduce the transmission latency (Lt) by enabling the receiver to reconstruct a low-quality version of the image quickly, followed by incremental enhancements as more data becomes available. This approach minimizes the perceived latency, allowing for immediate partial image availability at the cost of initially lower image quality. Our system model assumes that channel state information at the transmitter is available. This allows the transmitter to estimate the achievable transmission rate for a given time slot and determine the corresponding bit budget, Nbits. While channel state information enables adaptive-rate transmission, a progressive approach offers superior robustness. In severe fading conditions where the channel cannot support even the lowest rate of a non-progressive codec, our method can still transmit a minimal base layer, ensuring service continuity. This is demonstrated in our results where adaptive WebP fails entirely at low SNRs. LIC leverages deep learning techniques to surpass traditional compression methods in terms of compression efficiency and adaptability to varying channel conditions. LIC models typically employ neural network architectures that are trained end-to-end to optimize the trade-off between compression rate and image quality. At the heart of most LIC models lies the autoencoder architecture, which comprises an encoder, a bottleneck (latent space), and a decoder8. The encoder transforms the input image I into a compact latent representation z, which is then quantized and compressed for transmission. The decoder reconstructs the image \(\widehat{I}\) from the compressed latent code:

$$z={\text{Quantize}}\,({\text{Encoder}}(I)),$$
(5)
$$\widehat{I}={\text{Decoder}}(z).$$
(6)

The autoencoder framework is highly flexible and can be enhanced with various mechanisms to improve compression efficiency and image quality. For instance, incorporating attention mechanisms or residual connections can enhance the model’s ability to capture intricate image details9. Additionally, probabilistic models within the autoencoder allow for better entropy modeling, leading to more efficient compression29. vector quantization (VQ) is a technique used to discretize the continuous latent representations produced by the encoder. In the context of LIC, VQ reduces the dimensionality of the latent space, facilitating more efficient compression30. For example, VQGAN utilize VQ in their autoencoder architectures to create discrete latent codes that are easier to compress and transmit10. The integration of VQ in LIC models without lossless compression offers several advantages. Discrete latent spaces are inherently more resilient to transmission noise, enhancing the robustness of image reconstruction31. Additionally, hierarchical VQ approaches, such as residual vector quantization, enable the progressive refinement of image details by encoding residual information at multiple levels26. This hierarchical structure not only improves compression efficiency but also supports progressive transmission by allowing partial data to enhance image quality incrementally. Furthermore, recent advancements have explored the use of transformer architectures in conjunction with VQ-based models for generating high-fidelity images from discrete latent codes, bridging the gap between compression and image generation10. These features make VQ-based LIC models particularly well-suited for adaptive and progressive image transmission in dynamic wireless environments.

System architecture overview

Our transmission pipeline is structured around two primary LIC models: the hyperprior-based model and the VQGAN-based model. Each model’s architecture is illustrated in Figs. 3 and 4.

Fig. 3: System model for transmission based on the hyperprior model.
Fig. 3: System model for transmission based on the hyperprior model.
Full size image

The encoder ga downsamples the input image (W × H) by a factor of 16 to produce a latent representation Y of size \(\frac{W}{16}\times \frac{H}{16}\times N\). The system retains k% of feature maps while masking the remainder in Y and Z at the transmitter, and the reconstruction by padding zeroes in the masked 100 − k% feature maps at the receiver. The selection of k is based on channel conditions and service requirements.

Hyperprior-based model integration

The hyperprior-based model utilizes a variational autoencoder structure that enhances compression efficiency through the use of a hyperprior. The architecture of this model, depicted in Fig. 3, employs the following mathematical formulation8:

$$Y={g}_{a}(X),$$
(7)
$${Y}_{{\text{masked}}}={\text{Mask}}(Y,k \% ),$$
(8)
$${\widehat{Y}}_{{\text{masked}}}={\text{Quantize}}({Y}_{{\text{masked}}}),$$
(9)
$$Z={h}_{a}(Y),$$
(10)
$${Z}_{{\text{masked}}}={\text{Mask}}(Z,k \% ),$$
(11)
$${\widehat{Z}}_{{\text{masked}}}={\text{Quantize}}({Z}_{{\text{masked}}}),$$
(12)
$$\widehat{\sigma }={h}_{s}({\widehat{Z}}_{{\text{masked}}}),$$
(13)
$${y}_{{\text{bytes}}}={\text{ArithmeticEncode}}({\widehat{Y}}_{{\text{masked}}},\widehat{\sigma }),$$
(14)
$${z}_{{\text{bytes}}}={\text{ArithmeticEncode}}({\widehat{Z}}_{{\text{masked}}}),$$
(15)

where,

  • ga is the encoder that transforms the input image \(X\in {{\mathbb{R}}}^{W\times H\times 3}\) into a latent representation \(Y\in {{\mathbb{R}}}^{\frac{W}{16}\times \frac{H}{16}\times N}\). where the spatial dimensions are reduced by downsampling factors inherent to the encoder’s architecture.

  • Mask(Y, k%) masks k% of the channels in Y, preserving the remaining channels.

  • Quantize(Ymasked) quantizes the masked latent representation to produce \({\widehat{Y}}_{masked}\).

  • ha is the hyperprior encoder that processes Y to generate hyper latent \(Z\in {{\mathbb{R}}}^{\frac{W}{64}\times \frac{H}{64}\times M}\).

  • Mask(Z, k%) and Quantize(Zmasked) follow similar processing steps for the hyper latent.

  • hs decodes \({\widehat{Z}}_{masked}\) to estimate the scale parameter \(\widehat{\sigma }\), crucial for arithmetic encoding.

  • ArithmeticEncode generates byte streams ybytes and zbytes using the quantized and masked representations.

At the receiver end, depicted in Fig. 3, the decoding process unfolds as follows:

$${\widetilde{Z}}_{{\text{masked}}}={\text{ArithmeticDecode}}({z}_{{\text{bytes}}}),$$
(16)
$$\widehat{\sigma }={h}_{s}({\widetilde{Z}}_{{\text{masked}}}),$$
(17)
$${\widetilde{Y}}_{{\text{masked}}}={\text{ArithmeticDecode}}({y}_{{\text{bytes}}},\widehat{\sigma }),$$
(18)
$$\widetilde{X}={g}_{s}({\widetilde{Y}}_{{\text{masked}}}),$$
(19)

where:

  • gs is the decoder that reconstructs the image \(\widetilde{X}\) from the processed \({\widetilde{Y}}_{masked}\).

During training, a random selection of k% of the feature maps is masked to train the network to focus on significant features where k = 100 × u and u ~ Uniform(0, 1). During inference, the selection of k% is dynamically adjusted based on real-time channel conditions and application requirements, enabling adaptive transmission efficiency. This adaptive selection allows the system to decode the image from a minimal number of feature maps initially, with subsequent transmissions providing additional feature maps to enhance image quality progressively.

Algorithm 1

Progressive Hyperprior Transmission

1: Offline Analysis:

2: for each feature map i in the latent representation do

3: Calculate the average MSE degradation Δi by masking map i across a validation dataset.

4: end for

5: Sort feature maps in ascending order of Δi to create an importance-ranked list L.

6: Online Transmission:

7: At transmission time t, obtain the bit budget Nbits from the channel state.

8: Determine the maximum number of feature maps, k, from list L such that the compressed size fits the budget Nbits.

9: Transmit the top k feature maps.

10: The receiver reconstructs the image \(\widetilde{X}\) using the arithmetic decoding and synthesis transform defined in Eq. (16), utilizing all feature maps received up to time t.

VQGAN-based model integration

The VQGAN-based model employs vector quantization within its generative adversarial network architecture. This model’s system architecture is illustrated in Fig. 4, with the following process flow:

$$Y={g}_{a}(X),$$
(20)
$$Z={P}_{E}(Y),$$
(21)
$${Z}_{q}={\text{VectorQuantize}}(Z,CB),$$
(22)
$$\widetilde{Z}={\text{Dequantize}}({\widetilde{Z}}_{q},CB),$$
(23)
$$\widetilde{Y}={P}_{D}(\widetilde{Z}),$$
(24)
$$\widetilde{X}={g}_{s}(\widetilde{Y}),$$
(25)

where,

  • ga is the encoder transforming the input image X into the latent representation \(Y\in {{\mathbb{R}}}^{\frac{W}{8}\times \frac{H}{8}\times N}\).

  • PE projects Y into a lower-dimensional space \(Z\in {{\mathbb{R}}}^{\frac{W}{8}\times \frac{H}{8}\times 4}\).

  • VectorQuantize(Z, CB) quantizes Z into the token map Zq, using the codebook CB.

Additionally, the VQGAN-based model incorporates residual codebook clustering, enabling a hierarchical architecture that performs residual vector quantization in the bottleneck of the image autoencoder. This hierarchical structure allows for progressive decoding by transmitting indices corresponding to coarse and fine codebooks in successive transmissions, thereby refining the image quality incrementally without relying on lossless compression.

Fig. 4: Overview of the hyperprior-based transmission framework.
Fig. 4: Overview of the hyperprior-based transmission framework.
Full size image

System model for VQGAN-based image transmission, illustrating the generation of feature maps (Y) of size \(\frac{W}{8}\times \frac{H}{8}\times N\) and token maps (Z) of size \(\frac{W}{8}\times \frac{H}{8}\times 4\) during the encoding and decoding process.

Algorithm 2

Progressive VQGAN Transmission

1: Offline Setup:

2: A pre-trained VQGAN model with \({M}_{stages}^{\max }\) residual codebooks is used.

3: Online Transmission:

4: At transmission time t, obtain the bit budget Nbits from the channel state.

5: Calculate nbits_stage and determine the allowable residual stages Mstages as per Eq. (29).

6: Transmit the token indices corresponding to the first Mstages residual codebooks.

7: The receiver reconstructs the image by summing the contributions of all received residual stages and passing the result through the generator gs (Eq. (20)).

Adaptive and progressive transmission mechanism

Our progressive transmission mechanism is designed to adaptively adjust the amount of data transmitted based on real-time channel conditions and application-specific requirements. This mechanism is integral to both the hyperprior-based and VQGAN-based models, enabling them to balance robustness, throughput, and latency effectively.

Adaptive pipeline design

Rather than redesigning source and channel coding5,11,12, our approach leverages existing systems where modulation and coding schemes are dynamically selected based on channel condition. This adaptability allows our pipeline to be seamlessly integrated into various communication infrastructures. In each transmission slot, a permissible bit budget Nbits, determined by the current channel rate–which is influenced by factors such as bandwidth and noise level–is allocated. Based on this Nbits, the system performs the following selections: Hyperprior-Based Model: Selects the maximum number of feature maps such that the top k% of channels fit within the allocated bit budget. VQGAN-Based Model: Chooses the number of relevant codebooks for vector quantization based on the available channel rate.

Progressive image decoding

Hyperprior-based model

To enable progressive transmission for the hyperprior-based model, which is based on the architecture from Liu et al.9, we analyzed its sensitivity to channel masking. Specifically, we initiated the masking of a percentage of channels in the bottleneck and observed non-uniform sensitivity across different channels. Masking certain channels resulted in significant PSNR degradation, while others had minimal impact. To further investigate, we treated the problem as a feature pruning task. We performed inference on 100 randomly selected batches of images (batch size = 8) from the ImageNet dataset, systematically masking each feature map and recording the Mean Squared Error (MSE) degradation in image reconstruction. By averaging the MSE degradation across all tested images, we derived an importance metric for each feature map. Our results demonstrated that masking channels in order of least to most importance based on the averaged MSE degradation achieved a smooth PSNR drop and maintained better image reconstruction quality across varying mask percentages, as illustrated in Fig. 5. This finding motivates our progressive image transmission approach for hyperprior-based models. We utilize the sorting mechanism from Hojjat et al.18, which ranks feature maps according to their importance metrics, allowing for an adaptive and progressive transmission strategy that prioritizes critical feature maps under varying channel conditions. At the receiver, the progressive transmission is handled by maintaining the previously received \({\widetilde{Y}}_{masked}\) feature maps. With each new transmission, additional feature maps are received and concatenated with the existing ones, allowing the decoder to incrementally refine the reconstructed image \(\widetilde{X}\). This approach ensures that a low-quality version of the image is available quickly, with subsequent transmissions enhancing the image quality as more data becomes available.

Fig. 5: PSNR degradation versus the percentage of masked bottleneck channels for the image kodim05.
Fig. 5: PSNR degradation versus the percentage of masked bottleneck channels for the image kodim05.
Full size image

The solid line represents unsorted masking, while the dashed line corresponds to observer-based feature masking.

VQGAN-based model

For the VQGAN-based model, we adopt an advanced hierarchical vector quantization approach to facilitate progressive image transmission. Building upon the methodology proposed by Zhu et al.32, we use their projector to compress the embedding dimension to 4, thereby reducing the latent space complexity. Subsequently, it expands the codebook size to 100K, establishing a large codebook that serves as the foundation for our residual vector quantization-based progressive transmission. To manage this extensive codebook efficiently, we perform two types of clustering: K-Means-Based Clustering: We utilize Facebook AI Similarity Search (Faiss)33 to perform k-means clustering, partitioning the large codebook into smaller, manageable subsets corresponding to different bits per index (bpi). The clustering is configured with a variable number of clusters, specifically 2bpi, to represent different reconstruction qualities. For each desired bpi, the algorithm clusters the embeddings into 2bpi clusters. These clustered codebooks are then extracted and stored, facilitating efficient codebook selection based on the available bit budget Nbits during transmission. The codebook selection process is formalized as an optimization problem:

$$C{B}^{* }=\arg \mathop{\max }\limits_{B}\,bpi\,{\rm{s}}.{\rm{t}}.\,{n}_{{\text{bits}}}\le {N}_{{\text{bits}}},$$
(26)

where the number of bits nbits is defined as:

$${n}_{{\text{bits}}}=\left(\frac{W}{8}\right)\left(\frac{H}{8}\right)\,bpi,$$
(27)

with B* being the selected codebook among all available codebooks, CB, W × H denoting the dimensions of the image, and 8 being the down sampling factor across each spatial dimension of the encoder. Here, bpi represents the number of bits per index corresponding to the selected codebook. This optimization is directly constrained by the wireless environment. The channel SNR determines the instantaneous capacity C (Eq. (3)), which dictates the available bit budget Nbits for the current timeslot. The codebook selection CB* is thus a function of the link budget: as SNR decreases, Nbits drops, forcing the selection of a codebook with a lower bpi to satisfy the constraint nbitsNbits. Conversely, memory constraints at the edge device limit the maximum size of the codebook library available for selection (B), creating a hardware-bound upper limit on the achievable reconstruction quality. This formulation ensures that the selected codebook maximizes the bits per index without exceeding the transmission bit budget.

This formulation illustrates the principle of selecting the optimal codebook to maximize quality for a given bit budget. In our full pipeline, the budget Nbits is determined dynamically based on the real-time channel SNR and the corresponding modulation and coding scheme. Residual Quantizer (RQ): For progressive transmission, we implement a residual quantizer (RQ) that quantizes the residuals of the input vectors sequentially. At each encoding stage m, the RQ selects the codeword cm that best approximates the residual of the input vector x relative to the previously encoded steps:

$${c}_{m}=\mathop{{\text{argmin}}}\limits_{j=1,\ldots ,K}{\left|\left|\mathop{\sum }\limits_{i=1}^{m-1}{T}_{i}[{c}_{i}]+{T}_{m}[j]-x\right|\right|}^{2},$$
(28)

where Ti[ci] represents the codeword from the ith transmission stage indexed by ci, and K is the number of codewords in the codebook. This sequential encoding allows the receiver to reconstruct the image progressively by first decoding coarse details and then refining them with finer residuals as more indices are received.

Progressive transmission strategy

In the VQGAN-based model, the receiver maintains a list of previously decoded indices corresponding to coarse codebooks. As new indices referencing finer codebooks are received, they are integrated with the existing indices to progressively decode and enhance the image. This hierarchical decoding process enables the immediate reconstruction of a basic image from the coarse indices, followed by incremental improvements as additional fine-grained indices are received. To optimize progressive transmission, we dynamically determine the number of encoding stages that can be transmitted within the available bit budget Nbits. Given that each stage of residual quantization requires a fixed number of bits nbits, the maximum number of stages Mstages that can be accommodated is calculated as:

$${M}_{{\text{stages}}}=\min \left(\left\lfloor \frac{{N}_{{\text{bits}}}}{{n}_{{\text{bits}}}}\right\rfloor ,{M}_{{\text{stages}}}^{\max }\right),$$
(29)

where \(\lfloor \cdot \rfloor\) denotes the floor operation and \({M}_{stages}^{\max }\) is the maximum number of codebooks learned for the residual vector quantization. This ensures that the total number of bits used does not exceed the budget Nbits. Each encoding stage corresponds to a residual codebook that quantizes the residuals of the input vector relative to the previous stages. All residual codebooks are of the same size, and each stage incrementally refines the image quality by addressing the residual errors from earlier stages. By determining Mstages, we can transmit up to Mstages stages of residual quantization within the given bit budget, allowing the receiver to progressively enhance the image quality as more data becomes available. This adaptive selection of encoding stages based on the bit budget ensures that the most critical residuals are transmitted first, facilitating a balanced trade-off between image quality and transmission efficiency. Consequently, our approach leverages residual vector quantization to enhance the fidelity of the reconstructed image progressively, enabling robust and scalable image transmission under varying channel conditions without relying on lossless compression. Using adaptive and progressive image transmission pipelines that integrate hyperprior-based and VQGAN-based LIC models to address the challenges of dynamic wireless channels by dynamically adjusting the transmitted data based on real-time channel conditions and employing progressive decoding strategies, we effectively balance robustness, throughput, and latency.

Complexity considerations

The two proposed pipelines exhibit different complexity profiles. The hyperprior-based model relies on arithmetic coding, which is computationally intensive and sequential, potentially leading to higher encoding and decoding latency. However, its latent representation is compact. In contrast, the VQGAN-based model replaces arithmetic coding with vector quantization, which involves efficient nearest-neighbor searches (codebook lookups). While this is highly parallelizable and fast, it requires storing large codebooks, leading to higher memory requirements. The choice between them thus involves a trade-off between computational latency (hyperprior) and memory footprint (VQGAN). To evaluate the performance of our proposed adaptive and progressive image transmission pipeline, we conducted comprehensive experiments designed to assess the efficacy of our models under realistic wireless communication conditions. This section details the experimental setup, including the datasets used, model configurations, simulation parameters, implementation environment, and evaluation metrics.

Dataset and models

Dataset

We utilized the widely recognized Kodak dataset34 for evaluating our models. Kodak Dataset Consists of 24 high-quality, uncompressed color images frequently used as a benchmark for image compression and transmission evaluations. The images cover a variety of scenes and are of size 768 × 512 or 512 × 768 pixels. All images were preprocessed to match the input requirements of the pretrained models, normalization. No data augmentation techniques were applied during testing to maintain consistency.

Models

Our study investigates two primary LIC models: a hyperprior-based model and a VQGAN-based model.

Hyperprior-based model

We employed the architecture described in18, which is based on a hyperprior variational autoencoder for image compression. The model configurations are as follows:

  • Latent Channels (NY): 192 channels

  • Hyper Latent Channels (NZ): 128 channels

    and is trained using rate-distortion parameter λ = 0.1, and for using 0–100% of the channels. This particular model is trained to support progressive decoding by randomly masking a percentage of channels during training, allowing it to reconstruct images from a partial set of feature maps.

VQGAN-based model

For the VQGAN-based model, we utilized the LIC architecture proposed in ref. 32, which integrates vector quantization with GANs for image compression. The model specifications include:

  • Codebook: A large initial codebook of 100,000 entries.

  • Projection: A projector that compresses the embedding dimension to 4, facilitating efficient vector quantization.

We use pretrained models on the ImageNet dataset35, and no additional training was conducted as part of this study.

Adaptive WebP model

In addition to the learned image compression models, we utilize adaptive WebP4 as a benchmark. Adaptive WebP is a widely adopted, non-learning-based compression standard that can be used to dynamically adjust compression ratio based on channel conditions. We chose Adaptive WebP as a benchmark because similar to our progressive transmission approach it can be used to optimize performance by adjusting compression settings in real-time. Furthermore, unlike some learning-based image transmission pipelines that are not publicly available, Adaptive WebP is accessible to the public, ensuring transparency and reproducibility in our comparative analysis.

Implementation environment

Hardware

All experiments were conducted on a high-performance computing setup equipped with:

  • Processor: Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz with 48 cores.

  • Graphics Processing Unit (GPU): NVIDIA A40 with 46,068 MiB memory, supporting accelerated computations for deep learning models.

Software

The software environment was configured as follows:

  • Operating System: Ubuntu 20.04 LTS.

  • Deep Learning Framework: PyTorch version 2.2.1.

  • CUDA: Version 12.2, enabling GPU acceleration.

  • FAISS33: Utilized for efficient k-means clustering during codebook generation.

  • Additional Libraries: NumPy, SciPy, and other standard scientific computing libraries.

Channel simulation

To model realistic wireless communication environments characterized by multipath propagation and rapid signal attenuation, we employed a Rayleigh fading channel model. This model is widely used to represent multipath fading in wireless communications. The channel coefficients were generated using the sum-of-sinusoids method, approximating Clarke’s model for flat fading channels36. The key parameters of the channel simulation are as follows:

  • Maximum Doppler Frequency (fd): 10 Hz, representing a moderate level of mobility.

  • Symbol Duration (Ts): 1 ms.

  • Bandwidth (B): 100 kHz.

  • SNR: Varied over {−10, −5, 0, 5} dB to simulate different channel conditions.

We are mainly interested in image transmission in challenging channel conditions hence low SNRs and limited channel bandwidth are assumed. To ensure statistical robustness and average out the randomness inherent in Rayleigh fading channels, we simulated 1000 independent channel realizations.

Performance metrics

Evaluating the performance of image compression and transmission systems involves multiple metrics that assess various aspects of system efficiency and image quality. The key metrics considered in this study are PSNR, SSIM, throughput, and latency.

Image quality metrics

PSNR is a widely used objective metric for measuring the quality of reconstructed images compared to the original images. It is defined as:

$${\text{PSNR}}=10\cdot {\log }_{10}\left(\frac{{{\text{MAX}}}^{2}}{{\text{MSE}}}\right),$$
(30)

where \(MAX\) represents the maximum possible pixel value of the image and mean squared error (MSE) is given by:

$${\text{MSE}}=\frac{1}{{N}_{1}{N}_{2}}\mathop{\sum }\limits_{i=1}^{{N}_{1}}\mathop{\sum }\limits_{j=1}^{{N}_{2}}{(I(i,j)-\widehat{I}(i,j))}^{2},$$
(31)

with N1 × N2 being the dimensions of the image. The SSIM evaluates the similarity between two images based on luminance, contrast, and structural information. It is calculated as:

$${\text{SSIM}}(x,y)=\frac{(2{\mu }_{x}{\mu }_{y}+{C}_{1})(2{\sigma }_{xy}+{C}_{2})}{({\mu }_{x}^{2}+{\mu }_{y}^{2}+{C}_{1})({\sigma }_{x}^{2}+{\sigma }_{y}^{2}+{C}_{2})},$$
(32)

where μx and μy are the mean values, \({\sigma }_{x}^{2}\) and \({\sigma }_{y}^{2}\) are the variances, and σxy is the covariance of the original and reconstructed images. C1 and C2 are constants to stabilize the division. SSIM values range from -1 to 1, with higher values indicating greater structural similarity between images.

Transmission performance metrics

Latency is a critical performance indicator. While end-to-end latency includes compression (Lc), transmission (Lt), and decompression (Ld), our study focuses primarily on transmission latency (Lt). This is because in dynamic wireless environments, Lt is often the most significant and variable bottleneck, directly impacted by fluctuating channel conditions. Our progressive pipeline is specifically designed to mitigate this component. Therefore, in our experiments, latency was quantified by the number of transmission slots required to successfully transmit the image, allowing us to isolate and evaluate the effectiveness of our transmission strategy. However, in our experiments, latency was quantified by the number of transmission slots required to transmit the image, excluding encoding and decoding times. We further break down latency metrics into:

  • Average Waiting Time (Tavg): The mean time delay (in ms) for image transmission.

  • 99.9th Percentile Waiting Time (T99.9%): The time below which 99.9% of image transmissions are completed, critical for delay-sensitive applications.

Throughput is measured as the effective data rate achieved during transmission, influenced by the modulation scheme and channel conditions. We use Megapixels per second (Mpps) as the unit, where each pixel is counted only once to reflect the actual image area transmitted. This metric ensures that multiple bits transmitted to refine the same pixels do not inflate the throughput measurement. image fidelity assesses the accuracy and integrity of the transmitted image data. In this study, robustness is evaluated based on the fidelity of image reconstruction as indicated by PSNR and SSIM values under varying channel conditions. Additionally, robustness encompasses the successful completion of image transmission without significant degradation, especially under poor SNR conditions.

Implementation details

VQGAN-based model

For the VQGAN-based model, codebook generation and clustering are critical for progressive transmission. K-means clustering was employed using FAISS33 for efficient clustering on embeddings. Codebooks were generated for bits per index (bpi) ranging from 8 to 16 with a maximum of 100 iterations for the clustering algorithm and Euclidean distance (L2 norm) as the distance metric. Untimately, Mstages = 10 residual codebooks were trained, each with bpi = 8 (256 codewords), to support progressive residual transmission in the VQGAN-based model. For the VQGAN-based model, progressive transmission is achieved by transmitting indices corresponding to coarse codebooks first, followed by finer codebooks. The receiver integrates these indices to incrementally refine the image reconstruction.

Progressive transmission implementation

In the hyperprior-based model, we implemented progressive transmission by masking feature maps based on their importance, as detailed in Section IV. The receiver progressively reconstructs the image as more feature maps are received.

Experimental procedure

The experimental workflow was designed to systematically evaluate both the hyperprior-based and VQGAN-based models under identical channel conditions. The procedure is as follows:

  1. 1.

    Data Preparation: The Kodak dataset images were preprocessed to match the input requirements of the models.

  2. 2.

    Codebook Generation: For the VQGAN-based model, codebooks were generated using k-means clustering and residual quantization as described above.

  3. 3.

    Channel Simulation: Simulated Rayleigh fading channels with varying SNR levels were generated for transmission simulations.

  4. 4.

    Transmission Simulation: Each image was subjected to transmission over the simulated channels. The bit budget Nbits for each transmission slot was determined based on the current SNR and selected modulation scheme.

  5. 5.

    Progressive Decoding: The receiver progressively reconstructed images by integrating newly received data with previously received data, allowing for incremental improvements in image quality.

  6. 6.

    Metric Collection: At each stage of progressive decoding, PSNR, SSIM, latency, and throughput metrics were recorded to evaluate performance.

  7. 7.

    Result Aggregation: Results were averaged over all images in the dataset and multiple channel realizations to obtain statistically significant performance evaluations.