Introduction

With the iterative innovation of information technology and the disruptive evolution of computational paradigms, the deep integration of communication technology and artificial intelligence (AI) is propelling human society into a new era of intelligent interconnection of all things1,2,3,4. In this process, the inevitable strain on bandwidth resources and explosive growth of data transmission present fundamental challenges to traditional transmission frameworks based on Shannon’s information theory5. Semantic communication, an emerging communication paradigm centered on information meaning, aims not merely to ensure accurate transmission of data symbols, but to place greater emphasis on the receiving end’s comprehension and effective utilization of informational intent and semantics6,7. This concept is similar to technologies such as zero-knowledge proof and homomorphic encryption in blockchain8, which attempt to strip away the surface representation of data and directly operate on or convey the underlying logic or meaning. Semantic communication transcends the conventional focus on “signal fidelity” in traditional communication systems, shifting towards “semantic fidelity” as its primary objective, thereby enhancing communication efficiency and intelligent capabilities.

In traditional communication systems, source coding and channel coding are designed as two independent functional modules. For instance, in conventional wireless image transmission systems, the transmission process is typically divided into two primary stages: image compression (employing standards like JPEG9, JPEG200010 and BPG11) and data transmission (utilizing error-correcting codes such as LDPC12, Turbo codes13, and Polar codes14). Such a decoupled approach facilitates the flexible design, development, and maintenance of communication systems through modular optimization. However, the separated source-channel coding (SSCC) paradigm inherently prevents synergistic operation to achieve optimal overall communication capacity from the perspective of system optimality. Moreover, under low signal-to-noise ratio (SNR) conditions where channel decoding fails to ensure zero bit error rate (BER), this architecture is prone to the “cliff effect” – a phenomenon characterized by abrupt performance degradation beyond critical SNR thresholds. Consequently, in recent years, as semantic communication technologies have demonstrated their potential in improving wireless network performance, neural/deep learning-based end-to-end optimized joint source-channel coding (JSCC) for data transmission has emerged as an active research domain within semantic communications. This approach demonstrates consistent superiority over the SSCC paradigm across various tasks including text15, speech16, and image processing17,18,19,20,21,22,23,24,25,26, particularly in scenarios requiring semantic-aware transmission robustness.

As one of the pioneering works in this domain, a CNN-based deep JSCC scheme (DEEPJSCC) for wireless image transmission was first proposed in Ref17.. By constructing a CNN-parameterized encoder-decoder architecture and adopting an end-to-end jointly optimized training strategy, this approach transcended the theoretical limitations of SSCC designs in conventional systems, demonstrating superior performance over SSCC schemes in both Gaussian and Rayleigh fading channels. Building upon this foundation, Ref18. expanded the research boundaries of deep joint coding by innovatively proposing a feedback-enabled semantic-aware image transmission system (DEEPJSCC-f). Through the introduction of a dual-mode channel feedback mechanism (incorporating ideal noiseless feedback and practical noisy feedback), the system dynamically adjusts encoder-side semantic feature extraction strategies to strengthen receiver-side image quality, though it should be noted that feedback does not theoretically increase the capacity of memoryless communication channels. To mitigate the performance degradation caused by channel SNR mismatch in DEEPJSCC, Xu et al.19 proposed attention-enhanced JSCC (ADJSCC), which employed attention modules to recalibrate channel characteristics, enabling dynamic adjustment of source coding compression ratios and channel coding rates according to varying SNR conditions. Subsequently, Yuan et al.20refined the ADJSCC methodology into channel-blind JSCC (CBJSCC), achieving superior performance across different SNR levels without requiring prior channel state information. The work in Ref21. optimized wireless image transmission efficiency and quality through dynamic responses to channel conditions and image content, realizing adaptive rate control in deep learning-based JSCC models. However, these adaptive schemes primarily focus on either rate control or SNR adaptation, but not both concurrently. To address this limitation, a flexible dual-adaptive JSCC scheme (DEEPJSCC-V) was proposed in Ref22., integrating ADJSCC methodologies with adaptive masking mechanisms. This hybrid architecture enables transmission scheme adjustments based on both SNR variations and channel bandwidth ratios (CBR), demonstrating enhanced robustness albeit at the cost of marginal performance degradation compared to single-parameter adaptation approaches. On the other hand, JSCC frameworks based on Transformer architectures23,24,25 and Mamba state-space models26 have recently emerged as a prominent research direction due to their exceptional capability in capturing long-term dependencies and semantic representation capacities. However, these methods are excluded from our performance discussion due to their substantially larger model sizes (quantitative comparisons are provided in subsequent sections) compared to CNN-based solutions, which makes them unfavorable for deployment on edge devices.

The aforementioned systems have demonstrated the immense potential of deep learning technologies in semantic communications. However, in DEEPJSCC and its variants17,18,19,20,21,22, the inherent limitations of conventional convolutions for feature extraction significantly compromise semantic information extraction efficiency, communication resource utilization, and system robustness. Specifically, the quadratic growth of parameters and computational complexity with increasing input channels and kernel dimensions leads to inefficient feature extraction. Furthermore, the lack of explicit model size consideration during design restricts their applicability in resource-constrained semantic communication scenarios (e.g., mobile terminals, IoT devices, edge nodes), where computational and memory budgets are strictly limited. Regarding channel state adaptation, the conventional squeeze-and-excitation (SE) channel attention mechanism27 primarily focuses on global information extraction while inadequately exploiting the value of local patterns. Consequently, in semantic communications, designing a channel attention mechanism capable of effectively fusing global and local information without significantly increasing computational complexity remains an unresolved yet critically important research challenge. Such a mechanism demands efficient operation while enabling rational weight allocation to achieve comprehensive yet precise information capture and utilization, particularly crucial for semantic-aware systems requiring balanced performance between feature granularity and processing efficiency.

To address the aforementioned challenges, this paper proposes an efficient semantic coding-decoding network based on star operation for wireless image transmission, termed STARJSCC. The star operation maps input features into an ultra-high-dimensional nonlinear feature space28, thereby amplifying the model’s representational capacity. By leveraging star blocks to extract semantic features and project them into this high-dimensional nonlinear space, our approach simultaneously improves the characterization of latent semantic features and boosts model capacity while maintaining computational efficiency. Our primary innovations are summarized as follows:

  • A encoder-decoder architecture based on star-operation modulation network is designed. Considering the challenges associated with model size and complexity, the proposed scheme retains a convolution-based design approach. Distinctively, we develop a novel JSCC backbone network by integrating the star operation for feature fusion and leveraging the strengths of depthwise convolutions. This architecture is anticipated to enhance the model’s semantic feature representation capacity and transmission performance while significantly reducing model parameters and computational complexity.

  • A channel state adaptive module (CSA Mod) based on dynamic fine-grained attention mechanism is proposed. To enhance transmission quality and reconstruction fidelity by adapting to real-time channel conditions, we propose a critical plug-in module in STARJSCC, termed CSA Mod. This module refines the SENet architecture by introducing interactions between global and local features while incorporating SNR information as a guidance factor, thereby achieving channel state adaptation.

System model design

The system model of STARJSCC

As a fundamental paradigm in semantic communication systems, the DEEPJSCC framework achieves semantic-level joint source-channel modeling through the synergistic optimization of an end-to-end neural encoder-decoder. The input image is mapped by the joint source-channel encoder into a form suitable for transmission over wireless channels. After undergoing specific CBR and channel conditions, the receiver aims to reconstruct the original semantic information via the decoder, minimizing the semantic discrepancy between the reconstructed and input images. This approach emphasizes semantic fidelity while balancing communication efficiency and robustness.

Fig. 1
figure 1

Overview of STARJSCC system model for wireless image transmission.

Figure 1 illustrates the system model of STARJSCC. The essential part of it consists of a joint encoder \(E_{\theta }\) and a joint decoder \(D_{\varphi }\), where \(\theta\) and \(\varphi\) denote the trainable parameters of the encoder and decoder, respectively. Moreover, \({\varvec{x}}\in \mathbb {R}^{N}\) denotes input image, where \(\mathbb {R}\) is real number field and its dimensionality \(N=C\times H \times W\), with C, H, and W representing the number of channels, height, and width of the input image, respectively. During the encoding process, \(E_{\theta }\) maps the input information \({\varvec{x}}\) and the SNR into an N-dimensional complex-valued semantic code (CSC) \({\varvec{y}}\in \mathbb {C}^{N}\), where \(\mathbb {C}\) denotes the complex number set. The encoding process is formulated as:

$$\begin{aligned} {\varvec{y}}=E_{\theta } \left( {\varvec{x}},\varepsilon ,\gamma \right) \in \mathbb {C}^{N}, \end{aligned}$$
(1)

where \(\varepsilon\) is the SNR and \(\gamma\) is the CBR of the input data, respectively. Before transmission through the wireless channel, the real and imaginary parts of the masked CSC \({\varvec{y}}\) are extracted and concatenated to form the input signal. After channel processing, the output is reconstructed by concatenating the real and imaginary parts into a single real-valued vector, which serves as the input for the subsequent stage.

To enable the model to control transmission rate according to different CBR values, the CSC \({\varvec{y}}\) needs to be processed through the semantic compression (SC) operation. During this process, the system generates a semantic mask tensor \({\varvec{\rho }} \in \left\{ 0,1 \right\} ^{N}\) conditioned on the CBR with input value \(\gamma\). Each element of the tensor \({\varvec{\rho }}\) is determined by the following formulation:

$$\begin{aligned} {\varvec{\rho _{i}}}={\left\{ \begin{array}{ll} 1 & \text { if } i< k \\ 0 & \text { otherwise } \end{array}\right. }, \end{aligned}$$
(2)

The SC process applied to \({\varvec{y}}\) can be formulated as:

$$\begin{aligned} {\varvec{y}}^{\prime } = {\varvec{y}}\cdot {\varvec{\rho }} \in \mathbb {C} ^{k}, \end{aligned}$$
(3)

where \(\cdot\) denotes dot product operation. The compressed semantic feature dimension is given by \(k=\gamma \times N\). When \({\varvec{\rho _{i}}}=1\), the i-th semantic symbol is selected for transmission; whereas \({\varvec{\rho _{i}}}=0\) indicates that the i-th symbol is discarded as non-informative. After that, the first k elements of CSC \({\varvec{y}}\) are selected by mask tensor \({\varvec{\rho }}\), generating the masked semantic information \({\varvec{y}}^{\prime }\).

To satisfy the imposed power constraint P, the masked semantic information \({\varvec{y}}^{\prime }\) must be power normalized before transmission, formulated as follows:

$$\begin{aligned} \tilde{{\varvec{y}}}=\sqrt{kP}\times \frac{{\varvec{y}}^{\prime }}{{\varvec{y}}^{\prime *} {\varvec{y}}^{\prime } }, \end{aligned}$$
(4)

where k is the number of semantic symbols to be transmitted, and \({\varvec{y}}^{\prime *}\) represents the complex conjugate of \({\varvec{y}}^{\prime }\).

During transmission over physical channels, these semantic features are corrupted by stochastic noise to emulating the semantic interference encountered in practical communication scenarios. In this work, we consider the widely-used additive white Gaussian noise (AWGN) channel, whose transfer function can be expressed as:

$$\begin{aligned} \tilde{{\varvec{y}}}^{\prime }=h\left( \tilde{{\varvec{y}}},\varepsilon ,\gamma \right) =\tilde{{\varvec{y}}}+n, \end{aligned}$$
(5)

where n is zero-mean complex Gaussian noise with variance \(\sigma ^{2}\), i.e., \(n\sim \mathbb{C}\mathbb{N}\left( 0,\sigma ^{2} \right)\).

Building on the preceding steps, we apply zero-padding to the received semantic information \(\tilde{{\varvec{y}}}^{\prime }\) transmitted through the wireless channel. This ensures that only the semantic symbols selected for transmission are affected by interference, while the rest retain zero values. Similar to the SC process, an identical mask tensor \({\varvec{\rho }} \in \left\{ 0,1 \right\} ^{k}\) is first produced based on the input CBR \(\gamma\). The detailed operation can be mathematically expressed as:

$$\begin{aligned} \hat{{\varvec{y}}}=\tilde{{\varvec{y}}}^{\prime }\cdot {\varvec{\rho }}, \end{aligned}$$
(6)

Finally, the semantic decoder \(D_{\varphi }\) at the receiver reconstructs the original data based on the semantic information \(\hat{{\varvec{y}}}\). The decoded output \(\hat{{\varvec{x}}}\) can be formulated as:

$$\begin{aligned} \hat{{\varvec{x}}}=D_{\varphi }\left( \hat{{\varvec{y}}},\varepsilon ,\gamma \right) , \end{aligned}$$
(7)

Within the STARJSCC semantic communication model described above, the channel SNR serves as a critical factor influencing image semantic reconstruction, typically governed by the noise power. The SNR \(\varepsilon\) is mathematically defined as:

$$\begin{aligned} \varepsilon =10\log _{10}{\frac{P}{\sigma ^{2} }}(dB), \end{aligned}$$
(8)

We employ a loss function \(\mathcal {L}\) to jointly optimize the parameters of the encoder \(E_{\theta }\) and decoder \(D_{\varphi }\). This loss function is defined as the sum of mean squared errors (MSE) between the original information and the reconstructed outputs, expressed as follows:

$$\begin{aligned} \mathcal {L}_{\theta ,\varphi }\left( {\varvec{x}},\hat{{\varvec{x}}} \right) =d\left( {\varvec{x}},\hat{{\varvec{x}}} \right) = \frac{1}{N}\sum _{0}^{N}\left\| {\varvec{x}}_{i}-\hat{{\varvec{x}}}_{i} \right\| ^{2}, \end{aligned}$$
(9)

where \(d\left( {\varvec{x}},\hat{{\varvec{x}}} \right)\) denotes the MSE between the input image and the reconstructed image, and N represents the dimensionality of the images \({\varvec{x}}_{i}\) and \(\hat{{\varvec{x}}}_{i}\).

The objective of our training process is to minimize the loss function \(\mathcal {L}\). By iteratively applying the backpropagation algorithm, the encoder parameters \(\theta\) and decoder parameters \(\varphi\) are gradually updated so that the loss value tends to its global minimum:

$$\begin{aligned} \left( \theta ^{*},\varphi ^{*} \right) =argmin\mathbb {E}_{\theta ,\varepsilon ,\gamma }\left[ \mathcal {L}_{\theta ,\varphi }\left( {\varvec{x}}, \hat{{\varvec{x}}},\varepsilon ,\gamma \right) \right] , \end{aligned}$$
(10)

where \(\theta ^{*}\) and \(\varphi ^{*}\) denote the optimal encoder and decoder parameters, respectively. During the training of a deep neural network, the network parameters are progressively refined until the loss value stabilizes and ceases to decrease significantly, at which point the neural network attains a convergent state.

The overall architecture of STARJSCC

Figure 2 delineates the overall architecture of the proposed STARJSCC for wireless image transmission. Within the joint encoder-decoder of this architecture, the Star Blocks and CSA Mod serve as the pivotal components driving its functionality.

Fig. 2
figure 2

The overall architecture design of our STARJSCC for wireless image transmission.

The encoder takes an RGB image \({\varvec{x}}\in \mathbb {R}^{H \times W \times 3}\) as input. Initially, the image is processed through a convolutional (Conv) layer with \(kernel\_size=9\) and \(stride=2\) to extract hierarchical features and downsample. This operation reduces the spatial resolution of the input to \(\frac{H}{2}\times \frac{W}{2}\times C_{1}\), where \(C_{1}\) denotes the number of output channels after the Conv layer. Subsequently, the semantic features of input image are further learned through \(n_{1}\) successive Star Blocks, during which the dimensionality of it remains unchanged. Finally, the CSA Mod integrates the learned semantic features with the channel state via attention mechanism, adaptively recalibrating the weight distribution to achieve channel state adaptation. The aforementioned steps are collectively referred to as Stage 1. The encoding process comprises three stages, where the operations of Stages 2 and Stage 3 are similar to Stage 1, except that the Conv layer is replaced with a Down Sample layer. At each stage, the spatial resolution of the input is progressively reduced to lower computational complexity and facilitate the extraction of higher-level semantic features.

Following three stages of deep semantic feature extraction and learning, the CSC \({\varvec{y}}\) of input image is obtained. Prior to transmission over the wireless channel, the SC masking operation and power normalization procedure described in the preceding subsection are applied to \({\varvec{y}}\) to achieve transmission rate control. Finally, the received semantic information \(\tilde{{\varvec{y}}}^{\prime }\), transmitted over the wireless channel, is subjected to zero padding to obtain the decoder’s input \(\hat{{\varvec{y}}}\).

The decoder adopts a symmetrical design to the encoder, with the objective of analyzing and learning the received semantic information. Through a series of feature processing and reconstruction operations, it ultimately outputs a reconstructed image \(\hat{{\varvec{x}}}\) with dimensions \(H \times W \times 3\), which should closely approximate the original input image \({\varvec{x}}\). Corresponding to the Down Sampling layers in the encoder, the decoder employs Up Sampling layers to progressively restore the spatial resolution of the image. Ultimately, a transposed convolutional (Trans Conv) layer transforms the semantic features into the reconstructed image.

In this paper, we adopt Star Blocks as the primary semantic feature extraction module. As illustrated in Fig. 3a, each Star Block consists of two depthwise convolution (DW-Conv) layers and three fully connected (FC) layers. For normalization, we employ generalized divisive normalization (GDN)29, which is better suited for image reconstruction tasks. Regarding the activation function, the parametric rectified linear unit (PReLU)30, which is commonly adopted in the DEEPJSCC framework and its variants, is considered. The two branches within the Star Block leverage element-wise multiplication (i.e., star operation) to fuse semantic features. This design not only enhances the efficiency of the model in extracting image semantic features but also strengthens its representational capacity for latent semantic features. This is because the star operation enables high-dimensional, nonlinear semantic feature mapping while circumventing the limitations of traditional approaches, which typically require a substantial increase in network width or computational overhead28. Additionally, the Down Sample modules used in the Stage 2 and Stage 3 of the encoder, as well as the corresponding Up Sample modules in the decoder, are illustrated in Fig. 3b and c, respectively. The Down Sample module consists of a convolutional layer with \(kernel\_size=5\) and \(stride=2\), followed by a GDN layer. The Up Sampling module comprises a transposed convolutional layer with \(kernel\_size=5\) and \(stride=2\), followed by an IGDN (Inverse GDN) layer.

Fig. 3
figure 3

(a) A fundamental Star Block. (b) Down Sample module used in the encoder. (c) Up Sample module used in the decoder.

In summary, the proposed model employs a symmetrical encoder-decoder architecture that integrates Conv layer, Trans Conv layer, Star Blocks, CSA Mod, Down Sample and Up Sample operations. This design enables effective extraction, encoding, and decoding of semantic features during wireless channel transmission, thereby achieving efficient image communication and high-fidelity reconstruction.

Proposed channel state adaptive module

Constructing a model capable of adapting to diverse channel environments without requiring fine-tuning to achieve efficient image semantic transmission remains a significant challenge in the domain of semantic communication. To optimize transmission quality and reconstruction fidelity, a critical plug-in module is proposed in our work, namely CSA Mod. By leveraging accurate modeling and prediction of input SNR, the CSA Mod dynamically adjusts its parameters and configurations in real time. This optimizes its adaptation to varying channel conditions, ensuring efficient and stable semantic transmission.

In ADJSCC19 and DEEPJSCC-V22, the traditional SE channel attention mechanism is employed to achieve SNR adaptation. However, it relies on FC layers to extract global features while lacking effective interaction and fusion with local information, resulting in suboptimal weights allocation for features critical to semantic reconstruction. Inspired by Ref31., we introduce the CSA Mod, which is designed based on dynamic fine-grained attention. This module efficiently integrates global and local features and optimizes weight assignment to extract more discriminative image features, thereby providing precise feature representations for image semantic reconstruction. The design is illustrated in Fig. 4. The CSA Mod leverages a correlation matrix to capture interdependencies between global and local semantic information to facilitating their interaction and enabling more effective allocation of feature weights. Additionally, by incorporating channel SNR as a reference factor for attention weight updates through an auxiliary network, the model learns diverse channel states, which in turn improves the robustness of the entire semantic communication system.

The fundamental principle of the CSA Mod is to enable multi-scale feature interaction and adaptive channel attention allocation through global-local contrastive modeling, which in turn strengthens the expressive capability of semantic features. The implementation workflow is described in detail as follows:

Fig. 4
figure 4

Network structure of the proposed CSA Mod.

  1. 1)

    Global aggregation: Given an input semantic feature \(S\in \mathbb {R}^{H \times W \times C}\), a global feature vector \(F \in \mathbb {R}^{C \times 1 \times 1}\) is generated via global average pooling (GAP) operation. The n-th channel element of F denoted as \(F_{n}\) can be expressed as:

    $$\begin{aligned} F_{n}=GAP\left( S_{n} \right) =\frac{1}{H\times W}\sum _{i=1}^{H}\sum _{j=1}^{W}S_{n}\left( i,j \right) , \end{aligned}$$
    (11)

    where \(S_{n}\left( i,j \right)\) is the value of the n-th channel feature map at position \(\left( i,j \right)\), and \(GAP\left( S_{n} \right)\) represents the global average pooling function, defined as:

    $$\begin{aligned} GAP\left( x \right) =\frac{1}{H\times W}\sum _{j=0}^{H}\sum _{k=0}^{W}x \in \mathbb {R}^{C\times 1\times 1}, \end{aligned}$$
    (12)
  2. 2)

    Feature decomposition: The feature F is decomposed into \(F_{gc}\) (global contrastive features) and \(F_{lc}\) (local contrastive features), preserving global and local semantic information, respectively. The diag and band operations are then employed to extract global and local semantic features, enabling explicit modeling of multi-scale contextual dependencies. As follows: In order to capture local channel information while ensure a small number of model parameters, a band matrix B is employed for localized channel interaction. Let \(B=[b_{1}, b_{2}, b_{3},.....,b_{k}]\), \(F_{lc}\) is given by

    $$\begin{aligned} F_{lc}=\sum _{i=1}^{k}F\cdot b_{i}, \end{aligned}$$
    (13)

    where k represents the number of adjacent channel numbers. To capture global channel information and enhance the model’s representational capacity for global contexts, a diagonal matrix D is introduced to precisely characterize the interdependencies among all channels, thereby enabling effective extraction of global features. Let \(D=[d_{1}, d_{2}, d_{3},.....,d_{k}]\), \(F_{gc}\) is given by

    $$\begin{aligned} F_{gc}=\sum _{i=1}^{c}F\cdot d_{i}, \end{aligned}$$
    (14)

    where c denotes the number of channels.

  3. 3)

    Transposed mutual interaction: The global contrastive feature \(F_{gc}\) and local contrastive feature \(F_{lc}\) are multiplied with the transposed counterparts of each other (\(F_{gc}^{T}\) and \(F_{lc}^{T}\)), respectively, to generate two distinct contrastive feature matrices. This operation strengthens the interaction between global and local features by explicitly modeling their cross-dimensional dependencies. The cross-correlation operation is employed to capture correlations across varying granularities between global and local information, formulated as follows:

    $$\begin{aligned} Q=F_{gc}\cdot F_{lc}^{T}, \end{aligned}$$
    (15)

    where Q denotes the correlation matrix.

  4. 4)

    Feature integration and control: Row-wise summation is performed on the two contrastive feature matrices, generating two sets of weight vectors. A learnable factor \(\eta\) and the Sigmoid function are employed to adjust the relative importance of them, followed by a weighted fusion to produce the integrated channel weight W. Additionally, an auxiliary network is introduced to extract SNR information as a reference term for weight adjustment. Considering the model complexity, this network comprises two FC layers and activation function. The extracted SNR is fused with the channel weight W via an attention-based fusion operation, enabling the model to learn and adapt to wireless channel conditions, ultimately yielding the refined attention weight factor \(W^{*}\). The detailed procedure is outlined as follows:

    $$\begin{aligned}&F^{\omega }_{gc}=\sum _{j}^{c}Q_{i,j}, i\in 1,2,3,\dots ,c, \end{aligned}$$
    (16a)
    $$\begin{aligned}&F^{\omega }_{lc}=\sum _{j}^{c}\left( F_{lc}\cdot F_{gc}^{T} \right) _{i,j} = \sum _{j}^{c}Q_{i,j}^{T} , i\in 1,2,3,\dots ,c, \end{aligned}$$
    (16b)
    $$\begin{aligned}&W=\sigma \left( \sigma \left( \eta \right) \times \sigma \left( F_{gc}^{\omega } \right) + \left( 1-\sigma \left( \eta \right) \right) \times \sigma \left( F_{lc}^{\omega } \right) \right) , \end{aligned}$$
    (16c)
    $$\begin{aligned}&W^{*}=\mathcal {F}^{1\times 1}\left( W,\varepsilon \right) \end{aligned}$$
    (16d)

    where \(F^{\omega }_{gc}\) and \(F^{\omega }_{lc}\) denote the fused global and local channel weights, respectively, c represents the number of channels, and \(\mathcal {F}^{1\times 1}\) corresponds to the 1\(\times\)1 convolutional operation.

  5. 5)

    Feature recalibration: The final weight vector is multiplied with the original semantic feature to get output semantic feature \(S^{*}\). As follows:

    $$\begin{aligned} S^{*}=W^{*}\otimes S, \end{aligned}$$
    (17)

Training methodology

In the preceding sections, we presented the overall architecture of the wireless image transmission system, the data transmission pipeline of STARJSCC, and the adopted solutions. To validate the effectiveness of the proposed methodology, a carefully designed training scheme is required to derive the STARJSCC model. The training process is outlined as follows:

At the macro level, the system model takes dataset samples as input, with the final outputs being the optimized model parameters \(\theta ^{*}\) and \(\varphi ^{*}\), enabling the model to autonomously learn the joint source-channel encoding process. Specifically, the model parameters \(\theta\) and \(\varphi\) are initialized, and the dataset is loaded. Iterative training is then conducted across multiple epochs.

During each training batch, the input data is firstly partitioned into \({\varvec{X_{b}}}=[{\varvec{x}}_{1},{\varvec{x}}_{2},{\varvec{x}}_{3},\dots ,{\varvec{x}}_{b} ]\). After that, the CBR \(\gamma\) and the SNR \(\epsilon\) are randomly generated from the range [1/20, 1/4] and [0, 25] dB, respectively. This stochastic training strategy ensures the model adapts to varying channel states and different CBR values through decoupled attention mechanism and static SC masking scheme. The encoder \(E_{\theta }\) encodes \(X_{b}\) using \(\gamma\) and \(\epsilon\) to get CSC \({\varvec{y}}=[{\varvec{y}}_{1},{\varvec{y}}_{2},{\varvec{y}}_{3},\dots ,{\varvec{y}}_{b}]\), and then the masked semantic information \({\varvec{y}}^{\prime }\) is generated by Eq. (2) and Eq. (3). Finally, the decoder \(D_{\varphi }\) decodes the noise-corrupted semantic information \(\hat{{\varvec{y}}}=[\hat{{\varvec{y}}}_{1}, \hat{{\varvec{y}}}_{2}, \hat{{\varvec{y}}}_{3},\dots ,\hat{{\varvec{y}}}_{b}]\), which is transmitted over the wireless channel, into the reconstructed data \(\hat{{\varvec{X}}}=[\hat{{\varvec{x}}}_{1}, \hat{{\varvec{x}}}_{2}, \hat{{\varvec{x}}}_{3},\dots ,\hat{{\varvec{x}}}_{b}]\).

Prior to concluding each training batch, the MSE loss \(\mathcal {L}_{i}=\frac{1}{N}\left\| {\varvec{x}}_{i}-\hat{{\varvec{x}}}_{i} \right\| ^{2}\) is computed for each sample, and the epoch-averaged loss \(\mathcal {L}=\frac{1}{b} \sum _{i}^{b}\mathcal {L}_{i}\) is derived. The gradient information from \(\mathcal {L}\) is utilized to update the model parameters \(\theta\) and \(\varphi\) via backpropagation. In the end, the complete training procedure is summarized in Algorithm 1.

Algorithm 1
figure a

Training Process for the STARJSCC

Experimental results

In this section, we first provide a comprehensive description of the simulation experimental configuration. Subsequently, we present the results of the simulation experiments, which aim to validate the effectiveness and robustness of our STARJSCC model in executing transmission tasks under diverse channel conditions. Furthermore, ablation study is conducted to investigate the impact of different block designs on the experimental outcomes. Finally, a comparative analysis is carried out between the ADJSCC, DEEPJSCC-V, and our STARJSCC models, focusing on critical metrics such as model parameters and storage requirements.

Experimental setup

Dataset selection

In our simulation experiments, two datasets with distinct resolutions are considered. For the low-resolution images, the CIFAR10 dataset32 is employed, which comprises 32\(\times\)32-pixel color images spanning 10 different categories with a total of 60,000 images. These images are partitioned into 50,000 training images and 10,000 test images. Since the focus of our study is on communication tasks, category labels are usually ignored during the experiments. The training set is used for model optimization, while the test set serve to evaluate model performance. For high-resolution images, the DIV2K dataset33 is adopted, containing 900 images, each exceeding 2000\(\times\)1000 pixels. As for the testing phase, the Kodak dataset34 which includes 24 color images of 768\(\times\)512 pixels and CLIC2020 testset35 with approximate 2 K resolution images are selected to assess the model’s performance on high-resolution images. This dataset selection strategy ensures a comprehensive evaluation of the proposed model’s performance and robustness across varying image resolutions in transmission tasks.

Metrics

To comprehensively evaluate the end-to-end semantic transmission performance of the proposed STARJSCC model against comparable methods, we employed two well-established evaluation metrics: the pixel-level measurement, peak signal-to-noise ratio (PSNR); and the perception-oriented assessment, structural similarity index (SSIM)36.

For PSNR, defined as the ratio of the peak signal power to the mean noise power, it is formulated as follows:

$$\begin{aligned} PSNR=10\log _{10}{\frac{MAX^{2} }{MSE} }\left( dB \right) , \end{aligned}$$
(18)

where \(MSE=d({\varvec{x}},\hat{{\varvec{x}}})\) represents the mean squared error between the input image \({\varvec{x}}\) and the reconstructed image \(\hat{{\varvec{x}}}\), and MAX denotes the maximum possible pixel value. All experiments were conducted on 24-bit-depth RGB images, where each color channel (red, green, blue) is allocated 8 bits. Consequently, \(MAX=2^{8}-1=255\).

SSIM is computed as follows:

$$\begin{aligned} SSIM\left( {\varvec{x}},\hat{{\varvec{x}}} \right) =\frac{\left( 2\mu _{{\varvec{x}}}\mu _{\hat{{\varvec{x}}}} +c_{1} \right) \left( 2\sigma _{{\varvec{x}}\hat{{\varvec{x}}} } +c_{2} \right) }{\left( \mu _{{\varvec{x}}}^{2}+\mu _{\hat{{\varvec{x}}} }^{2}+c_{1} \right) \left( \sigma _{{\varvec{x}}}^{2}+\sigma _{\hat{{\varvec{x}}} }^{2}+c_{2} \right) }, \end{aligned}$$
(19)

where \(\mu _{{\varvec{x}}}\) and \(\mu _{\hat{{\varvec{x}}}}\) represent the mean values of the original image \({\varvec{x}}\) and the reconstructed image \(\hat{{\varvec{x}}}\), respectively. \(\sigma _{{\varvec{x}}}^{2}\) and \(\sigma _{\hat{{\varvec{x}}} }^{2}\) denote the variances of the original and reconstructed images, quantifying their intensity distributions. The constants \(c_{1}\) and \(c_{2}\) are stabilization terms introduced to prevent division by near-zero values, ensuring numerical stability during computation.

Training details

The Adam optimizer37 is employed to update model parameters, where the weight decay and learning rate are set to \(5 \times 10^{-4}\). During the training process, a StepLR scheduler is employed to dynamically adjust the learning rate, thereby mitigating the risk of convergence to local optima. The scheduler configuration utilize a step size of 100 epochs and a multiplicative factor of 0.5. This implementation reduces the learning rate by 50% at every 100-epoch interval, with the maximum training duration set to 400 epochs. After each epoch, validation is performed with gradient updates disabled to evaluate the model’s performance on the validation set.

For training on the CIFAR10 dataset, the batch size is set to 64. For the DIV2K dataset, due to its higher demands on GPU memory and computational resources, the batch size is reduced to 4. Additionally, these images are resized to 256\(\times\)256 pixels during training to facilitate model optimization. This configuration aims to balance training efficiency and model performance, ensuring optimal training outcomes across datasets of varying resolutions. To maintain architectural consistency, both high and low resolution images are processed using a three-stage STARJSCC scheme with parameters \([n_{1},n_{2},n_{3}]=[2], [3], [6]\). All experiments were conducted on a Linux system utilizing the PyTorch framework and a single NVIDIA RTX 3060 GPU.

Results analysis

The proposed STARJSCC scheme is compared with the ADJSCC and DEEPJSCC-V methods. To mitigate the impact of stochastic channel noise on experimental results during performance evaluation, 10 transmission trials are conducted for each image, and the average PSNR and SSIM values are computed. This approach not only facilitates the acquisition of stable performance metrics but also ensures the reliability and comparability of the experimental outcomes.

Low-resolution experimental results

The proposed STARJSCC model is comprehensively evaluated under AWGN channel conditions, and its performance is rigorously analyzed through comparative experiments across varying SNR levels. Figure 5 presents a performance comparison of the CIFAR10 dataset under AWGN channel conditions across varying SNR levels. Specifically, Fig. 5a illustrates the PSNR performance of the proposed STARJSCC model against ADJSCC and DEEPJSCC-V at CBR = 1/12 and CBR = 1/6. It is evident that the STARJSCC model outperforms DEEPJSCC-V under both CBR settings. Compared to ADJSCC, the proposed model demonstrates comparable or superior adaptability to channel states. Notably, although STARJSCC does not surpass ADJSCC at low SNR levels, it achieves significant reductions in model complexity (detailed analysis is provided later). Furthermore, STARJSCC can adapt to different CBR values in a single model, a capability absent in ADJSCC. Figure 5b depicts the SSIM performance comparison under the same conditions. The SSIM results exhibit a trend consistent with Figure 5a, the better channel conditions, the more obvious performance advantage of STARJSCC becomes.

Fig. 5
figure 5

(a) PSNR performance curves versus the SNR over AWGN channel. (b) SSIM performance curves versus the SNR over AWGN channel. Where the CBR = 1/12 and 1/6 for CIFAR10 dataset.

Fig. 6
figure 6

(a) PSNR performance curves versus the CBR over AWGN channel. (b) SSIM performance curves versus the CBR over AWGN channel. Where the SNR = 1dB, 4dB and 10dB for CIFAR10 dataset.

Figure 6 illustrates the performance comparison of the CIFAR10 test set under AWGN channel conditions across varying CBR. The proposed STARJSCC and DEEPJSCC-V models are evaluated and compared at SNR levels of 1 dB, 4 dB, and 10 dB to assess their adaptability and robustness to different CBR conditions. The results demonstrate that the proposed model significantly outperforms DEEPJSCC-V in terms of both PSNR and SSIM under most conditions, while maintaining comparable performance even under extremely adverse channel scenarios.

High-resolution experimental results

To further validate the superiority of the proposed STARJSCC model in high-resolution semantic transmission, additional tests are conducted using the Kodak and CLIC2020 dataset. For high-resolution images, inputs are preprocessed by cropping to 512\(\times\)512 pixels, with other experimental settings consistent with the low-resolution tests. Figure 7 presents the performance comparison of the Kodak and CLIC2020 dataset under AWGN channel conditions across varying SNR levels. The results demonstrate that the STARJSCC scheme exhibits even more pronounced advantages in high-resolution image transmission. Specifically, Fig. 7a and b illustrates the high-resolution PSNR performance of STARJSCC against ADJSCC and DEEPJSCC-V at CBR = 1/12 and 1/6. The proposed STARJSCC outperforms both DEEPJSCC-V and ADJSCC, with the performance gap widening as SNR increases, reaching a maximum advantage of 2.73 dB. Figure 7c and d compares the high-resolution SSIM performance under the same settings. The STARJSCC model consistently achieves higher SSIM values than ADJSCC and DEEPJSCC-V, with a maximum SSIM gain of 0.01577 over DEEPJSCC-V. At CBR = 1/12, STARJSCC demonstrates significant SSIM advantages at lower SNR levels, though this margin diminishes as SNR increases. This phenomenon arises because semantic feature loss becomes more pronounced at lower CBR, inherently limiting the achievable reconstruction fidelity. Conversely, STARJSCC maintains substantial performance gains across all SNR levels at CBR = 1/6, with improvements becoming increasingly prominent as SNR rises.

Fig. 7
figure 7

(a)-(b) PSNR performance curves versus the SNR over AWGN channel. (c)-(d) SSIM performance curves versus the SNR over AWGN channel. Where the CBR = 1/12 and 1/6 for Kodak and CLIC2020 dataset.

Fig. 8
figure 8

(a)-(b) PSNR performance curves versus the CBR over AWGN channel. (c)-(d) SSIM performance curves versus the CBR over AWGN channel. Where the SNR = 1dB, 4dB and 10dB for Kodak and CLIC2020 dataset.

Figure 8 illustrates the performance comparison of the Kodak and CLIC2020 dataset under AWGN channel conditions across varying CBR conditions. Similar to the low-resolution test results, the proposed model exhibits strong adaptability under diverse CBR constraints, consistently outperforming DEEPJSCC-V across three distinct SNR levels. Specifically, for PSNR, the performance advantage of STARJSCC becomes increasingly pronounced as CBR increases, achieving a maximum performance gap of 2.15 dB. For SSIM, the highest improvement reaches 0.02267 under SNR = 1 dB.

In summary, compared to the low-resolution dataset evaluations, the proposed scheme demonstrates even more substantial improvements in high-resolution testing scenarios. This underscores the superior capability of STARJSCC in high-resolution image semantic transmission tasks relative to other models. This phenomenon arises because STARJSCC’s hybrid attention architecture effectively integrates both local and global information, enhancing the model’s capacity to capture fine-grained semantic features. This architectural strength enables exceptional performance on high-resolution images where abundant structural details exist. In contrast, the inherent loss of finer object boundaries and texture variations in low-resolution images fundamentally limits the full exploitation of STARJSCC’s advantages.

Ablation study

Study on block design

In order to investigate the impact of different Star Block designs on the performance of STARJSCC system, an ablation study is conducted to analyze their architectural variations. Subsequently, we also compare Star Block with standard MobileNet block and EfficientNet-style bottleneck under the same training configurations to justify its use. The four distinct Star Block variants (Block I, Block II, Block III, Block IV) are designed, as illustrated in Fig. 9, and tested within the STARJSCC framework. Table 1 presents the performance metrics of these variants under CBR = 1/12 and SNR = 9 dB, where the Storage metric indicates the storage overhead of models trained with each respective module.

The experimental results demonstrate that models based on Star Block and its variants outperform those using standard MobileNet block and EfficientNet-style bottleneck, justifying the selection of Star Block as the fundamental module for STARJSCC. Among these, Block I achieves the best performance in terms of both PSNR and SSIM, followed by Block II and Block IV, while Block III shows slightly inferior results. Notably, the model trained with Block I exhibits significantly lower storage overhead compared to the other three variants. This substantial reduction in storage requirements, combined with its superior performance, further validates Block I as the optimal design choice for our system.

Fig. 9
figure 9

Four distinct Star Block variants are designed, with being Block I adopted as the standard configuration in the proposed STARJSCC framework.

Table 1 Performance comparison of different block variants under CBR = 1/12 and SNR = 9 dB, tested on the Kodak dataset.

Study on attention mechanism

Generally, the principle of the attention mechanism lies in minimizing the loss by adjusting the model’s focus on different image regions. In STARJSCC, we introduce the CSA mechanism to adapt to varying channel conditions, thereby improving the model’s semantic preservation and transmission capabilities. To understand the impact mechanism of the CSA Mod on semantic features during transmission, we focus on the scaling coefficients it generates. Specifically, we conduct 10 transmissions for images from the Kodak dataset at 5 different SNR values and compute the average of the scaling coefficients produced by the CSA Mod. The detailed distributions are illustrated in Fig. 10. We extract the scaling coefficients from the first 48 channels of the first and second CSA Nod in the encoder and visualize their distributions using heatmaps. The results show that after processing by the first CSA Mod, the scaling coefficients exhibit a distinct “stratification” phenomenon, indicating that they still retain noticeable variations at different SNR values. However, after refinement by the second CSA Mod, the differences in scaling coefficients at different SNRs diminish and become nearly identical. This trend aligns with the analysis in Ref19., suggesting that channel noise has a more pronounced impact on low-level features than on high-level features.

Furthermore, to validate the effectiveness of the proposed CSA mechanism, we conduct comparative experiments with standard attention mechanisms including SE and Convolutional Block Attention Module (CBAM), and investigate the impact of SNR input on the attention module’s performance. Specifically, under the same training configurations, we evaluate different attention mechanisms within the STARJSCC framework for wireless image transmission, with the corresponding performance results summarized in Table 2. For ease of understanding, we refer to the attention modules without SNR input as CSA_wo_SNR, SE_wo_SNR, and CBAM_wo_SNR, while those with SNR input are denoted as CSA+SNR, SE+SNR and CBAM+SNR. The results of Table 2 demonstrate that our proposed CSA attention mechanism achieves the highest PSNR and SSIM values regardless of SNR input incorporation. Moreover, by embedding SNR into the CSA module, our approach achieves performance improvements of 0.26 dB and 0.00213 in PSNR and SSIM, respectively.

Fig. 10
figure 10

The scaling coefficients of the first 48 channels in the encoder of STARJSCC on AWGN channel (CBR = 1/12). (a) the scaling coefficients of the first CSA Mod. (b) the scaling coefficients of the second CSA Mod. The scaling coefficients of each channel are evaluated on the Kodak dataset.

Table 2 The PSNR and SSIM comparison of different attention mechanism under CBR = 1/6 and SNR = 13 dB, tested on the Kodak dataset.

Analysis of model parameters, computational complexity, storage requirements, and inference time

Under the conditions of CBR=1/6, we further evaluate the model parameters, storage requirements, FLOPs, and inference time of conventional BPG + LDPC, ADJSCC, DEEPJSCC-V, SwinJSCC_Base, MambaJSCC and STARJSCC on the Kodak dataset. Notably, metrics such as parameters and storage overhead are independent of input image dimensions, as they solely depend on the model architecture. During testing, kodim02 image is preprocessed by resizing to 256\(\times\)256 pixels before being fed into the model. We transmit it 10 times and take the average as the final inference time result.

As shown in Table 3, the proposed STARJSCC scheme demonstrates significant advantages in FLOPs, parameters, and storage overhead. Specifically, STARJSCC demonstrates significantly faster inference time compared to the BPG+LDPC scheme. In terms of both FLOPs and parameters, STARJSCC achieves reductions of 30.83% and 52.78% compared to MambaJSCC and ADJSCC respectively. Furthermore, the storage cost of STARJSCC is only 59.90% of ADJSCC and 49.03% of DEEPJSCC-V. These improvements stem from its lightweight backbone network design, which replaces standard convolutions with DW-Convs and employs the star operation to fuse features from dual-branch structures, thereby improving the efficiency of semantic feature modeling. Although our model does not hold an advantage over other JSCC schemes, its outstanding advantages in computational efficiency, parameter reduction and storage economy make it particularly suitable for resource-constrained deployment scenarios, demonstrating significant practical value.

Table 3 Comparison of inference time, FLOPs, parameters, and storage requirements across different codec schemes.
Fig. 11
figure 11

Visual comparison of STARJSCC, ADJSCC and DEEPJSCC-V under AWGN channel at SNR=1dB, 5dB, 9dB, 15dB and 21dB.

Visualization analysis

To further validate the effectiveness of the proposed model, a set of visual comparisons is provided using the kodim21 image from the Kodak dataset. As illustrated in Fig. 11, the image reconstruction quality of STARJSCC, ADJSCC, and DEEPJSCC-V under AWGN channel conditions is visually compared, demonstrating the robustness and adaptability of the proposed framework. It is important to note that even when the SNR is constant during testing, inherent randomness in channel noise may introduce subtle variations in reconstruction outcomes. From these results, the STARJSCC scheme exhibits significant advantages across all SNR levels. Compared to DEEPJSCC-V, which supports SNR and CBR adaptation, STARJSCC achieves PSNR and SSIM improvements of 4.25 dB and 0.0276, respectively, under SNR = 21 dB. Furthermore, the proposed method effectively mitigates granular artifacts and shadowing distortions, producing reconstructed images with rich details and high fidelity. Consequently, STARJSCC better satisfies human visual perception requirements in semantic communication systems, bring more natural and authentic visual experiences.

Fig. 12
figure 12

Performance comparison of original and transmitted images in object detection task, where SNR = 6dB and CBR = 1/12 during the transmission. (a)-(d) show the detection results of original images. (e)-(h) show the detection results of the images transmitted using ADJSCC. (i)-(l) show the detection results of the images transmitted using DEEPJSCC-V. (m)-(p) show the detection results of the images transmitted using STARJSCC.

In practical semantic communication scenarios, images received after wireless transmission are typically utilized for downstream tasks. To validate the usability of transmitted images in subsequent semantic processing, we conduct object detection task using these images. The evaluation is performed with a Yolov8 network initialized with officially released pre-trained weights.Specifically, we utilize images kodim06, kodim11, kodim20, and kodim23 from the Kodak dataset, along with their semantically transmitted versions (processed through ADJSCC, DEEPJSCC-V, and STARJSCC), as input to YOLOv8 to obtain detection results. The transmission parameters are configured with SNR = 6 dB and CBR = 1/12. The comparative results and performance metrics output by Yolov8 are presented in Fig. 12. From these results, we observe that the transmitted images of all three models consistently meet the performance standards for object detection. Notably, the transmitted images perform better than the originals in terms of detection success rate in many cases. Furthermore, under conditions with strong background interference (e.g., kodim11), the detection performance of ADJSCC- and STARJSCC-transmitted images is significantly superior to that of both the original images and DEEPJSCC-V-transmitted images. This demonstrates that STARJSCC-transmitted images can be effectively utilized for downstream semantic tasks without compromising accuracy.

Conclusion

This paper proposes STARJSCC, a novel and highly flexible JSCC architecture that demonstrates exceptional adaptability within a single model, dynamically address diverse channel states and CBR conditions. Specifically, we design a star operation-based modulation network for wireless image transmission codec framework, incorporating a plug-in CSA Mod that enables dynamic channel sensing and parameter adjustment while maintaining transmission quality. Extensive experimental results demonstrate that, compared to conventional CNN-based JSCC frameworks, STARJSCC not only significantly reduces model parameters, computational complexity, and storage overhead but also achieves superior image transmission quality. These advancements position STARJSCC as a promising solution for semantic communication systems in resource-constrained wireless scenarios.

In future work, we will explore the deep integration of the proposed framework with SOTA architectures such as Transformer and Mamba, with a focus on developing more efficient rate-adaptive compression algorithms and optimizing the model’s transmission efficiency and generalization capabilities. Furthermore, we have verified the model’s compatibility with mainstream deployment tools such as TensorFlow Lite and ONNX Runtime. We will explore STARJSCC’s generalizability in broader semantic communication scenarios (such as IoT device communication and satellite communication) and conduct cross-modal research to extend it to other data transmission types (such as video streams, voice signals and text information). Through the synergistic integration and refinement of these advanced techniques, we aim to lay the groundwork for novel methodologies and solutions in the design and optimization of next-generation wireless communication systems.