Introduction

The ability to see through occlusions1 is a key challenge in computer vision, with wide-reaching implications for tasks such as object detection, tracking, and scene reconstruction2,3,4,5,6,7. Single-view image approaches have explored inpainting-based solutions8,9,10. However, inpainting in the absence of explicit mask guidance becomes an ill-posed problem, as these methods rely on predefined masks to infer missing content. This constraint restricts their adaptability to real-world occlusions, which exhibit diverse and unpredictable patterns. Moreover, without leveraging auxiliary information, single-view methods struggle to reconstruct occluded regions with high fidelity, limiting their effectiveness.

LF imaging provides a promising alternative by capturing both spatial and angular information. This multi-view representation allows regions hidden in one view to be revealed in another, offering complementary cues that make occlusion removal more reliable11,12. The concept of LF imaging originated with Faraday13 and was later formalized by Gershun14. Its computational foundation was established by Adelson et al.15 through the plenoptic function, which represents light as a seven-dimensional (7D) radiance field, modeling it at wavelength \(\lambda\), time t, and spatial position (xyz) along direction \((\theta ,\phi )\) (Fig. 1A):

$$\begin{aligned} L = P(\theta ,\phi ,\lambda ,t,x,y,z). \end{aligned}$$
(1)

Due to its high dimensionality, a widely adopted simplification assumes that light radiance remains constant along its path and does not vary over time, leading to a five-dimensional (5D) representation16,17 (Fig. 1B):

$$\begin{aligned} L = P(x, y, z, \theta , \phi ). \end{aligned}$$
(2)

Levoy and Hanrahan18, building on earlier work by Curless and Levoy19 and Hanrahan20, introduced the widely used four-dimensional (4D) two-plane representation. (Fig. 1C):

$$\begin{aligned} L = P(u, v, s, t), \end{aligned}$$
(3)

enabling efficient multi-view capture, post-capture refocusing, and depth-aware imaging, laying the foundation for LF applications in computer vision, occlusion removal, and immersive technologies 21,22.

Early LF occlusion removal methods primarily relied on encoder-decoder architectures. For instance, DeOccNet 23 used ResASPP to capture multi-scale features but struggled with large occlusions, while Mask4D 24 introduced 4D convolutions to maintain spatial-angular coherence. GAN-based approaches 25 refined occluded areas via semantic inpainting, and Zhang et al. 26 proposed a specialized adaptive filter for restoring occluded regions in sparse LFs, though its high memory demands limited applicability to denser datasets. Modular frameworks, such as ISTY 27, improved reconstruction accuracy by separating feature extraction, occlusion detection, and inpainting into distinct stages. More recent methods have emphasized expanding the receptive field to better capture both global context and local details. Hybrid CNN-Transformer architectures, such as those by Wang et al.28 and SwinSccNet29, combine local feature extraction with global contextual modeling to handle complex occlusions efficiently. Additionally, approaches integrating CSPDarknet53 and BiFPN 30 have enhanced multi-scale feature fusion, improving occlusion removal across diverse LF datasets.

Fig. 1
figure 1

Visualization of LF representations: (A) 7D \(L(\theta , \text {\text{\O }}, \lambda , t, X, Y, Z)\), (B) 5D \(L(X, Y, Z, \theta , \text {\text{\O }})\), and (C) 4D \(L(u, v, s, t)\) using two-plane parameterization, where light rays intersect parallel planes.

Despite recent progress, existing LF occlusion removal methods still face several critical limitations that motivate the development of our approach. First, they struggle to capture fine spatial details under large occlusions due to limited receptive fields and insufficient mechanisms for supplementing high-frequency information, making it challenging to detect all occluded points across sub-aperture images (SAIs). Second, effectively combining all SAIs in a grid-like angular sampling is non-trivial, as improper fusion can disrupt the intrinsic LF structure, hinder full exploitation of inter-view complementary information, and result in blurred reconstructions and color inconsistencies. Third, many approaches treat occluded and non-occluded pixels indiscriminately, allowing invalid pixels from occluded regions to degrade reconstruction quality and introduce artifacts. These challenges underscore the need for a model that preserves LF structure, adaptively discriminates between pixel types, and enriches multi-scale spatial details through an enlarged receptive field for accurate and robust occlusion removal.

To this end, we propose LF-PyrNet, a novel end-to-end deep learning model designed for the efficient removal of occlusions in LF images. LF-PyrNet addresses these challenges by expanding the receptive field through the integration of ResASPP and a modified RFB, enabling multi-scale feature extraction and improving spatial dependency modeling. The occlusion reconstruction network consists of three cascaded RDBs, each with four densely connected layers, followed by a FPN for multi-scale feature fusion, ensuring robust occlusion reconstruction. Furthermore, the refinement module, integrating separable and standard convolutions, enhances texture details and structural consistency, leading to sharper and more realistic occlusion-free images. By explicitly addressing receptive field constraints, preserving LF structure, and enhancing multi-scale representation learning, LF-PyrNet demonstrates superior performance in reconstructing occluded regions, establishing it as a highly effective and reliable solution for complex LF occlusion scenarios.

Our contributions are fourfold:

  • Multi-scale receptive field expansion: We integrate ResASPP with a modified RFB to enhance receptive field learning, effectively capturing both local and global spatial dependencies for robust occlusion removal.

  • Hierarchical feature fusion and reconstruction: Our network leverages three cascaded RDBs and a FPN to enable effective multi-scale feature fusion, ensuring fine-grained occlusion reconstruction with minimal artifacts.

  • Detail enhancement and texture refinement: Our network uses a refinement module with separable and standard convolutions to improve texture details and maintain structural consistency, resulting in clearer and more realistic occlusion-free images.

  • Extensive evaluation: Experiments on synthetic and real LF data show the superiority of LF-PyrNet in high-quality reconstruction and handling complex occlusions.

The remainder of this paper is organized as follows: Section Related work explores related work and sets the stage for our contributions. Section Proposed method describes the technical design of our architecture. Section Experiments provides a comprehensive evaluation, both quantitatively and qualitatively, along with a comparative analysis. Section Ablation study investigates the effect of individual components on performance through an ablation study. Finally, Section Conclusion and future work discusses challenges, suggests future research directions, and concludes with the key findings of the paper.

Related work

This section explores prior methods for handling occlusions, including classical techniques and modern deep learning approaches for LF occlusion removal. It highlights key advancements to provide a clear foundation for our work.

Conventional methods

The foundation of synthetic aperture focusing in LF imaging was laid by Vaish et al.31, who introduced a resampling approach to improve visibility through partial occlusions. Their method aligned 4D LF data to enable refocusing across multiple planes, enhancing background clarity while blurring occluders in the foreground. Expanding on this, Vaish et al.32 applied synthetic aperture techniques to 3D reconstruction, contrasting their approach with traditional stereo methods and introducing occlusion-aware multi-view techniques based on color and entropy metrics. To facilitate LF acquisition, Vaish et al.33 developed a calibration method utilizing planar parallax to estimate camera positions and project images onto different planes, achieving greater accuracy than conventional techniques. Pei et al.34 later proposed a pixel-labeling approach to detect and mask occlusions. In a subsequent work, they incorporated image matting to generate all-in-focus images35. However, this method was limited by depth-specific focal ranges. Yang et al.4 addressed these limitations by segmenting scenes into layers, allowing refocusing at varying depths. Xiao et al.36, who introduced an iterative reconstruction strategy that used clustering to separate occlusions from the background, refining the results through optimization. Challenges remain to handle large occlusions and improve depth consistency, which requires further advances in LF imaging.

Deep learning-based methods

To overcome the limitations of conventional occlusion removal techniques, Wang et al. 23 introduced DeOccNet, a deep learning architecture. The network integrates an encoder-decoder structure with ResASPP to expand the receptive field and enhance occlusion comprehension. It employs a mask embedding technique to simulate occluded LF images, but it faces challenges in handling large occlusions and suffers from blurring due to weak spatial dependency modeling in SAI stacking. Addressing these drawbacks, Li et al. 24 proposed Mask4D, leveraging 4D convolution to maintain spatial coherence and angular consistency, significantly improving occlusion handling. Similarly, Zhao et al. 25 explored generative adversarial networks (GANs) for occlusion removal, enabling semantic inpainting that integrates occluded regions with background content for more realistic reconstruction. Zhang et al. 26 developed a dynamic microlens filter to extract features from shifted lenslet images in sparse LFs; however, its reliance on rigid background assumptions and substantial memory usage limits its applicability to denser LFs. To address these constraints, Zhang et al. 37 introduced LFORNet. The network integrates Foreground Occlusion Location (FOL), Background Content Recovery (BCR), and a refinement module. This design enables effective management of occlusions of varying scales using multi-angle view stacks (MVAS). Further advancements include Song et al.38, who proposed a dual-pathway fusion network that separately synthesizes center views and predicts occlusions, later combining them for improved reconstruction accuracy. Hur et al.27 introduced ISTY, a framework incorporating modules for feature extraction, occlusion detection, and inpainting, effectively handling occlusions in both sparse and dense datasets. However, its reliance on CNNs restricts its capacity to capture complex occlusion patterns due to limited receptive fields. To address this, Wang et al.28 developed a hybrid model integrating CNNs for local feature extraction and Swin Transformers for capturing global structures, improving performance in large occlusion scenarios. Expanding on this, Zhang et al.29 introduced SwinSccNet, incorporating ScConv blocks for efficient feature compression and a Swin-Unet to optimize occlusion removal performance while maintaining computational efficiency. Building on this progress, Senussi et al.30 integrated CSPDarknet53 for hierarchical features, BiFPN for multi-scale fusion, and HiNet refinement to improve occlusion robustness. Chen et al.39 proposed ATM-OAC, combining adaptive threshold mask prediction, occlusion-aware convolution, and subpixel complementation for enhanced spatial-angular extraction and view reconstruction. To address disparity-based LF challenges, Zhang et al.40 developed a progressive MPI method using occlusion priors and attention filtering to recover backgrounds accurately. Lastly, Zhang et al.41 introduced MANet, a lightweight framework jointly predicting occlusion masks and restoring backgrounds via gated spatial-angular aggregation and texture-semantic attention.

Proposed method

In this section, we introduce LF-PyrNet, a novel deep learning model designed to address the challenges of occlusion removal in LF images. Existing methods often face difficulties in capturing long-range dependencies and effectively reconstructing missing details. To overcome these limitations, LF-PyrNet enhances occlusion removal through multi-scale receptive field learning and hierarchical feature refinement. By expanding the receptive field and improving feature fusion, LF-PyrNet accurately reconstructs occluded regions while preserving fine structural details and texture fidelity.

As illustrated in Fig. 2, LF-PyrNet consists of three key components: feature extraction, occlusion reconstruction, and refinement module. The network processes a 5\(\times\)5 grid of SAIs as input. The feature extraction module integrates ResASPP and a modified RFB to effectively capture both local and global contextual dependencies, ensuring a rich feature representation across multiple scales. These features form the foundation for occlusion-aware reconstruction. They are then processed by the occlusion reconstruction module, which consists of three cascaded RDBs, each composed of four densely connected layers, followed by a FPN for multi-scale feature fusion and progressive reconstruction of missing regions. Finally, the refinement module, integrating both separable and standard convolutions, enhances texture consistency and structural integrity, ensuring a seamless transition between reconstructed and non-occluded regions. The network is trained with center-view (CV) SAI as the reference, guiding the model towards precise reconstruction and supervision.

Fig. 2
figure 2

Overall Architecture of LF-PyrNet for Occlusion Removal in Light Field Images, Consisting of Feature Extraction, Reconstruction, and Refinement.

LF feature extractor

As shown in Fig. 2, the feature extraction process begins with an initial convolutional layer, followed by the combination of ResASPP and a modified RFB. This design expands the receptive field by leveraging atrous convolutions with varying dilation rates in ResASPP to capture multi-scale features. The RFB further broadens the receptive field, enhancing the model’s ability to capture both local and long-range dependencies across the LF image.

The input tensor, denoted as \(L_0 \in \mathbb {R}^{U \times V \times H \times W \times C_{\text {in}}}\), represents the LF image, where \(U\) and \(V\) correspond to the angular dimensions, \(H\) and \(W\) denote the spatial dimensions, and \(C_{\text {in}}\) represents the number of input channels. Specifically, for our LF images, the input tensor is structured as \(L_0 \in \mathbb {R}^{5 \times 5 \times 256 \times 192 \times 3}\), capturing both the angular and spatial components, as well as the color channels of the LF data.

The convolutional layer initially processes the tensor by applying a \(1 \times 1\) kernel, with a stride of 1 and padding of 1, to the input. This operation serves as an efficient mechanism to interact with the angular dimensions, effectively merging the information from the \(U\) and \(V\) angles across the channel dimension. By doing so, the network preserves the spatial and angular resolutions of the LF data while ensuring that features from all channels interact seamlessly. The output of this initial convolution is formally defined as follows:

$$\begin{aligned} F_C = \text {Conv}_{1 \times 1}(L_0), \end{aligned}$$
(4)

where \(F_C \in \mathbb {R}^{(3 \times U \times V) \times H \times W}\) represents the transformed tensor, rich with angular, spatial, and channel-wise features. Subsequently, the output tensor \(F_C\) is fed into the Residual Atrous Spatial Pyramid Pooling (ResASPP) module, as shown in Fig. 3. This design expands the receptive field by leveraging atrous convolutions with varying dilation rates, \(d = \{1, 2, 4, 8\}\), to capture multi-scale features, enabling the model to effectively understand both local and global contextual information. The parallel atrous convolutions, each followed by a LeakyReLU activation with a leaky factor of 0.1, enrich the feature extraction process, making the model sensitive to diverse spatial patterns across the LF image.

Fig. 3
figure 3

The overall structure of the ResAspp Block.

The outputs of these convolutions are concatenated and processed through a \(1 \times 1\) convolution for channel reduction, followed by a residual connection with the input tensor \(F_C\). This residual connection ensures that the model retains essential details from the original features while enriching them with multi-scale contextual information.

The final output, \(F_R\), is computed as:

$$\begin{aligned} F_R = F_C + \text {Conv}_{1 \times 1} \left( \text {Cat} \left( \{ \text {LReLU} \left( \text {Conv}_d ( F_C ) \right) \} \right) \right) , \end{aligned}$$
(5)

where \(d \in \{1, 2, 4, 8\}\) represents the dilation rates. This process results in the output tensor \(F_R \in \mathbb {R}^{H \times W \times C_{\text {out}}}\), which effectively preserves multi-scale and context-aware features essential for accurate occlusion removal. By expanding the receptive field and capturing multi-scale patterns, the ResASPP module enhances the feature extraction process, ensuring better reconstruction of occluded regions.

Following the ResASPP module, the output tensor \(F_{\text {R}}\) is further refined through the modified RFB, as shown in Fig. 4, to enhance feature extraction with multi-scale receptive fields. The modified RFB module, consisting of four parallel convolutional branches, captures information at various spatial scales, using different dilation rates to extend the receptive field without introducing excessive computational complexity. Each branch of the RFB performs the following: Branch 0 applies a simple \(1 \times 1\) convolution to capture fine local features, preserving detailed spatial information from the \(F_R\) tensor. Branch 1 first applies a \(1 \times 3\) convolution followed by a \(3 \times 1\) convolution, which captures medium-range dependencies. Then, a dilated \(3 \times 3\) convolution is used to expand the receptive field and integrate broader contextual information, effectively capturing both local and intermediate spatial interactions. Branch 2 employs \(1 \times 5\) and \(5 \times 1\) convolutions to extract even larger-scale features, with a dilated \(5 \times 5\) convolution applied afterward. This branch enhances the receptive field further, enabling the model to capture larger contextual information relevant to complex occlusions. Finally, Branch 3 uses \(1 \times 7\) and \(7 \times 1\) convolutions to extract long-range dependencies, followed by a dilated \(7 \times 7\) convolution. This allows the model to capture global context, critical for handling large occlusions and preserving overall scene coherence.

Fig. 4
figure 4

Modified RFB architecture with parallel branches for multi-scale feature extraction and receptive field enhancement.

The outputs of these branches are concatenated along the channel dimension and passed through a \(3 \times 3\) convolution to reduce the dimensionality. The final result is added to the input tensor \(F_R\) through a residual connection, where a \(1 \times 1\) convolution of \(F_R\) serves as the shortcut. This ensures that both fine-grained features from the input tensor and the multi-scale contextual information from the RFB are preserved. Mathematically, the output of the RFB is computed as:

$$\begin{aligned} \begin{aligned} F_{\text {M}} =&\, \text {ReLU} \left( \text {Conv}_{3 \times 3} \left( \text {Cat} \left( \{ x_0, x_1, x_2, x_3 \} \right) \right) \right. \\&\, + \left. \text {Conv}_{1 \times 1} ( F_R ) \right) , \end{aligned} \end{aligned}$$
(6)

where \(\{ x_0, x_1, x_2, x_3 \}\) represents the outputs from each branch, and \(\text {Conv}_{3 \times 3} ( F_R )\) is the residual connection. The resulting tensor \(F_{\text {M}} \in \mathbb {R}^{H \times W \times C_{\text {out}}}\) contains enriched multi-scale features, capturing both local details and broader contextual dependencies, which are crucial for the next model stages.

Occlusion Reconstruction (OR)

The occlusion reconstruction subnetwork, shown in Fig. 2, is built with three cascaded Residual Dense Blocks (RDBs), each enriched with Local Feature Fusion (LFF) to refine spatial details. To enhance multi-scale representation and global coherence, we incorporate Dense Feature Fusion (DFF), which combines Global Feature Fusion (GFF) and Global Residual Learning (GRL) for robust contextual aggregation. A Feature Pyramid Network (FPN) is also utilized to facilitate hierarchical feature integration and progressive refinement of occluded regions for precise restoration.

The first RDB receives the multi-scale feature tensor \(F_M\) from the previous feature extraction stage. Through dense connectivity and local residual learning, it begins refining the occluded regions. Each subsequent RDB further enhances the feature representation by adaptively fusing newly extracted features with those from earlier layers, improving the network’s capacity to handle complex occlusions.

More formally, let D represent the number of residual dense blocks. The output of the d-th RDB, denoted as \(F_d\), is computed recursively as

$$\begin{aligned} F_d = H_{\text {RDB},d}(F_{d-1}) = H_{\text {RDB},d} \big ( H_{\text {RDB},d-1} ( \dots H_{\text {RDB},1}(F_M) \dots ) \big ), \end{aligned}$$
(7)

where \(H_{\text {RDB},d}\) represents the composite operations (e.g., convolutions and ReLU activations) in the d-th RDB. Each \(F_d\) encapsulates rich local features, leveraging all convolutional layers within its block. After obtaining these hierarchical features from the RDB cascade, we perform Dense Feature Fusion (DFF) to integrate global information. The fused feature map is computed as

$$\begin{aligned} F_{\text {DF}} = H_{\text {DFF}}(F_{-1}, F_0, F_1, \dots , F_D), \end{aligned}$$
(8)

where \(F_{-1}\) denotes shallow feature maps from earlier stages, and \(H_{\text {DFF}}\) merges features across different levels.

Residual Dense Block (RDB)

As depicted in Fig. 5, the proposed RDB is designed to enhance feature representation through three key mechanisms: dense connectivity, local feature fusion (LFF), and local residual learning (LRL), which together establish a contiguous memory (CM) effect. In the d-th RDB, let \(F_{d-1}\) and \(F_d\) denote the input and output feature maps, respectively, each initially comprising \(G_0\) channels.

Fig. 5
figure 5

Illustration of the RDB structure, which improves feature representation through dense connectivity, local feature fusion (LFF), and local residual learning (LRL).

Within each RDB, the output of the c-th convolutional layer is formulated as

$$\begin{aligned} F_{d,c} = \sigma \left( W_{d,c} \left[ F_{d-1}, F_{d,1}, \dots , F_{d,c-1} \right] \right) , \end{aligned}$$
(9)

where \(\sigma\) is the ReLU activation function, \(W_{d,c}\) represents the weights of the c-th convolutional layer, and the concatenation \(\left[ F_{d-1}, F_{d,1}, \dots , F_{d,c-1} \right]\) aggregates feature maps from the preceding RDB and all previous layers within the current block. This concatenation results in a feature set with \(G_0 + (c-1) \times G\) channels, where G is the growth rate. To mitigate the rapid increase in feature dimensions, local feature fusion (LFF) is applied using a \(1 \times 1\) convolution:

$$\begin{aligned} F_{d,\text {LF}} = H_{d\text {LF}} \left( \left[ F_{d-1}, F_{d,1}, \dots , F_{d,C} \right] \right) , \end{aligned}$$
(10)

where \(H_{d\text {LF}}\) denotes the transformation induced by the \(1 \times 1\) convolution across all C layers within the block. Finally, local residual learning (LRL) is incorporated to improve information flow, yielding the final output of the RDB as

$$\begin{aligned} F_d = F_{d-1} + F_{d,\text {LF}}. \end{aligned}$$
(11)

This design effectively combines dense feature aggregation with residual connections, thereby preserving hierarchical information and enhancing gradient propagation, which are key to accurate occlusion reconstruction.

Dense Feature Fusion (DFF)

After extracting local dense features using a series of RDBs, the network performs dense feature fusion (DFF) to exploit these hierarchical features on a global scale. The DFF module first applies Global Feature Fusion (GFF) by concatenating the outputs of all RDBs:

$$\begin{aligned} F_{GF} = H_{GF} \left( [F_1, F_2, \dots , F_D] \right) , \end{aligned}$$
(12)

where \(H_{GF}\) is a composite function that typically involves a \(1 \times 1\) convolution for adaptive fusion, followed by a \(3 \times 3\) convolution to further extract salient features. Subsequently, global residual learning (GRL) is employed to integrate the shallow features \(F_{-1}\) with the fused global feature \(F_{GF}\):

$$\begin{aligned} F_{DF} = F_{-1} + F_{GF}. \end{aligned}$$
(13)

This hierarchical fusion of local dense features and global residual learning creates a multi-scale representation essential for subsequent accurate occlusion-free image reconstruction.

Feature Pyramid Network (FPN)

After executing Dense Feature Fusion (DFF), the occlusion reconstruction subnetwork incorporates a Feature Pyramid Network (FPN) to effectively aggregate multi-scale contextual features and reinforce hierarchical feature propagation. Designed to balance fine-grained spatial details with high-level semantic understanding, the FPN employs a structured fusion mechanism that integrates features across multiple scales. Operating through a bottom-up feature encoding process, the FPN extracts multi-resolution representations via a sequence of transformative ResBlocks, as shown in Fig. 6. This is followed by a top-down refinement stage, where hierarchical feature fusion is achieved through lateral connections, enabling effective cross-scale feature propagation. Using these pathways, the FPN preserves high-frequency details and broader context, improving the network’s ability to reconstruct occluded regions more clearly.

Fig. 6
figure 6

Schematic representation of the FPN architecture, enhancing feature integration by fusing multi-scale features through pyramidal layers.

Given an input feature map \(F^{\text {DFF}}\) from the preceding DFF module, the bottom-up hierarchical encoding process extracts multi-scale features through successive transformations, formulated as:

$$\begin{aligned} F^{\ell } = H_{\text {ResBlock},\ell } (F^{\ell -1}), \end{aligned}$$
(14)

where \(H_{\text {ResBlock},\ell }(\cdot )\) represents the residual transformations at level \(\ell\), typically implemented using convolutional layers with skip connections to maintain feature integrity and enhance gradient flow. The ResBlock is a fundamental component in this process, structured with three sequential convolutional layers, batch normalization, and Leaky ReLU activation functions. Batch normalization stabilizes training by normalizing feature distributions, while Leaky ReLU activation addresses vanishing gradients, enhancing feature representation. A key aspect of the ResBlock is its skip connection, which directly propagates the input feature map to the output through convolution and batch normalization layers. This mechanism helps retain spatial details, prevents degradation in deeper network layers, and facilitates more effective training. Through residual learning, the network focuses on refining occlusion-reconstructed features while preserving the underlying structural integrity. The final activation layer ensures robust feature propagation, maintaining consistency in reconstructed features across multiple scales.

To enable effective feature propagation, the network employs a top-down fusion mechanism, where high-resolution details are preserved by combining deeper feature representations with intermediate-level features via lateral connections. This is formally expressed as:

$$\begin{aligned} \hat{F}^{\ell } = H_{\text {lat},\ell } (F^{\ell }) + \text {Upsample}(F^{\ell +1}), \end{aligned}$$
(15)

where \(H_{\text {lat},\ell }(\cdot )\) is a \(1 \times 1\) convolution applied to intermediate features for dimensionality alignment, and \(\text {Upsample}\) represents a spatial upsampling operation that enhances feature consistency across scales. The fused hierarchical features are further processed through a smoothing transformation to refine the occlusion-reconstructed representation, ensuring both local consistency and global contextual integrity:

$$\begin{aligned} F_{\text {FPN}} = H_{\text {smooth}}(\hat{F}^{1}, \hat{F}^{2}, \dots , \hat{F}^{L}), \end{aligned}$$
(16)

where \(H_{\text {smooth}}\) consists of convolutional layers, typically a combination of \(3 \times 3\) convolutions and nonlinear activations, which act to enhance local feature continuity while preserving multi-scale information. Thus, the FPN constitutes the final stage of the occlusion reconstruction subnetwork, consolidating multi-scale features into a robust occlusion-reconstructed representation \(F_{\text {FPN}}\). This output is subsequently transmitted to the refinement subnetwork, which performs additional fine-grained enhancement to ensure superior occlusion-free image reconstruction.

Refinement module

The refinement subnetwork is designed to enhance the occlusion-reconstructed features at a fine-grained level, ensuring the generation of a high-quality, occlusion-free output image. As shown in Fig. 2, it consists of two key components: a Separable Convolution Block (SeparableConvBlock) followed by a standard convolutional layer that reduces the channel dimension to three, representing an RGB image.

Fig. 7
figure 7

Illustrative diagram of the Separable Convolution (SepConv) Block.

Central to the refinement module is the SeparableConvBlock, which decomposes the standard convolution operation into two distinct processes: a depthwise convolution followed by a pointwise convolution, as shown in Fig. 7. In the depthwise stage, a \(k \times k\) convolution layer is applied, where \(k \times k\) represents the kernel size (set to \(3 \times 3\) in our work). A unique filter is applied independently to each input channel, enabling the block to capture spatial information while significantly reducing computational cost. The pointwise convolution, implemented as a \(1 \times 1\) convolution, then fuses these per-channel responses, facilitating the integration of information across channels. Finally, the output from the SeparableConvBlock, which refines the 64-channel feature representation, is passed through a \(3 \times 3\) convolutional layer that reduces the feature map to three channels, corresponding to the final occlusion-free RGB image. The full refinement operation can be expressed by the following equation:

$$\begin{aligned} I_{\text {out}} = \text {Conv}_{3 \times 3} \left( \text {SepConvBlock} \left( F_{\text {FPN}} \right) \right) . \end{aligned}$$
(17)

This structured sequence of operations guarantees that the output image \(I_{\text {out}}\) is free from occlusions, with enhanced clarity and detail.

Loss function

To optimize image reconstruction, we employ a hybrid loss function comprising MAE, SSIM, and Perceptual Loss, balancing pixel-wise accuracy, perceptual fidelity, and structural consistency:

$$\begin{aligned} \mathcal {L} = k_1 L_{\text {MAE}} + k_2 L_{\text {SSIM}} + (1-k_1-k_2) L_{\text {PER}}, \end{aligned}$$
(18)

where \(k_1 = 0.30\) and \(k_2 = 0.35\) are empirically set.

MAE loss minimizes pixel-wise discrepancies:

$$\begin{aligned} L_{\text {MAE}} = \frac{1}{H W} \sum _{i,j} | I_{i,j} - \hat{I}_{i,j} |. \end{aligned}$$
(19)

SSIM loss enhances structural coherence:

$$\begin{aligned} L_{\text {SSIM}} = 1 - \text {SSIM}(I, \hat{I}), \end{aligned}$$
(20)

where SSIM evaluates luminance, contrast, and structure.

Perceptual loss enforces high-level feature and texture consistency:

$$\begin{aligned} L_{\text {PER}} = L_{\text {FEAT}}^{\tau } + L_{\text {STYLE}}^{\tau }, \end{aligned}$$
(21)

This formulation ensures fine detail preservation and robust occlusion handling.

Experiments

Experimental setup

To build upon the foundations laid in27 and30, we adopted their core principles for training and evaluating our network while making adjustments to better suit the specifics of our experimental setup. Our approach was guided by their comprehensive training techniques and evaluation criteria, ensuring a robust comparison while integrating our own modifications to optimize performance. The subsequent subsections detail the procedural steps taken during both training and testing phases.

Training dataset

For our LF-PyrNet network, we built a well-structured training pipeline using a carefully selected dataset that includes both real-world and synthetic occlusions. To create realistic occlusion cases, we applied the mask embedding strategy from23, where occlusion masks are added to occlusion-free LF images. This method allows us to generate a variety of occlusion patterns, helping the model learn to handle different levels of complexity. To introduce variability in occlusion patterns, we randomly embed one to three occlusion masks per sample, simulating different disparity conditions within the LF images. While the original dataset included 80 synthetic masks from23, we expanded it by adding 21 more masks, specifically designed to be larger and denser, making the occlusion removal task more challenging. These new masks were taken from real-world images to improve the model’s ability to reconstruct missing details in practical scenarios. To ensure accurate training, we only included LF images where occluded objects had negative disparity, guaranteeing reliable ground-truth data. In total, we selected 1,418 LF images from the DUTLF-V2 dataset42, which was captured using the Lytro Illum camera43. By integrating an enriched occlusion dataset and strategic augmentation, our approach enables the model to learn how to remove occlusions effectively, making it more adaptable to real-world challenges.

Testing dataset

To thoroughly evaluate our network’s ability to handle occlusions in sparse LF images, we conduct experiments on both synthetic and real-world datasets. Specifically, we test our model on two synthetic sparse LF datasets: the 4-Syn and 9-Syn datasets23,28, which are particularly challenging due to their angular sparsity. These datasets contain sparsely sampled LF images with occluded regions at multiple disparity levels, enabling us to analyze the model’s ability to reconstruct missing details in complex scenarios. To further validate real-world performance, we use the Stanford CD dataset33 for real sparse LF data, which provides a ground truth occlusion-free image, allowing us to measure how accurately our model reconstructs occluded regions. In addition, for dense LF scenarios, we extract 615 images from the DUTLF-V2 test dataset42, supplemented by 33 real occlusion cases, providing a realistic evaluation under complex occlusion patterns. To mimic occlusions with multiple depth levels, we employ a mask embedding strategy, generating Single Occ and Double Occ cases with disparities ranging from 1 to 4. This setup enables a controlled yet realistic evaluation of how the model restores occluded details across different depth layers. Beyond synthetic dataset evaluations, we conduct qualitative tests on publicly available real-world LF datasets. The Stanford Lytro dataset44 and the EPFL-10 dataset45 feature dense LF scenes with intricate occlusion patterns and varying depth complexities, providing a strong testbed for assessing real-world applicability. By using both synthetic and real-world datasets, we set up a complete evaluation framework to test our network on different types of occlusions in LF data.

Training details

For our experiments, we hire the DUTLF-V2 dataset42, which offers high-resolution LF images with \(9 \times 9 \times 600 \times 400\) for angular and spatial dimensions. For our purposes, we focus on the central \(5 \times 5\) views, reducing the spatial resolution to \(300 \times 200\). To enhance the training procedure, we center-crop and horizontally flip the images, resizing them to \(256 \times 192\). To introduce occlusions, we employ a mask embedding strategy that randomly selects, combines, and positions one to three RGB masks within the images. We train the model using the ADAM optimizer with settings \((\beta _1, \beta _2) = (0.5, 0.9)\) and a batch size of 18. Regularization is applied with \(\lambda _1 = 0.01\) and \(\lambda _2 = 120\), while the learning rate begins at 0.001 and is halved every 150 epochs. The entire training process spans 500 epochs, takes around 1 day on a single Nvidia GeForce 4090 GPU, using the PyTorch framework.

Experimental results

To evaluate our model, we perform a series of experiments on de-occluded LF images, benchmarking its performance against ten state-of-the-art LF occlusion removal techniques, including DeOccNet 23, Mask4D 24, Zhang et al.26, LFORNet37, Wang et al.28, ISTY 27, Senussi et al. 30, ATM-OAC 39, Zhang et al. 40, and MANet 41. To gain deeper insight into the role of angular information, we also compare our model with popular single-image inpainting methods, such as RFR 46 and LBAM 47. For fairness, DeOccNet 23 and Senussi et al.’s method 30 are retrained using our consistent training setup. The ISTY 27 model is evaluated using the authors’ pretrained weights, while performance results for RFR 46, LBAM 47, and Zhang et al.26 are adopted from ISTY27. Other methods are evaluated using results from their original papers due to unavailable code.

Quantitative results

LF-PyrNet delivers outstanding performance on sparse LF datasets and exhibits competitive results on dense LF datasets. This is evidenced by the quantitative analysis in Table 1, where the PSNR and SSIM metrics highlight its strong effectiveness in occlusion removal.

For synthetic sparse LF datasets, LF-PyrNet achieves the highest PSNR score on 9-Syn (28.13 dB), outperforming all methods including Senussi et al. 30, the previous top performer in this category. Although LF-PyrNet’s 4-Syn score (27.41 dB) ranks just outside the top three, behind Wang et al. 28 (27.87), MANet 41 (29.17), and Zhang et al. 40 (29.70), these results highlight its effectiveness in managing occlusions across different disparity levels. Furthermore, on the real sparse LF dataset (CD), LF-PyrNet again achieves the best PSNR (25.79 dB), demonstrating its strong generalization to real-world occlusion cases. In terms of SSIM, which measures the structural similarity and perceptual quality of the reconstructed images, LF-PyrNet consistently achieves the highest scores on 9-Syn (0.870) and CD (0.887). In 4-Syn, it ranks third (0.872), yet still outperforming most competing methods. This highlights its strong visual precision and preservation of the structure despite challenging occlusions. The model’s ability to superpass in sparse LF settings can be attributed to its innovative architectural components. The RFB enhances contextual understanding by capturing multi-scale features, allowing the model to process occlusions of varying sizes more effectively. The RDB facilitates feature reuse and long-range dependencies modeling, improving texture consistency and fine-detail reconstruction. The FPN strengthens multi-scale occlusion-aware feature fusion, ensuring that occluded regions are restored with higher accuracy. Unlike LF-PyrNet, methods like RFR 46 and LBAM 47 face limitations, as their single-image inpainting approach neglects the rich angular and background cues inherent in LF data, leading to suboptimal performance in occlusion removal. DeOccNet 23 demonstrates reasonable performance but lacks consistency, particularly in more complex occlusions. Zhang et al. 26 make notable strides in specific cases, but their approach is constrained by assumptions about background visibility, limiting its performance in broader scenarios. Conversely, ISTY 27 and Senussi et al. 30 face challenges in handling occlusions and sparse data due to its reliance on local receptive fields in CNNs. While Wang et al. 28, MANet 41, and Zhang et al. 40 previously achieved the best results, their methods did not evaluate performance on the 9-Syn or CD datasets, which limits their comparability.

Table 1 Detailed quantitative evaluation on sparse and dense LF datasets using PSNR/SSIM. The best, second-best, and third-best results are shown in bold, underline, and bold-underline, respectively. A dash (–) indicates no evaluation of the model for that dataset.

For dense LF datasets, LF-PyrNet remains competitive. In the Single Occ scenario, it attains a PSNR of 31.96 dB, which is slightly lower than ISTY27 (32.44 dB) but still superior to other competing methods. Similarly, in the Double Occ setting, LF-PyrNet achieves 30.11 dB, demonstrating effective handling of occluded regions, even under multi-occlusion conditions. Although its SSIM scores (0.826 for Single Occ and 0.828 for Double Occ) are slightly lower than those of some methods, its higher PSNR in most other cases suggests it recovers more accurate pixel values, indicating stronger pixel-wise reconstruction.

Overall, LF-PyrNet achieves state-of-the-art performance, leading in PSNR across most cases and delivering robust occlusion removal, even when slightly behind in SSIM for dense scenarios.

Qualitative results

Fig. 8
figure 8

Visual comparison of LF-PyrNet and existing methods on the sparse LF dataset.

By effectively reconstructing occluded regions with enhanced enhanced structural integrity and precise texture details, LF-PyrNet surpasses existing methods on sparse LF datasets, as seen in Fig. 8. In the first four rows, which correspond to synthetic sparse scenes, single-image inpainting methods struggle to recover fine details. RFR 46 and LBAM 47 produce imperfect reconstructions, exhibiting noticeable blurring and missing textures in the occluded regions. While DeOccNet 23 shows a partial enhancement, it still introduces structural inconsistencies, particularly in complex patterns, such as the lattice structures in rows 2 and 3. Zhang et al.26 yields outputs with excessive smoothness, which blurs crucial details, whereas ISTY27 achieves better structural restoration but fails to eliminate residual distortions. Senussi et al. 30 offers notable results, yet it exhibits difficulties in restoring intricate patterns and textures, as seen in rows 3 and 4. LF-PyrNet clearly outperforms other methods in preserving textural details and geometric consistency. The integration of the RFB allows for more effective multi-scale context aggregation, enhancing occlusion recovery in structured patterns, as demonstrated in row 2, where fine details in the interior scene are reconstructed more accurately. The RDB strengthens feature propagation, enabling the model to recover high-frequency details, which is particularly evident in the restoration of lattice structures in rows 3 and 4. Additionally, the FPN enables better multi-scale learning, improving the handling of both thin and small occlusions, as seen in row 1. In the last row, corresponding to a real-world scene, the complexity of occlusions increases due to natural textures and variations in illumination. RFR 46, LBAM 47, and Zhang et al.26 produce reconstructions with severe blurring, while27 and Senussi et al. 30 manage to retain some details but still exhibit residual occlusions and fail to preserve texture continuity. In contrast, LF-PyrNet successfully restores intricate textures and maintains the original scene characteristics, demonstrating its effectiveness in occlusion removal for both synthetic and real-world scenarios.

In the dense LF dataset, as depicted in Fig. 9, the visual comparison highlights the strengths and weaknesses of various state-of-the-art occlusion removal methods, particularly in single and double occlusion scenarios. LF-PyrNet, while performing well in recovering fine details, shows a trade-off when it comes to maintaining structural consistency. In the single occlusion cases (rows 1 and 3), LF-PyrNet effectively reconstructs key visual elements with sharpness and reduced artifacts. For instance, in the first row, it successfully recovers the overall shape and texture of the branches with impressive clarity, avoiding the haziness observed in RFR 46 and LBAM 47, as well as the color bleeding seen in DeOccNet 23. However, when compared closely with ISTY 27 and Senussi et al. 30, subtle inconsistencies emerge. These include slightly irregular edges and a mild misalignment of fine patterns. While the pixel-wise accuracy is high, as reflected by a competitive PSNR of 31.96 dB, the perceptual integrity of the reconstructed content, indicated by a lower SSIM of 0.826, is slightly compromised. This is more apparent in the third row, where LF-PyrNet generates a clean fill of the occluded region in the flower scene, but the continuity of shadows and spatial alignment fall short compared to the smoother outputs of ISTY27.

Fig. 9
figure 9

Evaluation of LF-PyrNet and occlusion removal methods on the dense LF dataset, illustrating key visual differences.

Moving to the double occlusion scenarios (rows 2 and 4), where challenges intensify due to overlapping regions, LF-PyrNet continues to excel in clarity and pixel-level restoration. In the second row, the method succeeds in reestablishing the repetitive vertical structures of the bridge, outperforming other approaches that either blur or warp these patterns. High-frequency textures, such as pole shadows and gaps, are better preserved than in most competing models, further supporting its PSNR result of 30.11 dB. However, compared to ISTY 27, its spatial coherence is slightly weaker—while ISTY maintains a more perceptually accurate scene layout, LF-PyrNet prioritizes detail fidelity, sometimes at the expense of structural uniformity. This subtle imbalance is also evident in the fourth row, where LF-PyrNet produces crisp shapes, avoiding the distortions seen in methods like DeOccNet 23 and LBAM 47. Nevertheless, close observation reveals minor geometric distortions around object boundaries, once again highlighting the PSNR-SSIM trade-off.

Performance evaluation on real-world data

Restoring occluded regions in real-world LF images poses a significant challenge, requiring both structural accuracy and texture preservation. Figure 10 presents a comparative visual analysis of three occlusion removal methods: DeOccNet 23, Senussi et al.30, and our LF-PyrNet, which highlights the effectiveness of LF-PyrNet in reconstructing missing details. The visual comparison begins with the original occluded scenes in the first column, followed by the results of DeOccNet 23, Senussi et al.30, and LF-PyrNet in the subsequent columns. In the first row, where the occlusion involves a bicycle wheel obstructing the background, LF-PyrNet effectively reconstructs the missing details with high clarity. It ensures a smooth and natural transition between occluded and non-occluded regions. In contrast, DeOccNet 23 exhibits significant blurring, failing to recover fine details, while Senussi et al.30 leave residual traces of occlusion, impacting the overall visual coherence of the scene.

Fig. 10
figure 10

Real-world scene restoration with LF-PyrNet, showcasing its superiority in occlusion removal compared to DeOccNet23 and Senussi et al.30.

The second row introduces a more convoluted occlusion, where a metal fence partially obstructs a natural scene. In this case, LF-PyrNet demonstrates its ability to recover occluded regions while preserving structural consistency. Meanwhile, DeOccNet 23 produces overly smooth textures that detract from the natural appearance, while Senussi et al.30 leave subtle occlusion remnants, leading to visible structural distortions in the scene. The third row, characterized by dense and thin occlusions from overlapping tree branches, presents an even greater challenge. While DeOccNet 23 struggles to differentiate between occlusion and background, resulting in texture loss and excessive smoothing, and Senussi et al.30 fail to maintain spatial consistency, LF-PyrNet effectively reconstructs the fine details of the background, preserving natural textures with minimal distortion and fewer inconsistencies.

Evaluation of computational efficiency

Table 2 provides a detailed comparison, showing how LF-PyrNet performs compared to state-of-the-art methods in terms of computational efficiency, particularly in model size, inference spped, and training time on a Nvidia RTX 4090 GPU. LF-PyrNet stands out with just 16.97M parameters, making it one of the most efficient architectures in terms of model size among competing methods. It is significantly smaller than RFR 46 (30.59M), LBAM 47 (69.3M), ISTY 27 (80.6M) and Senussi et al.30 (52.59M), demonstrating its ability to maintain strong performance while using fewer parameters. Even compared to DeOccNet 23 (39.0M) and Wang 28 (44.02), which are also known for their efficient parameter usage, LF-PyrNet offers a more optimized structure. Although Zhang et al.26 (2.7M), ATM-OAC 39 (5.48M), and MANet 41 (2.4M) are designed to be lightweight, their efficiency is severely compromised by the long inference time, making them impractical for real-time applications.

Table 2 Comparison of model size, inference time, and training time, where lower values (↓) indicate better performance and ‘–’ denotes unreported data.

Regarding inference time, while DeOccNet23 (0.01 s), LBAM47 (0.012 s), and ISTY27 0.024 s) exhibit faster inference times, these models require significantly more parameters, leading to higher memory demands and potential scalability issues. In contrast, LF-PyrNet achieves an inference time of 0.043 s, outperforming RFR 46(6.76 s), Senussi et al.30 (0.138 s), Wang 28 (2.63 s), ATM-OAC 39 (1.04 s), MANet 41 (1.74 s) and Zhang et al.26 (3050 ms) by a considerable margin, offering a strong balance between computational cost and processing speed. In terms of training efficiency, LF-PyrNet is likewise efficient, requiring only 1 day of training, which is substantially faster than Zhang et al. 26 (3 days), MANet 41 (9.7 days), and other efficient methods such as DeOccNet 23 and Wang 28 (2 days), demonstrating rapid convergence without sacrificing performance. While LF-PyrNet may not be the fastest model overall, its efficient design strikes an optimal balance between parameter efficiency and inference speed, making it ideal for real-world applications.

Performance analysis under varying disparity ranges

To further demonstrate the robustness of our approach, we evaluate performance under five disparity ranges using both single-image inpainting methods (RFR 46, LBAM 47) and LF-specific methods (DeOccNet 23, Zhang et al. 26). For consistency of evaluation, part of the comparative results are directly adopted from Zhang et al. 26.

Table 3 Performance analysis under varying disparity ranges, best results highlighted in bold.

As shown in Table 3, our method consistently achieves superior performance across different LF types and disparity ranges. Specifically, our model attains the highest PSNR and SSIM in four out of five cases, with notable improvements in large disparity ranges such as \((-9, -4)\), where it surpasses DeOccNet 23 by more than 3.5 dB in 4-Syn and nearly 5 dB in 9-Syn settings. In contrast, inpainting methods (RFR 46 and LBAM 47) generate plausible semantics but fail to reconstruct complex occluded backgrounds, while LF-based methods DeOccNet 23 and Zhang et al. 26 underexploit angular correlations, leading to artifacts and degraded textures. Moreover, the sharp performance drop of DeOccNet 23 and Zhang et al. 26 cases highlights their limited generalization ability. By comparison, our method more effectively leverages spatial-angular consistency to recover fine textures, reduce artifacts, and preserve structural fidelity across diverse disparity ranges. These results confirm that our method scales more effectively to challenging disparity variations than both inpainting-based and LF-specific baselines.

Ablation study

The ablation study presented in Table 4 evaluates the impact of different network components on the performance of LF-PyrNet. While the full LF-PyrNet model generally achieves the highest performance across most cases, some ablated versions surpass it in specific scenarios, highlighting the exact role of individual components. Omitting the ResApp module leads to a consistent drop in performance across all scenarios. For instance, in the 9-Syn Sparse case, the PSNR decreases from 28.13 (full model) to 27.99, while SSIM reduces from 0.872 to 0.858, demonstrating the contribution of ResApp to improving feature extraction and reconstruction quality. Interestingly, removing the RFB slightly improves performance in the 9-Syn Sparse scenario, where the SSIM increases from 0.870 (LF-PyrNet) to 0.876 (w/o RFB). This indicates that RFB enhances global feature integration and helps extend the receptive field. However, in certain cases, removing it allows the network to focus on more localized details. This can lead to marginal gains. While RDB effectively reconstructs fine textures and preserves occlusion boundaries, improving detail restoration in most cases, its presence may introduce slight over-smoothing in heavily occluded dense scenarios. Excluding the RDB results in the highest SSIM for Dense Single Occlusion (0.832) and Dense Double Occlusion (0.832), surpassing the full model SSIM of 0.826 and 0.828, respectively.

Table 4 Ablation Analysis: Highlighting the Best Performance in bold.

The most severe performance drop is observed when the FPN is removed, particularly in the CD and dense (Double Occ) cases, where the PSNR drops from 25.79 to 24.41 and from 30.11 to 26.37, respectively. This highlights the crucial role of FPN in multi-scale feature fusion, ensuring that both fine and coarse details are effectively captured. Without the refinement module, performance takes a sharp hit, especially in the Dense (Single and Double Occ) cases. In Single Occ, PSNR drops from 31.96 to 28.36, while in Double Occ, the decrease is even more pronounced, falling from 30.11 to 27.07. This decline in performance underscores the crucial role of the refinement module, particularly the SepConv block, which aids in enhancing texture restoration and accurately recovering occlusion boundaries. Without it, the model struggles to refine the fine details of occluded regions, leading to a diminished decline in overall quality. The visual results in Fig. 11 align with the quantitative findings, highlighting the significant enhancements achieved when the complete architecture is employed.

Fig. 11
figure 11

Visual analysis of the ablation study, showcasing the influence of each model component on overall performance.

Conclusion and future work

This study presents LF-PyrNet, a new deep learning model designed to improve occlusion removal in LF images. By employing multi-scale receptive field learning through the integration of ResASPP and a modified RFB, LF-PyrNet expands the receptive field, capturing long-range dependencies and enhancing spatial feature extraction. The core network structure consists of three cascaded RDBs followed by an FPN, facilitating effective occlusion removal by progressively refining features at multiple scales. Additionally, our refinement module, which incorporates both separable and standard convolutions, ensures the restoration of even the finest details in occluded regions, enhancing the overall quality of the reconstruction. However, while LF-PyrNet performs well on sparse LF datasets, it faces challenges with denser datasets where occlusions are more complex. This limitation arises from the model’s current inability to rely on mask prediction prior to inpainting occluded regions, which makes handling dense occlusions with high accuracy more difficult. Future work will focus on incorporating mask prediction mechanisms to guide the inpainting process, allowing the model to better isolate and reconstruct occluded regions, especially in dense LF datasets. Furthermore, advanced techniques for feature fusion will be explored to improve performance on large and complex occlusions. These enhancements will make the model more reliable and adaptable in real-world applications.