Introduction

Since the time of Galileo, imaging has been the “eyes of science”1. The acquisition of high-quality biological images has assumed a pivotal role in biological science research, as advancements in imaging technology have provided essential tools for direct investigation2,3. Traditional techniques like widefield fluorescence and laser scanning confocal microscopy are crucial in biology4,5, and provide structural data6 but face resolution limits7. Advanced methods like confocal microscopy7,8,9,10,11, multiphoton microscopy12, structured illumination4,13,14, light-sheet imaging6,15,16,17,18, and STED microscopy19,20,21,22 aim to increase resolution, aiding neuroscience research. Nonetheless, these methods involve inherent physical resolution constraints7, necessitating the continued development of advanced techniques to enhance image quality through denoising, deblurring, and resolution improvement.

DL excels in and provides a promising tool to image enhancement and restoration23,24. With large datasets, given in a supervised way25,26,27,28,29,30 like low-high quality image pairs, or unsupervised way31,32,33,34,35,36 like corrupted pairs, DL may have good performance in many cases28,29,30,32,34,37,38,39. However, it should be noted that most DL algorithms assume independently and identically distributed40 (i.i.d.) training and inference data. Regrettably, this assumption is frequently violated in real-world scenarios due to factors such as sample variation (biological samples can exhibit significant differences even within the same species or cell line, stemming from genetic variations, environmental factors, or developmental stages), variability in imaging conditions (variations in lighting, exposure time, and focus can yield non-i.i.d data, and different microscopes or separate sessions with the same microscope can introduce variability in captured images), and temporal effects in living cells (time-lapse live-cell imaging can introduce non-i.i.d data due to changes in cells or imaging conditions during the experiment), ultimately resulting in less-than-optimal outcomes. In a simple experiment, we found that the PSNR may drop by approximately 0.5 to 6 dB when a DL model is applied to a different dataset. We require a novel approach to address the non-i.i.d issue between training and inference and unlock the model’s performance potential, developing effective single-image restoration methods is pivotal to surmount these data and non-i.i.d -related challenges.

The detrimental effects of non-i.i.d. data stem from the segregation of training and inference (as shown in Fig. 1a, traditional paradigm is a two-stage framework), hindering the model’s ability to adjust to the distinct attributes of individual inference images. To improve its performance, the central challenge lies in effectively utilizing a single inference image to optimize the model for the specific characteristics of the image. Inspired by biological intelligence, where past experiences sculpt neural circuits for future learning41, we introduce the Hitchlearning paradigm, which unifies training and inference (as shown in Fig. 1a, HitchLearning paradigm is a one-stage framework), by instantaneously adapting the model with any new inference data and the existing training data. Our paradigm (Fig. 1a, b) utilizes historical non-i.i.d. image data (Source) as prior experience, merging this prior experience with a single new inference image (Target) to optimize the model by aligning their distribution. In this way, the prior data provide general information for the task, while the inference image gives detailed features for the specific scenario. Traditional domain alignment strategies need multiple target data and source data to achieve the goal using deep neural network, while in our paradigm we only use single target data and source to realize domain-alignment in Fourier domain (Fig. 1b, c). With the adapted model, better performance can be achieved. Experiments demonstrate HitchLearning is versatile, as it can be applied to both supervised and unsupervised frameworks for deblurring, denoising, and super-resolution across various datasets (Fig. 1b).

Fig. 1: HitchLearning paradigm mechanism.
figure 1

a Comparison of the traditional and HitchLearning paradigms. The traditional machine learning paradigm has two stages: the first stage (upper left) uses a large amount of training data to create a well-performing model through various learning methods, and the second stage (upper right) uses this trained model to infer a restored output. Our new paradigm, shown at the bottom, unifies these stages by incorporating inference data during training. The two-stage framework faces a non-i.i.d. problem due to the domain gap between source and target data (middle left). In HitchLearning, this problem is alleviated by using inference data to generate new source data during model training. b The architecture of the HitchLearning paradigm. Domain alignment (blue) can utilize any method to align the distribution of source and target images. The image restoration learning method can be any approach (supervised, self-supervised, or unsupervised) suitable for tasks such as image denoising, deblurring, and super-resolution. The image restoration model is a well-trained model with strong performance on a single target image. The blue dotted box illustrates our method’s domain alignment in the Fourier domain, which requires only one target image, unlike commonly used DNN methods that need multiple target data, increasing costs. The black dotted box displays results of three tasks using our paradigm, with the bar chart at the bottom showing the PSNR comparison of two main training paradigms against HitchLearning across these tasks. c A simple but effective distribution alignment method. Here, we use Fourier Domain Adaption (FDA) method to obtain domain alignment, by doing this, the source image has a similar style to the target and can be verified by the data shown in the Results.

Results

Hitchlearning facilitates distribution consistency

Training with single target image offers important but limited information, whereas historical images are abundant. However, due to the non-i.i.d. of historical images and single target image, domain adaptation42 (DA) is needed to bridge this gap. Here, we propose a simple yet effective strategy based on Fourier domain adaptation43 (FDA) for HitchLearning. HitchLearning aligns source and target image distributions in the Fourier domain by replacing the source’s amplitude spectrum with that of target (Fig. 1c), without requiring complex networks, convolution operations, or extensive data. This process enhances source-target distribution similarity, providing consistent data for further analysis.

HitchLearning minimizes discrepancies between training and inference images. We employ Kullback-Leibler divergence (KLD) to quantify the distribution shift from source image data before and after HitchLearning to target image data44,45. Figure 2a depicts the fluctuations in Golji vs. Golji transferred by F-actin and Microtubule vs. Microtubule transferred by ER, indicating a reduction in KLD following the HitchLearning process. Additionally, the processed images became visually similar to the target images (Fig. 2b). The KLD results and the visual evidence show that HitchLearning enables consistent distributions, which indicates that HitchLearning can solve non-i.i.d. problems.

Fig. 2: Evaluation of FDA mechanism.
figure 2

a Quantitative Analysis of Kullback–Leibler Divergence (KLD) Reduction. A side-by-side comparison of the KLD values before and after applying our proposed transformation method. On the left panel, the analysis focused on F-actin as the target, while the right panel addresses the endoplasmic reticulum (ER) as the target. The data clearly illustrate a significant decrease in KLD following the transformation process. This reduction is indicative of the enhanced alignment between the source and target distributions, thereby affirming the efficacy of our method in mitigating discrepancies between disparate datasets. The graphical representation provides a compelling visual proof of concept for the effectiveness of our approach in harmonizing data distributions in distinct biological contexts. b Comparative Visualization of Domain Adaptation Algorithm Effectiveness. This figure demonstrates the outcomes (indicated by the red box) of employing Frequency Domain Adaptation (FDA) for domain adaptation on a single source image (highlighted in the blue box) across multiple target images (depicted in the gray box). Following the transformation, the source image shows significant visual alignment with each target image, affirming the adaptability and efficacy of the FDA algorithm for accessing various target images. The figure’s details emphasize subtle shifts in visual characteristics, highlighting the transformative capacity and flexibility of the approach. Specifically, the left panel depicts the transformation of the Golji apparatus to F-actin, while the right panel illustrates the transformation of Microtubule to the endoplasmic reticulum (ER) attributes.

Hitchlearning provides advanced one-image denoising techniques for enhancing live-cell imaging

We employed a state-of-the-art model, Neighbor2Neighbor33, as a foundation for unsupervised denoising of live-cell images. We adapted the Neighbor2Neighbor framework to include source and single target image data, expanding its applications. Our study comprised seven distinct experimental settings (Supplementary Table 1), each utilizing a differently composed training dataset as described in Methods. We analyzed four datasets: endoplasmic reticulum (ER), F-actin filaments (F-actin)46 Golji, and Microtubule38. ER and F-actin were imaged using multimodal SIM, while the Golgi apparatus and microtubules were examined via confocal microscopy. These datasets vary in structure, shape, and photographic style, impacting noise levels and resolution. In our approach, Golji was the source of F-actin and Microtubule was the source of ER. We divided the F-actin and ER datasets into training and inference segments (details in Methods) using only inference ground truth data for metric evaluations. All the datasets were normalized from 0 to 1 to reduce scale differences and improve model performance. The model underwent 10 trial runs to obtain average results across the inference dataset, facilitating a comparison of the seven experimental settings. Here, Src embodies the existing traditional deep learning paradigm, while AllTrg symbolizes the theoretical maximum achievable under this paradigm, albeit requiring extensive additional data collection.

To quantitatively evaluate our HitchLearning method, we used four metrics: the mean-square error (MSE), mean-absolute error (MAE), peak signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM). For the F-actin dataset, HitchLearning notably exceeded the theoretical performance ceiling, AllTrg, our benchmark for theoretical maximum performance, across all metrics. These results, averaged over ten runs, are detailed in Fig. 3c and Supplementary Table 2. The Trg setting, which used only a single inference image for training, failed dramatically, achieving a PSNR of 11.64 dB and an SSIM of 0.15. In contrast, the SrcTrg setting performed much better (35.63 dB PSNR and 0.97 SSIM), but still did not match the AllTrg’s performance (35.95 dB PSNR and 0.97 SSIM) that requires the collection of a large amount of data and is akin to training and testing on the dataset from the same domain, underscoring a significant gap in challenging scenarios. Remarkably, HitchLearning achieved 36.03 dB PSNR and 0.98 SSIM, demonstrating its superior ability to address the non-i.i.d problem in denoising tasks. Similarly, in the ER dataset, HitchLearning also yielded the best results, as shown in Supplementary Fig. 1b and Table 3. In addition to quantitative measures, visual comparisons (Fig. 3a and Supplementary Fig. 1a, 4) further confirmed HitchLearning’s superiority over the other six settings. To directly visualize the improvements in these denoised images, we computed the pixelwise absolute difference between the denoised and corresponding ground truth images shown under each restored image.

Fig. 3: Systematic denoising analysis of F-actin.
figure 3

a Representative denoised (DN) images inferred from noisy images of F-actin. Following each noisy image, seven DN renditions produced under different settings are presented. Directly beneath each DN image is its corresponding difference image, juxtaposed against the ground truth (GT). The Mean Absolute Error (MAE) is quantified and annotated at the top left corner of each difference image. b Higher-magnification views of F-actin (marked in yellow rectangular regions in (a). This sequence of images demonstrates the comparative effectiveness of the denoising process, with insets providing a close-up view of the enhanced structural details after denoising. c Statistical comparison of the seven settings in terms of the PSNR, SSIM, MSE and MAE for F-actin. This analysis comprehensively contrasted the performance of seven distinct settings, and Tukey box-and-whisker plots with significance analysis are shown (Methods). d Intensity profiles along the lines indicated by the arrowheads in the images of Noise (blue), GT (white), Src (orange) and ours (green). e The comparison of four metrics between the traditional (Src) and HitchLearning (ours). These side-by-side comparisons facilitate the assessment of denoising algorithm performance in preserving structural integrity while reducing noise. This set of images underscores the advanced capability of our paradigm to maintain critical biological structures amidst substantial noise reduction. Through a combination of visual results and quantitative analysis across seven settings, our innovative paradigm, HitchLearning, demonstrated superior performance in single-image denoising tailored for live-cell imaging. Scale bar, 4 μm (a), 2 μm (b) for magnified images.

Compared to the raw input inference images, the denoised images revealed a multitude of finer structures, including F-actin filaments. Moreover, we demonstrate that HitchLearning more precisely resolves the densely crisscross regions of F-actin cytoskeleton (Fig. 3b) and the line-scan profiles of F-actins inferenced from HitchLearning are closer to the profiles of GT than those from Src (Fig. 3d). The statistical analysis qualified that HitchLearning typically achieved better DN imaging performance in terms of four metrics (Fig. 3e). This comparison clearly demonstrated that HitchLearning effectively facilitates the denoising of live-cell images, as evidenced by the reduced discrepancies in the fine structural details. In summary, HitchLearning presents a highly effective solution for addressing non-i.i.d challenges in image denoising tasks.

Hitchlearning significantly enhances the performance of bioimage deblurring

When photographing living cells, the movements of cells caught motion blur, out of focus caught defocus blur, etc. To better analyze cell growth and morphology, it is essential to obtain deblurred cell images. We chose MPRNet47 as the base model for this task, when MPRNet was combined with HitchLearning, deblurring with only one image became reality. MPRNet is a supervised method, here, we use the deblurred image obtained from Source model as the pseudo ground truth to train models in the remaining settings. The data for the deblurring are generated by adding motion blur to the raw images38. We use different blur kernels for the source datasets (Golji and Microtubule) and target datasets (F-actin and ER) to generate blurry datasets.

As previously discussed, we assessed the performance of seven training settings for the deblurring task on both the F-actin and ER datasets. In this comparison, HitchLearning outperformed all the other settings across the four metrics for both the F-actin and ER datasets, with the exception of a slight decrease in the PSNR metric for the ER dataset under the Finetune setting, as shown Supplementary Tables 4, 5. Notably, HitchLearning exhibited significant improvements over Src and AllTrg in the F-actin dataset, with average PSNR enhancements of 6.34 dB and 1.02 dB, respectively. Similarly, on the ER dataset, HitchLearning improved the PSNR by 1.82 dB and 0.06 dB compared to Src and AllTrg, respectively. Although Finetune slightly outperformed HitchLearning in terms of the PSNR by 0.13 dB, HitchLearning achieved a notably greater SSIM, indicating a more substantial improvement. The visual results of the F-actin and ER datasets, which vary in structural complexity, are displayed in Fig. 4a and Supplementary Fig. 2a, 5. Despite a minor reduction in deblurring effectiveness with increasing complexity, HitchLearning consistently delivered the best visual outcomes among all the settings. These results, encompassing both quantitative metrics and visual analysis, demonstrate that HitchLearning can surpass models trained with extensive i.i.d data, even with the use of just a single inference image.

Fig. 4: Systematic deblurring analysis of F-actin and ER.
figure 4

a Representative deblurred (DB) images derived from blurry F-actin. Subsequent to each blurry image, seven distinct Deblurred (DB) renditions, realized through varied settings, are exhibited. Beneath each DB image, the corresponding difference image is presented, strategically aligned against the ground truth (GT) for direct comparison. The Mean Absolute Error (MAE) for each difference image was meticulously quantified and annotated at the upper left corner, facilitating a precise evaluation of the deblurring efficacy. Scale bar, 4 μm. b Comparison of DB images of ERs inferred by Src and ours. Blur and GT images are shown for reference. Lower row shows the magnified images of the boxed regions in the upper images. Scale bar 1 μm. c, d Statistical comparison of the traditional and HitchLearning in terms of PSNR, SSIM, MSE and MAE for F-actin (c) and ER (d). Combining visual results with a comprehensive quantitative analysis across seven distinct settings, our novel paradigm, HitchLearning, achieves unparalleled performance in single-image deblurring specifically for bioimaging applications.

To test whether HitchLearning outperforms the traditional Src in the deblurring task, we compared the restored images obtain from these two paradigms. The visual results shown in Fig. 4b demonstrated that HitchLearning achieved more precise cellular structures than Src. Moreover, the four metrics also reveal that our HitchLearning paradigm has a better performance.

Hitchlearning empowers the single-image super-resolution of live-cell images

To address the single-image super-resolution (SISR) challenge30,37,46,48,49,50,51,52, we employed SwinIR53 for SISR tasks on the F-actin and ER datasets, using Golji and Microtubule as the respective sources. SwinIR, noted for its shallow and deep feature extraction capabilities coupled with high-quality image super-resolution, leverages the Swin Transformer for improved performance in image restoration tasks. In our experiments, we tailored the input module of SwinIR to facilitate source-to-target domain transfer, enhancing the model’s efficacy in single-image super-resolution tasks. Since SwinIR is a supervised model, we first generated low-resolution (LR) images of Golji and Microtubule by downsampling the raw Golji and Microtubule images and subsequently used Golji and Microtubule as source to train the model to generate pseudo super-resolution (SR) images of F-actin and ER, respectively, finally, these pseudo-SR images were used to train the model under the remaining settings.

Metric comparisons reveal that HitchLearning achieves optimal performance across all the metrics in both the F-actin and ER experiments, as detailed in Supplementary Tables 6, 7. In the F-actin experiment, HitchLearning not only outperformed Src and AllTrg by 0.84 dB and 1.99 dB, respectively, but also surpassed the second-best model by an average of 0.05 dB in PSNR, signifying substantial improvements. The visual results, depicted in Fig. 5a–c and Supplementary Fig. 6, show that HitchLearning produces superior SR images for both the F-actin and ER datasets compared to Src, however, performance slightly diminishes as complexity increases. The SRs obtained via HitchLearning demonstrated a marked enhancement in image details (Fig. 5d). In the Trg setting, the performance is notably poor, which suggests that relying on a single inference image is insufficient for effective model training. However, in our HitchLearning paradigm, this problem can be mitigated by utilizing another dataset. These findings underscore the efficacy of our method in addressing the non-i.i.d nature of training and inference datasets.

Fig. 5: Systematic super-resolution analysis for F-actin and ER.
figure 5

a, b Representative SR images reconstructed by Src (a) and ours (b) from raw images from ER and F-actin. c The corresponding GT images. Scale bar, 1 μm for ER and F-actin. d Comparison of SR images of ERs inferred by Src and ours. LR and GT images are shown for reference. right row shows the magnified images of the boxed regions in the left images. Scale bar 5 μm; 1 μm for magnified images. e Intensity profiles of F-actin (2× upscaling) along the lines indicated by the two arrowheads in the images of (a) Src (orange), (b) ours (green) and (c) GT (white). The arrows indicate that the structures reconstructed by ours are closer to GT than those reconstructed by Src. f Representative SR images of F-actin generated by HitchLearning. Bottom: a fraction of the corresponding LR image. Scale bar, 2 μm. Through a combination of visual assessments and quantitative analysis across seven distinct settings, our innovative paradigm, HitchLearning, was demonstrated to achieve the most effective performance in Single-Image Super-Resolution (SISR) specifically for live-cell imaging. For comparison purposes, the low-resolution (LR) images were resized to match the dimensions of the super-resolution (SR) images.

As exemplified in Fig. 5e, the line-scan profiles from Src are closer to those of GT. However, as the complexity of the structure increases, the profiles obtained by Hitchlearning and Src decrease, while Hitchlearning consistently outperforms Src. Representative colored frames (Fig. 5f and Supplementary Fig. 3) show that HitchLearning resolves the crisscrossing regions of F-actin and ER cytoskeleton well, and it goes beyond the traditional Src paradigm.

Discussion

The field of image restoration has undergone a significant transformation with the advent of deep learning. The introduction of convolutional neural networks54 (CNNs) and advanced architectures has led to substantial advancements in denoising, super-resolution, and deblurring. Deep neural networks, especially CNNs, have excelled in capturing complex image patterns and learning restoration processes directly from training data. The ability of these methods to automatically learn complex mappings has notably enhanced image restoration, often surpassing traditional methods. However, these advancements have come with challenges. Supervised methods41,55 demand large quantities of low-quality and high-quality image pairs, while self-supervised31,32,33,56,57 and unsupervised methods34,35,58 aim to reduce this requirement. However, a common limitation across these methods is the separation of training and inference phases, leading to the non-i.i.d problem when the inference dataset distribution differs from the training dataset, adversely affecting model performance. Moreover, conventional methods typically require numerous training images from the same distribution, a requirement not always feasible, especially in biology. To address these issues, we introduce a novel deep learning paradigm that combines the target image with any available data for training, demonstrating effective performance, as shown in our results.

At its core, HitchLearning is a transfer learning paradigm, yet traditional frameworks demonstrate inadequacy for our specific challenges. Current transfer learning methods predominantly rely on deep learning models employing complex operations, such as convolution59,60 and generative adversarial networks61,62. Recent studies have shown that such models require extensive image data for training, a demand often impractical in real-world applications. Our paradigm offers a solution to this limitation. By utilizing any existing image dataset as source data and transforming it to match the distribution of a single target image, we can train the model using both the transformed source images and the target image, effectively restoring the image with a single inference. This approach ensures that only one image is sufficient for successful outcomes. Our results demonstrate that HitchLearning is effective in aligning the distributions of different datasets and in accomplishing transfer learning with a single inference image.

In this study, we evaluated the efficacy of our method across three tasks—denoising, deblurring, and super-resolution—and within both supervised and unsupervised frameworks on four datasets. Our findings demonstrate that our approach consistently yields the best overall results. Specifically, in denoising and super-resolution tasks, HitchLearning achieves superior performance both quantitatively and visually. In deblurring, while Finetune for ER marginally outperforms HitchLearning, the difference is minimal, and visually, HitchLearning yields excellent results. These outcomes illustrate the broad applicability of our paradigm across various tasks and datasets, regardless of whether supervised or unsupervised models are employed. We further simulated variations in feature extraction capacity by altering the model’s depth and architectural design. Experimental results (Supplementary Fig. 7 and Tables 810) demonstrate that HitchLearning consistently performs well across models with different feature extraction capabilities, further indicating its robustness and broad applicability.

We further provide a theoretical analysis of HitchLearning by examining its generalization behavior through the lens of the target-domain error bound proposed by Ben-David et al.63. Let \(h\) denote the true restoration function, and \(f\) the hypothesis learned by HitchLearning. The expected risks of \(f\) on the source domain \({{\mathcal{D}}}^{s}\) and the target domain \({{\mathcal{D}}}^{t}\) are defined respectively as:

$$\begin{array}{c}{\epsilon }_{s}(f)={{\mathbb{E}}}_{x\sim {P}_{s}}[|h(x)-f(x)|]\\ {\epsilon }_{t}(f)={{\mathbb{E}}}_{x\sim {P}_{t}}[|h(x)-f(x)|]\end{array}$$

where \(x\) represents the raw input data. According to Theorem 1 in ref. 63, if the hypothesis space \({\mathcal H}\) containing \(f\) is of VC-dimension \(\bar{d}\), then with probability at least \(1-\delta\) for every \(f\in {\mathcal H}\), the expected target domain error \({\epsilon }_{t}(f)\) in \({{\mathcal{D}}}^{t}\) is upper-bounded by:

$${\epsilon }_{t}(f)\le {\hat{\epsilon }}_{s}(f)+\sqrt{\frac{4}{n}\left(\bar{d}\,\log \frac{2en}{\bar{d}}+\,\log \frac{4}{\delta }\right)}+{d}_{ {\mathcal{H}}}({{\mathcal{D}}}^{s},{{\mathcal{D}}}^{t})+\lambda$$

where \({\hat{\epsilon }}_{s}(f)\) is the empirical risk on the source domain, \({d}_{ {\mathcal{H}} }({{\mathcal{D}}}^{s},{{\mathcal{D}}}^{t})\) denotes the divergence between source and target distributions with respect to \({\mathcal{H}}\) and \(\lambda\) is the minimum joint error achievable by any hypothesis in \({\mathcal{H}}\) across both domains. This decomposition highlights three critical factors that influence generalization to unseen domains: (1) the empirical risk on the source domain, (2) the distributional divergence between domains, and (3) the inherent difficulty of the task, captured by the joint error term. HitchLearning is specifically designed to address the second component, domain discrepancy, by aligning the source and target distributions at the input level prior to training. This alignment is accomplished through Frequency Domain Adaptation (FDA), which reduces degradation-specific discrepancies such as noise patterns, blur kernels, and spectral distortions.

The core idea behind HitchLearning is to approximate the i.i.d. assumption fundamental to most learning-theoretic guarantees. By narrowing the distributional gap between source and target data, HitchLearning facilitates more stable training and improved generalization to unseen domains. In the ideal case, when source and target domains are fully aligned, the distributional distance, measured using metrics such as the Kullback-Leibler Divergence (KLD), will be very small, thus minimizing the upper bound of the target-domain error. It is also worth noting that the performance of the underlying base model \(f\) plays a significant role in the final outcomes. Differences in model architecture or learning capacity can affect both the empirical risk and the alignment quality, which may in part explain the variability in performance observed across different models.

Although our paradigm has performed well in low-vision tasks such as image denoising, image deblurring and image super-resolution, there are still several limitations. First, performance varies across datasets due to differences in raw data quality, with some exhibiting higher degradation and greater restoration difficulty, which inherently impacts results. Despite this, HitchLearning achieves notable PSNR improvements on F-actin and ER, indicating its robustness under challenging conditions. This suggests that although absolute performance may fluctuate with dataset difficulty, relative gains remain substantial. Moreover, experiments show that HitchLearning more effectively reduces distributional discrepancies—reflected by a greater reduction in Kullback–Leibler divergence—between F-actin and Golgi than between ER and Microtubule. The performance gain for F-actin was also more pronounced than that for ER, suggesting that uneven improvements may hinder downstream tasks. Addressing how to consistently minimize inter-dataset differences remains an important direction for future research. Second, the compatibility of our paradigm with various image restoration models varies. Essentially, the better the inherent performance of a restoration model is, the greater its compatibility with our paradigm. This is primarily due to the differing abilities of the models.

We recognize that employing KLD as a measure of distributional consistency entails certain limitations, particularly in capturing complex, high-dimensional alignment characteristics. In future work, we plan to investigate the theoretical properties of the feature alignment process in HitchLearning, particularly how the implicit constraints introduced during alignment affect feature across different model architectures. Additionally, we aim to explore more principled metrics for distribution compatibility beyond KLD, and consider model-agnostic alignment strategies that can better adapt to architectures with differing feature representations.

Methods

HitchLearning

The central dilemma of HitchLearning is how to utilize a single test image to construct a high-performance model with thousands of parameters. The fundamental approach is to use a single test image to finetune a well pretrained model, despite the several advantages of fine-tuning, these drawbacks, such as loss of original knowledge and training time and resource requirements, cannot be ignored in fine-tuning. Moreover, when the distribution of the inference image is quite far from the training dataset, the fine-tuning method may lose effect. Here, we propose a simple new paradigm that involves utilizing old image resources to provide additional information for new test images to address these issues. The focus of HitchLearning is how to take advantage of the amount of information already available to pave the way for a new task to achieve a better result; since the gap is the non-i.i.d. problem, we first came up with the idea of trying to align the distributions of two datasets. There are many domain adaptation methods available; however, by analyzing the restoration problem carefully, we find that the core objective is to align the distribution of the noise and blur, which can be easier to separate in the Fourier domain. Then, we harness the Fourier Domain Adaptation method in our work to make the distributions of the two datasets consistent. The main idea of the FDA is to Fourier transform the image and replace the amplitude of the high-frequency region of the source domain image with that of the target domain image with the phase invariant.

Neighbor2Neighbor model for denoising

The Neighbor2Neighbor methodology is a self-supervised approach designed to train Convolutional Neural Network (CNN) denoisers using a single observation of noisy imagery. The proposed training strategy encompasses two components. The initial phase involves generating pairs of noisy images through the utilization of random neighbor subsampling techniques. In the subsequent phase, these sub-sampled image pairs are employed for self-supervised training, complemented by a regularized loss introduced to address the nonzero ground-truth gap existing between the paired, subsampled, noisy images. This regularized loss algorithm integrates a reconstruction segment with a regularization segment as follows:

$${{\mathcal{L}}}_{{Neighbor}2{Neighbor}}={{\mathcal{L}}}_{{rec}}+{{\mathcal{L}}}_{{reg}}$$

where

$${{\mathcal{L}}}_{{rec}}=\Vert{f}_{\theta }\left({g}_{1}\left(y\right)\right)-{g}_{2}\left(y\right)\Vert$$
$${{\mathcal{L}}}_{{reg}}={{\gamma }}\,{{\cdot }}\,\left\Vert{f}_{\theta }\left({g}_{1}\left(y\right)\right)-{g}_{2}\left(y\right)-\left({g}_{1}\left({f}_{\theta }\left(y\right)\right)-{g}_{2}\left({f}_{\theta }\left(y\right)\right)\right)\right\Vert$$

Here, \({f}_{\theta }\left(\cdot \right)\) is the denoising network with an arbitrary network design (here we use UNet), \(y\) is the inference image, \({g}_{1}\left(\cdot \right)\) and \({g}_{2}\left(\cdot \right)\) represent the neighbor subsampler, and \({\rm{\gamma }}\) is a hyperparameter controlling the strength of the regularization term. To ensure learning stability, we halt the gradients of \({g}_{1}\left({f}_{\theta }\left(y\right)\right)\) and \({g}_{2}\left({f}_{\theta }\left(y\right)\right)\), and incrementally increase the parameter γ to its predetermined value throughout the training process.

The MPRNet model for deblurring

The proposed image restoration framework is a tri-stage process designed for progressive restoration. The first two stages rely on encoder-decoder subnetworks to learn broad contextual information. The final stage, in recognition of the position-sensitive nature of image restoration, employs a subnetwork that operates on the original input image resolution, ensuring preservation of fine textures. This is not a simple cascade; supervised attention modules are integrated between each stage to rescale feature maps, and a cross-stage feature fusion mechanism is used to consolidate intermediate features. In each stage, MPRNet has access to the input image, which is divided into nonoverlapping patches for different stages. In each stage, a residual image is predicted, which when added to the degraded input, yields the final restored image. MPRNet is optimized end-to-end with the following loss function:

$${{\mathcal{L}}}_{{MPRNet}}={\sum }_{S=1}^{3}\left[{{\mathcal{L}}}_{{char}}\left({X}_{s},Y\right)+\lambda {{\mathcal{L}}}_{{edge}}\left({X}_{s},Y\right)\right]$$

\(S\) represents the stage, \({X}_{s}\) is the restored image that is obtained by adding the residual image to the input in each stage, \(Y\) is the ground-truth, and \(\lambda\) is set to 0.5 as a parameter controls that the relative importance of Chabonnier loss \({{\mathcal{L}}}_{{char}}\left(\cdot \right)\) and edge loss \({{\mathcal{L}}}_{{edge}}\left(\cdot \right)\) defined as follows:

$${{\mathcal{L}}}_{{char}}=\sqrt{{{||}{X}_{s}-{Y||}}^{2}+{\varepsilon }^{2}}$$
$${{\mathcal{L}}}_{{edge}}=\sqrt{{{||}\varDelta \left({X}_{s}\right)-\varDelta \left(Y\right){||}}^{2}+{\varepsilon }^{2}}$$

\(\varDelta\) denotes the Laplacian operator, and \(\varepsilon\) is a constant empirically set to \({10}^{-3}\).

SwinIR model for super-resolution

SwinIR consists of three modules: shallow feature extraction, deep feature extraction and high-quality (HQ) image reconstruction modules. SwinIR employ the same feature extraction modules for all restoration tasks but uses different reconstruction modules for different tasks. Given a low-quality input, SwinIR use a convolution layer to extract shallow feature F0. Then, a deep feature extraction module consisting of residual Swin Transformer blocks and a convolution layer is used to extract the deep feature FDF. Specifically, intermediate features F1, F2, , FK and the output deep feature FDF are extracted block by block. Then, the high-quality image IRHQ is reconstructed by aggregating shallow and deep features to a reconstruction module with the loss as follows:

$${{\mathcal{L}}}_{{SwinIR}}={{||}X-Y{||}}_{1}$$

The loss is simply a \({{\mathcal{L}}}_{1}\) pixel loss, and \(X,{Y}\) represent the restored image and ground-truth respectively.

The shallow feature mainly contains low frequencies, while the deep feature focuses on recovering lost high frequencies. With a long skip connection, SwinIR can transmit low-frequency information directly to the reconstruction module, which can help the deep feature extraction module focus on high-frequency information and stabilize training.

Image preprocessing and training

Here, we used F-actin and ER from the figshare repository as the target dataset, Golji and Microtubule from the Zenodo repository are used as the sources for F-actin and ER, respectively. Although we have the ground truths of all four datasets, we only use the ground truths of source datasets in super-resolution task for training, in other conditions, we only use the ground truths to compute the evaluation metrics. There are 51 cells in the F-actin datasets with 12 shots per cell, and we randomly choose 10 cells of the data volume (approximately 20%) as the test dataset, which means that 120 images are used as the test dataset and 492 images are used as training dataset. For ER, we used the same strategy. ER contains 68 cells with 6 shots per cell, we choose 13 cells (approximately 20%) as test set and 55 cells as train set, that is, 78 images were used as the test set and 330 images were used as the training set. For the source, 388 images of Golji and 618 images of Microtubule were used.

In the denoising, beblurring and super-resolution tasks, Golji was set as the source of F-actin, and Microtubule were set as the source of ER. In these tasks, we remove only the background of images by merely subtracting the 10% minimum value of the whole image, setting the negative values to 0 and then normalizing the image to the range 0 to 1. Specifically, in the deblurring task, we do not have the real blurred images of these datasets; then, we generate blurred images by adding motion blur to each image at different angles and values.

In our training protocol, as detailed in Supplementary Table 1, we employ seven distinct configurations. AllTrg, PartTrg, and Trg involve training the model with datasets matching the distribution of the inference dataset called the target. These settings differ in terms of the volume of imagery utilized: AllTrg employs the entire training set, PartTrg uses a subset of 10 images, and Trg leverages the inference image itself. In contrast, Src, the conventional approach, trains the model with datasets that typically do not match the inference dataset’s distribution, called the source. Finetune refines the model trained under the Src setting using the inference image. Finally, SrcTrg and HitchLearning integrate both the source training data and the inference image in the training process.

Assessment metrics calculation

In this work, we utilized four key metrics for image quality evaluation: the mean absolute error (MAE), mean squared error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM). These metrics jointly allowed for robust and comprehensive assessment of the quality of image reconstruction methodologies.

The MAE is a widely used statistical metric for evaluating the accuracy of predictions in regression analysis. The average of the absolute differences between the predicted values and observed (actual) values was calculated. The MAE provides a straightforward interpretation of the average magnitude of errors in predictions without considering their direction and is calculated as

$${MAE}=\frac{1}{{mn}}\mathop{\sum }\limits_{i,j=1,1}^{m,n}I\left(i,j\right)-K\left(i,j\right)$$

where \(I\) and \(K\) are the original and compared images, respectively, \(m\) and \(n\) are the dimensions of the images; and \(i\) and \(j\) are the pixel positions.

The MSE quantifies the average of the squares of the error between the original and compared images and is a common metric used in statistical analysis, particularly in the context of regression analysis and forecasting. The method for evaluating the performance of an estimator involves calculating the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value, calculated as

$${MSE}=\left(\frac{1}{{mn}}\right)\mathop{\sum }\limits_{i,j=1,1}^{m,n}{\left[I\left(i,j\right)-K\left(i,j\right)\right]}^{2}$$

where \(I\) and \(K\) are the original and compared images, respectively; \(m\) and \(n\) are the dimensions of the images; and \(i\) and \(j\) are the pixel positions.

The PSNR is commonly employed in image processing to measure the quality of reconstructed images, particularly for assessing the quality of reconstructed or compressed images in comparison to their original, uncompressed versions. The PSNR is based on the mean squared error (MSE) between the original image and a compressed or reconstructed image and provides a measure of reconstruction quality. It is defined as

$${PSNR}=20\times {\log }_{10}\left({MAX}{{\_}}I\right)-10\,\times \,{\log }_{10}\left({MSE}\right)$$

where \({MAX\_I}\) is the maximum possible pixel value of the image.

The SSIM quantifies image quality by comparing image luminance, contrast, and structure; this metric has a decimal value between -1 and 1, and a value of 1 is only reachable when two sets of data are identical. The SSIM was developed to provide a more accurate and consistent assessment of the perceived quality of an image. The SSIM is based on the understanding that the human visual system detects structural information in a scene; thus, a measure of structural information can provide an indication of perceived image quality. A higher SSIM index indicates greater structural similarity between two images. It is calculated using the following formula:

$${SSIM}=\frac{\left(2{\mu }_{x}{\mu }_{y}+{c}_{1}\right)\times \left(2{\sigma }_{{xy}}+{c}_{2}\right)}{\left({{\mu }_{x}}^{2}+{{\mu }_{y}}^{2}+{c}_{1}\right)\times \left({{\sigma }_{x}}^{2}+{{\sigma }_{y}}^{2}+{c}_{2}\right)}$$

where \(x\) and \(y\) are the two images being compared; \({\mu }_{x}\) and \({\mu }_{y}\) are the averages of \(x\) and \(y\), respectively; \({{\sigma }_{x}}^{2}\) and \({{\sigma }_{y}}^{2}\) are the variances of \(x\) and \(y\), respectively; and \({\sigma }_{{xy}}\) is the covariance of \(x\) and \(y\).