Abstract
Deep learning has significantly improved the performance of single-molecule localization microscopy (SMLM), but many existing methods remain computationally intensive, limiting their applicability in high-throughput settings. To address these challenges, we present LiteLoc, a scalable analysis framework for high-throughput SMLM data analysis. LiteLoc employs a lightweight neural network architecture and integrates parallel processing across central processing unit (CPU) and graphics processing unit (GPU) resources to reduce latency and energy consumption without sacrificing localization accuracy. LiteLoc demonstrates substantial gains in processing speed and resource efficiency, making it an effective and scalable tool for routine SMLM workflows in biological research.
Similar content being viewed by others
Introduction
In recent years, deep learning has emerged as a powerful tool for improving data analysis accuracy, especially in three-dimensional single-molecule localization microscopy (SMLM)1,2,3. In SMLM, a super-resolved image is achieved by precisely localizing individual single molecules from tens of thousands of single-molecule images. Consequently, it generates an enormous amount of data that needs to be analyzed promptly to obtain timely and actionable results. The computational burden often restricts the throughput of this technique4. Deep learning-based methods have been successfully applied to localize densely overlapped single-molecule data5,6, significantly reducing the number of frames and accelerating the imaging speed of SMLM. However, existing methods are often designed with high complexity to achieve the desirable localization precision, which results in long latency and high energy consumption7, hindering the application of deep learning-based algorithms for high-throughput super-resolution imaging.
The efficiency of the deep learning-based analysis algorithms for bioimaging has gained more and more interest8. Several techniques have been proposed to enhance the efficiency by compressing the models, such as network pruning9, quantization10, and knowledge distillation11. These techniques with a high degree of compression, result in loss of accuracy and poor generalization ability due to structural deterioration of deep learning models12. Deep learning-based localization is a structured prediction task that transforms pixelated images into a localization list of varying lengths and requires high localization precision. Therefore, the trade-off between precision and the degree of compression hinders the direct application of these techniques in single-molecule localization. Furthermore, apart from the network inference itself, the processes of data loading, pre-processing, post-processing, and writing are also time-consuming for the ultra-large volume of data generated by high-throughput imaging. Thus, efficiently executing these processes is crucial for handling high-throughput imaging data.
Here, we designed a deep learning-based framework for single-molecule localization, named LiteLoc, which was optimized to achieve low computational overhead by asynchronous execution of graphics processing unit (GPU) and central processing unit (CPU). LiteLoc includes a lightweight deep learning network, which enables significant acceleration of network inference without compromising precision and even shows advantages for single-molecule data imaged by large PSF models (i.e., better localization precision and fewer artefacts). Moreover, we presented a scalable and competitive data analysis framework to massively increase the deep learning-based SMLM data analysis throughput. This framework is designed to be highly scalable and compatible with heterogeneous GPU clusters, with no need for inter-GPU communication. As a result, it maximally exploits the computational power of modern hardware, enabling seamless parallel processing and efficient resource utilization. The lightweight network, supported by the parallel computation framework, enables LiteLoc to approach the full imaging acquisition speed of modern sCMOS cameras.
Results
Scalable and lightweight deep learning for SMLM data analysis
The object feature of conventional images is usually spread across a large region within the image, which requires the training of the whole image for tasks such as image classification or restoration. In contrast, SMLM data can be naturally decomposed into numerous small spatiotemporal blocks. Each block contains all the necessary information about the local molecular events, independent of other areas of the image (Fig. 1a). However, existing deep learning-based software for SMLM typically processes the entire image sequentially, often operating in a serial mode with only one GPU or CPU active at a time. To fully exploit both the intrinsic characteristics of SMLM data and available computing resources, we investigated the state-of-the-art (SOTA) deep learning-based SMLM analysis pipelines and identified two key areas for improving throughput: 1. Implementing full-pipeline parallel computation to maximize hardware utilization; 2. Designing a lightweight network architecture to enable faster and more efficient localization.
a The spatiotemporal locality inherent in SMLM data enables the independent analysis of small data blocks without requiring synchronization to integrate global information. b Schematic timelines for deep learning-based SMLM data analysis in serial and parallel modes. c Architecture of the LiteLoc network, including a lightweight coarse feature extractor (CFE) and a fine feature extractor (FFE).
In general, deep learning-based SMLM analysis can be divided into three steps: data reading/preprocessing, inference, and postprocessing/data writing (Fig. 1b, “Methods”). Preprocessing and postprocessing are typically handled by the CPU, while neural network inference is performed on the GPU. Among these steps, network inference is usually the most computationally intensive, often accounting for the majority of the total processing time. Naturally, this step can be accelerated by leveraging multiple GPUs. Inspired by popular deep learning frameworks such as PyTorch and TensorFlow, which provide built-in support for multi-GPU computation, we designed LiteLoc to specifically exploit the independence of SMLM data blocks without requiring inter-GPU synchronization. LiteLoc enables multiple GPUs to simultaneously retrieve and process data blocks in a competitive, unsynchronized manner (Supplementary Fig. 1).
After the optimized deployment of the deep learning-based localization framework across multiple GPUs and CPUs, we found that the inference process remains the primary computational bottleneck, accounting for approximately 90% of the overall analysis time. To tackle this challenge, LiteLoc incorporates a lightweight deep learning network specifically designed for single-molecule localization, aiming to accelerate inference without compromising localization precision. To learn informative features with minimal computational cost, LiteLoc network integrates two key modules: coarse feature extractor (CFE) and fine feature extractor (FFE) (Fig. 1c and Supplementary Table 2). The CFE employs a series of dilated convolutions13 with exponentially increased receptive fields, enabling the extraction of both global and local features with far fewer parameters than traditional convolutional layers. The FFE further refines the extracted features to capture localization information using a simplified U-Net14 architecture (Supplementary Fig. 2). Together, these modules enable high-precision localization with low computational overhead. As shown in Fig. 2a and Supplementary Table 3, LiteLoc network contains only 1.33 million parameters, approximately half the number in DECODE6 (2.20 M). For an input stack of 10 images (each 128 × 128 pixels), LiteLoc requires 71.08 giga floating point of operations (GFLOPs), representing just about 6.6% and 41.7% of the computational demand of DeepSTORM3D5 (1068.77 GFLOPs) and DECODE (170.46 GFLOPs), respectively.
a Analysis speed of LiteLoc, DECODE, and DeepSTORM3D running on a single GPU. Insets are model size (number of parameters) and computational workload. b LiteLoc demonstrates near-linear scalability of analysis speed on multiple GPUs. All tests were conducted on NVIDIA GeForce RTX 4090 GPUs. c, d Localization accuracy for astigmatic PSF and 6 µm DMO-tetrapod PSF, respectively. The blue line for \({{{CRLB}}_{x}}^{1/2}\) is underneath the green line for \({{CRLB}}_{y}^{1/2}\). 5000 photons and 50 background photons were used for each single molecule. 2000 single-emitter images with random x, y positions were generated for each axial position. e, f Performance evaluation on simulated datasets with different densities and SNRs using 3D Efficiency and RMSE based on astigmatic PSF and 6 µm DMO-tetrapod PSF, respectively. Scale bar, 1 µm (c, d).
LiteLoc enables fast, accurate, and generalizable localization
To comprehensively validate LiteLoc’s performance, we first compare the analysis speed of LiteLoc with SOTA methods using simulated data. As shown in Supplementary Table 1, DECODE and DeepSTORM3D achieve overall data analysis throughputs of 22.6 MB/s and 1.2 MB/s on a single GTX 4090 GPU, respectively. LiteLoc demonstrates superior computational efficiency, with a pure network inference speed of 85 MB/s—3.3 times the speed of DECODE (25.8 MB/s) and 35.4 times the speed of DeepSTORM3D (2.4 MB/s). When implemented in the LiteLoc framework, DECODE exhibits an immediate 17% improvement in analysis speed (26.4 MB/s) compared to its original deployment speed of 22.6 MB/s, while the overall analysis throughput of LiteLoc is 76.4 MB/s. To further leverage LiteLoc’s scalable computation framework, we deployed the networks on a GPU workstation equipped with eight GTX 4090 GPUs. With the help of high parallelism between the data loader/analyzer/saver processes, LiteLoc reached a high throughput of 567.6 MB/s, a 25-fold improvement over DECODE’s original performance (22.6 MB/s). This enables real-time analysis of high-throughput super-resolution imaging powered by modern sCMOS cameras. This scalability stems from LiteLoc’s tailored optimization for SMLM that minimizes the overhead of data transfer and synchronization—critical bottlenecks in common multi-GPU tasks, especially in systems lacking efficient communication support like NVLink15. Notably, LiteLoc is also compatible with GPUs with different computing capabilities, where the analysis speed is proportional to the additive computing capabilities of the available GPUs (Supplementary Table 1 and Fig. 2b).
To evaluate the performance of LiteLoc network, we first assessed its accuracy in single-molecule localization on simulated dataset (Supplementary Methods). The root mean square error (RMSE) values of both LiteLoc and DECODE approached the theoretical localization precision limits (Cramér-Rao Lower Bound, CRLB) for the widely used astigmatic PSF (70 nm astigmatism, Fig. 2c). For the deformable mirror-based optimal (DMO) Tetrapod PSF16 (optimized for 6 µm axial range), the lateral and axial RMSE values of LiteLoc are about 19.4% and 17.7% better than those of DECODE, respectively (Fig. 2d). This advantage is probably attributed to the larger receptive field of the dilated convolutions in LiteLoc, which extracts the full features of large PSF patterns. We further evaluated the performance of LiteLoc on simulated SMLM datasets with varying molecule densities and SNRs using both astigmatic and DMO-Tetrapod PSFs (Supplementary Methods, Fig. 2e, and Supplementary Fig. 3). Common metrics such as Jaccard index, RMSE, and 3D Efficiency17 were used for comparison (Supplementary Note 2). Training datasets for LiteLoc and DECODE were simulated using identical settings to ensure fairness (Supplementary Note 3). After testing various dilation factors, we finalized [2, 4, 8, 16] for LiteLoc due to their superior performance on large PSFs and comparable results on astigmatic PSF, eliminating the need for user adjustment (Supplementary Note 4 and Supplementary Tables 4 and 5). Results indicate LiteLoc performs comparable to DECODE with astigmatic PSF and slightly outperforms it with DMO-Tetrapod PSF (Supplementary Tables 6 and 7), which is consistent with the results of single-molecule CRLB comparison.
LiteLoc is a (PSF−) supervised learning approach. Therefore, its performance relies on the labeled training data. As only PSF models were used for training data generation, it is applicable to biological structures of different shapes imaged with the same PSF model. Still, the mismatch of the SNRs, emitter densities, and PSF aberrations between synthetic data and real experimental data could result in the degradation of the network performance (Supplementary Note 5 and Supplementary Table 8). To ensure precise localization and enhance practical application, our software estimates the single-molecule signal photon range and background from raw data before training. In density-mismatched simulated experiments, we found that LiteLoc trained on relatively high-density data can generalize well to low-density data, while the performance slightly degraded for high-density data analyzed by a network trained based on low-density data. To alleviate the effect of PSF model mismatch in deep sample imaging, we employed a robust training strategy18, where the training PSF model was varied by adding a zero-mean normal distribution with a standard deviation of λ/100 to Zernike aberrations and reduced the RMSE error by ~36% for an unknown 20 nm spherical aberration. To explore broader parameter ranges, we conducted systematic experiments to investigate LiteLoc’s generalization under mismatched SNRs when training with extended photon ranges. As shown in Supplementary Table 9, the expanded photon range training does not have much effect on medium and high SNR datasets. However, the RMSE deteriorates by ~42% for low SNR datasets under expanded photon range training conditions. The results demonstrate that broader parameter ranges enhance practical applicability, while also highlighting the importance of SNR-aware optimization for low SNR scenarios.
LiteLoc enables high-fidelity 3D reconstruction of biological samples
Then we applied LiteLoc to biological data to verify its performance in real imaging experiments. We first imaged the reference standard nuclear pore complex Nup96 in U2OS cells (“Methods”) using astigmatic PSF and reconstructed the super-resolution image with different algorithms. As shown in Fig. 3a–c, LiteLoc, DECODE, and DeepSTORM3D could nicely resolve the double ring structure in 3D near the coverslip. However, DeepSTORM3D tended to misidentify more background as emitters. When imaging with a 6 µm DMO-Tetrapod PSF, the localizations of DeepSTORM3D appeared dispersed (Fig. 3e) and the number of emitters detected is less as shown in the side view of the cell (Fig. 3i). The super-resolution image reconstructed using DECODE is prone to producing grid-like artifacts, while no obvious artifacts are observed in the images reconstructed by LiteLoc (Fig. 3e).
a Top view of Nup96 in a U2OS cell reconstructed by LiteLoc based on astigmatic PSF. DeepSTORM3D, DECODE, and LiteLoc’s inference time were approximately 1.6 h, 310 s, and 85 s on a single RTX 4090 GPU, respectively. b, c Magnified top and side views of the regions indicated by boxes (b and c) in (a), reconstructed using DeepSTORM3D, DECODE, and LiteLoc, respectively. d Top view of Nup96 in a U2OS cell reconstructed by LiteLoc using 6 µm DMO-Tetrapod PSF. DeepSTORM3D, DECODE, and LiteLoc’s inference time were approximately 4.2 h, 13 min, and 225 s on a single RTX 4090 GPU, respectively. e Zoomed-in view of the region indicated by box (e) in (d), reconstructed using DeepSTORM3D, DECODE, and LiteLoc, respectively. f, g Autocorrelation and normalized Fourier spectra along the x-direction for the region marked (e) in (d), reconstructed by DeepSTORM3D, DECODE, and LiteLoc, respectively. The blue circle indicates the peak frequency and amplitude at 1/ pixel size of single-molecule image (110 nm). h Side view of the region indicated by box (h) in (d). i Zoomed-in view of the region indicated by box (i) in (h), reconstructed using DeepSTORM3D, DECODE, and LiteLoc, respectively. Scale bars, 5 µm (a, d), 200 nm (b, c), 500 nm (e), 1 µm (h, i).
For 3D astigmatic PSF imaging under relatively high molecule density, grid artifacts also emerge in the reconstruction produced by DeepSTORM3D (Supplementary Fig. 4). This is because DeepSTORM3D predicts binary up-sampled voxels for 3D localizations. This approach introduces discretization artifacts, as the predicted voxels are inherently discrete, and the localization precision is constrained by the voxel size. In contrast, both LiteLoc and DECODE networks are trained to predict pixel-wise Gaussian distributions for lateral subpixel offsets. However, DECODE is biased to put predictions with large uncertainties near the pixel center as the likelihood changes slowly within the pixel (Supplementary Fig. 5). The superior ability of LiteLoc to mitigate grid artifacts can be attributed to its network design: dense connections promote confident localizations, and dilated convolutions introduce sparse pixel-level computation, which may relieve center bias, as demonstrated in the ablation experiments (Supplementary Fig. 6).
To further validate the robustness and versatility of LiteLoc, we compared its performance with ThunderSTORM, a conventional non-deep-learning method. LiteLoc exhibited superior axial resolution on astigmatic nuclear pore datasets (Supplementary Fig. 7) and provided sharper reconstructions of microtubule structures on high-density datasets (Supplementary Fig. 8), demonstrating its distinct advantages under challenging imaging conditions. We also analyzed data in the DNA-PAINT experiment to demonstrate the ability of LiteLoc for different SMLM modalities, and the results show that LiteLoc could nicely reconstruct the hollow structure of microtubules in 3D (Supplementary Methods, Supplementary Fig. 9). To ensure robust performance across different imaging conditions, LiteLoc incorporates both Cspline-interpolated19 and vectorial PSF20 models for network training. When combined with in-situ PSF21 modeling, LiteLoc is capable of compensating for the sample-induced, depth-dependent aberrations (Supplementary Fig. 10).
Discussion
To summarize, LiteLoc represents a significant advancement in high-throughput SMLM by addressing the issue of computational efficiency in existing deep learning-based localization methods. Its lightweight architecture, combined with an optimized use of computational resources, enables substantial acceleration of SMLM data processing without sacrificing localization accuracy, making it particularly suitable for large-scale experiments in modern SMLM workflows. Beyond its lightweight network design, LiteLoc features a modular and generalizable acceleration framework, allowing users to substitute their preferred localization models while still benefiting from the same efficient parallel processing pipeline.
LiteLoc is specifically designed to tackle challenges inherent to SMLM, including complex PSF models, the presence of grid artifacts in structured datasets, and the demand for high localization precision under varying conditions, demonstrating its potential for widespread application in super-resolution imaging of different biological samples. Moreover, LiteLoc’s robustness to different engineered PSFs and imaging conditions underscores its utility across diverse experimental SMLM setups. Another key strength of LiteLoc lies in its real-time processing capability, which is of growing importance for closed-loop SMLM systems22, online quality control23, etc. By lowering the computational barrier, LiteLoc supports the broader adoption of deep learning-based localization techniques in standard SMLM pipelines, including those operating under limited computational resources.
While LiteLoc demonstrates strong performance across various experimental conditions, certain limitations remain. As a supervised learning-based method, its performance may degrade when there is a mismatch between the training and testing datasets, such as differences in SNRs, emitter densities, or PSF aberrations. Although strategies like expanding the SNR range, slightly increasing the training density, or applying robust training can mitigate the impact of such mismatches, we recommend that users first evaluate the SNR of their raw data using the assessment tool provided in our software, estimate density through visual inspection, and obtain a realistic PSF model that closely resembles the actual imaging system. The latter can be achieved through our built-in calibration function or other publicly available tools21,24.
In conclusion, LiteLoc provides a robust, efficient, and scalable solution tailored to the specific needs of high-throughput SMLM. Its advantages position it as a valuable tool for researchers working with large SMLM datasets. Future development could integrate LiteLoc with downstream SMLM data analysis pipelines, such as clustering, tracking, or structural reconstruction, to provide a seamless and efficient end-to-end workflow.
Methods
Implementation details of the scalable competitive SMLM data analysis framework
The framework was compatible with Linux, Windows, and MacOS operating systems. The entire data analysis pipeline of LiteLoc can be divided into three steps: preprocessing, inference, and postprocessing. To fully utilize the valuable GPU computing resources and the common multi-core capabilities of CPUs, we designed three parallel processes: a data loader process for reading files and data preprocessing, an analyzer process for network inference, and a saver process for result postprocessing and file writing (Fig. 1 and Supplementary Fig. 1). Queues based on shared memory are used for efficient inter-process communication. All these processes independently and continuously generate data to or retrieve data from the queue in shared memory. Our data analysis framework is designed based on multi-process programming to efficiently parallelize these steps.
First, the framework implements a data loader process responsible for reading consecutive SMLM files and performing data preprocessing. Before reading the files, the framework inspects the entire data folder and creates a reading plan to process data in the unit of blocks in CPU. Users can define the block size, effectively mitigating CPU memory issues associated with loading large datasets. After loading a data block in CPU, the data loader process pads it with an extra frame at both ends and splits it into batches with temporal overlapping of 2 frames. This aims to seamlessly provide a 3-frame temporal context for LiteLoc inference of each frame in a batch. Besides, if the image size is too large or GPU memory is limited, the data loader process employs a divide-and-conquer strategy. Specifically, it splits the large field of view (FOV) images into mini-FOV images with a certain degree of overlapping, similar to what was done in FD-DeepLoc18. These spatiotemporally split batches, along with heading information (frame number, mini-FOV position, overlapping size, etc.), are fed into a batch queue. The queue leverages shared memory mechanisms provided by the Python library multiprocessing, using memory addresses instead of direct data copies. This optimization reduces both memory usage and communication time between different processes.
At the same time, the framework initiates multiple analyzer processes, each corresponding to an available GPU device. In our approach, we employ a data parallel strategy, where each GPU possesses a copy of the LiteLoc network and performs inference independently. Unlike common parallel data processing in deep learning-based analysis, where batch result aggregation is required, each LiteLoc analyzer retrieves data from the batch queue in a competitive manner without any synchronization. This design effectively addresses the bottleneck issue arising from GPUs with varying computing capabilities since analyzers do not need to wait for each other to complete their tasks. After inference on each GPU, the network outputs multi-channel predictions, which require further postprocessing to transform them into molecule lists. To accelerate the postprocessing, we implemented the time-consuming molecule list extraction from the network predictions on GPU as well. Here, we put this part of postprocessing in the same process, leveraging the GPU for pixel-wise molecule detection using convolutions and avoiding the transfer of large volume network outputs from GPU to CPU. The intermediate results are then fed into a result queue, shared between the analyzer process and the saver process.
Finally, the framework includes a saver process in background that monitors the result queue. Each item in the queue represents intermediate results from a batch of data, which may have spatial overlap. The saver process retrieves these items and performs additional postprocessing tasks, including putting the mini-FOV results to correct FOV positions, filtering spatially overlapping results, unit conversion, and data formatting for storage. The formatted results are continuously written to a user-defined file.
LiteLoc architecture
LiteLoc consists of a CFE, a fine feature extractor (FFE), and an output module (Fig. 1c and Supplementary Table 2). Except for the convolutions in the output module, each convolutional layer is followed by a rectified linear unit and batch normalization. The input includes three consecutive images that are analyzed independently by CFE, and the outputs are concatenated and passed to FFE.
We assume that the input image contains redundant information that is not useful for localization. Therefore, LiteLoc begins by extracting down-sampled features from the normalized input image by a convolutional layer and max pooling. These down-sampled features are subsequently passed through three convolutional layers to enhance feature learning. Given that PSF patterns vary with different optical setups and axial positions, four dilated convolutions with exponentially increasing dilation factors are introduced. This architecture allows for the extraction of multi-scale features without losing resolution and simultaneously learn local and global features of the PSF through receptive fields of various sizes. We found that using only a group of dilated convolutions to extract features for localization is insufficient. Therefore, we designed a simplified U-Net structure (Supplementary Fig. 2) to capture more abstract patterns through down-sampling and up-sampling operations. Additionally, the specifically designed dense and residual connections among layers alleviate the vanishing gradient problem and encourage features to be reused by combining features from different layers.
In the output module, a group of convolutions is used to transform features to final outputs. Similar to DECODE, the outputs include a pixel-wise probability map p that indicates whether an emitter exists, continuous subpixel offsets \(\Delta x,\,\Delta y\) relative to the pixel center, axial distance \(\Delta z\) relative to the focal plane, brightness \({photons}\), and corresponding uncertainties. For the post-processing, we first identify pixel candidates as local maxima with \({p}_{k}\) > 0.3 or non-maximum with \({p}_{k}\) > 0.6. This dual-threshold strategy ensures the detection of emitters in adjacent pixels, even when \({p}_{k}\) is not a local maximum. Subsequently, the probability values from the four nearest neighboring pixels around each candidate are aggregated to form an enhanced probability map. This map is then thresholded at 0.7 to generate a binary detection map. The pixel-wise detections are used to index all predicted offsets and uncertainties, thereby forming the final localization lists. We used SMAP24 to further process and render the localizations returned by LiteLoc.
GPU cluster configurations
In this work, we built a multi-GPU server housed in a single 9U computer rack to significantly improve the data analysis throughput. The server is equipped with two Intel Xeon Platinum 8375 C CPUs clocked at 2.90 GHZ, 16 Samsung 32 GB DDR4 SDRAM modules, 8 NVIDIA GeForce RTX 4090 with 24GB of RAM. The motherboard is ASRock SP2C621D32GM-2T. The server runs on Ubuntu 22.04 LTS operating system, utilizing CUDA 12.4 and cuDNN for optimized GPU performance. The program framework used PyTorch 2.3. For storage, the server is configured with a 4TB NVMe SSD to ensure high-speed access and retrieval. The system is powered by four 2600 W power supplies and is cooled by a combination of heat pipe radiators and high-efficiency fans to control operating temperatures.
Optical setups
In this work, we used a custom-built microscope based on a previously established design25, with all optical components and configurations detailed below. The excitation light (iBEAM-SMART-405-S, 150 mW, and iBEAM-SMART-640-S-HP, 200 mW, TOPTICA Photonics) was coupled into a single-mode fiber (P3-405BPM-FC-2, Thorlabs) through a fiber coupler (PAF2-A4A, Thorlabs). By adjusting the translation stage at the fiber output, we were able to achieve different illumination angles. The illumination beam was passed through a laser clean-up filter (ZET405/488/561/640xv2, Chroma). Beam collimation and reshaping were accomplished using a pair of lenses with focal lengths of 75 mm and 400 mm, in conjunction with a slit (SP60, Owis). The collimated beam was subsequently reflected by the main dichroic mirror (ZT405/488/561/640rpcxt-UF2, Chroma) and directed into the objective for sample illumination. Fluorescence emitted from the sample was collected by a high numerical aperture (NA) objective (NA 1.35, UPLSAPO 100 XS or NA 1.5, UPLSAPO 100XOHR, Olympus) and imaged through the tube lens (TTL-180-A, Thorlabs), with the imaging field confined by a slit (SP40, Owis) positioned after the tube lens. We employed two bandpass filters (NF03-405/488/561/635E25 and FF01-676/37-25, Semrock) mounted on a filter wheel to discriminate the emitted fluorescence from the excitation light. We established a 4f system in the imaging path using lenses with focal lengths of 125 mm and 75 mm and positioned a deformable mirror (DM140A-35-P01, Boston Micromachines) at the Fourier plane for PSF engineering and aberration correction. The back focal plane of the objective could be visualized through a lens with a focal length of 40 mm placed before the camera. Image acquisition was performed using an sCMOS camera (Dhyana 400BSI V3, Tucsen), providing a pixel size of 110 nm in the sample plane. Additionally, we incorporated a closed-loop focus lock control system into our optical setup, which used a 785 nm laser (iBEAM-SMART-785-S, 125 mW, TOPTICA Photonics). The laser was reflected by a dichroic mirror (FF750-SDi02, Semrock) and was detected by a quadrant photodiode (SD197-23-21-041, Advanced Photonix Inc), which provided feedback to a z-piezo stage (P-726.1CD, Physik Instrumente) to ensure stable focus.
Biological sample preparation
Cell culture
U2OS cells (Nup96-SNAP no. 300444, Cell Line Services) were cultured in DMEM (catalog no. 10569, Gibco) supplemented with 10% (v/v) fetal bovine serum (catalog no. 10099-141 C, Gibco), 100 U/ml penicillin, 100 μg/ml streptomycin (PS; catalog no. 15140-122, Gibco), and 1× MEM NEAA (catalog no. 11140-050, Gibco). Prior to cell plating, high-precision 25-mm round glass coverslips (catalog no. CG15XH, Thorlabs) were cleaned by sequential sonication in 1 M potassium hydroxide (KOH), Milli-Q water, and ethanol, followed by UV irradiation for 30 min. For super-resolution imaging, U2OS cells were plated on the clean coverslips and cultured for 2 days until reaching approximately 80% confluency.
Nup96 labeling
To label Nup96, U2OS-Nup96-SNAP cells were processed as follows26. The cells were prefixed with 2.4% (w/v) paraformaldehyde (PFA) for 30 s, permeabilized in 0.4% (v/v) Triton X-100 for 3 min, and then fixed again in 2.4% PFA for 30 min. The fixing and permeabilization buffers were preheated to 37 °C before use. Cells were quenched in 0.1 M NH4Cl for 5 min and rinsed twice with PBS. To minimize nonspecific binding, cells were blocked with Image-iT FX Signal Enhancer (catalog no. I36933, Invitrogen) for 30 min. For labeling, cells were incubated in a dye solution containing 1 μM SNAP-tag ligand BG-AF647 (catalog no. S9136S, New England Biolabs), 1 mM DTT (catalog no. 1111GR005, BioFroxx), and 0.5% (w/v) bovine serum albumin in PBS for 2 h. Excess dyes were removed by washing the cells three times in PBS for 5 min each. Finally, cells were postfixed with 4% PFA for 10 min, washed three times in PBS for 3 min each, and stored at 4 °C until imaging.
Imaging buffer
Samples were imaged in refractive index-matching buffer consisting of 50 mM Tris-HCl (pH 8.0), 10 mM NaCl, 10% (w/v) glucose, 0.5 mg/ml glucose oxidase (G7141, Sigma), 40 μg/ml catalase (C100, Sigma), 35 mM cysteamine, and 28.5% (v/v) 2,2′-thiodiethanol (166782, Sigma). The refractive index of the final imaging buffer was 1.406.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The data that support the findings of this study are publicly available on Zenodo at https://zenodo.org/records/1569671827. Source data are provided with this paper.
Code availability
Source code for the software used in this manuscript is contained in Supplementary Software 128 and updated versions can be accessed from our GitHub Page https://github.com/Li-Lab-SUSTech/LiteLoc.
References
Zhang, P. et al. Deep learning-driven adaptive optics for single-molecule localization microscopy. Nat. Methods 20, 1748–1758 (2023).
Carnevali, D. et al. A deep learning method that identifies cellular heterogeneity using nanoscale nuclear features. Nat. Mach. Intell. 6, 1021−1033 (2024).
Xiao, D. et al. Large-FOV 3D localization microscopy by spatially variant point spread function generation. Sci. Adv. 10, eadj3656 (2024).
Barentine, A. E. et al. An integrated platform for high-throughput nanoscopy. Nat. Biotechnol. 41, 1549–1556 (2023).
Nehme, E. et al. DeepSTORM3D: dense 3D localization microscopy and PSF design by deep learning. Nat. Methods 17, 734–740 (2020).
Speiser, A. et al. Deep learning enables fast and dense single-molecule localization with high accuracy. Nat. Methods 18, 1082–1090 (2021).
Desislavov, R., Martínez-Plumed, F. & Hernández-Orallo, J. Trends in AI inference energy consumption: beyond the performance-vs-parameter laws of deep learning. Sustain. Comput. 38, 100857 (2023).
Zhou, Y. et al. EfficientBioAI: making bioimaging AI models efficient in energy and latency. Nat. Methods 21, 368–369 (2024).
Anwar, S., Hwang, K. & Sung, W. Structured pruning of deep convolutional neural networks. ACM J. Emerg. Technol. Comput. Syst. 13, 1–18 (2017).
Yang, J. et al. Quantization networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 7308–7316 (IEEE, 2019).
Gou, J., Yu, B., Maybank, S. J. & Tao, D. Knowledge distillation: a survey. Int. J. Comput. Vis. 129, 1789–1819 (2021).
Rahul, M., Gupta, H. & Dutta, T. A survey on deep neural network compression: challenges, overview, and solutions. Preprint at https://arxiv.org/abs/2010.03954 (2020).
Yu, F. & Koltun, V. Multi-scale context aggregation by dilated convolutions. In 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings (2016).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (eds Navab, N. et al.) 234–241 (Springer, 2015).
Pal, S. et al. Optimizing multi-GPU parallelization strategies for deep learning training. IEEE Micro 39, 91–101 (2019).
Fu, S. et al. Deformable mirror based optimal PSF engineering for 3D super-resolution imaging. Opt. Lett. 47, 3031–3034 (2022).
Sage, D. et al. Super-resolution fight club: assessment of 2D and 3D single-molecule localization microscopy software. Nat. Methods 16, 387–395 (2019).
Fu, S. et al. Field-dependent deep learning enables high-throughput whole-cell 3D super-resolution imaging. Nat. Methods 20, 459–468 (2023).
Li, Y. et al. Real-time 3D single-molecule localization using experimental point spread functions. Nat. Methods 15, 367–369 (2018).
Leutenegger, M., Rao, R., Leitgeb, R. A. & Lasser, T. Fast focus field calculations. Opt. Express 14, 11277–11291 (2006).
Liu, S. et al. Universal inverse modeling of point spread functions for SMLM localization and microscope characterization. Nat. Methods 21, 1082–1093 (2024).
Li, M. et al. Real-time image resolution measurement for single molecule localization microscopy. Opt. Express 30, 28079–28090 (2022).
Mailfert, S., Djendli, M., Fabre, R., Marguet, D. & Bertaux, N. Quality control maps: real-time quantitative control of single-molecule localization microscopy data. Biophys. J. 124, 1132–1145 (2025).
Ries, J. SMAP: a modular super-resolution microscopy analysis platform for SMLM data. Nat. Methods 17, 870–872 (2020).
Shi, W. et al. Aberration correction for deformable-mirror-based remote focusing enables high-accuracy whole-cell super-resolution imaging. Photonics Res. 12, 821–832 (2024).
Thevathasan, J. V. et al. Nuclear pores as versatile reference standards for quantitative superresolution microscopy. Nat. Methods 16, 1045–1053 (2019).
Fei, Y. et al. Scalable and lightweight deep learning for efficient high accuracy single-molecule localization microscopy. Zenodo. https://doi.org/10.5281/zenodo.15696718 (2024).
Fei, Y. et al. Li-Lab-SUSTech/LiteLoc: LiteLoc (v1.0.0). Zenodo. https://doi.org/10.5281/zenodo.16070098 (2025).
Acknowledgements
This work was supported by the National Key Research and Development Program of China (2024YFF0726003 to Y.L.), Shenzhen Medical Research Fund (B2302038 to Y.L.), National Natural Science Foundation of China (62375116 to Y.L., 623B2044 to S.F.), Shenzhen Science and Technology Innovation Program (JCYJ20220818100416036 and KQTD20200820113012029 to Y.L.), Basic and Applied Basic Research Fund of Guangdong Province (2024A1515011565 to Y.L.), China Postdoctoral Science Foundation (GZC20240651 and 2025T180788 to W.S., 2025T180225, 2025M772887, and GZC20250546 to S.F.), SUSTech Presidential Postdoctoral Fellowship (S.F.), Guangdong Provincial Key Laboratory of Advanced Biomaterials (2022B1212010003), and a startup grant from Southern University of Science and Technology.
Author information
Authors and Affiliations
Contributions
Y.L. conceived the concept and supervised the entire project. Y.F. developed the deep learning-based algorithms. S.F. developed the multi-process framework. W.S. developed the PSF calibration software. W.S. and K.F. performed the imaging experiments. Y.F., S.F., W.S., R.W., and T.Z. analyzed the data. Y.L., Y.F., and S.F. wrote the manuscript with the input from all other authors
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Alexander Hillsley and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Fei, Y., Fu, S., Shi, W. et al. Scalable and lightweight deep learning for efficient high accuracy single-molecule localization microscopy. Nat Commun 16, 7217 (2025). https://doi.org/10.1038/s41467-025-62662-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-025-62662-5