Introduction

Single-molecule localization microscopy (SMLM) has revolutionized the understanding of biological systems by enabling visualization at the nanoscale1. Techniques like photoactivated localization microscopy (PALM)2 and stochastic optical reconstruction microscopy (STORM)3 have paved the way for this revolution by overcoming the diffraction limit of light and providing insights into molecular structures and dynamics with high resolution.

Traditionally, discriminating single-molecule binding types in SMLM relies on several methods. These include the use of different fluorescent dyes1,4,5, temporal separation of fluorophores6,7,8, or the analysis of blinking kinetics such as the binding on/off times of multiple binding events9,10,11,12,13. While effective, these approaches come with significant drawbacks. The use of different fluorescent dyes adds complexity to experimental design and analysis and may introduce potential issues, such as phototoxicity14 and dye crosstalk15. Temporal separation of fluorophores can lead to asynchronization issues, and analyzing blinking kinetics necessitates prolonged observation periods, limiting the practicality of these methods for certain applications, such as high-throughput studies16, live-cell imaging17, and real-time imaging18.

Recent advances in deep learning, particularly convolutional neural networks (CNNs), have made a significant impact across diverse fields, including fluorescence microscopy19, molecular imaging20, and single-molecule analysis21. While CNNs have been widely adopted in image processing and video classification, their application to single-molecule detection and fluorescence microscopy is still in its early stages. Previous studies21,22,23,24 demonstrated that deep learning models can enhance detection accuracy and reduce analysis time in single-molecule localization microscopy (SMLM) experiments. However, these approaches have yet to fully exploit CNNs’ ability to process both temporal and spatial information within a single model.

In this work, we reveal previously overlooked discriminative spatiotemporal information within diffraction-limited fluorescent spots, enabling synchronous classification of binding types at the single-event level using the same fluorescent dye. To leverage this insight, we propose a convolutional neural network architecture, Temporal-to-Context (T2C) CNN, which transforms long temporal fluorescence signals into enriched contextual representations. While temporal-to-channel integration has been explored in generic video analysis25, its application to fluorescence microscopy presents unique challenges due to low signal-to-noise ratios and subtle spatiotemporal dynamics. In our design, the temporal dimension is reshaped into the channel axis, allowing spatial convolutions to capture long-range temporal dependencies–a strategy we term “long-term spatial convolution”. We demonstrate that the combination of three architectural elements–long-term spatial convolutions, shallow cross-connected blocks, and a pooling-free design–enables the effective capture of fine-grained temporal context and multi-scale features essential for binding-type classification. This combination enables robust and generalizable classification performance under noisy conditions, and its effectiveness is supported by ablation experiments and comparisons to state-of-the-art deep learning baselines. We validate the approach using DNA-PAINT (Points Accumulation for Imaging in Nanoscale Topography26,27), where the T2C CNN achieves a substantial increase in classification accuracy–from 75% using probability density function (PDF) estimation based on binding time to approximately 95%. This performance gain is accompanied by a significant reduction in measurement time, from 10 min to just 5 s, enabled by the model’s efficient spatiotemporal pattern recognition. Moreover, T2C CNN significantly outperforms representative state-of-the-art deep learning methods–including 3D ResNet-1828, Video Transformer29, ED-TCN30, and SqueezeTime25–which achieve accuracies ranging from approximately 70% to 80% on the same dataset. Beyond superior accuracy, T2C CNN offers practical advantages: its model size is 1.19 to 77 times smaller, and it requires 4 to 534 times less computation than these alternatives. These features make T2C CNN also efficient, facilitating broader deployment in real-time, resource-constrained, or high-throughput single-molecule imaging and sensing applications. These results underscore the potential of T2C CNN to greatly enhance the analytical capabilities of SMLM and other fluorescence-based techniques for rapid and precise molecular investigations.

Results

Overview of raw data and methods

DNA-PAINT experiments generate raw data as videos, with each frame containing multiple diffraction-limited fluorescent spots. To minimize spot overlap, experimental parameters such as concentration and pH are carefully adjusted to control spot density. Figure 1a displays a representative single frame (100 ms exposure) from such a video. Each frame corresponds to a snapshot in time and captures transient binding events individually. The dynamic appearance and disappearance of spots over consecutive frames reflect the stochastic binding and unbinding of imager strands at docking sites. Traditional analysis methods, such as Picasso26, detect and localize fluorescent spots, perform drift correction, group spots, and extract multiple binding events at each site to generate a sequence of on-off signals (Fig. 1b).

Fig. 1: An example of raw images and time courses from the DNA-PAINT experiments used in this study.
Fig. 1: An example of raw images and time courses from the DNA-PAINT experiments used in this study.
Full size image

a Schematic of the two DNA-binding domains used to collect diffraction-limited fluorescent spots (adapted from ref. 11). Both domains share the same dye ("ATO532'') and strand ("10nt P3''), but differ in their complementary sequences: Domain 1 uses “8nt P3'” (a partial complement of P3), while Domain 2 uses the fully complementary “10nt P3'''. Example frames from the raw video data are shown next to the corresponding domains. b Example time courses annotated with diffraction-limited fluorescent spots from binding events in Domain 1 and Domain 2. Domain 2 exhibits longer binding durations than Domain 1 (dwell time analysis shows ton as 2.30 ± 2.38 s for Domain 1 and 11.56 ± 16.27 s for Domain 2, with survival curves provided in Supplementary Fig. 1), while both show similar intensity jumps. The length-based method classifies molecules by total binding durations, whereas the image-based method classifies individual events. Axis offsets (+17133/+18117, +6229/+271) indicate base intensity and event start frame, respectively. Source data are provided as a Source Data file.

Binding time, or on-time, is defined as the duration for which an imager strand remains bound to the docking strand before dissociating. It is detected by local intensity maxima with a gradient decrease toward surrounding pixels26 and depends on the binding affinity and local imager strand concentration. Off-time, also known as dark time, refers to the interval between the dissociation of one imager strand and the subsequent binding of another to the same docking site, influenced by imager strand concentration and diffusion kinetics.

Length-based methods classify binding affinities by comparing average binding durations. While effective for distinguishing interactions with large affinity differences, they struggle to differentiate binding types with similar affinities. To overcome this limitation, we propose an alternative approach that directly extracts discriminative features from raw video signals, referred to as image-based methods. As illustrated in Fig. 1b, length-based methods classify molecule types based on all binding events at a site, whereas image-based methods classify individual binding events. Extracting meaningful features from video signals improves binding type resolution, reduces the need for repeated observations, and enables real-time, high-throughput molecular detection. The proposed T2C CNN, an image-based method, processes videos of diffraction-limited fluorescent spots captured during single DNA binding events and classifies each event into predefined DNA binding types. The following sections present the experimental results, discussion, and methodology on these data and methods.

Binding-type information in diffraction-limited fluorescent spots

A standard DNA binding design in DNA-PAINT11 is used in this study (as shown in Fig. 2a), with different docking strands for the same 10nt imager strand 5’-GTAATGAAGA-3’: partially complementary 8nt 5’-TT-TCTTCATT-3’ (domain 1) and fully complementary 10nt 5’-TT-TCTTCATTAC-3’ (domain 2), where “-TT-” is a spacer between the docking strands and the DNA origami. Appropriate concentrations were chosen: 1 nM imager strands, 1 nM domain-1 docking strands, and 200 pM domain-2 docking strands. Note that the concentration of domain-2 docking strands was reduced to decrease the probability of spatial and temporal overlap of binding events at different sites, due to its longer binding time. We measured fluorescence microscopy of 20,000 frames (33 min and 20 s duration, 10 fps frame rate) for both domains. After drift correction, we identified 4977 and 1183 binding sites with 19,457 and 6073 binding events for domains 1 and 2, respectively. We analyzed 25,530 binding events from the two domains by applying background and blinking corrections to the diffraction-limited fluorescence spots, computing inter-frame correlations and summary statistics, and evaluating their association with domain labels.

Fig. 2: Diffraction-limited fluorescent spots hold binding-type information that can be utilized to more accurately differentiate between single binding events.
Fig. 2: Diffraction-limited fluorescent spots hold binding-type information that can be utilized to more accurately differentiate between single binding events.
Full size image

a Two types of DNA-binding domains used to collect diffraction-limited fluorescent spots. Blinking kinetics refers to the binary signals of binding on and off. (Revised from ref. 11). b The correlation between domain labels and some common statistics (n = 25,530 independent binding events), as well as the classification accuracy accounted for by the missing information. The “YZ,” “XZ,” and “XY” statistics refer to those calculated from projections of the video onto the Y-Z, X-Z, and X-Y planes, with X as the width, Y as the height, and Z as the time axis. c The correlation between the final classification output of the proposed T2C CNN (FC2 feat1 and FC2 feat2) and some common statistics, as well as the classification accuracy accounted for by the additional information captured by the model and the remaining missing information. d Scatter plot of mean on-time versus mean off-time. Below, the Gaussian kernel density estimate of the probability distribution function (PDF) for the mean on-time is displayed. Based on multiple binding events, 6.33% of binding sites are misclassified. e A 2D uniform manifold approximation and projection (UMAP) plot of the output embeddings from the convolutional layers of the proposed T2C CNN. This model achieves a lower error rate (3.62%) in more fine-grained recognition of single binding events, significantly reducing the time required to determine the binding type for a binding site. Clearer visualizations with each domain plotted on top are provided in Supplementary Fig. 2. Source data are provided as a Source Data file.

Frame correlation quantifies the similarity between frames based on pixel intensity patterns, revealing temporal and structural consistency (definition in Supplementary Note 1). Figure 3 shows the average inter-frame Pearson correlations for both domains. Initially, both Domain 1 (8nt-10nt) and Domain 2 (10nt-10nt) show stable correlations. However, Domain 2 maintains this stability longer, while Domain 1 rapidly develops alternating blocks of high and low correlation–mainly due to fewer long-duration events. These intermittent low-correlation segments in Domain 1 may arise from increased torsional and lateral fluctuations enabled by its asymmetric partial duplex geometry31 and flexible overhangs32, which subtly shift the spot pattern across frames without reducing mean intensity. This is likely due to a shorter binding duration, which reduces the inter-frame persistence of fluorophore localization. Such correlation patterns provide useful cues for identifying heterogeneity in binding site behavior in fluorescence videos.

Fig. 3: Inter-frame correlations.
Fig. 3: Inter-frame correlations.
Full size image

Correlations between diffraction-limited fluorescent spot frames during binding events for Domain 1 and Domain 2. Detailed definitions, analyses, and visualizations are provided in Supplementary Note 1, 2, and Supplementary Fig. 3, respectively. Source data are provided as a Source Data file.

Common statistics of diffraction-limited fluorescent spots (Fig. 2c) reveal up to 0.5 Pearson correlation with binding type, beyond binding time alone. These include the following. Sum features (e.g., “Sum”, “YZ sum”, “XZ sum”, “XY sum”): Partially matched 8nt-10nt bindings exhibit lower overall fluorescence intensity than fully matched 10nt-10nt bindings, as observed in projections onto width-time (YZ), height-time (XZ), and height-width (XY) planes. Non-zero features (e.g., “non-zero”, “YZ non-zero”, “XZ non-zero”): Fully matched 10nt-10nt bindings show fewer blinking or localization losses than partially matched 8nt-10nt bindings. Signal loss leads to zero values in the data since background intensities have been removed. “XY non-zero” is less informative because the time axis is collapsed, masking temporal fluctuations. Non-masked features (e.g., “non-masked”, “YZ non-masked”, “XZ non-masked”): Partially matched 8nt-10nt bindings exhibit weaker edge intensities, which are masked if their values fall below the average background intensity within the same frame. As with the previous metric, “XY non-masked” provides limited insight due to time axis overlap in the XY projection. XY statistics (e.g., “mean”, “std”, “min”, “max”, “median”, “range”): 8nt-10nt bindings show lower average intensity, less variation, and more random spatial changes.

In this paper, “missing information” is defined as the gap between the current classification accuracy and an ideal scenario where all discriminative features are captured (i.e., 100% accuracy). This gap quantifies the amount of additional information needed to achieve perfect classification. By voting on binding types using all the statistics in Fig. 2c weighted by their correlation with binding type, we found that the classification accuracy could potentially be improved by 4.05% in Domain 1 and 37.75% in Domain 2. These differences indicate that the available statistics capture most of the necessary discriminative features in Domain 1, but much less so in Domain 2, where a substantial portion of key information is missing. The proposed T2C CNN compensates for the missing information by extracting additional spatiotemporal features from fluorescence videos, recovering 0.25% in Domain 1 and 34.72% in Domain 2, ultimately improving classification accuracy to 96% in both domains. This highlights the effectiveness of T2C CNN in enhancing image-based binding type classification, where traditional image statistical features are insufficient.

To evaluate the temporal information in fluorescence image sequences, we applied three scrambling strategies: within events (removing order), across events (preserving length but mixing events), and into random-length segments (removing both order and structure). As shown in Supplementary Table 1, the performance of the image-based method (T2C CNN) degrades progressively with increasing disruption. Scrambling within events moderately reduces accuracy (DNA Origami: 94.76%  → 89.47%; Cell: 74.09%  → 71.59%), highlighting the value of temporal order. Cross-event scrambling further lowers accuracy (DNA Origami: 87.66%; Cell: 70.25%), emphasizing the role of event-level coherence. Random-length scrambling yields the lowest accuracy (DNA Origami: 78.98%; Cell: 62.64%), falling below even length-based methods (DNA Origami: 83.88%; Cell: 66.34%), suggesting that disrupting both order and structure severely degrades discriminative power. These results confirm that both temporal order and event structure of fluorescence image sequences encode meaningful binding information. Additional results are provided in subsequent sections.

Cross-experiment evaluation of existing methods for discriminating single-dye binding types

Using the same DNA binding type design as in previous sections, we increased the domain-2 docking strand concentration from 200 pM to 400 pM and recollected fluorescence microscopy videos of equal duration for both domains. Binding event data from this setup were used to train and evaluate cross-experiment classification models, including: (1) PDF33–Gaussian kernel density estimation of binding durations; (2) MLP34–trained on sliced binding durations; (3) 3D ResNet-1828–a standard video classification CNN; (4) Video Transformer29–a self-attention-based video classification model; and (5) T2C CNN–the proposed CNN designed for classifying binding types from stacks of diffraction-limited fluorescent spot images.

Classification accuracy and measurement time (with standard deviations) for all methods are shown in Fig. 4a. The PDF and MLP models, relying on binding durations, reach only  ~ 75% accuracy per event and require at least 3 min of observation to exceed 85%. In contrast, image-based methods (dashed edges in the plot), such as 3D ResNet-18, achieve  ~ 80% accuracy per 5-s binding event. The proposed T2C CNN attains  ~ 95% accuracy with only one 5-s event by using cross-connected long-term spatial convolutions. Unlike traditional 3D CNNs (e.g., 3D ResNet-18) with shallow temporal strides, T2C CNN treats long temporal stacks as channels, enabling 2D convolutions equivalent to long-step 3D convolutions. This design captures frequency variations across spatial regions, effectively encoding DNA binding dynamics that can affect fluorescence intensity, angle, and density. Although orientation-dependent effects are likely averaged out over the 100 ms frame duration, hybridization or partial immobilization may still induce subtle changes in fluorophore behavior. For instance, the linker length and binding strength can influence the mobility35 of the fluorophore bound to the docking strands. Even though the exact reason is undetermined, the slight changes in the PSF36 time averaged to 100 ms can be one of the origins of enhanced classification accuracy with diffraction-limited images. Figure 4b shows confusion matrices from the best-performing models across five cross-validation experiments. Duration-based (PDF) and standard image-based models (Video Transformer, 3D ResNet-18) often misclassify Domain 2 events as Domain 1. In contrast, T2C CNN significantly reduces such errors by better distinguishing subtle binding type differences. Figure 4c visualizes predicted binding types with pseudo-colored images across increasing measurement times. T2C CNN consistently reconstructs ground-truth binding site distributions with 95–100% accuracy, enabling reliable interpretation of multi-target fluorescence microscopy data.

Fig. 4: Cross-experiment results for discriminating binding types.
Fig. 4: Cross-experiment results for discriminating binding types.
Full size image

a Comparison of average performance on 25,530 binding events across the two domains. The width of the colored regions in the horizontal and vertical directions represents the standard deviation of measurement time and accuracy, respectively, across five cross-validation trials. b Comparison of confusion matrices. The PDF method provides molecule-level predictions based on multiple binding events. In contrast, other methods are image-based and offer predictions at the single-event level. c Comparison of pseudo-colored composite fluorescence images, created by frame-wise summation of the two separated videos representing different binding types. By localizing fluorescence spots before summation, we accurately assigned each spot in the composite video to its corresponding binding type, establishing the ground truth. Image-based methods were trained on the mixed data from the composite video, without knowledge of the original intensities in the separate videos. The rightmost column displays the observation time and the classification accuracy for binding sites, representing the cumulative accuracy of binding type classification for all events occurring at the site. To reduce spot overlap from longer binding events, a lower concentration was used for Domain 2 (10nt) than for Domain 1 (8nt), leading to a higher number of observed 8nt binding sites over time. Source data are provided as a Source Data file.

Multi-class experiments

We extended the proposed approach to three-class classification by introducing an additional 6nt–6nt R1 strand binding type (Fig. 5a). As shown in Fig. 5c, T2C CNN outperforms state-of-the-art video models by a large margin, offering a robust tool for high-accuracy, single-fluorophore, three-target SR imaging. Additional results are provided in Supplementary Note 3 and Supplementary Fig. 4.

Fig. 5: Multi-class data, workflow, and results.
Fig. 5: Multi-class data, workflow, and results.
Full size image

a Illustration of the three binding types (6nt–6nt, 8nt–10nt, and 10nt–10nt) used in the multi-class experiments. b Data distributions from the experiments. Each of the three binding types includes two biologically independent replicates (Experiments 1 & 4, 2 & 5, and 3 & 6), with one replicate used for training and the other for testing. Violin plots show the full distribution of the data. The embedded boxplots indicate the 25th-75th percentiles (bounds of the box), the median (center line), and the minima and maxima (whiskers). Notably, the 6nt R1 imager exhibited lower average frame intensity (approximately 50 and 100 for training and testing) due to its shorter binding time, which sometimes resulted in incomplete fluorescence accumulation within a single frame. c Workflow and results of image-based binding type classification. Source data are provided as a Source Data file.

Interpretations of the T2C CNN

The proposed T2C CNN takes as input the video of diffraction-limited fluorescent spots, captured by a microscope, that result from the excitation of conjugated dyes in single DNA binding events. The output is the classification of the DNA binding event into predefined DNA binding types. This process does not require manually defined video features, which are difficult to discern with the naked eye. In this section, we attempt to explain the additional information captured by the T2C CNN to differentiate between different DNA-binding types, building on the cross-experiment studies.

First, we sought to identify which part of the diffraction-limited fluorescent spot most accurately reflects the DNA binding type. By overlaying the model input with its saliency map, we found that the edge region of the diffraction-limited fluorescence spot contributes most to the model’s prediction (an example is shown in Fig. 6a). This is likely because the edge plays a crucial role in forming the overall diffraction pattern. In diffraction-limited fluorescence imaging, the edges of the spots often exhibit unique interference patterns and intensity gradients that encode critical information about the underlying molecular interactions37. The saliency map generated by T2C CNN shows that these edge regions are particularly influential in the model’s decision-making process. This suggests that T2C CNN leverages the subtle variations along the edges–such as differences in brightness, shape, and gradient–that arise from diffraction effects to distinguish between binding types. By focusing on these edge features, the network can capture nuanced differences that may not be apparent in the central region of the spots, leading to more accurate classification outcomes.

Fig. 6: Interpretations of the temporal-to-context convolutional neural network (T2C CNN).
Fig. 6: Interpretations of the temporal-to-context convolutional neural network (T2C CNN).
Full size image

a T2C CNN saliency maps for diffraction-limited fluorescent spots in different DNA binding domains. The saliency map visually highlights the regions within each frame that T2C CNN considers most critical for its classification decisions. b F-test on T2C CNN convolutional layer outputs, with each domain treated as a separate group. c Comparison of T2C CNN binding type classification accuracies with different masked frame sections. d Average classification accuracies of T2C CNN for binding events of varying lengths. The peak at 1.1 s arises from merged short events due to blinking correction thresholds (1 s). e Example T2C CNN feature heatmaps and classification probabilities (Prob.) for single-frame diffraction-limited fluorescent spots, demonstrating that T2C CNN assigns a classification probability of over 85% to similar single-frame fluorescence spots in the correct domain, attributable to its ability to capture discriminative features through learnable convolutions. As the first learnable convolution layer, “Conv1” applies 64 convolutional kernels of size (3 × 3) with a (2 × 2) stride and (1,1,1,1) padding, converting the (1 × 10 × 10) input images into 64 feature maps with spatial dimensions of (5 × 5). More examples and analysis are provided in Supplementary Figs. 5, 6, 7, and Supplementary Note 4. Source data are provided as a Source Data file.

We also analyzed the correlation between the model output and common statistics of the diffraction-limited fluorescent spot video (Fig. 2d), which generally reflects the true domain correlations (Fig. 2c). Additionally, we performed an F-test on the features output by T2C CNN (Fig. 6b), revealing that the majority of features have a high F-statistic (F  >100). This indicates that the model has learned discriminative features from the diffraction-limited fluorescent spot videos to distinguish DNA binding types.

Next, we explored the importance of different periods of the DNA-binding events in determining the DNA-binding type. For each DNA-binding event’s diffraction-limited fluorescent spot video, we sequentially masked out 10% of the frames (not masking fewer than 10 frames). As shown in Fig. 6c, when we masked the first 10% or between 10% and 20% of the frames, the classification accuracy of domain 1 DNA binding types dropped from around 96% to below 90%. We hypothesize that this is because the early binding stages of a partially matched 8nt docking strand with a 10nt imager strand in domain 1 are still unstable, compared to the more stable early binding stages of a fully matched 10nt docking strand in domain 2. This hypothesis is consistent with the conclusions in Fig. 3.

Finally, we evaluated T2C CNN on DNA binding events of varying durations. As shown in Fig. 6d, T2C CNN achieves ≥87.5% accuracy for events shorter than 4.9 s (49 frames), which constitute 72.9% of all 25,530 events. The probability density function in Fig. 2e shows substantial overlap between the two domains in this range, underscoring the challenge addressed by the proposed method. Occasionally, lower accuracies (<80%) arise in a small subset (1.6%) of longer events. Notably, only 1–6 events per binding length category (across 12 of 463 categories, or 2.6%) yielded 0% or 50% accuracy, remaining within acceptable limits for typical applications.

Interestingly, T2C CNN achieves approximately 95% accuracy in distinguishing binding types on those single-frame events (see Fig.6d at x = 100). Accordingly, Fig.6e illustrates a representative example of single-frame events from both domains, including the input image, a representative feature map, and the corresponding classification probability. Feature maps reveal differences in spot shape (e.g., round vs. square) and pixel intensity distribution (e.g., concentrated vs. dispersed). The model consistently outputs high-confidence predictions (with the confidence ≥85%). These results explain how the T2C CNN works at the single-frame level. For multi-frame spots, the model integrates temporal variations of these features. Given its ability to differentiate single-frame spots from the same dye, it should, in principle, also distinguish different dyes–explored further in the last section of Results.

Robustness test of image-based binding-type classification models

In single-molecule fluorescence binding experiments, nonspecific binding, background intensity noise, and camera defects are common issues, making it crucial to evaluate the robustness of various methods under noisy conditions. Supplementary Note 5 and Fig. 7 assess the robustness of image-based binding-type classification models (3D ResNet, Video Transformer, ED-TCN, SqueezeTime, and the proposed T2C CNN) by simulating these three types of noise interference. Among all tested models, the proposed T2C CNN consistently demonstrated strong robustness, particularly against Poisson and Gaussian noise, due to its temporal-to-channel transformation and efficient parameter design. These results suggest that T2C CNN can reliably handle signal disturbances commonly encountered in fluorescence imaging, provided that interference spot intensity is kept below half that of the signal and the effective pixel ratio—the proportion of the signal area that remains unobstructed by camera defects—exceeds 93%.

Fig. 7: Robustness test results.
Fig. 7: Robustness test results.
Full size image

a Examples of generated noise added to the original signal, including Poisson light noise with Gaussian decay, Gaussian noise, and hot pixel noise. The symbols μ and σ denote the mean and standard deviation (std.), respectively. b Average classification accuracies of image-based binding type classification models under different noise conditions. The shaded regions represent the std. of accuracies from noises generated with 5 random seeds. Models exhibit stable performance across different random seeds (with minimal std., resulting in nearly invisible shading) but show performance degradation with increasing noise intensity. Source data are provided as a Source Data file.

Hyperparameter analysis of the T2C CNN

We systematically evaluated the impact of network depth, width, and input slice length on T2C CNN’s performance (Fig. 8 and Supplementary Note 6). An optimal depth of 8 layers (two per block) achieved the highest accuracy (94.76%). Configurations with evenly distributed layers (4, 8, and 12) consistently outperformed uneven ones (e.g., 5, 6, 7, 9, 10, 11), underscoring the importance of architectural symmetry. The default width of 64 provided the best trade-off between accuracy and efficiency, with 128 yielding only marginal gains (94.76%  → 95.38%) at the cost of doubling model size and computation. Remarkably, even with a small width of 16, accuracy remained above 90%, indicating T2C CNN’s robust feature extraction and the redundancy in fluorescence video data. Regarding input length, increasing the slice length improved performance up to 512 frames by capturing complete binding events, with performance plateauing beyond that.

Fig. 8: Ablation study and hyperparameter analysis of T2C CNN.
Fig. 8: Ablation study and hyperparameter analysis of T2C CNN.
Full size image

a Comparison of classification accuracy, computational cost, and model sizes for T2C CNN variants. Baseline methods, including 3D ResNet-1828, Video Transformer29, ED-TCN30, and SqueezeTime25, are also evaluated for comparison (detailed comparison discussed in the Discussion and Motivation sections). The abbreviation “w/o” represents “without.” Giga Floating Point Operations per Second (GFLOPs) quantify the computational power required for processing, indicating hardware demands. Curved arrows visualize the performance trend across variants with the same hyperparameter type. b T2C CNN Architecture Variants Investigated in the Ablation Study. Dashed boxes highlight the four blocks of T2C CNN, while solid rectangles represent a convolutional or fully connected layer followed by BN and ReLU layers (except for the final output layer, which omits BN and ReLU). The convolution kernel size is labeled as “N x N,” and the rectangle length indicates the kernel size or the number of hidden units (i.e., width). Hidden dimension transformations are denoted as “N  → M.” Rectangles without labeled kernel sizes or dimension transformations follow the default hidden layer configuration for an 8-layer depth. Source data are provided as a Source Data file.

In Fig. 8, we also included the baseline models with their optimal depths and widths, which were evaluated on randomly split training and validation datasets (Supplementary Table 2).

Ablation study of T2C CNN architecture

We dissected T2C CNN’s architecture to assess the contributions of its key blocks and design components (Supplementary Note 7, Supplementary Table 2). Removing either the second (Hidden Transformation) or third (Multi-Scale Feature Fusion) block significantly degraded performance (94.76%  → 90.33% and 88.80%, respectively), confirming their critical roles. The second block proved particularly essential for transforming global features into discriminative representations that facilitate effective multi-scale fusion. Further analysis of individual components–long-term spatial convolutions, skip concatenations, and no-pooling–revealed that each contributed incrementally to performance, with their combined use achieving the highest accuracy. Notably, the default configuration also yielded the lowest standard deviation, reflecting stable generalization. The optimal configuration consistently outperformed the baseline across both single-event and single-molecule tasks. These results demonstrate that the full T2C CNN architecture forms a synergistic and domain-adapted design optimized for robust classification under noisy single-dye fluorescence conditions.

Statistical significance analysis of model performance

A Wilcoxon signed-rank test conducted on single-event class-wise accuracies across five folds confirmed that T2C CNN significantly outperformed all of its ablated variants (p = 0.031). To assess performance across a broader set of baseline models—including all seven ablated variants, four PDF variants (1-event, 2-event, 3-event, and all-event) and four image-based methods—we performed a Friedman test on the same single-event accuracies (Supplementary Table 4), which revealed a statistically significant difference among the 16 methods (χ2 = 64.11, p = 4.89 × 10−8). All models were evaluated using five-fold cross-validation and tested on an independent sample. Although additional random seeds were not explored, the cross-validation results were consistent across folds, indicating stable performance and low sensitivity to model initialization. Post-hoc analysis using the Nemenyi test further showed that T2C CNN significantly outperformed the state-of-the-art baselines, including SqueezeTime (p = 0.0030), ED-TCN (p = 0.0023), and Video Transformer (p = 0.0145). These results underscore the necessity of domain-specific architectural design in fluorescence video classification.

HER2-targeted cell experiments

As illustrated in the central panel of Fig. 9a, two groups of HER2-positive AU565 cell samples (Domain 1 and Domain 2) were immobilized on separate glass slides. Herceptin antibodies conjugated with a 10nt P3’ docking strand (complementary to the 10nt P3 strand) were used to specifically bind to the HER2 proteins on the cell membranes. Identical fluorophores (ATTO532)-labeled imagers were introduced to both samples. The same truncated 8nt P3 strand and full-length 10nt P3 strand were used as in the previous experiments. TIRF microscopy was used to image a thin layer at the cell-glass interface, where imagers can access the contact region through gaps between the cell and the slide. A higher number of binding events is expected near the edges of the contact area, which are more exposed to the imaging buffer. Details of the experimental preparation procedures can be found in the “Sample preparation” section of Methods.

Fig. 9: Cell experimental settings and data analysis.
Fig. 9: Cell experimental settings and data analysis.
Full size image

a Settings: HER2-positive AU565 cells were cultured, fixed, and immunolabeled with custom antibody-oligo conjugates (8nt or 10nt P3' strands), followed by incubation with imager strands (10nt P3 strands) in salt-buffered PBS for fluorescence imaging. Example fluorescence images are shown on the corresponding sides of the two domains. Experiments were repeated twice biologically and five times technically (5-fold validation) with similar results. b Binding time analysis: The average binding durations for Domain 1 (8nt–10nt) and Domain 2 (10nt-10nt) overlap substantially, with an intersection-over-union (IoU) of 45.76% between the probability density functions (PDFs) estimated using Gaussian kernels. This implies that nearly half of the binding times cannot be clearly assigned to either Domain 1 or Domain 2. c Fluorescence spot video analysis: Correlations were calculated between domain labels and several common statistical features (n = 19,695 independent binding events). The “YZ,” “XZ,” and “XY” statistics refer to those computed from projections of the video onto the Y--Z, X--Z, and X--Y planes, where X represents the width, Y height, and Z time. Compared to the Origami data in Fig. 2c, the cell data is markedly more difficult to distinguish--both in terms of the overlapping binding times and in the reduced correlation of fluorescence spot video features (dropping from 0.5 in Origami data to 0.2 in cell data), likely due to the increased molecular complexity of the cellular surface. Source data are provided as a Source Data file.

Figure 9a shows raw fluorescence images from two hybridization types: 8nt-10nt and 10nt-10nt, where 10nt imager strands bind to HER2-targeted docking strands on AU565 cells. By carefully adjusting the concentration ratio between the docking and imager strands, we minimize spot overlap to accurately reveal the locations of individual HER2 proteins. After drift correction over 33 min 20 s (20,000 frames at 10 fps), super-resolution reconstruction (Fig. 10a, middle) reveals HER2 positions (orange/green dots).

Fig. 10: Cell experimental results.
Fig. 10: Cell experimental results.
Full size image

a Bright-field and super-resolution microscopy images of two test HER2-positive AU565 cell samples, with HER2 proteins immunolabeled using 8nt and 10nt DNA strands. The DNA-PAINT technique26 is used to reconstruct HER2 protein locations. Fluorescence videos from the two samples are aligned frame-by-frame to synthesize the multiplexed scenario, serving as ground truth. Experiments were repeated twice biologically and five times technically (5-fold validation) with similar results. b Real-time classification of binding events from fluorescence spot videos on the two test samples. The models were trained on another two separate HER2-positive samples prepared under the same experimental conditions. Results show that the proposed T2C CNN outperforms the state-of-the-art video classification models by a significant margin. Source data are provided as a Source Data file.

Binding events at these sites were tracked to compute average on/off times. As shown in Fig. 9b, Domain 1 (8nt-10nt) shows shorter binding durations than Domain 2 (10nt-10nt), with similar off-times. Compared to the Origami setup (Fig. 2e), domain durations in cells overlap more, likely due to environmental factors such as membrane interactions or local field instabilities. Kernel density estimation (Fig. 2c, bottom) reveals a 45.76% intersection-over-union (IoU) between domains. Using the intersection point of the distributions (4.7s) as a threshold, 31.85% (402/1262) of events are misclassified, highlighting the limits of duration-based classification in cells.

To enhance feature richness, each event is converted into a video by stacking cropped regions across frames. Figure 9c shows feature-label correlations: duration correlates weakly (± 0.13), while other features–e.g., min/max intensity and Y-Z median intensity–show stronger trends (up to  ± 0.17), offering improved discriminative power. Still, these cell-derived features are less informative than those from the Origami dataset (correlations  ≈   ± 0.5 in Fig. 2c), underscoring the increased complexity of cellular classification.

HER2 site-localized videos (Fig. 10a) were background-corrected and normalized before being input to video classifiers. Labels (Domain 1 vs. Domain 2) guided model training via parameterized operations (convolution, matrix multiplication, etc.). During testing, models classified unseen videos. To evaluate multiplexing, we overlaid super-resolution images from two samples (Fig. 10b), reconstructed at multiple durations (10s to 33min20s). Accuracy trends across time are summarized in the right-hand bar plot.

Despite cellular complexity lowering overall performance, the proposed T2C CNN consistently outperformed other models, reaching 78.48–80.61% accuracy versus 40.19–70.00% for others. The performance gap widened with longer measurements, with T2C CNN surpassing the second-best by 10.00–20.18%. This enables more accurate reconstruction of binding types in real time, advancing high-accuracy classification for single-molecule fluorescence in cells.

Single-frame discrimination of different dye-labeled binding events

As a fluorescence classification model, T2C CNN can be used for multiplexing using the wavelength dependence of the emission PSF. This is an alternative to multi-fluorophore experiments using spectral separation38,39. As shown in Supplementary Note 8 and Supplementary Fig. 8, T2C CNN demonstrated superior classification accuracy (92.88%), comparable to larger models like VGG16. At this level of accuracy, T2C CNN requires less than 0.1% of ResNet-18’s computational cost, making it well-suited for deployment on lightweight devices. This outcome confirms the ability to differentiate multicolor data at the single fluorophore level by analyzing the PSF patterns of emission wavelengths. This finding lays the groundwork for future multicolor microscopy techniques that do not require wavelength-specific analysis.

Discussion

In this study, we reveal that beyond binding time, diffraction-limited fluorescence spots also contain information related to different types of binding interactions. These interactions may arise from various molecular processes that affect fluorescence. For instance, in DNA binding, the fluorescence intensity of conjugated dyes may either increase or decrease upon hybridization, depending on the sequence and position of the dye40. Beyond DNA binding, other molecular interactions and conformational changes have also been shown to influence fluorescence. For example, acrylodan fluorescence emission is sensitive to its local environment; when bound to a protein, it exhibits changes in both intensity and emission wavelength, reflecting the degree of solvent exclusion and the effective dielectric constant of the fluorophore’s environment41. The fluorescence spectrum of a conjugated fluorophore can be sensitive to microenvironmental changes, such as solvent variations, and may fluctuate over time due to slow, spontaneous conformational changes in the protein molecule42. Certain ligands can induce distinct conformational states in the binding protein, altering the environment around the conjugated fluorophore side chain41. Additionally, the conformational changes in conjugated polymers can modify the distance between the polymer (acting as an energy donor) and the reporter dye molecule (acting as an energy acceptor). These detection mechanisms typically result in fluorescence turn-on or turn-off, or changes in either the visible color or fluorescence emission color of the conjugated polymer43. This work opens the door to exploring the effects of a broader range of molecular binding events on fluorescence, thereby simplifying molecular detection and enhancing specificity.

Some deep learning models that perform well in natural video classification, such as 3D ResNet28 and Video Transformer29, do not perform as well on fluorescence videos. A key reason for this discrepancy lies in the way video semantics are represented. Natural videos emphasize the motion patterns of spatial features over time, which involves spatiotemporal correspondence44. In contrast, fluorescence videos require attention to the frequency domain of spatial features over a long time span, providing information that is at least as valuable as what can be obtained from the temporal domain45. Based on this distinction, we replaced conventional temporal-domain convolutions with transformations spanning a wide temporal range, leading to the development of long-term spatial convolutions in the Temporal-to-Context (T2C) CNN model. While treating the time dimension as the channel dimension, T2C CNN features a design distinct from the classical temporal convolutional network architecture ED-TCN30 and the recently developed SqueezeTime25, which focuses on reducing computational cost and memory usage for video processing. Supplementary Table 5 highlights their differences and evaluates their suitability for fluorescence video analysis. Experimental results demonstrated that the T2C CNN, which captures spatial frequency features, significantly outperforms traditional spatiotemporal models in fluorescence videos with low spatial-temporal ratios. This discovery provides valuable insights into the processing and recognition of fluorescence videos.

While this study demonstrates the feasibility of distinguishing binding types from diffraction-limited fluorescence videos, several practical challenges remain, particularly in complex cellular imaging scenarios. These include the presence of more than three distinct binding kinetics, variations in fluorescence intensity across molecular species, and the occurrence of overlapping fluorescence spots from multiple species. Below, we outline these limitations and suggest possible avenues for addressing them.

Although the current experiments involve three binding types, future applications may involve even more complex kinetic behaviors. In such cases, the T2C CNN could be extended by incorporating training data from additional purified target samples. For inference on mixed or unknown samples, one strategy is to flag videos that deviate significantly from known profiles as ’unknown’ or ’nonspecific’, thereby preventing overconfident misclassification.

Differences in fluorescence emission profiles across species can introduce variability in the input signals. To mitigate this, calibration data from purified targets across different species may help the model learn invariant features, potentially improving its robustness and generalization in heterogeneous biological environments.

In multi-species imaging, partial overlap between spots is often unavoidable. For moderate overlap, image deconvolution using Gaussian fitting or related techniques may resolve individual events. However, for more severe overlap, additional modeling strategies may be required. These may include training the model on simulated composite signals, identifying distinct dynamic patterns within overlapping videos, or using outlier detection methods to identify atypical fluorescence dynamics.

Overall, while the approach has shown promising results under controlled conditions, further refinements and methodological extensions will be essential to enable its broader applicability in more complex and variable cellular contexts.

Methods

This study did not involve human participants, animal experiments, or other procedures requiring ethics oversight. Therefore, no ethics approval was required.

Sample preparation

The DNA-origami configuration used in this paper is obtained from the design module of Picasso26. We used 4 docking strands per origami at the 4 corners, with the sequence of docking and imager strands adapted from previous research11. The origami is prepared by mixing the M13mp18 single-stranded scaffold DNA (10 nM), core staples (100 nM), biotinylated staples (100 nM), and staples with docking strand (1 μM) and annealing in a thermocycler. First, the temperature is raised to 80 °C, followed by a thermal gradient from 60 °C to 4 °C in 3 h. The origami was then purified by centrifugal filtration using Amicon 0.5ml 50k MWCO, at 5000 × g for 6 min followed by collection at 5000 × g for 5 min.

The substrate is prepared by cleaning a coverslip and sticking it with a CoverWell perfusion chamber. The origami is immobilized on the coverslip using a BSA-biotin-streptavidin linkage, and the chamber is filled with an imager solution (1 nM imager strands in imaging buffer consisting of 1 × PBS, 500 mM NaCl, and saturated with Trolox). The same 1 nM ATO532-labeled imager strand solution was used for both domains to ensure consistent imager availability and minimize potential differences in binding frequency due to imager concentration. To limit the number of fluorescence spots per frame and reduce overlap, we adjusted the surface density of docking strands: 400 pM for the 10nt docking strand (which has a longer binding time) and 1 nM for the 8nt and 6nt strands (which have shorter binding times).

The antibody-oligo conjugate probes were prepared using the method described in46. The 5’-amine-modified docking strands were conjugated with Herceptin monoclonal antibody, sourced from Trastuzumab (Roche, Graz Steiermark, Austria), which targets the HER2 protein on the cell membrane, using a disuccinimidyl suberate (DSS) linker. For cell experiments, the antibody-oligo conjugates were prepared at a 1:1 labeling ratio and used at a final concentration of 1 nM. The oligo (200 μM in nuclease-free water) was first mixed with an equal volume of acetonitrile, DSS (25 mM, dissolved in dimethylformamide), and 1:800 (v/v) of triethylamine for 15 min at room temperature, and then purified via ethanol precipitation. Sodium acetate (0.3 M, pH 5.2) and magnesium chloride (10 mM) were added to the conjugation product and mixed with three times the volume of cold absolute ethanol. After 1 h incubation at  −20 °C, the slurry was centrifuged at 24,100 g for 15 min at 4 °C. The pellet was washed once with ice-cold 75% ethanol and reconstituted in nuclease-free water. The activated oligo was then incubated with a 3-fold molar excess of Herceptin antibody in 50 mM phosphate buffer (pH 7.2) for 12 h at room temperature.

The antibody-oligo conjugates were purified by ion exchange chromatography (IEX) using an Agilent Bio SAX NP3 (4.6  × 50 mm) column on an Agilent 1260 Infinity HPLC system. Elution began with 100% buffer A (50 mM phosphate buffer, pH 7.2), followed by a step change to 70% buffer A and 30% buffer B (50 mM phosphate buffer, pH 7.2, supplemented with 1.0 M NaCl). A linear gradient was then applied, increasing the proportion of buffer B to 65% (corresponding to 35% buffer A) over 13 min to achieve progressive salt-mediated elution. Each collected fraction, containing antibodies conjugated with discrete numbers of oligo strand(s), was concentrated using Amicon ultrafiltration columns with a 50 kDa cut-off. The final antibody-oligo conjugates were quantified using the Qubit™ ssDNA Assay Kit. The purified conjugates were stored at 4 °C in buffer containing phosphate saline (pH 7.4) and 1 mM EDTA.

HER2 receptor protein-expressing AU565 cells (CRL-2351™, ATCC; Manassas, Virginia, USA) were cultivated in Roswell Park Memorial Institute 1640 Medium (RPMI/ATCC Modification #A1049101, Gibco, Thermo Fisher Scientific, Massachusetts, USA). The source of AU565 was from ATCC (CRL-2351), isolated from a pleural effusion of a 43-year-old, White, female, patient with breast adenocarcinoma. The medium was supplemented with 1% penicillin-streptomycin (#15070063, Gibco) and 10% fetal bovine serum (FBS; #10270106, Gibco). The cells were seeded at a density of 50,000 cells per dish onto culture dishes (#P35G-1.5-20-C, MatTek, Massachusetts, USA). The cultures were maintained in a heated CO2 incubator at 37 °C and 5% (v/v) CO2 concentration (Forma Steri-Cycle CO2 incubator, Thermo Fisher Scientific) for 36 h.

Cells were then fixed with 4% paraformaldehyde (PFA, EMS) in PBS for 15 min, followed by quenching with 1 mg/mL sodium borohydride for 7 min. Cells were washed three times (5 min per wash) with PBS, and incubated with 1% BSA and 0.05% Tween-20 in PBS for 2 h to block non-specific binding. The blocked cells were washed four times with 1% BSA and incubated with conjugated antibodies (1 nM in PBS containing 500 mM NaCl) for 1 h at room temperature. Cells were then washed four times with PBS supplemented with 500 mM NaCl. Imager solution (1–5 nM in PBS with 500 mM NaCl and 2 mM Trolox) was added immediately before imaging.

All experiments were conducted on separate samples.

Fluorescence microscopy

The intermittent binding of an imager and corresponding immobilization of the fluorophore to the docking strand on the origami are imaged in total internal reflection fluorescence (TIRM) mode. We used an inverted microscope (IX-71, Olympus, Tokyo, Japan) with an oil immersion objective (PlanApo, 100 × , NA 1.5, Olympus). A 532 nm laser (Samba, Cobolt, AB, Sweden) is coupled with TIRF illuminator model IX2-RFAEVA-2 (Olympus) through an optical fiber. The laser is focused on the back-focal plane of the objective, away from the center to achieve TIRF. The laser power density is measured after the objective and determined to be 125 W/cm2. The emission is collected through the same objective, passed through a dichroic beamsplitter (Semrock, Di03-R405/488/532/635-t1-25 × 36) and emission filter (Semrock, NF03-405/488/532/635E-25), imaged with a scientific complementary metal-oxide semiconductor (sCMOS) camera (ORCA-Flash 4.0, Hamamatsu, Shizuoka, Japan). The image sequences are collected at a frame rate of 10 Hz, and analyzed without further processing.

Data preprocessing

This section describes the data preprocessing steps from raw microscope images to the model input. Initially, due to the continuous use of laser excitation for fluorescence, the number of photons received by the microscope gradually decreases and fluctuates over time. This causes the overall intensity of the raw microscope images to vary over time, and there are noticeable intensity differences between images from different experiments, depending on the laser and dye conditions during the experiments. To estimate the background intensity of each frame of the raw microscope images, we used Picasso v0.7.526 to localize diffraction-limited fluorescent spots on the raw microscope images and masked out all detected fluorescent spots to calculate the average intensity of the remaining pixels. As shown in Supplementary Fig. 9a, there are different frame-level variations and significant gaps between experiments in the background intensity across four experiments in the two domains. To reduce the interference of background intensity variations on fluorescence intensity, we subtracted the background intensity from the detected fluorescent spots and assigned a small value (such as 100, to distinguish it from the zeros used in padding) to pixels below the background intensity. An example of background intensity correction is shown in Supplementary Fig. 9b. This example demonstrates a significant change in background intensity within just three frames (0.3 s), making the fluorescent spots appear much brighter. After correction, we found that the fluorescence intensity remained at the same level after 0.3 s. Before inputting the background-corrected fluorescent spots into the model, it is recommended to divide all intensity values by a scaling constant (e.g., 500) or do the event-level normalization to reduce the numerical scale to a suitable range (e.g., not greater than 10).

Fluorescence blinking and detection algorithm errors can cause the loss of a few fluorescent spots in the same binding event, including some at the edges and those with high overlaps. As shown in Supplementary Fig. 9c, we consider two binding-on signals detected very close in space and time to be from the same binding event and pad zeros between them. By reading the drift of the slides using Picasso Render26 (as shown in Supplementary Fig. 9d), we obtained a series of binding events grouped by binding site location after drift correction. Subsequently, in Supplementary Fig. 9e, we analyzed the number of binding events per location (denoted as “#Binding events/group”) and the duration of each binding event over 33 min and 20 s across four experiments in the two domains. The number of binding events in both domains is concentrated between 2 and 4, and the binding durations are concentrated between 0 and 10 s. Compared to the partially matched 8nt-10nt binding events in domain 1, the fully matched 10nt-10nt binding events in domain 2 have a smoother distribution and longer binding times. In the cross-experiment binding type classification task, we used data from the first two experiments to train models and data from the last two experiments to test models.

Motivation

Unlike conventional video classification tasks29,47,48,49, diffraction-limited fluorescence spot videos used for DNA binding type classification are characterized by low spatial resolution and long temporal sequences. We refer to these as low spatial-temporal ratio objects. Traditional video models typically rely on narrow 3D convolutional kernels47,49 or combine small-scale convolutions with sequential models29,48. In general, 3D convolutions use strides smaller than the kernel size50,51,52 to capture fine-grained features. However, such settings can lead to excessive convolutions along the temporal dimension, resulting in a proliferation of transitional features53. For sparse, pattern-like fluorescence spots, these transitional features can introduce redundancy and increase the risk of overfitting. On the other hand, small-scale convolutions with sequential methods29,48 emphasize token dependencies but often fail to capture the global temporal patterns essential for identifying distinct binding events.

To address these challenges, we propose a domain-specific architecture optimized for grayscale fluorescence videos with low spatial-temporal ratios: the temporal-to-context convolutional neural network (T2C CNN). Unlike prior work such as25, which also treats the temporal dimension as channels, the T2C CNN incorporates long temporal strides that match the convolutional kernel’s temporal extent, facilitating long-term spatial convolutions. This design effectively abstracts global temporal features while reducing unnecessary transitional representations. As illustrated in Supplementary Fig. 10a, treating the time dimension as the channel dimension allows T2C CNN’s 2D convolution to achieve a receptive field equivalent to that of a 3D convolution, while enabling more efficient model size and flexible input handling.

Moreover, the T2C CNN includes additional domain-specific innovations: long-term spatial convolutions to capture frequency-like patterns in the spatial domain, shallow cross-connected blocks to retain multi-scale spatiotemporal features, and a pooling-free strategy to preserve spatial detail and temporal continuity. These components collectively enhance the model’s ability to discriminate between binding types (Supplementary Table 3). For long binding events, the T2C CNN performs slice-wise inference and aggregates predictions to yield a robust final output. This architecture achieves superior accuracy and stability (Figs. 4, 5, 6, 7, and 10) compared to general-purpose video models. In the following section, we detail the design and implementation of the T2C CNN.

Notations

Let the function f(xyzc) represent the input video, where f(x = Xy = Yz = Zc = C) denotes the pixel value at the three-dimensional coordinates (XYZ) and the C-th channel in the input video. Assume the input video f(xyzc) has a frame width of Win, a frame height of Hin, a number of frames Din, and a number of channels Cin. Therefore, \(x,y,z,c\in {{\mathbb{N}}}_{0}\), x < Win, y < Hin, z < Din, and c < Cin. For values of x, y, z, c outside these ranges, f(xyzc) is considered to be 0 to facilitate the compatibility of padding in expressions.

Let the convolution kernel ω have a width of Wkernel, a height of Hkernel, a depth of Dkernel, and a channel size of Cin, with its value at the three-dimensional coordinates (uvw) and the channel index c given by ω(uvwc). Assume the convolution strides in width, height, and depth are sw, sh, and sd, respectively, and the padding sizes on both sides for width, height, and depth are pw, ph, pd, respectively. Let Wout, Hout, and Dout represent the output dimensions in width, height, depth, and channel, respectively. Assume the number of output channels is Cout, each corresponds to a unique kernel ωd(uvwc) where d ranges from 1 to Cout. Then, the output of a typical 3D CNN convolution54 is represented as:

$${{{{\rm{Conv}}}}}^{({{{\rm{3D}}}})}(\;f,\omega )={g}^{({{{\rm{3D}}}})},$$
(1)

where

$${g}^{({{{\rm{3D}}}})}(i,j,k,d)={\sum}_{u=0}^{{{{{\rm{W}}}}}_{{{{\rm{kernel}}}}}-1}{\sum}_{v=0}^{{{{{\rm{H}}}}}_{{{{\rm{kernel}}}}}-1}{\sum}_{w=0}^{{{{{\rm{D}}}}}_{{{{\rm{kernel}}}}}-1}{\sum}_{c=0}^{{{{{\rm{C}}}}}_{{{{\rm{in}}}}}-1}{\omega }_{d}(u,v,w,c)\cdot \\ f({s}_{w}\cdot i+u-{p}_{w},{s}_{h}\cdot j+v-{p}_{h},{s}_{d}\cdot k+w-{p}_{d},c).$$
(2)

Here, g(3D)(ijkd) represents the output of the 3D convolution operation at position (ijkd) in the output volume. The indices i, j, k iterate over the output dimensions Wout, Hout, Dout, and Cout, respectively. The sums iterate over the kernel dimensions Wkernel, Hkernel, Dkernel, Cin, and u, v, w, c index into the kernel. The function f(xyzc) represents the input video, where values outside the video dimensions are assumed to be 0. The convolution kernel is denoted by ωd(uvwc).

T2C layer

Different fluorescences are typically split into separate channels in fluorescence microscopy55,56. Further distinguishing different binding types within a single channel can significantly increase the labeling capacity of the fluorescence and help eliminate potential non-specific binding events. In this study, the proposed T2C layer reinterprets the extensive temporal dimension inherent in fluorescence microscopy data by treating it as the channel dimension of a grayscale image. This transformation allows the application of spatial convolutions across the broad temporal domain, effectively capturing frequency variations. This approach enhances the efficiency of spatiotemporal information fusion, thereby improving the CNN’s accuracy in predicting binding types. The output of a T2C layer, denoted by Conv(T2C), for each fluorescence channel can be represented as:

$${{{{\rm{Conv}}}}}^{({{{\rm{T2C}}}})}(\;f,\omega )={\{{g}_{k}^{({{{\rm{T2C}}}})}\}}_{k=0}^{{{{{\rm{N}}}}}_{{{{\rm{slice}}}}}},$$
(3)

where

$${g}_{k}^{({{{\rm{T2C}}}})}(i,j,d)={\sum}_{u=0}^{{{{{\rm{W}}}}}_{{{{\rm{kernel}}}}}-1}{\sum}_{v=0}^{{{{{\rm{H}}}}}_{{{{\rm{kernel}}}}}-1}{\sum}_{w=0}^{{{{{\rm{T}}}}}_{{{{\rm{slice}}}}}-1}{\omega }_{d}(u,v,w)\cdot \\ f({s}_{w}\cdot i+u-{p}_{w},{s}_{h}\cdot j+v-{p}_{h},{{{{\rm{T}}}}}_{{{{\rm{slice}}}}}\cdot k+w),$$
(4)

Here, \({g}_{k}^{({{{\rm{T2C}}}})}(i,j,d)\) represents the output of the temporal-to-channel convolution operation at position (ijd) in the output volume for the k-th temporal slice with length Tslice. The number of slices is denoted by Nslice, which is defined as \(\lceil \frac{{{{{\rm{D}}}}}_{{{{\rm{in}}}}}}{{{{{\rm{T}}}}}_{{{{\rm{slice}}}}}}\rceil\). The indices i, j iterate over the output dimensions Wout and Hout, respectively. The sums iterate over the kernel dimensions Wkernel, Hkernel, Tslice, and u, v, w index into the kernel. The function f(xyz) represents the input single-channel video, where values outside the video dimensions are assumed to be 0. The convolution kernel is denoted by ωd(uvw).

At the beginning of the T2C CNN, a batch of images is input to the T2C convolution to produce the output \({g}_{k}^{({{{\rm{T2C}}}})}\), which then undergoes batch normalization (BN) and rectified linear unit (ReLU) activation. BN is used to accelerate T2C CNN training by reducing internal covariate shift57. After BN, ReLU activation is applied to introduce non-linearity58 and accelerate learning convergence59. Following its transformation through the T2C layer, the original 3D input video segment is converted into a 2D feature map with fixed channels.

T2C CNN architecture

After the T2C layer, we apply 2D convolutions and fully connected (FC) layers60 to further extract features and make the final prediction. Supplementary Fig. 10b shows the dimensions of each convolutional and FC layer in the T2C CNN used in this study, which can be functionally divided into four blocks: Temporal2Channel block, Hidden transformation block, Multi-scale feature fusion block, and Feature classification block. The definition of the four blocks is detailed in Supplementary Note 9.

The architecture of the T2C CNN is specifically tailored for fluorescence video analysis. In the first block, long-term spatial convolutions are used to capture extended spatial frequency patterns. In the intermediate blocks (i.e., neither input nor output), skip concatenations61 are adopted to preserve multi-scale spatiotemporal features. Across all blocks of the T2C CNN, a pooling-free strategy62,63 is employed to promote efficient spatiotemporal information fusion. While each of these components has been explored individually in prior work, their integration within the T2C CNN constitutes an architecture specifically optimized for fluorescence video analysis. The individual and combined contributions of these three components are systematically analyzed in the ablation study.

Training on T2C CNN

After the construction of a unique CNN architecture consisting of four blocks, the T2C CNN can be trained by a classic cross-entropy loss64 to achieve an accurate classification of binding types. We use stochastic gradient descent65 on a single Tesla V100 32G GPU to update the parameters of the T2C CNN. The training set is randomly divided into five equal parts. For each experiment, one part is used for validation to assess the model’s classification performance, while the remaining four parts are used for training. Training stops when the validation performance no longer improves for 100 epochs, and the model with the highest validation performance is selected for testing. In the training environment (CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz, GPU: Tesla V100 32G), preparing the data for 3000 binding events takes approximately 5 min, and training each epoch takes about 3 s. The optimal model is typically determined after around 200 epochs, so the entire training process for 3000 binding events, including data preparation, takes about 15 min.

Test on T2C CNN

The aggregated prediction \({\hat{y}}_{ij}\) is derived from the average softmax output (as shown in the last block in Supplementary Note 9), it can generally be interpreted as the confidence score for the i-th binding event ( fi) belonging to the j-th binding type (yj)66,67, i.e., \({{{\rm{Confidence}}}}(\;{f}_{i}\in {y}_{j}| \,{f}_{i}):={\hat{y}}_{ij}\). During inference, we classify each binding event by selecting the binding type with the highest confidence score. That is, after model training is completed, T2C CNN predicts the binding type \({\hat{c}}_{i}\) for each single binding event (where i denotes its index) based on the sequence of its diffraction-limited fluorescent spots: \({\hat{c}}_{i}={\arg \max }_{j}{{{\rm{Confidence}}}}(\;{f}_{i}\in {y}_{j}| \,{f}_{i})\). This classification approach ensures that the final prediction is based on the most probable binding type, reflecting the model’s learned feature representations.

In the testing environment (CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz, GPU: Tesla V100 32G), extracting data for all 31,479 binding events from a 33-min and 20-s observation video takes approximately 25 min, while predicting their binding types takes around 18 s.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.