Introduction

Diagnostic ultrasound is widely used due to its excellent accessibility, affordability, and radiation-free nature. Objects in ultrasound are distinguishable only when the wavelength is shorter than the distance between them. Higher frequencies cause increased attenuation and reflection by tissues, and hence the depth of imaging reduces. Lower frequency waves penetrate deeper, but result in lower resolution. Hence, half of a wavelength is defined as the ultrasound resolution limitation. Despite its capability to identify the general structure of organs, this resolution is limited in its ability to provide detailed visual representations of finer vessels.

Inspired by various approaches in microscopy to overcome diffraction limitations, such as single-molecule localization microscopy (SMLM)1 and photoactivated localization microscopy (PALM)2,3, Ultrasound Localization Microscopy (ULM) has been introduced and utilized to facilitate the visualization of microvasculture using ultrasound4. This technique relies on the presence of microbubble (MB) contrast agent. This agent, currently in clinical use, consists of a solution of small, encapsulated gas bubbles on the order of 1–8 μm in size5. As they pass through the ultrasound beam, they can vibrate non-linearly, exhibiting a rich resonant structure that is not produced by neighboring tissue scattering6. A key feature of this agent is that it is restricted to the vasculature due to its size and thus acts as a true red blood cell tracer. Accurate MB localization and tracking form the basis of ULM and result in sub-wavelength resolution of the vessel map visualization. In order to form the ULM localization density image, multiple imaging frames are accumulated over time after MB positions are localized. Flow velocity and direction maps are then generated by tracking bubbles in frames7. In the context of ULM, instead of generating an image based on the traditional use of nonlinear bubble scattering, the focus shifts to mapping the location of the MB through time5. Indeed, due to its deep and noninvasive nature, ULM allows for many applications in the early diagnosis and characterization of diseases, including cancer8, cardiovascular conditions9, and neurological disorders10.

There are mainly two approaches to facilitating MB detection: signal filtering techniques and image processing ones. Signal filtering techniques such as cross-correlation analysis11, maximum likelihood estimation12, minimum variance beamforming13, and Bayesian inference14, are limited in that they require a relatively low concentration of MBs to achieve sparse MB signals, since excessive use of contrast agents can cause signal interference. In addition to long acquisition times, these methods also suffer from extensive parameter tuning. A statistical analysis of the image is another approach proposed15, which exploits the sparsity of the underlying vascular architecture. In that work, a sparsity-enhancing regularization term is applied to the low-resolution correlation image after solving an optimization problem in the correlation space.

With the advancement of machine learning-based methods in a variety of medical imaging applications, deep learning methods have also emerged in the field of super-resolution ultrasound to detect and localize MBs. U-Net16 stands out as one of the widely adopted architectures employed in MB localization due to its encoder-decoder structure, which efficiently recovers high-resolution information from low-resolution ultrasound images. A U-Net architecture was trained on a set of simulated datasets to improve MB localization accuracy in high-density regions while using radiofrequency (RF) signals to generate the input patches17. The same architecture was used while employing Gaussian blurs on the ground truth masks18. To generate training data, authors used Field II ultrasound simulations to simulate each dataset of the IEEE IUS 2022 UltraSR challenge19 and selected the acquisition parameters to best match the point spread functions (PSFs) of the challenge dataset. In this method, MB localization is done on a frame-by-frame basis using images sized to match the final super-resolved image. In light of this, the detection and localization time, as well as the volume processed, increase exponentially as the resolution increases.

In another study, a 3D Convolutional Neural Network (CNN) was deployed to exploit the spatiotemporal information of ultrasound frames over time to generate the maps of mouse brain vasculature20. They defined the problem as a classification problem (whether MB is present or not); hence, the binary tracks that are provided do not contain velocity and flow information. RF and envelope data were utilized to train another U-Net style CNN on datasets with different MB concentrations to achieve high-resolution MB localization21. In this case, however, simulation datasets (using Field II) were used to generate the ground truths, which fail to capture nonlinear MB dynamics. Further more, a sub-pixel CNN architecture with residual blocks to achieve ULM imaging was developed22, but it still relies on the use of spatially-dependent PSFs.

Recently, there have been exciting advances towards the use of transformers and attention-based architectures for the ULM problem, for example, a Swin transformer was employed to improve localization accuracy for different MB densities23. A local to global feature learning framework using deformable convolutions and a vision transformer is used for super-resolution in medical imaging and tested for ultrasound images as well24. Although it performs well, it has limitations when it comes to complex and small vessel networks. A multi-channel CNN along with transformer blocks were used to produce high-resolution images using low-resolution temporal mean of a few number of frames25. In this way, the authors have successfully reduced the data acquisition time while still managing to achieve a high resolution. However, the issue is that their network does not retain individual MB data, resulting in the loss of flow characteristics essential for many applications, such as cardiovascular studies.

The above-mentioned algorithms, while effective and innovative, present challenges in MB localization. Most techniques use a fixed template, the PSF of an MB, neglecting MB dynamics and its variations at different imaging depths. This reduces accuracy as gas-filled MBs contract and expand, generating harmonics and sub-harmonics, and appear differently over time. Additionally, imaging parameters vary with different ultrasound machine settings, so training on parameters from one ultrasound machine or environment does not ensure effectiveness across all others. These issues also hinder generalization from simulation training sets to in vivo test sets. Thus, a straightforward localization method is needed that accounts for MB echo appearance changes over time and the fundamental differences between simulation and in vivo images.

In this paper, a transformer-based architecture was developed to achieve a more generalizable solution for MB localization. Despite CNNs being the dominant architecture in the field, transformers consisting of attention modules26 have emerged as an effective alternative due to their ability to capture long-range dependencies in data. Transformers utilize an encoder-decoder architecture while employing multiple self-attention layers, a mechanism that allows each element in a sequence to attend to other elements, capturing their dependencies. Self-attention enables transformers to capture both local and global contextual information, making them suitable for tasks that require understanding relationships among different image regions. Additionally, pre-training transformers on large-scale datasets enhances their ability to generalize. This advantage is particularly valuable in our case, where data scarcity is one of the challenges. Pre-training enables transfer learning, allowing models to leverage knowledge from diverse datasets and improve performance on specific tasks.

Previously, DEtection TRansformer (DETR)27, was used to show the prominence of the transformer networks in MB localization28. The application of DEformable DETR (DE-DETR)29 for MB localization was also explored by our group30. Both our previous studies were restricted to simulation datasets. In this paper, we argue that a fine-tuned DE-DETR, a network that employs a deformable attention mechanism to further capture object variations, side-steps various methods by manifesting its superiority in the generalization from simulation to in vivo datasets.

Results

Metrics

The MB localization for both simulation and in vivo dataset along with the ground truth is presented in Fig. 1 as a representative sample of our results. To manifest the prominence of DE-DETR in MB localization and quantify our results, the precision, recall and average Root Mean Square Error (RMSE) of each network are presented in Tables 1 and 2. The aforementioned metrics are calculated as follows:

$$\begin{aligned} \text {Precision} = \frac{TP}{FP + TP} \end{aligned}$$
(1)
$$\begin{aligned} \text {Recall} = \frac{TP}{FN + TP} \end{aligned}$$
(2)
$$\begin{aligned} RMSE = \frac{1}{{N_{{TP}}}}{\sum _{i \in {TP}}^{N_{{TP}}} \sqrt{\left( x_i-x'_i\right) ^2+\left( z_i-z_i'\right) ^2}} \end{aligned}$$
(3)

The above \(\left( z'; x'\right)\) and \(\left( z; x\right)\) represent the predicted and ground truth MB positions, respectively. TP is the number of MBs correctly predicted by the network; FP denotes the erroneous classification of a sample as an MB, and FN signifies the incorrect classification of a sample as background according to a precision of \(\frac{\lambda }{2}\) with \({\lambda }\) being the ultrasound wavelength.

Fig. 1
figure 1

Simulation (a) and in vivo (b) sample frame with localizations. Red: ground truth, blue: predictions. The ground truth is unknown in (b).

Table 1 Evaluation Metrics for DETR.
Table 2 Evaluation metrics for DE-DETR.

As shown in Tables 1 and 2, increasing the number of patches generally improves the metrics. Precision for DETR ranges from 50.01–80.53% and recall is between 48.32–67.07% with the “best” case being patch size of 256 by 171. For DE-DETR, precision ranges between 62.12 and 81.95% and recall is between 58.13 and 74.37% and the network does the best for patch size of 256 by 256. The improvement seen by increasing the number of patches can be attributed to the fact that attention modules struggle to capture smaller spatial details, and therefore, patching can help capture fine-grained details more effectively, as each patch can be processed at a higher resolution. However, caution is necessary regarding the size of the number of patches, as an increase in this number leads to a corresponding increase in the number of parameters and computations needed for the model to process each patch individually and then combine the results. Furthermore, as the patches become too small, the network might struggle to capture the global structure of the frames, resulting in poor performance. The training was done on a single NVIDIA GeForce RTX 3080 GPU in a Docker container.

Super-resolution maps

Figure 2 shows the results of our method on the last 100 frames of the simulation video (test dataset) after considering a fixed Gaussian around each localized MB to produce the super-resolution localization maps. The results are also compared with DETR, a morphological-based31 method, temporal mean, Gaussian fit32, Deep-ULM or UNet-based solution18, and the ground truth. As illustrated in Fig. 2, the vessels in our super-resolution localization maps are well-defined and closely resemble the ground truth (g), showcasing its efficacy in accurately representing complex vascular networks. Using DE-DETR’s bounding box based localization, the fine details of the vessel branches and intersections are preserved, indicating high precision in MB detection. This level of detail is less evident in the temporal mean, where vessels appear blurred, and in the morphological method, where the structure look less precise. The Gaussian fit method seems to perform well in the central area, however it does not seem to do as well in the deeper regions, which complies with the fact that MB PSFs look different in deeper regions of an ultrasound B-mode image. Furthermore, UNet seems to be missing some contrast between vessel and background regions, indicating a higher number of false negatives. Furthermore, compared to its predecessor DETR (e), DE-DETR shows enhanced delineation of smaller vessels and improved overall network connectivity by reducing the number of false positives.

Fig. 2
figure 2

SR images of the test dataset, i.e., the last 100 frames of the first simulation video of the challenge dataset obtained from (a) UNet18 (b) Morphological-based31 (c) temporal mean (d) Gaussian fit32 (e) DETR27 (f) DE-DETR (g) ground truth, respectively.

The super-resolution maps based on different localization methods with tracking are displayed in Fig. 3. To appraise our localization method, in Fig. 3, we have compared the results using each method’s localization and our tracking algorithm, leaving localization as the only variable. When compared with Figs. 2, 3 shows the effectiveness of utilizing tracks to visualize vessel maps as the vessels have a smoother and more natural appearance.

The comparison between different methods in Fig. 3 demonstrates the improved performance of DE-DETR, which generates super-resolution maps closely matching the ground truth and providing a powerful tool for detailed vascular imaging. The UNet-based method (a) captures the general structure of the vessels but introduces significant artifacts and noise, particularly in the dense regions. The vessels appear fragmented and less continuous, which can be attributed to the limitations of the UNet in handling complex vessel patterns and the high density of localized MBs. The morphological method (b) provides more continuous vessel structures. However, it still struggles with accurately delineating smaller vessels and fine details. The method tends to over-smooth the vessel boundaries, which leads to a loss of intricate structural information. The Gaussian fit method (c) provides a better approximation of the vessel structures, capturing both large and small vessels with reasonable accuracy. However, some regions still exhibit blurring, particularly in areas with high vessel density, indicating that the Gaussian fit may not fully resolve closely spaced MBs. The DETR (d) and DE-DETR (e) methods show substantial improvements over the previous methods and both provide a high level of detail and continuity in the vessel structures. The DE-DETR method (e), in particular, stands out by more closely approximating the ground truth (f) both at the denser parts of the map in the middle and at the deeper areas in the bottom left of the super-resolution maps.

Fig. 3
figure 3

SR images of the test dataset, i.e. the last 100 frames of the first simulation video of the challenge dataset obtained from (a) UNet-based18 (b) Morphological-based31 (c) Gaussian fit32 (d) DETR27 (e) DE-DETR (f) ground truth.

To evaluate our results further, Tables 3 and 4 present the Dice score between every compared localization method and the ground truth. Here, the Dice score provides a quantitative measure of the similarity between predicted and ground-truth vessel delineations. Among these, DE-DETR achieves the highest Dice score at 83.19%, followed closely by DETR at 83.00%. The Temporal mean method has the lowest score at 78.14%. This indicates that DE-DETR and DETR perform better in terms of segmentation accuracy for fixed Gaussian-based maps compared to the other methods.

Table 4 displays the Dice scores for tracking-based super-resolution maps using the same set of methods. Here, DE-DETR again performs the best with a Dice score of 92.00%, slightly outperforming DETR, which has a score of 91.38%. The Morph method also shows a strong performance with an 84.83% score. These results suggest that tracking-based super-resolution maps generally yield higher Dice scores across all methods compared to fixed Gaussian-based maps, indicating that tracking-based approaches might be more effective for enhancing segmentation accuracy. In summary, DE-DETR consistently outperforms other methods in both scenarios, while tracking-based super-resolution maps tend to produce better segmentation results than fixed Gaussian-based maps.

Table 3 Dice scores of different methods for fixed Gaussian-based super-resolution maps.
Table 4 Dice scores of different methods for tracking-based super-resolution maps.

DE-DETR also demonstrates superior performance in maintaining precision and recall across varying depths of ultrasound images. Figure 4 , which segments the image into three distinct depths, illustrates less reduction in both precision and recall for DE-DE compared to other methods. This performance can be attributed to the fact that our approach does not rely on PSFs, which are often variable with depth variations. By circumventing the dependency on PSFs, our method ensures consistent and robust image segmentation, thereby achieving better precision and recall metrics at different image depths.

Fig. 4
figure 4

Precision and Recall for different methods in 3 different depths of imaging.

Next, we assessed the results of our method on the in vivo rat brain dataset from the UltraSR challenge. Figure 5 demonstrates the results of localization and tracking for Deep-ULM18, DETR, DE-DETR, Gfit and temporal mean. It’s worth mentioning that the tracking along each of these SR maps is the same as the tracking used in the rest of the paper. The UNet-based method (a) shows a reasonable attempt at capturing the vessel structures but suffers from notable artifacts and noise. In regions marked by blue arrows, the vessels appear fragmented and less continuous. This discontinuity and noise can obscure the fine vascular details, making it challenging to distinguish between true vessel structures and artifacts. We’d like to note that the UNet results are taken from18, where the network was trained on Field-II simulations specifically derived from the in vivo dataset. In contrast, our DETR and DEDETR models were trained exclusively on the challenge-provided simulation data, which is entirely independent of the in vivo data. This distinction underscores the robustness of our approach and its ability to generalize across different data domains without relying on dataset-specific priors. The temporal mean method (b) results in a highly blurred output, losing much of the fine detail necessary for accurate vessel visualization. The vessels are poorly defined, and the overall map lacks the sharpness required for precise analysis. This method fails to capture dynamic changes and fine structures, leading to a loss of critical information in the super-resolution map. The Gaussian fit method (c) provides a clearer representation of the vessel structures. However, in regions indicated by the blue arrows, the vessels still exhibit some blurring and loss of fine detail. The DETR method (d) shows a significant improvement over the previous methods, providing a high level of detail and continuity in the vessel structures. The vessels are more accurately represented, with fewer artifacts and noise. However, there are still some regions, marked by blue arrows, where the method could improve in capturing the finest details and resolving closely spaced MBs. The DE-DETR method excels in providing a detailed and continuous representation of the vessel structures, closely approximating the ground truth. The regions indicated by the blue arrows show that DE-DETR effectively captures intricate vessel details that other methods miss. This method’s advanced detection and tracking capabilities allow for more precise localization and better handling of overlapping MBs, resulting in a high-quality super-resolution map.

To further analyse the superiority of our method to visualize smaller vessels, more dense areas and deeper regions, we have chosen three different areas to zoom in and compared the results of each method accordingly in Fig. 6. Temporal mean has been excluded since the results from Fig. 5 already shows the poor details provided by this method. In Fig. 6, the image on the left is the same as Fig. 5, this time indicating the zoomed in boxes. Boxes (a) to (c) are selected from different regions of the super-resolution map. Each row shows the results for one of the boxes and each of the columns indicate the method used to get the super-resolved maps. As illustrated by the subfigures in Fig. 6, DEDETR provides more detailed and smooth vessels maps in different regions of the image, show casing it’s ability in providing high quality microvasculature maps.

Fig. 5
figure 5

SR images of the in vivo test dataset, obtained using Gaussians based on tracking from (a) UNet-based solution [9] (b) temporal mean (c) Gaussian Fit (d) DETR [19] (e) DE-DETR, respectively.

Fig. 6
figure 6

Zoomed-in boxes from different in vivo SR maps, showing the positions of the boxes in the DE-DETR super-resolution map on left and the results from each method (indicated by columns) for each of the (a), (b) and (c) boxes.

To quantify spatial resolution, we computed the Modulation Transfer Function (MTF) from in vivo super-resolution images using an edge-based method, in Fig. 7. Specifically, a region of interest (ROI) containing tightly packed vessels was selected, with the aim of evaluating each method’s ability to resolve fine spatial details. Within this ROI, a horizontal line profile was extracted across the vessels.

The intensity profile along this line represents a one-dimensional cross-section of the structural detail in the image. This profile was first smoothed using a Gaussian filter to reduce high-frequency noise and mitigate local fluctuations that could bias the frequency analysis. The smoothed signal was then mean-centered and windowed (e.g., using a Hamming window) to minimize spectral leakage during Fourier transformation.

The one-dimensional Fourier transform of the line profile was computed, and the magnitude of the resulting spectrum was normalized with respect to its DC component. This yielded the normalized MTF curve, which characterizes the image’s ability to transmit spatial frequencies.

As illustrated in Fig. 7, the superior performance of deformable DETR is clearly supported by the MTF measurements compared to alternative approaches in terms of its ability to detect, localize, and separate closely spaced vessels.

Fig. 7
figure 7

MTF calculated across a horizontal line (color: cyan) in an ROI with tightly packed vessels (color: green) for (a) DEDETR, (b) DETR, (c) UNet and (d) Gaussian fit. (e) Comparison of the normalized MTF of all methods.

Discussion

Main findings

In this study, we introduced an MB localization framework based on the attention mechanism. Our approach is not reliant on MB PSFs and is trained using datasets characterized by dynamic PSFs.

Our findings demonstrate an increased MB detection compared to traditional MB localization techniques. Our simulation results indicate superior precision and recall, and consistently demonstrate better performance with fewer missed pairs, as shown in Fig. 5.

Results indicate that the incorporation of additional feature maps has evidently enhanced the performance of DE-DETR. The additional feature maps integrated into DE-DETR enhance its performance by providing richer contextual information and features for MB detection which leads to improved accuracy in identifying and localizing MBs.

This study further illustrates the benefits of patching and finds the optimal number of patches to achieve the best precision and recall. Specifically, increasing the quantity of non-overlapping patches has been observed to enhance the network’s performance. We hypothesize that this improvement arises from an increased dataset, akin to a form of data augmentation. Also, patching helps avoid increasing the number of queries (which determines the maximum number of detected objects in an image).

We employed random data augmentation to mitigate overfitting and enhance generalization capabilities on the test dataset. We further validated our results by calculating the Dice score in Tables 3 and 4. The success of our method, without explicit background rejection or fine-tuned post processing, could be attributed to the capabilities of DETR and DE-DETR in capturing relevant complex features to effectively identify unique MB characteristics. The transformer-based architecture of these networks increases their capacity, likely enabling them to differentiate between MBs and the background, even in the absence of background rejection.

Furthermore, during the training phase of these networks, the background is treated as a distinguished class, allowing us to control the significance of this class in the loss function and doing an implicit background rejection, effectively ignoring the background. Additionally, both networks have undergone extensive pre-training on computer vision datasets and the foundational knowledge gained during the initial phase significantly enhances their performance and adaptability for our specialized task.

It’s crucial to note that the quality and consistency of the data used for training and testing which can significantly influence the model’s performance. Utilizing datasets, such as the UltraSR 2022 challenge dataset which contains high-quality images with clear distinctions between MBs and the background, facilitates the training phase of the networks by leveraging this information effectively without requiring additional preprocessing steps.

The adoption of bounding boxes, as opposed to solely using the center points of MBs, represents an advancement over conventional deep learning methods. This approach provides the network with a broader perspective of the surroundings of each MB. When properly defined, these bounding boxes can potentially maintain the network’s performance consistency when transitioning from simulated environments to real-world (in vivo) scenarios. This enhanced contextual awareness contributes to more robust object detection and localization.

Our analysis revealed optimal results when the mean contrast of the test set closely matched that of the training set. This consistency in contrast levels between training and testing data significantly enhanced the model’s performance. The alignment of these visual characteristics appears to be a crucial factor in achieving peak accuracy and generalization.

An ablation study of some samples of encoder and decoder queries is provided in the Supplementary Material, which demonstrates where the transformer encoder focuses to extract features and generate encoded representations of input data and how the model aligns the detected MBs in the frames with corresponding outputs in the final prediction. This analysis offers insights into the model’s reasoning processes concerning MB positions and whether or not there is a relationship between these positions.

In the in vivo analyses, the enhanced efficacy of the proposed approach relative to the baseline method is depicted in Fig. 5. As indicated by the arrows, the proposed technique surpasses the baseline method. The intensity profile visualization corroborates the advantages of utilizing our network for localization within a sample ROI.

Finally, as demonstrated in Fig. 3 and Table 4 the inclusion of velocity information facilitated a more precise reconstruction of MB vessels.

Limitations and future work

In the MB tracking phase, we employ a straightforward KDTree-based algorithm to determine the velocity magnitude and direction of each MB. While this method suffices, more sophisticated approaches could also be utilized, e.g. Kalman filter based tracking33 or optical flow based techniques34. Additionally, the extensive parameters of our proposed networks, coupled with the increased number of patches required for enhanced precision and recall, introduce computational challenges and prolonged training durations. Our investigation has predominantly centered on 2D MB localization. Further exploration is needed concerning the 3D MB localization and super-resolution ultrasound, such as research by35,36, to compensate for out-of-plane motion of MBs. Additionally, it is imperative to assess our proposed method across varying MB concentrations, environments and vessel structures over time. Another next step could be incorprating RF signals as another modality to help localize the MBs. Recent research manifested the benefits of using RF signals in mitigating the domain shift between simulation and the real world37. Finally, optimizing the network’s output threshold to achieve the ideal balance between precision and recall is a complex, dataset-dependent process. This critical step requires substantial time investment and careful analysis, as the optimal threshold can vary significantly across diverse datasets.

Methods

Dataset

The simulation and in vivo dataset used to evaluate our network are provided in the Ultra SR Challenge IEEE IUS 202238. The simulation involved generating two random networks of MB-seeded 3D vessel trees using a microvascular tree generator. The flow within these networks was modeled using the Hagen–Poiseuille equation, while a linear acoustic simulator, incorporating nonlinear MB response, was used for ultrasound signal simulation. To introduce variability, different spatial sections had varying vessel densities and vessel tree parameters—including bifurcation probability, segment angles, and MB seeding probability—were randomized. Multiple branching vessel trees were then merged into a single network with a higher MB concentration. To simulate realistic noise conditions, Gaussian White Noise was band-pass filtered to match the transducer’s bandwidth and added to the RF signals. Time gain compensation was applied before beamforming to account for wave attenuation with depth.

The networks were intentionally designed to lack a recognizable structure, to make localization and tracking more challenging. The generated frames are highly realistic and present a challenging scenario for algorithms due to their high motion blur and track concentration, relatively low frame rate, fluctuating signal-to-noise ratio, and the presence of out-of-plane motion in some motion blocks within the 3D simulation. More details can be found in the original paper38.

The training dataset is chosen as the simulation data with four videos and 1640 frames in total. The test set is the last hundred frames of one of the videos; the next 20 frames are then omitted to make sure that there is no more than 0.1689\({\%}\) correlation between train and test data. The in vivo dataset contains 8,000 rat brain frames and all of it was used as the test set. Training and test set from Simulation data and the in vivo test set all undergo singular value decomposition (SVD). All MB annotations for the training set were masked and converted into COCO39 format to comply with our network’s input format. To capture a wider range of variations and patterns in the MB appearances and to allow the model to generalize better, a supervised random augmentation was implemented. The term supervised here indicates that the augmentations did not alter the appearance of the training images in a way that they no longer resemble an ultrasound image. The augmentations are choosen randomly between affine, Gaussian blur, dropout, fliplr, added speckle noise and resize.

The training and test datasets were then patched with varying numbers of patches to increase the size of the training dataset, and facilitate the use of our transformer-based network. The optimal number of patches is discussed in Results.

Network architecture

DETR27 is an end-to-end object detection encoder-decoder network, featuring multi-head and self-attention mechanisms and enabling the model to consider the entire image and the relationships between objects. Figure 8 illustrates the DETR architecture. Image features are extracted using a CNN, specifically ResNet50. These features are then processed by the transformer’s encoder-decoder structure as queries and keys. The decoder’s output is converted into bounding box and class predictions through a 3-layer perceptron, with the model containing 41M trainable parameters. Positional embeddings are used to ensure permutation-invariance, with learned positional encodings added to the input of each attention layer in the decoder. In DETR, bipartite matching was utilized to minimize a set-based Hungarian loss to avoid near-duplicates and enforce permutation invariance, eliminating the need for any pre- or post-processing algorithms. Previously, we demonstrated the performance of DETR for MB localization28. Despite this, DETR network faces shortcomings such as high complexity and limited feature resolution when it comes to smaller objects27. The complexity of transformer encoders in DETR depend on the resolution of the input image and is derived from O(HWC2), with H, W and C being the height, width of input images and the number of channels in the network, respectively. Therefore, unlike CNNs that only pay attention to neighboring elements, in DETR each element has to pay attention to all else, losing the advantages of the inductive bias of CNNs. This unavoidable quadratic cost, in turn, results in a slow convergence.

To mitigate the issues of DETR, DE-DETR proposes a multi-scale feature map and a novel deformable attention module. The deformable attention module’s mechanism is as follows29

$$\begin{aligned} \operatorname {DeformAttn}\left( {P}, x\right)&= \sum _{n=1}^N W_n [\sum _{k=1}^K A_{nk} \cdot W_n^{\prime } x\left( P+\Delta P_{n k}\right) ] \end{aligned}$$
(4)

with x being the input feature map, P being the reference point, N indicating the number of heads, and K being the number of sample keys. \({A_{nk}}\) shows the attention weight for each head and sample key, and \({W_n}\) and \({W_n^{\prime }}\) are the weights for the linear layers. Finally, \(\Delta P_{n k}\) shows the neighbours (offsets) of the reference point to pay attention to, given the limited budget of K sample keys. Attention weights for heads and offsets are calculated via a linear projection of the query features of the encoder and decoder. The encoder queries are the pixels of the feature map, and the decoder queries are object queries as defined by29. A graphic representation of the deformable attention module is depicted in Fig. 9. In this example, we have three heads and two sample keys for each head. The offsets for each head are shown in different colors. As long as the number of sample keys is less than HW, the complexity of DE-DETR would be determined by the number of queries29 and hence avoid the aforementioned quadratic cost compared to DETR.

To improve the accuracy in predicting the correct class labels (between classes MB and “not MB”), we increased the class cost coefficient in the loss function of the network. To prevent over-fitting, we used a higher dropout rate compared to the original DE-DETR and finally the number of encoder and decoder layers was reduced from 6 to 2 to enhance the generalization abilities of the network and facilitate the training process.

Fig. 8
figure 8

The architecture of DETR27. DETR combines a CNN backbone, transformer encoder-decoder, and learned object queries to perform end-to-end object detection.

Fig. 9
figure 9

Architecture of the deformable attention module. The input feature map is linearly projected and reference points are identified. Attention is computed using multiple heads (AHead1, AHead2, AHead3) and combined through a linear projection with their corresponding weights to produce the final output.

Tracking

After finding the MB locations in each frame using DE-DETR, we used the linking algorithm based on40, which allows MB tracking with a significantly greater spatial resolution than the ultrasound wavelength. The algorithm was originally developed to track the particles with Brownian diffusion by considering the best guess for where a particle is going to be, according to the following formula:

$$\begin{aligned} || P\left( t_1, t_0, \overrightarrow{x}\left( t_0\right) \right) - \overrightarrow{x}\left( t_0\right) || \le \text {R} \end{aligned}$$
(5)

with R being the search range, \(t_0\) and \(t_1\) being the current and the next time steps, x the position of the particle and finally \(P\left( t_1, t_0, \overrightarrow{x}\left( t_0\right) \right)\) the Brownian probability of where the particle would be in the next time step. Since the MB movements are not Brownian, i.e., MB velocities between frames are not uncorrelated, we utilized the ”TrackPy”41 library’s dynamic predictors to allow different particles to have different velocities that can change over time. To accomplish this, we use the most recent MB velocities, and define P as follows41:

$$\begin{aligned} P\left( t_1, t_0, \vec {x}_i\left( t_0\right) \right) =\vec {x}_i\left( t_0\right) +\frac{\vec {x}_i\left( t_0\right) -\vec {x}_i\left( t_{-1}\right) }{t_0-t_{-1}}\left( t_1-t_0\right) \end{aligned}$$
(6)

The pairwise linking has been done using KDTree, and if an MB in the current frame has no corresponding particle in the next frame, it is marked as lost. We’ve also discarded tracks that are too short and trajectories with few points since they are often spurious.

To address velocity rejection, a temporal-based filter is employed that discards velocities exceeding three times the velocity of the corresponding MB in the preceding frame.

Conclusion

We presented and evaluated two transformer-based localization algorithms that do not rely on using MB PSFs and mitigate the drastic change between the simulation and in vivo datasets. This was accomplished by treating the background as a class, leveraging different levels of feature maps, employing bounding box based object detection and a deformable attention module. Our method is trained on datasets with dynamic PSFs coming from the UltraSR challenge dataset. Finally, using random augmentation has helped the network to generalize better on test datasets. These advancements can play a crucial role in mitigating the domain shift from simulation to in vivo in the ULM field, significantly improving the resolution and accuracy of microvascular imaging and thereby enhancing the diagnostic and therapeutic capabilities of ultrasound technology.