Introduction

Bloom filters (BFs) are widely recognized as a robust data structure for efficiently performing approximate set membership queries1. Their compact design and ability to quickly assess whether an element likely belongs to a set, though with a minor chance of false positives, have made them a widely favored tool across diverse applications2. While BFs have conventionally been used for managing textual and numerical data, their potential lies in image processing, where efficiently handling large volumes of visual data poses a significant challenge.

Building on this idea, Image-based Bloom Filters (IBFs) have gained traction as an effective tool for tasks such as image similarity searches and recognition3. Unlike conventional BFs that deal with text or numeric data, IBFs are tailored for visual content. By compressing images into concise representations, IBFs enable fast visual data comparison, making them ideal for large-scale image retrieval and similar applications. Essentially, IBFs apply the core principles of BFs to image processing by transforming features like color, texture, and structure into a form that works with BF operations.

The process starts with feature extraction, where key visual elements of an image are analyzed and converted into a digital format. These elements can include color histograms, texture patterns, or more sophisticated descriptors from deep learning models4. After extraction, the features are hashed into a bit array through multiple hash functions, resulting in a compact yet informative image representation. During a query, the IBF compares the hashed features to the bit array to determine if a match is probable. While this method significantly accelerates the search process, it still maintains the probabilistic characteristics of BFs, meaning false positives, though infrequent, can still occur.

IBFs offer several benefits, especially in situations that demand rapid filtering of irrelevant data while conserving computational power for more relevant options. In applications like large-scale image retrieval, content-based classification, or biometric recognition, IBFs’ ability to quickly focus on likely matches proves invaluable5. Their strength lies in efficient storage, processing, and scalability, making them highly effective for managing large datasets. However, this efficiency comes with tradeoffs, such as the possibility of false positives and the necessity of fine-tuning parameters like bit array size and the number of hash functions.

Thus, this paper investigates the design and application of IBF, emphasizing how they can be optimized for handling image-based data effectively. A primary focus is on addressing the issue of false positives, a common limitation of BFs, by leveraging the use of check bits6. These check bits, generated from image captions and visual content, provide an extra layer of verification, helping to lower the probability of false positives without significantly increasing the requirements for storage or computation. Our analysis and experiments show that this method offers a favorable tradeoff between efficiency and accuracy in image retrieval systems.

The key contributions of this paper are as follows:

  • Implementing check bits to minimize false positives in BF;

  • Leveraging caption-based check bits as a form of content-based verification to further reduce false positives;

  • Utilizing image intensity-derived check bits to enhance the efficiency of BF;

  • Augmenting the caption and image check bits derived from image captions and content. Thus, the combination reduces false positives and provides optimal BF efficiency using a smaller-sized BF.

The remainder of the paper is organized as follows: section “Related work” highlights the related work. Section “Proposed approach” sheds light on the proposed approach. Section “Experiments and Results” thoroughly discusses the experiments and results. Comparative evaluation is provided in section '‘Comparative Evaluation”. The paper concludes with a summary and future directions.

Related work

IBFs have emerged as a focal point of research due to their effectiveness in handling diverse data storage and retrieval tasks. Mitzenmacher and Upfal7 and Broder and Mitzenmacher8 laid the theoretical groundwork for BFs, which serves as the foundation for IBFs. Building upon this theoretical framework, Kirsch et al.9 proposed optimizations to traditional BFs, which can be extended to IBFs, enhancing their performance. The versatility of BF, demonstrated in applications like probabilistic verification by10, suggests promising integration opportunities for IBFs into probabilistic models within image-based systems. In addition to the conventional application of BFs, the work in11 has pioneered an extension of this concept into approximate state machines. This leap expands the utility of BFs, particularly in image data processing, presenting exciting new avenues for exploration.

Putze et al.12 introduce cache-hash and space-efficient variants of BF, specifically to alleviate resource constraints encountered in image processing environments utilizing Invertible BF. A detailed review by Broder and Mitzenmacher8 examines the diverse network applications of BF. Additionally, the study of data locality in algorithms by Acar et al.13 offers valuable perspectives for optimizing IBF performance in image processing. Compact and concurrent caching strategies proposed by Fan et al.14 promise to improve storage and retrieval efficiency operations within IBFs. Furthermore, alternate hashing methods, such as cuckoo hashing15, provide practical solutions for enhancing collision resolution within IBFs.

Similarly, Araujo et al.4 introduce a framework that applies BFs for large-scale video retrieval using image-based queries. Conventional methods involve indexing each video frame separately, which results in high latency and considerable memory usage. To address this, the researchers propose a system that employs BFs to index extended video segments (or scenes), facilitating faster and more scalable retrieval. This approach significantly enhances retrieval speed and accuracy compared to traditional frame or shot-level indexing. Using scene-based descriptors with Fisher embeddings in conjunction with BFs, the system achieves up to a 24% improvement in mean average precision.

Chakrabarti16 proposes an image retrieval approach that utilizes neural hash codes and BFs to minimize false positives. The study explores how features extracted through Convolutional Neural Networks (CNNs) can be used to match images, thereby simplifying the retrieval process. The method leverages high-level and low-level feature maps by incorporating multiple neural hash codes derived from various CNN layers. A hierarchical retrieval strategy is employed, beginning with semantically similar images at a broad level and then narrowing down to structural similarities managed through BFs.

In a different context, Breidenbach et al.3 introduce a privacy-enhanced image hashing technique that utilizes BFs. This research focuses on applying BFs to store robust image hashes, thereby improving privacy in applications such as forensic analysis and content filtering. The approach preserves privacy by embedding cryptographic hash functions into the BF structure while ensuring efficient identification of known illegal content. The authors evaluate the effectiveness of this method by analyzing error rates and the level of privacy protection it achieves. Furthermore, Li et al.17 explore secure image search in cloud-based Internet of Things (IoT) environments by developing the Merged and Repeated Indistinguishable Bloom Filter (MRIBF) structure. MRIBF reduces storage overhead and enhances security by achieving a lower false positive rate. Their proposed Bloom Filter-based Image Search (BFIS) approach enables faster and more precise searching. Both theoretical analysis and extensive experiments demonstrate the scheme’s accuracy, security, and practical efficiency.

Additionally, BFs have found applications in steganography and image security. Salim et al.18 introduce a steganography method that combines chaotic systems with BFs to improve security. By employing Lorenz’s chaotic system to generate pseudorandom image positions, the method overcomes weaknesses in the Least Significant Bit (LSB) approach used in steganography. Integrating BFs helps prevent data repetition and loss within the same pixel, making it more resistant to visual and analytical attacks. In another context, Jiang et al.19 propose enhancing the traditional BF algorithm, for barcode recognition and processing. The paper tackles the challenge of high false positive rates in large-scale applications. Their approach splits the BF’s bit vector into two sections and uses differential amplification to accentuate the differences between elements. This refinement reduces the false positive rate without significantly increasing processing costs.

Ng et al.20 propose a scalable image retrieval system utilizing a two-layer BF to enhance retrieval performance on mobile devices. The first layer leverages Asymmetric Cyclical Hashing (ACH) for initial verification, while the second layer applies secure hashing to further minimize false positives . This model increases retrieval accuracy, accelerates the process, and requires less space than other techniques. In a related development, Pontarelli et al.21 introduce the Fingerprint Counting Bloom Filter (FPCBF), which uses fingerprints to decrease false positives in Counting Bloom Filters (CBFs). In contrast to previous approaches like Variable Increment Counting Bloom Filter (VICBF), FPCBF provides superior performance, particularly in reducing false positives, with a more straightforward implementation. The theoretical analysis and simulations indicate that FPCB surpasses VICBF, mainly when larger bits per element are used. Yu et al.22 developed an image hashing technique using low-rank sparse matrix decomposition.

Recent advancements in cross-modal retrieval and medical image analysis have been driven by deep learning techniques, leading to significant improvements in efficiency and accuracy. Wang et al.23 introduced an innovative approach to enhance the alignment and interaction between different data modalities, improving how information is retrieved in cross-modal tasks. By improving alignment mechanisms, the model enhances retrieval accuracy across modalities. This study directly supports the motivation for combining captions with image features, reinforcing the check-bit strategy employed in the current paper. In the medical field, Li et al.24 explored a method for visualizing 3D medical images emphasizing importance-aware retrieval, making it easier for professionals to access relevant medical content. Meanwhile, Cai et al.25 leveraged generative adversarial networks (GANs) to create multi-style cartoonization, utilizing multiple datasets to enhance image transformation quality. Similarly, deep learning has also been applied to remote sensing. Feng et al.26 developed a technique to improve road detection from satellite images by enhancing how road features are captured and analyzed.

In ophthalmic image analysis, researchers have made significant strides in automating disease detection and image quality assessment. One study27 introduced a fuzzy broad learning system to segment the optic disk and cup for glaucoma screening, providing a more efficient approach to early diagnosis. Another approach focused on ensuring the reliability of medical imaging by developing a domain-invariant, interpretable method for assessing fundus image quality28. Additionally, vessel extraction from non-fluorescein fundus images has been improved using an orientation-aware detector, which aids in accurately identifying blood vessels—an essential step in diagnosing retinal diseases29. These studies highlight how artificial intelligence is transforming fields such as medical imaging, remote sensing, and information retrieval, paving the way for more accurate, efficient, and automated analysis in healthcare and beyond.

Li et al.30 proposed the Merged and Repeated Indistinguishable Bloom Filter (MRIBF) to enable secure and efficient image retrieval in cloud-based IoT environments. Their approach reduces storage overhead and minimizes false positives, common limitations in traditional Bloom Filter implementations. This work demonstrates the potential of BF optimizations for scalable and secure retrieval, closely related to the present study’s focus on enhancing retrieval accuracy.

Salim et al.31 introduced a novel image steganography method that combines the Lorenz chaotic system with Bloom Filters to improve security and robustness. Integrating BFs strengthens resistance against visual and statistical attacks, offering improved data hiding capabilities. Although targeted at steganography, this study underscores how Bloom Filters can be extended to support security-critical multimedia applications, which complements the retrieval-oriented perspective of the current work.

Although there have been some attempts, as discussed in the aforementioned studies, to handle diverse data storage and retrieval tasks through different approaches, a content-based approach that utilizes image and caption data to generate bits has not yet been considered. Therefore, we propose this approach, and the details are discussed in the following section.

Proposed approach

We present a content-based approach that utilizes image and caption data to generate bits that help minimize false positives in BFs. These bits, referred to as check bits, enhance the accuracy of the filtering process.

One potential method for decreasing the occurrence of false positives in BFs is to increase the filter size. However, this technique requires large memory and computational resources, which may not be suitable for large datasets. On the other hand, our approach focuses on optimizing performance without requiring a larger BF. Our proposed approach thus aims to reduce the size of the BF while still achieving good overall performance by reducing the number of false positives in the filter. We keep the BF smaller and reduce the false positives by introducing content bits (check bits) from the caption of the image and the image (intensity content) components. Fig. 1 shows the flow of the image caption-based check bits generation, and Fig. 2 shows the image content-based check bits generation. These are evaluated in the experimental evaluation.

Fig. 1
figure 1

Overview of the image caption-based check bits’ generation.

Fig. 2
figure 2

Overview of the image content-based check bits’ generation.

First, we explain the general approach to generating the check bits. This strategy is similar to that used in Figs. 1 and 2.

We can use numerical or textual data to generate check bits. The input data is then converted into binary form. Next, the generated binary bits are summed to create decimal equivalents. The obtained decimal number is converted to 1’s and 0’s. Since there will be a large number of 1’s and 0’s, we select only a representative few bits (at least three or more) from the large number of 1’s and 0’s. We refer to these representative bits as the check bits. The objective is to keep the number of check bits smaller. Increasing the number of check bits can reduce the false positives. However, it can also increase processing and memory requirements. Thus, the optimum number of check bits selection is crucial for balancing false positive reduction and computational power usage. Therefore, after binary equivalence, we have several possibilities for selecting the check bits to extract the optimal number of check bits. The check bits can be extracted from the beginning, middle, or end of the data. However, in the experimental evaluation, we found that the selection of check bits from the middle is optimal in most cases. For the proposed approach, the check bit offset is set to 1, meaning that the incremental offset for the next check bit is 1. The number of hash functions to generate check bits is set to three, which is found to be optimal in many scenarios.

The algorithm is provided next:

Algorithm 1
figure a

Proposed method pseudocode.

Experiments and results

The dataset selected for this study, proposed by32, comprises 31,783 photographs sourced from Flickr 3, depicting a wide range of everyday activities, events and scenes, along with 158,915 captions generated through crowdsourcing. A sample is shown in Fig 3 (Obtained from https://www.flickr.com/).

Fig. 3
figure 3

Sample images from the utilized dataset with the corresponding captions.

This extensive collection builds upon and extends the corpus initially created by33, which included 8,092 images. The size of the dataset offers a suitable balance, making it comprehensive enough for in-depth analysis yet manageable for computational processes, which is crucial for the efficacy of machine learning models. The methodology for gathering images and generating captions is similar to that established by33, incorporating similar annotation guidelines and quality controls to correct spelling errors and eliminate ungrammatical or non-descriptive sentences. This approach guarantees consistency and reliability in the dataset.

For each image, five independent annotators, unfamiliar with the specific details or entities in the images, provide captions. This approach reduces individual bias, resulting in more generalized descriptions like “A group of people cooking outdoors” instead of personal narratives such as “My family’s barbecue party.” This format proves valuable as it allows for exploring various computational techniques, including the use of BFs, which benefit from categorizing captions to improve semantic analysis.

The specificity range in captions provides a broad spectrum of data for analysis. This diversity is crucial for inducing denotational similarities between expressions that extend beyond straightforward syntactic rewrite rules, aligning perfectly with the goals of the BF implementation.

We conducted diverse experiments and tested how different setups affect the likelihood of false positives in a computer system dealing with a large dataset. More specifically, we looked at factors like the number of images, the size of the BF, and additional check bits.

Using captions only

In the first set of experiments, we rely solely on the captions to generate check bits. As shown in Table 1, 10,000 images are considered. The size of the BF is set to 500. Moreover, only three check bits are used. It is important to note that the size of the BF is much smaller than the number of images.

Table 1 Parameters used in the experiment with captions for generating check bits in a BF.

The results showed a big difference when check bits were employed compared to when they were not. Without check Bits, the system made false positive identifications about 99% of the time. However, the rate dropped to 90% when check bits were added. This improvement means adding check bits helped reduce false positives, but they still occurred. The reason for the extraordinary number of false positives is the vast difference between the BF size and the number of images.

This experiment highlights the complexity of designing computer systems to handle large amounts of data. It suggests adding extra features like check bits to improve accuracy, but there is still room for improvement. Understanding these dynamics is crucial for developing reliable systems for tasks such as searching through information, managing databases, and securing networks.

Usage of image hash only

In the updated analysis, we considered the image hash rather than the image captions. The number of images was reduced to 1000. However, the rest of the parameters remained the same, just like in the experiment involving captions only. Table 2 depicts the values of different parameters.

Table 2 Parameters used in the experiment with image hashes only for generating check bits in a BF.

Without check bits, false positives occur about 73% of the time. While this represents an improvement from previous results, errors persist. The main reason is that the number of images is still twice as large as the BF. However, with the inclusion of check bits, the rate of false positives drops significantly to just 25%. This reduction is substantial compared to the prior scenario without check bits and the earlier one with check bits.

These results emphasize the importance of incorporating supplementary features like check bits to refine the accuracy of systems managing extensive datasets. They also underscore how even minor configuration adjustments can profoundly impact system performance. Such insights are pivotal for developing dependable computer systems for data retrieval and network security tasks.

Augmentation of the image hash with the image caption

In the last round of these experiments, we shifted the attention from simple parameters such as the BF and check bits to the positioning of captions in the hash construction. The headers were affixed in the following structures: only at the starting point of the hash, at both the starting and ending points of the hash, and only at the finishing point. This structure demonstrates the arrangement of data, so we changed the places of the captions to test how this affected the false positive rates. The overall parameters were kept similar to those of Experiment 2, with 1,000 images, a BF size of 500, and the use of 3 check bits.

Captions at the beginning of the hash: Initially, placing captions only at the start of the hash structure resulted in a high false positive rate of 82%. However, this rate significantly dropped to 31% with the addition of check bits, showing their effectiveness in reducing errors. This reduction suggests that using check bits improves the efficacy of the BF when captions are placed at the start of the hash.

At the beginning and end of the hash: When captions were placed at both the beginning and end, the false positive rate decreased to 74% without check bits and to 24% with their inclusion, showing a substantial improvement in accuracy.

At the end of the hash: On the other hand, placing captions only at the end resulted in a high false positive rate of 82% without check bits, which, however, decreased to 31% when check bits were used.

These findings emphasize how the placement of captions within the hash structure influences system accuracy, particularly when check bits are employed. The results suggest that positioning captions at both the beginning and end of the hash structure yields the most significant reduction in false positives, as illustrated in Fig 4. Furthermore, this shows the importance of considering implementation details alongside fundamental parameters to optimize computational systems for data analysis and management.

Fig. 4
figure 4

Comparison of false positive rates with and without check bits for different caption positions in the hash of an image BF.

Comparative evaluation

The comparative evaluation is based on the approximate Neural Bloom Filter (NBF) of 34 and 35. For the comparative evaluation, the proposed approach is termed the Check-bits-based Bloom Filter (CBF), whereas the Neural Bloom Filter will be represented as the NBF. For the CBF, we selected the approach with the best performance, which uses the caption in the middle of the hash in the caption location column of the previous section. In this specific setting, we achieved the highest reduction of false positives with a total FP of 23%. We thus use these evaluation settings in all comparative evaluations with the NBF. Table 3 depicts the settings for the evaluation of CBF and NBF.

Table 3 Settings for the evaluation of the CBF and NBF.

False positive rate

In the first set of comparative experiments, we evaluate the total number of FP when the bloom filter size is kept constant at a 500-sized array.

In the first comparative evaluation experiment, we compared two approaches for handling false positives: CBF and NBF. As depicted in Fig 5, the results showed that CBF had a lower FP rate of 23%, while NBF had a considerably higher FP rate of 35%. The CBF combines the image and caption data, providing a richer and more detailed understanding of the content. Therefore, when the CBF checks if an image is a match, it has more unique features to consider, which reduces the chances of mistakenly identifying something as a match. In NBF, while captions provide some context, they do not capture as much detail as images. As a result, the system may struggle to distinguish between similar captions, leading to an increase in false positives. The CBF simple checkbits approach here has a simpler feature space to work with, which helps in reducing FP.

Fig. 5
figure 5

Comparison of the false positive rate for NBF and CBF.

Fixed false positives

In the second set of comparative evaluation experiment, we experiment with fixing the total FP rate, and check how much memory space (size of array) is required to achieve the desired FP perfectly. One approach to decreasing the FP in bloom filter approaches is to keep increasing the memory size of the filter array. We take this as an evaluation measure and compare the NBF and CBF with 1% and 5% fixed FP rates. The performance of the two algorithms is shown in Table 4 and Fig 6.

Table 4 Performance of CBF and NBF with fixed FP rates.
Fig. 6
figure 6

Performance of NBF and CBF with different fixed FP rates.

The results show that CBF, which uses images and captions, requires considerably more memory. In order to keep the FP rate at 5%, it needs 300 kb of memory, and for 1%, it needs 500 kb. On the other hand, NBF, which only employs captions, is much more memory-efficient, requiring just 500 b for a 5% FP rate and 1 kb for a 1% rate. This shows that while CBF can reduce false positives more effectively by using image and text data, it comes with a tradeoff of higher memory usage. NBF, although more efficient in memory use, does not capture additional information from images, which means it requires less memory and offers less detailed results.

Computational tradeoffs

We also evaluate the computational costs of querying for CBF and NBF. However, regarding computational costs, we only consider the captions of images for both CBF and NBF. We measure latency as the time to complete a single insertion/query of a row and key string having a certain fixed length. We also consider the average computational cost of 1000 queries plus insertions. We found that the combined query and insert latency of the NBF amounts to 4.9 ms on the CPU. For the CBF, it is around 2.5 ms as depicted in Fig 7. Therefore, the CBF is approximately 50% faster in this evaluation. We argue that for lower latency of the NBF, the Learned Index incurs a much larger latency due to the sequential application of gradients, a common and standard issue with neural models.

Fig. 7
figure 7

Combined Query and Insertion Latency for NBF and CBF.

Conclusion and future works

This paper introduced a novel method for enhancing Bloom Filters (BFs) in image retrieval applications. We added small pieces of verification data, called check bits, which are created from both image captions and the images themselves. This approach significantly reduces the number of false positives—a well-known limitation of BFs—while maintaining a small filter size . Experiments on a large dataset showed that using these check bits significantly reduces false positive rates. We also found that the placement of caption data (e.g., at the start, middle, or end of the hash) affects performance, with the best results achieved when caption data is placed at both the beginning and the end of the hash. In a comparison with a modern Neural Bloom Filter (NBF), our method—called the Check-bits-based BF (CBF)—achieved a lower false positive rate (23% vs. 35%) under the same memory constraints. Although the NBF requires less memory for stringent error rates, our CBF is approximately 50% faster, making it more suitable for applications where speed is a priority. In a nutshell, this work demonstrates that incorporating both text and image content enhances the accuracy and efficiency of BFs. This improvement is useful for real-world tasks such as image search, database management, and network security.

Several interesting directions for future research exist. First, we plan to test this idea on other types of data, such as audio and video. Another perspective for further research is the development of intelligent algorithms that can adaptively determine the number of check bits to use and their placement, based on the input data. We also aim to test the method on more diverse and challenging datasets to ensure it works well in different situations. Ultimately, employing more advanced feature extraction techniques, such as newer deep learning models, may further enhance performance . These refinements would enable significant cost savings while improving BF performance across various applications and supporting the increased diversity and complexity of data.