Bloom Filters Made Easy: Python Code & Explanation

Leapcell: The Best Serverless Platform for Python Web Hosting Bloom Filter: Principles, Usage, Advantages, Disadvantages and Python Implementation I. Usage and Application Scenarios The Bloom Filter is a highly space - efficient probabilistic data structure used to determine whether an element belongs to a set. It has wide applications in many fields: Spelling Check in Word - processing Software: In word - processing software, it can quickly check if an English word is spelled correctly. For example, when a user enters a word, the Bloom Filter can rapidly determine whether the word is likely to be in the set of correct words. If not, it prompts a spelling error. FBI Suspect List Query: In institutions like the FBI, it can be used to quickly determine whether a suspect's name is already on the suspect list. When new suspect information is available, it can be screened initially and rapidly, improving efficiency. Web Crawler URL Access Judgment: In web crawlers, it can efficiently determine whether a URL has been visited. This avoids repeated access to the same URL and saves resources. Email Spam Filtering: Email spam - filtering functions of email services like Yahoo and Gmail can use the Bloom Filter to determine whether an email is spam. First, some features are used to judge whether the email is likely to be spam. If it is, further processing is carried out. Preventing Cache Penetration: In the cache system, the Bloom Filter can prevent the cache - penetration problem. When a large number of requests simultaneously access data that does not exist in the cache, it may cause excessive pressure on the database. The Bloom Filter can first determine whether the data is likely to exist. If not, it returns directly, avoiding invalid database queries. II. Advantages and Disadvantages of the Algorithm (I) Advantages Small Data Space: The Bloom Filter does not need to store the data itself. Instead, it uses bit arrays and hash functions to mark the existence of data, greatly saving storage space. Compared with the traditional way of storing all elements, it has obvious advantages when storing a large amount of data. (II) Disadvantages Existence of Misjudgment: A failed match can determine that the element is "definitely not in the set", but a successful match does not guarantee that the value is definitely in the set. There is a certain false - positive probability. This is because different elements may occupy the same bit positions in the bit array after being mapped by the hash function. Elements Cannot Be Deleted: Once an element is added to the set, it cannot be deleted. Deleting a certain element may affect the judgment results of other elements, unless the Bloom Filter is rebuilt. False - positive Rate Changes with Capacity: When the set is almost full, that is, when it approaches the estimated maximum capacity, the false - positive probability will increase. This is because the number of bits set to 1 in the bit array increases, making misjudgment more likely. Amplified Space Occupancy: Generally, for a 1% false - positive probability, each element requires less than 10 bits, and this is independent of the size or number of elements in the set. Although it is relatively space - saving, the overall space occupied will still be large for large - scale data. Slower Query Process: Due to the use of multiple hash functions, each matching process needs to check multiple bits (the number of hashes) to confirm the existence, resulting in a slower query process. III. Principle Introduction (I) Algorithm Principle The Bloom Filter is an algorithm for judging whether a set contains a specific element. It does not need to store the data itself but is implemented through a very large bit array and several hash functions. Suppose the length of the bit array is (m) and the number of hash functions is (k). First, initialize the bit array by setting each bit to 0. When an element is added to the set, the input string is mapped to the sub - script of the bit array through the hash function, and then the value corresponding to the sub - script is set to 1. For example, for a string, after calculation by (k) hash functions, (k) sub - script positions will be obtained, and the bits at these (k) positions are all set to 1. When the same string is calculated for the second time, it will also be mapped to the same sub - script positions by the hash function. When querying whether an element exists, the element is mapped to (k) points on the bit array through the same (k) hash functions. If one of these (k) points is not 1, it can be judged that the element definitely does not exist in the set; conversely, if all (k) points are 1, the element may exist in the set. Note that it cannot be judged that the element must exist in the set here, and there is a certain false - positive rate. For example, assume that there are 3 element

Feb 8, 2025 - 11:39

Bloom Filters Made Easy: Python Code & Explanation

Leapcell: The Best Serverless Platform for Python Web Hosting

Bloom Filter: Principles, Usage, Advantages, Disadvantages and Python Implementation

I. Usage and Application Scenarios

The Bloom Filter is a highly space - efficient probabilistic data structure used to determine whether an element belongs to a set. It has wide applications in many fields:

Spelling Check in Word - processing Software: In word - processing software, it can quickly check if an English word is spelled correctly. For example, when a user enters a word, the Bloom Filter can rapidly determine whether the word is likely to be in the set of correct words. If not, it prompts a spelling error.
FBI Suspect List Query: In institutions like the FBI, it can be used to quickly determine whether a suspect's name is already on the suspect list. When new suspect information is available, it can be screened initially and rapidly, improving efficiency.
Web Crawler URL Access Judgment: In web crawlers, it can efficiently determine whether a URL has been visited. This avoids repeated access to the same URL and saves resources.
Email Spam Filtering: Email spam - filtering functions of email services like Yahoo and Gmail can use the Bloom Filter to determine whether an email is spam. First, some features are used to judge whether the email is likely to be spam. If it is, further processing is carried out.
Preventing Cache Penetration: In the cache system, the Bloom Filter can prevent the cache - penetration problem. When a large number of requests simultaneously access data that does not exist in the cache, it may cause excessive pressure on the database. The Bloom Filter can first determine whether the data is likely to exist. If not, it returns directly, avoiding invalid database queries.

II. Advantages and Disadvantages of the Algorithm

(I) Advantages

Small Data Space: The Bloom Filter does not need to store the data itself. Instead, it uses bit arrays and hash functions to mark the existence of data, greatly saving storage space. Compared with the traditional way of storing all elements, it has obvious advantages when storing a large amount of data.

(II) Disadvantages

Existence of Misjudgment: A failed match can determine that the element is "definitely not in the set", but a successful match does not guarantee that the value is definitely in the set. There is a certain false - positive probability. This is because different elements may occupy the same bit positions in the bit array after being mapped by the hash function.
Elements Cannot Be Deleted: Once an element is added to the set, it cannot be deleted. Deleting a certain element may affect the judgment results of other elements, unless the Bloom Filter is rebuilt.
False - positive Rate Changes with Capacity: When the set is almost full, that is, when it approaches the estimated maximum capacity, the false - positive probability will increase. This is because the number of bits set to 1 in the bit array increases, making misjudgment more likely.
Amplified Space Occupancy: Generally, for a 1% false - positive probability, each element requires less than 10 bits, and this is independent of the size or number of elements in the set. Although it is relatively space - saving, the overall space occupied will still be large for large - scale data.
Slower Query Process: Due to the use of multiple hash functions, each matching process needs to check multiple bits (the number of hashes) to confirm the existence, resulting in a slower query process.

III. Principle Introduction

(I) Algorithm Principle

The Bloom Filter is an algorithm for judging whether a set contains a specific element. It does not need to store the data itself but is implemented through a very large bit array and several hash functions.

Suppose the length of the bit array is (m) and the number of hash functions is (k). First, initialize the bit array by setting each bit to 0.

When an element is added to the set, the input string is mapped to the sub - script of the bit array through the hash function, and then the value corresponding to the sub - script is set to 1. For example, for a string, after calculation by (k) hash functions, (k) sub - script positions will be obtained, and the bits at these (k) positions are all set to 1. When the same string is calculated for the second time, it will also be mapped to the same sub - script positions by the hash function.

When querying whether an element exists, the element is mapped to (k) points on the bit array through the same (k) hash functions. If one of these (k) points is not 1, it can be judged that the element definitely does not exist in the set; conversely, if all (k) points are 1, the element may exist in the set. Note that it cannot be judged that the element must exist in the set here, and there is a certain false - positive rate.

For example, assume that there are 3 elements ({x, y, z}) in the set, and the number of hash functions is 3. For each element in the set, the element is successively mapped through 3 hash functions. Each mapping will generate a hash value, which corresponds to a point on the bit array, and then the corresponding position of the bit array is marked as 1. When querying whether the element (W) exists in the set, (W) is mapped to 3 points on the bit array through the same method. If one of the 3 points is not 1, it can be judged that the element definitely does not exist in the set. Conversely, if all 3 points are 1, the element may exist in the set. As can be seen from the figure: assume that a certain element is mapped to the sub - scripts 4, 5, and 6 through mapping. Although these 3 points are all 1, it is obvious that these 3 points are positions obtained by different elements through hashing. Therefore, this situation shows that although the element is not in the set, the corresponding bits may all be 1, which is the reason for the existence of the false - positive rate.

(II) Simple Python Implementation

The following is only a Python code implementation example of the BloomFilter. The real BloomFilter implementation needs to determine the number of hash functions according to the length of the initialized bit array and also has a complex hashing process.

#_*_coding:utf_8_
import BitVector
import os
import sys


class SimpleHash():
    def __init__(self, cap, seed):
        self.cap = cap
        self.seed = seed

    def hash(self, value):
        ret = 0
        for i in range(len(value)):
            # weighted sum
            ret += self.seed * ret + ord(value[i])
        # bit operation to ensure the final value is between 0 and self.cap
        return (self.cap - 1) & ret


class BloomFilter():
    def __init__(self, BIT_SIZE=1 << 25):
        self.BIT_SIZE = 1 << 25
        self.seeds = [5, 7, 11, 13, 31, 37, 61]
        self.bitset = BitVector.BitVector(size=self.BIT_SIZE)
        self.hashFunc = []

        for i in range(len(self.seeds)):
            self.hashFunc.append(SimpleHash(self.BIT_SIZE, self.seeds[i]))
        print(self.hashFunc)

    def insert(self, value):
        for f in self.hashFunc:
            loc = f.hash(value)
            self.bitset[loc] = 1
        print(self.bitset)

    def is_contaions(self, value):
        if value == None:
            return False
        ret = True
        for f in self.hashFunc:
            loc = f.hash(value)
            ret = ret & self.bitset[loc]
        return ret

In the above code, the SimpleHash class implements a simple hash function, and the BloomFilter class initializes the bit array and multiple hash functions to achieve the functions of element insertion and existence judgment. The insert method is used to insert elements into the Bloom Filter, and the is_contaions method is used to judge whether an element exists in the Bloom Filter.