A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can produce false positives but never false negatives, making it ideal for quickly ruling out non-membership before performing expensive lookups.
A Bloom filter is a bit array of m bits, all initialized to 0, combined with k independent hash functions. When you insert an element, each of the k hash functions maps it to a position in the array and sets that bit to 1. To query membership, the same k hash functions are applied and all resulting bit positions are checked.
Bloom filters use far less memory than storing actual elements — often just a few bits per item regardless of element size. They are widely used in databases (e.g., Cassandra, RocksDB), web caches, network routers, and spell checkers to avoid slow disk reads or network calls for data that definitely does not exist. The trade-off is a controllable false positive rate, not accuracy guarantees.
If any checked bit is 0, the element is definitively NOT in the set — this is the guaranteed no-false-negatives property. If all bits are 1, the element is probably in the set, but those bits may have been set by other elements, producing a false positive. The probability of a false positive increases as more elements are inserted and more bits become 1.
The false positive rate p is controlled by three parameters: the bit array size m, the number of hash functions k, and the number of inserted elements n. The optimal number of hash functions is k = (m/n) × ln(2), and the resulting false positive rate approximates (1 − e^(−kn/m))^k. In practice, targeting a 1% false positive rate requires roughly 9.6 bits per element.
Standard Bloom filters do not support deletion — setting a bit back to 0 could invalidate other elements that share that bit. Counting Bloom filters address this by storing small counters instead of single bits, at the cost of more memory. Additionally, the filter must be sized for the expected number of insertions upfront; inserting far more elements than planned will degrade accuracy significantly.
Always pre-calculate your expected dataset size and acceptable false positive rate before allocating the filter. Use well-distributed hash functions such as MurmurHash3 or xxHash rather than cryptographic hashes, which are slower and offer no benefit here. In distributed systems, serialize and share the bit array so multiple nodes can query the same filter without redundant storage.
© RM Full Stack & AI Engineer · All guides · Roadmaps · Open the app