On Bloom Filters
I’ve got to admit that I don’t come from a hard core CS background (and I’m sure it shows in some of my articles on more complicated topics :P ), but I know a little bit about data structures and algorithms. While I’m no expert on the topic of data structures, I was surprised that – up until now – I had never even heard of Bloom filters.
What are Bloom filters?
Let’s say you have a set of millions of records in a database. Running a query over such a database can be quite time consuming. It would be great if we a cheap way to determine if a given record is likely to be within the database without necessarily performing a query. Bloom filters to the rescue!
Rather than executing a complex database query, we can ask our Bloom filter whether or not a certain record exists. It will then respond with either a definitive “no” (i.e. the record is definitely not in the database) or a “maybe”. In the event our Bloom filter says “maybe”, we must (unfortunately) query the database to determine whether or not the record really exists.
While this may sound like double the workload for our poor application, querying the Bloom filter is likely to be very fast. The goal of the Bloom filter here is to reduce the number of times we actually hit the database, a way to determine the probability a given input exists in our data set. Bloom filters are also small in terms of memory usage, requiring nothing more than a simple bitfield.
A quick overview
Bloom filters primarily consist of a bitfield of size m and one or more (call it k) hash functions. The classic Bloom filter provides two main operations:
- add, which is used to add new records to the Bloom filter; and
- query, which is used to determine whether a record is likely to be found in the data set.
Starting out with a zero-value bitfield, each time we call add our hash functions are each called on the input record in turn to produce a set of k indices (one for each hash function). For each one of these indices, we set the corresponding bit in the bitfield.
To query this filter, we again generate k indices using our hash functions on the input record. Rather than setting the bits in the bitfield, we instead check each bit. If any one of the bits is not set, then there is zero chance our input record exists in the data set (a definitive “no”). Otherwise, if all k bits are set, we have a “maybe”.
Problems with the classic implementation
There’s one glaring problem with the classic Bloom filter: there are no deletions! If records in your source data set can be deleted, your Bloom filter may need to be regenerated (depending on your accuracy needs). There exists a variation on the classic filter which uses an array of counters instead of a bitfield. The add operation becomes a counter increment. This allows for a delete operation which, predictably, becomes a decrement of the counter. This variant trades memory for flexibility.
A reference implementation
I’m uploading a C reference implementation of a Bloom filter with 1 (super naive) hash function and a 32-bit-long bitfield. Hopefully there is somebody out there who finds it informative.
Download my crummy (but hopefully instructive!) reference implementation here.
WARNING I wrote this implementation from a quick read of the Wikipedia page, so if there are any CS geeks out there who can see I’m clearly doing something wrong, please tell me how wrong I am :)