An algorithm invented by Burton Bloom for querying whether an object is in a set when the full non-sparse index is too big to be quickly accessible (for example, it doesn't fit into working memory and has to be loaded from disk in parts).

You need a set of hashing functions Hn, and a vector of bits M[n]. The hashing function should return a number smaller than the size of the bit vector (i.e. it can be used as an index into it).

To add an object to the set, you generate Hn(Object) for all n, and for each outcome set the bit at that index to 1.

To query whether an object is in the set, you generate Hn(Object) for all n, and for each outcome check the bit at that index. If any of these are 0, the object is not in the set.

This is most used for caches, as it doesn't generate any false negatives (if you get a negative from the algorithm you can safely say you don't have the data) while the false positives it gives (which, with the appropriate number of hashing functions and appropriate size of M[n] will be quite small) will be detected when trying to retrieve the actual content.

Note that you don't usally need to actually hash the data (which is quite computationally expensive) for every hashing function, you can generate a much larger hash and split it into chunks, or generate just one hash and use it as a seed for for instance a linear congruential PRNG.

Log in or register to write something here or to contact authors.