Bloom Filters, Probabilistic

Bloom Filters, Probabilistic

Today I implemented a Bloom Filter and I felt quite good with the results, which is strange compared to the experience that I have had in the past learning other algorithms. Thanks a lot to Robert Lefkowitz and Stanley Zheng for their help during the implementation of the code, their input helped me to not only understand how non-cryptographic hash functions work but also how doing pair programming makes you more productive.

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In other words, a query returns either “possibly in set” or “definitely not in set”. Elements can be added to the set, but not removed (though this can be addressed with a “counting” filter). The more elements that are added to the set, the larger the probability of false positives. — Bloom Filter, Wikipedia

The basic idea is that you have a vector with m size initialized with either Zeroes or False values, k hash functions, and two methods void Add(s <T>) and bool Test(s <T>). For every entry that you want to introduce into the vector the Add() method will hash the input one time per each hash function, then will set the value of that index in the vector to One or True.

Once you have injected all the data you can start testing for existing elements, you do this by calling Test() which executes the same procedure described in the Add() method, instead of adding the input to the vector the algorithm will check if the result of each hash function is equal to False in the vector, if they are all False then you can assume, for sure, that the element does not exists in the list, otherwise you will consider it as a “Maybe”.

— Read What are Bloom filters? by Jamie Talbot.

There are two important things that you have to remember when the Bloom Filter is initialized, one is to choose a good size m for the vector, @r0ml suggests to use prime numbers but for the sake of simplicity we will do it with only 1024. The second important aspect of the implementation of this algorithm is k which is the hash functions that will fill the vector.

type BloomFilter interface {
    Add(s string)
    Test(s string) bool
    Size() int
}

type BF struct {
    m int /* vector size */
    k int /* hash functions */
    v []bool /* boolean vector */
}

Initially I thought to generate one of the hashes by simply returning the length of the input. The second hash would return the sum of all the characters as their ASCII representation and the multiplication of the indexes of all the vowels. And the third hash function would XOR all the bits starting with the length of the input multiplied by the ASCII representation of the first character.

func (b *BF) Hash1(s string) uint32 {
    return uint32(len(s) % b.m)
}

func (b *BF) Hash2(s string) uint32 {
    var total int
    for idx, char := range s {
        total += int(char - '0')
        if char == 'a' ||
            char == 'e' ||
            char == 'i' ||
            char == 'o' ||
            char == 'u' {
            total *= idx
        }
    }
    return uint32(total % b.m)
}

func (b *BF) Hash3(s string) uint32 {
    total := len(s) * int(s[0]-'0')
    for i := 0; i < len(s)-2; i++ {
        total *= int(s[i]) ^ int(s[i+1])
    }
    return uint32(total % b.m)
}

Consider this input:

abcdefghi
something
foobarfoo

The hash functions implemented above will return the following results:

INPUT      Hash1  Hash2  Hash3
abcdefghi  9      9      9
something  176    475    928
foobarfoo  1001   256    0

Now the Test() method will take an input and hash it the same was as the data inserted into the vector, if the indexes in the vector are set to One or True in at least one of the hash functions then we can return a “Input maybe exists” otherwise we will return “Input definitely does not exists”.

func (b *BF) Test(s string) bool {
    if !b.v[b.Hash1(s)] &&
        !b.v[b.Hash2(s)] &&
        !b.v[b.Hash3(s)] {
        return false
    }

    return true
}

func (b *BF) Debug(s string) {
    if !b.Test(s) {
        fmt.Println("definitely does not exists")
    } else {
        fmt.Println("maybe exists")
    }
}

Of course, the hash functions implemented above will generate a lot of false positives because they are too simple for our use case. The size of the vector is also important as I explained before. Here is a working implementation of a Bloom Filter using FNV, CRC32 and Murmur that feeds the vector with a dictionary of random words provided by most Unix systems at /usr/share/dict/words.

Bloom Filter — Written in Go

Do you have a project idea? Let's make it together!