Bloom Filters, Probabilistic

Today I implemented a Bloom Filter and I felt quite good with the results, which is strange compared to the experience that I have had in the past learning other algorithms. Thanks a lot to Robert Lefkowitz and Stanley Zheng for their help during the implementation of the code, their input helped me to not only understand how non-cryptographic hash functions work but also how doing pair programming makes you more productive.

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate. In other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter). The more elements that are added to the set, the larger the probability of false positives. — Bloom Filter, Wikipedia

The basic idea is that you have a vector with m size initialized with either Zeroes or False values, k hash functions, and two methods void Add(s <T>) and bool Test(s <T>). For every entry that you want to introduce into the vector the Add() method will hash the input one time per each hash function, then will set the value of that index in the vector to One or True.

Once you have injected all the data you can start testing for existing elements, you do this by calling Test() which executes the same procedure described in the Add() method, instead of adding the input to the vector the algorithm will check if the result of each hash function is equal to False in the vector, if they are all False then you can assume, for sure, that the element does not exists in the list, otherwise you will consider it as a "Maybe".

— Read What are Bloom filters? by Jamie Talbot.

There are two important things that you have to remember when the Bloom Filter is initialized, one is to choose a good size m for the vector, @r0ml suggests to use prime numbers but for the sake of simplicity we will do it with only 1024. The second important aspect of the implementation of this algorithm is k which is the hash functions that will fill the vector.

type BloomFilter interface {
    Add(s string)
    Test(s string) bool
    Size() int
}

type BF struct {
    m int /* vector size */
    k int /* hash functions */
    v []bool /* boolean vector */
}

Initially I thought to generate one of the hashes by simply returning the length of the input. The second hash would return the sum of all the characters as their ASCII representation and the multiplication of the indexes of all the vowels. And the third hash function would XOR all the bits starting with the length of the input multiplied by the ASCII representation of the first character.

func (b *BF) Hash1(s string) uint32 {
    return uint32(len(s) % b.m)
}

func (b *BF) Hash2(s string) uint32 {
    var total int
    for idx, char := range s {
        total += int(char - '0')
        if char == 'a' ||
            char == 'e' ||
            char == 'i' ||
            char == 'o' ||
            char == 'u' {
            total *= idx
        }
    }
    return uint32(total % b.m)
}

func (b *BF) Hash3(s string) uint32 {
    total := len(s) * int(s[0]-'0')
    for i := 0; i < len(s)-2; i++ {
        total *= int(s[i]) ^ int(s[i+1])
    }
    return uint32(total % b.m)
}

Consider this input:

# Input
abcdefghi
something
foobarfoo

The hash functions implemented above will return the following results:

INPUT      Hash1  Hash2  Hash3
abcdefghi  9      9      9
something  176    475    928
foobarfoo  1001   256    0

Now the Test() method will take an input and hash it the same was as the data inserted into the vector, if the indexes in the vector are set to One or True in at least one of the hash functions then we can return a "Input maybe exists" otherwise we will return "Input definitely does not exists".

func (b *BF) Test(s string) bool {
    if !b.v[b.Hash1(s)] &&
        !b.v[b.Hash2(s)] &&
        !b.v[b.Hash3(s)] {
        return false
    }

    return true
}

func (b *BF) Debug(s string) {
    if !b.Test(s) {
        fmt.Println("definitely does not exists")
    } else {
        fmt.Println("maybe exists")
    }
}

Of course, the hash functions implemented above will generate a lot of false positives because they are too simple for our use case. The size of the vector is also important as I explained before. Here is a working implementation of a Bloom Filter using FNV, CRC32 and Murmur that feeds the vector with a dictionary of random words provided by most Unix systems at /usr/share/dict/words.

Bloom Filter — Written in Go

3 months ago
  • 8da57acFix minor bugs found by the code static ana…
3 months ago
  • caa417fAdd option to configure the malware scanner…
3 months ago
  • 9c86744Modify default value for some of the alert …
3 months ago
  • 84dd39dAdd option to stop sending the failed login…
3 months ago
  • eb05935Add pre-checks for every plugin page for si…
3 months ago
  • d21a062Modify mechanism to ignore files from integ…
3 months ago
  • b1a9169Add developer option to disable failed pass…
3 months ago
  • 4e3ef13Add support for other English and Spanish b…
3 months ago
  • 2d07b4eFix error interception for Firewall API err…
3 months ago
pushed to master at cixtor/slackapi
  • 4a2c1c8Modify data type for methods related to cha…
  • 6716199Add CLI handler for the users.identity API …
  • f9c448dAdd CLI handler for the mpim.open API endpo…
  • 305d1c4Add CLI handler for the mpim.mark API endpo…
  • 8bb89afAdd CLI handler for the mpim.close API endp…
  • 202a017Add CLI handler for the dnd.teamInfo API en…
  • 68819e9Add CLI handler for the dnd.info API endpoi…
  • 9a2b29aAdd CLI handler for the dnd.endSnooze API e…
  • e7dc86aAdd CLI handler for the dnd.setSnooze API e…
  • 111c53fAdd CLI handler for the dnd.endDnd API endp…
  • cdb620dFix token usage when there are no extra par…
  • View comparison for these 11 commits
3 months ago
pushed to master at cixtor/slackapi
3 months ago
  • 6497e80Remove unnecessary automatic blacklisting o…
3 months ago
opened pull request Sucuri/sucuri-wordpress-plugin#40
Fix multiple bugs with the API calls and queue system
18 commit with 793 additions and 293 deletion
3 months ago
3 months ago
3 months ago
  • 38cc02aModify timing for the dashboard alerts afte…
3 months ago
  • 350c074Fix infinite loop with email alerts and SMT…
3 months ago
  • acff4aaFix detection of base URL with built-in fun…
3 months ago
3 months ago
3 months ago
3 months ago
3 months ago
opened pull request Sucuri/sucuri-wordpress-plugin#39
Add queue system for the security logs and cache improvement
5 commit with 517 additions and 541 deletion
3 months ago
3 months ago
  • 4c51445Fix static function call of non-static Site…
3 months ago
3 months ago
  • af47581Add changelog to release version 1.8.6
3 months ago
opened pull request Sucuri/sucuri-wordpress-plugin#38
Add changelog to release version 1.8.5
6 commit with 4062 additions and 1841 deletion
3 months ago
  • dc1a05aAdd changelog to release version 1.8.5
3 months ago
Do you have a project idea? Let's make it together!