Byzantine Reality

Searching for Byzantine failures in the world around us

Bloom Filters

I’ve been getting more into my programming roots lately, and after I read this CodeKata, I decided to give it a try. The idea is to write a Bloom Filter, fiddle with the parameters that make it what it is, and test its performance. That being said, let’s look at it.

The way I understood the Bloom Filter is as follows: get a large list (here, the UNIX dictionary, located at /usr/share/dict/words for Mac OS X users), perform some kind of mapping function, keep some subset of the result, and store that in an array. It doesn’t matter if multiple words in the dictionary map to the same result, but you will see a small percentage of words NOT in the dictionary match up and appear to be in the dictionary (a false positive). This summary is pretty much the same as what is said in the previous link, so let’s get to something original here.

The Perl script below takes in each word from the dictionary file, gets the MD5 hex value associated with it, and keeps the first 6 characters of it. The “6″ can be changed to whatever you like, but lowering it will increase the number of false positives and increasing it will reduce the savings in file size you’re getting from using the Bloom Filter in the first place.

We then (as the Kata post above recommends) generate random 5 character words and see if they match up in our Bloom Filtered Array. If so, we make sure that we didn’t accidentally generate a real word. My Perl script is set to do this 10000 times, taking on average 10 minutes.

I ran this script 3 times, getting 159 false positives the first time, 145 the second, and 141 the third time. This averages to 147 false positives when checking 10000 random words, yielding a 1.47% error rate. That’s pretty good, if I don’t say so myself. The savings in file size is not too shabby as well. The dictionary file is 2.49MB, while the Bloom Filtered Array is 1.64 MB (a 34% reduction in size). This may not look too great, but on bigger data sets, a 1.47% error rate may be justifiable in exchange for cutting your data set’s size down by a third. Below is my code if you wish to try it out (requires a *NIX of some sort):

#!/usr/bin/perl
use strict;
use warnings;
use Digest::MD5 qw(md5 md5_hex md5_base64);

# Chris Bunch
# Bloom Filter implemented along these guidelines:
# http://codekata.pragprog.com/2007/01/kata_five_bloom.html
# Creates an array and for each word in /usr/dict/words
# MD5′s the word and places the first n bits of it in the array

# Do not create array if -nocreate is seen

# Create array, size = wc -l /usr/share/dict/words

my $arraySize = `wc -l /usr/share/dict/words`;
my @wordList = `cat /usr/share/dict/words`;
my @bloomFilter;

# Length of words in the Bloom Filter

my $md5Length = 6;

# For each line of /usr/share/dict/words

foreach my $thisWord (@wordList) {

# MD5 this word, save the first n bits

chomp ($thisWord);
my $hash = md5_hex($thisWord);
my $substring = substr($hash, 0, $md5Length);

# Place it into the array at this place

push(@bloomFilter, $substring);
}

my $i = 0;
my @letters = qw(a b c d e f g h i j k l m n o p q r s t u v w x y z);

while ($i < 10000) {
my $searchWord = $letters[int(rand(@letters))] . $letters[int(rand(@letters))] . $letters[int(rand(@letters))] . $letters[int(rand(@letters))] . $letters[int(rand(@letters))];

# map the word to the first n bits of its MD5

my $hashedWord = md5_hex($searchWord);
my $substring = substr($hashedWord, 0, $md5Length);

# look it up in the array

my $found = “false”;
foreach my $thisWord (@bloomFilter) {

if ($substring eq $thisWord) {
$found = “true”;
}
}

# if present, print present, otherwise, print not present

if ($found eq “true”) {
my @grepResults = `grep $searchWord /usr/share/dict/words`;
foreach my $result (@grepResults) {
chomp ($result);
if ($result eq $searchWord) {
$found = “false”;
}
}

if ($found eq “true”) {
print “\n$searchWord was found”;
}

}
else {

}

$i++;
}

As always, if you see something miserably wrong with the code or have any comments, feel free to drop me a line.