Edward Capriolo

Friday Jan 02, 2015

Bloom Filters - Building a Column Family store - Part 7

The last blog in this series we built a Commitlog for our quickly evolving NoSql/Column Family store nibiru. I decided this morning that I had been dodging BloomFilters for a while and it was time to get to it.

You may ask what are Bloom Filters and why might a Column Family store need/want them? If you read my post on SSTables and Compaction, you can see that we are implementing a Structured Log Merge system. Namely, we are not doing updates in places, and instead we periodically merging files. If you imagine data in these files over time, you can picture a scenario where there are 30 or more SSTables on disk. Now, imagine that only 1 of those 30 files has the rowkey being searched for. To find that rowkey we would have to search all 30 of the SStables. 

A Bloom Filter (described here in detail) could potentially help us avoid searching tables. This is because a Bloom Filter can give two answers 'Absolutely NO' or 'Maybe'. (Answers I used to get when I was asking someone out a date!)

'Absolutely No' in this case means we do not have to bother searching, because we are sure the rowkey is not in this table.

'Maybe' means we do have to search and maybe the row key is in the table or maybe it is not. (We would have searched it anyway without the Bloom Filter so no big loss)

Because of these properties, Bloom filters make a very nice option for a negative-cache. (For those interested the pull request with the majority of the feature is here  I did make other subtle changes along the way). So let's have at it and implement this bad boy.

Writing and Persisting Bloom Filters

Remember that our only writes are Memtable, Flushing, and SSTables. Thus we can simply hook into the SSTableWriting process and attach in a bloom filter writer.

public class SsTableStreamWriter {

...

private BloomFilterWriter bloomFilter;

public SsTableStreamWriter(String id, ColumnFamily columnFamily){
...
indexWriter = new IndexWriter(id, columnFamily.getKeyspace().getConfiguration());
bloomFilter = new BloomFilterWriter(id, columnFamily.getKeyspace().getConfiguration());
}

public void write(Token t, Map<String,Val> columns) throws IOException {
long startOfRecord = ssOutputStream.getWrittenOffset();
bloomFilter.put(t);

public void close() throws IOException {
indexWriter.close();
ssOutputStream.close();
bloomFilter.writeAndClose();
}

I did not go out and write a Bloom Filter from scratch I decided to use guava. But I did wrap it in facade to make it look like I did something.

package io.teknek.nibiru.engine;

public class BloomFilterWriter {

private final BloomFilter<Token> bloomFilter;
private static final int expectedInsertions = 50000; //Be smarter here
private final String id;
private final Configuration configuration;

public BloomFilterWriter(String id, Configuration configuration){
this.id = id;
this.configuration = configuration;
bloomFilter = BloomFilter.create(TokenFunnel.INSTANCE, expectedInsertions);
}

public enum TokenFunnel implements Funnel<Token> {
INSTANCE;
public void funnel(Token person, PrimitiveSink into) {
into.putUnencodedChars(person.getToken()).putUnencodedChars(person.getRowkey());
}
}

public void put(Token t){
bloomFilter.put(t);//see I did something
}

public static File getFileForId(String id, Configuration configuration){
return new File(
configuration.getSstableDirectory(), id + ".bf");
}

public void writeAndClose() throws IOException {
BufferedOutputStream bo = new BufferedOutputStream(new FileOutputStream(getFileForId(id,configuration)));
bloomFilter.writeTo(bo);
bo.close();
}
}

You do have to size the filters correctly otherwise they are not efficient. A better thing to do here would be estimate on the SSTables being flushed or compacted. I just cheated and put it at 50000 for the time being.

Reading Bloomfilters

When we open up an SSTable we also open the associated Bloom Filter.

public class SsTable implements Comparable<SsTable>{

private KeyCache keyCache;
private ColumnFamily columnFamily;
private SsTableReader ssTableReader;
private BloomFilter bloomFilterReader;

public SsTable(ColumnFamily columnFamily){
this.columnFamily = columnFamily;
}

public void open(String id, Configuration conf) throws IOException {
bloomFilterReader = new BloomFilter();
bloomFilterReader.open(id, conf);

keyCache = new KeyCache(columnFamily.getColumnFamilyMetadata().getKeyCachePerSsTable());
ssTableReader = new SsTableReader(this, keyCache, bloomFilterReader);
ssTableReader.open(id);
}

Now here is the payday! Any operation that does a read we can ask the bloom filter for help, and skip the entire rest of the read path if the bloom filter says no!

  public Val get(Token searchToken, String column) throws IOException {
boolean mightContain = bloomFilter.mightContain(searchToken);
if (!mightContain) {
return null;
}

BufferGroup bgIndex = new BufferGroup();

Performance testing this is simple. First, we can just write a test that searches for a rowkey not in the SStable. This was a run I did without the Bloom Filter code.

  {
long x = System.currentTimeMillis();
for (int i = 0 ; i < 50000 ; i++) {
Assert.assertEquals(null, s.get(ks1.getKeyspaceMetadata().getPartitioner().partition("wontfindthis"), "column2"));
}
System.out.println("non existing key " +(System.currentTimeMillis() - x));
}

result >> non existing key 11694

(I suspect the code is not opting out early when it reads a row key > search, but that is another performance tweak). Next we put the code into action.

result >> index match 891
result >> far from index 186
result >> index match 159
result >> far from index 139
result >>non existing key 26

Wow! Maximum performance!

There you have it. BloomFilters, great for looking for data that is not there :) Sounds stupid but the case happens more often then you might first think!



Comments:

Post a Comment:
Comments are closed for this entry.

Calendar

Feeds

Search

Links

Navigation