Edward Capriolo

Thursday Dec 04, 2014

Building a Column Family store for fun and enjoyment

Recently I thought I would have some fun. What is my definition of fun? I decided to take on building my own Column Family data store. I decided to start with making a small server and add features one by one. I started with memtables, , but I decided it was time to up my game.

Cassandra writes go to memtables, eventually these memtables flush to disk as sstables. SS stands for Sorted String. Besides the sorted nature of the SSTables Cassandra uses Bloom Filters, and Indexes (every N rows) to make searches for keys efficient.

To build an SSTable I have been playing around with memory mapped byte buffers and file channels. The first step is turning a memtable (a Map<String,Map<String,Column>>) into an sstable.

This is not the most efficient design in terms of storing numbers as string in some cases but I can always evolve this later.

public void flushToDisk(String id, Configuration conf, Memtable m) throws IOException{
    File f = new File(conf.getSstableDirectory(), id + ".ss");
    OutputStream output = null;
    try {
      output = new BufferedOutputStream(new FileOutputStream(f));
      for (Entry<Token, ConcurrentSkipListMap<String, Val>> i : m.getData().entrySet()){
        output.write(START_RECORD);
        output.write(i.getKey().getToken().getBytes());
        output.write(END_TOKEN);
        output.write(i.getKey().getRowkey().getBytes());
        output.write(END_ROWKEY);
        boolean first = true;
        for (Entry<String, Val> j : i.getValue().entrySet()){
          if (!first){
            output.write(END_COLUMN);
            first = false;
          }
          output.write(j.getKey().getBytes());
          output.write(END_COLUMN_PART);
          output.write(String.valueOf(j.getValue().getCreateTime()).getBytes());
          output.write(END_COLUMN_PART);
          output.write(String.valueOf(j.getValue().getTime()).getBytes());
          output.write(END_COLUMN_PART);
          output.write(String.valueOf(j.getValue().getTtl()).getBytes());
          output.write(END_COLUMN_PART);
          output.write(String.valueOf(j.getValue().getValue()).getBytes());
        }
        output.write('\n');
      }
    }
    finally {
      output.close();
    }
  }

Cool we can write a memtable to disk! Next step is reading it.

  private RandomAccessFile raf;
  private FileChannel channel;
 
  public void open(String id, Configuration conf) throws IOException {
    File sstable = new File(conf.getSstableDirectory(), id + ".ss");
    raf = new RandomAccessFile(sstable, "r");
    channel = raf.getChannel();
  }

I am not going to share all the code to read through the data. It is somewhat tricky, I do not write stuff like this often, so I am sure I could do a better job looking at some other data stores and following what they do. This piece can always be evolved later.

I am proud of this class in particular. This BufferGroup makes it easier to navigate the file channel and buffer and read in blocks while iterating byte by byte.

public class BufferGroup {
  private int blockSize = 1024;
  byte [] dst = new byte [blockSize];
  int startOffset = 0;
  int currentIndex = 0;
  FileChannel channel;
  MappedByteBuffer mbb;
 
  public BufferGroup(){}
 
  void read() throws IOException{
    if (channel.size() - startOffset < blockSize){
      blockSize = (int) (channel.size() - startOffset);
      dst = new byte[blockSize];
    }
    mbb.get(dst, startOffset, blockSize);
    currentIndex = 0;
  }
 
  void advanceIndex() throws IOException{
    currentIndex++;
    if (currentIndex == blockSize){
      read();
    }
  }
}

Write to a memtable, flush it to disk, load the disk back into memory as a memory mapped buffer, and use that buffer for reads. Here is what it looks like when you put it all together.

   public void aTest() throws IOException{
    File tempFolder = testFolder.newFolder("sstable");
    System.out.println("Test folder: " + testFolder.getRoot());
    Configuration configuration = new Configuration();
    configuration.setSstableDirectory(tempFolder);
    Memtable m = new Memtable();
    Keyspace ks1 = MemtableTest.keyspaceWithNaturalPartitioner();
    m.put(ks1.getKeyspaceMetadata().getPartitioner().partition("row1"), "column2", "c", 1, 0L);
    Assert.assertEquals("c", m.get(ks1.getKeyspaceMetadata().getPartitioner().partition("row1"), "column2").getValue());
    m.put(ks1.getKeyspaceMetadata().getPartitioner().partition("row1"), "column2", "d", 2, 0L);
    SSTable s = new SSTable();
    s.flushToDisk("1", configuration, m);
    s.open("1", configuration);
    long x = System.currentTimeMillis();
    for (int i = 0 ; i < 50000 ; i++) {
      Assert.assertEquals("d", s.get("row1", "column2").getValue());
    }
    System.out.println((System.currentTimeMillis() - x));
  }

I am pretty excited. I am not sure what I will do next indexes, bloom filter, commit log?




Comments:

Post a Comment:
Comments are closed for this entry.

Calendar

Feeds

Search

Links

Navigation