On Key Value vs Column Family
The one line, mile-high view of Apache Cassandra is as follows:
The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together a fully distributed design and a ColumnFamily-based data model.
I have worked with the project for some time now, and I was doing some reflection on this statement and how profound Cassandra's design is. Amazon Dynamo's data model is a key value store, while Cassandra's data model is a ColumnFamily-based one. You can 'encapsulate' the difference like this:
Key value:There are many successful tools that are key value stores, notably Memcache. Some NoSQL products take the next step and turn key value store into a distributed key value store. The ColumnFamily model is more complex with its larger signature Map<Object,Object>, but that provides greater flexibility and allows certain applications to be modelled that would be difficult to do with a pure key/value store.
For example, imagine your are collecting page visits for a users web session. User bob visits 4 pages, index.html, stuff.html, purchase.html, and checkout.html.
In a pure key-value system this requirement is difficult to model. The first stumbling block is you can not append in a pure key/value store, so you will have to read, append, and then write back.
List<Object> newResults = get visit[bob]The above pseudo code only works in a single threaded writer otherwise you could face lost update problems.
results.add( $time-index.html,$time-stuff.html,$time-purchase.html,$time-checkout.html )
set visit[bob]= newResults
One way the key value system could handle this is:
server.lock( bob )
List<Object> newResults = get visit[bob]
results.add( $time-index.html , $time-stuff.html, $time-purchase.html,$time-checkout.html ) <--client side stuff
set visit[bob]= newResults
server.unlock ( bob )
This has many drawbacks. The first is you are going to need 3-4 client server exchanges to complete this, and the second is handling the situation where a client vanishes after acquiring the lock is a problem along with all the other distributed locking problems.
Vector Clocks are another system to handle this type of problem without the multiple round-trips. Vector clock systems are cool but the explanation/proof of how it works involves a bit more math then I am used to.
Let's take a look at how the ColumnFamily based model solves this problem:
set visits[bob][$time-index.html] =''
set visits[bob][$time-stuff.html] =''
set visits[bob][$time-purchase.html] =''
set visits[bob][$time-checkout.htm] = ''
Cassandra's use of ColumnFamily are a straight forward solution when compared to Vector Clocks, locking, or transactions. The application is responsible for creating a unique key for the map. In this example, we appended a timeUUID infront of the page name. This design does not need to "read before write" to accomplish an "append", nor does it need vector clocks on the server side to reconsile the data internally.
The structured log format using Memtables and SSTable allows for fast write and a system can merge on read. This can be a huge optimization for many applications. For example, at m6d we have our mapping servers writing URL information into something like the visit table I described above. If we had to 'read before write' or 'select before insert' we would need significantly more hardware, (if I had to guess I would say 2-3 times). Likewise we can add (or remove) Segments information to or from a user without ever having to issue a read.
Also for we could just use Cassandra as a key value store if we had no need for the map. In the example below 'k' is a static char is the only column in the column family.
This was a contrived example, but many applications can be fit well into the ColumnFamily model. Time series is one such example, lists inside a key, sets inside a key, different structure types under the same key. The key being a map rather then just a single opaque entity is very powerful and practical. The tunable consistency model, distributed design, and column family based approach are an optimal tool for many applications.