Edward Capriolo

Saturday Mar 03, 2012

What is the deal with IronCount?

You may have noticed that real time analytics is the new hotness. Just about every NoSQL these days seems to support some write without read counter system. Likewise many people are building their own system to funnel data to processing systems and increment said counters .

I view real time analytics as the next etl. By this I mean to say that stuffing tons of data into hadoop and building hourly or daily aggregations was cutting edge up until a minute ago. It is still a critical part of a "web scale" infrastructure but real time analytics is the next/current big deal.

What are the key parts of a (near) real time analytics infrastructure?

1) Data Aggregation

The system has to get data together in a way that it can be processed.

2) Distributed processing

With data being aggregated some algorithm or transformation needs to be applied to it.

3) On demand results

The system has to work at small granularity. This could be one or five minute windows, but possibly smaller, writing and retrieving inside these windows should be low latency.

4) Scale

The system has to support more data, more algorithms/aggregations , and more requests without needing to be redesigned, every part should be horizontally scalable.

I am going to talk about IronCount and why I think it can be a key part of a real time infrastructure. First, a brief history lesson. I came up with the inspiration for IronCount while talking with Joe Stein. I was tired of waiting for twitter to open source RainBird. Which looks like it is never going to happen at this point. I work by the iconic FlatIron building, hence Iron in the name, and Count because the standard use case is typically to read raw data and increment N counters (usually in a system like Cassandra).

Flat Iron Building

IronCount is a consumer manager for Apache Kafka. Lets talk about kafka which solves need #1 in RTA. (also deals with 1,3 and 4)

Kafka offers high throughput distributed low latency producer/consumer queue system. Unlike the JMS message queues it does not try to implement all types of complicated and complex JMS semantics. However it does support one feature that is very useful. Messages are routed by key to the same partition. This is critical because it allows an application consuming messages to know that messages for a key are only being consumed by a single consumer. For example, you can route weblogs by user_id to ensure that all the hits for one user are processed by the same consumer. (note that loss or addition of a consumer changes where messages will route). With kafka's distributed architecture we can process data and horizontally scale out.

Distributed processing is RTA requirement #2 which is where IronCount comes in. It is great that we can throw tons of messages into Kafka, but we do not have a system to process these messages. We could pick say 4 servers on our network and write a program implementing a Kafka Consumer interface to process messages, write init scripts, write nagios check, manage it. How to stop it start it upgrade it? How should the code even be written? What if we need to run two programs, or five or ten?

IronCount gives simple answers for these questions. It starts by abstracting users from many of the questions mentioned above. Users need only to implement a single interface and possibly only a single method handleMessage(Message m).

public interface MessageHandler {
  public void setWorkload(Workload w);
  public void handleMessage(Message m) ;
  public void stop();
  public void setWorkerThread(WorkerThread wt) ;
}

Turns out this pattern makes many problems that seem complicated with other systems easy.

For example, if someone wants to implement something like scribe or flume aka aggregate logs into HDFS.

Caligraphy

What 100 lines of code!? Doesn't this have to be hard?

What about joining and re-routing streams? AKA yahoo's s4.

Map Reduce demo

What? Don't you need some special application made just for this?

What if you wanted to count URLs like rainbird and save them into Cassandra?

Mocking Bird

I think by now you might be getting what I am driving at. Rather then having a highly specialized infrastructure to handle task X, Kafka, IronCount and maybe a little Cassandra or Hadoop can get the job done.

I am not trying to take anything away from Rainbird, Flume, Scribe, s4, Storm or anyone of the other technologies I have mentioned. But if IronCount can demonstrate the same or similar results without having to learn a complex API or implement a large special purpose architecture that says something. With IronCount you can quickly write a Handler to do exactly what you need, rather then dealing with complexity of making another system work well for exactly what you are trying to do.


Comments:

Hi Edward, Thanks for a great post. I've considered using Kafka before - it seems very useful fpr something like this. Have you looked at Acunu Analytics? [http://www.acunu.com/blogs/andy-twigg/acunu-analytics-preview/] It runs on Cassandra, and does some tricks with counters to allow interesting partial aggregates. Maybe we can integrate Kafka into its interface? Andy

Posted by Andy Twigg on April 12, 2012 at 05:16 AM EDT #

Younger generations or child easily learn how to send SMS messages.

Posted by sms messages on May 29, 2012 at 04:19 AM EDT #

Want to say someone sorry Visit for Sorry messages

Posted by Auste on June 01, 2012 at 08:50 AM EDT #

Post a Comment:
Comments are closed for this entry.

Calendar

Feeds

Search

Links

Navigation